OCaml Libraries from Jane Street

Screenshot from 2013-03-11 20:13:11

This blog entry is about using Jane Street’s publicly released OCaml libraries for your own projects. It tells you what to install, how to a compile programs and provides a short example program that tries to do something useful using the Command module and the Async library. For the official documentation about all this take a look at the urls towards the bottom of this blog post.

A word of warning: This blog entry does not aim to teach you functional programming, teach you OCaml, teach you monads or teach you how to install/configure things on Linux. What it will show you however, is a small fully working example and an overview of how to work with this rather rich code base. I assume you already know how to program in OCaml and maybe a bit about monads. Also, Jane Street’s libraries don’t work on Windows. For now they are Linux only.

Installing

The first thing you need to know about is OPAM. OPAM serves an alternate to Godi as a means of installing the OCaml compiler and OCaml libraries. It plays the same role that CPAN does for Perl, gem does for Ruby and cabal does for Haskell. The latest release of the OCaml compiler and Jane Street’s libraries (which are updated weekly) can be installed through OPAM.

Install OPAM. The directions are on the website and are simple enough. Once you have OPAM and the latest OCaml installed (which should be version 4.xx right now), install the Jane Street libraries. Here are some worth mentioning:

1) Core
Core is Jane Street’s alternative to the standard library, i.e. it contains all the basic stuff. Expect to find List, Set, Hashtable, Unix bindings etc here. Very often these implementation extend and improve on what ships from INRIA. Core also contains some interesting modules such as Command which is the command line parsing module that is commonly used to build command line programs.

2) Core_extended
Core extended is everything else that is still useful but is not carefully code reviewed in the same way as Core. Core extended contains a great many things. Bench, for instance, is a micro-benchmarking module in Core extended. Ascii_table, as the name suggests, is a module for drawing tables using ascii art.

3) Async
Async is Jane Street’s lightweight concurrency module. Async is workhorse of a library and share similarities with Lwt from INRIA. Async lets you write programs that can contain hundreds of lightweight concurrently running jobs whose scheduling is done in user-space without OS support. Async exposes a monadic interface for writing these jobs. The primary author of Async is Stephen Weeks who is also know for his work on the MLton compiler, though it has been worked on by several people in Jane Street.

4) Async_extended and Async_unix
These are asyncifications of several things that one may find in Core and the Unix modules.

5) Sexplib
Pretty much anything you install will pull along things like sexplib which is a caml preprocessor extension that allows one to annotate type declarations with the annotation “with sexp” which generates s-expression serializers and deserializsers for the type in question.

The best way to find out what these libraries contain is to go fish out their mli files. This is the best (and the only) documentation that you have at the moment. OPAM creates a folder called ~/.opam in you home directory where it installs things. You can can find the mli files in folders corresponding to each module installation. For example, mlis core may be found in ~/.opam/4.00.1/lib/core.

Overview of the example program

Here is the sample program we are going to write. When run at the command line, it basically displays this:

$ hg_stats.exe -?
Summarize status of hg repos in all subdirectories.

  hg_stats.exe 

=== flags ===

  [-incoming]    Show incoming changesets.
  [-max MAX]     Crawl atmost MAX repos.
  [-outgoing]    Show outgoing changesets.
  [-build-info]  print info about this build and exit
  [-version]     print the version of this build and exit
  [-help]        print this help text and exit
                 (alias: -?)

The program will recursively find all the mercurial repositories under your current directory and display their status. In other words, it will run a “hg stat”, hg paths”, “hg incoming”, “hg outgoing” for each repository and summarize the results into a nice table. Here is such a sample output.

Screenshot

The program executes hg commands on multiple repos at the same time instead of sequentially querying one repo after the other. This is achieved using the Async library. Since much of this is IO bound, waiting for the hg process to return information, hg_stats.exe gets a lot of effective concurrency out of using Async. Also in the event that one may have thousands of sub-repos, the hg_stats throttle the maximum number of parallel queries so that the system does not run out of open file handles and such.

How to compile programs

One way to compile programs is to run something similar to the following from the command line.

ocamlfind ocamlopt -package async -package core -package core_extended \
  -thread -linkpkg hg_stats.ml

Alternately, one can use omake. To use omake, first go the folder where you wish to build the program and type “omake -install”. This will generate the files called OMakeroot and the OMakefile. Edit the OMakefile to look like this:

USE_OCAMLFIND = true

OCAMLPACKS[] = core core_extended async

if $(not $(OCAMLFIND_EXISTS))
   eprintln(This project requires ocamlfind, but is was not found.)
   eprintln(You need to install ocamlfind and run "omake --configure".)
   exit 1

OCAMLOPTFLAGS += -thread

FILES[] =
   hg_stats

PROGRAM = hg_stats.exe

.DEFAULT: $(OCamlProgram $(PROGRAM), $(FILES))

Once that is done, you can compile hg_stats.ml to hg_stats.exe using omake. If you type “omake -P”, sometime called ‘server mode’, you can have omake waiting to changes and automatically rebuilding.

Source code

Here is the file hg_stats.ml which implements the program above.

open Core.Std
open Core_extended.Std
open Async.Std
 
let exec ~prog ~args ?working_dir () = 
  Process.run ~prog ~args ?working_dir () 
  >>= function 
  | Error _ -> return None
  | Ok res -> 
    match String.strip res with 
    | "" -> return (Some [])
    | src -> return (Some (String.split ~on:'\n' src))
 
let ascii_table cols rows = 
  let cols = List.mapi cols ~f:(fun i col -> 
    let align, col = match String.chop_prefix col ~prefix:"-" with 
    | Some col -> Ascii_table.Align.right, col
    | None -> Ascii_table.Align.left, col in 
    Ascii_table.Column.create col ~align (fun row -> List.nth_exn row i)) in 
  Ascii_table.output ~limit_width_to:120 cols rows ~oc:stdout
 
let length_string_opt = function 
  | None -> ""
  | Some ls -> Int.to_string (List.length ls)
 
let sort_by_hd ls = 
  List.sort ls ~cmp:(fun ls1 ls2 -> 
    String.compare (List.hd_exn ls1) (List.hd_exn ls2)) 
 
let hg_stat ~incoming ~outgoing repo = 
  exec ~prog:"hg" ~working_dir:repo ~args:["stat"] ()
  >>= fun files -> 
  exec ~prog:"hg" ~working_dir:repo ~args:["paths"] ()
  >>= (function 
  | None| Some [] -> 
    return ("", (if incoming then [""] else []), 
	        (if outgoing then [""] else []))
  | Some (path::_) ->
    (if incoming then
	exec ~prog:"hg" ~working_dir:repo 
	  ~args:["incoming"; "-q"; "--template";"\"{node|short}\\n\""] ()
	>>= fun lines -> 
     return [length_string_opt lines]
     else return [])
    >>= fun incoming -> 
    (if outgoing then 
      exec ~prog:"hg" ~working_dir:repo 
	~args:["outgoing"; "-q"; "--template";"\"{node|short}\\n\""] ()
      >>= fun lines -> 
      return [length_string_opt lines]
     else return [])
    >>= fun outgoing -> 
    return (path, incoming, outgoing))
  >>= fun (path, incoming, outgoing) -> 
  return ([ repo; path; length_string_opt files] @ incoming @ outgoing)
 
let find_hg_and_report ~max ~incoming ~outgoing () = 
  exec ~prog:"find" ~args:[".";"-name";".hg"] ()
  >>= fun lines_opt -> 
  let repos = List.map (Option.value_exn lines_opt) 
    ~f:(fun x -> fst (String.rsplit2_exn ~on:'/' x)) in 
  let repos = List.take repos max in 
  printf "%d mercurial repositories found.\n%!" (List.length repos);
  let throttle = Throttle.create 
    ~continue_on_error:false ~max_concurrent_jobs:10 in 
  Deferred.List.map ~how:`Parallel repos ~f:(fun repo -> 
    Throttle.enqueue throttle (fun () -> 
      hg_stat ~incoming ~outgoing repo))
  >>= fun rows -> 
  let cols = ["repo"; "path"; "-changes"]  
    @ (if incoming then ["-incoming"] else []) 
    @ (if outgoing then ["-outgoing"] else []) in 
  ascii_table cols (sort_by_hd rows);
  return ()
 
let command = 
  let open Command.Spec in 
    Command.async_basic 
      ~summary:"Summarize status of hg repos in all subdirectories."
      (empty 
       +> flag "-max" (optional_with_default 50 int) ~doc:"MAX Crawl atmost MAX repos."
       +> flag "-incoming" no_arg ~doc:" Show incoming changesets."
       +> flag "-outgoing" no_arg ~doc:" Show outgoing changesets.")
      (fun max incoming outgoing () -> 
	find_hg_and_report ~max ~incoming ~outgoing ())
 
let () = Command.run command

There is a lot happening here. Here are some high level notes. This program uses Core, Core_extended and Async. Jane Street libraries contain a module called Std which is meant to be opened at the top of ml/mli files using the library. Hence the “open Core.Std” etc in the source code above.

The program shows how to put together a small command line parser using the Command module. At a high level one specifies a set of flags (flags can take arguments), and a set of anonymous arguments. If the flags and anonymous arguments are correctly parsed from the command line, the given function is called with the arguments in order. Use the above as the starting point to explore what Command provides and look at Command.mli for more details.

The program also uses Async serves as an example of using Async. When reading Async code here is the first thing to know: the type ‘a Deferred.t is your monadic type. In other words, it is the Async equivalent of the monadic type “m a” where m is a monad. For the Async monad return has type ‘a -> ‘a Deferred.t and bind (written as >>=) has the type ‘a Deferred.t -> (a -> ‘b Deferred.t) -> ‘b Deferred.t, as expected.

Even though Async exposes a monadic interface, it does not behave like a “real monad” in the following: any time one creates a term of type ‘a Deferred.t, its execution is immediately scheduled. In other words, Async computations do not have to be placed into any special context for their evaluation. If one creates a ‘a Deferred.t and ignores it, it will still be run.

The Deferred.List.map runs the function on every element of the list of repos in parallel, thereby creating as many jobs as there are repos. In an Async program control can transfer between jobs only between binds. Hence if you write regular OCaml code, control does not switch out. Blocking operations (typically IO operations) cause a job to sleep until its operation is complete while other waiting jobs get to run. In the above code, every call to Process.run is such a blocking call which causes the job to wait for an external process to run.

Does Async provide real parallelism? Yes and no. No, in the sense that only one OCaml thread is actually executing OCaml code. For example, if every Async job in a process were doing some arithmetic, the program would only use one CPU core. If context switch happens from one Async job to the other, the new job would effectively suspend the old job. The fact that only one OCaml thread of execution is allowed at a time goes back to a limitation of the OCaml runtime. There is work underway to change this, but I don’t know what its status is. However Async does offer real parallelism when Async jobs are blocked on IO. All the IO operations that Async jobs are blocked on get to execute in parallel. So if you were to run our example program, hg_stats.exe, you can see all your CPU cores light up. Every Async job that is blocked on an IO request is concurrently waiting while the IO requests (in this case the external hg commands) execute in parallel on your multiple cores.

More

Finally, take a look at Jane Street’s github page for updates. They currently have some installation notes, Hello World style examples, some html docs generated from the mli files and bunch of other useful reading.

There is also a little intro to the Async library.

Comments are closed.