Wednesday, 12 August 2015

Opam and BSD Owl Support for Travis CI container-based infrastructure

This article is for users of Travis CI services interested in moving their OCaml, opam or BSD Owl projects to the new container-based infrastructure provided by Travis CI.

Travis, a continuous-integration service, introduced a new container-based infrastructure, promising more speed and reactivity than the old virtual-machine-based infrastructure, which is now deemed deprecated. Users willing to take the move to the new infrastructure are facing a major obstacle: the container-based infrastructure does not support the sudo command, which, in cascade, implies that the Travis users are not able any-more to install packages from random package sources. These dependencies now need to be installed from source until you arrange so that a repository containing these dependencies has been white-listed.
The script anvil_travisci_autoinstall.sh distributed with anvil will ease this operation for OCaml, opam and BSD Owl users!

Setting up Travis

We set up Travis to take advantage of its cache, which is not optional since a compilation matrices involving the three latest OCaml compilers needs about 15 minutes setup. For the purpose of the discussion, we consider the example of mixture, an OCaml library implementing common mixins. Let us walk through its .travis.yml file:
language: c
sudo: false
addons:
  apt:
    sources:
    - avsm
    packages:
    - ocaml
    - opam
    - ocaml-native-compilers
install: sh -ex ./Library/Ancillary/autoinstall bmake bsdowl opam
cache:
  directories:
  - ${HOME}/.local
  - ${HOME}/.opam
script: sh -ex ./Library/Ancillary/travisci
env:
  - TRAVIS_OCAML_VERSION=4.00.1
  - TRAVIS_OCAML_VERSION=4.01.0
  - TRAVIS_OCAML_VERSION=4.02.3
The first declarations, language, sudo and addons constitute the typical prelude of OCaml projects. The script ./Library/Ancillary/autoinstall installs dependencies from sources and initialises opam. The sources are installed to ${HOME}/.local and opam files are stored in ${HOME}/.opam, caching these files allows us to skip completely this step in case of a cache hit. We present the autoinstall script later, but right now we want to take a look at the last lines of .travis.yml: it defines the actual continuous integration script and a build environement matrix.
The continuous integration script is everything but fancy, it setups opam to target the compiler announced by TRAVIS_OCAML_VERSION and runs the traditional autoconf; ./configure; bmake all combo:
INSTALL_PREFIX="${HOME}/.local"
eval $(opam config env)
autoconf
./configure --prefix="${INSTALL_PREFIX}"
bmake -I "${INSTALL_PREFIX}/share/bsdowl" all

The autoinstall script

The script installing dependencies from sources actually delegates the job to anvil_travisci_autoinstall.sh. Theoretically, this script could be bundled in the distribution instead of being downloaded, but doing so eases updates. The autoinstall script is:
: ${local:=${HOME}/.local}
: ${srcdir:=${HOME}/.local/sources}

if [ -f "${local}/.anvil_autoinstall_cached" ]; then exit 0; fi

git clone 'https://github.com/michipili/anvil' "${srcdir}/anvil"
/bin/sh -ex "${srcdir}/anvil/subr/anvil_travisci_autoinstall.sh" "$@"\
    && touch "${local}/.anvil_autoinstall_cached"
When the installation is succeful, it leaves a cookie in the cache, whose existance guards an early exit condition. The autoinstall script supports three arguments, bmake, bsdowl and opam requiring the setup of the corresponding packages. When setting up opam the file .travis.opam is read to find out which compilers and packages need to be installed:
compiler:
  - 4.00.1
  - 4.01.0
  - 4.02.3
repository:
  - ocamlfind
git:
  - https://github.com/michipili/broken.git
The syntax of this file imitates the YaML format used in .travis.yml but it is converted to a tabular format with sed so that imaginative formatting is discouraged. There is two ways to specify a dependant package: either by reffering to a name in the official repository, or directly with a git repository supporting opam pinning.

Friday, 5 June 2015

Configuration files for shell scripts

Sometimes it is convenient to pass arguments to a program using configuration files rather than command line arguments. Using a sourced file in shell scripts is a popular approach because it is very easy to implement, however it has two flaws:

  1. It is not very easy to validate against malicious input, which might be a concern in some contexts.

  2. It is not very structured, for it does not offer any way to structure information.

In the last years¹ a style of configuration files coming from the Microsoft Windows world has gained popularity on Unix platforms. This style of file is known has INI files, and Wikipedia wants us to believe that they look like this:

# last modified 1 April 2001 by John Doe
[owner]
name=John Doe
organization=Acme Widgets Inc.

[database]
# use IP address in case network name resolution is not working
server=192.0.2.62
port=143
file=payroll.dat

Can we use such files to store configuration values for shell scripts, without relying on any fancy dependency? For sure, here is a sed script which converts these files to a tabular format, like:

owner|name|John Doe
owner|organization|Acme Widgets Inc.
database|server|192.0.2.62
database|port|143
database|file|payroll.dat

The sed script follows:

# Configuration bindings found outside any section are given to
# to the default section.
1 {
  x
  s/^/default/
  x
}

# Lines starting with a #-character are comments.
/^#/n

# Sections are unpacked and stored in the hold space.
/^\[/ {
  s/\[\(.*\)\]/\1/
  x
  b
}

# Bindings are unpacked and decorated with the section
# they belong to, before being printed.
/=/ {
  s/^[[:space:]]*//
  s/[[:space:]]*=[[:space:]]*/|/
  G
  s/\(.*\)\n\(.*\)/\2|\1/
  p
}

It is then easy to extract interestign values from the output of the script with awk or read.

Of course, the script is not very robust and does not support fancy features of other implementations, like arguments spreading on multiple lines or enclosed between double quotes. It nevertheless mitigates risks bound to malicious input as it does not provide a way to execute arbitrary code as the script and can be extended to support more advanced use cases.

¹ Maybe it is for a decade, actually.

Sunday, 10 May 2015

Of course grepping logs is terrible!

There has been a heated discussion on LinuxFR¹ about a blog article “Grepping logs is terrible” whose author strongly advocates in favour of binary log storage solutions, which perform consistently better than text log storage solutions in any regards. The author, whose name remains undisclosed², presents himself as an experienced system administrator and gives a fair account of the arguments usually presented by tenants of the text log storage faction.

This author is definitely right in asserting that binary log storage solutions perform better than text log storage solutions. This is in some sense, totally obvious, and discussing this alternative from a performance perspective misses an important point, that I would like to discuss here.

Our author is a professional system administrator and he, of course, has to work with astronomical quantities of data and to make as much sense out of this data as possible. The numbers he can infer from log-analysis give the pulse of the system. How fast he can get to relevant information has a direct incidence on the time he needs to accomplish basic tasks in his work. How easy it is to retrieve log records satisfying complex criterions determines how fast he can diagnose and repair system malfunctions. As a professional, he has to pick and use tools matching his needs.

As the title of his article suggests it “grepping logs is terrible” but it necessarily is, for grep is a generic tool which can be used in many other contexts than log exploration! Any specialised research tool has to be better than grep, otherwise what would be the point of writing and using it? Most Unix users are not system administrators or do not have requirements similar to our author, and they occasionally need to interact with the logs. For these users, being able to interact with the logs using generic tools they possibly already know is more convenient than having to use a random dedicated command. And even if by any wonder they had hundreds of gigabytes to analyse one time, it would be acceptable to do this in half an hour with grep rather than in a few minutes with a specialised tool because this is a one-time operation.

Our author is therefore obviously right in his statement, as he merely observes that specialised tools perform better than generic tools.

The promise made by Unix systems to represent as many data as possible as text files implies that the user who learns the generic tools to analyse and transform text³ will be able to use them on any kind of data, because after all, these are text. This is also the promise that time spent to learn these tools is time well spent, because they can be used in a variety of contexts, not only the context which triggered the need to learn them. These tools are generic and are incredibly useful at prototyping systems, which can then be implemented in a more robust or efficient manner. The scenario where the system administrator observes that his system has reached a scale where grep does not perform well enough to let him do his work efficiently is just an occurence of a prototype needing a perennial implementation. And this is the Unix way, which has nothing to do with a silly war between fanatical text and binary factions, which unfortunately caught our author under its fire.


¹ A site similar to Slashdot for French speakers.
² Actually, it is Gergely Nagy!
³ Let us mention grep, sed, awk, sort, join and paste as some of the most important.

Thursday, 16 April 2015

Organise collections of LaTeX documents with BSD Owl Scripts

Let us discuss how to handle collections of LaTeX documents with the build system BSD Owl Scripts. In our example we pretend that we are preparing an electronic journal and want to distribute each article of the journal as a separate electronic document.

Organisation on the file-system

We use the following simple organisation at the file-system level:
  1. We prepare a directory holding each issue of our journal, for instance ~/journal.
  2. Each issue of the journal is represented by a subdirectory.
  3. Each article of the journal is represented by a subdirectory of the directory corresponding to the issue it belongs to.
Assume we already have several articles, as demonstrated by the following command output:
% find ./journal -name '*.tex'
./journal/issue-2013-1/01-galdal/article.tex
./journal/issue-2013-1/02-arathlor/article.tex
./journal/issue-2013-2/01-mirmilothor/article.tex
./journal/issue-2013-2/02-eoron/article.tex
./journal/issue-2013-2/03-echalad/article.tex
Names like galdal, arathlor are the names of fictional authors of articles of our journal. Each submission has a directory containing the text article.tex of the article.

Typeset each single article

We rely on BSD Owl Scripts to transform each article in a PDF file. We therefore add a Makefile in each directory corresponding to an article.
% find ./journal -name 'Makefile'
./journal/issue-2013-1/01-galdal/Makefile
./journal/issue-2013-1/02-arathlor/Makefile
./journal/issue-2013-2/01-mirmilothor/Makefile
./journal/issue-2013-2/02-eoron/Makefile
./journal/issue-2013-2/03-echalad/Makefile
Each of these Makefiles can actually be as simple as
DOCUMENT=       article.tex
.include "latex.doc.mk"
These Makefiles can also define file-system locations where TeX will lookup for common assets, define rules to automatically build some tables or figures, or use any of the more advanced techniques described in the documentation. Since we want to keep focus on the organisational features of BSD Owl Scripts we will stick to that minimalistic Makefile.

Bundle the articles together

To orchestrate the preparation of all our articles with BSD Owl Scripts we just need to write additional Makefiles.
./journal/Makefile
./journal/issue-2013-1/Makefile
./journal/issue-2013-2/Makefile
./journal/issue-2013-3/Makefile
Each Makefile basically contains the list of subdirectories where make should descend to actually build, install or clean. Readers fond of design patterns will recognise aggregates implementing a delegate pattern.
The file ./journal/Makefile should contain:
PACKAGE=        journal

SUBDIR=         issue-2013-1
SUBDIR+=        issue-2013-2
SUBDIR+=        issue-2013-3
.include "bps.subdir.mk"
The file ./journal/issue-2013-1/Makefile should contain:
SUBDIR=         01-galdal
SUBDIR+=        02-arathlor
.include "bps.subdir.mk"
The remaining files ./journal/issue-2013-2/Makefile and ./journal/issue-2013-3/Makefile can be similarly prepared. With these settings, the targets all, build, clean, distclean, realclean and install are delegated to Makefiles found in the subdirectories listed by SUBDIR.
The variable SUBDIR_PREFIX can be used to define a customised installation path for each article, so that the Makefile building a document could be
DOCUMENT=       article.tex
DOCDIR=         ${HOME}/publish/journal${SUBDIR_PREFIX}
.include "latex.doc.mk"
With this setting, the document ./journal/issue-2013-1/01-galdal/article.pdf will be installed as ${HOME}/publish/journal/issue-2013-1/01-galdal/article.pdf and so on. It is possible to tweak this in all possible ways to use arbitrary naming schemes for installed articles, like for instance ${HOME}/publish/journal/issue-2013-1/01-galdal.pdf or whatever we fancy.

Declare locations of file assets

We can elaborate on our basic setup to handle the case where our documents share assets, for instance a logo for our journal or some custom LaTeX packages. In BSD Owl Scripts we can use the TEXINPUTS variable to declare one or more such locations. For instance the declaration
TEXINPUTS=      ${HOME}/share/texmf/tex/latex/journal
will arrange so that TeX finds all files in ${HOME}/share/texmf/tex/latex/journal when it needs them. This statement can be added to individual Makefiles responsible for the preparation of an article, or it can be added to ./journal/Makefile.inc. The latter file is read by make every times it processes a Makefile based on BSD Owl Scripts. Adding that declaration to ./journal/Makefile.inc is therfore similar to adding it to each single Makefile in the project.

Monday, 13 April 2015

Testing complex shell programs without installing them

A simple shell script fitting in one file can easily be tested from the command line. Complex scripts relying on several shell subroutines libraries and other file assets are a bit more complicated to test, because the file assets used by the script lie at different locations on the file system when the script is developed and when the script is installed. Let us see together how we can use parameter expansion modifiers to easily test our shell scripts without installing them, and how this is actually used in a small project anvil.

Test the scripts in a special crafted environment

We can environment variables to reconciliate the shell script's idea of where to found its file assets when it is installed and when it is tested as part of its development cycle. More precisely, we can take advantage of the Assign Default Values parameter expansion modificator. This modificator is described as follows in sh(1) on FreeBSD:
${parameter:=word}
Assign Default Values. If parameter is unset or null, the expan- sion of word is assigned to parameter. In all cases, the final value of parameter is substituted. Quoting inside word does not prevent field splitting or pathname expansion. Only variables, not positional parameters or special parameters, can be assigned in this way.
Assume that our software package is called anvil and its source repository contains a collection of function libraries found in the folder subr, which are copied in /usr/local/share/anvil/subr when the package is installed. If we write
: ${subrdir:=/usr/local/share/anvil/subr}
near the top of our shell program, we can use ${subrdir} to access our subroutine libraries, for instance
. "${subrdir}/common.sh"
In order to test our script as part of its developement cycle, we must ensure that it reads the subroutines found in its source repository, and not the libraries found in /usr/local/share/anvil/subr copied by the installation of an older version of the software package. For this, it is enough to run the script in an environment where the variable subrdir has been set to ./subr or to the absolute path to that directory.

Configure installation paths

Instead of hard-wiring the installation in the sources, we may want to use configuration parameters for this. It is easy to use autoconf for this, but in order to avoid running repeatedly the slow ./configure script it generates, we may prefer a solution based on autoconf and Makefiles, so that the ./configure script is run once and for all and make is used to edit configuration parameters in the scripts. This is the approach used by anvil, which can be studied as an example of this technique.

Sunday, 12 April 2015

Delegating complex treatments to filters in shell programs

Novice shell programmers tend to reproduce procedural structures they learnt from classical procedural languages like Pascal or C¹. While it produces results, this approach is catastrophic and complex treatments should be delegated to filters. I will first convince you that catastrophic is not as much of an hyperbole as it may seem and discuss a simple concrete example from a code review I recently made.

Example presentation

I reviewed a submission for opam, the package manager recently adopted by the OCaml community, and spent quite a time to comment on a code snippet which confronts the list of compilers supported by the system with a list of compiler versions that are to be kept, the purpose of the script being to remove the remaining ones.
Each supported compiler is represented on the filesystem by a directory, whose path relative to ${OPAMROOT} — the path to data owned by opam — has the following structure:
compilers/${SERIES}/${VERSION}
So for instance, the directory compilers/4.02.1/4.02.1+PIC corresponds to the compiler 4.02.1+PIC in the 4.02.1 series.
The problem solved by the snippet I reviewed performs a rather straightforward treatment: given a list of compiler versions held by the variable COMPILER_VERSIONS it removes from the file-system the compilers whose version is not listed in COMPILER_VERSIONS.

Mimicking the classic procedural approach

The classic procedural approach to solving this problem can be worded “consider each compiler, if it is not in my little list, then delete it.” This can be implemented like this in the shell:
is_in_compiler_versions()
{
  local version
  for version in ${COMPILER_VERSIONS}; do
      if [ ${version} = $1 ]; then return 0; fi
  done
  return 1
}

for compiler in compilers/*/*; do
    if !is_in_compiler_versions "${compiler##compilers/*/}"; then
        rm -r -f "${compiler}"
    fi
done
There is nothing terribly surprising here and it definitely works, so why should we consider this approach catastrophic? Here are a few reasons:
  1. The code is hard to read, there is no function name advertising the purpose of the main treatment and this purpose is well hidden in a conditional in the body of a for-loop.
  2. The code is hard to debug, since of the three important data sets involved in this treatment, only the list COMPILER_VERSIONS can be easily examined by the maintainance programmer. The list of supported compilers and the list of compilers to remove form the file-system exist only in an evanescent manner in this code and cannot be easily examined.
  3. The code is hard to reuse, because the enumeration of the compilers to remove and the actual removing are tightly bound together.
To put this in a few words, mimicking the classical procedural approach led to a code which is hard to read, hard to debug and hard to reuse. Maybe labeling this catastrophic was not an exaggeration, after all. And we did not even consider execution speed, the shell being rather slow, that kind of code performs poorly when it has to handle a lot of data.
The example itself is of course really innocent but things go worse when we consider more complicated treatments, and larger programs. Now, what can we do about this? We can opt for

Delegating complex treatments to filters

Understanding that complex treatments should not be performed by the shell itself but delegated to filters is probably the most important perspective shift required to program the shell properly. Here is how we can rewrite the previous snippet using filters.
find_compilers()
{
  find "compilers" -type d -depth 2
}

select_not_in()
{
    awk -F '/' -v filter_out_list="$1" '
BEGIN {
  split(filter_out_list, s, " ")
  for(i in s){
    filter_out[s[i]]
  }
}
!($3 in filter_out) {print}
'
}

find_compilers\
  | select_not_in "${COMPILER_VERSIONS}"\
  | xargs rm -r -f
This solves all the problems found before. The code is easy to read because the function names make their purpose obvious. We do not need to understand awk to guess what the filter select_not_in does, since it is pretty clear from its name. Using awk here is essentially irrelevant, any language can be used to perform this selection step. It is very easy to scan the code down to the end of the pipeline to see that the purpose of the pipeline is to remove some files. The code is also easy to debug because the maintainance programmer can break the pipe sequence anywhere to examine the output of the program at that point, insert tees, insert a filter to pause between each line or mock the input of the filter. This code is easy to reuse because each of the three steps are independent. Last, it is way faster than the previous program and starts much less processes.

Consequences for code organisation

Once we have understood the benefits of delegating complex treatments to filters, we can draw a few consequences for the organisation of our programs and how we should shape our competences.
  1. As a rule of thumb, shell variables should not contain any complex data² and should contain only variables from the Unix world, that is, paths in the filesystem and PIDs. Everything else is stored in files or flaws from one process to the other through a pipe.
  2. It is crucial to know well a tool which we can use to quickly write these filters. I am very pleased with sed and awk as the versions in BSD systems are quite lightweight in comparison to some others, but there is a lot of reasonable choices here.


¹ Yes I wrote Pascal or C. This is a bit old-school, I know.
² Are base64-encoded files complex data?

Saturday, 11 April 2015

Drawing METAPOST pictures with BSD Owl Scripts



METAPOST, a program my John Hobby, is a powerful language for creating technical drawings and it is found in most if not all TeX distributions. While most LaTeX compilation assistants do not pay much attention to METAPOST, it is very well integrated in BSD Owl Scripts so that preparing a LaTeX document containing beautiful METAPOST pictures is achieved by a Makefile as simple as
DOCUMENT=        galley.tex
SRCS+=           figures.mp
.include "latex.doc.mk"
It is also possible to produce pictures for themselves, using a Makefile similar to
DOCUMENT=        figures.mp
MPDEVICE=        eps pdf png svg
.include "mpost.doc.mk"
It will produce EPS, PDF, PNG and SVG versions of the figures.

If you do not know METAPOST here are few figures drawn with it:
A performance comparison chart
A timelineA UML diagram
These pictures are examples found in my Blueprint project, a library of METAPOST definitions. This project also illustrates the use of BSD Owl Scripts to produce METAPOST pictures.

See also: Producing LaTeX documents (BSD Owl Scripts documentation), TeX Users Group page dedicated to METAPOST, André Heck's METAPOST tutorial.