Wednesday, 24 December 2008

CloudCamb

I've put up a (heavily-edited) transcript of my CloudCamb talk for anyone interested.

Comments have been automatically disabled | digg this | reddit | del.icio.us

Monday, 22 December 2008

Revisiting Erlang

This year's holiday project is to revisit my attempt to learn Erlang, this time in a more structured fashion, following Joe Armstrong's book.

The first substantive exercise is:

Write a ring benchmark. Create N processes in a ring. Send a message round the ring M times so that a total of N * M messages get sent. Time how long this takes for different values of N and M.

Write a similar program in some other programming language you are familiar with. Compare the results. Write a blog, and publish the results on the internet!

Programming Erlang
— Joe Armstrong

No benchmarking or comparison, I'm afraid, but here's my solution.

ringfun(_, 0) ->
    ok;
ringfun(Pid, Times) ->
    receive
        next ->
            io:format("Pid ~p sending message ~p~n", [self(), Times]),
            Pid ! next,
            ringfun(Pid, Times-1);
        Pid2 ->
            ringfun(Pid2, Times)
    end.

ring(N, M) when N>1, M>=1 ->
    FirstPid = spawn(fun() -> ringfun(ok, M) end),
    FirstPid ! lists:foldl(fun(_, Pid) ->
                                   spawn(fun() -> ringfun(Pid, M) end)
                           end, FirstPid,
                           lists:seq(1, N-1)),
    FirstPid ! next,
    ok.

The version above had no timing information. It also didn't communicate back to the parent process that the messages had finished circulating. These two issues were related (since the parent process needs to know when to stop timing. And the final time needs to be calculated on the final process in case the child processes have a different notion of time.) So the updated version below fixes both these issues:

ringfun(_, 0, ok) ->
    ok;
ringfun(_, 0, ParentPid) ->
    ParentPid ! done;
ringfun(NextPid, Times, ParentPid) ->
    receive
        next ->
            NextPid ! next,
            ringfun(NextPid, Times-1, ParentPid);
        NewPid ->
            ringfun(NewPid, Times, ParentPid)
    end.

ring(N, M) when N>1, M>0 ->
    S = self(),
    FirstPid = spawn(fun() -> ringfun(ok, M, S) end),
    FirstPid ! lists:foldl(fun(_, Pid) ->
                                   spawn(fun() -> ringfun(Pid, M, ok) end)
                           end, FirstPid,
                           lists:seq(1, N-1)),
    T1 = erlang:now(),
    FirstPid ! next,
    receive
        done ->
            T2 = erlang:now()
    end,
    Tdiff = timer:now_diff(T2, T1)/1000000,
    io:format("Sent ~p messages around a ring of length ~p in ~p seconds~n",
              [M, N, Tdiff]).

and a quick proof from the pudding:

13> chapter8:ring(3000,3000).
Sent 3000 messages around a ring of length 3000 in 10.3243 seconds
ok
Comments have been automatically disabled | digg this | reddit | del.icio.us

Tuesday, 18 November 2008

Two quick git tricks

A handy oneliner

To retrieve the contents of a file from a different branch without affecting your current git state:

git show `git rev-parse $BRANCH_NAME:$FILE_PATH`

Creating full git history from release tarballs

Given a set of tar-balls from a project whose version control is not exposed to the public:

toby-whites-macbook:python-memcached tow$ ls /Volumes/ftp.tummy.com/old-releases/*.tar.gz
/Volumes/ftp.tummy.com/old-releases/python-memcached-1.2.tar.gz
/Volumes/ftp.tummy.com/old-releases/python-memcached-1.2_tummy1.tar.gz
/Volumes/ftp.tummy.com/old-releases/python-memcached-1.2_tummy2.tar.gz
/Volumes/ftp.tummy.com/old-releases/python-memcached-1.2_tummy3.tar.gz
/Volumes/ftp.tummy.com/old-releases/python-memcached-1.2_tummy4.tar.gz
/Volumes/ftp.tummy.com/old-releases/python-memcached-1.2_tummy5.tar.gz
/Volumes/ftp.tummy.com/old-releases/python-memcached-1.2_tummy6.tar.gz
/Volumes/ftp.tummy.com/old-releases/python-memcached-1.31.tar.gz
/Volumes/ftp.tummy.com/old-releases/python-memcached-1.32.tar.gz
/Volumes/ftp.tummy.com/old-releases/python-memcached-1.33.tar.gz
/Volumes/ftp.tummy.com/old-releases/python-memcached-1.34.tar.gz
/Volumes/ftp.tummy.com/old-releases/python-memcached-1.36.tar.gz
/Volumes/ftp.tummy.com/old-releases/python-memcached-1.37.tar.gz
/Volumes/ftp.tummy.com/old-releases/python-memcached-1.38.tar.gz
/Volumes/ftp.tummy.com/old-releases/python-memcached-1.39.tar.gz
/Volumes/ftp.tummy.com/old-releases/python-memcached-1.40.tar.gz
/Volumes/ftp.tummy.com/old-releases/python-memcached-1.41.tar.gz
/Volumes/ftp.tummy.com/old-releases/python-memcached-1.42.tar.gz
/Volumes/ftp.tummy.com/old-releases/python-memcached-1.43.tar.gz

We can turn this into a git repository with full version history fairly easily:

PROJECT=python-memcached
mkdir $PROJECT
(
  cd $PROJECT
  git init
  git commit --allow-empty -m "Empty repo"
)
for i in $PROJECT*.tar.gz; do
  vntgz=${i#$PROJECT-}
  vn=${vntgz%%.tar.gz}
  echo $vn
  tar xzf $i
  mv $PROJECT/.git $PROJECT-$vn
  (
    cd $PROJECT-$vn
    find . -name '.git' -prune -o -exec git update-index --add --remove {} \;
    git commit -m $vn
  )
  mv $PROJECT-$vn/.git $PROJECT
done

For re-use, you might need to play around with some of the substring matching to get the right naming conventions for another project.

You don't get a useful Changelog, but now you can do git bisect tricks and the like.

Comments have been automatically disabled | digg this | reddit | del.icio.us

Thursday, 6 November 2008

Python anonymous classes

This post is as much to remind me of how to do this as anything else.

I sometimes find myself wanting anonymous classes; essentially, I want to be able to have objects which are just bags of properties. Clearly you can get most of this behaviour from a dictionary:

anon_object_0 = {}

and then assign properties as dictionary entries:

anon_object_0['property'] = 1

But, in my head at least, there is a difference in intent when manipulating a dictionary, and when manipulating an object with properties, and sometimes I want to explicitly be using an object, with property-accessor syntax.

It's also useful to be able to generate things that behave like objects, for driving interfaces which expect objects of a certain type; that is, "mock objects", thought that term means different things to different people. (and there are Python libraries for those already). That use case is most often seen in testing, but I've needed it in production code occasionally.

So usually, when I want anonymous objects, I want to be able to set them up trivially; I don't want to pull in a library, nor do I want a lengthy definition - one line is verging on too much already.

There's basically two ways to create useful anonymous classes in Python.

The obvious way is just:

class Anonymous(object): pass

Thereafter you can do:

anon_object_1 = Anonymous()
anon_object_1.property = 1

My preferred way takes advantage of the python type constructor; this creates an anonymous class:

type("", (), {})

and this, of course, creates an instance:

anon_object_2 = type("", (), {})()

which I prefer to the first. Firstly, it genuinely is a one-liner to define an anonymous object, I don't have to worry about the anonymous class per se; and secondly, for the relatively unimportant reason that

>>> anon_object_1
<__main__.Anonymous object at 0x6f8b0>
>>> type(anon_object_1)
<class '__main__.Anonymous'>

>>> anon_object_2
<__main__. object at 0x6f8f0>
>>> type(anon_object_2)
<class '__main__.'>

The first contains a reference to the name of its defining class, so is not exactly anonymous; and it's invaded your namespace a little. Largely in an unimportant way, since you can re-use that part of your namespace easily enough, but this is true:

>>> 'Anonymous' in globals()
True

where it wasn't before.

An interesting variation of the first way can be seen in Peter Norvig's Python Infrequently Asked Questions - he has a "Struct" class, which has a series of useful methods which help with initialization and update of arbitrary properties.

Comments have been automatically disabled | digg this | reddit | del.icio.us

Friday, 19 September 2008

Testing with nose, or why setuptools is annoying.

I use nose for my Python tests. It's not the only Python testing framework out there, but it seems to fit my needs.

Anyway; so nose has this concept of plugins, which let you extend test discovery , or add extra fixtures, or whatever. Indeed, nose's core functionality is implemented by bundled plugins. It picks up all available plugins automatically by scanning the entrypoints from packages installed by setuptools. This has the irritating effect that

  • nose itself can't work without being installed, since it needs to find its own bundled plugins.

  • any additional plugins you write or use have to be installed as well.

Now I don't like this at the best of times; I get annoyed by software that insists it knows better than me where it should live, and I especially don't like blindly installing new software which might go & stomp all over existing installed software. Once you introduce versioning into the equation, I get even more annoyed; you end up with a python version of DLL Hell, with one application needing version 0.8 of a package, and another needing version 0.9.

But - since most of the rest of the world apparently shares none of my concerns with these issues, I struggle manfully onwards.

This week, there's been a thread on the testing-in-python mailing list "why you should distribute tests with your application / module". I agree 100% - tests should always be distributed with applications; I think almost all of the software I've ever written has had a bundled test-suite. (This was particularly useful for Fortran software, where there is such a wide range of compilers, but it still helps in tracking down system-specific issues even in Python).

Unfortunately, nose's setup.py requirements fly in the face of this; most users won't have nose installed. I could just about forgive nose for this, if I could rely on distributing custom plugins with my package, and being able to pick them up from the local path; but I can't even do that.

It's also worth noting that this is an issue even for software with a very limited distribution. If you're working collaboratively on a project, then all your colleagues need to be able to run tests too; and if you're working in a heterogeneous environment, then adding additional dependencies and installation requirements becomes rapidly onerous, and liable to piss off your co-developers.

Anyway. To round this story off with at least a moderately cheerful ending, I was happy enough with nose's usability not to abandon it, but pissed off with its requirements enough to try and fix them. There's a patch in the nose bug-tracker which at least partly fixes the issue, so that nose will pick up plugins from sys.path.

I suspect in the long term, though, the answer to most of these issues lies in the use of virtualenv. Enough people insist on requiring setuptools-based install, that it will probably be easier simply to isolate every app with its own dependencies in a virtualenv, and just distribute that instead.

In the meantime, for anyone actually reading this; REQUIRING SETUPTOOLS IS FUCKING ANNOYING, MMM'KAY? DON'T DO IT

Comments have been automatically disabled | digg this | reddit | del.icio.us

Thursday, 11 September 2008

New job + Python descriptors

So, there's been a bit of a hiatus in my blogging activity, which has coincided with a change in my job.

I'm no longer employed by the university - as of the start of August I've been working as a founder of a startup. We're still in stealth mode, so output here will be work-related, but not too revealing, initially at least. I think it's probably safe to say that there will be much more Python than Fortran from now on!

Anyway, I thought it good practice to start writing English again, after several weeks of nothing but Python. Naturally of course these English words will concern Python …

So today I first used Python descriptors in anger. The particular pattern used I hadn't seen before, so I thought I'd write about it.

The problem I faced was how to nicely deal with an object which is expensive to initialize, which there should only be one of, and which is used by a number of other objects. If I were writing in Java, this would be a classic use-case for a Singleton, with some form of delayed initialization. How to do it in a more Pythonic way, though?

The easiest way to get Singleton-ish behaviour is probably to have the ExpensiveObject defined in its own module, with one instance instantiated as a module-level variable, and thus initialized on module import. This means that any other objects which need access to it can simply have a class attribute pointing at it.

elsewhere.py:
class ExpensiveObject(object):
    ...

expensive_instance = ExpensiveObject()
user.py:
class ObjectUser(object):
    from elsewhere import expensive_instance
    reference = expensive_instance
    ...

This doesn't delay instantiation, though - the instantiation is performed whenever the ObjectUser definition is processed. Since expensive_instance isn't always needed, it's annoying to have to always create it.

In order to do avoid that, clearly we need to remove the expensive_instance from elsewhere and replace the ObjectUser attribute reference with a function call.

We could do this in ObjectUser by overriding its getattr appropriately, to do the normal trick where we check whether expensive_instance is defined on this object, and if not, putting it there:

def __getattr__(self, name):
    if name == 'reference' and name not in self.__dict__:
        from elsewhere import expensive_instance
        object.__class__.name =  expensive_instance
    return object__getattr__(self, name)

which has a few problems.

  • Firstly, this involves doing this check for every attribute access on this object, which is an unnecessary price.

  • Secondly, if we are doing lots of getattr tricks for other attributes as well, it's messy to have them all in the same method.

  • Thirdly, we've set expensive_instance to be a class attribute, which means that every class for which we do this will get its own expensive_instance.

We could solve the second two issues with inheritance - have a small class (ExpensiveFactory?) which does nothing but override getattr for the attribute of interest. This isolates the getattr logic for this attribute, and makes sure that only one copy of expensive_instance is instantiated (as a class variable of ExpensiveFactory)

class ObjectUser(object, ExpensiveFactory):
    ...

But: we still haven't solved the first problem (speed of getattr) and we've introduced another - if any of these child classes want to override getattr, they have to remember to call super() all the way up the inheritance hierarchy (see Python's Super considered harmful)

Anyway - so this (I think) was exactly the reason that descriptors were invented. Instead of ExpensiveFactory, we have ExpensiveDescriptor:

elsewhere.py:
class ExpensiveDescriptor(object):
    _expensive_instance = None
    def __get__(self, instance, owner):
        if self.__class__._expensive_instance is None:
             from elsewhere import ExpensiveObject
             self.__class__._expensive_instance = ExpensiveObject()
        return self.__class__._expensive_instance
user.py:
class ObjectUser(object):
    expensive_instance = elsewhere.ExpensiveDescriptor()
    ...

Whenever ObjectUser().expensive_instance is accessed, the descriptor's get method is invoked, and an ExpensiveObject created - but not before then.

This happens for ObjectUser, and any classes which inherit from it, without any further interference in them.

And, get is implemented to have no cost when accessing any other attributes.

And, of course, since _expensive_instance is a class attribute of the Descriptor, there should only ever be one created.

Actually, you could have ExpensiveDescriptor manipulating the module attribute elsewhere.expensive_instance - this would let you get at the expensive_instance from anywhere in the code without having to go through an object - but only after it had been instantiated by one of the accessing objects. Might or might not be useful, depending on your use cases.

Anyway, so that's why descriptors are brilliant! For more reading, try:

Comments have been automatically disabled | digg this | reddit | del.icio.us

Wednesday, 23 April 2008

XPath and QNames in content

As any fule no, QNames are how XML does namespaces. Where a namespace has been declared:

<c:cml xmlns:c="http://www.xml-cml.org/schema/>

and the "c" prefix on the element name is associated, via the xmlns attribute, with the namespace URI. This is trivially manipulable with any namespace-aware tool.

So far so good. However, when QNames are used in content (typically, as an attribute value) then the situation is more complex. The two nodes below are equivalent under QName-in-content processing.

<c:cml xmlns:c="http://www.xml-cml.org/schema"
  att="c:comp"/>
<d:cml xmlns:d="http://www.xml-cml.org/schema"
  att="d:comp"/>

This usage is blessed by the W3C, http://www.w3.org/2001/tag/doc/qnameids.html, and XSLT depends on it working.

But it's significantly harder to work with using most XML toolkits.

node()[@att='string']

The above XPath returns all nodes which have att="string". However, it turns out that matching on a namespace-resolved QName needs the following:

node()[substring-after(@att, ':')='comp'
       and @att[../namespace::*
                 [name()=substring-before(../@att,':')]
                ='http://www.xml-cml.org/schema']
      ]

if you only allow for prefixed QNames (eg c:comp above). If you want to be able to match unprefixed QNames as well, that is, QNames in the default namespace:

<cml xmlns="http://www.xml-cml.org/schema"
  att="comp"/>

then you need to extend the expression to the following:

node()[(substring-after(@att, ':')='comp'
        and @att[../namespace::*
                  [name()=substring-before(../@att,':')]
                 ='http://www.xml-cml.org/schema'])
    or (@att='comp' and
         and namespace::*[name()='']
              ='http://www.xml-cml.org/schema')
       ]

which is hardly transparent!

Much as I think XPath 2 is a bad idea in general, this is one area where it is a significant step forward; it offers node functions:

which will do what they suggest. Of course XPath 2 then buggers things up again by saying:

In XPath Version 2.0, the namespace axis is deprecated and need not be supported by a host language

W3C Recommendation 23 January 2007
— XML Path Language (XPath) 2.0

Who needs backwards compatibility anyway?

But since libxml2 doesn't support XPath2, I don't propose to worry very much about it.

In any case, unwieldy though the above solutions are, they work correctly.

Comments have been automatically disabled | digg this | reddit | del.icio.us

Tuesday, 22 April 2008

m4Y - the Y combinator in m4

Just to get the goods up front:

define(`m4Y', `dnl
pushdef(`m4Y_recur',dnl
`pushdef(`m4Y_LL',dnl
`$1''changequote([,])(['changequote([,])`changequote`]`$[]1'(``$[]1'')dnl
['changequote([,])'changequote`])changequote`dnl
(changequote([,])`$[]1'changequote)dnl
`popdef(`m4Y_LL')')'dnl
`m4Y_LL')dnl
pushdef(`m4Y_LL',`dnl
m4Y_recur(`m4Y_recur')'changequote([,])(`$[]1')changequote`dnl
popdef(`m4Y_recur')`'popdef(`m4Y_LL')')`'dnl
'`m4Y_LL')`'dnl

So as seems to be popular, I've been working my way through The Little Schemer over the last few weeks.

And, as is equally common, I ground to a halt at the derivation at the end of Chapter IX, where they spring the Y Combinator on the unsuspecting audience. The best way to understand it is to work through it by yourself, so I thought I would see if you could do one in m4. And it turns out you can, though it's not very pretty!

Clearly what the world needs is to know about it, so I wrote it up, and you can follow the derivation in two essays:

  1. Higher-Order Programming in m4, which shows you how to do proper quoting to get macro Currying to work.

  2. The Y Combinator in m4, which uses those quoting techniques to do the full derivation of m4Y above.

Beware that m4 may be dangerous for the health of compulsive programmers.

— The GNU m4 manual
Comments have been automatically disabled | digg this | reddit | del.icio.us

Monday, 21 April 2008

asciidoc source code highlighting

As I've mentioned before, all the entries in this blog are written in asciidoc, which is very nice for a lightweight markup language, particularly in terms of embedding code fragments and having them marked up nicely.

Asciidoc uses GNU Source-highlight as its backend for generating pretty code fragments, which does a reasonable job, and its author is very responsive to feedback - he's fixed a couple of bugs in the Fortran and XML modules for me.

However, I've been growing dissatisfied with its use, for two reasons.

  1. it has a very heavyweight dependency on boost, for its regex library. Compiling boost takes several hours, and this seems to me like massive overkill for a bit of code highlighting.

  2. It is purely regex-driven. Furthermore, all language front-ends are defined in terms of a mini-regex language. This means that its markup capabilities are fundamentally limited to a very simple regex subset.

In any case, it can't approach the expressiveness of Emacs font-lock highlighting, which is what I'm used to.

So, I thought it ought to be possible to abuse one of several available emacs-lisp packages to do the job, and indeed it was. The script available here is a wrapper around a modified version of htmlfontify, and works like so:

htmlfontify -mode $MODENAME $FILENAME

or if $FILENAME is -, it takes input on stdin. It will print out a properly marked-up fragment of HTML on stdout, marked up according to emacs, in $MODENAME-mode fontification.

Note
Importantly, it is entirely standalone, with no dependencies beyond Emacs 21 or better, which is installed everywhere these days.

This was easy to write an asciidoc filter for, so code in this blog will henceforth be marked up by emacs.

Comments have been automatically disabled | digg this | reddit | del.icio.us

Wednesday, 16 April 2008

Finder WebDAV bugs, part II

As a follow-up to my last post on this, some good news and some bad.

Good news: Apple Engineering got back rapidly, with a good understanding of the authentication issue, and a good suggestion for how they might fix it. Of course the fix won't emerge until at least 10.5.3.

Bad news: there is another bug lurking in Finder's webdav implementation that I keep coming across. I haven't characterized it well enough to report, but I'm noting it down here so Google has some record of it at least.

The symptom is that after the state of the webdav server changes in some way (Certainly not every time it changes; I think this occurs when a directory that was previously readable has become unreadable because permissions have changed) then when you try and eject the mounted disk, Finder refuses with one of two error messages.

  1. It complains that the disk is in use, even when it's not - and this can be confirmed by

    source~~~~~~~~ sh-3.2# lsof /Volumes/webdav_mount lsof: WARNING: can't stat() webdav file system /Volumes/webdav_mount Output information may be incomplete. assuming "dev=2d000009" from mount table source~~~~~~~~

    The fix for this is a simple

    source~~~~~~~~ umount -f /Volumes/webdav_mount source~~~~~~~~

  2. It gives the unhelpful message: "error code -8072"

    In this case, the fix is first to unmount the disk with umount as above, and then to restart Finder.

    Note
    Make sure to unmount the disk first!!! If you try and restart Finder (or logout, or reboot) without doing so, then the OS is liable to hang in an unretrieveable state, so that only pulling the power cable fixes it - which has happened to me more than once.
Comments have been automatically disabled | digg this | reddit | del.icio.us

Thursday, 10 April 2008

shell history

Borrowed from plasmasturm.

source~~~~~~~~~~~~~~ sloth: tow$ history|awk {a[$2]++} END {for(i in a){printf "%5d\t%s\n",a[i],i}}|sort -rn|head 99 ls 85 cd 66 git 53 xsltproc 44 vi 38 ssh 22 python 15 grep 8 wget 8 rm source~~~~~~~~~~~~~~~

Comments have been automatically disabled | digg this | reddit | del.icio.us

The problem with visual programming languages

People seem to like visual (or graphical) programming languages, (VPLs), but I don't think they should.

Reports of success

This was prompted by a talk on Monday, at the Royal Society meeting on environmental e-Science. One of the speakers (I forget who) was demonstrating a workflow system, run through a VPL environment. He talked about having given a workshop, showing scientists how to use the system, and quoted one of them as saying (paraphrased from memory):

I've accomplished in one afternoon what it took me the whole of Summer 2005 to do.

and used this as evidence for how wonderful such "friendly" VPL environments are.

I've been to a number of talks where such things are shown off, and it's undoubtedly true that placing these tools into the hands of some scientists does result in such reactions (although I suspect less often that their proponents like to think; and I'm not sure how long-lived such reactions are).

However, contrary to the conclusions usually drawn, I don't think that the praise should be given to the VPL environment - indeed I think such things are actively harmful, and actually, you could get the same reactions via different means.

Cleaner interfaces

I suspect that actually, such positive reactions aren't actually caused by the visual nature of the environment, so much as the fact that (compared with the typical workflow system of bodged-together Perl scripts)

  1. interfaces between components are much simpler,

  2. they've been pre-written by someone else

  3. they've been designed to plug together.

But because humans are very visual creatures, it is the obvious differences in the visual aspect of the interface that is noticed, and it's to that that positive effects are ascribed. It may well be that there is a small advantage there, but I think it is fairly small, and very easily overstated; see for example "Why looking isn't always seeing: readership skills and graphical programming", which I don't think enough relevant people have read.

When pulling things together using bodged Perl scripts, then you are reliant on whatever interfaces to the script, and to whatever other programs are being called, that someone else has written.

These interfaces are probably not quite what you want for your purposes, so you'll need to munge them a bit.

They're almost certainly not well-designed - indeed probably little thought has gone into interface design at all, so much as ensuring that the necessary scientific job gets done.

They may not be very functionally oriented; i.e. not well-suited to being called as part of a larger workflow. For example, there may be lots of out-of-band set-up required in the way of global environment variables and so forth.

All of these problems will be ameliorated in building components for any workflow system, certainly those of the type likely to underlie typical VPLs. Interface design will be an integral part of making components that fit together to make workflows of the type envisaged; components will have been co-designed, so that interfaces between them match well their intended use; and they will have to be built such that they don't rely on global state.

The end result is that the components of such a system can be pulled together and made to interact far far easier than a set of programs designed in isolation with little thought for reuse. (Though if badly done, it may result in components which can't be easily re-used outside the original workflow domain.) And of course this is true regardless of what programming interface is used to edit the resulting workflows.

However, I also believe that beyond this, VPLs are actively harmful.

Text-munging tools

My objection boils down to the fact that we have an enormous range of text-munging tools, but we have far fewer tools for munging whatever graphical representations are fed to us on the screen.

The issue that most obviously shows itself to me is version control, or revision tracking.

As any halfway-sensible programmer does, I keep all my projects under version control. The reasons are well-rehearsed, but I firmly believe that just as any program longer than 10 lines probably has a bug, any program longer than 10 lines should be kept under version control. The same applies to programs in VPLs. If you've got more than 10 or so components strung together, each of which probably has 5 or 6 tunable parameters, then you want to be able to preserve the state of the system, and record changes between them.

And this is not just in order to be able to roll back to previous versions; but to be able to usefully compare source trees:

  • so that you can see the difference between last week's and this week's version

  • so that you can see the difference between your version and your colleagues version.

  • so that you can usefully merge in adaptations from variant source trees.

The tools for doing this are well-established for text source, as are standard practices for improving source code. For example, reformatting - whitespace-only - updates to source code are often deliberately isolated from semantic changes. Similarly, groups of related changes are often made together in changesets.

Such practices facilitate the use of text-based tools, and as a result, management of multiple similar source trees is well understood and documented.

By comparison; if code were stored, for example, in a Word document (or to make the same point more extravagantly, as a bitmap image showing the text), this would be impossible. It would be just as easy for the programmer to read - but you'd lose all the power of automatic comparisons; source trees diverge wildly with every minor change. This wouldn't stop you using version control, and backtracking to previous versions, but your diffs would carry no useful information.

And so with visual programming. Clearly there is some serialization underlying whatever interface you're given, and more than likely that serialization is textual. So in principle, you could keep your programs under version control (though your IDE might not make it very easy).

However, the visual interface is at liberty to rewrite the serialization without consulting you; and what might be a minor, or indeed insignificant change to you (moving a box without changing its connectors; adding one additional connector) might well result in the serialization being completely restructured.

As a result, you end up in the same situation as with your Word-encoded source. You can't compare your workflows with a colleagues. You can't trivially diff your current version with that of 6 months ago, and identify key differences. You can't check it against yesterday's version, and work out what you changed which caused the testsuite breakage that you noticed this morning.

Although version control is the most obvious to me as a useful application of text-munging, there's any number of others that are useful; automatic bug-finding tools; refactoring tools, etc, which rely on the well-understood machinery of text analysis.

Even if some of these may exist in a given visual IDE (and I wouldn't be surprised if a few did), they certainly don't all, nor would it be easy to transfer them between IDEs, since there is no agreed upon standard way to do visual programming.

Source control

The open source movement relies on source to programs being available. What that means in precise terms begins to break down when we move away from the realm of traditional compiled languages; but in any case, what it means to me is the ability to usefully manipulate and inspect the algorithms behind a program.

You can't do this if you're given only the compiled program - all the usefully-human-readable information is in the source code, and is thrown away when compiling; minor changes in source code can result in very different machine code, and decompilers are very imperfect instruments.

In fact, that's not quite true; there's three layers of information. There's the human-readable description of what the code should do. That's probably in someone's head. There's the partly human-, partly machine-readable form in the source code, which can be programmatically manipulated (or "compiled" as we like to say) into alternative forms, one of which is machine code for execution.

When writing in VPLs, you radically lower the utility of the intermediate stage. The human-accessible code is in people's heads, and is transferred to a graphical form that isn't actually very human-readable - at least, not in the sense that I can write tools to do anything with it.

So until people devote as much time to research into software engineering and code management tasks for non-textual code representations, I don't think VPLs are ever going to be properly successful.

In fact, any code that is written in such environments is information that is being essentially thrown away - expertise that is being just as much wasted as if you threw away the source code to your compiled programs.

Comments have been automatically disabled | digg this | reddit | del.icio.us