Archive for the ‘clojure’ Category

Using binding to mock out even “direct linked” functions in Clojure

Saturday, April 17th, 2010

The Clojure macro binding is frequently handy for mocking out functionality during testing, but this sometimes does not behave as desired in multi-threaded context. Fortunately, there's a solution...

For example, at work we have unit tests on functions that may attempt to talk to network services, which we'd rather they not do:

(defn send-request [request server]
  '(...do real RPC stuff here...))

(defn average-timestamp [time-servers]
  (/ (apply + (map #(send-request :get-timestamp %)
                   time-servers))
     (count time-servers)))

In order to test average-timestamp, we need a "mock" function that will stand in for send-request. This must be a function that takes a request and a server and returns a timestamp, just like the real send-request. For this example, it can take the timestamp itself as the "server" and simply return that timestamp. For bonus points we can make sure the request parameter is what we expect:

(defn mock-send-request [request server]
  (assert (= request :get-timestamp))
  server)  ; assume "server" is actually the timestamp

(mock-send-request :get-timestamp 5)
;=> 5

Using binding we can temporarily replace send-request with mock-send-request:

(binding [send-request mock-send-request]
  (send-request :get-timestamp 5))
;=> 5

(binding [send-request mock-send-request]
  (average-timestamp [5 15]))
;=> 10

So there's the background: a pretty normal way to mock out stuff in Clojure. What this doesn't address is when some clever co-worker (hi, Nathan!) realizes that average-timestamp would work better if it talked to multiple time-servers in parallel, and that this could be easily accomplished by replacing the use of map with pmap:

(defn average-timestamp [time-servers]
  (/ (apply + (pmap #(send-request :get-timestamp %) time-servers))
     (count time-servers)))

This is an easy single-letter change that indeed works quite well with the real send-request. But when we try to use binding to mock it out, we run into problems:

(binding [send-request mock-send-request]
  (average-timestamp [5 15]))
; java.lang.ClassCastException:
;   clojure.lang.PersistentList cannot be cast to java.lang.Number

It's not obvious from the error message, but what's happening is our original un-mocked send-request is getting called. This is because binding only has an effect on the current thread, but pmap causes send-request to be called in other threads.

The solution is simple enough in Clojure 1.0 -- you simply mock out pmap as well, and since the API is identical to map, it's quite easy to do. But as you can see, this does us no good at all in recent versions of Clojure:

(binding [send-request mock-send-request
          pmap map]
  (average-timestamp [5 15]))
; java.lang.ClassCastException:
;   clojure.lang.PersistentList cannot be cast to java.lang.Number

The reason is that starting with Clojure 1.1, most clojure.core Vars are linked directly into code that uses them. This means that our definition of average-timestamp above links directly to the actual definition of clojure.core/pmap, not just the Var that points to it, thus attempts to rebind the Var with binding are futile.

But do not despair, there is a solution even for this. All you need to do is tell Clojure not to directly link pmap when it compiles average-timestamp. This is done by adjusting pmap's metadata before average-timestamp is compiled, like so:

; Set pmap to be dynamically linked
(alter-meta! #'pmap assoc :dynamic true)

; Must now re-define the function that uses pmap
(defn average-timestamp [time-servers]
  (/ (apply + (pmap #(send-request :get-timestamp %)
                    time-servers))
     (count time-servers)))

; Finally, our mocking out of pmap works
(binding [send-request mock-send-request
          pmap map]
  (average-timestamp [5 15]))
;=> 10

The best place to put the alter-meta! depends on your particular use case. You might want to figure out how to do the alter-meta! only when in testing and not in production. Remember that the metadata on pmap is global -- you're changing the only metadata that clojure.core/pmap has, so the namespace you're in when you do it makes no difference at all.

In our case at work however, the performance impact of leaving pmap dynamic all the time was not enough to worry about. Dynamic linking is still pretty fast and is fine even for our production uses of pmap in this particular code base. So we simply put the alter-meta! at the top of the file that used pmap and got on with our unit testing.

http://blog.n01se.net/wp-content/uploads/2010/01/joy.png
The Joy of Clojure
Thinking the Clojure Way

by Michael Fogus and Chris Houser

Clojure: Controlling run-away trains, onions, and exercise bikes.

Wednesday, February 3rd, 2010

A normal Clojure REPL (or prompt) in a terminal window is by default a bit touchy about infinite seqs, deeply nested structures, and long-running operations. Any of these can cause your REPL to wander off into the weeds, busy spinning, perhaps printing pages of useless data, with the only apparent remedy being to press CTRL-C, which generally kills the entire JVM, Clojure and all, and dumps you back at your operating system prompt.

It's unfortunate that this is the default, but Clojure's youth shows particularly when it come to tools and settings like this. Happily it's sufficiently mature to provide several solutions that are not difficult to apply.

Run-away trains

The most common of the problems listed above is the infinite seq. Such a seq is easy to create, easy to print, and can result in an run-away REPL. The solution: setting your *print-length* to something short of infinity. I recommend 103:

(set! *print-length* 103)

Now it's safe to print infinite sequences:

(iterate inc 0)
;=> (0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
    24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44
    45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65
    66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86
    87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 ...)

Note the ... at the end of the list which indicates there was more to print, but Clojure is respecting our requested limit and giving up after 103 items. I've been mocked upon occasion for choosing 103, as if 102 is insufficient and 104 dangerous. Of course it's not so important exactly what you pick, so I'm happy to keep my rationale to myself.

Run-away onions

But *print-length* won't help you in the case of infinitely recursive structures, data nested inside data like layers of an onion. While less common, such nesting can be deep enough that it can still be a problem. To limit printing of recursive structures, use *print-level*. Usually 15 is about right for me:

(set! *print-level* 15)

Now it's safe to print infinitely recursive structures:

(let [x (atom 0)] (reset! x {:deeper x}))
;=> {:deeper #<Atom@4cb533b8: {:deeper #<Atom@4cb533b8:
    {:deeper #<Atom@4cb533b8: {:deeper #<Atom@4cb533b8:
    {:deeper #<Atom@4cb533b8: {:deeper #<Atom@4cb533b8:
    {:deeper #<Atom@4cb533b8: {:deeper #}>}>}>}>}>}>}>}

Again you get a little textual indicator that print is giving up, in this case it's just the # at the deepest level.

Run-away exercise bikes

But these settings only help take control of run-away printing. Sometimes the REPL gets hung up doing something other than printing, sitting there spinning away like a stationary bike, and neither of these settings will help you. My final advice is to use repl-utils to register your REPL thread as killable by CTRL-C:

(require 'clojure.contrib.repl-utils)
(clojure.contrib.repl-utils/add-break-thread!)

Now when you press CTRL-C, instead of shutting down the whole JVM, an exception will be thrown within the REPL thread, which is usually exactly what you need to halt run-away computation. Note that repl-utils does this using a deprecated Java API that is described as unsafe. And yet it remains useful for just such circumstances. For example:

(deref (promise))

Look at that -- instant deadlock! Now press CTRL-C, and you get:

java.lang.Exception: SIGINT (NO_SOURCE_FILE:0)

...and you're back to REPL prompt. But now you've violated Java, so I'd recommend saving what you need to save and then restart your JVM anyway. If you don't, weird things can happen. For example if you run the above deref-promise example again immediately, it may fail with a slightly different exception, even before you press CTRL-C. Similar strangeness can occur if you press CTRL-C at the REPL when no operation is running, though in that case it usually just kills your next expression, whatever it is.

Hopefully by setting your *print-length*, *print-level*, and break-thread, you can feel more comfortable at the REPL with less fear of getting stuck.

http://blog.n01se.net/wp-content/uploads/2010/01/joy.png
The Joy of Clojure
Thinking the Clojure Way

by Michael Fogus and Chris Houser

The Joy of Clojure, my perspective.

Sunday, January 31st, 2010

There are a few books about Clojure in the works now. But despite the variety in authorship and publisher, they all seem to be primarily in the vein of a language tutorial, starting from a basis of "common knowledge" among most programmers and taking the reader to a level of basic familiarity with Clojure.

Michael Fogus and I think there is more to Clojure that has yet to be addressed by any of the books currently being written. So we've started writing "The Joy of Clojure". Our vision for this book is to impart not just what Clojure is, but why it's the way it is; not just how to use it, but how to choose well from among the wide variety of options Clojure provides. And along the way we hope you'll experience the wonder, amazement, and, yes joy that Clojure can bring to programming tasks that often feel like nothing more than chores in other environments.

I also wanted to take this opportunity to brag a little on Fogus. He's been a fantastic partner throughout this process -- I wouldn't even be involved if it weren't for him. He has put into our book many references to papers and books that I've never heard of, but that illuminate much of Clojure's design -- clearly he has read a good deal more than I have. Once we're done writing, I intend to use the Bibliography as a reading list -- I'm sure it will be enlightening. He is dedicated to producing the best book we can, and he's the only reason we're meeting any of our deadlines. I'm looking forward to the next few months as we finish this book together.

http://blog.n01se.net/wp-content/uploads/2010/01/joy.png
The Joy of Clojure
Thinking the Clojure Way

by Michael Fogus and Chris Houser

What is Clojure-in-Clojure?

Friday, July 10th, 2009

The code making up Clojure's current implementation can be divided into 3 main groups:

1. The compiler -- used at the REPL and for AOT compilation, the compiler translates Clojure expressions into JVM bytecode. For this discussion, the reader lives in this group.

2. The data structures -- persistent maps, sets, and vectors, sorted and unsorted. For this discussion we'll also include here the reference types (ref, agent, var, atom) and the STM implementation.

3. clojure.core -- Destructuring, most of syntax quote, the Clojure functions for manipulating seqs (filter, map, for, doseq, etc.) and all the other programmer-facing functions we know and love.

Today, Clojure is written mostly in Java. Of the groups above, only the core library is written almost entirely in Clojure. This is no small thing, mind you -- the language that is defined by just the first two is an awkward, tiny little language that is rather painful to use. (You can see this demonstrated near the top of clojure/core.clj.) However, there is still quite a bit of code in the compiler and data structures, and they're almost entirely implemented in .java files.

Clojure-in-Clojure is the effort to re-write the compiler and built-in data structures in Clojure. Note that the primary compiler implementation would still run on the JVM and still produce JVM bytecode. This is not about getting rid of the JVM or implementing any new runtime or virtual machine.

But first a better 'new'

There's a bit of work that has to be done first. Specifically, although you could use gen-class or proxy to implement the Clojure data structures today, their performance would not match the current Java implementations. Both gen-class and proxy use an extra dereference on each method call which give them dynamic redefinition features. For application-level code this is often desirable, but for these low level data structures it is not. Rich Hickey's plan for solving this is a more featureful 'new' operator cleverly nicknamed "new-new" [update: this has since been renamed reify and has been augmented with the related constructs defprotocol and deftype].

Clojure is already sufficient for implementing the compiler (the reader being the low-hanging fruit there, I would think), and once new-new is in place the data structures can be ported to Clojure as well.

The benefits of Clojure-in-Clojure

So what's the point of all this? I'm not sure what Rich's primary motivation is here, but it will certainly be nice that writing bug fixes and features for Clojure itself will involve more time in Clojure and less in Java.

But a more fascinating benefit is that porting Clojure to non-JVM targets will be much easier. The majority of the effort so far put into ClojureCLR and ClojureScript has been rewriting the data structures for the target platform. This has required a lot of hand-written C# and JavaScript (respectively) all of which can become quickly obsolete as changes are made to the primary Java versions.

Once Clojure-in-Clojure is complete, there will be no need to hand-port the data structures. In fact, most of the compiler won't have to be ported either. Imagine you want to be able to run Clojure code on the Parrot VM. All you'd need to do is describe how each of Clojure's dozen or so special forms (including new-new) are to be compiled to Parrot bytecode. This code would be written in Clojure, and would initially run on the JVM to AOT-compile all the rest of Clojure (the rest of the compiler, the data structures, clojure.core) to Parrot bytecode. And poof you'd have ParrotClojure -- start it up in the Parrot VM and you'd have a ParrotClojure REPL.

Clojure is already great at allowing you to use the JVM without dealing with Java. Clojure-in-Clojure would be another step in the same direction, with the added benefit of making it easier to target non-JVM platforms.

Experimental branches of Clojure

Saturday, February 14th, 2009

The author of Clojure, Rich Hickey, has been contemplating changing the behavior of one of Clojure's core features, the seq abstraction. He's been working on a few different alternatives for a couple months now.

First, a quick review of how things stand now. Currently a lazy seq can be produced by using lazy-cons:

(defn my-range [from to]
  (when (< from to)
    (lazy-cons from (my-range (inc from) to))))

Calling my-range returns a lazy seq of the integers between from and to. This seq can be used by any of the functions that consume seqs:

user=> (reduce + (my-range 5 10))
35

Reduce takes the first two numbers in the seq (5 and 6) and passes them as parameters to the addition function. This result and the next value from the seq (11 and 7) are passed to the addition function again. This continues until the end of the seq is reached, and the result (35) is returned.

The first alternative I heard about is the Left Fold Enumerator (LFE). LFEs are about inversion of control, where the function passed to reduce would be in charge of the looping and could choose to stop before the end of the seq and signal to the seq to free any resources it was holding.

This was was explored in much more detail by Christophe Grand.

However after some consideration, Rich appears to have abandoned the idea of replacing Clojure's current seq abstraction with LFEs.

The next alternative is something called "streams". These are stateful iterators, carefully protected to help prevent the problems of state that Clojure sets out to solve. Their main benefits over the existing seq model are better performance and more complete laziness. Clojure's existing seq abstraction could be built on top of streams such that both the extra laziness and most of the performance would be available to user-level code that remains almost entirely unchanged

One user-visible change would be the loss of a feature called nil punning. Currently, something that returns a seq actually promises to return either an instance of ISeq (which is guaranteed to have at least one item in it), or nil. Since nil acts like false, you can write code like:

(if (filter odd? coll)          ; nil pun
  (println "coll is a bit odd")
  (println "no odds in coll"))

If there are no odd numbers in coll, filter promises to return nil, otherwise it will return a lazy seq containing all the odd numbers.

With streams, though, the behavior changes: filter could return a non-nil object representing an empty stream. This provides the extra laziness that's desirable, but the example above would no longer be correct. If you want to find out if the stream is empty, you must force it to do some traversal, the idiom for which is to get a seq of the stream:

(if (seq (filter odd? coll))
  (println "coll is a bit odd")
  (println "no odds in coll"))

The only way to save nil punning would be to provide a whole separate set of seq-library-like functions for dealing with streams: filter-stream, map-stream, concat-stream, etc. These would not allow nil punning, but the regular seq functions would continue to behave as they always had.

Despite the potential loss of nil punning, Rich thought this approach had enough promise that he forked a streams branch of Clojure and wrote some docs:

(defn my-range [from to]
  (let [a (atom (dec from))]
    (stream
      (fn [eos]
        (swap! a inc)
        (if (< @a to)
          @a
          eos)))))

But the stateful and imperative nature of the streams code were apparently unpleasant enough that he began looking at another alternative. This new lazy branch does not have quite the performance benefits of streams, but does keep the immutable nature of Clojure and adds the extra laziness demonstrated by streams.

The lazy-cons macro is gone, and instead there is a lazy-seq macro:

(defn my-range [from to]
  (lazy-seq
    (when (< from to)
      (cons from (my-range (inc from) to)))))

This has the same loss of nil punning as streams, but perhaps the extra laziness is worth it.

But what happened to the resource management touted by LFEs? It appears likely that a new macro (currently named 'scope' in the streams branch) will be introduced to help with that problem, regardless of whether the lazy or streams branch is finally adopted.

So there you have it -- two forks representing possible futures for Clojure. Which will be the chosen path? Well, I can't say for sure, but the only drawback of the lazy branch compared to today's Clojure is the loss of nil punning, and Rich seemed pleased by how little code in clojure.core was actually making that assumption. There's even a compile-time flag that can be turned on to help users track down their nil puns, which I'll blog about later.

Besides, I think Rich pushed streams as far as he could toward simplicity and protection against mutation, and still didn't like the results. I doubt the streams branch has much future, but Rich has been asking for feedback so now's the time to explore the options and make your opinion know.

In my next blog post I'll show my setup for easily switching between Clojure branches.

Update: My next blog post will not be about switching Clojure branches. As of Tue Feb 17 2009, the lazy branch was merged into trunk.

My path to Clojure

Monday, January 19th, 2009

Someone asked me about the path that lead me to Clojure. On the off chance that he's not the only one curious, I thought I'd post the answer here.

Sometime after college I started hunting for the "best language", not with any formal definition but as a personal quest for what I liked the best. A primary criteria was something I called "expressiveness". I don't know if I can define that accurately, but it has something to do with being able to provide APIs that concisely model whatever specific domain you're dealing with. A common example is how much easier it is to work with text in perl (because of strongly integrated regex support) than it is in C.

My "best language" before Clojure was Ruby, because of its clean object system and iterators, but I hadn't experienced any functional or immutable languages yet.

I had heard a lot about how amazing LISP was, but I didn't get it. I couldn't figure out why anyone liked LISP's annoying syntax, or why anyone (who wasn't deeply into Artificial Intellegence) would want to write programs that write programs. So one big turning point was reading "On Lisp" by Paul Graham (free download). By the time I was done with that book, I was convinced that any "best language" had to support macros.

So I tried to use Common Lisp, but just couldn't make it "stick". I had learned plenty of other languages, and usually enjoyed the process, but Common Lisp's lack of libraries really drove me nuts. I could do network, file or text stuff in 4 or 5 lines of ruby that took a page of code in Common Lisp. So I shelved Common Lisp for a while.

I do a lot of C++ and JavaScript at work, and was learning to really like JavaScript -- clean syntax, closures, direct support for hash-maps and regex, a builtin "gui" system. So it was easy for Yegge to lure me into trying JavaScript on the JVM, especially with the promise of optional static typing, which I was pretty sure I wanted.

Around the same time I learned about Project Euler, and thought it would be a great way to get used to using JavaScript for general purpose programming rather than just browser stuff. I did the first five PE problems in JavaScript before giving up. It was terrible. Not only was there no static typing yet, but things that had been minor annoyances in the browser now seemed to be giant obstacles to solving the problem at hand. Re-reading some of Yegge's cheerleading brought me to the conclusion that he's actually endorsing some private language based roughly on JavaScript. None of the other languages that Google lets him use are as malleable, and Rhino lets him code up whatever new features he wants in Java and morph the language to his will. This is great for him, I'm sure, but was lousy for me.

I hadn't used Java in any capacity for years, but now thanks to Yegge I had it installed and ready to go. This made me more willing to try Scala. I don't remember how I heard about it, but I picked it up to work on the rest of the PE problems, and this is when I started to learn about immutability. Scala encourages the use of immutable locals, and has a few immutable collections. I expended a lot of energy bending my problem-solving thought processes to this new context. Scala also has lazy streams, which was another big hurdle. Both of these were challenging to learn, but once I got it, they were very rewarding tools to have at my fingertips. Project Euler was my primary teaching tool, forcing me to learn new concepts a little at a time. The folks on the IRC channel helped me when I got stuck, though in exchange, I frequently had to endure condescension from some of them.

So this was going along okay, until I thought I had enough Scala under my belt to take a crack at a tricky little API I'd had on my mind. I'd been wanting a way of picking data out of HTML documents, something with the expressive power of XPath, but that embedded cleanly in the host language. I wanted to be able to provide new functions, filters, etc. (have you tried to work with dates in XPath and XSLT??), and for the return values to be host language data structures, not more XML documents. I'd actually built a beastly little project using xlstproc and ruby, but that's another story altogether.

Anyway, I began sculpting this API in Scala. I spent a day or two experimenting with syntax, the right combinations of traits, implicits, etc. to create the kind of XPath-like expressions I wanted. I finally got the beginnings of something that I thought would work, and compiled a minimal test. Well, tried to compile it, but the types weren't quite right. I then spent the another day or two trying to get the type declarations correct enough to even compile, and finally gave up in frustration. I'm sure it would have been possible to make it work, but I wanted to spend my time solving actual problems, not solving problems the compiler was making up for me. Maybe I didn't want static typing after all.

This is when I started learning Clojure. Paul Graham had convinced me I wanted a LISP. Yegge's hype about JavaScript had prepared me for the JVM. Scala had introduced me to immutability and lazy streams. Clojure let me use the best properties of each of these, so it was a very natural fit right from the beginning. I've since learned more -- how to use more immutable collections, software transaction memory, etc., and the discussion group and IRC channel have always been friendly and encouraging.

That's how I got to where I am. I don't think I'd recommend the JavaScript or Scala tangents to anyone else, but "On Lisp" may still be worth reading, at least until Programming Clojure comes out. Since bailing on Scala, I've done another 25 or so Project Euler problems in Clojure, and had so much fun I even went back and did the first 14 all over again. I also took another stab at XPath-inspired document querying, resulting in a zip-filter lib that I'll write about some other time.

Clojure's definitely closer to the "best language" than any other I've tried. If you haven't tried it yet, you should!

Merry Christmas

Friday, December 26th, 2008

Fractal Tree

In the beginning was the Word, and the Word was with God, and the Word was fully God. The Word was with God in the beginning. All things were created by him, and apart from him not one thing was created that has been created. In him was life, and the life was the light of mankind. And the light shines on in the darkness, but the darkness has not mastered it.

A man came, sent from God, whose name was John. He came as a witness to testify about the light, so that everyone might believe through him. He himself was not the light, but he came to testify about the light. The true light, who gives light to everyone, was coming into the world. He was in the world, and the world was created by him, but the world did not recognize him. He came to what was his own, but his own people did not receive him. But to all who have received him – those who believe in his name – he has given the right to become God’s children

-- John 1:1-12a, NET

The program to draw the tree is 30 lines of Clojure

Agents and watchers in Clojure

Wednesday, December 17th, 2008

Below is a walk-though of solving a problem using side-effect free functions, then agents, and finally agents and watchers. It's a very slightly edited copy of a post I made to the Clojure Google Group a week or two ago. I'm including it here because the audience for this blog is a bit different from the Google group

Spoiler warning -- this is about a Project Euler problem. If you don't follow the links, I think you'll be able to understand what I'm talking about without learning anything specific enough to ruin any particular puzzle. If you want to know which specific puzzle I'm discussing (to see if you've already done it, for example) you can go to the problem description. The problem number isn't mentioned in the full solution code.

For this puzzle, I had a grid of cells, each of which had a value that depends on the values of its neighbors in a way that guaranteed a stable solution. The value of one cell was given.

My initial solution [single-threaded.clj] represented the grid as a vector of vectors of Integers, and maintained a PersistentQueue of cells that needed to be updated, with a single loop to work through the queue. For each iteration of the loop, a cell would be popped off the queue, and a new value for that cell computed. If the new value was different from the old value, the 'recur' then updated the cell in the vector and pushed the neighboring cells onto the work queue. When the queue was empty, a stable state had been reached and the answer value could be read.

This ran fairly fast, used no mutable state, and the implementation seemed relatively clean to me. I was quite pleased with myself.

But my friend Aaron Brooks who had already solved the problem was encouraging me to created a multi-threaded solution. Note the solution I had was already thread-safe, but only used one of my two processor cores.

My second solution [using-agents.clj] represented the grid as a vector of vectors of agents. The action function computed a new value for a given agent, and then used 'send' to queue up the same action for neighboring agents. It also maintained other shared state to keep track of how many cell agents were running so that it could detect when a stable state for the whole grid had been reached.

Despite the complexity one might expect from all that, the agent solution was only 3 lines longer than the single-threaded solution. It also ran about 30% faster.

But it had a bug -- often returning the right answer, but sometimes returning an incorrect number. It also seemed more imperative than the first solution, because of the 'send' calls, updating shared counters, etc. When I mentioned this on IRC, it was recommended I try watchers.

So I added a watcher to every agent before kicking off the computation process. The watcher did no computation as such, but it was the perfect place to 'send' to neighboring agents, update the running count, etc. Moving this code to the watcher also meant the action function was now pure, with no side-effects.

It was during the process of separating the stateless and state-management code that I discovered my bug -- the code to manage global state had obscured an error in the computation logic. With them completely separate, it was easier to think about the specific responsibilities of each.

It was also now easy to see that the pure computation function was almost exactly the same whether I was using agents or just a simple loop. I finished factoring out this duplication and ended up with code that could use either mechanism to solve the same problem. [with-watchers.clj]

Writing a macro: for vs. doseq

Monday, November 10th, 2008

Looping: for vs. doseq

Newcomers to Clojure are often surprised to find that for is not a flow-control mechanism, but instead a lazy-generator. This may trip you up if you want side-effects:

(defn count-and-print-words [text]
  (let [words (re-seq #"\S+" text)]
    (for [word words]               ; not what I want
      (println word))
    (count words)))

user=> (count-and-print-words "What's that blue thing?")
4

No words are printed, because nothing uses the return value of for and so, being lazy, it does nothing at all. What you want is doseq:

(defn count-and-print-words [text]
  (let [words (re-seq #"\S+" text)]
    (doseq [word words]
      (println word))
    (count words)))

user=> (count-and-print-words "What's that blue thing?")
What's
that
blue
thing?
4

But for has a lot of power that's been missing from doseq, specifically nesting loops, :while, and :when. In case you're not familiar with these, here's a quick example that just (inefficiently) lists the factors of the Fibonacci numbers under 100:

(def fibs (lazy-cat [0 1] (map + fibs (rest fibs))))

(def some-factors
  (for [fib fibs :while (< fib 100)
        fac (range 2 fib) :when (zero? (rem fib fac))]
    [fib fac]))

user=> some-factors
([8 2] [8 4] [21 3] [21 7] [34 2] [34 17] [55 5] [55 11])

Isn't that the most beautiful definition of fib? Alas, it's not my invention.

Anyway, if I want to be able to have nesting, :while, and :when in an imperative, non-lazy loop, I'll have to write a macro. A good way to start is with an example of calling your macro. Start with the simplest useful example:

(my-doseq [i (range 10)]
  (prn i))

The first rule of writing macros

I heard somewhere that the first rule of writing macros is: don't write a macro. This is a cute way of reminding you to only write a macro when it's necessary. In this case it's necessary because I want my-doseq to be able to define new locals (i in the above example). If my-doseq were a function and i wasn't defined yet, the example wouldn't even compile.

So here are some ideas that I've found useful when writing macros:

  • Don't write a macro (perhaps a function would do or the macro already exists).
  • Write an example usage (start with a simple one).
  • Expand your example usage by hand.
  • Use macroexpand when your macro doesn't work.
  • Experiment at the REPL
  • Break up complicated macros into smaller functions.

Our own looping macro

I'll demonstrate all these as we go along. Write out by hand the code the macro should produce for this example:

(loop [i-seq (seq (range 10))]
  (when i-seq
    (let [i (first i-seq)]
      (prn i)
      (recur (rest i-seq)))))

Try that out to see that it prints 0 through 9, just like it should. Note that without the call to seq, passing in an empty collection (instead of range) would cause (when i-seq ...) to succeed when it shouldn't. It's an important detail to remember -- use seq to get a false value for an empty collection.

If you don't understand the above loop, take a moment to walk through it -- it's the foundation all the following examples are built on.

A macro for this simple case is pretty straightforward:

(defmacro my-doseq [[sym init] & body]
  `(loop [sq# (seq ~init)]
     (when sq#
       (let [~sym (first sq#)]
         ~@body
         (recur (rest sq#))))))

The hand-coded loop is in this macro almost verbatim. The backquote is what makes this work, and enables the placeholder magic: sq# for a temporary local name, ~ to insert a parameter, ~@ to splice in a parameter that's a seq.

This isn't terribly impressive, though -- it's almost a straight copy of how doseq is already defined. Next up: nested loops.

Nested loops

First, the example usage:

(my-doseq [i (range 3) j (range 3)] (prn i j))

Expanded by hand:

(loop [i-seq (seq (range 3))]
  (when i-seq
    (let [i (first i-seq)]
      (loop [j-seq (seq (range 3))]
        (when j-seq
          (let [j (first j-seq)]
            (prn i j)
            (recur (rest j-seq)))))
      (recur (rest i-seq)))))

I'm actually not sure if this one easier to expand by hand or to just write the macro to do it for you. The pattern to notice, though, is that where our earlier expansion had the body, this one has another loop. What we want is a function that produces the loop part, but which can be called again if there's a nested loop:

(defmacro my-doseq [seq-exprs & body]
  (let [emit (fn emit [[sym init & rest-exprs]]
               `(loop [sq# (seq ~init)] ; same loop as before
                  (when sq#
                    (let [~sym (first sq#)]
                      ~(if rest-exprs   ; more rest-exprs?
                         (emit rest-exprs)
                         `(do ~@body))
                      (recur (rest sq#))))))]
    (emit seq-exprs)))

The main loop is recognizable from the previous version, but now it's in its own local function so it can be called again. Where it used to simply splice in ~@body, it now has an if to either recurse (if there's another nested loop) or in the terminal case splices in ~@body like it did before.

The use of `(do ~@body) instead of just body may be a bit surprising, so let's take a moment to look at why. It shows up in a ~ expression inside a let body, so let's strip it down to just that and see what goes wrong with a plain body:

user=> (defmacro tmp [& body] `(let [] ~(when true body)))
nil
user=> (macroexpand '(tmp (prn 1) (prn 2)))
(let* [] ((prn 1) (prn 2)))

Can you spot what's wrong with that expansion? What if you run it:

user=> (tmp (prn 1) (prn 2))
1
2
java.lang.NullPointerException (NO_SOURCE_FILE:0)

Both the prn's get run, but then the first one's return value is at the head of a list, so Clojure tries to invoke that return value. The prn function always returns nil, and the attempt to invoke nil causes the NPE.

There a couple ways to solve this, one of which is to splice the whole body into a do:

user=> (defmacro tmp [& body] `(let [] ~(when true `(do ~@body))))
nil
user=> (macroexpand '(tmp (prn 1) (prn 2)))
(let* [] (do (prn 1) (prn 2)))

By putting the word do at the head of the list, we get the behavior we want.

Implementing the :when filter

So now that nested loops are working, next up is support for :when and :while. The main difficulty here is that the number of items our emit function has to consume each iteration depends on how many of these options follow the init value:

(my-doseq [prod (range 13)
           fac (range 2 prod) :when (zero? (rem prod fac))]
  (prn prod fac))

One approach to this kind of problem is to assume, for the moment, that the input to my-doseq was open to change. What format would make it easiest to write the emit function? Perhaps:

[{:name prod
  :init (range 13)}
 {:name fac
  :init (range 2 prod)
  :when (zero? (rem prod fac))}]

By grouping the name, initial expression, and all options together for each nesting level of the loop, the macro could be written:

(defmacro my-doseq [seq-exprs & body]
  (let [emit (fn emit [[bind-map & rest-binds]]
               `(loop [sq# (seq ~(:init bind-map))] ; changed to use :init
                  (when sq#
                    (let [~(:name bind-map) (first sq#)] ; changed to use :name
                      ~(if rest-binds
                         (emit rest-binds)
                         `(do ~@body))
                      (recur (rest sq#))))))]
    (emit seq-exprs)))

This works, but is very awkward to call... but that's a problem we can solve later. For now let's make :when work, by inserting an if clause that defaults to true if no :when is given:

(defmacro my-doseq [seq-exprs & body]
  (let [emit (fn emit [[bind-map & rest-binds]]
               `(loop [sq# (seq ~(:init bind-map))]
                  (when sq#
                    (let [~(:name bind-map) (first sq#)]
                      (if ~(or (:when bind-map) true) ; new "if" clause
                        ~(if rest-binds
                           (emit rest-binds)
                           `(do ~@body)))
                      (recur (rest sq#))))))]
    (emit seq-exprs)))

If no :when clause is given, such as with the prod loop shown earlier, that part of the macro will expand to:

(if true
  (loop...))

This is slightly inefficient, but completely removing the if would make the macro significantly more complex. Instead we'll leave it in and rely on the JVM to optimize it away at runtime.

Implementing the :while limiter

A minimal example using :while:

(my-doseq [{:name fib :init fibs :while (< fib 40)}] (prn fib))

That's really a terribly verbose syntax. I promise we'll fix it soon.

To get the :while part working, a when clause is inserted to skip both the body and the recur when the test expression is false:

(defmacro my-doseq [seq-exprs & body]
  (let [emit (fn emit [[bind-map & rest-binds]]
               `(loop [sq# (seq ~(:init bind-map))]
                  (when sq#
                    (let [~(:name bind-map) (first sq#)]
                      (when ~(or (:while bind-map) true) ; new "when" clause
                        (if ~(or (:when bind-map) true)
                          ~(if rest-binds
                             (emit rest-binds)
                             `(do ~@body)))
                        (recur (rest sq#)))))))]
    (emit seq-exprs)))

Try that out to see that it works.

Fixing the input syntax

Now that the emit function has all the required features, we can fix up the input syntax our macro expects. The task is to produce the following kind of format:

[{:name prod
  :init (range 13)}
 {:name fac
  :init (range 2 prod)
  :when (zero? (rem prod fac))}]

...from input that looks like:

[prod (range 13) fac (range 2 prod) :when (zero? (rem prod fac))]

So we need to write a function -- let's call it fix-vector:

(defn fix-vector [seq-exprs] ...)

At the highest level, we'll be consuming a seq of expressions and producing a new vector (of hash-maps). This probably means reduce will be a good fit. Here the new vector is called out-vec:

(defn fix-vector [seq-exprs]
  (reduce (fn [out-vec sym]
            ...)
          [] seq-exprs))

The type of expression at the front of each pair is what determines if that pair is a new name/init pair or a :when or :while that modifies an existing name/init pair. Talking about pairs from a seq suggests partition:

user=> (def test-vector
       '[prod (range 13) fac (range 2 prod) :when (zero? (rem prod fac))])
#=(var user/test-vector)

user=> (partition 2 test-vector)
((prod (range 13)) (fac (range 2 prod)) (:when (zero? (rem prod fac))))

...and since each of those will be passed as the second argument to our reduce function, we can pull the pair apart using destructuring:

(defn fix-vector [seq-exprs]
  (reduce (fn [out-vec [sym value]]
            ...)
          [] (partition 2 seq-exprs)))

The sym argument needs to be examined so this pair can be treated differently when its a keyword vs. when it's just a symbol. This function's getting complicated enough that I'd like to test it to make sure we're on the right track, so instead of ... I'm going to put in some stub code:

(defn fix-vector [seq-exprs]
  (reduce (fn [out-vec [sym value]]
            (if (keyword? sym)
              (conj out-vec [:found-keyword sym value]) ; stub for testing
              (conj out-vec {:name sym :init value})))
          [] (partition 2 seq-exprs)))

user=> (fix-vector test-vector)
[{:name prod,
  :init (range 13)}
 {:name fac,
  :init (range 2 prod)}
 [:found-keyword :when (zero? (rem prod fac))]]

Clojure doesn't have pretty-printing yet (someone want to get on that?) so I formatted the above output by hand for clarity.

Now for that stub -- we need to replace it with something that updates the final hash of out-vec. I thought by now I was pretty comfortable with Clojure's immutable collections, but I still found this a bit tricky to think through.

Here's what I ended up with: we're going to have a modified hash, so we'll want to drop the final hash of the vector to allow the modified one to be appended in its place. That suggests:

(conj (pop out-vec) ...)

To "modify" that hash we have to get it out of the vector and assoc the new key/value into it:

(assoc (peek out-vec) sym value)

Put them together in place of our earlier stub, and we get:

(defn fix-vector [seq-exprs]
  (reduce (fn [out-vec [sym value]]
            (if (keyword? sym)
              (conj (pop out-vec) (assoc (peek out-vec) sym value))
              (conj out-vec {:name sym :init value})))
          [] (partition 2 seq-exprs)))

Are we even close?

System Message: ERROR/3 (<stdin>, line 390)

Unexpected indentation.
user=> (fix-vector test-vector)
[{:name prod,
  :init (range 13)}
 {:when (zero? (rem prod fac)),
  :name fac,
  :init (range 2 prod)}]

Perfect! Now fix-vector just needs to be put into the macro:

(defmacro my-doseq [seq-exprs & body]
  (let [maps (reduce (fn [out-vec [sym value]]
                       (if (keyword? sym)
                         (conj (pop out-vec) (assoc (peek out-vec) sym value))
                         (conj out-vec {:name sym :init value})))
                     [] (partition 2 seq-exprs))
        emit (fn emit [[bind-map & rest-binds]]
               `(loop [sq# (seq ~(:init bind-map))]
                  (when sq#
                    (let [~(:name bind-map) (first sq#)]
                      (when ~(or (:while bind-map) true) ; new "when" clause
                        (if ~(or (:when bind-map) true)
                          ~(if rest-binds
                             (emit rest-binds)
                             `(do ~@body)))
                        (recur (rest sq#)))))))]
    (emit maps)))

And that's it. Now the for example at top of this article can be re-worked to use our new macro:

(my-doseq [fib fibs :while (< fib 100)
           fac (range 2 fib) :when (zero? (rem fib fac))]
  (prn fib fac))

This macro is actually available as doseq in Clojure as of revision 1090. Please note: that revision has several features (including the new doseq syntax) that may require you change your Clojure code and how you launch Clojure. These changes are not all well-document or fully tested yet, so until they are, you may choose to use my-doseq as shown above.

One final note: if you actually open up core.clj (the file formerly known as boot.clj) and look at the definition of doseq you will see a couple minor differences from my-doseq. This is because destructuring is not yet available at the point in core.clj where doseq is required, so I had to do a bit more first-ing, rest-ing, and apply-ing.

Is it hard to write a GUI in Clojure?

Wednesday, October 29th, 2008

Nope.

Exploring a parsed XML file using the gview function

Of course a screenshot doesn't prove something's easy. How about the complete codebase needed to make the screenshot?

(ns gview
  (:import (javax.swing JFrame JScrollPane JTree)))

(defn- make-node [parent obj]
  (proxy [javax.swing.tree.TreeNode] []
    (toString []          (pr-str obj))
    (getAllowsChildren [] (coll? obj))
    (getChildAt [i]       (make-node this (nth (seq obj) i)))
    (getChildCount []     (count obj))
    (getIndex [n]         -1)
    (getParent []         parent)
    (isLeaf []            (not (coll? obj)))))

(defn gview [obj]
  (doto (JFrame.)
    (add (JScrollPane. (JTree. (make-node nil obj))))
    (setTitle (str "gview: " (.getName (class obj))))
    (setDefaultCloseOperation JFrame/DISPOSE_ON_CLOSE)
    (pack)
    (setVisible true)))

With that, you've got a handy gview function you can call from the REPL whenever you want. Trying to make sense of the in-memory representation of an messy XML file? Or maybe just trying to reproduce my screenshot? Just type this at the REPL:

(gview (clojure.xml/parse "/etc/tomcat5.5/web.xml"))

Update 23 Jan 2009: Clojure has for quite a while come with a strikingly similar tool built right in:

(use 'clojure.inspector)
(inspect-tree (clojure.xml/parse "/etc/tomcat5.5/web.xml"))