Using StringReader and transient sequence in Libil Clojure
September 9, 2014
Clojure never stops to amaze me. I’ve implemented
libil in Clojure too. The
0.1.0 version of clojure
implementation makes use of persistent data and head-tail idiom, as you’ve usually see in LISP. I think it’s
already good until I read the implementation of clojure-csv. I
discovered new and faster way to tokenise string.
The Naive Implementation
The naive implementation is very straightforward. It’s implemented using head-tail idiom and treat the string like a list. I haven’t use it in the production, and haven’t tried to benchmark it, so I don’t know the result. This is the complete tokenisation code.
(defn tokenize-word "Tokenizing the word, to be able to be mapped" [^String w] (loop [l  rstr w] (let [pair (apply str (take 2 rstr)) fstr (str (first rstr))] (cond (empty? rstr) l (within? all-con (lower-case pair)) (recur (conj l pair) (apply str (-> rstr rest rest))) :else (recur (conj l fstr) (apply str (rest rstr)))))))
rest twice to skip the list if we meet digraphs like
th. There’s a subtle bug in
the implementation. Upon reading the code, I feel something not right. I don’t know what it is, my gut
just says so.
String Reader Discovery
I then remember the
Reader class from
java.io namespace. Alright, I think
perfect candidate for this as it will scan the String forward, char by char. In Clojure this can be implemented
(def s (StringReader. "Sangi")) (.read s) ;; returns ASCII code of character read (.skip s 1) ;; skip one char
Due to digraphs, I think this is better solution. I can skip one char or read two chars at once. Next question,
will be: how do I somehow know the next character without advancing the cursor? Well, upon peeking the clojure-csv
code, I found out a way to mark before advancing, from
(defn- rdr-peek [^Reader rdr] (.mark rdr 1) ;; set a mark (let [c (.read rdr)] ;; read and advance (.reset rdr) ;; back c))
This will return the next char without advancing. This is very good for my use case. So the replacement implementation becomes.
(defn tokenize-rdr "Tokenize using a Reader" [^Reader rdr] (loop [tokens  current (.read rdr) ahead (rdr-peek rdr)] (let [cc (str (char current))] (cond (== -1 ahead) (conj tokens cc) ;; Check if it's end of string (within? all-con (lower-case (str cc (char ahead)))) ;; Check digraphs (let [digraph (str cc (char ahead))] (.skip rdr 1) ;; if digraph advance two chars (if (== -1 (rdr-peek rdr)) (conj tokens digraph) (recur (conj tokens digraph) (.read rdr) (rdr-peek rdr)))) :else (recur (conj tokens cc) (.read rdr) (rdr-peek rdr)))))) ;; if not digraphs advance 1 char ;; tokenize-word replacement (defn tokenize-word "Tokenize a Word" [^String s] (tokenize-rdr (StringReader. s)))
It runs okay, until I discover another thing, transient collections.
Persistent data structure is encouraged when using Clojure. Clojure also provides a mutating data structure called transients. Wait What? MUTATION, THAT’S HARAAAM!.
Relax, Clojure is not pure functional language. As you can read from the docs:
If a tree falls in the woods, does it make a sound?
If a pure function mutates some local data in order to produce an immutable return value, is that ok?
In short, Clojure is pragmatist. Sometimes you can’t rely on purely immutable data structure if you need
performance. From the implementation above, the new vector of tokens is created by taking old vector and appending
the last token with
conj. In transients, we mutate the vector. The difference when using transient is rather than
conj, we use
conj!. In Clojure, function with exclamation mark signifies an exciting part code. And we
still need to return a persistent data structure by calling
persistent!. The implementation of
(defn tokenize-rdr "Tokenize a reader" [^Reader rdr] (loop [tokens (transient ) current (.read rdr) ahead (rdr-peek rdr)] (let [cc (-> current char str)] (cond (== -1 ahead) (persistent! (conj! tokens cc)) (within? all-con (lower-case (str cc (char ahead)))) (let [digraph (str (char current) (char ahead))] (.skip rdr 1) (if (== -1 (rdr-peek rdr)) (persistent! (conj! tokens digraph)) (recur (conj! tokens digraph) (.read rdr) (rdr-peek rdr)))) :else (recur (conj! tokens cc) (.read rdr) (rdr-peek rdr))))))
The changes follows this pattern:
- Calling transient on the initial vector
persistent!on function return