Smaller than Baby Steps with Julia

Julia has some impressive performance stats, so I gave it a whirl, or half a whirl.

Since I now have a lot of text to mine I need to at least be able to segment strings into tokens and count occurrences of those words in sentences, phrases, paragraphs, chapters, and books. Unfortunately Julia can’t match the human compiler performance of python for this simple task that for python is simply a matter of split, Counter and a generator or two. I couldn’t get Julia to even concatenate an array of strings together: split or join a string using the weird Perl syntax:

$ julia
               _
   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: http://docs.julialang.org
   _ _   _| |_  __ _   |  Type "help()" for help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.3.11
 _/ |\__'_|_|_|\__'_|  |  
|__/                   |  x86_64-linux-gnu

julia> s = ["this", "is", "an", "array", "of", "strings"]
6-element Array{ASCIIString,1}:
 "this"   
 "is"     
 "an"     
 "array"  
 "of"     
 "strings"

julia> " ".join(s)
ERROR: type ASCIIString has no field join

julia> " $s"
" ASCIIString[\"this\",\"is\",\"an\",\"array\",\"of\",\"strings\"]"

julia> 

Or split a sentence into tokens using the cool built in Perl regex capability:

julia> ismatch(r"^\s*(?:#|$)", "# a comment")
true

julia> match(r"^\s*(?:#|$)", "# a comment")
RegexMatch("#")

julia> regex = r"[^ ]+"
r"[^ ]+"

julia> captures(regex, "This is an sentence of workds (tokens).")
ERROR: captures not defined

julia> m = match(regex, "This is an sentence of workds (tokens).")
RegexMatch("This")

julia> m.captures
0-element Array{Union(SubString{UTF8String},Nothing),1}

julia> m = match(regex, "This is an sentence of workds (tokens).")
RegexMatch("This")

julia> m.captures
0-element Array{Union(SubString{UTF8String},Nothing),1}

I’m affraid Python is too deeply engrained in my thinking. So it’s back to Python and gensim for the total guten literature mining work.

Written on February 22, 2016