matt.freels.name

Little Ruby Tidbits

Sunday, 10 October 2010

Earlier today I stumbled upon a behavior of String#split that was interesting and surprising: split takes either a string or a regexp as its first argument. If not provided, it defaults to /\s+/. (Technically it defaults to the value of $; but you should immediately forget that fact.)

What was new to me, however, was that if you split on a regexp that includes matching groups, those groups are included in the resulting array:

"foobarbaz".split(/bar/)   #=> ["foo", "baz"]
"foobarbaz".split(/(bar)/) #=> ["foo", "bar", "baz"]

Handy!

to_proc

Another quick note: Here's an implementation of Symbol#to_proc that overcomes both the slowness of creation and slowness of application that to_proc usually entails:

class Symbol
  @@memoized_procs = {}
  def to_proc
    @@memoized_procs[self] ||= eval("lambda {|x| x.#{self} }")
  end
end

The one minor caveat of this approach is the fact that these procs are made effectively singletons. I don't see how it would ever be an issue, but it is different. For comparison, here's some stupid benchmarks demonstrating the performance characteristics of the different to_proc implementations, along with using a block literal:

loop 10,000,000 times:      user     system      total        real
block:                  1.460000   0.000000   1.460000 (  1.460893)
to_proc:                3.440000   0.060000   3.500000 (  3.505272)
to_proc (send):         2.340000   0.000000   2.340000 (  2.334712)
to_proc (eval):         1.430000   0.000000   1.430000 (  1.432513)
to_proc (memo):         1.440000   0.000000   1.440000 (  1.438504)

gen 10,000,000 times:
block:                 21.720000   5.290000  27.010000 ( 27.023781)
to_proc:               23.860000   5.530000  29.390000 ( 29.382054)
to_proc (send):        25.070000   5.540000  30.610000 ( 30.614251)
to_proc (eval):        79.030000   7.600000  86.630000 ( 86.648413)
to_proc (memo):         3.250000   0.010000   3.260000 (  3.253314)

One thing to note is that using eval to generate the proc is just as fast as using a block literal on 1.8.7. For the most part, using eval for metaprogramming on 1.8.x wherever possible will lead to faster code. Though as shown in the generation benchmark, it's pretty slow to run eval over and over again. Don't do that.

Also, rather embarassingly, 1.8.7's built-in to_proc is slower than the implementation using send, which is surprising as it's effectively the same thing but in C.

For comparison, here are the same benchmarks using Rubinius 1.1:

loop 10,000,000 times:      user     system      total        real
block:                  0.404524   0.000161   0.404685 (  0.400118)
to_proc:                2.789626   0.002370   2.791996 (  2.780405)
to_proc (send):         2.737143   0.002231   2.739374 (  2.739562)
to_proc (eval):         2.740973   0.002626   2.743599 (  2.743766)
to_proc (memo):         2.736583   0.001882   2.738465 (  2.738590)

gen 10,000,000 times:
block:                  2.608039   0.006805   2.614844 (  2.589822)
to_proc:                1.929814   0.001922   1.931736 (  1.925800)
to_proc (send):         1.976845   0.001521   1.978366 (  1.975148)
to_proc (eval):         SLOW!
to_proc (memo):         2.255655   0.001574   2.257229 (  2.067706)

gen 100,000 times:
to_proc (eval):        18.293764   0.248443  18.542207 ( 15.189934)

Rubinius is pretty much faster all around, except that eval is way slower in Rubinius than in MRI 1.8.7 (which was already fairly slow). The first time I tried, I let that benchmark go for 5 minutes before killing. Turning down the number of iterations reveals that eval in this simple case is about 2 orders of magnitude slower. I'm not particulary surprised by this, as Rubinius does more work, compiling everything to byte code. MRI just parses the string into an AST.

Interesting, to say the least. If you want to play along at home, here's the benchmark.