Fun with Clean Blocks: Speeding up Dictionary>>#at:

Dictionary lookup is used a lot. Let’s look at Dictionary>>#at:

at: key
    "Answer the value associated with the key."
    ^ self at: key ifAbsent: [self errorKeyNotFound: key]

The method looks simple and at a first glance, there are no obvious problems.

But if we look at it again, we see that for every execution of #at:, the block [self errorKeyNotFound: key] has to be created. The method stores a CompiledBlock in the literals, and a bytecode “create block” creates the block:

(Dictionary>>#at:) symbolic 

"'41 <4C> self
42 <40> pushTemp: 0
43 <40> pushTemp: 0
44 <F9 00 01> fullClosure:a CompiledBlock: [self errorKeyNotFound: key] NumCopied: 1
47 <A1> send: at:ifAbsent:
48 <5C> returnTop'"

This happens at every execution of #at:, even though the block that we spend time to create will never be executed at runtime (outside of real errors).

If you look at the rest of the code path of #at: in a Dictionary, it is carefully written to avoid block creation by using optimized contructs (ifTrue: and friends).

Could clean blocks help? Clean blocks are blocks that only use data that the compiler knows at compile time, thus we can create a CleanBlockClosure (which has the CompiledBlock) and store that as a Literal.

This then would mean that we could just use as fast pushLiteral: to push the block, no creation at runtime needed.

Clean blocks are not yet enabled by default, but we can use a compiler option pragma to enable them just for this method:

    <compilerOptions: #(+ optionCleanBlockClosure)>

The problem is that the block references both “self” and the key to be able to show a nice error message. Referencing either one make the block not clean.

Could we change the block to be clean? I guess we could not refer self and inline the method #errorKeyNotFound: (knowing the instance is not that important for logging the error in production, for example). But we do want to know the key that is not found, it really simplifies debugging especially if you have to look at log files.

If we closely look at the block: we know a bit more about it and how it is used. We know that if it gets evaluated, that evaluation will always happen with the homeContext of the block on the stack, as it is will always be #at:ifAbsent: that evaluates that block.

And in that case, there is a trick that we can do: we can use reflection to read the needed values via the stack.

You might not be aware, but the debugger needs some quite interesting features from the reflective layer of the system to provide the user experience that you take for granted (and that feels trivial). Imagine a block like that:

tt

    | temp1 temp2 |
    temp1 := 1.
    temp2 := 2.

    self class methods do: [ :each | self halt. each with: temp1 ].
    ^ temp2

The block does not reference temp2, so it actually is created as a block that does not know temp2 at all. temp2 is not accessibe from the block or it’s context. Yet, when debugging, you want to just be able to write temp2 in the block and eval it (without saving the method), as if you would access the temp in this block and recompile, the compiler would compile a different block that would know temp2.

So the reflective API of Context (and the infrastructure of reading Variables reflectively), is build in a way that #readVariableNamed: will use the stack to find the value of temps that are not available in the block context, but could be available if they would be referenced statically.

And if you think about it, we have just a case like that here, in reverse: if we would use #readVariableNamed: on the context to read the argument, the compiler would not see a read of “key”, thus compile a block that does not have that temp available. And if we then read “self” via thisContext, too, the compiler sees a block that it can compile as a clean block:

at: key
    "Answer the value associated with the key."
    
    <compilerOptions: #(+ optionCleanBlockClosure)>
    
    ^ self at: key ifAbsent: [
    
        "this block is never executed, yet we pay runtime cost to create it if it is a full block.
        We use instead the reflective API to read the argument and receiver via the stack, making
        the block clean.
    
        We enable cleanBlocks for this method by setting optionCleanBlockClosure as it is not yet
        enabled by default"
        
        KeyNotFound signalFor: (thisContext readVariableNamed: #key) in: thisContext sender receiver
        "This is equivalent to:
            self errorKeyNotFound: key"
        ]

Let’s look at the printed symbolic bytecode:

Dictionary>>#at:) symbolic  

"'41 <4C> self
42 <40> pushTemp: 0
43 <20> pushConstant: [
    
        "this block is never executed, yet we pay runtime cost to create it if it is a full block.
        We use instead the reflective API to read the argument and receiver via the stack, making
        the block clean.
    
        We enable cleanBlocks for this method by setting optionCleanBlockClosure as it is not yet
        enabled by default"
        
        KeyNotFound signalFor: (thisContext readVariableNamed: #key) in: thisContext sender receiver
        "This is equivalent to:
            self errorKeyNotFound: key"
        ]
44 <A1> send: at:ifAbsent:
45 <5C> returnTop'"

you see that it now uses pushConstant: to push the pre-compiled CleanBlockClosure, which is much faster.

A very naive benchmark using a fairly large Dictionary (Smalltalk globals):

[Smalltalk globals at: #Object] bench 

8591996.801/7040335.266 "1.220395972119888"

Shows a speedup of ~20%, which is qute a lot!

Of course, this is a hack. For one, it just solves this for this one case. And we do not want to add code like that everywhere. And real world impact of speed will be quite limited due to that (it just speeds up Dictionary>#at:)

The alternative would be to inline #at:ifAbsent: This inlining, due to the existing subclasses, leads to quite some needed refactorings, and has the same problem of being a solution for this problem in this one place.

The real solution in the long term is a JIT that removes blocks like that by inlining automatically for the code it executes, while not changing the image level code at all. This would then solve this for all similar cases everywhere all at once.

But the nice aspect of the hack is: it is local (touches just this one method), gives us some speedup now and will be trivially removable when we deploy a better solution in the future.

But the code is quite ugly and indeed relies on reflection… which maybe for Dictionary>>#at: is not that nice (e.g. if you want to create a minimal image without reflection)

But it is save to add this to your own image, it works in Pharo11 and Pharo12 and I guess even in Pharo10 (not tested).

Posted on May 19, 2023 #Pharo #Blocks