In order to more fully understand how I'm going to implement the translator from my imaginary programming language, I have written a few direct translations from px to assembly. My thinking is that if the language doesn't rely on the abstractions provided by the "compile to C" interpreted languages, I should be able to end up with a systems level language with the clean syntax of python, ruby, et al.
My first example is to calculate the sum of a large list. Here are two versions in python.
UPDATE!!! After trying a few more versions, I realized that my loop counters were off in every example. I had made the silly python mistake of forgetting the range of range. range(0, 123456789) = 0,1,2...123456788. And I had subsequently propagated that mistake into my C and Assembly versions, producing the wrong answer in every case, sometimes very fast! Lesson learned: Check your math!
# listsum1.py - calculate 1 + 2 + ... + 12,345,678 = listsum listsum = 0 for f in range(0,12345678): listsum = listsum + f print "The result is " + str(listsum) # listsum2.py listsum = sum(range(0,12345678)) print "The result is " + str(listsum)
And here's what it might look like in px.
print listsum buildlist 12345678
But, of course, no library functions exist yet, so I'll spell out how the functions listsum and buildlist would be defined.
n... buildlist size, next 0: next next + 1 n next if next < size : n listsum list...: break if not list n n + list continue True :
buildlist is defined as taking one argument, size and producing a list, one item at a time. Listsum consumes these items and produces a final value when the whole list has been read. All items are passed as two 32-bit values, [type/status] and [value/pointer]. Each operation checks the status flag to check if it is receiving a valid object and what that object is exactly. The buildlist function uses this fact to push a false object onto the stack when it reaches "size". This tells listsum that there are no more values to be read. In assembly this is what the code structure looks like in simplified form:
list_size: .int 12345678 l_sum: .int 0 push $1 push list_size pop list_size pop edi #using edi as the status flag mov 0, ecx #and ecx for the counter buildlist: compare list_size, ecx cmovae 0, edi #if above, set edi to 0 (false) add 1, ecx push edi push ecx listsum: pop eax pop edi cmp 0, edi je end_listsum add eax, l_sum jmp buildlist end_listsum print l_sum
In the actual implementation, I changed the add to support a quadword because the results get large quite fast, but this gives you the idea.
Is It Fast?
Well, I knew it would be fast, but this is a pretty large list, the code between buildlist and end_listsum has to run over 12 million times to calculate the sum, this is what I got when comparing the two python versions and my assembly version:
$ time python listsum1.py The result is 76207876467003 real 0m15.379s user 0m14.819s sys 0m0.557s $ time python listsum2.py The result is 76207876467003 real 0m8.123s user 0m7.476s sys 0m0.647s $ time ./listsum The result is 76207876467003 real 0m0.275s user 0m0.273s sys 0m0.000s
That's pretty blazing fast if you ask me! But I was put in my place when I ran the C version which produced these results:
time ./f_sum3 The answer is 76207876467003.000000 real 0m0.123s user 0m0.120s sys 0m0.000s
I used a floating point double for the list sum, so maybe that's why C was faster. Still, if we use C as the baseline we have:
C 1 px 2.2 Py2 66.04 Py1 125.03
Not bad for a start. Now I'm off to dig through the compiled C code to see how it should be done...
UPDATE #2 - I've realized that there is a quite simple algorithm to solve this particular problem and that is: (n/2) * (n+1). This simple equation eliminates the loop completely making the calculation trivial. The updated python version runs in 0.038s and the C version runs in 0.002s producing the, now correct sum, 76207888812681.
I figured out the integer type I should have been using "uint_64_t" and the proper format string for printf is %llu. Turns out this one runs even faster, computing the value in 0.064 s. Much of the speed gains seem to be related to accessing the stack directly, rather than pushing and popping the stack. The compiler also byte-aligns the stack before using it. After implementing these optimizations, my assembly version clocks in at 0.095 and that includes the listsum checking at every step if it's receiving valid data. Sweet!
ReplyDeleteAs another test, I pulled all the heavily used data off the stack and kept them in the registers. The result 0.047s! I've beat the C version, but only because there are very few variables in play. The 8 bytes of the sum stay in the EDI and ESI registers, The status stays in the ECX register and the loop counter stays in the EAX/EDX. Strangely, it performs faster when the total loop count is on the stack vs. in EBX, maybe it's a processor feature for static values?
ReplyDelete