Java Grinder: Compile Java bytecode to microcontroller assembly

DannyB2 · on Feb 23, 2018

I am reminded of writing Java "Midlets", (J2ME - Java 2 Micro Edition) back in the early 2000's. These ran on the then-common flip-phones and candy bar phones. Predating smart phones and esp. iPhone.

I made my own "asteroids" game. Real Java code, written using Eclipse. The Java had some limitations. No floating point was the most notable. So there were some widely circulated Java libraries that did Trig and related functions using integers, offset by some decimal fraction so that about half the bits were right of decimal point. I simply drew vector graphics on each update frame of the game. An asteroid, for example, was a center point with a number of random points around it, connected by lines. Thinking of these points as r-theta instead of x,y made it simple for the asteroids (and the ship!) to rotate using simple sin/cos to convert the r-theta points into x,y for rendering the frame.

The whole thing compiled to a midlet of 21 K bytes. It ran pretty universally on flip phones of the time. The 1 and 4 keys on keypad controlled ship rotation. The 3 key was to fire.

No sound.

Now, how unrealistic is it that today's micro controllers could run a version of Java. Especially if a few J2ME type limitations were imposed.

derefr · on Feb 23, 2018

It's not even unrealistic for early-2000s microcontrollers. The https://en.wikipedia.org/wiki/Java_Card JVM profile is running on the microcontrollers in both your phone's SIM card and your credit card's chip.

mitchty · on Feb 23, 2018

They even gave out a ring that ran java iButton (I say we can blame java for the iSTUFF prefix Apple ran with just because):

https://www.javaworld.com/article/2076641/learn-java/an-intr...

https://www.slideshare.net/abhishekabhi1023/java-ring-new

mooreds · on Feb 24, 2018

Ah, midlets. I wrote up an entire presentation (using docbook) on networked j2me applications, based on lessons learned from a startup that built a j2me application that allowed consumers to search for homes (in 2004): http://www.mooreds.com/midp/midp.html

Good times, good times.

bri3d · on Feb 24, 2018

There was also TinyVM/leJOS, which ran on a Renesis H8 with 32K (quite generous for the time, and even today for an 8-bit!) of RAM. You could program your Lego Mindstorms with Java and I remember it being quite mind-blowing for the time.

microcolonel · on Feb 23, 2018

> offset by some decimal fraction so that about half the bits were right of decimal point.

Seems weird not to use binary fixpoints. Can't really see the point of using a decimal point instead of a binary point/radix.

vidarh · on Feb 23, 2018

Precision / rounding requirements is often the reason for this.

E.g. last time I implemented VAT rounding on a billing system (no I'm not suggesting I'd use trig functions for that; it's just an example of why precision matters) our financial controller was hovering over me with a copy of the HMRC rules for VAT rounding, and while we certainly could do it with floats or binary fixed point if we took some care, doing it with decimal fixed points made it a lot easier to demonstrate to our financial controller that I wasn't doing anything that'd make him liable for tax evasion.

Different representations works best for different things, but you better understand the tradeoffs.

sehugg · on Feb 23, 2018

I think they're talking about an implied decimal point, i.e. fixed point math.

vidarh · on Feb 24, 2018

Yes, but the question is in part why use an implied decimal point instead of an implied binary point. You can do fixed binary point just by treating everything past a given bit as the fraction - it's simpler.

The main reason would be differences in expected precision/rounding behavior depending on how many bits you use past the binary/decimal point. E.g. if you use 4 bits past the point, you can represent 1/10's in decimal but 1/16's in binary. But while the binary representation offers more precision, it allows you to represent different values. If you e.g. want to represent 0.2, a single decimal represented in 4 bits will allow you to do that, but with 4 bits past a binary decimal point means you'll need to round down to 0.1875 (3/16) or up to 0.25 (4/16).

valarauca1 · on Feb 24, 2018

if your dealing with currencies you can’t use fractions. Unrepresentative decimal values (.33, .34, .66) are non trivial without many bits of fraction space.

Your IEEE standard for floating points has a lot of complex rules and standard binary layouts to ensure these values are represented.

You end up needing a lot more space and rules than just diving by 10^3

vidarh · on Feb 24, 2018

Of course you can use fractions. You can't reasonably use binary fractions. But the discussion is about fixed point representations, not floating point anyway.

DannyB2 · on Feb 26, 2018

I should have said binary point.

pjc50 · on Feb 23, 2018

Some ARM devices can execute Java bytecode directly: https://en.wikipedia.org/wiki/Jazelle

It never really took off.

gh02t · on Feb 24, 2018

From the Wiki page, if people are wondering why it never took off:

> The published specifications are very incomplete, being only sufficient for writing operating system code that can support a JVM that uses Jazelle. The declared intent is that only the JVM software needs to (or is allowed to) depend on the hardware interface details. This tight binding facilitates that the hardware and JVM can evolve together without affecting other software. In effect, this gives ARM Holdings considerable control over which JVMs are able to exploit Jazelle. It also prevents open source JVMs from using Jazelle.

So it sounds like ARM tried (presumably) to leverage it to make some profit off of support agreements, but ended up falling on their own sword.

gok · on Feb 24, 2018

It was kind of a half baked idea. At best, it could lower the small overhead of the interpreter loop for a subset of Java bytecodes. At some point you’d need a JIT to get better performance, and that JIT would want to target real ARM code (so you can do things like register allocation)

One cool thing about Java on little processors is that you can get memory safety without an MMU. But again at some point you’re going to want an MMU anyway...

pjmlp · on Feb 24, 2018

That is the whole premises of language OSes, type safety over MMU.

But it is as you say, given enough memory to manage or the usual performance tricks for GC/JIT, and the MMU will be needed.

carom · on Feb 23, 2018

I understand with class methods you can just jump to them, but how are class instances treated in Java? Where are they stored in a JVM? I assume the templates for them are stored in the constant pool, but I'm having a hard time finding exactly what is stored and then from there, how they're instantiated.

Edit: In an attempt to answer my own question, I came across this blog [0] which is helpful.

0. http://blog.jamesdbloom.com/JVMInternals.html#constant_pool

jacobwilliamroy · on Feb 23, 2018

Here's an excerpt from Bruce Eckel's Thinking in Java (4th Edition), pg. 42-43

The stack

This lives in the general random -access memory (RAM) area, but has direct support from the processor via its stack pointer The stack pointer is moved down to create new memory and moved up to release that memory. This is an extremely fast and efficient way to allocate st orage, second only to registers. The Java system must know, while it is creating the program, the exact lifetime of all the items that are stored on the stack. This constraint places limits on the flexibility of your programs, so while some Java storage exis ts on the stack—in particular, object references—Java objects themselves are not placed on the stack.

The heap

This is a general-purpose pool of me mory (also in the RAM area) where all Java objects live. The nice thing about the he ap is that, unlike the stack, the compiler doesn’t need to know how long that storag e must stay on the heap. Thus, there’s a great deal of flexibility in using storage on the heap. Whenever you need an object, you simply write the code to create it by using new, and the storage is allocated on the heap when that code is executed. Of course th ere’s a price you pay for this flexibility: It may take more time to allocate and clean up heap storage than stack storage (if you even could create objects on the stack in Java, as you can in C++).

Not sure if this answers your question. I've only gotten through about 5% of the book, and I can't tell from the table of contents, if he ever goes into more detail on these explanations.

sehugg · on Feb 23, 2018

More information and example output here: http://www.mikekohn.net/micro/java_grinder.php

Note that it's just a subset of the bytecode spec (no GC, objects, virtual methods) but might be a nice way to get up to speed on a CPU architecture. I used the SDCC C compiler similarly to learn Z80.

sitkack · on Feb 24, 2018

How does no GC work? Static arrays?

fisian · on Feb 23, 2018

Nice project. This already gave me the motivation to look more into Java bytecode. Does anyone know a good resource which explains Java bytecode?

cle · on Feb 23, 2018

https://docs.oracle.com/javase/specs/jvms/se9/html/index.htm...

calebm · on Feb 23, 2018

Great name. I'd really like to see this used to convert Java Beans to assembly :)

Ecco · on Feb 23, 2018

Nice ! But without GC I guess it’s just a nice hack and will have a hard time being actually useful

amelius · on Feb 23, 2018

GC doesn't have to be complicated if performance isn't an issue.

Also, escape-analysis could be used to run this without GC in trivial cases.

vidarh · on Feb 23, 2018

As an example of how simple, as I think people tend to see gc as complicated and scary, a simple tri-color mark and sweep gc in pseudo code:

    push_registers_to_stack()  // Simplifies dealing with roots - you can now ignore registers
    // These are the roots:
    // (grey objects are directly referenced from a root, but have not yet been scanned for pointers)
    for_each_stack_value { mark_grey }   
    for_each_bss_variable { mark_grey }
    // We then keep iterating over grey objects, scanning the values of those objects for further
    // pointers (which we mark grey), and mark "finished" objects black:
    while (ob = grey_list.shift) { 
        for each pointer in ob { 
            if white: mark_grey(pointer); 
            mark_black(ob) 
        }
    }
    // Now any objects you haven't touched are still white (and unreachable), and thus garbage:
    for_each_object {
        if white: free
        else: mark_white
    }

The above is inefficient, but it takes very little space to implement. If you know precise object layout and takes it into account when scanning, it's also precise, if not it's conservative (it may leave garbage if values in memory looks like pointers). Downside is that it is "stop the world", so if your heap is big, it'll cause noticeable pauses.

Simple GC's are easy.... Fast/good/efficient GC's on the other hand...

vvanders · on Feb 23, 2018

The other downside is fragmentation(which is also why even embedded native tends to be static/up-front allocation based).

vidarh · on Feb 24, 2018

That's true, and certainly if you can avoid dynamic allocations, do.

But you can actually get "halfway" to a partially compacting gc very easily too if memory use is a bigger concern than performance and you want to avoid fragmentation:

Set a (small) fixed size for all objects. It can be anything from a single pointer to any data that will be present in all objects. E.g. you might have class pointers and flags and a fixed number of bytes for auxiliary data. If any class needs more instance data, you make a pointer in that primary layout point to space in a secondary heap.

Now compacting the secondary heap is easy: Just copy into new buffers and deallocate the old and update a single pointer per object. You can do that by a single pass through all allocated objects, or you can do something more complex.

Compacting the primary heap is still messy (you'd need to identify all sites pointing to each object), but if you chose a small size for the primary objects there's less reason to do so, as while it is still possible to create pathological allocation patterns, unless the application reaches a high water mark and then frees a huge number of objects that it then never need again, any empty space left behind can always be perfectly filled by new primary object allocations, and hopefully each such object is fairly small.

If you want to, though, you can also compact auxiliary data (the "secondary heap") into holes in the primary heap when there's a good enough fit. E.g. on top of the steps I outlined:

For each object, if there's an auxiliary data pointer, check if there's a suitable hole (keep lists by "hole size" for example) in the primary heap. If so, move there. If not, copy to end of new secondary heap. Update source pointer.

pjmlp · on Feb 24, 2018

Even simpler than your description, just plain reference counting, which is why it is usually the first GC algorithm many reach to.

Eventually coupled with a basic tracing GC like you describe for collecting cycles.

It won't win any performance awards, but does the job.

sehugg · on Feb 24, 2018

Yep, even Applesoft BASIC had a simple compacting GC for strings.

vidarh · on Feb 24, 2018

Yes, that's certainly "low hanging fruit" in compacting in that if you have a list of "objects" that e.g. contain just a pointer and a length of auxiliary data, so that you have an indirection, compacting gets really easy since you only need to update a single location, and for things like strings that auxiliary data is likely to make up most of the memory use.

DannyB2 · on Feb 23, 2018

Also if memory is fairly limited, then you don't get that much of a pause time from even a fairly simplistic GC implementation.

I would also point out that modern micro controllers run faster than the old systems mentioned in TFA. (eg, Apple II, TI/99, etc.)

There are already micro controller implementations of GC languages. MicroPython. Espruino (eg, JavaScript)

pjmlp · on Feb 24, 2018

That is what gets old about arguing against GC languages on micro-controllers.

Of course there are use cases using 8KB PIC micro-controllers and such.

On the other hand something like ESP32, has a dual core with 512KB, way better than the Amstrad PCW 512 used to be.

amelius · on Feb 23, 2018

As a concrete example of the second point: if your program basically does:

    for(;;)
      do_shortlived_task();

Then even if do_shortlived_task() has a complicated memory allocation pattern, you can get away by just throwing away the entire heap after each iteration.

vvanders · on Feb 23, 2018

That's assuming no statics/singletons.

vidarh · on Feb 23, 2018

That's still easily solved by providing arenas or barriers: allocate anything that needs to persist first, then mark, do your shortlived task and throw away anything above the mark in memory.

vvanders · on Feb 23, 2018

If your statics have containers(yes I know it's a bad idea, but not forbidden) then you can easily have invalid pointers now with that approach.

Really you're going to have to go through so many gyrations that your GC code may start looking a lot like manual memory management.

vidarh · on Feb 24, 2018

No, you just need to know what you're allocating and accept that if you're using an approach like that you're very much responsible for ensuring you don't allocate anything that should live on past the mark. It's an old, well tested mechanism - the brk() syscall in Unix-like systems for getting memory does just that, for example (though today some get more convoluted and may or may not opt to use mmap() as well)

And you'll find it used today even in systems with gc's too. E.g. mruby for example provides an API to do pretty much that for extensions that would otherwise leave a lot of garbage to explicitly allocate to an arena and throw away everything other than whatever you then explicitly register with the gc.

If you have complications like what you describe, you'd probably want a proper gc, but as I outlined elsewhere, a simple gc is also easy if your heap is small enough that performance isn't an issue. But for many systems it's not necessary.

xfer · on Feb 24, 2018

Look into javacard spec, it doesn't require a GC, but it is used everywhere(your credit card, SIM card etc.). The secure processors used here implement this spec.

nickpsecurity · on Feb 24, 2018

JavaCard is also one of the few areas where high-assurance security use commercially. Names that come to mind are Kesterel Institute and Gemalto with EAL7-style VM's with formal verification. Examples:

https://www.commoncriteriaportal.org/iccc/9iccc/pdf/B2404.pd...

http://www.kestrel.edu/home/projects/jcapplets/

_lbaq · on Feb 23, 2018

Most of the machines on the list have a 16 bit address range (65536 bytes without paging) and usually less than that in free memory, GC or no GC, does not matter.

dfox · on Feb 24, 2018

Original PARC Smalltalk implementations had 32k (16b) word heaps (object pointers were 15b + one tag bit distinguishing SmallInteger and everything else). You could write quite complex systems with just that.

Edit: this is true for the canonical ST80 VM described in Blue book, ST80 VM on Alto AFAIK had some indirection mechanism (ie. not 32k words, but 32k heap objects) and did paging.

pjmlp · on Feb 24, 2018

Also the Lisp implementations in the 60 and 70's were even more constrained on IBM mainframes.