Christopher Browne's Web Pages
Prev		Next

Bytecode Systems

Christopher Browne

$Id: bytecode.sgml,v 1.19 2005-12-28 14:04:23 cbbrowne Exp $

Table of Contents
1. About Bytecode

1. About Bytecode

Contrary to the hype surrounding Java and MONO , bytecode compilation is hardly a new thing. It dates back to the days of BCPL and Pascal, and possibly even further.

The general idea is that you take code written in some high level language, and rather than compiling it into "native" code for a specific hardware architecture, you compile it into a sort of "virtual assembly language," the instruction set for some sort of generic processor. This has quite a number of merits:

The code becomes somewhat more "opaque," which is good for those that want to distribute proprietary software written in scripting languages like Perl or Python ;
Code is parsed by the "bytecompiling" process and is transformed into some form that may be read in quickly without a need for complex parsing;

By removing whitespace and the likes, there is sometimes a savings of space as compared to the source code form (this commonly occurs when compiling ELisp code).

More importantly, there is almost always a huge savings of space as compared to compiling to machine code.

For example, the calendrica code compiles in various forms to the following sizes:

Table 1. Compiling calendrica.lisp

File	Form	Size (bytes)
`calendrica.lisp`	Source Code	170347
`calendrica.x86f`	CMUCL Machine Code	472649
`calendrica.lbytef`	CMUCL Bytecompiled	87660
`calendrica.fas`	CLISP Bytecode	190873
`calendrica.lbytef.gz`	CMUCL Bytecompiled, compressed	34941
`calendrica.fas.gz`	CLISP Bytecode, compressed	30290

The critical comparison here is that the bytecoded forms are a whole lot smaller than the roughly 472K of calendrica.x86f.

It is far more difficult to measure this, but bytecode is also likely to be stored more compactly in memory than machine code. This is one of the purposes of the way CMUCL combines native compilation with a bytecode compiler: code that is executed a lot will benefit from compilation to native code, whilst by bytecode-compiling those parts of a system that are seldom executed, substantial memory savings are attained. The compactness here comes from the fact that the "machine language" is designed not for the computer hardware , but rather for the application .

Hand in hand with the diminished size comes the combination of convenience of implementation as well as improved computational efficiency.
All three walk in together as joint merits of designing a "computational engine" specifically for the application.
- Consider that if the application is intended to process strings, it makes sense to have strings as basic data types. Parrot has "string" operations length, concat, repeat, tostring, which work with strings far more conveniently than operators you would get with "real machine language." That convenience can make it easier to write compact, efficient code.
  Expanding this to "real machine code" would increase the size of the code considerably.
- A simulated "virtual machine" can be manipulated in ways that would be prohibitively complex to do on "bare hardware." For instance, in the Parrot system, it is easy enough to save sets of registers by pushing a few pointers onto a stack. On "bare hardware," the equivalent behaviour requires pushing a whole of registers into memory locations. This has the unexpected result that bytecoding can, here and there, actually be faster than coding to bare hardware.
  Bytecode machines have traditionally been stack-oriented machines, where objects would be drawn in and out of memory onto a stack where they would then be processed.
  The Parrot virtual machine is a little different, having a register architecture with four sets of 32 registers for four data types of integers, floats, strings, and Parrot Magic Cookies. They figure that this will lead to less stack thrashing.
- It is convenient to create operations that do extremely complex processing.
  Such operations will provide a compact representation for something that is complex, which reduces the size of a program; they also substantially improve performance by allowing a lot of work to be done within the optimized code of the "virtual machine simulator."
  The classic example of an arguably mistaken example of this is in the CRC operation on the old VAX architecture. Calculating CRC checksums and evaluating polynomials are wonderful examples of "extremely complex processing." Rather a lot of microcode silicon was likely consumed on these operations, and few compilers made use of them. At least not the C compiler! As a result of that, code implemented in C is unlikely to use these operations, such as popular bytecoded language interpreters!
  In an application where you expect to calculate a lot of polynomials, a POLY operator will certainly be of great value, as would, very likely, a whole set of matrix math operators.
  CLISP is known for having unusually good performance when processing BIGNUMs (quasi-infinite precision integers). Other Common Lisp implementations tend to beat its pants off when working with small integers when they can render code into native 32 bit arithmetic operations, as you might find with crypto applications, but once you cross the line to the BIGNUM, all the implementations wind up invoking function calls, and behave little different from a bytecode interpreter. CLISP has an unusually good BIGNUM library, and so works better than many others in this area of strength.
  As for the CRC function on the VAX being a "mistake," it's a mistake when it consumes silicon on the CPU that would have better been used for something else, and then remains unused when your favorite compilers don't use it. The same is not true for rarely-used bytecode instructions. If there are 160000 gates on a CPU that aren't being used, that feels wasteful. If there is 16K of code in the bytecode interpreter that never gets used, and perhaps never even gets paged into memory, the waste is nowhere near as painful.
  In the hardware world, RISC may have become "king," in that it allows silicon to be devoted to having more registers and in improving the ability to execute code in parallel. In a bytecode interpreter, CISC is virtually always a win.

WikiPedia: Bytecode
O-code machine
The BCPL compiler generated O-code, which was then interpreted or compiled to native code.
P-code machine
Implementations of the Icon language compile to bytecode, allowing deployment of compiled code on any supported platform.
CLISP and CMUCL both offer bytecode compilers.
Emacs ELisp code is commonly compiled into bytecode to speed load time, and reduce both disk and memory consumption.
Smalltalk80 Interpreter Implementation
OCAML includes a portable bytecode compiler.
Various Scheme implementations use bytecode interpreters.
In Perl 6, it is planned that code will be compiled into bytecode using something called Parrot .
Smalltalk systems typically are bytecoded; one interesting claim I have seen is that some Smalltalk implementations include Java bytecode integration.
One instance of this is VisualWorks: Frost, an environment that takes Java bytecode and transforms it into Smalltalk bytecode.
Many implementations of FORTH use "bytecompiled" code;
RTL/Tree from GCC
LLVM - Low Level Virtual Machine

1.1. MONO

The MONO project represents a free software implementation of a number of components of the Microsoft .NET "platform," notably including a C# Compiler, CLI Runtime, and class libraries.

Mono Hacking Roadmap

1.2. Why MONO? (Or why not???)

There have periodically been some rather hysterical reports and theories about the relationship between MONO, GNOME and Microsoft. Many quite wild, with rather incoherent theories as to why someone would have thought it sensible to implement MONO.

Contrary to some of the wild theories floating out on Slashdot, the reasoning has little to do with "using Microsoft code," or Microsoft Passport authentication, or anything else of the sort.

The real reasoning has to do with language. Microsoft is implementing all sorts of things as "part of .NET;" the parts MONO is looking at are:

A dynamic language
The big name Ximian project is the email-and-stuff application Evolution .
The code for it is written in C, and apparently whopping huge portions of it consist of memory management code, which, in C, must be done quite manually.
Using a more dynamic language offering garbage collection allows the ability to not bother writing hordes of malloc() and free() calls, which would allow an application like Evolution to be both smaller and more easily and quickly written.
Java offers garbage collection, and so resembles an answer in this regard. So also would languages such as Lisp, Smalltalk, Eiffel, and Modula3.
A bytecoded (perhaps JIT-compiled) platform to provide some independence of platform.
This also would disconnect application code somewhat from the deep details of the many C-based libraries of GNOME. Apparently the not-always-organized growth of libraries in GNOME has led to it becoming somewhat difficult to make concurrent use of many of the services offered.
Again, Java offers a "JVM." A number of other languages offer language-specific bytecoding schemes that somewhat parallel this.
Language- and platform-independence
One of the important characteristics of the GNOME project is that it intends to be relatively agnostic about what languages are used (in contrast with the somewhat C++-partisan KDE and Objective C-partisan GNUStep ).
The various " bytecode execution machines" that are presently available are generally not terribly friendly to the use of multiple languages. JVM is for Java, for instance.
There is some "never-accomplished Holy Grail" to this; witness UNCOL.
In practice, while there are around a dozen language compilers available that could be used with Mono, nearly all code is written in C#.

In effect, MONO represents something rather like the Java platform, with the conspicuous difference that it is specifically intended to be language neutral.

Here are some links to interviews and commentary from sundry GNOME folk about what they're about:

MONO interview with Miguel
Miguel with "The Long Reply About MONO"
In effect, MONO may be summed up by "Programmer to use new compiler and new garbage collector". In a way, it's not vastly more profound than that.
Havoc Pennington comments on MONO
Basically, "it's at least a couple years away, so speculating about it's functionality and importance now is fairly silly."
Rationale for MONO
A Better MONO Rationale
Mono Mythbusting, September 2010 Edition
Which tries to puncture some of the as-of-2010 hysteria and disinformation.
Eiffel comments on .NET
Gnome-to-MONO Considered Controversial...
MONO Controversy
.NET
DotGNU Project
Monant
An IDE tool for use with Mono
BOO - OO Static-typed language for CLR; syntax inspired by python

Prev	Home	Next
Java	Up	Scripting Languages