Rune Object and Machine Model Instruction Set Instructions codes are one or two 16-bit words, plus an optional extension. They are generally split into two groups: Instructions with an operation size (zz or zzz) and instructions without an operation size. Instructions without an operation size utilize the operation size set in the execute flow for the register. There is no auto-extension to a larger width and width mixing is relatively restricted (pointers excepted when loading pointers from or storing pointers to memory). There are explicit EXT instructions for width promotion or demotion. There are 16 variable-width integer registers, 16 fixed-width pointer registers, and 16 variable-width media registers. When used properly, pointers are designed to be machine portable. Their width is not known until the final translation to the target machine. This formulation of the instruction set results in very high portability and good instruction compactness. Condition status bits are only valid for the immediately following instruction. You can think of it this way: All instructions always set all condition bits. This is a huge translation and optimization aid. 0000zz00 ssssdddd MOVE %rs,%rd NOP2 0000zz01 ssssdddd MOVEU %rs,%rd 0000zz10 vvvvdddd MOVE $IMMQ4,%rd 0000zz11 vvvvdddd MOVE $IMMQ4,%pd 0001zzxx 0000dddd IMM* MOVE $IMM,%rd NOP8 (64-bit IMM,%r0) 0001zzxx 0001dddd IMM* MOVE $IMM,%pd 0001zzxx 0010dddd EXT* (reserved) 0001zzxx 0011dddd EXT* (reserved) 0001zzxx 0100dddd EXT* (reserved) 0001zzxx 0101dddd EXT* (reserved) 0001zzxx 0110dddd EXT* (reserved) 0001zzxx 0111dddd EXT* (reserved) 0001zzxx 1eeedddd ssssvvvv iiiiiiii EXT* insn256 ea,ea (dep on eee) 0001zzxx 1eeedddd ssssgggg iiiiiiii EXT* insn256 ea,%rg,ea (dep on eee) 0001zzxx 1eeedddd vvvvvvvv iiiiiiii EXT* insn256 ea,ea (dep on eee) 0010zzxx ssssdddd OFF* MOVE OFF(%ps),%rd 0011zzxx vvvvdddd OFF* MOVE $IMMQ4,OFF(%pd) 0100zz00 ssssdddd MOVE %rs,%pd 0100zz01 ssssdddd MOVEU %rs,%pd 0100zz10 ssssdddd MOVE %ps,%pd 0100zz11 ssssdddd MOVE %ps,%rd 010100xx iiiicccc OFF* implA16 OFF(%pc) 010101xx iiiidddd OFF* implB16 OFF(%pd) 010110xx ssssdddd OFF* FMOVE OFF(%ps),%fd 010111xx ssssdddd OFF* FMOVE %fs,OFF(%pd) 0110zzxx ssssdddd OFF* LEA OFF(%ps),%pd 0111zzxx ssssdddd OFF* MOVE OFF(%ps),%pd 01000iii ssssdddd insi8 %rs,%rd 01001iii ssssdddd insi8 %ps,%pd 01010iii ssssdddd insn8 %rs,%pd 01011iii ssssdddd insn8U %rs,%pd 01100iii ssssdddd insn8 %ps,%rd 01101iii ssssdddd insn8U %ps,%rd 01110iii vvvvdddd insn8 $IMMQ4,%rd 01111iii vvvvdddd insn8 $IMMQ4,%pd 100iiixx ssssdddd OFF* insn8 OFF(%ps),%rd 101iiixx ssssdddd OFF* insn8 %rs,OFF(%pd) 110iiixx ssssdddd OFF* insn8 OFF(%ps),%pd 111iiixx ssssdddd OFF* insn8 %ps,OFF(%pd) NOTE: %ps/%pd - represent pointer registers %rs/%rd - represent integer registers %fs/%fd - represent media registers zz: Operation Size Integer Media 00 8-bit 32-bit 01 16-bit 64-bit 10 32-bit 128-bit 11 64-bit 256-bit NOTE: Media operations apply to FMOVE and all media instructions. xx: Extension word specification 00 none (implied $0 or v-bits depending) 01 16-bit 10 32-bits 11 64-bits NOTE: The extension value is always sign-extended to the width of the operation (if immediate) or the width of the machine pointer (if offset). If the machine pointer width is less, the extension value will be truncated. NOTE: Code 00 indicates that there is no extension. The value normally derived from the extension will be 0. If 'v' bits are present and not otherwise used by the instruction, the value will be taken from these bits instead. NOTE: Extension words can be misaligned, but will still be laid out without any gaps. In such cases the extension words are laid out in such a way as to avoid having to do any shifts. Thus the extension words are extracted in their nominal bit positions, masked, and OR'd together as appropriate. For example a 64-bit extension that is offset by 16-bits (i.e. not 64-bit aligned) will be laid out: 0123456789ABCDEF (byte offset) 23456701 (laid-out - if little-endian) 54321076 (laid-out - if big-endian) ssss: Source Register dddd: Target Register Register 0 in all domains always reads 0 and is a sink-null on write. The top 5 registers in the pointer domain have a special meaning and cannot be directly written to in usermode. This leaves 11 general pointer registers available for code backends. %db (%p11) Data & library base pointer (e.g. per-library) %tp (%p12) Thread pointer %ap (%p13) Argument pointer (caller frame) %fp (%p14) Frame pointer %pc (%p15) Program counter Integer domain registers can be 8, 16, 32, or 64 bits. Media domain registers can be 32, 64, 128, or 256 bits. Total storage required to save the register set is ~768 bytes, inclusive of %r0, %p0, and %f0. cccc: Condition code (only applicable to related implied instruction) 0000 always BRA 0001 Z BEQ a == b 0010 C BCS, BLO a < b (unsigned) BVSU overflow (unsigned) 0011 Z | C BLS a < b (unsigned) 0100 N BMI, BLT a < b (signed) 0101 (reserved) 0110 N | Z BLE a <= b (unsigned) 0111 BVS overflow (signed) 1000 never BNEVER 1001 ~Z BNE a != b 1010 ~C BCC, BHS a >= b (unsigned) BVCU no-overflow (usigned) 1011 ~Z & ~C BHI a > b (unsigned) 1100 ~N BPL, BGE a >= b (signed) 1101 (reserved) 1110 ~N & ~Z BGT a > b (signed) 1111 BVC no-overflow (signed) vvvv: Immediate or Offset Quick value -8 to +8 (code 0000= +8) NOTE: insn $0,dea is automatically converted to insn %r0,dea vvvvvvvv: Immediate or Offset Quick value (32-bit instructions only) -128 to +127 (0-inclusive) NOTE: Full 2s complement allows for 8-bit relocation. eee: Effective Address Mode (32-bit instructions only) 000 insn %xs,%xd 001 insn $IMM,%xd insn $IMMQ,%xd 010 insn %xs,%xg,OFF(%pd) (three-register mode) 011 insn OFF(%ps),%xg,%xd (three-register mode) 100 insn OFF(%ps),%xd 101 insn %xs,OFF(%pd) 110 insn $IMMQ,OFF(%pd) 111 insn $IMM,OFFQ(%pd) NOTE: IMMQ and OFFQ use the vvvvvvvv bits, giving us very good compaction for the 32-bit instruction form. NOTE: OFF will make use of the vvvv bits if xx=00. IMM will make use of the vvvvvvvv bits if xx=00 (i.e. is IMMQ) NOTE: %xs/%xd may represent integer (%rs/%rd) or media (%fs/%fd) registers depending on the instruction. NOTE: Remember that %r0, %f0, and %p0 always read 0 and are sink-null on write. These registers can be used whenever 0 is desired. In particular, absolute addressing may be formulated by using %p0 as the pointer register. Similarly, if extension words are omitted any EA expecting an extension for its OFF or IMM value will receive either the value 0 or a value derived from the vvvv or vvvvvvvv bits. NOTE: Memory operations are always read, write, or read-modify-write. All instructions except those in the implied sets execute at most one memory operation. Instruction Set Implied Instructions A (conditional) The implA instruction form encodes 16 instructions with an optional PC-relative effective-address and condition. Not all insn combinations are legal. xx must be 00 for instructions which do not use the PC-rel EA. 0000 Bcc OFF(%pc) 0001 CALLcc OFF(%pc) 0010 (reserved) OFF(%pc) 0011 (reserved) OFF(%pc) 0100 LFENCE 0101 SFENCE 0110 MFENCE 0111 (reserved) 1000 REScc 1001 RETcc 1010 IRETcc 1011 RAISEcc 1100 LREScc 1101 LRETcc 1110 LIRETcc 1111 LRAISEcc WARNING: xx bits must be 00 for any instruction which does not utilize the extension word capability. Implied Instructions B (PTR source or target) The implB instruction form encodes 16 instructions with an optional register-relative address, either an effective-address or a memory load obtaining the address. Note that these instructions have side effects and may use other registers, see additional information later on in document. 0000 JMP OFF(%pd) 0001 CALL OFF(%pd) 0010 LCALL OFF(%pd) 0011 TRAP OFF(%pd) 0100 0101 0110 0111 Change of control. Note that an absolute address or value may be specified using %p0, an effective address using normal addressing, or a load from memory can supply the address. These instructions are mostly used without an offset. TRAPs supply the address of the trap vector entry. 1000 SLOCK OFF(%pd) 1001 SLOCKNB OFF(%pd) 1010 XLOCK OFF(%pd) 1011 XLOCKNB OFF(%pd) 1100 UNLOCK OFF(%pd) 1101 1110 1111 Object locking instructions are so integral to Rune they need their own instructions in order to allow translators to heavily optimize their operations. Note that hard vs soft locking is handled by the threading runtime. Core ALU Instructions 0000.0000 MOVE d = s NOP4 %r0 = %r0 NOP6 NOP8 NOP12 0000.0001 ADD d = d + s 0000.0010 CMP (d - s) -> condition 0000.0011 SUB d = d - s 0000.0100 AND d = d & s 0000.0101 OR d = d | s 0000.0110 XOR d = d ^ s 0000.0111 TEST (d & s) -> condition 0000.1000 ASL d = d << s 0000.1001 NEG d = -s 0000.1010 ASR d = d >> s (signed) 0000.1011 LSR d = d >> s (unsigned) 0000.1100 BSET d = d | (1 << s) 0000.1101 BCLR d = d & ~(1 << s) 0000.1110 BCOM d = d ^ (1 << s) 0000.1111 BTST (d & ~(1 << s)) -> condition NOTE: The 16-bit MOVE forms should be used whenever possible. Use the 32-bit MOVE only to construct variable-width NOPs. NOTE: Shifts may mix register widths. The source register can be any integer width (8, 16, 32, 64). The shift value will be truncated to the target register width. e.g. that means shifting an 8-bit register left by 8 bits is the same as shifting it by 0 bits, the target will be unchanged. Conditional Long Form Instructions 0001.cccc (reserved) 0010.cccc SETcc d = s iff cond else d = 0 0011.cccc MOVEcc d = s iff cond NOTE: Only these instructions here and the 16-bit implA instructions utilize condition bits. Special Fast Integer and Mask instructions 0100.0000 ADDC 0100.0001 SUBC 0100.0010 BMASK d = (1 << s) - 1 0100.0011 BIT d = 1 << s 0100.0100 ROL d = d <<< s (lsb->carry after operation) 0100.0101 ROR d = d >>> s (msb->carry after operation) 0100.0110 ROLC d = d,c >> s (rotate through carry) 0100.0111 RORC d = d,c >> s (rotate through carry) Endian Conversion Instructions 0100.1000 SWAP8 swap 8-bit pairs 0100.1001 SWAP16 swap 16-bit pairs 0100.1010 SWAP32 swap 32-bit pairs 0100.1011 ESWAP swap endian 0100.1100 LEMOVE move from/to little-endian 0100.1101 BEMOVE move from/to big-endian 0100.1110 0100.1111 Integer Multiply/Divide Instructions 0101.0000 MULU d = d * s (unsigned) 0101.0001 MULS d = d * s 0101.0010 DIVU d = d / s (unsigned) 0101.0011 DIVS d = d / s 0101.0100 MODU d = d % s (unsigned) 0101.0101 MODS d = d % s 0101.0110 0101.0111 0101.1000 MULDIVU d = d * s / g (unsigned) 0101.1001 MULDIVS d = d * s / g 0101.1010 MULMODU d = d * s % g 0101.1011 MULMODS d = d * s % g (unsigned) NOTE: MULDIV* and MULMOD* are only useful as a 3-operand instruction. intermediate result is double-wide. Can be used for scaling. The intermediate result is double-width internally and thus can be used to fractionally scale a 64-bit quantity if desired. NOTE: If both division and modulo is desired, both instructions must be executed, but the hardware and/or translator may be able to optimize that back into one native instruction. 0101.1100 0101.1101 0101.1110 0101.1111 Integer Width Adjustment Instructions 0110.00yy EXTU d = s overrides source reg domain TRUNCU d = s overrides source reg domain 0110.01yy EXTS d = s overrides source reg domain TRUNCS d = s overrides source reg domain 0110.10yy EXTU d = s overrides target reg domain TRUNCU d = s overrides target reg domain 0110.11yy EXTS d = s overrides target reg domain TRUNCS d = s overrides target reg domain For these instructions the main zz bits always control the memory EA (source or destination) and the yy bits control the register EA. For reg,reg operations, the yy bits control the source register. RNG, CRC, and Encryption 0111.0000 CRYPT 0111.0001 DECRYPT 0111.0010 DCACHE clears context caches for Crypto & CRC 0111.0011 0111.0100 CRC 0111.0101 0111.0110 RAND d = (random_number) & s 0111.0111 Special Memory Unit Instructions 1000.0000 1000.0001 1000.0010 1000.0011 1000.0100 VMOVE d = s (invalid->overflow) 1000.0101 FLUSH mem Synchronize specific cache line 1000.0110 RAHEAD mem Request shared cache line 1000.0111 WAHEAD mem Request exclusive cache line NOTE: VMOVE, RAHEAD, and WAHEAD are opportunistic and will not generate an exception if the MMU says the memory location is invalid. However, exceptions are still generated for misaligned requests. 1000.1000 1000.1001 LOCK %rs,OFF(ea) (specify lock type in %rs) 1000.1010 RAISEIPL raise user interruption priority (or NOP) 1000.1011 LOWERIPL lower user interruption priority (or NOP) 1000.1100 SETIPL set user interrupt priority 1000.1101 WEVENT write event bit (V=nospace) 1000.1110 REVENT read event bit 1000.1111 REVENTNB read event bit (non-blocking) NOTE: IPL functions operate on sea and store the prior value of the UPL in dea. Media Unit (Vector) The vector unit supports 8, 16, 32, 64, 128, 256, and 512-bit register domains, and instruction extensions (v-bits). 1001.0000 1001.0001 1001.0010 1001.0011 1001.0100 1001.0101 1001.0110 1001.0111 1001.1000 1001.1001 1001.1010 1001.1011 1001.1100 1000.1101 1001.1110 1001.1111 1010.0000 1010.0001 1010.0010 1010.0011 1010.0100 1010.0101 1010.0110 1010.0111 1010.1000 1010.1001 1010.1010 1010.1011 1010.1100 1010.1101 1010.1110 1010.1111 Media Unit (Floating) The floating unit supports 32 (float), 64 (double), and 128-bit (ldouble). When doing immediate comparisons, immediate FP values may be specified as 16, 32 or 64-bit floats and will be expanded as necessary, or the value 0.0 (no extension). If taken from the v-bits, the v-bits are converted from an integer to floating point. If comparing a constant in the 128-bit domain the constant must be loaded from memory first. 1100.0000 FMOVE 1100.0001 FADD 1100.0010 FSUB 1100.0011 FMUL 1100.0100 FDIV 1100.0101 FCMP 1100.0110 1100.0111 1100.1000 1100.1001 1100.1010 1100.1011 1100.1100 1100.1101 1100.1110 1100.1111 1101.0000 ITOFS integer -> fp (signed) 1101.0001 ITOFU integer -> fp (unsigned) 1101.0010 FMLOADS integer -> fp (signed) 1101.0011 FMLOADU integer -> fp (unsigned) 1101.0100 FTOIS fp -> integer (signed) 1101.0101 FTOIU fp -> integer (unsigned) 1101.0110 FMSAVES fp -> mant/exp (signed) 1101.0111 FMSAVEU fp -> mant/exp (unsigned) 1101.1000 FBUILDS mant/exp -> fp (signed) 1101.1001 FBUILDU mant/exp -> fp (unsigned) 1101.1010 1101.1011 1101.1100 1101.1101 1101.1110 1101.1111 System Instructions 1110.0000 MOVEFRU d = s move from user space 1110.0001 MOVETOU d = s move to user space 1110.0010 1110.0011 1110.0100 1110.0101 1110.0110 1110.0111 1110.1000 1110.1001 1110.1010 1110.1011 1110.1100 1110.1101 1110.1110 1110.1111 Three-operand locked bus cycle instructions XXX 1111.0000 ADD sea,%rg,dea (fetch and add) 1111.0001 SUB sea,%rg,dea (fetch and sub) 1111.0010 AND sea,%rg,dea (fetch and and) 1111.0011 OR sea,%rg,dea (fetch and or) 1111.0100 BSET sea,%rg,dea (fetch, test and set) 1111.0101 BCLR sea,%rg,dea (fetch, test and clear) 1111.0110 CMPXCHG sea,%rg,dea (fetch, compare and exchange) 1111.0111 SWAP sea,%rg,dea (swap) Three-operand scaled instructions XXX 1111.1000 MOVE sea,dea 1111.1001 ADD sea,dea 1111.1010 SUB sea,dea 1111.1011 CMP sea,dea 1111.1100 AND sea,dea 1111.1101 OR sea,dea 1111.1110 XOR sea,dea 1111.1111 LEA sea,dea (EA=110 only) NOTE: There are no FP exceptions, but the V (overflow) bit will be set if an operation fails. Failed operations set the destination to NaN which can also be tested, and failures accumulate in the thread status bits which can also be tested. XXX I might change how this works, it may be too time consuming to translate the thread status bit part. Library Call and Return CALL/LCALL - Saves %ap, and (return) %pc in the source frame. Sets %ap = %fp Sets %fp = new_frame For a library call, also saves %db and replaces it with %pd, where %pd is the register from the OFF(%pd) of the LCALL argument. Thus %pd in the call is expected to be the new library base. RET/LRET - Sets %fp = %ap Restores %ap and %pc from the source frame. For a library call, also restores %db from the source frame. RES/LRES - These instructions create a copy of the current frame in a new thread and return to the caller in the old thread. In the new thread, %ap is generally set to %tp and a RET or LRET vectors to code which exits the thread. The orignal %ap frame might have returned so it is not accessible in the new thread. NOTE: Register save areas are out of band (in an area not included in either the negative or positive frame space), and thus the final translation stage might optimize them out or require a different amount of space. In calls, the new frame is not necessarily a stack push relative to the old frame, and not necessarily at a lower memory address. For translated executables the negative and positive frame sizes and reserved call space are typically added together, and the translator adjusts the offsets appropriately. Also note that the translator may adjust the size of the negative frame space, causing all post-translated offsets to change. In a direct implementation of the Rune machine, the return %pc and %ap is saved, %fp is copied to %ap, and %fp remains unchanged. The final translation pass is expected to insert code to adjust %fp downward (if a stack model is being used) in the target function. Object ABI In the object form, the coder can assume that all appropriate frame and register handling abstractions will be performed. This includes: * The frame pointer allocation abstraction. The final translation step handles saving, shifting, and restoring %ap, %fp, %db, and %pc. This includes the allocation of the new frame space which is specified via special relocations in objects, libraries, and binaries. * The register space abstraction. All procedure calls fully abstract the entire register space, so each procedure can assume that all registers are persistent. The final translation step handles all required save and restore operations. Register passing is fully supported via the Memory object cache abstraction by explicitly storing a register into the negative (cache) frame space in the call and explicitly loading the register from the negative (cache) argument space in the call target. Register return is fully supported via the same mechanism. The target stores a result in the negative (cache) argument space and the caller loads the result from that space immediately after the call returns. The final translation step can optimize these actions into reg-args and reg-return, completely optimized-out the memory operations while still maintaining the register abstraction. * The Memory object cache abstraction. The negative and positive %ap-relative and %fp-relative space is treated differently. The base offset (not the post-relocation) determines which space is being used. The spaces are *NOT* contiguous. The positive offset space represents fixed-memory objects and the layout is exactly as you use the space. The negative offset space represents register-cacheable memory objects. Any intermediate assembly, linking, or translation step can modify the offsets, optimize them out, or even add new offsets. Each offset represents an explicit object which must be read and written using the exact same operation size. Any intermediate step can decide to cache objects in this space registers and can also overload offsets in the space when it is legal to do so. The programmer can overload offsets in the space through legal means, essentially through simple code analysis, all possible code paths must result in deterministic behavior with regards to loads and stores to a particular negative offset. The offsets must also be legally aligned. * Interfacing via Library call interface instead of system calls. Interfacing with the target machine is done via the library interface rather than with a system call trap. The final translation pass or interpreter will convert the library calls to the appropriate mechanism. This saves us a lot of grief. * Exception handling via RAISE. Exception handling is designed to avoid excessively complex code paths in the object model. See the description later on in this file. Condition Codes In Rune object code, all instructions-set specific condition codes and any remaining codes are set to undefined. Condition codes are only good for the next sequential instruction and are not good across calls or returns (that is, the CALL and RET instructions set all condition codes to invalid). This greatly reduces optimization and translation complexities. Library-relative memory Library relative memory typically holds library-wide variables, which are basically global variables only accessible via a particular library base pointer. The positive offset space may be used to hold this data, and like all positive offset spaces the layout is under the control of the programmer. The negative offset space for a library contains the library descriptors for its library call vectors, which is what the LCALL points to. These descriptors are defined elsewhere but are typically big enough to hold actual code if the library function is short enough, resulting in a direct call, and otherwise will hold a JMP instruction which points to the actual library function. Library descriptors may contain other information as well. Thread-relative memory The negative offset space for %tp-relative accesses contain thread hardware and other fields generally defined by the translator and accessed through relocations. The translator may perform other actions when accessing such fields as the fields might represent hardware functions and actual memory. The positive offset space for %tp-relative accesses contains per-thread data generally reserved using a special section. Threading and Process ABI The Rune object and machine supports fine-grained threading. These threads are NOT unix processes or unix pthreads. They are RUNE threads defined by the %tp (thread pointer register), creating a separate very light-weight executable context. Implementations can choose to execute RUNE threads via Unix style threads (e.g. pthreads or some other shared address space mechanism), can manually switch between threads through some other mechanism to limit the unix-style threads, or can use some combination of the above. It should be noted that Rune programs often assume a very large number of threads to be available at low cost (tens of thousands or hundreds of thousands). All thread ABI calls except those defined by specific instructions are defined via the machine interface library, which uses the RUNE library call mechanism but might be translated to inline or other specialized code. Per-thread ABI features include: * A global free-running timer with a per-thread match/wait capability, generating an event. * Yield. * Interruption management primitives for inter-thread interruption (does not have to disable target machine interrupts). * Per-thread event queue for inter-thread management. * Exception management (in addition to RAISE). * Thread stop/wait/exit/wakeup primitives. Exception Handling Rune provides a conditional-capable RAISE and LRAISE instruction. The actual instruction code is only used for out-of-band procedural returns. In-band jumps generate Bcc or JMP instructions instead. RAS automatically adjusts the instruction as needed without the emitter having to specify a PC-relative address. However, the emitter can elect to supply such an address if it desires which will cause a Bcc or JMP to be emitted. The difference is tha RAS may also generate additional catch table for the run-time. [L]RAISE determines where to jump based on the current .catch context. .catch pseudo instructions tell the assembler which catch target label within the procedure is effective at any given point. If there is no .catch context an actual [L]RAISE instruction is emitted. The actual instruction code causes a [L]RET to occur but then enters the run-time instead of continuing at the callers next instruction. The run-time uses .catch tables to calculate where to jump to in the caller. If no .catch context is present it will chain another [L]RAISE. .catch is a pseudo-op which generates no actual code. You specify the catch target with it. The assembler will generate a catch table for the run-time. Optimization and Translation Pass Requirements All instruction sequences in Rune object code must be analyzable for execution path determinism to allow any pass to optimize/overload register and cache memory object use. As a simple example, this means that any overloading of cache memory objects already in-place must pass this deterministic analysis with regards to: (1) Loads and stores of compatible object sizes only, with size changes due to overloading bounded by a deterministic store. (2) Cache memory objects must be stored deterministically before they can be loaded. You cannot load an uninitialized cache memory object. (3) Simple but powerful topological code path analysis is employed. The object code must pass on a procedure-by-procedure basis. (4) No jumps or branches from inside to outside the current procedure. Only CALL, LCALL, and various RET-related instructions may be used at the procedure's execution boundaries. (5) Rune make machine-specific optimizations via the library call interface. The translation phase is allowed to replace particular well-defined library calls with machine-specific code. For example, a matrix or vector manipulation call might be replaced with a SSE instruction. Library writers are expected to fully implement the feature as Rune code, and then allow translators to optimize the code out for machine targets which understand the particular feature. This works for direct library calls. (6) Rune is generally allowed to inline direct call targets. That is, the Rune language will generate calls (not do the actual inlining), and the translation phase will handle the inlining. This works for direct calls and direct library calls. (7) Unreachable code can be optimized out. Unreachable exception code through call topology analysis can also be optimized out. Locked bus cycle instructions Because architectures implement locked bus cycle instructions differently, Rune provides a set of locked bus-cycle instructions using instruction codes 1111.xxxx, including nominal ADD, SUB, bitwise operators, CMPXCHG, and SWAP. These instructions implement full memory barriers and execute atomically. Generally speaking the operation will execute with a locked bus cycle and the original contents of the memory location will be stored in the third operation register. Object Locking Rune has a set of object locking instructions. These instructions exist because object locking is integral to the language and must be heavily optimized at multiple stages to produce a high-performance machine target. These instructions specify the memory address of a Rune lock structure and perform all operations necessary on the lock and related thread context, including potentially blocking. Generally speaking locks are integrated into the thread context. The thread keeps track of held locks and whether it is in soft or hard locking mode. In soft mode held locks can be temporarily lost when the thread voluntarily blocks (but not if it yields or is switched involuntarily), and will be reacquired when it unblocks. In hard mode held locks cannot be lost. Since many locks may be held, Rune implements a lock-stealing mechanism whereby a held lock in soft mode can be stolen while the thread is blocked, and a mechanism to detect if any locks or a particular lock has been stolen. Detection of stolen locks during blocking is handled by testing a special stolen-lock-counter field in the thread structure.