2 Rune Object and Machine Model Instruction Set
4 Please refer to objmodel.txt for the reasoning behind the formulation
5 of the instruction set.
7 Instructions codes are one or two 16-bit words, plus an optional extension.
8 There are many 16-bit forms and one 32-bit form. Most forms may have
9 extension words representing 16, 32, or 64 bits of data, or no extension
10 words at all (representing 0).
12 32 and 64-bit extension words are 16-bit aligned in the instruction flow
13 but loaded at the bit locations of their natural alignment, which allows
14 processing code to avoid any shifts. For example, a 64-bit extension
15 misaligned by 32 bits can be loaded by taking the first aligned 64-bit
16 word masked against the low 32 bits and OR that with the second aligned
17 64-bit word masked against the high 32 bits.
19 Most instruction forms combine the register number specification with
20 implied type and the operation size, supporting 16 registers x 8 domains
21 for 128 total registers. The nominal register save context is around
22 1080 bytes. A light-weight thread context inclusive of the register
23 save context is 2048 bytes and can also includes part of the thread's
26 Generally speaking, all quick, immediate and offset values are
27 sign-extended. This is very important for instruction layout compression
28 when loading 32 or 64-bit registers and dealing with the 16, 32 or 64-bit
29 pointer register space. built-in IMMQ4 fields range from -8 to +8 and
30 do not include 0 (bitcode 0000 = +8). Remember that %r0 always represents
31 0 and can also be used to form absolute addresses if desired.
33 Most long-form instructions have significant flexibility in the
34 interpretation of the built-in vvvvvvvv bits, including using the bits
35 to avoid needing additional extension words. This flexibility also
36 includes a three-operand mode and a scaled three-operand mode.
38 Condition status bits are only valid for the immediately following
39 instruction. You can think of it this way: All instructions always set
40 all condition bits. This is a huge translation and optimization aid.
42 000iiizz ssssdddd insn8 %rs,%rd
43 0010iizz ssssdddd insn4 %rs,%pd
44 0011iizz vvvvdddd insn4 $IMMQ4,%pd
45 010iiizz vvvvdddd insn8 $IMMQ4,%rd
46 011iiizz vvvvdddd OFF16 insn8 $IMMQ4,OFF16(%pd)
48 100iiizz ssssdddd OFF16 insn8 %rs,OFF16(%pd)
49 101iiizz ssssdddd OFF16 insn8 OFF16(%ps),%rd
50 1100iizz ssssdddd OFF16 insn4 %ps,OFF16(%pd)
51 1101iizz ssssdddd OFF16 insn4 OFF16(%ps),%pd
53 1110xx00 iiiicccc EXT* implA OFF(%pc) (EA w/cond)
54 1110xx01 iiiidddd EXT* implB OFF(%pd) (EA or mem)
55 1110xx10 iiiidddd EXT* implB *OFF(%pd) (ind)
56 1110xx11 ssssdddd EXT* LEA OFF(%ps),%pd (EA)
58 1111xxee ezzzdddd ssssvvvv iiiiiiii EXT* insn8 sea,dea
59 1111xxee ezzzdddd vvvvvvvv iiiiiiii EXT* insn8 sea,dea (eee=001,010)
61 1111xx11 0zzzdddd ssssvvvv ggggiiii EXT* insnx16 mem,%rg,%rd (eee=110)
62 1111xx11 0zzzdddd ssssvvvv gggg1111 EXT* LEA mem,%rg,%rd (eee=110)
63 1111xx11 1zzzdddd ssssvvvv ggggiiii EXT* insnx16 %rs,%rg,mem (eee=111)
64 1111xx11 1zzzdddd ssssvvvv gggg1111 EXT* (reserved)
66 zzz: Memory operation size and register domain
68 000 8-bit integer domain
69 001 16-bit integer domain
70 010 32-bit integer domain
71 011 64-bit integer domain
72 100 pointer domain (16, 32, or 64 machine bits)
73 101 128-bit media domain
74 110 256-bit media domain
75 111 512-bit media domain
77 NOTE: The pointer domain registers are meant to be directly
78 translatable to target architecture machine pointer registers
79 and can be 16, 32, or 64 bits. For compatibility, the object
80 model has its own understanding of the pointer width which can
81 be different. Code generators must not make assumptions as to
82 the actual width of registers in the PTR domain (i.e. should not
83 use PTR domain registers to store integer values).
85 NOTE: When a register is used indirectly, the PTR domain is implied
86 regardless of the zzz bits. Some instruction forms also imply
87 the PTR domain for source or destination.
89 NOTE: Only the first four domains are available in 16-bit insn formats,
90 but there are special 16-bit %ps and %pd register direct
96 Register 0 in all domains always reads 0 and is a sink-null on write.
98 The top 5 registers in the pointer domain have a special meaning and
99 cannot be directly written to in usermode. This leaves 11 general
100 pointer registers available for code backends.
102 %db (%p11) Data & library base pointer (e.g. per-library)
103 %tp (%p12) Thread pointer
104 %ap (%p13) Argument pointer (caller frame)
105 %fp (%p14) Frame pointer
106 %pc (%p15) Program counter
110 (only applicable to related implied instruction)
114 0010 C BCS, BLO a < b (unsigned)
115 BVSU overflow (unsigned)
116 0011 Z | C BLS a < b (unsigned)
117 0100 N BMI, BLT a < b (signed)
119 0110 N | Z BLE a <= b (unsigned)
120 0111 BVS overflow (signed)
124 1010 ~C BCC, BHS a >= b (unsigned)
125 BVCU no-overflow (usigned)
126 1011 ~Z & ~C BHI a > b (unsigned)
127 1100 ~N BPL, BGE a >= b (signed)
129 1110 ~N & ~Z BGT a > b (signed)
130 1111 BVC no-overflow (signed)
132 vvvv: Immediate or Offset Quick value. Prescales by operation
133 size if used as an offset.
135 -8 to +8 (code 0000= +8)
137 NOTE: insnq $0,dea is automatically converted to insnq %r0,dea
139 vvvvvvvv: Immediate or Offset Quick value, signed 2's complement
140 (-128 to +127, 0 inclusive). Does NOT prescale by the
141 operation size either way, allowing an 8-bit relocation to
142 control the field if desired. Format used for immediate
145 xx: Encode extension word size
146 00 No extension word (IMM or OFF value from 'v' bits or imply 0)
147 01 16-bit extension word: IMM16 or OFF16
148 10 32-bit extension word: IMM32 or OFF32
149 11 64-bit extension word: IMM64 or OFF64
151 When 00 is specified there is no extension word and the IMM or OFF
152 value either takes on the value of the quick bits (v*), or takes on
153 the value of 0 if the quick bits are already being used for the other
154 operand. The extension word can be specified to be larger than the
155 operation size but will be truncated if used as an IMM constant or if
156 larger than the machine pointer width and used as an OFF.
158 16, 32, and 64-bit extension words are sign-extended 2s complement,
161 eee[.xx]: Effective Address Mode (32-bit insn only)
163 EA modes not involving an immediate or offset value may use the xx
164 bits for other purposes.
172 insn $IMMQ8,%rd (xx=00)
174 010 insn $IMMQ8,OFF(%pd)
175 insn $IMMQ8,(%pd) (xx=00)
177 011 insn $IMM,OFFQ8(%pd)
178 insn $0,OFFQ8(%pd) (xx=00)
180 100 insn OFF(%ps),%rd
181 insn OFFQ4*zzz(%ps),%rd (xx=00)
183 101 insn %rs,OFF(%pd)
184 insn %rs,OFFQ4*zzz(%pd) (xx=00)
186 110 insnx16 OFF(%ps + %pg*IMMQ4U),%rd
187 insnx16 (%ps + %pg*IMMQ4U),%rd (xx=00)
188 insnx16 OFF(%ps),%rg,%rd (iiii < 1000)
190 Special scaled or three-register memory mode. Maps to
191 instructions 1111iiii. 11111111 is LEA.
193 IMMQ4U ranges from 1..16 (0000=16)
195 111 insnx16 %rs,OFF(%pd + %pg*IMMQ4U)
196 insnx16 %rs,(%pd + %pg*IMMQ4U) (xx=00)
197 insnx16 %rs,%rg,OFF(%pd) (iiii < 1000)
199 Special scaled or three-register memory mode. Maps to
200 instructions 1111iiii. 11111111 is reserved.
202 IMMQ4U ranges from 1..16 (0000=16)
204 NOTE: Any unused fields must be coded as 0. Specifically, the vvvv
205 bits for EA=001,100,101 when an extension is specified (xx!=00).
207 NOTE: Absolute addressing is supported by using %p0 as the indirect
208 register. ABS mode ea is converted to OFF(%p0).
210 NOTE: If xx=00 the quick bits typically implement either OFFQ4 or
211 IMMQ4. Immediate mode instructions (EA=001,010,011) steal the
212 source register bits to feature OFFQ8 and IMMQ8. In situations
213 where both an IMM and OFF element is present, the quick bits
214 always implement one and the extension bits always implement
215 the other, implying a value of 0 if xx=00 (EA=010,011).
217 NOTE: Memory operations are always read, write, or read-modify-write.
218 No instruction other than change-of-control insns operates on
219 more than one memory location.
223 Implied Instructions A (conditional)
225 The implA instruction form encodes 16 instructions with an optional
226 PC-relative effective-address and condition. Not all insn combinations
227 are legal. xx must be 00 for instructions which do not use the PC-rel
231 xx.0001 CALLcc OFF(%pc)
232 xx.0010 (reserved) OFF(%pc)
233 xx.0011 (reserved) OFF(%pc)
249 Implied Instructions B (PTR source or target)
251 The implB instruction form encodes 16 instructions with an optional
252 register-relative address, either an effective-address or a memory
253 load obtaining the address.
255 Note that these instructions have side effects and may use other
256 registers, see additional information later on in document.
258 xx.0000 JMP [*]OFF(%pd)
259 xx.0001 CALL [*]OFF(%pd)
260 xx.0010 LCALL [*]OFF(%pd)
261 xx.0011 TRAP [*]OFF(%pd)
267 Change of control and fences. Note that an absolute address or value
268 may be specified using %p0, an effective address using normal
269 addressing, or a load from memory can supply the address.
271 TRAPs supply the address of the trap vector entry.
273 xx.1000 SLOCK [*]OFF(%pd)
274 xx.1001 SLOCKNB [*]OFF(%pd)
275 xx.1010 XLOCK [*]OFF(%pd)
276 xx.1011 XLOCKNB [*]OFF(%pd)
277 xx.1100 UNLOCK [*]OFF(%pd)
282 Object locking instructions are so integral to Rune they
283 need their own instructions in order to allow translators
284 to heavily optimize their operations. Note that hard vs soft
285 locking is handled by the threading runtime.
288 Core ALU Instructions
290 NOTE: insn4 reflects the first 4 instructions (2 bits)
291 insn8 reflects the first 8 instructions (3 bits)
292 insn16 reflects the first 16 instructions (4 bits)
293 implA (see Implied instructions above)
294 implB (see Implied instructions above)
298 0000.0001 ADD d = d + s
299 0000.0010 CMP (d - s) -> condition
300 0000.0011 SUB d = d - s
302 0000.0100 OR d = d | s
303 0000.0101 AND d = d & s
304 0000.0110 XOR d = d ^ x
305 0000.0111 TEST (d & s) -> condition
307 0000.1000 ASL d = d << s
309 0000.1010 ASR d = d >> s (signed)
310 0000.1011 LSR d = d >> s (unsigned)
311 0000.1100 BSET d = d | (1 << s)
312 0000.1101 BCLR d = d & ~(1 << s)
313 0000.1110 BCOM d = d ^ (1 << s)
314 0000.1111 BTST (d & ~(1 << s)) -> condition
316 Conditional Long Form Instructions
319 0010.cccc SETcc d = s iff cond else d = 0
320 0011.cccc MOVEcc d = s iff cond
322 Special Fast Integer and Mask instructions
326 0100.0010 BMASK d = (1 << s) - 1
327 0100.0011 BIT d = 1 << s
328 0100.0100 ROL d = d <<< s (lsb->carry after operation)
329 0100.0101 ROR d = d >>> s (msb->carry after operation)
333 Endian Conversion Instructions
335 0100.1000 SWAP8 swap 8-bit pairs
336 0100.1001 SWAP16 swap 16-bit pairs
337 0100.1010 SWAP32 swap 32-bit pairs
338 0100.1011 ESWAP swap endian
344 Integer Multiply/Divide Instructions
346 0101.0000 MULU d = d * s (unsigned)
347 0101.0001 MULS d = d * s
348 0101.0010 DIVU d = d / s (unsigned)
349 0101.0011 DIVS d = d / s
350 0101.0100 MODU d = d % s (unsigned)
351 0101.0101 MODS d = d % s
352 0101.0110 MULDIVU d = d * s / g (unsigned)
353 0101.0111 MULDIVS d = d * s / g
355 NOTE: MULDIV* is only useful as a 3-operand instruction.
356 intermediate result is double-wide. Can be used for
357 scaling a value. Register specification is %ra:%rd,dea.
359 MULDIV* internally implements a double-operation-size-wide
360 intermediate value and can thus be used to fractionally scale
361 a 64-bit quantity in the integer domain if desired.
363 NOTE: If both division and modulo is desired, both instructions must
364 be executed, but the hardware and/or translator may be able
365 to optimize that back into one native instruction.
376 Integer Width Adjustment Instructions
378 0110.00yy EXTU d = s overrides source reg domain
379 TRUNCU d = s overrides source reg domain
380 0110.01yy EXTS d = s overrides source reg domain
381 TRUNCS d = s overrides source reg domain
382 0110.10yy EXTU d = s overrides target reg domain
383 TRUNCU d = s overrides target reg domain
384 0110.11yy EXTS d = s overrides target reg domain
385 TRUNCS d = s overrides target reg domain
387 For these instructions the main zz bits always control
388 the destination and the yy bits always control the source.
390 RNG, CRC, and Encryption
394 0111.0010 DCACHE clears context caches for Crypto & CRC
398 0111.0110 RAND d = (random_number) & s
401 Special Memory Unit Instructions
408 1000.0100 VMOVE d = s (invalid->overflow)
409 1000.0101 FLUSH mem Synchronize specific cache line
410 1000.0110 RAHEAD mem Request shared cache line
411 1000.0111 WAHEAD mem Request exclusive cache line
413 NOTE: VMOVE, RAHEAD, and WAHEAD are opportunistic and will not
414 generate an exception if the MMU says the memory location
415 is invalid. However, exceptions are still generated for
419 1000.1001 LOCK %rs,OFF(ea) (specify lock type in %rs)
420 1000.1010 RAISEIPL raise user interruption priority (or NOP)
421 1000.1011 LOWERIPL lower user interruption priority (or NOP)
422 1000.1100 SETIPL set user interrupt priority
423 1000.1101 WEVENT write event bit (V=nospace)
424 1000.1110 REVENT read event bit
425 1000.1111 REVENTNB read event bit (non-blocking)
427 NOTE: IPL functions operate on sea and store the prior value of
432 The vector unit supports 8, 16, 32, 64, 128, 256, and 512-bit
433 register domains, and instruction extensions (v-bits).
472 Media Unit (Floating)
474 The floating unit supports 32 (float), 64 (double), and
475 128-bit (ldouble) domains, and instruction extensions (v-bits).
494 1101.0000 ITOFS integer -> fp (signed)
495 1101.0001 ITOFU integer -> fp (unsigned)
496 1101.0010 FMLOADS integer -> fp (signed)
497 1101.0011 FMLOADU integer -> fp (unsigned)
498 1101.0100 FTOIS fp -> integer (signed)
499 1101.0101 FTOIU fp -> integer (unsigned)
500 1101.0110 FMSAVES fp -> mant/exp (signed)
501 1101.0111 FMSAVEU fp -> mant/exp (unsigned)
502 1101.1000 FBUILDS mant/exp -> fp (signed)
503 1101.1001 FBUILDU mant/exp -> fp (unsigned)
513 1110.0000 MOVEFRU d = s move from user space
514 1110.0001 MOVETOU d = s move to user space
530 Three-operand locked bus cycle instructions
532 1111.0000 ADD sea,%rg,dea (fetch and add)
533 1111.0001 SUB sea,%rg,dea (fetch and sub)
534 1111.0010 AND sea,%rg,dea (fetch and and)
535 1111.0011 OR sea,%rg,dea (fetch and or)
536 1111.0100 BSET sea,%rg,dea (fetch, test and set)
537 1111.0101 BCLR sea,%rg,dea (fetch, test and clear)
538 1111.0110 CMPXCHG sea,%rg,dea (fetch, compare and exchange)
539 1111.0111 SWAP sea,%rg,dea (swap)
541 Three-operand scaled instructions
543 1111.1000 MOVE sea,dea
544 1111.1001 ADD sea,dea
545 1111.1010 SUB sea,dea
546 1111.1011 CMP sea,dea
547 1111.1100 AND sea,dea
549 1111.1110 XOR sea,dea
550 1111.1111 LEA sea,dea (EA=110 only)
552 NOTE: There are no FP exceptions, but the V (overflow) bit will be
553 set if an operation fails. Failed operations set the destination
554 to NaN which can also be tested, and failures accumulate in the
555 thread status bits which can also be tested. XXX I might change
556 how this works, it may be too time consuming to translate the
557 thread status bit part.
559 Library Call and Return
561 CALL/LCALL - Saves %ap, and (return) %pc in the source frame.
565 For a library call, also saves %db and replaces it with
566 %pd, where %pd is the register from the OFF(%pd) of the
567 LCALL argument. Thus %pd in the call is expected to be
568 the new library base.
570 RET/LRET - Sets %fp = %ap
571 Restores %ap and %pc from the source frame.
573 For a library call, also restores %db from the source frame.
575 RES/LRES - These instructions create a copy of the current frame
576 in a new thread and return to the caller in the old thread.
578 In the new thread, %ap is generally set to %tp and a
579 RET or LRET vectors to code which exits the thread.
580 The orignal %ap frame might have returned so it is not
581 accessible in the new thread.
583 NOTE: Register save areas are out of band (in an area not included in
584 either the negative or positive frame space), and thus the final
585 translation stage might optimize them out or require a different
588 In calls, the new frame is not necessarily a stack push relative to the
589 old frame, and not necessarily at a lower memory address. For translated
590 executables the negative and positive frame sizes and reserved call
591 space are typically added together, and the translator adjusts the
592 offsets appropriately. Also note that the translator may adjust the
593 size of the negative frame space, causing all post-translated offsets
596 In a direct implementation of the Rune machine, the return %pc and %ap
597 is saved, %fp is copied to %ap, and %fp remains unchanged. The final
598 translation pass is expected to insert code to adjust %fp downward (if a
599 stack model is being used) in the target function.
603 In the object form, the coder can assume that all appropriate frame
604 and register handling abstractions will be performed. This includes:
606 * The frame pointer allocation abstraction.
608 The final translation step handles saving, shifting, and restoring
609 %ap, %fp, %db, and %pc. This includes the allocation of the new frame
610 space which is specified via special relocations in objects, libraries,
613 * The register space abstraction.
615 All procedure calls fully abstract the entire register space, so each
616 procedure can assume that all registers are persistent. The final
617 translation step handles all required save and restore operations.
619 Register passing is fully supported via the Memory object cache
620 abstraction by explicitly storing a register into the negative (cache)
621 frame space in the call and explicitly loading the register from the
622 negative (cache) argument space in the call target.
624 Register return is fully supported via the same mechanism. The target
625 stores a result in the negative (cache) argument space and the caller
626 loads the result from that space immediately after the call returns.
628 The final translation step can optimize these actions into reg-args
629 and reg-return, completely optimized-out the memory operations while
630 still maintaining the register abstraction.
632 * The Memory object cache abstraction.
634 The negative and positive %ap-relative and %fp-relative space is treated
635 differently. The base offset (not the post-relocation) determines which
636 space is being used. The spaces are *NOT* contiguous.
638 The positive offset space represents fixed-memory objects and the layout
639 is exactly as you use the space.
641 The negative offset space represents register-cacheable memory objects.
642 Any intermediate assembly, linking, or translation step can modify
643 the offsets, optimize them out, or even add new offsets. Each offset
644 represents an explicit object which must be read and written using
645 the exact same operation size. Any intermediate step can decide to
646 cache objects in this space registers and can also overload offsets
647 in the space when it is legal to do so.
649 The programmer can overload offsets in the space through legal means,
650 essentially through simple code analysis, all possible code paths must
651 result in deterministic behavior with regards to loads and stores to
652 a particular negative offset. The offsets must also be legally aligned.
654 * Interfacing via Library call interface instead of system calls.
656 Interfacing with the target machine is done via the library interface
657 rather than with a system call trap. The final translation pass or
658 interpreter will convert the library calls to the appropriate mechanism.
659 This saves us a lot of grief.
661 * Exception handling via RAISE.
663 Exception handling is designed to avoid excessively complex code paths
664 in the object model. See the description later on in this file.
668 In Rune object code, all instructions-set specific condition codes and
669 any remaining codes are set to undefined. Condition codes are only good
670 for the next sequential instruction and are not good across calls or
671 returns (that is, the CALL and RET instructions set all condition codes
674 This greatly reduces optimization and translation complexities.
676 Library-relative memory
678 Library relative memory typically holds library-wide variables, which
679 are basically global variables only accessible via a particular library
682 The positive offset space may be used to hold this data, and like all
683 positive offset spaces the layout is under the control of the programmer.
685 The negative offset space for a library contains the library descriptors
686 for its library call vectors, which is what the LCALL points to. These
687 descriptors are defined elsewhere but are typically big enough to hold
688 actual code if the library function is short enough, resulting in a
689 direct call, and otherwise will hold a JMP instruction which points to
690 the actual library function. Library descriptors may contain other
693 Thread-relative memory
695 The negative offset space for %tp-relative accesses contain thread
696 hardware and other fields generally defined by the translator and
697 accessed through relocations. The translator may perform other actions
698 when accessing such fields as the fields might represent hardware
699 functions and actual memory.
701 The positive offset space for %tp-relative accesses contains per-thread
702 data generally reserved using a special section.
704 Threading and Process ABI
706 The Rune object and machine supports fine-grained threading. These
707 threads are NOT unix processes or unix pthreads. They are RUNE threads
708 defined by the %tp (thread pointer register), creating a separate very
709 light-weight executable context.
711 Implementations can choose to execute RUNE threads via Unix style threads
712 (e.g. pthreads or some other shared address space mechanism), can
713 manually switch between threads through some other mechanism to limit
714 the unix-style threads, or can use some combination of the above.
716 It should be noted that Rune programs often assume a very large number
717 of threads to be available at low cost (tens of thousands or hundreds
720 All thread ABI calls except those defined by specific instructions are
721 defined via the machine interface library, which uses the RUNE library
722 call mechanism but might be translated to inline or other specialized
725 Per-thread ABI features include:
727 * A global free-running timer with a per-thread match/wait
728 capability, generating an event.
732 * Interruption management primitives for inter-thread interruption
733 (does not have to disable target machine interrupts).
735 * Per-thread event queue for inter-thread management.
737 * Exception management (in addition to RAISE).
739 * Thread stop/wait/exit/wakeup primitives.
743 Rune provides a conditional-capable RAISE and LRAISE instruction. The
744 actual instruction code is only used for out-of-band procedural returns.
745 In-band jumps generate Bcc or JMP instructions instead. RAS automatically
746 adjusts the instruction as needed without the emitter having to specify
747 a PC-relative address. However, the emitter can elect to supply such an
748 address if it desires which will cause a Bcc or JMP to be emitted. The
749 difference is tha RAS may also generate additional catch table for the
752 [L]RAISE determines where to jump based on the current .catch context.
753 .catch pseudo instructions tell the assembler which catch target label
754 within the procedure is effective at any given point. If there is no
755 .catch context an actual [L]RAISE instruction is emitted.
757 The actual instruction code causes a [L]RET to occur but then enters the
758 run-time instead of continuing at the callers next instruction. The
759 run-time uses .catch tables to calculate where to jump to in the caller.
760 If no .catch context is present it will chain another [L]RAISE.
762 .catch is a pseudo-op which generates no actual code. You specify the
763 catch target with it. The assembler will generate a catch table for
766 Optimization and Translation Pass Requirements
768 All instruction sequences in Rune object code must be analyzable for
769 execution path determinism to allow any pass to optimize/overload
770 register and cache memory object use. As a simple example, this means
771 that any overloading of cache memory objects already in-place must
772 pass this deterministic analysis with regards to:
774 (1) Loads and stores of compatible object sizes only, with size changes
775 due to overloading bounded by a deterministic store.
777 (2) Cache memory objects must be stored deterministically before they can
778 be loaded. You cannot load an uninitialized cache memory object.
780 (3) Simple but powerful topological code path analysis is employed. The
781 object code must pass on a procedure-by-procedure basis.
783 (4) No jumps or branches from inside to outside the current procedure.
784 Only CALL, LCALL, and various RET-related instructions may be used
785 at the procedure's execution boundaries.
787 (5) Rune make machine-specific optimizations via the library call
788 interface. The translation phase is allowed to replace particular
789 well-defined library calls with machine-specific code. For example,
790 a matrix or vector manipulation call might be replaced with a SSE
791 instruction. Library writers are expected to fully implement the
792 feature as Rune code, and then allow translators to optimize the
793 code out for machine targets which understand the particular feature.
794 This works for direct library calls.
796 (6) Rune is generally allowed to inline direct call targets. That is,
797 the Rune language will generate calls (not do the actual inlining),
798 and the translation phase will handle the inlining. This works for
799 direct calls and direct library calls.
801 (7) Unreachable code can be optimized out. Unreachable exception code
802 through call topology analysis can also be optimized out.
804 Locked bus cycle instructions
806 Because architectures implement locked bus cycle instructions differently,
807 Rune provides a set of locked bus-cycle instructions using instruction
808 codes 1111.xxxx, including nominal ADD, SUB, bitwise operators, CMPXCHG,
809 and SWAP. These instructions implement full memory barriers and execute
812 Generally speaking the operation will execute with a locked bus cycle
813 and the original contents of the memory location will be stored in the
814 third operation register.
818 Rune has a set of object locking instructions. These instructions exist
819 because object locking is integral to the language and must be heavily
820 optimized at multiple stages to produce a high-performance machine target.
822 These instructions specify the memory address of a Rune lock structure
823 and perform all operations necessary on the lock and related thread
824 context, including potentially blocking.
826 Generally speaking locks are integrated into the thread context. The
827 thread keeps track of held locks and whether it is in soft or hard
828 locking mode. In soft mode held locks can be temporarily lost when
829 the thread voluntarily blocks (but not if it yields or is switched
830 involuntarily), and will be reacquired when it unblocks. In hard mode
831 held locks cannot be lost.
833 Since many locks may be held, Rune implements a lock-stealing mechanism
834 whereby a held lock in soft mode can be stolen while the thread is
835 blocked, and a mechanism to detect if any locks or a particular lock
836 has been stolen. Detection of stolen locks during blocking is handled
837 by testing a special stolen-lock-counter field in the thread structure.