docs/insn.txt

   1
   2                     Rune Object and Machine Model Instruction Set
   3
   4     Please refer to objmodel.txt for the reasoning behind the formulation
   5     of the instruction set.
   6
   7     Instructions codes are one or two 16-bit words, plus an optional extension.
   8     There are many 16-bit forms and one 32-bit form.  Most forms may have
   9     extension words representing 16, 32, or 64 bits of data, or no extension
  10     words at all (representing 0).
  11
  12     32 and 64-bit extension words are 16-bit aligned in the instruction flow
  13     but loaded at the bit locations of their natural alignment, which allows
  14     processing code to avoid any shifts.  For example, a 64-bit extension
  15     misaligned by 32 bits can be loaded by taking the first aligned 64-bit
  16     word masked against the low 32 bits and OR that with the second aligned
  17     64-bit word masked against the high 32 bits.
  18
  19     Most instruction forms combine the register number specification with
  20     implied type and the operation size, supporting 16 registers x 8 domains
  21     for 128 total registers.  The nominal register save context is around
  22     1080 bytes.  A light-weight thread context inclusive of the register
  23     save context is 2048 bytes and can also includes part of the thread's
  24     stack.
  25
  26     Generally speaking, all quick, immediate and offset values are
  27     sign-extended.  This is very important for instruction layout compression
  28     when loading 32 or 64-bit registers and dealing with the 16, 32 or 64-bit
  29     pointer register space.  built-in IMMQ4 fields range from -8 to +8 and
  30     do not include 0 (bitcode 0000 = +8).  Remember that %r0 always represents
  31     0 and can also be used to form absolute addresses if desired.
  32
  33     Most long-form instructions have significant flexibility in the
  34     interpretation of the built-in vvvvvvvv bits, including using the bits
  35     to avoid needing additional extension words.  This flexibility also
  36     includes a three-operand mode and a scaled three-operand mode.
  37
  38     Condition status bits are only valid for the immediately following
  39     instruction.  You can think of it this way: All instructions always set
  40     all condition bits.  This is a huge translation and optimization aid.
  41
  42     000iiizz ssssdddd                           insn8 %rs,%rd
  43     0010iizz ssssdddd                           insn4 %rs,%pd
  44     0011iizz vvvvdddd                           insn4 $IMMQ4,%pd
  45     010iiizz vvvvdddd                           insn8 $IMMQ4,%rd
  46     011iiizz vvvvdddd OFF16                     insn8 $IMMQ4,OFF16(%pd)
  47
  48     100iiizz ssssdddd OFF16                     insn8 %rs,OFF16(%pd)
  49     101iiizz ssssdddd OFF16                     insn8 OFF16(%ps),%rd
  50     1100iizz ssssdddd OFF16                     insn4 %ps,OFF16(%pd)
  51     1101iizz ssssdddd OFF16                     insn4 OFF16(%ps),%pd
  52
  53     1110xx00 iiiicccc EXT*                      implA OFF(%pc)      (EA w/cond)
  54     1110xx01 iiiidddd EXT*                      implB OFF(%pd)      (EA or mem)
  55     1110xx10 iiiidddd EXT*                      implB *OFF(%pd)     (ind)
  56     1110xx11 ssssdddd EXT*                      LEA   OFF(%ps),%pd  (EA)
  57
  58     1111xxee ezzzdddd ssssvvvv iiiiiiii EXT*    insn8 sea,dea
  59     1111xxee ezzzdddd vvvvvvvv iiiiiiii EXT*    insn8 sea,dea  (eee=001,010)
  60
  61     1111xx11 0zzzdddd ssssvvvv ggggiiii EXT*    insnx16 mem,%rg,%rd  (eee=110)
  62     1111xx11 0zzzdddd ssssvvvv gggg1111 EXT*    LEA     mem,%rg,%rd  (eee=110)
  63     1111xx11 1zzzdddd ssssvvvv ggggiiii EXT*    insnx16 %rs,%rg,mem  (eee=111)
  64     1111xx11 1zzzdddd ssssvvvv gggg1111 EXT*    (reserved)
  65
  66     zzz:        Memory operation size and register domain
  67
  68         000      8-bit integer domain
  69         001     16-bit integer domain
  70         010     32-bit integer domain
  71         011     64-bit integer domain
  72         100     pointer domain (16, 32, or 64 machine bits)
  73         101     128-bit media domain
  74         110     256-bit media domain
  75         111     512-bit media domain
  76
  77         NOTE: The pointer domain registers are meant to be directly
  78               translatable to target architecture machine pointer registers
  79               and can be 16, 32, or 64 bits.  For compatibility, the object
  80               model has its own understanding of the pointer width which can
  81               be different.  Code generators must not make assumptions as to
  82               the actual width of registers in the PTR domain (i.e. should not
  83               use PTR domain registers to store integer values).
  84
  85         NOTE: When a register is used indirectly, the PTR domain is implied
  86               regardless of the zzz bits.  Some instruction forms also imply
  87               the PTR domain for source or destination.
  88
  89         NOTE: Only the first four domains are available in 16-bit insn formats,
  90               but there are special 16-bit %ps and %pd register direct
  91               instruction codes.
  92
  93     ssss:       Source Register
  94     dddd:       Target Register
  95
  96         Register 0 in all domains always reads 0 and is a sink-null on write.
  97
  98         The top 5 registers in the pointer domain have a special meaning and
  99         cannot be directly written to in usermode.  This leaves 11 general
 100         pointer registers available for code backends.
 101
 102         %db (%p11)      Data & library base pointer (e.g. per-library)
 103         %tp (%p12)      Thread pointer
 104         %ap (%p13)      Argument pointer (caller frame)
 105         %fp (%p14)      Frame pointer
 106         %pc (%p15)      Program counter
 107
 108     cccc:       Condition code
 109
 110         (only applicable to related implied instruction)
 111
 112         0000    always                  BRA
 113         0001    Z                       BEQ             a == b
 114         0010    C                       BCS, BLO        a < b (unsigned)
 115                                         BVSU            overflow (unsigned)
 116         0011    Z | C                   BLS             a < b (unsigned)
 117         0100    N                       BMI, BLT        a < b (signed)
 118         0101                            (reserved)
 119         0110    N | Z                   BLE             a <= b (unsigned)
 120         0111                            BVS             overflow (signed)
 121
 122         1000    never                   BNEVER
 123         1001    ~Z                      BNE             a != b
 124         1010    ~C                      BCC, BHS        a >= b (unsigned)
 125                                         BVCU            no-overflow (usigned)
 126         1011    ~Z & ~C                 BHI             a > b  (unsigned)
 127         1100    ~N                      BPL, BGE        a >= b (signed)
 128         1101                            (reserved)
 129         1110    ~N & ~Z                 BGT             a > b (signed)
 130         1111                            BVC             no-overflow (signed)
 131
 132     vvvv:       Immediate or Offset Quick value.  Prescales by operation
 133                 size if used as an offset.
 134
 135         -8 to +8 (code 0000= +8)
 136
 137         NOTE: insnq $0,dea is automatically converted to insnq %r0,dea
 138
 139     vvvvvvvv:   Immediate or Offset Quick value, signed 2's complement
 140                 (-128 to +127, 0 inclusive).  Does NOT prescale by the
 141                 operation size either way, allowing an 8-bit relocation to
 142                 control the field if desired.  Format used for immediate
 143                 instructions only.
 144
 145     xx:         Encode extension word size
 146         00      No extension word (IMM or OFF value from 'v' bits or imply 0)
 147         01      16-bit extension word:  IMM16 or OFF16
 148         10      32-bit extension word:  IMM32 or OFF32
 149         11      64-bit extension word:  IMM64 or OFF64
 150
 151         When 00 is specified there is no extension word and the IMM or OFF
 152         value either takes on the value of the quick bits (v*), or takes on
 153         the value of 0 if the quick bits are already being used for the other
 154         operand.  The extension word can be specified to be larger than the
 155         operation size but will be truncated if used as an IMM constant or if
 156         larger than the machine pointer width and used as an OFF.
 157
 158         16, 32, and 64-bit extension words are sign-extended 2s complement,
 159         inclusive of 0.
 160
 161     eee[.xx]:   Effective Address Mode  (32-bit insn only)
 162
 163         EA modes not involving an immediate or offset value may use the xx
 164         bits for other purposes.
 165
 166         000.00  insn    %rs,%rd
 167         000.01  insn    %rs,%pd
 168         000.10  insn    %ps,%rd
 169         000.11  (reserved)
 170
 171         001     insn    $IMM,%rd
 172                 insn    $IMMQ8,%rd              (xx=00)
 173
 174         010     insn    $IMMQ8,OFF(%pd)
 175                 insn    $IMMQ8,(%pd)            (xx=00)
 176
 177         011     insn    $IMM,OFFQ8(%pd)
 178                 insn    $0,OFFQ8(%pd)           (xx=00)
 179
 180         100     insn    OFF(%ps),%rd
 181                 insn    OFFQ4*zzz(%ps),%rd      (xx=00)
 182
 183         101     insn    %rs,OFF(%pd)
 184                 insn    %rs,OFFQ4*zzz(%pd)      (xx=00)
 185
 186         110     insnx16 OFF(%ps + %pg*IMMQ4U),%rd
 187                 insnx16 (%ps + %pg*IMMQ4U),%rd  (xx=00)
 188                 insnx16 OFF(%ps),%rg,%rd        (iiii < 1000)
 189
 190                 Special scaled or three-register memory mode.  Maps to
 191                 instructions 1111iiii.  11111111 is LEA.
 192
 193                 IMMQ4U ranges from 1..16 (0000=16)
 194
 195         111     insnx16 %rs,OFF(%pd + %pg*IMMQ4U)
 196                 insnx16 %rs,(%pd + %pg*IMMQ4U)  (xx=00)
 197                 insnx16 %rs,%rg,OFF(%pd)        (iiii < 1000)
 198
 199                 Special scaled or three-register memory mode.  Maps to
 200                 instructions 1111iiii.  11111111 is reserved.
 201
 202                 IMMQ4U ranges from 1..16 (0000=16)
 203
 204         NOTE: Any unused fields must be coded as 0.  Specifically, the vvvv
 205               bits for EA=001,100,101 when an extension is specified (xx!=00).
 206
 207         NOTE: Absolute addressing is supported by using %p0 as the indirect
 208               register.  ABS mode ea is converted to OFF(%p0).
 209
 210         NOTE: If xx=00 the quick bits typically implement either OFFQ4 or
 211               IMMQ4.  Immediate mode instructions (EA=001,010,011) steal the
 212               source register bits to feature OFFQ8 and IMMQ8.  In situations
 213               where both an IMM and OFF element is present, the quick bits
 214               always implement one and the extension bits always implement
 215               the other, implying a value of 0 if xx=00 (EA=010,011).
 216
 217         NOTE: Memory operations are always read, write, or read-modify-write.
 218               No instruction other than change-of-control insns operates on
 219               more than one memory location.
 220
 221                                 Instruction Set
 222
 223                         Implied Instructions A (conditional)
 224
 225     The implA instruction form encodes 16 instructions with an optional
 226     PC-relative effective-address and condition.  Not all insn combinations
 227     are legal.  xx must be 00 for instructions which do not use the PC-rel
 228     EA.
 229
 230     xx.0000     Bcc             OFF(%pc)
 231     xx.0001     CALLcc          OFF(%pc)
 232     xx.0010     (reserved)      OFF(%pc)
 233     xx.0011     (reserved)      OFF(%pc)
 234
 235     00.0100     LFENCE
 236     00.0101     SFENCE
 237     00.0110     MFENCE
 238     xx.0111     (reserved)
 239
 240     00.1000     REScc
 241     00.1001     RETcc
 242     00.1010     IRETcc
 243     00.1011     RAISEcc
 244     00.1100     LREScc
 245     00.1101     LRETcc
 246     00.1110     LIRETcc
 247     00.1111     LRAISEcc
 248
 249                         Implied Instructions B (PTR source or target)
 250
 251     The implB instruction form encodes 16 instructions with an optional
 252     register-relative address, either an effective-address or a memory
 253     load obtaining the address.
 254
 255     Note that these instructions have side effects and may use other
 256     registers, see additional information later on in document.
 257
 258     xx.0000     JMP             [*]OFF(%pd)
 259     xx.0001     CALL            [*]OFF(%pd)
 260     xx.0010     LCALL           [*]OFF(%pd)
 261     xx.0011     TRAP            [*]OFF(%pd)
 262     xx.0100
 263     xx.0101
 264     xx.0110
 265     xx.0111
 266
 267         Change of control and fences.  Note that an absolute address or value
 268         may be specified using %p0, an effective address using normal
 269         addressing, or a load from memory can supply the address.
 270
 271         TRAPs supply the address of the trap vector entry.
 272
 273     xx.1000     SLOCK           [*]OFF(%pd)
 274     xx.1001     SLOCKNB         [*]OFF(%pd)
 275     xx.1010     XLOCK           [*]OFF(%pd)
 276     xx.1011     XLOCKNB         [*]OFF(%pd)
 277     xx.1100     UNLOCK          [*]OFF(%pd)
 278     xx.1101
 279     xx.1110
 280     xx.1111
 281
 282         Object locking instructions are so integral to Rune they
 283         need their own instructions in order to allow translators
 284         to heavily optimize their operations.  Note that hard vs soft
 285         locking is handled by the threading runtime.
 286
 287
 288                             Core ALU Instructions
 289
 290     NOTE: insn4 reflects the first 4 instructions (2 bits)
 291           insn8 reflects the first 8 instructions (3 bits)
 292           insn16 reflects the first 16 instructions (4 bits)
 293           implA (see Implied instructions above)
 294           implB (see Implied instructions above)
 295
 296     0000.0000   MOVE            d = s
 297                 NOP             %r0 = %r0
 298     0000.0001   ADD             d = d + s
 299     0000.0010   CMP             (d - s) -> condition
 300     0000.0011   SUB             d = d - s
 301
 302     0000.0100   OR              d = d | s
 303     0000.0101   AND             d = d & s
 304     0000.0110   XOR             d = d ^ x
 305     0000.0111   TEST            (d & s) -> condition
 306
 307     0000.1000   ASL             d = d << s
 308     0000.1001   NEG             d = -s
 309     0000.1010   ASR             d = d >> s      (signed)
 310     0000.1011   LSR             d = d >> s      (unsigned)
 311     0000.1100   BSET            d = d | (1 << s)
 312     0000.1101   BCLR            d = d & ~(1 << s)
 313     0000.1110   BCOM            d = d ^ (1 << s)
 314     0000.1111   BTST            (d & ~(1 << s)) -> condition
 315
 316                         Conditional Long Form Instructions
 317
 318     0001.cccc
 319     0010.cccc   SETcc           d = s iff cond else d = 0
 320     0011.cccc   MOVEcc          d = s iff cond
 321
 322                     Special Fast Integer and Mask instructions
 323
 324     0100.0000   ADDC
 325     0100.0001   SUBC
 326     0100.0010   BMASK           d = (1 << s) - 1
 327     0100.0011   BIT             d = 1 << s
 328     0100.0100   ROL             d = d <<< s (lsb->carry after operation)
 329     0100.0101   ROR             d = d >>> s (msb->carry after operation)
 330     0100.0110
 331     0100.0111
 332
 333                         Endian Conversion Instructions
 334
 335     0100.1000   SWAP8           swap 8-bit pairs
 336     0100.1001   SWAP16          swap 16-bit pairs
 337     0100.1010   SWAP32          swap 32-bit pairs
 338     0100.1011   ESWAP           swap endian
 339     0100.1100   LEMOVE
 340     0100.1101   BEMOVE
 341     0100.1110
 342     0100.1111
 343
 344                     Integer Multiply/Divide Instructions
 345
 346     0101.0000   MULU            d = d * s (unsigned)
 347     0101.0001   MULS            d = d * s
 348     0101.0010   DIVU            d = d / s (unsigned)
 349     0101.0011   DIVS            d = d / s
 350     0101.0100   MODU            d = d % s (unsigned)
 351     0101.0101   MODS            d = d % s
 352     0101.0110   MULDIVU         d = d * s / g (unsigned)
 353     0101.0111   MULDIVS         d = d * s / g
 354
 355         NOTE: MULDIV* is only useful as a 3-operand instruction.
 356               intermediate result is double-wide.  Can be used for
 357               scaling a value.  Register specification is %ra:%rd,dea.
 358
 359               MULDIV* internally implements a double-operation-size-wide
 360               intermediate value and can thus be used to fractionally scale
 361               a 64-bit quantity in the integer domain if desired.
 362
 363         NOTE: If both division and modulo is desired, both instructions must
 364               be executed, but the hardware and/or translator may be able
 365               to optimize that back into one native instruction.
 366
 367     0101.1000
 368     0101.1001
 369     0101.1010
 370     0101.1011
 371     0101.1100
 372     0101.1101
 373     0101.1110
 374     0101.1111
 375
 376                     Integer Width Adjustment Instructions
 377
 378     0110.00yy   EXTU            d = s   overrides source reg domain
 379                 TRUNCU          d = s   overrides source reg domain
 380     0110.01yy   EXTS            d = s   overrides source reg domain
 381                 TRUNCS          d = s   overrides source reg domain
 382     0110.10yy   EXTU            d = s   overrides target reg domain
 383                 TRUNCU          d = s   overrides target reg domain
 384     0110.11yy   EXTS            d = s   overrides target reg domain
 385                 TRUNCS          d = s   overrides target reg domain
 386
 387         For these instructions the main zz bits always control
 388         the destination and the yy bits always control the source.
 389
 390                         RNG, CRC, and Encryption
 391
 392     0111.0000   CRYPT
 393     0111.0001   DECRYPT
 394     0111.0010   DCACHE          clears context caches for Crypto & CRC
 395     0111.0011
 396     0111.0100   CRC
 397     0111.0101
 398     0111.0110   RAND            d = (random_number) & s
 399     0111.0111
 400
 401                     Special Memory Unit Instructions
 402
 403     1000.0000
 404     1000.0001
 405     1000.0010
 406     1000.0011
 407
 408     1000.0100   VMOVE           d = s   (invalid->overflow)
 409     1000.0101   FLUSH           mem     Synchronize specific cache line
 410     1000.0110   RAHEAD          mem     Request shared cache line
 411     1000.0111   WAHEAD          mem     Request exclusive cache line
 412
 413         NOTE: VMOVE, RAHEAD, and WAHEAD are opportunistic and will not
 414               generate an exception if the MMU says the memory location
 415               is invalid.  However, exceptions are still generated for
 416               misaligned requests.
 417
 418     1000.1000
 419     1000.1001   LOCK            %rs,OFF(ea)  (specify lock type in %rs)
 420     1000.1010   RAISEIPL        raise user interruption priority (or NOP)
 421     1000.1011   LOWERIPL        lower user interruption priority (or NOP)
 422     1000.1100   SETIPL          set user interrupt priority
 423     1000.1101   WEVENT          write event bit (V=nospace)
 424     1000.1110   REVENT          read event bit
 425     1000.1111   REVENTNB        read event bit (non-blocking)
 426
 427         NOTE: IPL functions operate on sea and store the prior value of
 428               the UPL in dea.
 429
 430                                 Media Unit (Vector)
 431
 432     The vector unit supports 8, 16, 32, 64, 128, 256, and 512-bit
 433     register domains, and instruction extensions (v-bits).
 434
 435
 436     1001.0000
 437     1001.0001
 438     1001.0010
 439     1001.0011
 440
 441     1001.0100
 442     1001.0101
 443     1001.0110
 444     1001.0111
 445
 446     1001.1000
 447     1001.1001
 448     1001.1010
 449     1001.1011
 450     1001.1100
 451     1000.1101
 452     1001.1110
 453     1001.1111
 454
 455     1010.0000
 456     1010.0001
 457     1010.0010
 458     1010.0011
 459     1010.0100
 460     1010.0101
 461     1010.0110
 462     1010.0111
 463     1010.1000
 464     1010.1001
 465     1010.1010
 466     1010.1011
 467     1010.1100
 468     1010.1101
 469     1010.1110
 470     1010.1111
 471
 472                                 Media Unit (Floating)
 473
 474     The floating unit supports 32 (float), 64 (double), and
 475     128-bit (ldouble) domains, and instruction extensions (v-bits).
 476
 477     1100.0000   FMOVE
 478     1100.0001   FADD
 479     1100.0010   FSUB
 480     1100.0011   FMUL
 481     1100.0100   FDIV
 482     1100.0101   FCMP
 483     1100.0110
 484     1100.0111
 485     1100.1000
 486     1100.1001
 487     1100.1010
 488     1100.1011
 489     1100.1100
 490     1100.1101
 491     1100.1110
 492     1100.1111
 493
 494     1101.0000   ITOFS           integer -> fp   (signed)
 495     1101.0001   ITOFU           integer -> fp   (unsigned)
 496     1101.0010   FMLOADS         integer -> fp   (signed)
 497     1101.0011   FMLOADU         integer -> fp   (unsigned)
 498     1101.0100   FTOIS           fp -> integer   (signed)
 499     1101.0101   FTOIU           fp -> integer   (unsigned)
 500     1101.0110   FMSAVES         fp -> mant/exp  (signed)
 501     1101.0111   FMSAVEU         fp -> mant/exp  (unsigned)
 502     1101.1000   FBUILDS         mant/exp -> fp  (signed)
 503     1101.1001   FBUILDU         mant/exp -> fp  (unsigned)
 504     1101.1010
 505     1101.1011
 506     1101.1100
 507     1101.1101
 508     1101.1110
 509     1101.1111
 510
 511                             System Instructions
 512
 513     1110.0000   MOVEFRU         d = s   move from user space
 514     1110.0001   MOVETOU         d = s   move to user space
 515     1110.0010
 516     1110.0011
 517     1110.0100
 518     1110.0101
 519     1110.0110
 520     1110.0111
 521     1110.1000
 522     1110.1001
 523     1110.1010
 524     1110.1011
 525     1110.1100
 526     1110.1101
 527     1110.1110
 528     1110.1111
 529
 530                         Three-operand locked bus cycle instructions
 531
 532     1111.0000   ADD     sea,%rg,dea     (fetch and add)
 533     1111.0001   SUB     sea,%rg,dea     (fetch and sub)
 534     1111.0010   AND     sea,%rg,dea     (fetch and and)
 535     1111.0011   OR      sea,%rg,dea     (fetch and or)
 536     1111.0100   BSET    sea,%rg,dea     (fetch, test and set)
 537     1111.0101   BCLR    sea,%rg,dea     (fetch, test and clear)
 538     1111.0110   CMPXCHG sea,%rg,dea     (fetch, compare and exchange)
 539     1111.0111   SWAP    sea,%rg,dea     (swap)
 540
 541                         Three-operand scaled instructions
 542
 543     1111.1000   MOVE    sea,dea
 544     1111.1001   ADD     sea,dea
 545     1111.1010   SUB     sea,dea
 546     1111.1011   CMP     sea,dea
 547     1111.1100   AND     sea,dea
 548     1111.1101   OR      sea,dea
 549     1111.1110   XOR     sea,dea
 550     1111.1111   LEA     sea,dea         (EA=110 only)
 551
 552     NOTE: There are no FP exceptions, but the V (overflow) bit will be
 553           set if an operation fails.  Failed operations set the destination
 554           to NaN which can also be tested, and failures accumulate in the
 555           thread status bits which can also be tested. XXX I might change
 556           how this works, it may be too time consuming to translate the
 557           thread status bit part.
 558
 559                         Library Call and Return
 560
 561     CALL/LCALL  - Saves %ap, and (return) %pc in the source frame.
 562                   Sets %ap = %fp
 563                   Sets %fp = new_frame
 564
 565                   For a library call, also saves %db and replaces it with
 566                   %pd, where %pd is the register from the OFF(%pd) of the
 567                   LCALL argument.  Thus %pd in the call is expected to be
 568                   the new library base.
 569
 570     RET/LRET    - Sets %fp = %ap
 571                   Restores %ap and %pc from the source frame.
 572
 573                   For a library call, also restores %db from the source frame.
 574
 575     RES/LRES    - These instructions create a copy of the current frame
 576                   in a new thread and return to the caller in the old thread.
 577
 578                   In the new thread, %ap is generally set to %tp and a
 579                   RET or LRET vectors to code which exits the thread.
 580                   The orignal %ap frame might have returned so it is not
 581                   accessible in the new thread.
 582
 583     NOTE: Register save areas are out of band (in an area not included in
 584           either the negative or positive frame space), and thus the final
 585           translation stage might optimize them out or require a different
 586           amount of space.
 587
 588     In calls, the new frame is not necessarily a stack push relative to the
 589     old frame, and not necessarily at a lower memory address.  For translated
 590     executables the negative and positive frame sizes and reserved call
 591     space are typically added together, and the translator adjusts the
 592     offsets appropriately.  Also note that the translator may adjust the
 593     size of the negative frame space, causing all post-translated offsets
 594     to change.
 595
 596     In a direct implementation of the Rune machine, the return %pc and %ap
 597     is saved, %fp is copied to %ap, and %fp remains unchanged.  The final
 598     translation pass is expected to insert code to adjust %fp downward (if a
 599     stack model is being used) in the target function.
 600
 601                                 Object ABI
 602
 603     In the object form, the coder can assume that all appropriate frame
 604     and register handling abstractions will be performed.  This includes:
 605
 606     * The frame pointer allocation abstraction.
 607
 608       The final translation step handles saving, shifting, and restoring
 609       %ap, %fp, %db, and %pc.  This includes the allocation of the new frame
 610       space which is specified via special relocations in objects, libraries,
 611       and binaries.
 612
 613     * The register space abstraction.
 614
 615       All procedure calls fully abstract the entire register space, so each
 616       procedure can assume that all registers are persistent.  The final
 617       translation step handles all required save and restore operations.
 618
 619       Register passing is fully supported via the Memory object cache
 620       abstraction by explicitly storing a register into the negative (cache)
 621       frame space in the call and explicitly loading the register from the
 622       negative (cache) argument space in the call target.
 623
 624       Register return is fully supported via the same mechanism.  The target
 625       stores a result in the negative (cache) argument space and the caller
 626       loads the result from that space immediately after the call returns.
 627
 628       The final translation step can optimize these actions into reg-args
 629       and reg-return, completely optimized-out the memory operations while
 630       still maintaining the register abstraction.
 631
 632     * The Memory object cache abstraction.
 633
 634       The negative and positive %ap-relative and %fp-relative space is treated
 635       differently.  The base offset (not the post-relocation) determines which
 636       space is being used.  The spaces are *NOT* contiguous.
 637
 638       The positive offset space represents fixed-memory objects and the layout
 639       is exactly as you use the space.
 640
 641       The negative offset space represents register-cacheable memory objects.
 642       Any intermediate assembly, linking, or translation step can modify
 643       the offsets, optimize them out, or even add new offsets.  Each offset
 644       represents an explicit object which must be read and written using
 645       the exact same operation size.  Any intermediate step can decide to
 646       cache objects in this space registers and can also overload offsets
 647       in the space when it is legal to do so.
 648
 649       The programmer can overload offsets in the space through legal means,
 650       essentially through simple code analysis, all possible code paths must
 651       result in deterministic behavior with regards to loads and stores to
 652       a particular negative offset.  The offsets must also be legally aligned.
 653
 654     * Interfacing via Library call interface instead of system calls.
 655
 656       Interfacing with the target machine is done via the library interface
 657       rather than with a system call trap.  The final translation pass or
 658       interpreter will convert the library calls to the appropriate mechanism.
 659       This saves us a lot of grief.
 660
 661     * Exception handling via RAISE.
 662
 663       Exception handling is designed to avoid excessively complex code paths
 664       in the object model.  See the description later on in this file.
 665
 666                               Condition Codes
 667
 668     In Rune object code, all instructions-set specific condition codes and
 669     any remaining codes are set to undefined.  Condition codes are only good
 670     for the next sequential instruction and are not good across calls or
 671     returns (that is, the CALL and RET instructions set all condition codes
 672     to invalid).
 673
 674     This greatly reduces optimization and translation complexities.
 675
 676                             Library-relative memory
 677
 678     Library relative memory typically holds library-wide variables, which
 679     are basically global variables only accessible via a particular library
 680     base pointer.
 681
 682     The positive offset space may be used to hold this data, and like all
 683     positive offset spaces the layout is under the control of the programmer.
 684
 685     The negative offset space for a library contains the library descriptors
 686     for its library call vectors, which is what the LCALL points to.  These
 687     descriptors are defined elsewhere but are typically big enough to hold
 688     actual code if the library function is short enough, resulting in a
 689     direct call, and otherwise will hold a JMP instruction which points to
 690     the actual library function.  Library descriptors may contain other
 691     information as well.
 692
 693                             Thread-relative memory
 694
 695     The negative offset space for %tp-relative accesses contain thread
 696     hardware and other fields generally defined by the translator and
 697     accessed through relocations.  The translator may perform other actions
 698     when accessing such fields as the fields might represent hardware
 699     functions and actual memory.
 700
 701     The positive offset space for %tp-relative accesses contains per-thread
 702     data generally reserved using a special section.
 703
 704                             Threading and Process ABI
 705
 706     The Rune object and machine supports fine-grained threading.  These
 707     threads are NOT unix processes or unix pthreads.  They are RUNE threads
 708     defined by the %tp (thread pointer register), creating a separate very
 709     light-weight executable context.
 710
 711     Implementations can choose to execute RUNE threads via Unix style threads
 712     (e.g. pthreads or some other shared address space mechanism), can
 713     manually switch between threads through some other mechanism to limit
 714     the unix-style threads, or can use some combination of the above.
 715
 716     It should be noted that Rune programs often assume a very large number
 717     of threads to be available at low cost (tens of thousands or hundreds
 718     of thousands).
 719
 720     All thread ABI calls except those defined by specific instructions are
 721     defined via the machine interface library, which uses the RUNE library
 722     call mechanism but might be translated to inline or other specialized
 723     code.
 724
 725     Per-thread ABI features include:
 726
 727         * A global free-running timer with a per-thread match/wait
 728           capability, generating an event.
 729
 730         * Yield.
 731
 732         * Interruption management primitives for inter-thread interruption
 733           (does not have to disable target machine interrupts).
 734
 735         * Per-thread event queue for inter-thread management.
 736
 737         * Exception management (in addition to RAISE).
 738
 739         * Thread stop/wait/exit/wakeup primitives.
 740
 741                                 Exception Handling
 742
 743     Rune provides a conditional-capable RAISE and LRAISE instruction.  The
 744     actual instruction code is only used for out-of-band procedural returns.
 745     In-band jumps generate Bcc or JMP instructions instead.  RAS automatically
 746     adjusts the instruction as needed without the emitter having to specify
 747     a PC-relative address.  However, the emitter can elect to supply such an
 748     address if it desires which will cause a Bcc or JMP to be emitted.  The
 749     difference is tha RAS may also generate additional catch table for the
 750     run-time.
 751
 752     [L]RAISE determines where to jump based on the current .catch context.
 753     .catch pseudo instructions tell the assembler which catch target label
 754     within the procedure is effective at any given point.  If there is no
 755     .catch context an actual [L]RAISE instruction is emitted.
 756
 757     The actual instruction code causes a [L]RET to occur but then enters the
 758     run-time instead of continuing at the callers next instruction.  The
 759     run-time uses .catch tables to calculate where to jump to in the caller.
 760     If no .catch context is present it will chain another [L]RAISE.
 761
 762     .catch is a pseudo-op which generates no actual code.  You specify the
 763     catch target with it.  The assembler will generate a catch table for
 764     the run-time.
 765
 766                 Optimization and Translation Pass Requirements
 767
 768     All instruction sequences in Rune object code must be analyzable for
 769     execution path determinism to allow any pass to optimize/overload
 770     register and cache memory object use.  As a simple example, this means
 771     that any overloading of cache memory objects already in-place must
 772     pass this deterministic analysis with regards to:
 773
 774     (1) Loads and stores of compatible object sizes only, with size changes
 775         due to overloading bounded by a deterministic store.
 776
 777     (2) Cache memory objects must be stored deterministically before they can
 778         be loaded.  You cannot load an uninitialized cache memory object.
 779
 780     (3) Simple but powerful topological code path analysis is employed.  The
 781         object code must pass on a procedure-by-procedure basis.
 782
 783     (4) No jumps or branches from inside to outside the current procedure.
 784         Only CALL, LCALL, and various RET-related instructions may be used
 785         at the procedure's execution boundaries.
 786
 787     (5) Rune make machine-specific optimizations via the library call
 788         interface.  The translation phase is allowed to replace particular
 789         well-defined library calls with machine-specific code.  For example,
 790         a matrix or vector manipulation call might be replaced with a SSE
 791         instruction.  Library writers are expected to fully implement the
 792         feature as Rune code, and then allow translators to optimize the
 793         code out for machine targets which understand the particular feature.
 794         This works for direct library calls.
 795
 796     (6) Rune is generally allowed to inline direct call targets.  That is,
 797         the Rune language will generate calls (not do the actual inlining),
 798         and the translation phase will handle the inlining.  This works for
 799         direct calls and direct library calls.
 800
 801     (7) Unreachable code can be optimized out.  Unreachable exception code
 802         through call topology analysis can also be optimized out.
 803
 804                           Locked bus cycle instructions
 805
 806     Because architectures implement locked bus cycle instructions differently,
 807     Rune provides a set of locked bus-cycle instructions using instruction
 808     codes 1111.xxxx, including nominal ADD, SUB, bitwise operators, CMPXCHG,
 809     and SWAP.  These instructions implement full memory barriers and execute
 810     atomically.
 811
 812     Generally speaking the operation will execute with a locked bus cycle
 813     and the original contents of the memory location will be stored in the
 814     third operation register.
 815
 816                                 Object Locking
 817
 818     Rune has a set of object locking instructions.  These instructions exist
 819     because object locking is integral to the language and must be heavily
 820     optimized at multiple stages to produce a high-performance machine target.
 821
 822     These instructions specify the memory address of a Rune lock structure
 823     and perform all operations necessary on the lock and related thread
 824     context, including potentially blocking.
 825
 826     Generally speaking locks are integrated into the thread context.  The
 827     thread keeps track of held locks and whether it is in soft or hard
 828     locking mode.  In soft mode held locks can be temporarily lost when
 829     the thread voluntarily blocks (but not if it yields or is switched
 830     involuntarily), and will be reacquired when it unblocks.  In hard mode
 831     held locks cannot be lost.
 832
 833     Since many locks may be held, Rune implements a lock-stealing mechanism
 834     whereby a held lock in soft mode can be stolen while the thread is
 835     blocked, and a mechanism to detect if any locks or a particular lock
 836     has been stolen.  Detection of stolen locks during blocking is handled
 837     by testing a special stolen-lock-counter field in the thread structure.
 838