Microcode
This section describes the rationale for using a microcode, as well as its weaknesses, some terminology, and finally, the various microcode fields of Sentinel. The default microcode file is also included to study!
If you want a short version for the rationale, and just skip over to the terminology:
Sentinel is a horizontally-microcoded
CPU design because I wanted to create a conforming RV32I_Zicsr CPU with M-Mode
that fits into ~1000 ICESTORM_LCs[1] on an iCE40 FPGA. It wasn’t- and still isn’t- clear
to me that I could hit these goals without using some of the FPGA Block RAM
for a microcode instead of LUTs for a hardwired control unit.
Some References
CPUs implementing a Reduced Instruction Set, like RISC-V, are best optimized for speed by using dedicated circuitry to control the internal components. Today, microcode is generally relegated to special cases. However, both hardwired and microcoded control fulfill the same general purpose; “drive 0s and 1s to a CPUs various components to move and manipulate data”.
I designed a hardwired CPU for a class many years ago; it took me a while to wrap my head around how microcode works and find the right people to point me in the right direction. I found the following resources useful:
Over the years, the Wikipedia article has gotten a lot better.
The ZPU Stack Machine CPU has a microcode implementation to study.
In my opinion, the gold standard for microcode design is Bit-slice microprocessor design by John Mick and Jim Brick.
The book was written for with the Am2900 series in mind, which are long out-of-production parts. However, the Am2900 series were building blocks designed for microcode. As the book teaches you to to make a custom CPU datapath out of Am2900 parts, you by extension learn how to design a microcode.
Hardwired and Microcoded Rationale
Back in the 80s, microcode allowed for flexibility of implementation and quick iteration. If there was a bug or a design change, you may only have to swap the microcode out of EPROMss or RAM; and the rest of your Am2900 building blocks/glue logic would remain untouched.
Today, FPGAs serve much the same purpose, and can swap out an entire CPU- not just microcode- in seconds. In most cases, designing a hardwired control unit on an FPGA will have acceptable iteration time and flexibility thanks to describing circuits in code. A hardwired implementation also makes it easier to implement speed optimizations like tight pipelining, which are difficult to reason about in microcode[2]. Thanks to changes in design process, there is probably no reason to do a performance-oriented microcode RISC-V implementation.
When optimizing for size, you still should probably hardwire your RISC-V core. There are several perfectly usable hardwired RISC-V CPUs in < 1000 LCs, such as:
I could probably design a minimal hardwired Sentinel that doesn’t have that much more circuitry than the current microcoded version. However, I set a goal of a complete RISC-V implementation including M-Mode in ~1000 LCs. It wasn’t clear to me- and still isn’t- that a non-microcode RISC-V implementation could fit in 1000 LCs without making concessions that I didn’t want to[3], such as:
Limit datapath width.
Remove IRQs and M-Mode.
Do not handle illegal instructions.
Even if I did a hardwired control unit, I knew that I was going to be multicycle and not tightly pipelined. My own experience is that a basic RISC pipeline 32-bit CPU takes at least 2000 ICE40 LCs minimum[4] thanks to pipeline control logic. At that point, for any design meeting my requirements, I figured the speed of a microcoded and a hardwired RISC-V without pipeline control would be similar.
Around the same time in late 2020[5] was when I found out about Mick and Brick. I found the book fascinating (and still do), so I was already looking for an excuse to write a microcode for fun. Additionally, I realized could put a microcode to good use by leveraging FPGA block RAM to hold the microcode program. This gave me more precious LUT breathing room that would otherwise be used by a hardwired implemetation. Since, I already wasn’t expecting my implementation to be fast, all of a sudden I found the potential space savings of a microcode very appealing.
Via creative uses of block RAM and microcoding, Sentinel itself reached my goal
of ~1000 LCs while implementing RV32I_Zicsr and M-Mode. However, it’s still
hit or miss whether a full SoC
fits into my target ICE40HX1K FPGA (1280 LCs max), depending on yosys
optimizations and Amaranth changes. We’ll see what the future holds.
Additionally, while not every application needs a super fast CPU, there is plenty of room for Sentinel and other small RISC-V implementations to coexist! Users need to decide for themselves if Sentinel fits their needs.
Terminology
Mick and Brick introduces some jargon that I use in Sentinel:
Note
This list is probably incomplete.
- Condition Code Multiplexer
A multiplexer of various conditional tests. The output of this multiplexer, selected by the microcode, becomes an test input to the Sequencer. The conditional test result can often be inverted by microcode to double the number of possible tests.
Tests conditions used by Sentinel include:
Is the ALU output zero/nonzero?
Is a memory access complete/incomplete?
Did an exception occur/not occur?
Unconditionally true/false.
- Macroinstruction
An unit of execution from the CPU’s instruction set, composed of microinstructions. In Sentinel’s case, macroinstructions are RISC-V instructions.
- Mapping (P)ROM
A (P)ROM which maps of the macroarchitecture opcode into microprogram jump targets. It is the hardware version of a jump table, where the jump index is retrived from the macroinstruction opcode.
Each macroinstruction is a loop through the microprogram. The mapping (P)ROM jumps from microcode common to all macroinstructions to microcode specific to each (group of) macroinstruction(s).
In Sentinel, the Mapping “(P)ROM” is implemented in combinational logic.
- Microinstruction
A microprogram/microcode instruction. Macroinstructions are composed of multiple microinstructions. Each microinstruction takes one clock cycle.
- Microprogram Counter
Register whose value is the address of the microcode instruction which will execute on the next clock cycle, assuming the sequencer chooses to use it.
- Pipeline Register
In microcode, the pipeline register specifically refers to a holding register containing the bits of the currently-executing microinstruction.
In Sentinel, the pipeline register is part of the synchronous read port of the Block RAM holding the microprogram. The address input to the microprogram Block RAM is the output of the sequencer; the data for the microinstruction at this address appears on the read port on the next clock cycle.
- Sequencer
Component which supplies the address of the microinstruction which will be output on the next clock cycle. It chooses between various sources based on a test condition provided by the Condition Code Multiplexer.
Sources used by Sentinel, include:
The microprogram counter.
An address constant in the microcode instruction.
A mapping PROM.
An implied constant
0.
Microcode Fields
Microcode field classes for main microcode file.
This file is used to avoid circular imports and to serve as a single source of truth for the meaning of microcode fields. Each variable defined in this modules corresponds to an m5meta field in the default microcode file.
Microcode field order is determined by the microcode assembly file; order of
fields in this module do not matter. However, for consistency, we try to match
the microcode.asm order.
The default/main microcode file is stored with the Sentinel package in the same
directory as this file, at microcode.asm.
- sentinel_cpu.ucodefields.Target = unsigned(8)
Jump target supplied by the currently-executing microinstruction. Occassionally used to supply a constant value, like in
CSRSel.
- class sentinel_cpu.ucodefields.JmpType(*values)
Type of jump to perform for this microinstruction.
- CONT = 0
On the next cycle go to the next sequential microinstruction (upc + 1).
- Type:
- NOP = 0
An alias for
CONTmeant to indicate that the target field is being used forsomething else.- Type:
- MAP = 1
Jump to the address supplied by the
MappingROMifconditionis met. Otherwise, unconditionally jump to the address supplied byTarget. This is generally used to jump to code specific to each macroinstruction, or start exception handling on an invalid instruction.- Type:
- DIRECT = 2
If
conditionis met, jump to the address supplied byTarget. Otherwise, go to the next sequential microinstruction, as inCONT.- Type:
- class sentinel_cpu.ucodefields.OpType(*values)
ALU operationto perform this cycle.On the next active edge,
ALU output (O)will be equal to result of the operation performed using itsAandBinputs.
- class sentinel_cpu.ucodefields.CondTest(*values)
Conditional test to pass through to the
Sequencer.- EXCEPTION = 0
Set if an exception occurred this clock cycle.
When
InvertTestis asserted, set if an exception did not occur this clock cycle.- Type:
- CMP_ALU_O_ZERO = 1
Set if the
ALUoutput is0this clock cycle.When
InvertTestis asserted, set if if theALUoutput is nonzero this clock cycle.- Type:
- MEM_VALID = 2
Set if the contents of the memory bus are valid this cycle.
When
InvertTestis asserted, set if if the contents of the memory bus are not valid.The memory bus is valid when
sentinel_cpu.top.Top.bus.ackinTopis asserted.- Type:
- TRUE = 3
Unconditionally set/asserted. When
InvertTestis asserted, the test unconditionally fails.- Type:
- sentinel_cpu.ucodefields.InvertTest = unsigned(1)
If set, invert the result of the conditional test on the output of
CondTestthis clock cycle.
- class sentinel_cpu.ucodefields.PcAction(*values)
Perform an action on the RISC-V
Program Counterthis cycle.- LOAD_ALU_O = 2
Set the Program Counter to the value currently on the
ALU output.- Type:
- sentinel_cpu.ucodefields.LatchA = unsigned(1)
- sentinel_cpu.ucodefields.LatchB = unsigned(1)
- class sentinel_cpu.ucodefields.ASrc(*values)
Select the source for the
ALU A input.The ALU A input is provided by the latched output of
ASrcMux; this field is qualified byLatchA.- ALU_O = 2
Feed back the
ALU outputinto the input. Intended to facilitate chaining ALU ops together.- Type:
- class sentinel_cpu.ucodefields.BSrc(*values)
Select the source for the
ALU B input.The ALU B input is provided by the latched output of
BSrcMux; this field is qualified byLatchB.- PC = 1
Select the
Program Counterregister.- Type:
- DAT_R = 4
Select the unregistered Wishbone read data bus value. The read data bus is only valid when indicated by
MEM_VALID.- Type:
- CSR_IMM = 5
Some RISC-V CSR instructions have an Immediate field that differs from
IMM; select the CSR Immediate field instead.- Type:
- CSR = 6
Select CSR register that was read from the
CSR reg filelast cycle.- Type:
- MCAUSE_LATCH = 7
Select the current value of the
MCAUSE latch.- Type:
- class sentinel_cpu.ucodefields.ALUIMod(*values)
Modify ALU inputs before performing
ALU op.This field modifies the ALU inputs
AandBjust before they are sent to the to ALU. Set this field to value besidesNONEon the same cycle as when anALU opyou wish to modify is taking place. TheALU output (O), modified or otherwise, will be available on the next active edge.Modifying the inputs are useful to implement additional ALU operations, such as signed compare using an unsigned comparator.
- NONE = 0
Pass through
AandBto the ALU unchanged.
- INV_MSB_A_B = 1
Invert the most-significant bit of
AandBbefore performingOP.
- class sentinel_cpu.ucodefields.ALUOMod(*values)
Modify the result of the currently-executing
ALU op.This field modifies the raw ALU result just before storing the result in
Oon the next active edge. In other words, this field must be set on the same cycle as when theALU opyou wish to modify is taking place.Modifying the output is useful for synthesizing additional ALU operations, such as “compare-greater-than-or-equal” or
JALRtargets.
- sentinel_cpu.ucodefields.RegRead = unsigned(1)
If set, read from the
register filethis cycle. The results will be valid and available on the read port on the next active edge. The read value will stay valid on the read port until the subsequent active edge whereRegReadis asserted orCSROpis notNONE.The register file is
transparent; a write and read to/from the same address on the same cycle will use the value to-be-written on the read port on the next active edge.This field has no effect if
CSROpis notNONE.Todo
I need to verify what happens when we
RegWriteto the same address with deasserted read-enable. Will it “blow away” the current read port value?
- sentinel_cpu.ucodefields.RegWrite = unsigned(1)
If set, write to the
register filethis cycle. The write will be valid on the next active edge.
- class sentinel_cpu.ucodefields.RegRSel(*values)
Select register to be read to
register file.
- class sentinel_cpu.ucodefields.RegWSel(*values)
Select register to be written to
register file.
- class sentinel_cpu.ucodefields.CSROp(*values)
Select operation on
CSR file.- NONE = 0
Do a read and/or write to the
register filethis cycle.This variant qualifies
RegRead,RegWrite,RegRSel, andRegWSel;CSRSelhas no effect when this variant is selected.- Type:
- class sentinel_cpu.ucodefields.CSRSel(*values)
Select register from
CSR fileto read or write.This field has no effect if
CSROpisNONE.- INSN_CSR = 0
Select the CSR register specified by the compressed CSR address, derived from the current instruction.
- Type:
- TRG_CSR = 1
Select the CSR register specified by
Target, using the compressed address encoding.- Type:
- sentinel_cpu.ucodefields.MemReq = unsigned(1)
If set, set Wishbone
CYC_OandSTB_Oto the asserted state, indicating that a memory transfer is imminent. This signal also qualifiesAddressAlignoutputs.As per the Wishbone spec, since Sentinel does not use wait states, tying
CYC_OandSTB_Oto the same signal is sound. See Permission 3.40.
- class sentinel_cpu.ucodefields.MemSel(*values)
Select memory transfer type in progress.
This field indirectly controls the the Wishbone
SEL_O,DAT_I(for reads/loads that are not instruction fetches), andDAT_Olines (writes/stores). Seesentinel_cpu.alignfor more information.- AUTO = 0
Memory access is instruction fetch or none at all- data width and
SEL_Ois determined automatically.- Type:
- BYTE = 1
Memory access is 8-bit; only one of bit
0,1,2, and3ofSEL_Ois asserted. Read and write data will be shifted appropriately.- Type:
- HWORD = 2
Memory access is 16-bit; either bits
0and1or2and3ofSEL_Oare asserted. Read and write data will be shifted appropriately.- Type:
- class sentinel_cpu.ucodefields.MemExtend(*values)
Extend read data to
WORDwidth.Sentinel CPU directly reads the
DAT_IWishbone signal when performing instruction fetches and loads. Fetches are alwaysWORDsized, but loads can be variable-sized. RISC-V specifies that loads less thanWORDwidth should have the unused bits filled/extended with either0(unsigned/signed) or1(signed).This field will make sure
BYTEandHWORDloads are properly extended beforelatching datafor further use by Sentinel. It has no effect forWORDorAUTOloads.
- sentinel_cpu.ucodefields.LatchAdr = unsigned(1)
If set, latch the
ALU outputinto an internal register representing the raw byte address for an upcoming Wishbone memory transaction. This internal register indirectly controls the WishboneADR_OandSEL_Olines viaAddressAlign. Used for both Wishbone reads and writes.
- sentinel_cpu.ucodefields.LatchData = unsigned(1)
If set, latch
write datainto an internal register which directly drives the Wisbone signalDAT_O. The data will be appropriately aligned for an upcoming Wishbone write, based upon the contents of the internal address register controlled byLatchAdr. Used only for Wishbone writes.
- sentinel_cpu.ucodefields.WriteMem = unsigned(1)
If set, set Wishbone
WE_Oto the asserted state, indicating a Wishbone write. Not used by other core components.
- sentinel_cpu.ucodefields.InsnFetch = unsigned(1)
If set, indicate that the current Wishbone transaction is an instruction fetch. Currently, this signal overrides
address alignmentbehavior so that instruction fetches will succeed. In the future, this signal will also be used for a Wishbone tag of some sort.Instruction decode begins automatically upon receipt of Wishbone
ACK_I.
- class sentinel_cpu.ucodefields.ExceptCtl(*values)
Perform a variety of exception-handling related tasks.
- LATCH_DECODER = 1
Check
Decodefor exceptions and latch results intoExceptionRouterthis cycle.- Type:
- LATCH_JAL = 2
Use
ExceptionRouterto check whether aJALtriggered alignment exceptions this cycle. Valid only when the current instruction is in fact aJAL.- Type:
- LATCH_STORE_ADR = 3
Use
ExceptionRouterto check whether a store triggered alignment exceptions this cycle. Valid only when the current instruction is in fact a store.- Type:
- LATCH_LOAD_ADR = 4
Use
ExceptionRouterto check whether a load triggered alignment exceptions this cycle. Valid only when the current instruction is in fact a load.- Type:
Default Microcode Annotated Source
Many jump addresses are hardcoded by the mapping PROM. Since there is only room for 256 instructions, the remaining required jumps go to wherever there is extra room. With that said:
I try to keep instructions with the similar functionality (“major opcode”) together.
I try to avoid backward jumps, except for jumping to the next macroinstruction, but they are sometimes unavoidable (see
beqandbnelabels).
space block_ram: width 48, size 256;
space block_ram;
origin 0;
// Microcode fields in this space correspond to classes defined in
// ucodefields.py. The ordering of microcode fields is taken from this file.
// Width and enum field names are validated against the Amaranth source after
// assembly.
//
// Comments are included for convenience, and efforts are made to ensure
// they don't contradict comments in ucodefields.py. In case of conflict,
// ucodefields.py comments take priority.
fields block_ram: {
// Target field for direct jmp_type. The micropc jumps to here next
// cycle if the test succeeds.
target: width 8, origin 0, default 0;
// Various jump types to jump around the microcode program next cycle.
// cont: Increment upc by 1.
// nop: Same as cont, but indicate we are using the target field for
// something else.
// map: Use address supplied by decoder if test fails. Otherwise, unconditional
// direct.
// direct: Conditionally use address supplied by target field. Otherwise,
// cont.
// direct_zero: Conditionally use address supplied by target field. Otherwise,
// 0.
jmp_type: enum { cont = 0; nop = 0; map; direct; direct_zero; }, default cont;
// Various tests (valid current cycle) for conditional jumps:
// int: Is interrupt line high?
// exception: Illegal insn, EBRAK, ECALL, misaligned insn, misaligned ld/st?
// mem_valid: Is current dat_r valid? Did write finish?
// true: Unconditionally succeed
cond_test: enum { exception; cmp_alu_o_zero; mem_valid; true}, default true;
// Invert the results of the test above. Valid current cycle.
invert_test: bool, default 0;
// Modify the PC for the next cycle.
pc_action: enum { hold = 0; inc; load_alu_o; }, default hold;
// ALU src latch/selection.
latch_a: bool, default 0;
latch_b: bool, default 0;
a_src: enum { gp = 0; imm; alu_o; zero; four; thirty_one; }, default gp;
b_src: enum { gp = 0; pc; imm; one; dat_r; csr_imm; csr; mcause_latch }, default gp;
// Latch the A/B inputs into the ALU. Contents vaid next cycle.
alu_op: enum { add = 0; sub; and; or; xor; sll; srl; sra; cmp_ltu; }, default add;
// Modify inputs and outputs to ALU.
alu_i_mod: enum { none = 0; inv_msb_a_b; }, default none;
alu_o_mod: enum { none = 0; inv_lsb_o; clear_lsb_o }, default none;
// Either read or write a register in the register file. _Which_ register
// to read/write comes from the decoded insn.
// Read contents will be on the data bus the next cycle. When insn_rs1 is
// paired with insn_fetch, the address sent to the reg file comes directly
// from bits 15 to 20 on the WB DAT_R bus. Otheriwse, the address sent to the
// reg file is retrieved from a holding register for bits 15 to 20 of the
// previously-decoded instruction word.
reg_read: bool, default 0;
reg_write: bool, default 0;
reg_r_sel: enum { insn_rs1 = 0; insn_rs2 = 1; }, default insn_rs1;
reg_w_sel: enum { insn_rd = 0; zero = 1; }, default insn_rd;
// CSR regs can either be read or written in a given cycle, but not both.
// CSR ops override reg_ops. This is technically a union.
csr_op: enum { none = 0; read_csr; write_csr }, default none;
csr_sel: enum { insn_csr; trg_csr }, default insn_csr;
// Start or continue a memory request. For convenience, an ack will
// automatically stop a memory request for the cycle after ack, even if
// mem_req is enabled. Valid on current cycle.
mem_req: bool, default 0;
mem_sel: enum { auto = 0; byte = 1; hword = 2; word = 3; }, default auto;
mem_extend: enum { zero = 0; sign = 1}, default zero;
// Latch data address register from ALU output.
latch_adr: bool, default 0;
latch_data: bool, default 0;
write_mem: bool, default 0;
// Current mem request is insn fetch. Valid on current cycle. If set w/
// mem_req, mem_sel ignored/calculated automatically.
insn_fetch: bool, default 0;
except_ctl: enum { none; latch_decoder; latch_jal; latch_store_adr; \
latch_load_adr; enter_int; leave_int; }, default none;
};
#define INSN_FETCH insn_fetch => 1, mem_req => 1
#define INSN_FETCH_EAGER_READ_RS1 INSN_FETCH, READ_RS1
#define SKIP_WAIT_IF_ACK jmp_type => direct_zero, cond_test => mem_valid, target => check_int
#define JUMP_TO_OP_END(trg) cond_test => true, jmp_type => direct, target => trg
#define NOT_IMPLEMENTED jmp_type => direct, target => panic
#define NOP target => 0
#define READ_RS1 reg_read => 1, reg_r_sel => insn_rs1
#define READ_RS2 reg_read => 1, reg_r_sel => insn_rs2
#define WRITE_RD reg_write => 1
#define WRITE_RD_CSR csr_op => write_csr
#define READ_RS1_WRITE_RD READ_RS1, reg_write => 1, reg_w_sel => insn_rd
#define CMP_LT alu_op => cmp_ltu, alu_i_mod => inv_msb_a_b
#define CMP_GEU alu_op => cmp_ltu, alu_o_mod => inv_lsb_o
#define CMP_GE alu_op => cmp_ltu, alu_i_mod => inv_msb_a_b, alu_o_mod => inv_lsb_o
// The LT[U]/GE[U] tests will either return zero or one; this makes it fine
// to reuse the conditional meant for shift ops.
#define CONDTEST_ALU_ZERO cond_test => cmp_alu_o_zero
// HINT: alu_o_mod -> inv_lsb_o can be used to
// implement a check for ALU output being exactly one. Can
// this be utilized anywhere?
// Also, inv_lsb_o does the same as XOR 1. So ((A XOR 1)) XOR 1 is a no-op,
// if a bit convoluted.
#define CONDTEST_ALU_NONZERO invert_test => 1, cond_test => cmp_alu_o_zero
#define JUMP_TO_ZERO cond_test => true, invert_test=> true, jmp_type => direct_zero
#define STOP_MEMREQ_THEN_JUMP_TO_ZERO mem_req=>0, JUMP_TO_ZERO
// CSR Register addresses in private RAM
#define MSTATUS 0
#define MIE 0x4
#define MTVEC 0x5
#define MSCRATCH 0x8
#define MEPC 0x9
#define MCAUSE 0xA
#define MIP 0xC
fetch:
wait_for_ack: INSN_FETCH_EAGER_READ_RS1, invert_test => 1, cond_test => mem_valid, \
jmp_type => direct, target => wait_for_ack;
// Illegal insn or insn misaligned exception possible
check_int: jmp_type => map, a_src => gp, latch_a => 1, READ_RS2, \
except_ctl => latch_decoder, cond_test => exception, \
target => save_pc;
origin 2;
// Make sure x0 is initialized with 0. PC might not be valid, depending
// on which microcycle a reset or clock enable (if applicable) was
// asserted/deasserted. So reset PC to zero also.
// Additionally, MCAUSE CSR is nominally a copy of a latch, but it also
// should be 0 (for our implementation) after reset.
//
// Stale microcode exists on microcode ROM read port for one cycle after
// non-power-on-resets, since read port lags by one cycle except for
// after POR. The effects of stale microcode appear on the second cycle
// after reset. This has the following consequences which we exploit:
// * Spec mandates MSTATUS.MIE is zero after reset. The ALU output is
// initialized to 0 upon reset, so stale microcode on read port will
// never write a non-zero value to registers.
// * One full cycle after reset was deasserted, we make can no assumptions
// about ALU contents. So we must explicitly reinitialize the ALU to 0.
reset: latch_a => 1, latch_b => 1, b_src => one, a_src => zero;
alu_op => and;
alu_op => and, reg_write => 1, reg_w_sel => zero;
jmp_type => direct_zero, pc_action => load_alu_o, csr_op => write_csr, \
csr_sel => trg_csr, invert_test => 1, cond_test => true, \
target => MCAUSE;
origin 8;
lb_1: latch_b => 1, b_src => imm, pc_action => inc, jmp_type => direct, \
target => lb;
lh_1: latch_b => 1, b_src => imm, jmp_type => direct, target => lh;
lw_1: latch_b => 1, b_src => imm, jmp_type => direct, target => lw;
NOT_IMPLEMENTED;
lbu_1: latch_b => 1, b_src => imm, pc_action => inc, jmp_type => direct, \
target => lbu;
lhu_1: latch_b => 1, b_src => imm, jmp_type => direct, target => lhu;
lb: alu_op => add;
latch_adr => 1;
lb_wait: a_src => zero, b_src => dat_r, latch_a => 1, latch_b => 1, mem_req => 1, invert_test => 1, \
cond_test => mem_valid, mem_sel => byte, mem_extend => sign, jmp_type => direct, \
target => lb_wait;
alu_op => add, JUMP_TO_OP_END(fast_epilog);
lh: alu_op => add;
latch_adr => 1, except_ctl => latch_load_adr, mem_sel => hword, \
jmp_type => direct, cond_test => exception, target => save_pc;
lh_wait: a_src => zero, b_src => dat_r, latch_a => 1, latch_b => 1, mem_req => 1, invert_test => 1, \
cond_test => mem_valid, mem_sel => hword, mem_extend => sign, jmp_type => direct, \
target => lh_wait;
alu_op => add, pc_action => inc, JUMP_TO_OP_END(fast_epilog);
lw: alu_op => add;
latch_adr => 1, except_ctl => latch_load_adr, mem_sel => word, \
jmp_type => direct, cond_test => exception, target => save_pc;
lw_wait: a_src => zero, b_src => dat_r, latch_a => 1, latch_b => 1, mem_req => 1, invert_test => 1, \
cond_test => mem_valid, mem_sel => word, jmp_type => direct, \
target => lw_wait;
alu_op => add, pc_action => inc, JUMP_TO_OP_END(fast_epilog);
lbu: alu_op => add;
latch_adr => 1;
lbu_wait: a_src => zero, b_src => dat_r, latch_a => 1, latch_b => 1, mem_req => 1, invert_test => 1, \
cond_test => mem_valid, mem_sel => byte, jmp_type => direct, \
target => lbu_wait;
alu_op => add, JUMP_TO_OP_END(fast_epilog);
lhu: alu_op => add;
latch_adr => 1, except_ctl => latch_load_adr, mem_sel => hword, \
jmp_type => direct, cond_test => exception, target => save_pc;
lhu_wait: a_src => zero, b_src => dat_r, latch_a => 1, latch_b => 1, mem_req => 1, invert_test => 1, \
cond_test => mem_valid, mem_sel => hword, jmp_type => direct, \
target => lhu_wait;
alu_op => add, pc_action => inc, JUMP_TO_OP_END(fast_epilog);
origin 0x24;
// CSR ops take two cycles to decode. This is effectively a no-op in case
// there's an illegal CSR access or something.
csr_trampoline: READ_RS1, jmp_type => map, except_ctl => latch_decoder, \
cond_test => exception, target => save_pc;
csrro0_1: a_src => zero, b_src => one, latch_a => 1, latch_b => 1, pc_action => inc, \
jmp_type => direct, target => csrro0;
csrw_1: a_src => zero, b_src => gp, latch_a => 1, latch_b => 1, pc_action => inc, \
jmp_type => direct, target => csrwi;
csrrw_1: csr_op => read_csr, csr_sel => insn_csr, a_src => zero, latch_a => 1, \
b_src => gp, latch_b => 1, pc_action => inc, jmp_type => direct, \
target => csrrwi;
csrr_1: csr_op => read_csr, csr_sel => insn_csr, a_src => zero, latch_a => 1, \
pc_action => inc, jmp_type => direct, target => csrr;
csrrs_1: csr_op => read_csr, csr_sel => insn_csr, a_src => zero, latch_a => 1, \
pc_action => inc, jmp_type => direct, target => csrrs;
csrrc_1: csr_op => read_csr, csr_sel => insn_csr, a_src => zero, latch_a => 1, \
latch_b => 1, b_src => one, pc_action => inc, jmp_type => direct, \
target => csrrc;
csrwi_1: a_src => zero, b_src => csr_imm, latch_a => 1, latch_b => 1, pc_action => inc, \
jmp_type => direct, target => csrwi;
csrrwi_1: csr_op => read_csr, csr_sel => insn_csr, a_src => zero, b_src => csr_imm, \
latch_a => 1, latch_b => 1, pc_action => inc, jmp_type => direct, \
target => csrrwi;
csrrsi_1: csr_op => read_csr, csr_sel => insn_csr, a_src => zero, latch_a => 1, \
pc_action => inc, jmp_type => direct, target => csrrsi;
csrrci_1: csr_op => read_csr, csr_sel => insn_csr, a_src => zero, latch_a => 1, \
latch_b => 1, b_src => one, pc_action => inc, jmp_type => direct, \
target => csrrci;
origin 0x30;
misc_mem: pc_action => inc, jmp_type => direct, target => fetch;
csrro0: alu_op => and, JUMP_TO_OP_END(fast_epilog);
csrr: latch_b => 1, b_src => csr;
alu_op => add, JUMP_TO_OP_END(fast_epilog);
csrwi: alu_op => add, JUMP_TO_OP_END(fast_epilog_csr);
csrrwi: alu_op => add, latch_b => 1, b_src => csr; // Latch old CSR value, pass thru new.
WRITE_RD_CSR, alu_op => add, JUMP_TO_OP_END(fast_epilog);
csrrsi: latch_b => 1, b_src => csr;
alu_op => add, b_src => csr_imm, latch_b => 1;
csrrs_2: WRITE_RD, a_src => alu_o, latch_a => 1; // Feed back old CSR value.
alu_op => or, JUMP_TO_OP_END(fast_epilog_csr);
csrrci: latch_b => 1, b_src => csr, alu_op => sub; // Synthesize -1 on ALU_O
// TODO: Unlike GP reads, csr_ops are not sticky. Maybe they should be?
csr_op => read_csr, csr_sel => insn_csr, alu_op => add, a_src => alu_o, \
b_src => csr_imm, latch_a => 1, latch_b => 1;
csrrc_2: WRITE_RD, b_src => csr, latch_b => 1, alu_op => xor; // Bit Clear = A & ~B
a_src => alu_o, latch_a => 1;
alu_op => and, JUMP_TO_OP_END(fast_epilog_csr);
origin 0x40;
addi_1: latch_b => 1, b_src => imm, pc_action => inc, jmp_type => direct, \
target => addi;
slli_1:
// All the Shift-Immediates pass through to the Shift-Register logic;
// the AND with 31 is harmless for SLL and SRL, and required for SRA
// because of a hardcoded 1 in the imm12.
READ_RS1, a_src => thirty_one, latch_a => 1, b_src => imm, \
latch_b => 1, pc_action => inc, jmp_type => direct, \
target => sll;
slti_1: latch_b => 1, b_src => imm, pc_action => inc, jmp_type => direct, \
target => slti;
sltiu_1: latch_b => 1, b_src => imm, pc_action => inc, jmp_type => direct, \
target => sltiu;
xori_1: latch_b => 1, b_src => imm, pc_action => inc, jmp_type => direct, \
target => xori;
srli_1: READ_RS1, a_src => thirty_one, latch_a => 1, b_src => imm, \
latch_b => 1, pc_action => inc, jmp_type => direct, \
target => srl;
ori_1: latch_b => 1, b_src => imm, pc_action => inc, jmp_type => direct, \
target => ori;
andi_1: latch_b => 1, b_src => imm, pc_action => inc, jmp_type => direct, \
target => andi;
NOT_IMPLEMENTED; // 0b1000 subi?
csrrs: READ_RS1, latch_b => 1, b_src => csr;
alu_op => add, b_src => gp, latch_b => 1, jmp_type => direct, \
target => csrrs_2;
csrrc: READ_RS1, latch_b => 1, b_src => csr, alu_op => sub; // Synthesize -1 on ALU_O
csr_op => read_csr, csr_sel => insn_csr, alu_op => add, a_src => alu_o, \
b_src => gp, latch_a => 1, latch_b => 1, jmp_type => direct, target => csrrc_2;
srai_1: READ_RS1, a_src => thirty_one, latch_a => 1, b_src => imm, \
latch_b => 1, pc_action => inc, jmp_type => direct, \
target => sra;
origin 0x50;
auipc: latch_a => 1, latch_b => 1, a_src => imm, b_src => pc;
alu_op => add, pc_action => inc;
WRITE_RD, jmp_type => direct, cond_test => true, target => fetch;
addi: alu_op => add, INSN_FETCH, JUMP_TO_OP_END(fast_epilog);
slti: CMP_LT, INSN_FETCH, JUMP_TO_OP_END(fast_epilog);
sltiu: alu_op => cmp_ltu, INSN_FETCH, JUMP_TO_OP_END(fast_epilog);
xori: alu_op => xor, INSN_FETCH, JUMP_TO_OP_END(fast_epilog);
ori: alu_op => or, INSN_FETCH, JUMP_TO_OP_END(fast_epilog);
andi: alu_op => and, INSN_FETCH, JUMP_TO_OP_END(fast_epilog);
sll_loop:
// Subtract 1 from shift cnt, preliminarily save shift results
// in case we bail (microcode cannot be interrupted, so user
// will never see this intermediate result).
// Also write the previous shift, either from prolog or last
// loop iteration.
alu_op => sub, a_src => alu_o, latch_a => 1, WRITE_RD;
// Then, do the shift, and bail if the shift cnt reached zero.
alu_op => sll, a_src => alu_o, b_src => one, latch_a => 1, latch_b => 1, \
jmp_type => direct_zero, CONDTEST_ALU_NONZERO, target => sll_loop;
srl_loop:
alu_op => sub, a_src => alu_o, latch_a => 1, WRITE_RD;
alu_op => srl, a_src => alu_o, b_src => one, latch_a => 1, latch_b => 1, \
jmp_type => direct_zero, CONDTEST_ALU_NONZERO, target => srl_loop;
sra_loop:
alu_op => sub, a_src => alu_o, latch_a => 1, WRITE_RD;
// Then, do the shift, and bail if the shift cnt reached zero.
alu_op => sra, a_src => alu_o, b_src => one, latch_a => 1, latch_b => 1, \
jmp_type => direct_zero, CONDTEST_ALU_NONZERO, target => sra_loop;
origin 0x80;
sb_1: READ_RS2, latch_b => 1, b_src => imm, pc_action => inc, jmp_type => direct, \
target => sb;
sh_1: READ_RS2, latch_b => 1, b_src => imm, jmp_type => direct, target => sh;
sw_1: READ_RS2, latch_b => 1, b_src => imm, jmp_type => direct, target => sw;
predict_not_taken_neq:
// Old PC still available in ALU latches. Preemptively assume branch not
// taken and load new PC. Construct the jump target in case this was a bad
// assumption, and pass the old PC through.
pc_action => inc, a_src => zero, latch_a => 1, alu_op => add, \
CONDTEST_ALU_NONZERO, jmp_type => direct_zero, \
target => mispredict_branch_was_taken;
predict_not_taken_eq:
// Old PC still available in ALU latches. Preemptively assume branch not
// taken and load new PC. Construct the jump target in case this was a bad
// assumption, and pass the old PC through.
pc_action => inc, a_src => zero, latch_a => 1, alu_op => add, \
CONDTEST_ALU_ZERO, jmp_type => direct_zero, \
target => mispredict_branch_was_taken;
mispredict_branch_was_taken:
// If branch required, preemptively assume the address is good, and load
// the branch target into the PC. If this fails, the old PC will be
// available to rollback and go to exception handler.
alu_op => add, pc_action => load_alu_o, except_ctl => latch_jal, \
jmp_type => direct_zero, cond_test => exception, \
target => branch_exception_detected;
branch_exception_detected:
// Old PC is available on ALU output. We have an exception. Rollback PC
// and begin exception handler.
pc_action => load_alu_o, cond_test => true, jmp_type => direct, \
target => save_pc;
origin 0x88;
branch_ops:
beq_1: latch_b => 1, b_src => gp, jmp_type => direct, target => beq;
bne_1: latch_b => 1, b_src => gp, jmp_type => direct, target => bne;
NOT_IMPLEMENTED;
NOT_IMPLEMENTED;
blt_1: latch_b => 1, b_src => gp, jmp_type => direct, target => blt;
bge_1: latch_b => 1, b_src => gp, jmp_type => direct, target => bge;
bltu_1: latch_b => 1, b_src => gp, jmp_type => direct, target => bltu;
bgeu_1: latch_b => 1, b_src => gp, jmp_type => direct, target => bgeu;
beq: a_src => imm, b_src => pc, latch_a => 1, latch_b => 1, alu_op => sub, \
jmp_type => direct, target => predict_not_taken_eq;
bne: a_src => imm, b_src => pc, latch_a => 1, latch_b => 1, alu_op => sub, \
jmp_type => direct, target => predict_not_taken_neq;
blt: a_src => imm, b_src => pc, latch_a => 1, latch_b => 1, CMP_LT, \
jmp_type => direct, target => predict_not_taken_neq;
bge: a_src => imm, b_src => pc, latch_a => 1, latch_b => 1, CMP_GE, \
jmp_type => direct, target => predict_not_taken_neq;
bltu: a_src => imm, b_src => pc, latch_a => 1, latch_b => 1, alu_op => cmp_ltu, \
jmp_type => direct, target => predict_not_taken_neq;
bgeu: a_src => imm, b_src => pc, latch_a => 1, latch_b => 1, CMP_GEU, \
jmp_type => direct, target => predict_not_taken_neq;
origin 0x98;
jalr: b_src => imm, latch_b => 1;
jalr_shared:
// Bring in PC and prepare to construct PC + 4. Calculate jmp target.
latch_a => 1, latch_b => 1, a_src => four, b_src => pc, alu_op => add, \
alu_o_mod => clear_lsb_o;
// PC + 4 will be avail on next cycle, which fast_epilog will save into
// RD. If we had an exception, then we have to wait until the old PC
// is available, which is still latched in ALU B input.
// Preemptively load PC with jmp target.
a_src => zero, latch_a => 1, pc_action => load_alu_o, alu_op => add, \
except_ctl => latch_jal, jmp_type => direct, cond_test => exception, \
invert_test => 1, target => fast_epilog;
// Exception detected. Pass the old PC through, and then reload.
alu_op => add, jmp_type => direct, cond_test => true, \
target => branch_exception_detected;
sb: a_src => zero, b_src => gp, latch_a => 1, latch_b => 1, alu_op => add;
alu_op => add, latch_adr => 1;
mem_sel => byte, latch_data => 1;
sb_wait: mem_req => 1, invert_test => 1, cond_test => mem_valid, \
mem_sel => byte, write_mem => 1, jmp_type => direct, target => sb_wait;
STOP_MEMREQ_THEN_JUMP_TO_ZERO;
sh: a_src => zero, b_src => gp, latch_a => 1, latch_b => 1, alu_op => add;
alu_op => add, latch_adr => 1, except_ctl => latch_store_adr, mem_sel => hword, \
jmp_type => direct, cond_test => exception, target => save_pc;
mem_sel => hword, latch_data => 1, pc_action => inc;
sh_wait: mem_req => 1, invert_test => 1, cond_test => mem_valid, \
mem_sel => hword, write_mem => 1, jmp_type => direct, target => sh_wait;
STOP_MEMREQ_THEN_JUMP_TO_ZERO;
sw: a_src => zero, b_src => gp, latch_a => 1, latch_b => 1, alu_op => add;
alu_op => add, latch_adr => 1, except_ctl => latch_store_adr, mem_sel => word, \
jmp_type => direct, cond_test => exception, target => save_pc;
mem_sel => word, latch_data => 1, pc_action => inc;
sw_wait: mem_req => 1, invert_test => 1, cond_test => mem_valid, \
mem_sel => word, write_mem => 1, jmp_type => direct, target => sw_wait;
STOP_MEMREQ_THEN_JUMP_TO_ZERO;
origin 0xB0;
jal: a_src => imm, b_src => pc, latch_a => 1, latch_b => 1, \
jmp_type => direct, target => jalr_shared;
fast_epilog: INSN_FETCH_EAGER_READ_RS1, WRITE_RD, SKIP_WAIT_IF_ACK;
fast_epilog_csr: INSN_FETCH_EAGER_READ_RS1, WRITE_RD_CSR, SKIP_WAIT_IF_ACK;
origin 0xc0;
add_1: latch_b => 1, b_src => gp, pc_action => inc, jmp_type => direct, \
target => add;
// Re: READ_RS1... the reg values read out of the GP file are
// sticky, but as part of pipelining, we read out RS2's value
// during dispatch/check_int.
// We'll need RS1 again, so get it back.
sll_1: READ_RS1, a_src => thirty_one, latch_a => 1, b_src => gp, \
latch_b => 1, pc_action => inc, jmp_type => direct, \
target => sll;
slt_1: latch_b => 1, b_src => gp, pc_action => inc, jmp_type => direct, \
target => slt;
sltu_1: latch_b => 1, b_src => gp, pc_action => inc, jmp_type => direct, \
target => sltu;
xor_1: latch_b => 1, b_src => gp, pc_action => inc, jmp_type => direct, \
target => xor;
srl_1: READ_RS1, a_src => thirty_one, latch_a => 1, b_src => gp, \
latch_b => 1, pc_action => inc, jmp_type => direct, \
target => srl;
or_1: latch_b => 1, b_src => gp, pc_action => inc, jmp_type => direct, \
target => or;
and_1: latch_b => 1, b_src => gp, pc_action => inc, jmp_type => direct, \
target => and;
sub_1: latch_b => 1, b_src => gp, pc_action => inc, jmp_type => direct, \
target => sub; // 0b1000
NOT_IMPLEMENTED; // 0b1001
NOT_IMPLEMENTED; // 0b1010
NOT_IMPLEMENTED; // 0b1011
NOT_IMPLEMENTED; // 0b1101
sra_1: READ_RS1, a_src => thirty_one, latch_a => 1, b_src => gp, \
latch_b => 1, pc_action => inc, jmp_type => direct, \
target => sra;
origin 0xd0;
lui: a_src => zero, b_src => imm, latch_a => 1, latch_b => 1, pc_action => inc, \
jmp_type => direct, target => addi;
add: alu_op => add, INSN_FETCH, JUMP_TO_OP_END(fast_epilog);
slt: CMP_LT, INSN_FETCH, JUMP_TO_OP_END(fast_epilog);
sltu: alu_op => cmp_ltu, INSN_FETCH, JUMP_TO_OP_END(fast_epilog);
xor: alu_op => xor, INSN_FETCH, JUMP_TO_OP_END(fast_epilog);
or: alu_op => or, INSN_FETCH, JUMP_TO_OP_END(fast_epilog);
and: alu_op => and, INSN_FETCH, JUMP_TO_OP_END(fast_epilog);
sub: alu_op => sub, INSN_FETCH, JUMP_TO_OP_END(fast_epilog);
sll:
// Get first input to shift it. Restrict second input
// (shift count) from 0-31. Set up b_src for shift loop.
a_src => gp, latch_a => 1, b_src => one, latch_b => 1, alu_op => and;
// Do a shift, but also check if shift count was zero/
// If so, bail. Otherwise, we're all set for the main shift loop.
a_src => alu_o, latch_a => 1, alu_op => sll, \
jmp_type => direct, CONDTEST_ALU_NONZERO, target => sll_loop;
// Whoops, was a zero shift. Pass through original RS1 and write
// to dest!
a_src => zero, b_src => gp, latch_a => 1, latch_b => 1;
alu_op => add, INSN_FETCH, JUMP_TO_OP_END(fast_epilog);
srl:
// Same comments as sll apply here.
a_src => gp, latch_a => 1, b_src => one, latch_b => 1, alu_op => and;
a_src => alu_o, latch_a => 1, alu_op => srl, \
jmp_type => direct, CONDTEST_ALU_NONZERO, target => srl_loop;
a_src => zero, b_src => gp, latch_a => 1, latch_b => 1;
alu_op => add, INSN_FETCH, JUMP_TO_OP_END(fast_epilog);
sra:
// Same comments as sll apply here.
a_src => gp, latch_a => 1, b_src => one, latch_b => 1, alu_op => and;
a_src => alu_o, latch_a => 1, alu_op => sra, \
jmp_type => direct, CONDTEST_ALU_NONZERO, target => sra_loop;
a_src => zero, b_src => gp, latch_a => 1, latch_b => 1;
alu_op => add, INSN_FETCH, JUMP_TO_OP_END(fast_epilog);
// Interrupt handler.
origin 0xf0;
save_pc: except_ctl => enter_int, csr_op => read_csr, csr_sel => trg_csr, \
a_src => zero, b_src => pc, latch_a => 1, latch_b => 1, target => MTVEC;
// Latch MTVEC, pass thru PC.
alu_op => add, b_src => csr, latch_b => 1;
// Read mcause_latch, write MEPC, pass thru MTVEC.
alu_op => add, b_src => mcause_latch, latch_b => 1, csr_op => write_csr, \
csr_sel => trg_csr, target => MEPC;
// Write PC, pass thru mcause_latch
alu_op => add, pc_action => load_alu_o;
// Write MCAUSE, and start exception handler.
INSN_FETCH, jmp_type => direct_zero, invert_test => 1, cond_test => true, \
csr_op => write_csr, csr_sel => trg_csr, target => MCAUSE;
origin 248;
mret: csr_op => read_csr, csr_sel => trg_csr, a_src => zero, latch_a => 1, target => MEPC;
// Latch MEPC
b_src => csr, latch_b => 1;
// Pass thru MEPC
alu_op => add;
// Write PC
pc_action => load_alu_o;
except_ctl => leave_int, INSN_FETCH, jmp_type => direct, target => fetch;
origin 254;
halt: jmp_type => direct, target => halt;
origin 255;
panic: jmp_type => direct, target => panic;