: div ( ud udiv -- uqout umod )
com >r >r >a r> r> over 0 # +
/- /- /- /- /- /- /- /- /-
/- /- /- /- /- /- /- /-
nip nip a >r -cIF *+ r> ;
THEN 0 # + *+ $8000 # + r> ;
The next example is even more complicated, since I emulate a serial interface.
At 10MHz, each bit takes 87 clock cycles, to get a 115200 baud fast serial
We add a second stop bit, to allow the other side to resynchronize, when
the next bit arrives.
: send-rest ( c -- c' ) *+
: wait-bit
1 # $FFF9 # BEGIN over + cUNTIL drop drop ;
: send-bit ( c -- c' )
delay at start
: send-bit-fast ( c -- c' )
$FFFE # >a dup 1 # and
IF drop $0001 # a@ or a!+ send-rest ;
THEN drop $FFFE # a@ and a!+ send-rest ;
: emit ( c -- )
8N1, 115200 baud
>r 06 # send-bit r>
send-bit-fast send-bit send-bit send-bit
send-bit send-bit send-bit send-bit
drop send-bit-fast send-bit drop ;
Like in ColorForth,
is just an EXIT, and
is used as label.
If there's a call before
, this is converted to a jump.
This saves return stack entries, time, and code space.
\layout Section
The Rest of the Implementation
First the implementation file with comment and modules.
* b16 core: 16 bits,
* inspired by c18 core from Chuck Moore
`define L [l-1:0]
`define DROP { sp, T, N } <= { spinc, N, toN }
`timescale 1ns / 1ns
* Instruction set:
* 1, 5, 5, 5 bits
* 0 1 2 3 4 5 6 7
* 0: nop call jmp ret jz jnz jc jnc
* /3 exec goto ret gz gnz gc gnc
* 8: xor com and or + +c *+ /-
* 10: A!+ A@+ R@+ lit Ac!+ Ac@+ Rc@+ litc
* /1 A! A@ R@ lit Ac! Ac@ Rc@ litc
* 18: nip drop over dup >r >a r> a
\layout Subsection
Top Level
\layout Standard
The CPU consists of several parts, which are all implemented in the same
Verilog module.
\layout Scrap
module cpu(clk, reset, addr, rd, wr, data, T,
intreq, intack, intvec);
always @(posedge clk or negedge reset)
endmodule // cpu
First, Verilog needs port declarations, so that it can now what's input
and output.
The parameter are used to configure other word sizes and stack depths.
\layout Scrap
parameter show=0, l=16, sdep=3, rdep=3;
input clk, reset;
output `L addr;
output rd;
output [1:0] wr;
input `L data;
output `L T;
input intreq;
output intack;
input [7:0] intvec; // interrupt jump vector
The ALU is instantiated with the configured width, and the necessary wires
are declared
wire `L res, toN;
wire carry, zero;
alu #(l) alu16(res, carry, zero,
T, N, c, inst[2:0]);
\layout Standard
Since the stacks work in parallel, we have to calculated, when a value is
pushed onto the stack (thus
if something is stored there).
\layout Scrap
reg dpush, rpush;
always @(clk or state or inst or rd)
dpush <= 1'b0;
rpush <= 1'b0;
if(state[2]) begin
dpush <= |state[1:0] & rd;
rpush <= state[1] & (inst[1:0]==2'b10);
end else
5'b00001: rpush <= 1'b1;
5'b11100: rpush <= 1'b1;
5'b11?1?: dpush <= 1'b1;
endcase // case(inst)
The stacks don't only consist of the two stack modules, but also need an
incremented and decremented stack pointer.
The return stack even allows to write the top of return stack even without
changing the return stack depth.
\layout Scrap
wire [sdep-1:0] spdec, spinc;
wire [rdep-1:0] rpdec, rpinc;
stack #(sdep,l) dstack(clk, sp, spdec,
dpush, N, toN);
stack #(rdep,l) rstack(clk, rp, rpdec,
rpush, toR, R);
assign spdec = sp-{{(sdep-1){1'b0}}, 1'b1};
assign spinc = sp+{{(sdep-1){1'b0}}, 1'b1};
assign rpdec = rp+{(rdep){(~state[2] | tos2r)}};
assign rpinc = rp+{{(rdep-1){1'b0}}, 1'b1};
The basic core is the fully synchronous register update.
Each register needs a reset value, and depending on the state transition,
the corresponding assignments have to be coded.
Most of that is from above, only the instruction fetch and the assignment
of the next value of
has to be done.
\layout Scrap
if(!reset) begin
end else if(state[2]) begin
end else begin // if (state[2])
if(show) begin
if(nextstate == 3'b100)
{ addr, rd } <= { P, 1'b1 };
state <= nextstate;
incby <= (inst[4:2] != 3'b101);
end // else: !if(reset)
As reset value, we initialize the CPU so that it is about to fetch the next
instruction from address 0.
The stacks are all empty, the registers contain all zeros.
\layout Scrap
state <= 3'b011;
incby <= 1'b0;
P <= 16'h0000;
addr <= 16'h0000;
A <= 16'h0000;
T <= 16'h0000;
N <= 16'h0000;
I <= 16'h0000;
c <= 1'b0;
rd <= 1'b0;
wr <= 2'b00;
sp <= 0;
rp <= 0;
intack <= 0;
The transition to the next state (the NEXT within a bundle) is done separately.
That's necessary, since the assignments of the other variables are not
just dependent on the current state, but partially also on the next state
when to fetch the next instruction word).
\layout Scrap
reg [2:0] nextstate;
always @(inst or state)
if(state[2]) begin
end else begin
endcase // casez(inst[0:2])
end // else: !if(state[2]) end
nextstate <= state[1:0] + { 2'b0, |state[1:0] };
5'b00000: nextstate <= state[1:0] + 3'b001;
5'b00???: nextstate <= 3'b100;
5'b10???: nextstate <= { 1'b1, state[1:0] };
5'b?????: nextstate <= state[1:0] + 3'b001;
\layout Standard
The ALU just computes the sum with possible carry-ins, the logical operations,
and a zero flag.
It would be possible to share common resources (the XORs of the full adder
could also compute the XOR operation, and the carry propagation logic could
compute OR and AND), but this optimization is left to the synthesis tool.
\layout Scrap
module alu(res, carry, zero, T, N, c, inst);
wire `L sum, logic;
wire cout;
assign { cout, sum } =
T + N + ((c | andor) & selr);
assign logic = andor ?
(selr ? (T | N) : (T & N)) :
T ^ N;
assign { carry, res } =
prop ? { cout, sum } : { c, logic };
assign zero = ~|T;
endmodule // alu
The ALU has ports T and N, carry in, and the lowest 3 bits of the instruction
as input, a result, carry out, and test for zero as output.
\layout Scrap
parameter l=16;
input `L T, N;
input c;
input [2:0] inst;
output `L res;
output carry, zero;
wire prop, andor, selr;
assign #1 { prop, andor, selr } = inst;
\layout Standard
The stacks are modeled as block RAM in the FPGA.
Therefore, they should have only one port, since these block RAMs are available
even in small FPGAs.
In an ASIC, this sort of stack is implemented with latches.
Here it's possible to separate read and write port (also for FPGAs that
support dual-ported RAM), and save the multiplexer for
\layout Scrap
module stack(clk, sp, spdec, push, in, out);
parameter dep=3, l=16;
input clk, push;
input [dep-1:0] sp, spdec;
input `L in;
output `L out;
reg `L stackmem[0:(1@<:
Programs memory from
\emph on
\emph default
\emph on
\emph default
data bytes
addr, len:
\emph default
Reads back
\emph on
\emph default
bytes from memory starting at
\emph on
\emph default
Execute the word at
\emph on
These three commands are sufficient to program the b16 interactively.
On the host side, a few instructions are sufficient, too:
comp Compile to the end of line, and send the result to the evaluation board
eval Compile to the end of line, send the result to the evaluation board,
call the code, and set the RAM pointer of the assembler back to the original
\layout Description
\family default
, but execute the result with the simulator instead of using the evaluation
check ( addr u --- ) Reads a memory block from the evaluation board, and
display it with
\layout Standard
More material is available from my home page
All sources are available under GPL.
Data for producing a board is available, too.
Hans Eckes
might make one for you, if you pay for it.
And if someone wants to use the b16 commercially, talk to me.
\bibitem {c18}
c18 ColorForth Compiler,
Chuck Moore
EuroForth Conference Proceedings, 2001
\bibitem {web}
b16 Processor,
Bernd Paysan
, Internet Home page,
\begin_inset LatexCommand \url[http://www.jwdt.com/~paysan/b16.html]{http://www.jwdt.com/~paysan/b16.html}