>=
\newline
: div ( ud udiv -- uqout umod )
\newline
com >r >r >a r> r> over 0 # +
\newline
/- /- /- /- /- /- /- /- /-
\newline
/- /- /- /- /- /- /- /-
\newline
nip nip a >r -cIF *+ r> ;
\newline
THEN 0 # + *+ $8000 # + r> ;
\newline
@
\layout Standard
The next example is even more complicated, since I emulate a serial interface.
At 10MHz, each bit takes 87 clock cycles, to get a 115200 baud fast serial
line.
We add a second stop bit, to allow the other side to resynchronize, when
the next bit arrives.
\begin_inset ERT
status Collapsed
\layout Standard
\backslash
filbreak
\end_inset
\layout Scrap
<
>=
\newline
: send-rest ( c -- c' ) *+
\newline
: wait-bit
\newline
1 # $FFF9 # BEGIN over + cUNTIL drop drop ;
\newline
: send-bit ( c -- c' )
\newline
nop
\backslash
delay at start
\newline
: send-bit-fast ( c -- c' )
\newline
$FFFE # >a dup 1 # and
\newline
IF drop $0001 # a@ or a!+ send-rest ;
\newline
THEN drop $FFFE # a@ and a!+ send-rest ;
\newline
: emit ( c -- )
\backslash
8N1, 115200 baud
\newline
>r 06 # send-bit r>
\newline
send-bit-fast send-bit send-bit send-bit
\newline
send-bit send-bit send-bit send-bit
\newline
drop send-bit-fast send-bit drop ;
\newline
@
\layout Standard
Like in ColorForth,
\family typewriter
;
\family default
is just an EXIT, and
\family typewriter
:
\family default
is used as label.
If there's a call before
\family typewriter
;
\family default
, this is converted to a jump.
This saves return stack entries, time, and code space.
\begin_inset ERT
status Collapsed
\layout Standard
\backslash
filbreak
\end_inset
\layout Section
The Rest of the Implementation
\layout Standard
First the implementation file with comment and modules.
\layout Scrap
<>=
\newline
/*
\newline
* b16 core: 16 bits,
\newline
* inspired by c18 core from Chuck Moore
\newline
*
\newline
<>
\newline
*/
\newline
\newline
`define L [l-1:0]
\newline
`define DROP { sp, T, N } <= { spinc, N, toN }
\newline
`timescale 1ns / 1ns
\newline
\newline
<>
\newline
<>
\newline
<>
\newline
@
\layout Scrap
<>=
\newline
* Instruction set:
\newline
* 1, 5, 5, 5 bits
\newline
* 0 1 2 3 4 5 6 7
\newline
* 0: nop call jmp ret jz jnz jc jnc
\newline
* /3 exec goto ret gz gnz gc gnc
\newline
* 8: xor com and or + +c *+ /-
\newline
* 10: A!+ A@+ R@+ lit Ac!+ Ac@+ Rc@+ litc
\newline
* /1 A! A@ R@ lit Ac! Ac@ Rc@ litc
\newline
* 18: nip drop over dup >r >a r> a
\newline
@
\newline
\layout Subsection
Top Level
\layout Standard
The CPU consists of several parts, which are all implemented in the same
Verilog module.
\begin_inset ERT
status Collapsed
\layout Standard
\backslash
filbreak
\end_inset
\layout Scrap
<>=
\newline
module cpu(clk, reset, addr, rd, wr, data, T,
\newline
intreq, intack, intvec);
\newline
<>
\newline
<>
\newline
<>
\newline
<>
\newline
<>
\newline
<>
\newline
<>
\newline
<>
\newline
\newline
always @(posedge clk or negedge reset)
\newline
<>
\newline
\newline
endmodule // cpu
\newline
@
\layout Standard
First, Verilog needs port declarations, so that it can now what's input
and output.
The parameter are used to configure other word sizes and stack depths.
\begin_inset ERT
status Collapsed
\layout Standard
\backslash
filbreak
\end_inset
\layout Scrap
<>=
\newline
parameter show=0, l=16, sdep=3, rdep=3;
\newline
input clk, reset;
\newline
output `L addr;
\newline
output rd;
\newline
output [1:0] wr;
\newline
input `L data;
\newline
output `L T;
\newline
input intreq;
\newline
output intack;
\newline
input [7:0] intvec; // interrupt jump vector
\newline
@
\layout Standard
The ALU is instantiated with the configured width, and the necessary wires
are declared
\layout Scrap
<>=
\newline
wire `L res, toN;
\newline
wire carry, zero;
\newline
\newline
alu #(l) alu16(res, carry, zero,
\newline
T, N, c, inst[2:0]);
\newline
@
\layout Standard
Since the stacks work in parallel, we have to calculated, when a value is
pushed onto the stack (thus
\series bold
only
\series default
if something is stored there).
\begin_inset ERT
status Collapsed
\layout Standard
\backslash
filbreak
\end_inset
\layout Scrap
<>=
\newline
reg dpush, rpush;
\newline
\newline
always @(clk or state or inst or rd)
\newline
begin
\newline
dpush <= 1'b0;
\newline
rpush <= 1'b0;
\newline
if(state[2]) begin
\newline
dpush <= |state[1:0] & rd;
\newline
rpush <= state[1] & (inst[1:0]==2'b10);
\newline
end else
\newline
casez(inst)
\newline
5'b00001: rpush <= 1'b1;
\newline
5'b11100: rpush <= 1'b1;
\newline
5'b11?1?: dpush <= 1'b1;
\newline
endcase // case(inst)
\newline
end
\newline
@
\layout Standard
The stacks don't only consist of the two stack modules, but also need an
incremented and decremented stack pointer.
The return stack even allows to write the top of return stack even without
changing the return stack depth.
\begin_inset ERT
status Collapsed
\layout Standard
\backslash
filbreak
\end_inset
\layout Scrap
<>=
\newline
wire [sdep-1:0] spdec, spinc;
\newline
wire [rdep-1:0] rpdec, rpinc;
\newline
\newline
stack #(sdep,l) dstack(clk, sp, spdec,
\newline
dpush, N, toN);
\newline
stack #(rdep,l) rstack(clk, rp, rpdec,
\newline
rpush, toR, R);
\newline
\newline
assign spdec = sp-{{(sdep-1){1'b0}}, 1'b1};
\newline
assign spinc = sp+{{(sdep-1){1'b0}}, 1'b1};
\newline
assign rpdec = rp+{(rdep){(~state[2] | tos2r)}};
\newline
assign rpinc = rp+{{(rdep-1){1'b0}}, 1'b1};
\newline
@
\layout Standard
The basic core is the fully synchronous register update.
Each register needs a reset value, and depending on the state transition,
the corresponding assignments have to be coded.
Most of that is from above, only the instruction fetch and the assignment
of the next value of
\family typewriter
incby
\family default
has to be done.
\begin_inset ERT
status Collapsed
\layout Standard
\backslash
filbreak
\end_inset
\layout Scrap
<>=
\newline
if(!reset) begin
\newline
<>
\newline
end else if(state[2]) begin
\newline
<>
\newline
end else begin // if (state[2])
\newline
if(show) begin
\newline
<>
\newline
end
\newline
if(nextstate == 3'b100)
\newline
{ addr, rd } <= { P, 1'b1 };
\newline
state <= nextstate;
\newline
incby <= (inst[4:2] != 3'b101);
\newline
<>
\newline
end // else: !if(reset)
\newline
@
\layout Standard
As reset value, we initialize the CPU so that it is about to fetch the next
instruction from address 0.
The stacks are all empty, the registers contain all zeros.
\begin_inset ERT
status Collapsed
\layout Standard
\backslash
filbreak
\end_inset
\layout Scrap
<>=
\newline
state <= 3'b011;
\newline
incby <= 1'b0;
\newline
P <= 16'h0000;
\newline
addr <= 16'h0000;
\newline
A <= 16'h0000;
\newline
T <= 16'h0000;
\newline
N <= 16'h0000;
\newline
I <= 16'h0000;
\newline
c <= 1'b0;
\newline
rd <= 1'b0;
\newline
wr <= 2'b00;
\newline
sp <= 0;
\newline
rp <= 0;
\newline
intack <= 0;
\newline
@
\layout Standard
The transition to the next state (the NEXT within a bundle) is done separately.
That's necessary, since the assignments of the other variables are not
just dependent on the current state, but partially also on the next state
(e.g.
when to fetch the next instruction word).
\begin_inset ERT
status Collapsed
\layout Standard
\backslash
filbreak
\end_inset
\layout Scrap
<>=
\newline
reg [2:0] nextstate;
\newline
\newline
always @(inst or state)
\newline
if(state[2]) begin
\newline
<>
\newline
end else begin
\newline
casez(inst)
\newline
<>
\newline
endcase // casez(inst[0:2])
\newline
end // else: !if(state[2]) end
\newline
@
\layout Scrap
<>=
\newline
nextstate <= state[1:0] + { 2'b0, |state[1:0] };
\newline
@
\layout Scrap
<>=
\newline
5'b00000: nextstate <= state[1:0] + 3'b001;
\newline
5'b00???: nextstate <= 3'b100;
\newline
5'b10???: nextstate <= { 1'b1, state[1:0] };
\newline
5'b?????: nextstate <= state[1:0] + 3'b001;
\newline
@
\layout Subsection
ALU
\layout Standard
The ALU just computes the sum with possible carry-ins, the logical operations,
and a zero flag.
It would be possible to share common resources (the XORs of the full adder
could also compute the XOR operation, and the carry propagation logic could
compute OR and AND), but this optimization is left to the synthesis tool.
\begin_inset ERT
status Collapsed
\layout Standard
\backslash
filbreak
\end_inset
\layout Scrap
<>=
\newline
module alu(res, carry, zero, T, N, c, inst);
\newline
<>
\newline
\newline
wire `L sum, logic;
\newline
wire cout;
\newline
\newline
assign { cout, sum } =
\newline
T + N + ((c | andor) & selr);
\newline
assign logic = andor ?
\newline
(selr ? (T | N) : (T & N)) :
\newline
T ^ N;
\newline
assign { carry, res } =
\newline
prop ? { cout, sum } : { c, logic };
\newline
assign zero = ~|T;
\newline
\newline
endmodule // alu
\newline
@
\layout Standard
The ALU has ports T and N, carry in, and the lowest 3 bits of the instruction
as input, a result, carry out, and test for zero as output.
\begin_inset ERT
status Collapsed
\layout Standard
\backslash
filbreak
\end_inset
\layout Scrap
<>=
\newline
parameter l=16;
\newline
input `L T, N;
\newline
input c;
\newline
input [2:0] inst;
\newline
output `L res;
\newline
output carry, zero;
\newline
\newline
wire prop, andor, selr;
\newline
\newline
assign #1 { prop, andor, selr } = inst;
\newline
@
\layout Subsection
Stacks
\layout Standard
The stacks are modeled as block RAM in the FPGA.
Therefore, they should have only one port, since these block RAMs are available
even in small FPGAs.
In an ASIC, this sort of stack is implemented with latches.
Here it's possible to separate read and write port (also for FPGAs that
support dual-ported RAM), and save the multiplexer for
\family typewriter
spset
\family default
.
\begin_inset ERT
status Collapsed
\layout Standard
\backslash
filbreak
\end_inset
\layout Scrap
<>=
\newline
module stack(clk, sp, spdec, push, in, out);
\newline
parameter dep=3, l=16;
\newline
input clk, push;
\newline
input [dep-1:0] sp, spdec;
\newline
input `L in;
\newline
output `L out;
\newline
reg `L stackmem[0:(1@<:
\emph default
Programs memory from
\emph on
addr
\emph default
with
\emph on
len
\emph default
data bytes
\layout Description
1
\emph on
addr, len:
\emph default
Reads back
\emph on
len
\emph default
bytes from memory starting at
\emph on
addr
\layout Description
2
\emph on
addr:
\emph default
Execute the word at
\emph on
addr
\layout Standard
These three commands are sufficient to program the b16 interactively.
On the host side, a few instructions are sufficient, too:
\layout Description
comp Compile to the end of line, and send the result to the evaluation board
\layout Description
eval Compile to the end of line, send the result to the evaluation board,
call the code, and set the RAM pointer of the assembler back to the original
value
\layout Description
sim Same as
\family typewriter
eval
\family default
, but execute the result with the simulator instead of using the evaluation
board
\layout Description
check ( addr u --- ) Reads a memory block from the evaluation board, and
display it with
\family typewriter
dump
\layout Section
Outlook
\layout Standard
More material is available from my home page
\begin_inset LatexCommand \cite{web}
\end_inset
.
All sources are available under GPL.
Data for producing a board is available, too.
\noun on
Hans Eckes
\noun default
might make one for you, if you pay for it.
And if someone wants to use the b16 commercially, talk to me.
\layout Bibliography
\bibitem {c18}
\emph on
c18 ColorForth Compiler,
\emph default
\noun on
Chuck Moore
\noun default
,
\begin_inset Formula $17^{\mathrm{th}}$
\end_inset
EuroForth Conference Proceedings, 2001
\layout Bibliography
\bibitem {web}
\emph on
b16 Processor,
\emph default
\noun on
Bernd Paysan
\noun default
, Internet Home page,
\begin_inset LatexCommand \url[http://www.jwdt.com/~paysan/b16.html]{http://www.jwdt.com/~paysan/b16.html}
\end_inset
\the_end