Veljko Milutinovic
MPS:
Understanding the Issues
Advanced RISC Microprocessors
The DEC Alpha AXP
Digital Equipment Corporation.
The first product realizing the Alpha AXP architecture is labeled 21064.
The Alpha is a 64-bit RISC-type microprocessor.
The PowerPC Family
IBM, Motorola, and Apple.
The first PowerPC implementation is the PowerPC 601 microprocessor
(also called MPC 601 by Motorola, and PPC 601 by IBM).
The Sun SPARC Family
Sun Microsystems.
The name SPARC stands for scalable processor architecture.
The SPARC architecture follows the Berkeley RISC design philosophy.
The MIPS Rx000 Family
MIPS Computer Systems.
The MIPS acronym stands for microprocessor without interlocked pipeline stages.
The MIPS system originated at Stanford University in the early eighties.
The Intel i860/i960 Family
The i860 RISC was first announced in 1989.
It features on-chip FPU, dual cache, and graphics unit
(the first microprocessor with such a feature).
The Motorola M88000 Family
The first members of the M88000 family are the MC88100 and MC88200.
They were followed by the RISC MC88110,
which is a two-issue superscalar.
The HP Precision Architecture Family
The PA-RISC architecture was designed to be scaleable
across technologies, cost ranges, performance ranges,
and to provide price-performance advantages.
Company |
Internet URL of microprocessor family home page |
IBM |
http://www.chips.ibm.com/products/ppc/DataSheets/techlibsheets.html |
Motorola |
http://www.mot.com/SPS/PowerPC |
DEC |
http://www.europe.digital.com/semiconductor/alpha.htm |
Sun |
http://www.sun.com/sparc |
MIPS |
http://www.sgi.com/MIPS/products/index.html |
Hewlett-Packard |
http://hpcc920.external.hp.com/computing/framed/technology/micropro |
AMD |
http://www.amd.com/K6 |
Intel |
http://www.intel.com/english/PentiumII/zdn.htm |
[Stojanovic96] Stojanovic, M., "Advanced RISC Microprocessors," Internal Report, Department of Computer Engineering,
School of Electrical Engineering, University of Belgrade, Belgrade, Serbia, Yugoslavia, December 1995.
Microprocessors and Their Primary Manufacturers
(source: [Prvulovic97])
Microprocessor |
Company |
PowerPC 601 |
IBM, Motorola |
PowerPC 604e |
IBM, Motorola |
PowerPC 620* |
IBM, Motorola |
Alpha 21064* |
Digital Equipment Corporation (DEC) |
Alpha 21164* |
Digital Equipment Corporation (DEC) |
Alpha 21264* |
Digital Equipment Corporation (DEC) |
SuperSPARC |
Sun Microelectronics |
UltraSPARC-I* |
Sun Microelectronics |
UltraSPARC-II* |
Sun Microelectronics |
R4400* |
MIPS Technologies. |
R10000* |
MIPS Technologies. |
PA7100 |
Hewlett-Packard |
PA8000* |
Hewlett-Packard |
PA8500* |
Hewlett-Packard |
MC88110 |
Motorola |
AMD K6 |
Advanced Micro Devices (AMD) |
i860 XP |
Intel |
Pentium II |
Intel |
Legend: * 64-bit microprocessors, all others are 32-bit microprocessors.
Microprocessor Technology
Microprocessor |
Technology |
Transistors |
Frequency [MHz] |
Package |
PowerPC 601 |
0.6 m m, 4 L, CMOS |
2,800,000 |
80 |
304 PGA |
PowerPC 604e |
0.35 m m, 5 L, CMOS |
5,100,000 |
225 |
255 BGA |
PowerPC 620 |
0.35 m m, 4 L, CMOS |
7,000,000 |
200 |
625 BGA |
Alpha 21064 |
0.7 m m, 3 L, CMOS |
1,680,000 |
300 |
431 PGA |
Alpha 21164 |
0.35 m m, 4 L, CMOS |
9,300,000 |
500 |
499 PGA |
Alpha 21264 |
0.35 m m, 6 L, CMOS |
15,200,000 |
500 |
588 PGA |
SuperSPARC |
0.8 m m, 3 L, CMOS |
3,100,000 |
60 |
293 PGA |
UltraSPARC-I |
0.4 m m, 4 L, CMOS |
5,200,000 |
200 |
521 BGA |
UltraSPARC-II |
0.35 m m, 5 L, CMOS |
5,400,000 |
250 |
521 BGA |
R4400 |
0.6 m m, 2 L, CMOS |
2,200,000 |
150 |
447 PGA |
R10000 |
0.35 m m, 4 L, CMOS |
6,700,000 |
200 |
599 LGA |
PA7100 |
0.8 m m, 3 L, CMOS |
850,000 |
100 |
504 PGA |
PA8000 |
0.35 m m, 5 L, CMOS |
3,800,000 |
180 |
1085 LGA |
PA8500 |
0.25 m m, ? L, CMOS |
>120,000,000 |
250 |
? |
MC88110 |
0.8 m m, 3 L, CMOS |
1,300,000 |
50 |
299 |
AMD K6 |
0.35 m m, 5 L, CMOS |
8,800,000 |
233 |
321 PGA |
i860 XP |
0.8 m m, 3 L, CHMOS |
2,550,000 |
50 |
262 PGA |
Pentium II |
0.35 m m, ? L, CMOS |
7,500,000 |
300 |
242 SEC |
Legend:
x L—x-layer metal (x = 2, 3, 4);
PGA—pin grid array;
BGA—ball grid array;
LGA—land grid array;
SEC—single edge contact;
Microprocessor Architecture
Microprocessor |
IU registers |
FPU registers |
VA |
PA |
EC Dbus |
SYS Dbus |
PowerPC 601 |
32´ 32 |
32´ 64 |
52 |
32 |
none |
64 |
PowerPC 604e |
32´ 32 +RB(12) |
32´ 64 +RB(8) |
52 |
32 |
none |
64 |
PowerPC 620 |
32´ 64 +RB(8) |
32´ 64 +RB(8) |
80 |
40 |
128 |
128 |
Alpha 21064 |
32´ 64 |
32´ 64 |
43 |
34 |
128 |
128 |
Alpha 21164 |
32´ 64 +RB(8) |
32´ 64 |
43 |
40 |
128 |
128 |
Alpha 21264 |
32´ 64 +RB(48) |
32´ 64 +RB(40) |
? |
44 |
128 |
128 |
SuperSPARC |
136´ 32 |
32´ 32* |
32 |
36 |
none |
64 |
UltraSPARC-I |
136´ 64 |
32´ 64 |
44 |
36 |
128 |
128 |
UltraSPARC-II |
136´ 64 |
32´ 64 |
44 |
36 |
128 |
128 |
R4400 |
32´ 64 |
32´ 64 |
40 |
36 |
128 |
64 |
R10000 |
32´ 64 +RB(32) |
32´ 64 +RB(32) |
44 |
40 |
128 |
64 |
PA7100 |
32´ 32 |
32´ 64 |
64 |
32 |
? |
? |
PA8000 |
32´ 64 +RB(56) |
32´ 64 |
48 |
40 |
64 |
64 |
PA8500 |
32´ 64 +RB(56) |
32´ 64 |
48 |
40 |
64 |
64 |
MC88110 |
32´ 32 |
32´ 80 |
32 |
32 |
none |
? |
AMD K6 |
8´ 32 +RB(40) |
8´ 80 |
48 |
32 |
64 |
64 |
i860 XP |
32´ 32 |
32´ 32* |
32 |
32 |
none |
? |
Pentium II |
? |
8´ 80 |
48 |
36 |
64 |
64 |
Legend:
IU—integer unit;
FPU—floating point unit;
VA—virtual address [bits];
PA—physical address [bits];
EC Dbus—external cache data bus width [bits];
SYS Dbus—system bus width [bits];
RB—rename buffer [size expressed in the number of registers];
* Can also be used as a 16´ 64 register file.
Microprocessor ILP Features
Microprocessor |
ILP issue |
LSU units |
IU units |
FPU units |
GU units |
PowerPC 601 |
3 |
1 |
1 |
1 |
0 |
PowerPC 604e |
4 |
1 |
3 |
1 |
0 |
PowerPC 620 |
4 |
1 |
3 |
1 |
0 |
Alpha 21064 |
2 |
1 |
1 |
1 |
0 |
Alpha 21164 |
4 |
1 |
2 |
2 |
0 |
Alpha 21264 |
4 |
1 |
4 |
2 |
0 |
SuperSPARC |
3 |
0 |
2 |
2 |
0 |
UltraSPARC-I |
4 |
1 |
4 |
3 |
2 |
UltraSPARC-II |
4 |
1 |
4 |
3 |
2 |
R4400 |
1* |
0 |
1 |
1 |
0 |
R10000 |
4 |
1 |
2 |
2 |
0 |
PA7100 |
2 |
1 |
1 |
3 |
0 |
PA8000 |
4 |
2 |
2 |
4 |
0 |
PA8500 |
4 |
2 |
2 |
4 |
0 |
MC88110 |
2 |
1 |
3 |
3 |
2 |
AMD K6 |
6** |
2 |
2 |
1 |
1*** |
i860 XP |
2 |
1 |
1 |
2 |
1 |
Pentium II |
5** |
? |
? |
? |
? |
Legend:
ILP = instruction level parallelism;
LSU = load/store or address calculation unit;
IU = integer unit;
FPU = floating point unit;
GU = graphics unit;
* Superpipelined;
** RISC instructions, one or more of them are needed to emulate an 80x86 instruction;
*** MMX (multimedia extensions) unit.
Microprocessor Cache Memory
Microprocessor |
L1 Icache, Kbytes |
L1 Dcache, Kbytes |
L2 cache, Kbytes |
PowerPC 601 |
32, 8WSA, UNI |
— |
|
PowerPC 604e |
32, 4WSA |
32, 4WSA |
— |
PowerPC 620 |
32, 8WSA |
32, 8WSA |
—* |
Alpha 21064 |
8, DIR |
8, DIR |
—* |
Alpha 21164 |
8, DIR |
8, DIR |
96, 3WSA* |
Alpha 21264 |
64, 2WSA |
64, DIR |
—* |
SuperSPARC |
20, 5WSA |
16, 4WSA |
— |
UltraSPARC—I |
16, 2WSA |
16, DIR |
—* |
UltraSPARC—II |
16, 2WSA |
16, DIR |
—* |
R4400 |
16, DIR |
16, DIR |
—* |
R10000 |
32, 2WSA |
32, 2WSA |
—* |
PA7100 |
0 |
—** |
|
PA8000 |
0 |
—** |
|
PA8500 |
512, 4WSA |
1024, 4WSA |
— |
MC88110 |
8, 2WSA |
8, 2WSA |
— |
AMD K6 |
32, 2WSA |
32, 2WSA |
—* |
i860 XP |
16, 4WSA |
16, 4WSA |
— |
Pentium II |
16, ? |
16. ? |
512, ?*** |
Legend:
Icache—on-chip instruction cache;
Dcache—on-chip data cache;
L2 cache—on chip L2 cache;
DIR—direct mapped;
xWSA—x-way set associative;
UNI—unified L1 instruction and data cache;
* on-chip cache controller for external L2 cache;
** on-chip cache controller for external L1 cache;
*** L2 cache is in the same package, but on a different silicon die.
Miscellaneous Microprocessor Features
Microprocessor |
ITLB |
DTLB |
BPS |
PowerPC 601 |
256, 2WSA, UNI |
—* |
|
PowerPC 604e |
128, 2WSA |
128, 2WSA |
512 ´ 2BC |
PowerPC 620 |
128, 2WSA |
128, 2WSA |
2048 ´ 2BC |
Alpha 21064 |
12 |
32 |
4096 ´ 2BC |
Alpha 21164 |
48 ASSOC |
64 ASSOC |
ICS ´ 2BC |
Alpha 21264 |
? |
? |
? |
SuperSPARC |
64 ASOC, UNI |
? |
|
UltraSPARC-I |
64 ASOC |
64 ASOC |
ICS ´ 2BC |
UltraSPARC-II |
64 ASOC |
64 ASOC |
ICS ´ 2BC |
R4400 |
48 ASOC |
48 ASOC |
— |
R10000 |
64 ASOC |
64 ASOC |
? ´ 2BC |
PA7100 |
16 |
120 |
? |
PA8000 |
4 |
96 |
256 ´ 3BSR |
PA8500 |
160, UNI |
>256 ´ 2BC |
|
MC88110 |
40 |
40 |
? |
AMD K6 |
64 |
64 |
8192 ´ 2BC, 16´ RAS |
i860 XP |
64, UNI |
? |
|
Pentium II |
? |
? |
? |
Legend:
ITLB—translation lookaside buffer for code [entries];
DTLB—translation lookaside buffer for data [entries];
2WSA—two-way set associative; ASOC = fully associative;
UNI—unified TLB for code and data;
BPS—branch prediction strategy;
2BC—two-bit counter;
3BSR—three bit shift register;
RAS—return address stack;
ICS—instruction cache size (2BC for every instruction in the instruction cache);
* hinted instructions available for static branch prediction.
intel®
PENTIUM™ PROCESSOR
The Pentium processor provides the new generation of power for high-end workstations and servers. The Pentium processor is compatible with the entire installed base of applications for DOS, Windows, OS/2, and UNIX. The Pentium processor’s superscalar architecture can execute two instructions per clock cycle. Branch prediction and separate caches also increase performance. The pipelined floating point unit of the Pentium processor delivers workstation level performance. Separate code and data caches reduce cache conflicts while remaining software transparent. The Pentium processor has |
Pentium™ Processor Pinout (Top View)
Figure MPSS1: Pentium pin layout (source: [Intel93])
Legend: Self-explanatory.
Architecture Overview:
Pin Functional Grouping
Function |
Pins |
Clock |
CLK |
Initialization |
RESET, INIT |
Address Bus |
A31–A3, BE7#–BE0# |
Address Mask |
A20M# |
Data Bus |
D63–D0 |
Address Parity |
AP, APCHK# |
Data Parity |
DP7–DP0, PCHK#, PEN# |
Internal Parity Error |
IERR# |
System Error |
BUSCHK# |
Bus Cycle Definition |
M/IO#, D/C#, W/R#, CACHE#, SCYC, LOCK# |
Bus Control |
ADS#, BRDY, NA# |
Page Cacheability |
PCD, PWT |
Cache Control |
KEN#, WB/WT# |
Cache Snooping/Consistency |
AHOLD, EADS#, HIT#, HITM#, INV |
Cache Flush |
FLUSH# |
Write Ordering |
EWBE# |
Bus Arbitration |
BOFF#, BREQ, HOLD, HLDA |
Interrupts |
INTR, NMI |
Floating Point Error Reporting |
FERR#, IGNNE# |
System Management Mode |
SMI#, SMIACT# |
Functional Redundancy Checking |
FRCMC# (IERR#) |
TAP Port |
TCK, TMS, TDI, TDO, TRST# |
Breakpoint/Performance Monitoring |
PM0/BP0, PM1/BP1, BP3–2 |
Execution Tracing |
BT3–BT0, IU, IV, IBT |
Probe Mode |
R/S#, PRDY |
Figure MPSS2: Pentium pin functions (source: [Intel93])
Legend:
TAP—Processor boundary scan.
Pentium™ Processor Block Diagram
Figure MPSS3: Pentium block digaram (source: [Intel93])
Legend:
TLB—Translation Lookaside Buffer.
Intel486™ Pipeline Execution
PF |
I1 |
I2 |
I3 |
I4 |
|
|
|
|
D1 |
|
I1 |
I2 |
I3 |
I4 |
|
|
|
D2 |
|
|
I1 |
I2 |
I3 |
I4 |
|
|
EX |
|
|
|
I1 |
I2 |
I3 |
I4 |
|
WB |
|
|
|
|
I1 |
I2 |
I3 |
I4 |
Pentium™ Pipeline Execution
PF |
I1 |
I3 |
I5 |
I7 |
|
|
|
|
|
I2 |
I4 |
I6 |
I8 |
|
|
|
|
D1 |
|
I1 |
I3 |
I5 |
I7 |
|
|
|
|
|
I2 |
I4 |
I6 |
I8 |
|
|
|
D2 |
|
|
I1 |
I3 |
I5 |
I7 |
|
|
|
|
|
I2 |
I4 |
I6 |
I8 |
|
|
EX |
|
|
|
I1 |
I3 |
I5 |
I7 |
|
|
|
|
|
I2 |
I4 |
I6 |
I8 |
|
WB |
|
|
|
|
I1 |
I3 |
I5 |
I7 |
|
|
|
|
|
I2 |
I4 |
I6 |
I8 |
Figure MPSS4: Intel 486 pipeline versus Pentium pipeline (source: [Intel93])
Legend:
PF—Prefetch;
D1/2—Decoding 1/2;
EX—Execution;
WB—Writeback.
Instructions Prefetch:
until a branch is fetched.
Pipeline Stage D1 (Decode 1):
one per clock cycle;
base instruction is issued and paired with others,
after all prefixes have been issued.
Pipeline Stage D2 (Decode 2):
Pipeline Stage EX (Execute):
Pipeline Stage WB (Writeback):
Stall:
If u-pipe is stalled, the v-pipe is stalled, too.
If v-pipe is stalled, the u-pipe proceeds.
No successive instructions into EX
before both pipes advanced to WB.
Instruction Pairing Rules:
(a) Both must be "simple"
(b) No RAW or WAW dependencies
(c) Neither can contain both,
a displacement and an immediate
(d) Instructions with prefixes
(other than 0f of JCC)
can occur only in the u-pipe.
Branch Prediction:
by two simultaneous write misses
in the two instruction pipes.
Example:
for(k=i+prime; k<=SIZE; k+=prime)
flags[k]=FALSE;
Execution time:
Texe[Pentium (with branch prediction)]=2
Texe[i486]=6
External Event Synchronization:
Serializing Operations:
External Interrupt:
BOSCHK#
R/S#
FLUSH#
SMI#
INIT
NMI
INTR
Writeback Buffers:
Model Specific Registers:
RDMSR
WRMSR
Value |
Register Name |
Description |
00H |
Machine Check Address (MCA) |
Stores address of cycle causing the execution |
01H |
Machine Check Type (MCT) |
Stores cycle type of cycle causing the execution |
0EH |
Test Register 12 (TR12) |
New feature control |
Figure MPSS5: Model specific register manipulation (source: [Intel93])
Legend:
H—Hexadecimal.
Floating-Point Unit:
Floating-Point Pipeline Stages:
PF Prefetch;
D1 Instruction decode;
D2 Address generation;
EX Memory and register read;
X1 Floating-point execute stage # one;
X2 Floating-point execute stage # two;
WF Rounding and writing the floating-point result
to register file;
ER Error report + update status word.
On-Chip Caches:
Cache Organization:
CD=NW=0
CD=NW=1
Organization of Instruction
and Data Caches
|
|
MESI State |
|
|
|
|
MESI State |
||
|
|
¯ |
¯ |
|
LRU |
|
|
¯ |
¯ |
Set |
TAG Address |
|
|
|
¬ ® |
|
TAG Address |
|
|
|
WAY 0 |
|
|
|
|
|
WAY 1 |
|
|
Data Cache
|
|||||||||
|
State Bit (S or I) |
|
|
|
State Bit (S or I) |
||||
|
|
¯ |
|
|
LRU |
|
|
¯ |
|
Set |
TAG Address |
|
|
|
¬ ® |
|
TAG Address |
|
|
|
WAY 0 |
|
|
|
|
|
WAY 1 |
|
|
Instruction Cache |
Figure MPSS6: Organization of instruction and data caches (source: [Intel93])
Legend:
MESI—Modified/Exclusive/Shared/Invalid;
LRU—Least Recently Used.
PCD and PWT Generation
Figure MPSS7: Generation of PCD and PWT (source: [Intel93])
Legend:
PCD—a bit which controls cacheability on a page by page basis;
PWT—a bit which controls write policy for the second level caches;
PTRS—Pointers.
Page Cacheability:
PWT=1 (write through)
PWT=0 (write back)
PCD=0 (caching enabled)
PCD=1 (caching disabled)
Inquire Cycles:
Cache Flushing:
The MESI Protocol:
M - Modified: An M-state line is available in ONLY one cache,
and it is also MODIFIED (different from main memory).
An M-state line can be accessed (read/written to)
without sending a cycle out on the bus.
E - Exclusive: An E-state line is also available in only one cache in the system, but the line is not MODIFIED
(i.e., it is the same as main memory).
An E-state line can be accessed (read/written to)
without generating a bus cycle.
A write to an E-state line will cause the line to become MODIFIED.
S - Shared: This state indicates that the line is potentially shared
with other caches
(i.e., the same line may exist in more that one cache).
A read to an S-state line will not generate bus activity,
but a write to a SHARED line
will generate a write-through cycle on the bus.
The write-through cycle may invalidate this line in other caches. A write to an S-state line will update the cache.
I - Invalid: This state indicates that the line is not available in the cache.
A read to this line will be a MISS,
and may cause the Pentium processor to execute LINE FILL.
A write to an INVALID line causes the Pentium processor
to execute a write-through cycle on the bus.
Figure MPSS8: Definition of states for the MESI and the SI protocols (source: [Intel93])
Legend:
LINE FILL—Fetching the whole line into the cache from main memory.
Present State |
Pin Activity |
Next State |
Description |
M |
n/a |
M |
Read hit; |
E |
n/a |
E |
Read hit; |
S |
n/a |
S |
Read hit; |
I |
CACHE# low AND KEN# low AND WB/WT# high AND PWT low |
E |
Data item does not exist in cache (MISS). |
I |
CACHE# low AND KEN# low AND (WB/WT# low OR PWT high) |
S |
Same as previous read miss case |
I |
CACHE# high AND KEN# high |
I |
KEN# pin inactive; |
Figure MPSS9: Data cache state transitions for UNLOCKED Pentium™ processor initiated read cycles*
(source: [Intel93])
Legend:
*—Locked accesses to the data cache will cause the accessed line to transition to the Invalid state.
Present State |
Pin Activity |
Next State |
Description |
M |
n/a |
M |
write hit; update data cache. No bus cycle generated to update memory. |
E |
n/a |
M |
Write hit; update cache only. No bus cycle generated; line is now MODIFIED. |
S |
PWT low AND WB/WT# high |
E |
Write hit; data cache updated with write data item. A write-through cycle is generated on bus to update memory and/or invalidate contents of other caches. The state transition occurs after the writethrough cycle completes on the bus (with the last BRDY#). |
S |
PWT low AND WB/WT# low |
S |
Same as above case of write to S-state line except that WB/WT# is sampled low. |
S |
PWT high |
S |
Same as above cases of writes to S state lines except that this is a write hit to a line in a write through page; status of WB/WT# pin is ignored. |
I |
n/a |
I |
Write MISS; a write through cycle is generated on the bus to update external memory. No allocation done. |
Figure MPSS10: Data cache state transitions for UNLOCKED Pentium™ processor initiated write cycles* (source: [Intel93])
Legend:
WB/WT—Writeback/Writethrough.
Present State |
Next State INV=1 |
Next State INV=0 |
Description |
M |
I |
S |
Snoop hit to a MODIFIED line indicated by HIT# and HITM# pins low. Pentium™ processor schedules the writing back of the modified line to memory. |
E |
I |
S |
Snoop hit indicated by HIT# pin low; |
S |
I |
S |
Snoop hit indicated by HIT# pin low; |
I |
I |
I |
Address not in cache; HIT# pin high. |
Figure MPSS11: Data cache state transitions during inquire cycles (source: [Intel93])
Legend:
INV—Invalid bit.
Reference:
[Intel93] "Pentium Processor User’s Manual," Intel, Santa Clara California, USA, 1993.
Veljko Milutinovic
MPS:
State of the Art
Pentium MMX
New instructions of the Pentium MMX processor (source: [Intel97])
EMMS—Empty MMX state
MOVD—Move doubleword
MOVQ—Move quadword
PACKSSDW—Pack doubleword to word data (signed with saturation)
PACKSSWB—Pack word to byte data (signed with saturation)
PACKUSWB—Pack word to byte data (unsigned with saturation)
PADD—Add with wrap-around
PADDS—Add signed with saturation
PADDUS—Add unsigned with saturation
PAND—Bitwise And
PANDN—Bitwise AndNot
PCMPEQ—Packed compare for equality
PCMPGT—Packed compare greater (signed)
PMADD—Packed multiply add
PMULH—Packed multiplication
PMULL—Packed multiplication
POR—Bitwise Or
PSLL—Packed shift left logical
PSRA—Packed shift right arithmetic
PSRL—Packed shift right logical
PSUB—Subtract with wrap-around
PSUBS—Subtract signed with saturation
PSUBUS—Subtract unsigned with saturation
PUNPCKH—Unpack high data to next larger type
PUNPCKL—Unpack low data to next larger type
PXOR—Bitwise Xor
Legend:
MMX—MultiMedia eXtension.
Pentium Pro:
Basic features:
Pentium Pro Block Diagram
Legend:
AGU |
Address generation unit |
L2 |
Level-2 cache |
BIU |
Bus interface unit |
MIS |
Microinstruction sequencer |
BTB |
Branch target buffer |
MIU |
Memory interface unit |
DCU |
Data cache unit |
MOB |
Memory reorder buffer |
FEU |
Floating-point execution unit |
RAT |
Register alias table |
ID |
Instruction decoder |
ROB |
Reorder buffer |
IEU |
Integer execution unit |
RRF |
Retirement register file |
IFU |
Instruction fetch unit (with I-cache) |
RS |
Reservation station |
Pentium Pro and Pentium II Bus Structures
Legend:
SB—single independent bus;
DIB—dual independent bus;
CLC—control logic chipset;
L2—second level cache.
References:
[Papworth96] Papworth, D. B.,
"Tuning the Pentium Pro Microarchitecture,"
IEEE Micro, April 1996, pp. 8–16.
[Intel96] http://www.intel.com/procs/p6/p6white/index.html,
Intel, Santa Clara, California, USA, 1996.
Intel COO Craig Barrett’s Vision: 2000
Microprocessor Performance Trends
Process Technology: Delay Trends
Figure MPSS1: Microprocessor chip delay trends (source: [Sheaffer96])
Legend:
Metal 2 (2 mm)—Two level metal.
Process Technology: Area Trends
Silicon process technology |
1.5 m m |
1.0 m m |
0.8 m m |
0.6 m m |
0.35 m m |
0.25 m m |
Intel386TM DX Processor |
||||||
Intel486TM DX Processor |
||||||
Pentium® Processor |
||||||
Pentium® Pro Processor |
est |
est |
Figure MPSU2: Microprocessor chip area trends (source: [Sheaffer96])
Legend:
est—estimated.
Frequency of Operation
Figure MPSS3: Microprocessor chip operation frequency (source: [Sheaffer96])
Legend:
PP—Pentium Processor;
PPro—Pentium Pro Processor.
Microprocessor Complexity
Figure MPSS4: Microprocessor and memory complexity (source: [Sheaffer96])
Brainiacs and Speedemons
Figure MPSS5: Microprocessor sophistication (source: [Sheaffer96])
Current Trends in Design
Figure MPSS6: Microprocessor time budget (source: [Sheaffer96])
Legend:
L1/2—First/second level cache.
[Sheaffer96] Sheaffer, G.,
"Trends in Microprocessing,"
Keynote Address,
YU-INFO-96,
Brezovica, Serbia, Yugoslavia,
April 1996.
Veljko Milutinovic
MPS:
IFACT
Ten Example Models of a RISC Design
Models:
References:
[Milicev97] Milicev, D., Petkovic, Z., Raskovic, D., Stefanovic, D., Zivkovic, M., Jelic, D., Robal, M., Jelisavcic, M., Milenkovic, A., Milutinovic, V.,
"Models of Modern Microprocessors,"
IEEE Transactions on Education, 1997.
[Milicev96] http://ubbg.etf.rs/~emiliced/,
University of Belgrade, Belgrade, Serbia, Yugoslavia, 1996.
120 |
Editor: Will Tracz, Loral Federal Systems, MD 0210, Owego, NY 13827; Internet, tracz@lfs.loral.com |
Ten lessons learned from a RISC design
essons can be learned anywhere on earth, and we’ve accumulated a few from our international project—a 64-bit RISC processor design using silicon compilation (with 2.5 million transistors) that took two years to complete. Project teams were located on three continents: a US company provided the hardware description language; a European group (the two of us) was responsible for generating the HDL-based model that correctly described all signals on all pins for each instruction and every operational mode; and a Japanese company generated over 10 Mbytes of tests. Our team’s task was then to successfully pass these tests, after which another US company did the silicon compilation. Finally, another Japanese company did the fabrication. You can imagine the possibilities for complexity! Here are a few of the many lessons we learned.
Lesson #1: It’s tough for just one person to understand everything. A silicon compiler’s essential value is that it enables one person to fully understand a relatively complex design task; however, it’s extremely difficult for one person to manage every detail. In our case, the details were all signals on all pins for every instruction executed in each operational mode. It’s important that future HDL extensions contain language constructs to efficiently express such details!
Lesson #2: Coding rules for silicon compilation are underdeveloped. One nice thing about HDLs is that they let you adequately exploit the full parallelism at the lowest hardware levels for efficient programming. However, current silicon compilers get "confused" with too much parallelism, so the HDL programmer must serialize the description, which negatively affects programmer productivity. The solution? Develop design rules characterized by maximum parallelism yet without negative effects on synthesis efficiency!
Lesson #3: Don’t let silicon compiler warnings get you down. We’ve noticed that many silicon compilers generate correlated warnings. Consequently, a huge number of warnings results in a mere handful of coding rule violations. Therefore, the generated warnings must be orthogonalized!
Lesson #4: Be careful when naming variables. A silicon compiler shouldn’t specify how variable names are created. For example, our register variable names had to start with "r_." This can be confusing, especially of required of the HDL programmers after they’ve mostly completed their task.
Lesson #5: The environment keeps changing. The silicon compiler was fully developed by the time we started our work, but the programming rules that enabled synthesis were not. Consequently, creating rules was a trial-and-error experience. Also, the silicon-compilation design process is still lengthy, so the project requirements are likely to change during the design process. Nothing new!
Lesson #6: Testing is still the bottleneck. The first 90 percent of the project—design—was completed in six months, while the remaining 10 percent—testing—needed another 18 months!
Lesson #7: Beware the NIH problems. People who work in high tech tend to think very highly of themselves, and that characteristic caused some problems of the NIH ("not invented here") variety. When a test failed, entirely too much time was spent trying to determine who made the error rather than getting on with fixing it. The typical reaction was always to blame someone else for the error.
Lesson #8: Working on three continents is both pleasure and pain. If the phone woke you up in the early morning, you knew the call was from Japan. If the phone woke you up late at night, you knew the call was from the USA. After awhile, you learned the best time to send e-mail to get a prompt response. Cultural differences, although a source of fun, can provoke misunderstandings that create hard feelings.
Lesson #9: Time to market is still an issue. A major driver of silicon compiler development is fast time to market. However, the goal has not yet been met to accelerate very sophisticated processor-logic designs adequately. There’s lots of research room for new methodologies in the over-one-million-transistor arena.
Lesson #10: We’re always more clever after the fact! As the saying goes, "hindsight is 20-20." Looking back, it’s obvious that better planning up front would have eliminated many problems (although, unfortunately, none of the above!). Better planning would definitely have reduced the 18 months it took to eliminate the last 10 percent of errors. Also, we’re all now older and wiser, with two years’ more experience!
OUR PROCESSOR DESIGN PROJECT—with all its lessons learned—was one of life’s special experiences. During the two long years of work, one of the project team members passed away, and another one received a beautiful new baby. Sometimes, life resembles engineering so much!
Veljko Milutinovic and Zvezdan Petkovic
University of Belgrade
emilutiv@ubbg.etf.rs