# FREESS: An Educational Simulator of a RISC-V-Inspired Superscalar Processor Based on Tomasulo's Algorithm

Roberto Giorgi giorgi@unisi.it University of Siena Siena, Italy

### **Abstract**

FREESS is a free, interactive simulator that illustrates instructionlevel parallelism in a RISC-V-inspired superscalar processor. Based on an extended version of Tomasulo's algorithm, FREESS is intended as a hands-on educational tool for Advanced Computer Architecture courses. It enables students to explore dynamic, outof-order instruction execution, emphasizing how instructions are issued as soon as their operands become available.

The simulator models key microarchitectural components, including the Instruction Window (IW), Reorder Buffer (ROB), Register Map (RM), Free Pool (FP), and Load/Store Queues. FREESS allows users to dynamically configure runtime parameters, such as the superscalar issue width, functional unit types and latencies, and the sizes of architectural buffers and queues.

To simplify learning, the simulator uses a minimal instruction set inspired by RISC-V (ADD, ADDI, BEQ, BNE, LW, MUL, SW), which is sufficient to demonstrate key pipeline stages: fetch, register renaming, out-of-order dispatch, execution, completion, commit, speculative branching, and memory access. FREESS includes three step-by-step, illustrated examples that visually demonstrate how multiple instructions can be issued and executed in parallel within a single cycle. Being open source, FREESS encourages students and educators to experiment freely by writing and analyzing their own instruction-level programs and superscalar architectures.

# **CCS** Concepts

• Computer systems organization  $\rightarrow$  Superscalar architectures; Reduced instruction set computing; • General and reference  $\rightarrow$  Evaluation; • Applied computing  $\rightarrow$  Computer-assisted instruction.

# **Keywords**

Superscalar, Simulator, RISC-V, Tomasulo

### **ACM Reference Format:**

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org.

WCAE '25, Tokyo, Japan

© 2025 Copyright held by the owner/author(s). Publication rights licensed to ACM. ACM ISBN 978-1-4503-XXXX-X/2025/06



Figure 1: Structure of an Out-Of-Order Processor in FREESS.

#### 1 Introduction

Most modern microprocessors in medium- and high-end systems adopt *superscalar* architectures, building upon dynamic scheduling mechanisms such as *Thornton's scoreboard* [10] and *Tomasulo's algorithm* [11]. In this paper, we focus on Tomasulo's to illustrate how machine instructions can be executed *in parallel*, transparently to the user. These techniques are commonly known as *dynamic scheduling*, *out-of-order execution*, or *restricted dataflow* [4].

This topic is therefore a cornerstone for *Advanced Computer Architecture* courses, where *Instruction-Level Parallelism (ILP)* plays a key role in achieving *High-Performance Computing* and in accelerating *single-threaded* execution.

Superscalar architectures provide an elegant hardware-based solution to the problem of tracking control and data dependencies among instructions. While thread— and data-level parallelism are useful, executing multiple instructions per cycle can also significantly boost performance. Normally, one instruction takes several cycles to complete. A first performance improvement comes from *pipelining*, but a single pipeline can still issue at most one instruction per cycle, also known as *Flynn's bottleneck*. This limitation is overcome by issuing multiple instructions to independent functional units. This goal is achieved in *superscalar processors* (via hardware scheduling) or *VLIW architectures* (via software scheduling).

The superscalar approach based on Tomasulo's extended algorithm augments the pipeline with several hardware structures that dynamically track data dependencies and execute instructions according to the *dataflow* principle: firing them as soon as their operands are ready. The most relevant structures involved in the tracking dependencies include: *Physical Registers (Px)*, the *Free Pool (FP)*, the *Register Map (RM)*, the *Instruction Window (IW)*, the *Reorder Buffer (ROB)*, and the *Load/Store Queues (LSQs)* (see Fig. 1).

| =======================================                                                 |                                     |                                                                              |                                                             |
|-----------------------------------------------------------------------------------------|-------------------------------------|------------------------------------------------------------------------------|-------------------------------------------------------------|
| PHYSICAL REGS: 1 2                                                                      | 3 4 5 6 7 8 9                       | 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24                                 | REGISTER MAP                                                |
| qi: 1 1<br>vi: 00 00 0                                                                  | 1 1 1 1 1 1 1 1 0 00 00 00 00 00 00 | 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1                                      | & PHYSICAL REGISTERS                                        |
| REG.FILE: xi:                                                                           | 1 2 3                               |                                                                              |                                                             |
|                                                                                         | 0 0 0                               | 0 0 0 0                                                                      | LOGICAL REGISTERS                                           |
| Vi: 0000001                                                                             | 0 00000000 00000000                 | 00000040 00000050 00000060 00000000 00000000                                 |                                                             |
| STAGES:<br>TOTAL SLOTS:<br>BUSY SLOTS:                                                  | 4 4 4 4 9 4                         | C RENAMED-STR INSTRUCTION-WINDOW 4 24 16 RESOURCE                            | REORDER-BUFFER A M L S B F X 99 1 1 1 0 1 4 1 0 0 0 0 0 0 0 |
| STALLS:                                                                                 | 0 0 0 0 0 0                         | ACCOUNTING                                                                   | 0 0 0 0 0 0                                                 |
| PC INSTRUCTION 000] LW x3,0(x4) 001] LW x7,128(x5) 002] MUL x7,x7,x3 003] ADDI x1,x1,-1 | F D P I X W 0 0 0 0 0 0             | C P1,Pj Pk P1 IWW OPCD P1 Pj Pk I/P1 Cj Ck C1<br>CYCLE-BY-CYCLE<br>EVOLUTION | PC                                                          |
|                                                                                         |                                     | Press ENTER to continue (PC=4,IC=4,CK=0,CTOT                                 | =1,1PG=4.00)                                                |

Figure 2: Cycle-by-cycle simulation progress: the screen shows the state of the superscalar's most relevant internal structures.

Several superscalar simulators have been proposed to aid in teaching these concepts. These simulators effectively show the internal behavior of these hardware components. However, the information presented is often difficult to reproduce with pencil and paper, as updates are either overwritten on screen or dispersed across multiple disconnected windows.

FREESS (Free Educational Superscalar Simulator) addresses these limitations by providing a *single screen* visualization summarizing both the current state and the cycle-by-cycle evolution of instructions through the superscalar pipeline. This evolution matches exactly what can be written on a single sheet of paper (see Fig. 2) while solving one of the exercises used for student training and for preparation verification (exams). Moreover, the output of FREESS can be printed on paper for reviewing or studying purposes, even without the need to interact with a computer. This approach encourages students to bridge theoretical concepts with manual tracing and reinforces the broader course's emphasis on performance-critical architectural design.

The syntax of the code recalls RISC-V instructions and registers due to the popularity of this novel instruction set, which is not bound to a single manufacturer and is widely adopted in Computer Architecture classes. FREESS supports a small but sufficient set of seven instructions—ADD, ADDI, BEQ, BNE, LW, MUL, and SW—following a minimalistic approach in the spirit of other well-known educational tools like the LC3 simulator [7]. Although adding more instructions is very simple in FREESS, the target of the simulator is studying the superscalar internals, not exploring a wide range of instructions, substitute debuggers, or other types of simulators.

### **Contributions**

The main contributions of this work are:

- to illustrate a methodology for teaching superscalar execution, where students can manually trace execution on paper using a layout that mirrors the one provided by the tool;
- to present FREESS, a simulator that provides a cycle-accurate view of the key hardware structures used in a Tomasulobased superscalar processor;
- to support the open source release of FREESS, along with guidance on writing and visualizing RISC-V-like programs.

The rest of the paper is organized as follows: Section 2 discusses related work; Section 3 describes the simulator; Section 4 presents illustrative examples; Section 5 analyzes the educational impact; and Section 6 concludes the paper.

### 2 Related Work

Several educational simulators have been developed to support the teaching of superscalar architecture concepts, each offering various degrees of visualization and interactivity.

SIMDE [1]] is designed to support dynamic and static scheduling through Tomasulo's algorithm and scoreboarding. It enables students to explore the flow of instructions and the hazard resolution mechanisms, providing insight into out-of-order execution and register renaming. Although SIMDE effectively represents data hazards and scheduling logic, its visualization primarily focuses on instruction status tables and resource usage. It lacks a complete, unified cycle-by-cycle overview of the instruction pipeline.

SATSim [12] provides an interactive, GUI-based environment to understand superscalar architectures. It emphasizes visual tracing of individual instructions and the state of internal buffers, which is valuable for observing execution behavior. However, SATSim offers limited feedback on stall causes or pipeline-wide performance trends, and it does not model the Load and Store Queues.

PSATSim [9] builds on SATSim by incorporating power and performance metrics. It allows users to explore the effects of microarchitectural changes on energy efficiency and throughput. Despite its improvements, PSATSim lacks comprehensive per-cycle visualization of pipeline stages.

Jaros [5] introduces a Web-based RISC-V simulator with superscalar support. Although accessible and platform-independent, this simulator primarily focuses on instruction execution and register state. Its support for architectural configuration is limited and does not include detailed memory pipeline modeling or stall diagnostics.

Other simulators like Ripes [8] and several others [6] model RISC-V pipeline, but lack a detailed superscalar modeling.

In contrast, FREESS offers a more holistic approach by providing a cycle-by-cycle visualization of the full instruction pipeline, including the Instruction Window, Register Map, Reorder Buffer, and particularly the Load and Store Queues, which are often omitted in comparable tools. FREESS highlights instruction-level parallelism by showing how multiple instructions proceed concurrently through fetch, rename, issue, execute, complete, and commit stages. Crucially, it also reports stall conditions and their causes, such as structural, data, or control hazards, making it easier for students to understand pipeline bottlenecks and dependency resolution. This level of feedback helps bridge the gap between theoretical understanding and practical insight into superscalar processor behavior.

Figure 3: At start, FREESS generates the text of an exercise with specific hypotheses on the architecture of the superscalar. For example, here a 16-slot instruction window, 24 physical registers and other key parameters have been specified.

# 3 Description

# 3.1 Launching the simulator

FREESS is written in pure C code, compatible with GCC from version 2.7 (year 2009) to present (year 2025) without a single warning. This characteristic makes the tool available on many computing platforms, including MS-Windows via the Ubuntu shell, for example, and, of course, all Linux-based systems. The tool is launched on the command line, and it first generates the text of the exercise, indicating the working hypothesis, based on the default options or the optional command parameters (Fig. 3). This feature is also useful to teachers for generating new exercises and for students to explore and recall the simulated architecture's key parameters. In Fig. 4 and Fig. 5, some architectural parameters are shown on architectural sketch and listed with a more detailed description.



Figure 4: FREESS Parameters.

| FREESS<br>PARAMETERS | UNIT                     | MEANING                                                        |  |  |  |  |  |  |  |  |  |  |
|----------------------|--------------------------|----------------------------------------------------------------|--|--|--|--|--|--|--|--|--|--|
| -fw number           | <inst cycle=""></inst>   | Number of instruction that can be fetched in a cycle           |  |  |  |  |  |  |  |  |  |  |
| -dw number           | <insts cycle=""></insts> | Number of instruction that can be decoded in a cycle           |  |  |  |  |  |  |  |  |  |  |
| -iw number           | <insts cycle=""></insts> | Number of instruction that can be issued in a cycle            |  |  |  |  |  |  |  |  |  |  |
| -cw number           | <inst cycle=""></inst>   | Number of instruction that can be committed in a cycle         |  |  |  |  |  |  |  |  |  |  |
| -wins size           | <insts></insts>          | Number of slots in the Instruction Window (IW)                 |  |  |  |  |  |  |  |  |  |  |
| -robs size           | <inst></inst>            | Number of slots in the Re-Order Buffer (ROB)                   |  |  |  |  |  |  |  |  |  |  |
| -rreg number         | <num></num>              | Number of Logical Registers                                    |  |  |  |  |  |  |  |  |  |  |
| -preg number         | <num></num>              | Number of Physical Registers                                   |  |  |  |  |  |  |  |  |  |  |
| -lqs size            | <insts></insts>          | Number of slots in the load queue (LSQ)                        |  |  |  |  |  |  |  |  |  |  |
| -lfu number          | <num></num>              | Number of load functional units                                |  |  |  |  |  |  |  |  |  |  |
| -liat latency        | <cycle></cycle>          | Number of cycles to perform a load operation                   |  |  |  |  |  |  |  |  |  |  |
| -sqs size            | <insts></insts>          | Number of slots in the store queue (LSQ)                       |  |  |  |  |  |  |  |  |  |  |
| -sfu number          | <num></num>              | Number of store functional units                               |  |  |  |  |  |  |  |  |  |  |
| -slat latency        | <cycle></cycle>          | Number of cycles to perform a store operation                  |  |  |  |  |  |  |  |  |  |  |
| -afu number          | <num></num>              | Number of integer ALUs                                         |  |  |  |  |  |  |  |  |  |  |
| -alat number         | <cycle></cycle>          | Number of cycles to perform an ALU operation                   |  |  |  |  |  |  |  |  |  |  |
| -mfu number          | <num></num>              | Number of integer Multipliers                                  |  |  |  |  |  |  |  |  |  |  |
| -mlat number         | <cycle></cycle>          | Number of cycles to perform a integer Multiplication operation |  |  |  |  |  |  |  |  |  |  |

Figure 5: In yellow, the most important simulator parameters (many more parameters are available).

In Fig. 2 from top to down, the state of the superscalar parameters can be located and monitored cycle-by-cycle: the first group indicates the register map (a star is appearing below the physical registers that are allocated), qi indicates if the register is free, and vi is the register content. The second group indicates the logical registers (xi), the associated physical register (Pi), whether the value is actual (Qi), and its value (Vi).

The third group represents the accounting of the resources in terms of slots in the buffer stages (F, D, P, I, X, W, C)<sup>1</sup>, in the renaming logic, in the instruction window, in the reorder buffer, in each of the functional units (A, M, L, S, B, F, X)<sup>2</sup>. The fourth group presents for each instruction: the dynamic program counter (PC), the cycle when the instruction enters a certain stage, and other info detailed in the paper's text and the next figures.

<sup>&</sup>lt;sup>1</sup>The letters indicate the classic stages: F=Fetch, D=Decode/Renaming, P=Dispatch, I=Issue, X=Execute, W=Write-back and C=Commit.

<sup>&</sup>lt;sup>2</sup>The letters indicate the functional units: A=ALU, M=integer-multiplier, L=Load, S-Store, B=branch, F=floating-point add/sub, X=floating-point-mul/div.

**Table 1: FREESS instructions** 

| Mnemonic | Operation Code (opcode) |
|----------|-------------------------|
| ADD      | 1                       |
| ADDI     | 2                       |
| LW       | 3                       |
| SW       | 4                       |
| BEQ      | 5                       |
| BNE      | 6                       |
| MUL      | 7                       |

# 3.2 Defining machine-code programs

Currently, the simulator deliberately avoids the complexity of an assembly parser. The student can code a few assembly instructions manually instead. It is left to the open source community to pick up on this point and eventually add a more complete parser. Table 1 reports the opcodes for the seven supported instructions.

To write a program, students simply replace each mnemonic with its corresponding opcode, specify the register indices, and provide any required immediate values. The handling of branch instructions is particularly instructive: instead of using labels as in assembly, students must enter the immediate value representing the number of instructions to jump—positive for forward branches, negative for backward ones, effectively replacing the role of labels in BEQ and BNE instructions.

```
loop: x3 <- mem(0+x4)  # load b(i)
x7 <- mem(128+x5)  # load c(i)
x7 <- x7 * x3  # b(i) * c(i)
x1 <- x1 - 1  # decr. counter
mem(256+x6)<- x7  # store a(i)
x2 <- x2 + 8  # bump index
P <- loop; x1!=0  # close loop</pre>
```

Figure 6: Example-1. A simple loop that adds two vectors element by element and stores the results in a third vector.

Assuming the code of Fig. 6 the resulting code is the following:

As the processor can only execute machine code, one more bonus of FREESS is that it forces the student to realize that (or refreshing the concept if learned in a pre-requisite course) by manually converting the program from assembly mnemonic to a specific (simple) machine code format.

Superscalar execution can be analyzed in depth with assembly programs of just a few instructions (e.g., 5 to 10 instructions) and in a loop to see the effect of the branch speculation. Therefore, it is not difficult to code the program. As shown in the next Section, some pre-built examples are provided to simplify this task.

# 3.3 Dynamic stream and status of superscalar hardware structures

By default, the simulator assumes that the program executes three iterations of a loop (a command line parameter can change the number of iterations) and that the branch is speculatively assumed to be taken (a detailed branch predictor is future work).

*3.3.1* Fetch stage. In Fig. 2, the output after the first cycle (cycle 0) is shown. The four vertical zeros under the 'F' indicate that the first four instructions are in the fetch stage. The simulator executes the next cycle every time the Enter key is pressed.

3.3.2 Decode/Rename stage. The next cycle (cycle 1) is shown in Fig. 7. The next four instructions should go to the fetch stage. However, the branch forces the fetch stage to break fetching and get only three instructions instead of four.



Figure 7: Rename stage. The renamed instruction stream uses the newly allocated physical registers (Px).

The first four instructions progress and go to the decode/rename stage. The columns 'Pi Pj Pk Pl' report the renamed stream and show how the logical registers are renamed to physical registers. The destination register is taken from the free pool (if not available, there will be a structural hazard), and the Register Map is updated accordingly, instruction after instruction, but within the same cycle. The Register Map (RM) tracks the assignment of the source logical to physical registers and is always visible at the top of the screen.

Please note that the screen flows: the current screen presents a complete overview of the stream's evolution, and the previous screens are still available for double-checking and can be dumped into a file for further examination and study.

3.3.3 Dispatch stage. The dispatch stage receives instructions from the renaming stage. Depending on the implementation, it stores them into the structures that will hold them until they are ready to execute, i.e., the Instruction Window (IW) or Reservation Stations (RS). We call them just *IW-SLOTS*. At the same time, a ROB entry is allocated. If either IW or ROB is full, we have a structural hazard. Again, the number of times this happens is annotated in the accounting area and explained at the bottom of the screen.

The IW-SLOT is a record that contains the entry identifier (IW#), the opcode (OPCD), the destination register (Pi), the first source register (Pj), the second source register (Pk), and the immediate (I). To track the availability of the source register values, the "flags" Cj and Ck are used: here we extend their meaning: a positive number indicates the cycle when the corresponding physical register (j or k) received its value in the IW<sup>3</sup>.

 $<sup>^3\</sup>mathrm{A}$  '-' means that the value is yet to be produced, so the instruction cannot be issued.

| PHYSICAL                                    | REGS:                                  | 1<br>*                              | *   | 3           | *       | *                | 6            | 7                | 8           | 9                | 10                       | 11 1                                 | 2 13                      | 14  | 15                 | 16                               | 17          | 18                         | 19 2           | 8 2                        | 1 2                | 23                          | 24    |   |   |                                      |            |     |                           |    |                          |             |    |                |     |         |   |
|---------------------------------------------|----------------------------------------|-------------------------------------|-----|-------------|---------|------------------|--------------|------------------|-------------|------------------|--------------------------|--------------------------------------|---------------------------|-----|--------------------|----------------------------------|-------------|----------------------------|----------------|----------------------------|--------------------|-----------------------------|-------|---|---|--------------------------------------|------------|-----|---------------------------|----|--------------------------|-------------|----|----------------|-----|---------|---|
|                                             | vi:                                    | 48                                  | 9Đ  | 58          | 99      | 68               | 10           | 68               | 68          | 88               | 99                       | 8 <u>9</u> 9                         | 9 86                      | 98  | 89                 | 98                               | 89          | 98                         | 89 6           | 9 8                        | 9 9                | 89                          | 98    |   |   |                                      |            |     |                           |    |                          |             |    |                |     |         |   |
| REG.FILE                                    | xi:<br>Pi:<br>Qi:<br>Vi:               | 000                                 | 888 | 1<br>7<br>1 | 88      | 900              | 2<br>10<br>1 |                  | 888         | 3<br>2<br>1      | 080                      | 8084                                 | 4<br>1<br>8<br>8 86       | 808 | 5<br>3<br>9<br>959 | 986                              | 9898        | 6<br>8<br>9<br>69          | 8086           |                            | 7<br>5<br>1<br>0 0 | 10808                       | 8 - 8 | I | W | and                                  | d F        | RO  | В                         | ar | e I                      | 90          | Р  | UL             | Α   | ΤE      | С |
| STAGES:<br>TOTAL SLO<br>BUSY SLO<br>STALLS: |                                        |                                     |     |             | F 4 4 8 | D<br>4<br>3<br>0 | P 4 4 8      | I<br>4<br>0<br>0 | X<br>9<br>8 | W<br>4<br>9<br>9 | C R<br>4 2<br>9 1<br>9 8 |                                      | ED-S                      | TR  |                    | INS<br>16<br>4<br>8              | TRUC        | т10                        | N-WI           | NDO                        | М                  |                             |       | / |   | REOF<br>99<br>4<br>0                 | DER-       | BUF | ER                        |    |                          | A<br>1<br>0 |    | L<br>1<br>0    |     | B F L 4 |   |
|                                             |                                        | (x4)<br>28(x<br>7,x3<br>1,-1        | 5)  |             | 8       | D 1 1 1 1 2      | 2            | I                | ×           | W                | P                        | i,Pj<br>2,8(<br>4,12<br>5,P4<br>7,P6 | P1)<br>B(P3<br>,P2<br>,-1 | ()  | ı                  | 1 M#<br>808<br>801<br>802<br>803 | )<br>)<br>M | CD<br>LW<br>LW<br>UL<br>OI | P2<br>P4<br>P5 | Pj<br>P1<br>P3<br>P4<br>P6 | -<br>-<br>P2       | I/PI<br>0<br>128<br>-<br>-1 | 2     | - | ī | ROB#<br>000)<br>001)<br>002)<br>003) | 993<br>993 | ×7  | oPi<br>-<br>-<br>P4<br>P6 | 8  | X C<br>0 0<br>0 0<br>0 0 | LQ<br>  PC  | (0 | )<br>)<br>)P P | L E | FAD     |   |
|                                             | 1 x2,x<br>x1,x<br>x3,0<br>x7,1<br>x7,x | 2,8<br>8,-7<br>(x4)<br>28(x<br>7,x3 | 5)  |             |         | 2 2              |              |                  |             |                  | P                        | 10,P<br>P7,P                         | 9,8                       |     |                    |                                  |             |                            |                |                            |                    |                             |       |   |   |                                      |            |     |                           |    |                          |             | (0 |                |     | FAD     |   |

Figure 8: The Instruction Window and ROB entries are allocated in the dispatch stage.

In the example of Fig. 8 instructions 0, 1, and 2 (first LW, second LW and ADDI) received their source value at cycle 2, i.e., when they entered the IW<sup>4</sup>.

When an IW-slot is allocated, a corresponding ROB-SLOT is. This can be easily identified on the screen since the IW-SLOT and the ROB-SLOT are on the same row, one beside the other.

The corresponding ROB-SLOT is a record that contains the entry identifier (ROB#), the PC of the instruction, the destination logical register (xi) and eventually the old physical register (oPi) associated with that register in the Register Map (RM): this is useful when eventually there is an exception or miss-speculation and the execution has to be rolled-back. There are also three flags: 's' to record whether the operation was associated with a store, 'x' to signal an exception, and 'c' to indicate when the instruction has completed, i.e., it has written back its result. Fig. 8 shows that the IW and ROB are populated at cycle 2.



Figure 9: The third area of the screen (the accounting area) shows the stall counters to further analyze the most critical resources. Other structures are updated.

*3.3.4* Issue stage. Once all source values have reached the physical registers of the instructions in the IW, one or more instructions can be issued. The issue width could be restricted to a single instruction or multiple instructions, up to the issue width (another simulation parameter). The instructions are sent to the corresponding available Functional Units (FUs) in this phase.

If the FU is occupied, we have a structural hazard, and the statistics on the screen are updated (see Fig. 9). The issued instructions

are also marked with the character '>' besides the IW#. For example, in Fig. 9, the first LW and the ADDI are issued at cycle 3.

Load and Store Queues. Loads or stores are queued in the load queue or store queue, respectively (indicated in the right part of the screen), along their PC, opcode (OP), and the Effective Address (EFAD), which is calculated before queuing, as shown in Fig. 9. In case of a load, the value read from the memory hierarchy is written in Pi. Ci indicates the cycle when the reload is queued, and later it is updated with the cycle when the value is forwarded to the common data bus. In the case of a store, the value to be written could yet to be produced, so Pl (and Cl) here indicate respectively the physical register waiting for such a value and the cycle when it arrives, which corresponds to the cycle when the store is ready to access memory in the order specified by the store queue (and in synchronization with the load queue).

3.3.5 Execute stage. During the execution stage (cycle 4 of this example), we can observe that some of the instructions are fired since the IW-SLOT identifier (IW#) is becoming '----' (see Fig. 10). The related information remains on the screen for reference, while the corresponding ROB-SLOT is freed only once the instruction commits. Depending on the latency associated with the Functional Unit, the associated instruction continues to occupy the X stage for the corresponding number of cycles, assuming the unit is pipelined. In the figure Fig. 1, the multiplication consists of 4 stage (X1, X2, X3, X4) for the sake of exemplification.

For ALU operations and the EFAD calculation, the Issue (I) and Execute (X) stages often operate on the same cycle: this is our default behavior. However, as a simulation parameter, the user can also specify that I and X should always happen in different cycles.

3.3.6 Write-back (or Complete) stage. In our driving example at cycle 4, the ADDI operation completes, as can be seen, out-of-order (see Fig. 10). This is marked by c=1. However, the ROB-SLOT cannot be freed until all previous ROB-SLOTs have the c flag equal to 1, ensuring that the logical registers are updated in program order. The ROB is managed as a circular queue. In case of exceptions or mis-speculations, the operations must be undone by properly updating the Register Map and the Free Pool. Several techniques exist, but are beyond our scope here.



Figure 10: Annotating issued instructions in IW and completed (write-back done) are annotated in the ROB.

 $<sup>^4</sup>$ We also introduced a possible third source register Pl and the corresponding flag Cl for future extensions.

3.3.7 Commit stage. The student can observe how the execution progresses by observing the cycles in which each instruction enters a pipeline stage and the updates in the RM, FP, IW, ROB, LQ, and SQ. Once some instructions - up to the commit width (4 or fewer, in our case) - are completed and are on top of the circular queue, then those instructions are committed.

At the end of the simulation (Fig. 11), the whole evolution of the execution and the final statistics are still visible on the screen. In particular, the driving example has taken 20 cycles (CTOT) and the Instruction-Per-Cycle (IPC) is 1.05, meaning that the pipeline has been able to achieve, on average, slightly more than one instruction per cycle, despite the data dependencies and the FU latencies.

# 4 Examples

The FREESS package has three examples, the first described in detail in the previous section. For easier start up, the examples can be launched via a script ./run-exK.sh, where K is '1', '2', or '3'. Once the student is more accustomed to the FREESS workflow, they can write her/his own programs (or modify the scripts), study the effect of different superscalar architectural parameters by sampling them on the command line as shown in Fig. 12.

```
$ ./freess -exe program-ex1 -pw 4 -wins 16
-pregs 24 -robs 99 -lqs 3 -sqs 3 -llat 2 -afu 1
```

Figure 12: A command line that specifies: the name of the program (-exe), the dispatch width (-pw), the window size (-wins), the number of physical registers (-pregs), the number of entries in the reorder buffer (-robs), the number of entries in the load queue (-lqs) and store queue (-sqs), the load latency (-llat), the number of arithmetic functional units (-afu).

### 4.1 Example-2

The second example is a smaller loop consisting of only five instructions. Fig. 13 shows the auto-generated text of this example, and the green box highlights the major modifications of the architecture: 12 IW-SLOTS, 12 Physical Registers, 12 ROB-SLOTS.

Figure 13: Example-2. Here, a 12-slot instruction window, 12 physical registers, and 12 entries in the ROB are used.

As can be seen in the final output windows, in this case, the superscalar can achieve an IPC of 1.36 with 11 cycles of execution. This means that several more instructions are now run in parallel.

| PHYSICAL             | REGS:        | 1    | 2    | 3 4  | 5   | - 6 | 7  | 8   | 9   | 10  | 11 12             |        |        |        |         |          |      |         |       |       |       |       |      |     |     |    |     |     |              |     |       |
|----------------------|--------------|------|------|------|-----|-----|----|-----|-----|-----|-------------------|--------|--------|--------|---------|----------|------|---------|-------|-------|-------|-------|------|-----|-----|----|-----|-----|--------------|-----|-------|
|                      |              | -    |      |      |     |     |    |     | *   |     |                   |        |        |        |         |          |      |         |       |       |       |       |      |     |     |    |     |     |              |     |       |
|                      | qi:          |      | 1    | 1 1  | - 1 | . 1 | 1  | 1   | 9   | 0   | 1 1               |        |        |        |         |          |      |         |       |       | 9     | STA   | МΙ   | 9   | :ТД | ÆΙ | ST  | ıc  | 2:           |     |       |
|                      | vi:          | 99   | 69 9 | L 04 | 98  | 91  | 98 | 88  | 91  | 8C  | 99 89             |        |        |        |         |          |      |         |       |       | ~ `   | ٠.,   | ٠    |     | ••• |    | ٠.  | . ~ | , –          |     |       |
| REG.FILE:            |              |      |      |      |     |     |    |     |     |     |                   |        | 5      | 6      |         |          | ,    |         |       |       |       |       |      |     |     |    |     |     |              |     |       |
| REG. PICE.           | Pi:          |      | 1    |      |     | ô   |    |     |     |     | - 1               |        |        |        |         |          | ٠.   | _       | 7     |       |       |       |      |     |     |    |     |     |              |     |       |
|                      | Qi:          |      | - 7  |      |     | é   |    |     | 9   |     | 9                 |        | 9      | 6      |         | _        | 8    |         | 9     |       |       |       |      |     |     |    |     |     |              |     |       |
|                      | Vi:          | 989  | 0100 | 8 88 | 868 | 080 | 99 | 003 | 808 | 868 | 98989             | 989898 | 99 99  | 108688 | 989     | 9999     | 3 96 | 999999  | 0     |       |       |       |      |     |     |    |     |     |              |     |       |
|                      |              |      |      |      |     |     |    |     |     |     |                   |        |        |        |         |          |      |         |       |       |       |       |      |     |     |    |     |     |              |     |       |
| STAGES:<br>TOTAL SLO |              |      |      | -    | Ü   |     | -  | 12  |     | 4 1 | ENAME             | J-81M  | 12     | RUCTI  | UN-W    | INDU     | W.   |         |       |       | 12    | DER-  | RUFF | EK  |     |    | A.  | н   | LS           | В   | F X   |
| BUSY SLOT            |              |      |      | -    | -   | -   | -  | 12  | -   | -   | _                 |        | 9      |        |         |          |      |         |       |       | 9     |       |      |     |     |    | 8   | à   | 9 9          | 9   | 9 9   |
| STALLS:              |              |      |      | 9    | 9   | 9   | 6  | 9   | ø   | 8 8 |                   |        | 9      |        |         |          |      |         |       |       | 8     |       |      |     |     |    | 9   | ë   | 8 3          | 9   | 0 0   |
|                      |              |      |      | -    |     |     |    |     |     |     |                   |        |        |        |         |          |      |         |       |       |       |       |      |     |     |    |     |     |              |     |       |
|                      | TRUCTI       |      |      | F    | D   | Р   | 1  | Х   |     |     | i,Pj              |        | IW#    | OPCO   |         | Ρj       | Pk   | I/Pl    | Cj    | Ck Cl |       | PC    | хi   | oPi |     |    | *   |     |              |     | +     |
| 000] LW              | x2,0         |      |      | 9    | 1   | 2   | 3  | 3   |     |     | 2,8(P             |        |        |        | P2      | P1       |      | 9       | 3     | 3 -   |       |       |      | -   | 9 9 |    | LQ( |     | p pi         |     | n cil |
| 001] ADDI<br>002] SW | x2,x         |      |      | 9    | 1   | 2   | 2  | *   |     |     | 3,P2,:<br>0(P1)   |        |        |        | P3      | P2       | P1   | 8       | 4     |       |       | 001   | X2   | P2  | 1 8 |    | PU  |     |              | 198 |       |
| 002] SW              |              |      |      | 0    | 1   | 2   | 3  | 2   |     |     | 4.P1.             |        |        |        |         | P1       | P.I  | 4       | -     | 2 -   |       |       | v1   | P1  |     | 1  |     |     |              | 100 |       |
|                      | ×2.×         |      |      | 1    | 2   | 3   | 4  | 4   |     |     | P3. P0            |        |        |        |         | P3       | РВ   | -5      | ï     | 4 -   |       |       |      | -   | 0 0 |    |     |     | w P8         | 10  |       |
| 005] LW              | x2,8         | (x1) |      | 2    | 3   | 4   | 5  | 5   | 6   | 7 P | 5,8(P             | 4)     |        | - Lh   | P5      | P4       | -    | 9       | 5     | 5 -   |       | - 808 | x2   | Р3  | 9 9 | 1  | ÷   |     |              |     |       |
| 006] ADDI            |              |      |      | 2    | 3   | 4   | 6  | 6   |     |     | 6,P5,             |        |        | ADDI   | P6      | P5       | -    | 1       | 6     | 6 -   |       | - 001 |      | P5  | 9 9 | 1, | С   | Y١  | CL           | ES  |       |
| 007] SW              | x2,0         |      |      | 2    | 3   | 4   | 6  | 6   |     |     | 0(P4)             |        |        | - Sh   | - 1     | P6       | P4   | 9       | -     | 6 -   |       | - 002 |      | -   | 1 0 | 1  | +   | -   |              |     | +     |
| 008] ADDI            |              |      |      | 2    | 3   | 4   | 5  | 5   |     |     | 7,P4,             |        |        |        |         | P4       | -    | 4_      | 5     | 5 -   |       |       |      | P4  | 9 8 | 1  | SQ( |     |              |     | !     |
| 009] BNE<br>010] LW  | x2,x<br>x2,0 |      |      | 3    | -   | 6   | 6  | 0   |     |     | P6, P0,<br>8.0(P) |        |        |        | -<br>PR | P6       | Рθ   | -5<br>e | 4     | 0 -   |       |       |      | 1   | 8 8 | 1  | PC  |     | P Pl<br>W P3 | 186 | D C1  |
| 010] LW              |              |      |      | 4    | 6   | 6   | 0  | 0   |     |     | 8,0(P.<br>9.P8.:  |        |        |        | P8      | PA<br>PA | Ξ.   | 1       | 0     | 0 -   |       |       |      | PB  | 0 0 | 1  | 1   | - 8 |              | 101 |       |
| 012] SW              | x2.0         |      |      | 4    | 5   | 6   | 8  | B   |     |     | 0(P7)             |        |        | Sh     | -       | pg       | P7   | 6       | _     | 8 -   |       |       | 2    | -   | 1 8 | î  | i   | - 8 |              | 101 |       |
| 013] ADDI            | 1 x1, x      | 1,4  |      | 4    | 5   | 6   | 7  | 7   |     |     | 10, P7            |        |        | ADDI   | P10     | P7       | -    | 4       | 7     | 7 -   |       | 903   | x1   | P7  | 9 9 | 1  | ÷   |     |              |     |       |
| 014] BNE             | x2,x         | 0,-5 |      | 5    | 6   | 7   | 8  | 8   | 9 : | Le, | P9, P0            |        |        |        |         | P9       | P8   | -5      |       | 8 -   |       | 884   |      | -   | 9 9 | 1  | ID  | ^   |              |     |       |
|                      |              |      |      |      |     |     |    |     |     |     | 1                 | ress E | NTER 1 | o con  | tinu    | e (P     | C=6. | IC=15   | . CK= | 10.CT | T=11. | IPC=  | 1.36 | 14. |     |    | 11  | U   |              |     |       |

Figure 14: Example-2. Final output of the simulation of the second exercise. In this case, IPC is 1.36.

Comparing Fig. 13 and Fig. 14, we can observe that the second program generates fewer stalls due to lacking resources. In the first example (Fig. 13), we got 9 stalls due to renaming (D stage), 12 due to dispatch (P stage), 16 due to Issue (I stage), and 7 due to commit (C stage). In the second examples (Fig. 14), these numbers are respectively 0 (D), 0 (P), 6 (I),  $0(C)^5$ .

# 4.2 Example-3

To confirm the conclusions of Example-2, another slightly different example is considered: Example-3. In this case, the program remains the same as in the Example-2, but the superscalar width is reduced to simulate a 2-way superscalar. We omit the auto-generated text to save some space, and we report the final statistics of the execution in Fig. 15. In this case, the stalls are 0(D), 3 (P), 3 (I), 0 (C). While the total number of stalls is the same (6) in example-2 and example-3, the ability to process and commit fewer instructions per cycle is limiting the IPC to 1.07 with 14 total executed cycles.



Figure 15: Example-3. For the same program as in Example-2, the IPC is now only 1.07 due to the limited width (2-way).

### 4.3 Analyzing the Stall Reasons

While the stall explanation is reported in the bottom part of the screen cycle-by-cycle (see Fig. 9 and Fig. 10), the same entries (all

 $<sup>^5\</sup>mathrm{We}$  are reporting only those statistics that are relevant for the comparison with the previous example.

| PHYSICAL REGS: 1 2 3<br>* *<br>qi: 0 1 0<br>vi: 40 00 50                                                                                                                                                                                                                                                                                                                                                | 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24  * * * * * *  1 1 1 1 0 1 1 1 1 1 1 1 1 0 1 0 0 0 1 1 1 1  00 00 10 0F 60 00 08 00 00 00 0E 10 00 00 00 0D 18 00 00 00 00 | STALL STATISTICS                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| REG.FILE: xi: 1 Pi: 19 Qi: 0 Vi: 0000000E                                                                                                                                                                                                                                                                                                                                                               | 20 16 1 3 8 18 -                                                                                                                                                                   |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |
| STAGES:<br>TOTAL SLOTS:<br>BUSY SLOTS:<br>STALLS:                                                                                                                                                                                                                                                                                                                                                       | F D P I X W C RENAMED-SIR INSTRUCTION-WINDOW 4 4 4 4 9 4 4 24 16 0 9 12 16 0 0 7 0 0                                                                                               | REORDER-BUFFER         A         M         L         S         B         F         X           99         1         1         1         0         1         4         1           0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0 |
| PC INSTRUCTION 000] LW x3, 0(x4) 001] LW x7, 128(x5) 002] MUL x7, x7, x3 003] ADDI x1, x1, -1 004] SW x7, 256(x6) 005] ADDI x2, x2, 8 006] BME x1, x0, -7 007] LW x3, 0(x4) 008] LW x7, 128(x5) 009] MUL x7, x7, x3 010] ADDI x1, x1, -1 011] SW x7, 256(x6) 012] ADDI x2, x2, 8 013] BNE x1, x0, -7 015] LW x3, 0(x4) 015] LW x7, 128(x5) 016] MUL x7, x7, x3 017] ADDI x1, x1, -1 018] SW x7, 256(x6) | F D P I X W C Pi,Pj Pk Pl                                                                                                                                                          | ROB# PC   xi   oPi   s   x   c                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |

Figure 11: The final execution screen shows a summary of what happened during the program's execution. In particular, the execution of 21 instructions achieved an IPC of 1.05 (almost one instruction per cycle) in 20 cycles.

stalls) are also logged in the stall.log file for final reference (see Fig. 16). Each stall explanation reports: the cycle when the stall happened, the reason for the stall, the involved instruction, and the stage where this happened. By analyzing these explanations, students gain deeper insight into whether stalls are caused by resource constraints or particular instructions that could be optimized in the source program. Experimenting with architectural parameters, for example in Example-3, they can see how increasing the dispatch width, issue width, and the number of ALUs (e.g., using -pw 3 -iw 3 -afu 2) eliminates dispatch-stage stalls. Comparing across different configurations (such as 2-way versus 4-way superscalar) further highlights how architectural decisions directly affect performance. This exercise provides students with a hands-on understanding of how hardware limitations influence stalls and hence performance.

```
0003 stall due to no L-unit available
0003 stall due to NO SLOTS when trying to move instuction LW/001 from stage P to stage I.
0003 stall due to NO SLOTS when trying to move instuction MUL/002 from stage P to stage I.
0003 stall due to NO SLOTS when trying to move instuction BME/006 from stage D to stage P.
0003 stall due to NO SLOTS when trying to move instuction ADDI/101 from stage F to stage D.
0004 stall due to NO SLOTS when trying to move instuction ADDI/101 from stage P to stage D.
0004 stall due to NO SLOTS when trying to move instuction LW/001 from stage P to stage I.
0004 stall due to NO SLOTS when trying to move instuction MUL/002 from stage P to stage I.
0004 stall due to NO SLOTS when trying to move instuction MUL/009 from stage D to stage P.
0004 stall due to NO SLOTS when trying to move instuction MUL/009 from stage D to stage P.
0004 stall due to NO SLOTS when trying to move instuction ADDI/1012 from stage F to stage D.
0004 stall due to NO SLOTS when trying to move instuction ADDI/1012 from stage F to stage D.
0005 stall due to NO SLOTS when trying to move instuction MUL/002 from stage F to stage D.
0005 stall due to NO SLOTS when trying to move instuction MUL/002 from stage P to stage I.
```

Figure 16: The log file that lists all the stall reasons that happened at a certain cycle (e.g., @003).

# 5 Impact

Superscalar processors and out-of-order execution are key topics in modern computer architecture courses, forming the foundation of high-performance CPU design. These topics are commonly covered in graduate curricula using textbooks such as *Computer Architecture: A Quantitative Approach* by Hennessy and Patterson [3] and *Parallel Computer Organization and Design* by Dubois, Annavaram, Stenström [2]. However, while superscalar execution is often supported by visual tools and exercises, the more advanced concepts of superscalar execution, register renaming, and instruction reordering are harder to teach and visualize.

FREESS (Free Educational Superscalar Simulator) addresses this pedagogical gap by offering a lightweight, open-source tool to support teaching Tomasulo-style, out-of-order superscalar execution. It provides a clear, cycle-by-cycle visualization of how instructions move through the pipeline—from fetch to commit, while showing internal structures such as the Instruction Window (IW), Reorder Buffer (ROB), Register Map (RM), Free Pool (FP), and Load/Store Queues (LSQs). FREESS displays the current state of each instruction and hardware structure and highlights the causes and frequency of stalls (e.g., structural, data, and control hazards), offering a practical and detailed perspective on pipeline bottlenecks.

The RISC-V community has grown exponentially, as has the demand for more performance in RISC-V implementations for High-Performance Computing. Therefore, several RISC-V-based efforts worldwide are exploring superscalar designs to improve the execution performance. Using a RISC-V-inspired instruction set, FREESS aligns with the increasing shift in academia from older ISAs like MIPS toward the open and modern RISC-V standard. The simplified encoding of instructions and manual entry of opcode and register indices helps students understand the fundamentals of instruction encoding and control flow.

FREESS has been used in the Computer Architecture curriculum at our institution since 2010 and is distributed along with illustrative examples and configuration scripts. It enables students to engage with realistic yet manageable exercises that reflect the behavior of real-world superscalar processors. The tool supports active learning, allowing students to replicate the simulation outputs on paper for deepened understanding. About 4 hours of lessons and 6 hours of practicing are planned for teaching dynamic scheduling and superscalar concept (plus 2 hours on branch prediction) at the University of Siena (course site: https://hpca.dii.unisi.it/). The related slides used for the teaching are available on demand.

In our experience, students initially struggle to understand outof-order execution, but after using FREESS to manually step through cycle-by-cycle outputs, they consistently report feeling more confident in tracing out-of-order execution. They appreciate how the textual interface matches what they do in paper-based exercises, and in seconds, they can evaluate the effectiveness of different architectural choices in the structure of the superscalar.

FREESS also has limitations in that it does not model everything currently available in the current superscalar processor, but only the main structures typically addressed at the level of an advanced course in Computer Architecture. FREESS is not a production tool, meaning there is a lot of space for improving it for different needs. The fact that it is written in pure C and has a very limited complexity of about 2000 lines of code (including lots of comments and pretty-printing functions) should make its extension easy to perform.

Finally, FREESS open-source nature encourages contributions and extensions. We envision a growing community of educators and students around FREESS, sharing new exercises, architectural variants, and features, further enhancing the tool's utility and educational reach.

### 6 Conclusions

FREESS (Free Educational Superscalar Simulator) provides an effective and accessible tool for teaching the principles of superscalar processors and out-of-order execution, which are fundamental to modern computer architecture education. By offering a cycle-accurate visualization of key hardware structures such as the Instruction Window (IW), Reorder Buffer (ROB), Register Map (RM), Free Pool (FP), and Load/Store Queues (LSQs), FREESS bridges the gap between theoretical concepts and practical understanding. Its unified, text-based interface allows students to manually trace execution steps, reinforcing their comprehension of dynamic scheduling and dependency resolution.

The simulator's minimalistic RISC-V-inspired instruction set simplifies learning while maintaining relevance to contemporary architectures. FREESS ability to dynamically configure architectural parameters and log stall conditions enables students to explore the impact of resource limitations and pipeline hazards, fostering deeper insights into performance bottlenecks. Including illustrative examples and open-source availability further enhances its utility as an educational resource.

By abstracting away unnecessary complexity and focusing on a minimal but representative instruction set, FREESS lowers the barrier for understanding key pipeline stages while maintaining technical accuracy. Its visual and interactive approach encourages engagement, experimentation, and a deeper conceptual grasp of modern processor design. Its lightweight implementation in pure C ensures broad compatibility and ease of extension, inviting contributions from the educational community.

Avoiding the use of a GUI has the further advantage of simply printing the output of the whole evolution of the execution on a cycle-by-cycle basis or an interesting part of it, for off-line study.

Future work includes extending FREESS with additional RISC-V instructions and developing more sophisticated branch prediction strategies. These enhancements will further enhance its value as both a teaching aid and a platform for architectural prototyping.

Another future possibility is to convert the C-code to a Web-Assembly version for a Web-based execution.

In summary, FREESS addresses the challenges of teaching superscalar execution by providing a hands-on, interactive, and adaptable platform. Its open-source nature and alignment with RISC-V position it as a valuable resource for educators and students, promoting active learning and experimentation in Computer Architecture.

# Acknowledgments

I want to thank the anonymous reviewers for their encouraging comments and Jonnatan Mendoza of BSC for the useful feedback. The European Commission partially supported this work under the projects: AXIOM H2020 (id. 645496), TERAFLUX (id. 249013), HiPEAC (id. 101069836) and EDGE-ME under the Next Generation EU via the Italian National Recovery and Resilience Plan M4C2-Inv.1.4, CUP J33C22001170001, (CN00000013 - partnership ICSC).

# References

- I. Castilla, L. Moreno, C. González, J. Sigut, and E. González. 2007. SIMDE: An Educational Simulator of ILP Architectures with Dynamic and Static Scheduling. Comp. App. in Eng. Education 15, 4 (2007), 309–318. doi:10.1002/cae.20154
- Michel Dubois, Murali Annavaram, and Per Stenström. 2012. Parallel Computer Organization and Design. Cambridge University Press, Cambridge.
- [3] John L. Hennessy and David A. Patterson. 2017. Computer Architecture, Sixth Edition: A Quantitative Approach (6th ed.). MKP Inc., San Francisco, CA, USA.
- [4] W. Hwu and Y. N. Patt. 1986. HPSm, a high performance restricted data flow architecture having minimal functionality. SIGARCH Comput. Archit. News 14, 2 (May 1986), 297–306. doi:10.1145/17356.17391
- [5] J. Jaros. 2024. Web-Based Simulator of Superscalar RISC-V Processors. In ICS'24. IEEE, Piscataway, NJ, USA, 1–6. doi:10.1109/SCW63240.2024.00209
- [6] G. Mariotti and R. Giorgi. 2022. WebRISC-V: A 32/64-bit RISC-V pipeline simulation tool. ELSEVIER SoftwareX 18 (May 2022), 1–7. doi:10.1016/j.softx.2022.101105
- [7] Yale Patt and Sanjay Patel. 2004. Introduction to Computing Systems: From Bits and Gates to C and Beyond (2 ed.). McGraw-Hill, New York, NY, USA.
- [8] Morten B. Petersen. 2021. Ripes: A Visual Computer Architecture Simulator. In ISCA-WCAE'21. IEEE, Virtual Conference, 1–8.
- [9] C. W. Smullen. 2006. PSATSim: An Interactive Graphical Superscalar Architecture Simulator for Power and Performance Analysis. In ISCA-WCAE'06. ACM, New York, NY, USA, 1–6. doi:10.1145/1275620.1275627
- [10] James E. Thornton. [n. d.]. Design of a Computer: The Control Data 6600. Scott, Foresman and Co., Glenview, IL, USA.
- [11] Robert M. Tomasulo. 1967. An Efficient Algorithm for Exploiting Multiple Arithmetic Units. IBM Journal of Research and Development 11, 1 (1967), 25–33.
- [12] S. Wolff. 2000. SATSim: A Superscalar Architecture Trace Simulator Using Interactive Visualization. In ISCA-WCAE'00. ACM, New York, NY, USA, 1–7. doi:10.1145/1275240.1275249

### A Online Resources

The FREESS Educational Simulator for RISC-V inspired Superscalar Processors based on Tomasulo's Algorithm is available at this address: https://github.com/robgiorgi/freess

Received 19 May 2025