May 2026 – Ir. Dr. Nurul Hazlina Noordin

BHE3233 BTS4433 – Week 11 Project Design Optimisation

Hello everyone,

After weeks of writing, debugging, and integrating Verilog code, we have finally reached the pinnacle of digital system design: Stage 4 (Comparative Optimization)

In this phase, we move beyond simply asking “Does the code work?” to asking “Which architecture is mathematically and structurally superior?”

In FPGA design, there is rarely one perfect answer. Every design choice is a delicate balancing act between Area (Logic Elements and Flip-Flops), Speed (

f_{M A X}

), and Latency.

This week, our students put competing architectures head-to-head across four different projects to see how they perform under the rigorous scrutiny of Quartus’ Static Timing Analysis (STA) and Resource Utilization reports.

Here is a breakdown of the design comparisons for each project:

Project 1: AES Cryptography (LUT vs. Logic-Based S-Box)

For the AES S-Box, students integrated two completely different structural approaches onto the DE10-Lite FPGA and used a toggle switch to compare their performance.

1. Table-Based (ROM/LUT) Approach: This method acts like a cheat sheet, pre-calculating every possible answer and storing it in memory
  1. While extremely fast (constant time), it consumes significant memory resources and scales poorly. It is highly suited for high-speed cryptographic processor.
2. Logic-Based (Boolean) Approach: This method acts like a math formula, calculating Galois Field arithmetic in real-time using layers of logic gates
  1. It uses very little memory and is highly area-efficient, making it the perfect choice for low-area IoT devices, though it suffers from a longer propagation delay due to the deep logic tree.

Project 2: CPU Arithmetic (Sequential vs. Pipelined 16-bit Multipliers)

For the ALU design project, students escalated their 4-bit multipliers to 16-bit and compared architectures to see how digital systems handle complex arithmetic

1. Sequential (Shift-and-Add) Multiplier: By mimicking manual long multiplication, this architecture reuses a single adder over multiple clock cycles
  1. It dramatically saves on Logic Elements (LEs), but the cost is high latency, making it ideal for space-constrained, battery-powered handheld devices
2. Pipelined Multiplier: To maximize performance, students inserted registers into the combinational logic “fences” to break up the deep mathematical tree
  1. Like an assembly line, this allows a new multiplication operation to begin every single clock cycle.
  2. It costs far more registers, but drastically increases throughput and $f_{M A X}$ , which is mandatory for applications executing millions of operations per second.

Project 3: DSP and Sensors (Direct vs. Transposed FIR Filters)

In digital signal processing, the physical layout of your adders and multipliers can make or break your frequency limits. Students evaluated a 4-tap FIR filter using two mathematically identical, but structurally different, forms.

1. Direct Form: The standard approach where all multiplications happen in parallel, and the results are summed up in a large “Adder Tree”. Its major flaw is a massive Critical Path—the signal must traverse a multiplier and the entire chain of adders before the clock cycle ends, severely limiting the maximum frequency.
2. Transposed Form: By strategically placing delay registers between the adders, students shortened the critical path so the signal only propagates through one multiplier and one adder per cycle. While this slightly increases Total Registers (FFs), it yields a substantially higher $f_{M A X}$ (often a 25% improvement), making it the superior architecture for high-speed 100MHz digital audio processors

Project 4: UART Controller (Binary vs. One-Hot FSM Encoding)

In the final project, students escalated their UART transmitter to handle 16-bit data frames with Even Parity and evaluated how Quartus encodes Finite State Machines (FSMs)

1. Binary Encoding: This style uses the absolute minimum number of Flip-Flops (e.g., 2 FFs for 4 states). While it saves physical area on the silicon, it requires heavier combinational logic to decode the states
2. One-Hot Encoding: This style assigns exactly one Flip-Flop per state (e.g., 4 FFs for 4 states)
  1. Despite consuming more physical area, the decoding logic becomes incredibly simple. This translates to better Setup Slack and a much faster $f_{M A X}$ , proving that sometimes using more hardware actually makes your system perform better

The Takeaway Stage 4 proves that mastering digital system design isn’t just about writing Verilog that compiles. A true hardware engineer knows how to interpret the Fitter and Timing Analyzer reports to intelligently trade Area for Speed based on the exact needs of the industry application!

I look forward to your creativity in executing these projects. Please complete your submissions in KALAM =)

Discussion MSc

BHE3233 BTS4433 – Week 10 Project Semi Completed Programming

Hi everyone,

After successfully navigating code comprehension and hardware debugging in Stages 1 and 2, our journey through digital system design enters its most advanced phases. This week, we focused on Stage 3: Semi-Completed Programming and Stage 4: New Programming Task (Comparative Optimization).

These stages push us beyond merely “making it work” to actually architecting complete systems and evaluating trade-offs like true digital design engineers. Here is a breakdown of our milestones and the core learning outcomes for each project.

Stage 3: Semi-Completed Programming (Architectural Completion)

In Stage 3, we took foundational components and integrated them into complete, functioning architectures.

Project 1: AES Cryptography (The Logic-Optimized S-Box)

- The Challenge: Transitioning away from a memory-heavy Look-Up Table (LUT) to a logic-optimized approach using Galois Field (GF(2)) composite arithmetic. We had to complete the missing Boolean equations for multiplicative inversion.

- Learning Outcome: Mastering Mathematical Hardware. We learned how complex cryptography math (like Galois Fields) is synthesized into pure Boolean logic (XOR, AND, OR gates), demonstrating how a “calculation” approach saves memory at the cost of logic depth.

Project 2: The Multiplier (Pipelining for Speed)

- The Challenge: We upgraded a combinational multiplier by inserting registers into the middle of the logic “fences” to create a Pipelined Multiplier.
- Learning Outcome: Increasing Throughput. By breaking a long combinational path into smaller stages, we learned how pipelining allows a new multiplication to start every clock cycle, drastically increasing the system’s Max Frequency (fmax)

Project 3: FIR Filter (Structural Integration)

- The Challenge: We integrated our verified Multiply-Accumulate (MAC) units and Delay Lines to build a complete Direct Form FIR Filter.
- Learning Outcome: Preventing Arithmetic Overflow. The crucial lesson here was calculating exact bit-widths for signal growth. We learned that multiplying two 4-bit inputs yields an 8-bit output, and accumulating three of these 8-bit partial products requires a 10-bit final output to prevent overflow.

Project 4: UART Controller (The Transmitter FSM)

- The Challenge: Wrapping raw serial data into a standardized UART frame by building a Finite State Machine (FSM) that transitions through IDLE, START, DATA, and STOP states based on precise “ticks” from our Baud Rate Generator.
- Learning Outcome: Protocol Synchronization. We learned how to reliably sequence hardware operations using FSMs, ensuring that communication lines are held HIGH during idle and strictly synchronized to a predetermined baud rate for data integrity.

Moving on

Stage 4: Comparative Optimization (New Programming Tasks)

Stage 4 is where engineering design trade-offs shine. We escalated our designs and used Quartus tools (like the Timing Analyzer and Resource utilization reports) to scientifically compare competing architectures.

Project 1: AES Cryptography (LUT vs. Logic Trade-offs)

- The Challenge: We integrated both the Stage 1 (Table-based) and Stage 3 (Logic-based) architectures onto the DE10-Lite FPGA, using a hardware switch to “toggle” between them.
- Learning Outcome: Resource vs. Speed Optimization. By running a Static Timing Analysis (STA), we learned to scientifically deduce which architecture to use depending on the application—evaluating why a LUT is better for a high-speed cryptographic processor while Boolean logic is superior for a low-area IoT device.

Project 2: The Multiplier (16-bit Escalation)

- The Challenge: We escalated our multiplier from 4-bit to 16-bit and compared all three architectures: Behavioral, Sequential, and Pipelined.
- Learning Outcome: Evaluating Complex Architectural Trade-offs. We learned how expanding bit-widths exponentially deepens the logic tree. The outcome was understanding how to choose an architecture based on strict constraints (e.g., choosing sequential for space-constrained handhelds vs. pipelined for high-performance CPU ALUs).

Project 3: FIR Filter (Direct vs. Transposed Form)

- The Challenge: We redesigned our filter into the Transposed Form, a mathematically identical structure that places delay registers between the adders rather than at the input.
- Learning Outcome: Shortening the Critical Path. We learned a major DSP optimization technique: by separating combinational adders with registers, we shortened the critical path delay. Even though this uses slightly more Logic Elements, it dramatically boosts the f max making it ideal for high-speed audio/sensor processing.

Project 4: UART Controller (FSM Encoding & Parity)

- The Challenge: We expanded the UART to 16-bit with Even Parity error checking and compared two FSM architectures: Binary Encoding versus One-Hot Encoding.
- Learning Outcome: FSM Encoding Trade-offs. We gained hands-on experience in how the Quartus compiler assigns flip-flops. We discovered that Binary Encoding saves area (fewer flip-flops) but requires heavier decoding logic, whereas One-Hot Encoding uses more flip-flops but simplifies decoding, resulting in better setup slack and a faster maximum frequency.

BHE3233 BTS4433 – Week 9 Project Workout Simulation

This week in the lab, we move forward in our hardware design journey by executing Stage 1: Workout Programming and Stage 2: Debugging Specific Malfunctions across four highly distinct FPGA projects. These stages forced us to transition from merely reading code to actively troubleshooting real-world hardware logic errors.

Here is a breakdown of what we accomplished:

Project 1: Cryptography on Silicon (AES S-Box) We kicked things off by diving into the Advanced Encryption Standard (AES). In Stage 1, we analyzed a Memory-based (ROM) Look-Up Table (LUT) approach for an 8-bit AES S-Box. We used ModelSim to perform functional verification, proving that the hardware correctly maps inputs to standardized AES cipher outputs (like mapping 8'h00 to 8'h63).

The real challenge came in Stage 2 with the Inverse S-Box (used for decryption). We were handed a buggy Verilog script containing three intentional errors related to syntax, logic, and standard compliance. By deciphering Quartus compilation warnings like “Output port has no driver,” we successfully repaired the code to prove that feeding a cipher output back into the Inverse S-Box restores the original “plain” input byte.

Project 2: CPU Arithmetic (The Multiplier) Our second project focused on resource-efficient ALU design. Stage 1 introduced a Behavioral Multiplier, where we let the Quartus synthesis engine decide whether to map the multiplication logic to 9-bit DSP blocks or Logic Elements (LUTs).

Stage 2 brought us down to the sequential level with a Shift-and-Add Multiplier, which mimics manual long multiplication to save area. However, the provided design failed for maximum 4-bit inputs (like 4'hF * 4'hF). We had to debug a logical flaw in the module’s bit-counter; it was only counting up to 3 instead of 4. By correcting the count limit and adjusting the counter’s bit-width, we enabled the hardware to successfully process all 4 bits.

Project 3: DSP and IoT Sensors (4-Tap FIR Filter) For our digital signal processing project, we explored how to clean up sensor data using a Finite Impulse Response (FIR) filter. In Stage 1, we verified the core building block: the Multiply-Accumulate (MAC) unit. A key takeaway was learning how to prevent arithmetic overflow by understanding bit-growth (e.g., multiplying two 4-bit inputs requires an 8-bit output).

Stage 2 tackled a classic hardware pitfall: the Race Condition. In our shift register (Delay Line), which holds the crucial “historical” samples (

x [n - 1]

x [n - 2]

), the buggy code used blocking assignments (=) inside a clocked block. This caused the input to immediately leak through all registers in a single cycle. We fixed this by rewriting the block with non-blocking assignments (<=), ensuring the signal properly shifts stage-by-stage with each clock pulse.

Project 4: Communication Protocols (UART Controller) Our final project of the week centered on hardware interfacing. We started Stage 1 by executing Parallel-to-Serial Conversion, learning how a shift register takes a 4-bit wide parallel bus and sends it out bit-by-bit over a single serial wire.

In Stage 2, things got precise. Because UART transmitters and receivers don’t share a common clock, they rely on a precise Baud Rate. We had to debug a Baud Rate Generator designed to divide a 50MHz FPGA clock down to 9600 bits per second.

The buggy counter was missing a reset condition and had a threshold comparison error (counting to 5208 actually takes 5209 cycles). After fixing the logic, we used ModelSim to verify that our baud_tick pulses were exactly ~104.16 microseconds apart, ensuring perfect synchronization.

Moving from code comprehension in Stage 1 to active logic correction in Stage 2 has completely changed how we look at Verilog and Quartus. Next week, we will move on to Stage 3: Semi-Completed Programming, where we will integrate these components into larger, more complex architectures!