So first let’s bring in the CPU it means Central Processing Unit, the CPU is the brains of the computer. Let’s also bring in memory, the RAM, this is where the CPU accesses stored information it needs.
Now the CPU also has a built-in memory, this is called the Cache. The cache is considerably smaller than RAM, with sizes ranging in the order of 32 kilobytes to 8 megabytes. The purpose of the cache is to give the CPU information it needs immediately.
The CPU and RAM are separate objects, so when the CPU needs information, it takes time, a very small amount of time to read the data from the memory.
This time can add considerable delay to computer operation. With the cache being right on the CPU reduces this time to almost nothing. The reason why you don’t need much cache storage is that it just needs to store little bits of important information that the CPU will need to use soon or has been using a lot of recently.
There are various methods implemented to determine what goes on to the cache. What should be kept on the cache and when it should be written back on to the RAM. In a typical CPU, there are various levels of cache each with different read and write times and sizes, for the sake of simplicity we’ll assume a single cache for our CPU.
So now with the basic components out of the way. Let’s get into how the computer operates. When a CPU executes an instruction there are five basic steps that need to be completed:
- Fetch- Get the instruction from the memory and store it in the cache in some cases.
- Decode- Get the appropriate variables needed for the execution of the instruction.
- Execute- Compute the result of the instruction.
- Memory- For instructions that require a memory read/write operation to be done.
- Write Back (WB)- Write the results of the instruction back into memory.
Nearly every instruction goes through the first three and final steps, only certain instructions go through the memory steps such as load and stores but for the sake of simplicity, we’ll assume every instruction requires all five steps. Now each step takes one clock cycle, this translates to a CPI, clock cycles per instruction, of five.
As a note: Most modern processors can execute billions of clock cycles per second, for example- a 3.4 GHz processor can execute 3.4 billion clock cycles per second. Now a CPI of 5 is very inefficient, meaning the resources of the CPU are wasted.
This is why pipelining was introduced, bringing asynchronous operation into computing. Pipelining essentially makes it so each step can be executed in a different clock cycle. One instruction per clock cycle, a CPI of 1.
Essentially what pipelining does is take the segmented instruction and execute them in each clock cycle, since the segmented steps are smaller than the size and less complex than a normal instruction, you can do the steps of other instructions in the same clock cycle. For example- if a step for one instruction is fetching the data you could begin decoding another, executing another, etc. Since the hardware involved for those steps is not being blocked.
Superscalar pipeline adds to this performance further. Think of pipelines as a highway, now typical lanes in the highway can execute one instruction per clock cycle. With superscalar processors, you add more lanes to the highway, for example- a 2 wide superscalar also referred to as a dual-issue machine, has a theoretical CPI of 1/2 instructions per clock cycle.
There are various other methods implemented to make the processor CPI more efficient. such as:-
- Loop unrolling
- The very long instruction word
- VLIWs- which are essentially multiple instructions wrapped into one larger instruction.
- Compiler Scheduling- allowing for out-of-order execution and etc.
There are also many issues that come along with pipelining that decrease CPI such as:-
- Data Hazards
- Memory Hazards
- Structural Hazard and etc.
Computing Parallelism
So at this point, we now know about the basic design of a CPU. How it communicates with memory. The stages that it executes instructions in, as well as pipelining and superscalar design.
Now instead of imagining all of this as a single CPU. Let’s take it further, all this technology can be embedded on a single core of a processor with multiple cores, you take the performance of a single core and multiple it by the core count, for example- in a quad-core by four multiple cores also have a shared cache as a side note.
The use of superscalar pipelines as well as multiple cores is considered hardware-level parallelism.
The computer industry after years of stagnation is now beginning to divert more focus to hardware level parallelism, by adding more cores to processors. This can be demonstrated by consumer processors like AMD’s Thread ripper line and Intel’s i9 processor line with core counts ranging from 8 to 16 and 10 to 18 respectively.
While these may be their higher-end consumer processor, even the low and mid-end processors from i3, i5, and i7 are getting buffs with core counts ranging from the quad, hex, and octa-core.
As a side note, supercomputers are the best examples of utilizing hardware parallelism. For example- Intel’s Xeon and AMD’s Epyc processors have core counts ranging from 24 to 72 with supercomputers having tens of thousands of processors with them.
Now there is one key component that is required in tandem with hardware parallelism to truly use all the resources efficiently.
Software level Parallelism- This leads us to the final topic in classical computing will cover, hyperthreading also referred to as multithreading. Instead of being implemented as hardware parallelism, this is used as higher-level software parallelism.
Think of a thread as a sequence of instructions, now with single-threading that sequence of instructions just flows through the pipeline as normal. However, with multithreading, you can segment your application into many threads and specifically choose how you want to execute them.
Multi-threading can significantly increase computing performance, by explicitly starting what CPU resources you want to utilize and when. For example- for an application the user interface GUI can be executed on one thread while the logic is executed on another. This is just one example of many instances where multi-threading can be used.
Now multi-threading can’t just be used for every application. Since classical computing is not intrinsically parallel. there can be a lot of issues with concurrency or in other words when multiple threads are executing at the same time but depending on the result of each other, some applications end up being only single-threaded.
However, many individuals and groups are working on ways to best utilize hardware parallelism through new software practices and rewriting old software.
Some of the most computationally intensive take-ups by default excel at multi-threading, such as video editing, rendering, and data processing to list a few. Also, as exemplified by the gaming industry, a lot of games are now moving into multi-threading performance.
Conclusion
So in summary, classical computing is asynchronous not truly parallel. Instructions are still executed in serial, but through the use of hardware and software level parallelism, maximize the utilization of the resources of the computer, making them execute extremely fast giving the illusion of parallel operation.