Multi Processors, Parallel Programming, Threads and Threading <BR> Advantages and Disadvatages

Multi Processors, Parallel Programming, Threads and Threading
Advantages and Disadvatages

This document is mainly about: Multi Processors, Parallel Programming, Threads and Threading.
Also the following Wikipedia articles are discussed:

For a discussion about parallel programming with Visual Basic 2010 go here: Visual Basic 2010 Parallel Programming
For a detailed evaluation of Visual Basic 2010 go here: Visual Basic 2010 Evaluation

Introduction

Disc, Slow Memory, Fast Memory, Tasks, Task Manager, Registers and Arithmetic Unit.
In a standard PC all those components are at least once available.
In a PC with full dual performance, all those components should be available at least twice. That means two Arithmetic Units but also two Discs. In such a PC the performance is the same if you execute the same program once or twice simultaneous in parallel, assuming that each program runs in a different processor. This is true both when the application is highly CPU bound, meaning that it requires many arithmatic calculations and uses the Arithmetic Unit heavily, versus IO bound, meaning that the program uses the Disc heavily.
This is not necessarily the case if the Disc is not installed redundant. In that case the performance of two CPU bound programs will not change if you execute the program once or twice. On the other hand the performance of two IO bound programs will decrease. The reason is during execution both programs want to use the same resource (the disc) simultaneous, which is not possible, resulting in a loss in performance of both.

Detail
Now let us study each of the components in more detail.

The disc is the main primary storage unit which contains programs and data. Its performance is slow.
Slow Memory is secondary level of storage for programs and data. Its performance is medium. Programs that the user wants to execute are copied the disc to slow memory and when finished (modified) programs are copied from slow memory to disc.
Fast memory is a third level of storage for programs and data. Its performance is fast. Programs in fast memory are ready to be executed. The operating system takes care for control between fast and slow memory.
Tasks are the indivudual programs of a PC (in slow or fast memory). Each task is identified by what is called a task status buffer or TSB The task status buffer contains all necessary data to control and execute a program properly. In order to do that the TSB contains the start address, a continuation address and the time. Each task status buffers is link to next task status buffers.
The start address contains the location where the program begins. This is necessary in order to start the program. The continuation address is important to know when program execution is temporary interrupted. The time is important to show the next time when the program will be started.

This whole sequence of task status buffers forms what is called a chain. The same name for task is thread.
The Task Manager has three tasks:

To give control to the next task or (small) thread in the chain.
To update the information in the task status buffer when a program is delayed or when finished.
To add or remove the links between the task status buffers when tasks (small threads) are added or removed.

There are four types of registers:

Program Register or Instruction Counter. This register contains the memory adress of the Program. After each instruction executed this address is increased with one (assuming all instructions have the same length) or the Program Register gets the address as decided by a jump (goto or if) instruction.
The Task Manager initialy copies the start address from the TSB into the Program Register. The revers happens when the program is temporarily delayed.
Instruction Register. This register contains the instruction copied from fast memory at the Program Register. It is possible that in a single processor CPU the CPU has more instruction registers to improve performance.
Data Registers. Normally many. There are two types: Input data registers and output data registers.

The Arithmatic Unit is the most important part. This is the center of the CPU were the mathematical operations like: Add, Subtract, Multiply and Divide are performed. Input comes from the input registers and output is stored in the output registers.
This more or less describes all the units of a CPU.

Multi Processors and Hyper-Threading
In a single core CPU you have at least: fast memory, one set of Registers and one Aritmatic Unit. This is called one Pipe line. The maximum performance is 100%.
In a dual core CPU you have two complete separate Pipe Lines. The maximum performance of each pipeline is 50% of the full performance. The most important requirement is two separate Program Registers. When you have two separate Program Registers you can control two Programs simultaneous in order to solve two different problems. That means multi-processing. You can also use multi-programming in order to solve one problem. Multi-programming gives the possibility to divide one program into two tasks or threads and to execute both tasks or threads in parallel. Multiprogramming requires synchronisation between the tasks or threads.
In a Quad core processor CPU you have four complete separate Pipe Lines. The maximum performance of each pipe line is 25%. That means you can have 4 independent programs or 4 dependent programs solving 1 program using synchronisation.
In a "1 core"/"2 thread" CPU the "2 thread" stands for Hyper-Threading. Hyper-Threading means that only the Registers are implemented twice but not the Arithmatic Unit. This means that from the user (operating system) point of view this is a dual processor system and the first impression is that such a configuration is twice as powerfull. But this is not necessary true because there is only one Arithmatic Unit. Internally this is a single core processor.

The most benefit and improved performance you will have from such a configuration if in one hyper-thread the program is highly IO bound versus in the other hyper-thread highly CPU bound is. If first program needs the Disc the second program can immediate take over and use the Arithmatic Unit until the Disc operation of the first program is finished. That means such a usage depents highly on switching. A CPU with a lot of fast memory and at least all the registers implemented twice will outperform "in this case" a CPU which does not have those capabilities.
The least benefit you will have in such a configuration if in both hyper-threads both programs are highly CPU bound. The reason becomes easy because both programs perform mathematical calculations and require the Arithmatic Unit. That part is the bottle neck and because it is not implemented twice the overall performance will not improve.
The performance could even decrease because:

first in case you want to solve one physical application and use the CPU for 100% you have to use parallel programming. That means you have to divide your application in two independent parts. You also need extra synchronisation between both parts in order to find the solution. This synchronization requires extra CPU power which will decrease overall performance. Those independent parts become separate hyper-threads during program execution.
secondly because extra CPU hardware to support hyper-threads could even result in a slower clock speed. Extra hardware to support extra cores can als have the same effect.

In the above text the word Hyper-Threading is used, in order to describe an architecture were from the user point of view the CPU looks to have double the number of processors i.e. two arithmatic units but physically this is not the case. A much better name is Virtual Processors.
On a CPU i5 which is a 2 Core/4 Thread when I load the program PlanetS which is written in Visual Basic 2010 and which does not use the BackGroundworker the number of threads used by the task manager is 5. This is inaccordance with the number of event-handlers. This seams the case for all Visual Basic 2010 programs loaded independent of the number BackGroundworkers used. The number of processes that each program represent is 1.
On that same CPU when I end program PlanetPP (Also Visual Basic 2010) instantaneous which 5 BackGroundworkers active the total number of threads decreases with 10.
This tells me that a thread (in the context of the task manager) is much more a task (programme) and that the task of the task manager is to assign (or to remove) a task to one of the (in the case of an i5) 4 Virtual Processors. That means an i5 has 2 Processors and 4 Virtual Processors.

Core 1 Virt Proc 1 Virt Proc 2 ---------- ----------- Fast Pr Mem 1 Fast Pr Mem 2 | | P Reg 1 P Reg 2 | | Ins Reg 1 Ins reg 2 | | ------------------ | I1 I2 I3 | O1 O2 O3 | Artihmatic Unit 1 Add 1 Sub 1 Mul 1 Div 1 Move 1

Core 2 Virt Proc 3 Virt Proc 4 ----------- ----------- Fast Pr Mem 3 Fast Pr Mem 4 | | P Reg 3 P Reg 4 | | Ins reg 3 Ins Reg 4 | | ------------------ | I1 I2 I3 | O1 O2 O3 | Artihmatic Unit 2 Add 2 Sub 2 Mul 2 Div 2 Move 2

*********************** * Common Memory * * * * M1 M2 M3 M4 * *********************** ^ | | | <-------------

Figure 1: 2 Core / 4 Thread CPU

Figure 1 Shows a 2 Core / 4 Thread. That means:

there a 4 Virtual Processors. Each Virtual Processor contains one of:
Fast Program Memory, Program Register and Instruction Register
The Program Register is a pointer to the Fast Program Memory. The Instruction Register contains the Instruction at that Location.
There are 2 Arithmatic Units. Each Arithmatic Unit contains one of:
Logical Input Registers, Logical Output Registers, Adder, Subtracter, Multiplier, Divider and Move
In the case of a 2 Core / 2 Thread there are no Virtual Processors 2 and 4.
There is one Common Memory. Both Arithmatic Units can directly adress this Common Memory by means of a move operation (or something equivalent) With the move operation they can read and write data from Common Memory to and from their input and output registers.
In the case of parallel programming you reserve for each program a small area of this Common Memory to store data, identified as M1, M2, M3 and M4. One usage of this data is as a Command. In the programs PlanetPP and FibonacciPP this is implemented as the array STATE(i). For parallel programming to work properly with Visual Basic:

You have to subdivide your application into independent branches. Each branch is implemented as a Backgroundworker. Each Backgroundworker operates as a different thread in a different Virtual Processor.
One Backgroundworker services as the master. The other Backgroundworkers are slaves.
The master sets all Commands to a specific value inorder to inform the slaves what to do. Each slave investigate its own Command inorder to know what to do using either M2, M3 or M4. For example Backgroundworker 2 uses STATE(2)
When finished each slave clears its own Command.
Finally for the master this is an indication that one iteration is finished and that the next one can start.
By following this strategy there is no chance of race conditions.: The start address contains the location where the program begins. This is necessary in order to start the program. The continuation address is important to know when program execution is temporary interrupted. The time is important to show the next time when the program will be started.
This whole sequence of task status buffers forms what is called a chain. The same name for task is thread.
The Task Manager has three tasks:

To give control to the next task or (small) thread in the chain.
To update the information in the task status buffer when a program is delayed or when finished.
To add or remove the links between the task status buffers when tasks (small threads) are added or removed.

There are four types of registers:

Program Register or Instruction Counter. This register contains the memory adress of the Program. After each instruction executed this address is increased with one (assuming all instructions have the same length) or the Program Register gets the address as decided by a jump (goto or if) instruction.
The Task Manager initialy copies the start address from the TSB into the Program Register. The revers happens when the program is temporarily delayed.
Instruction Register. This register contains the instruction copied from fast memory at the Program Register. It is possible that in a single processor CPU the CPU has more instruction registers to improve performance.
Data Registers. Normally many. There are two types: Input data registers and output data registers.

The Arithmatic Unit is the most important part. This is the center of the CPU were the mathematical operations like: Add, Subtract, Multiply and Divide are performed. Input comes from the input registers and output is stored in the output registers.: Fast Program Memory, Program Register and Instruction Register
The Program Register is a pointer to the Fast Program Memory. The Instruction Register contains the Instruction at that Location. There are 2 Arithmatic Units. Each Arithmatic Unit contains one of: Logical Input Registers, Logical Output Registers, Adder, Subtracter, Multiplier, Divider and Move: Logical Input Registers, Logical Output Registers, Adder, Subtracter, Multiplier, Divider and Move

Wikipedia

Threads will often need to update some variable that is shared between them.
Why do you want to do that? One example could be that both are updating the same counter. That means you have two branches and in each branch you want to perform the statement: COUNT = COUNT + 1. The problem is here that you do want to perform both programs simultaneous except this statement. Next is written:: This is known as a race condition.
The race condition arises because the macro statement COUNT = COUNT + 1 in reality are three micro statements. (One get, One add and one put) Next is written: The programmer must use a lock to provide mutual exclusion. A lock is a programming language construct that allows one thread to take control of a variable and prevent other threads from reading or writing it, until that variable is unlocked.
That is true but the text does not explain how this is done in detail. How do you communicate between processors ? In the next line is explained that program A performs the statement "1A Lock variable V" and program B the statement: "1B Lock variable V" How do you know that both programs don't try to execute the same statements at the same time ? You don't know or it is not obvious that this does not happen. Next is written:: One thread will successfully lock variable V (1), while the other thread will be locked out—unable to proceed until V is unlocked again. This guarantees correct execution of the program.(2)
(1) Of course that is what you hope, but this is not that simple. Any sucessfull strategy requires that the two programs are not identical. One strategy is different priorities. A different strategy is a master slave concept which is used in the VB 2010 document. (2) is correct if (1) is correct, but how do you guarantee (1) ? IMO the whole concept of locking variables is not very practical and should not be used. Next is written:: Many parallel programs require that their subtasks act in synchrony. This requires the use of a barrier. Barriers are typically implemented using a software lock. One class of algorithms, known as lock-free and wait-free algorithms, altogether avoids the use of locks and barriers. However, this approach is generally difficult to implement and requires correctly designed data structures.
Synchronisation is a very important issue if you want to use parellel programming in order to solve one application. When you follow the approach outlined in the VB 2010 document you will see that this is not difficult to implement using Visual Basic 2010 and the BackGroundworker
2. Wikipedia Hyper-threading
In the first paragragh is written:: For each processor core that is physically present, the operating system addresses two virtual processors, and shares the workload between them when possible.
That is correct. In a 1 Core/2 Thread CPU the "2 Threads" mean two virtual processors. Next is written:: Intel recommends disabling HTT when using operating systems that have not been optimized for this chip feature. [1]
It is interesting that Intel mentions this. I want that. But if I ask to Microsoft how to do that there is no response. See: What is the reason that a pentium 4 3.0 behaves as a 1Core/2Thread In the paragragh called: 3 Performance is written:: The advantages of hyper-threading are listed as: (1) improved support for multi-threaded code, (2) allowing multiple threads to run simultaneously, (3) improved reaction and response time.
(1) This is only an advantage if the use of multi-threaded code has an advantage. What exactly is multi-threaded code ? (2) This is only an advantage if "multiple threads to run simultaneous" have an advantage. That this is allowed is not directly an advantage. (3) Reaction time is only an issue if your application requires a lot of man machine interaction. Response time has to do with duration and that is most often a function of the clock speed. Next is written.: Intel claims up to a 30% performance improvement compared with an otherwise identical, non-simultaneous multithreading Pentium 4.
My test do not support this claim: CPU performance P4 2.8 versus P4 3.0 Next:: Tomshardware.com states "In Some Cases a P4 running at 3.0 GHz with HT on can even beat a P4 running at 3.6 GHz without HT turned on".
Same comment. This is maybe only true if the duration of the test is less than 1 minute. Next:: The performance improvement seen is very application-dependent, however when running two programs that require full attention of the processor it can actually seem like one or both of the programs slows down slightly when Hyper Threading Technology is turned on.
I do not call a 50% decrease in performance slightly.
3. Wikipedia Simultaneous multithreading In the first paragraph is written:: SMT permits multiple independent threads of execution to better utilize the resources provided by modern processor architectures.
That is maybe true, but this is no guarantee for better overall performance, compared with a single core / one thread CPU with the same clock speed. In the paragraph: 2.5 Disadvantage is written:: Simultaneous multithreading cannot improve performance if any of the shared resources are limiting bottlenecks for the performance. In fact, some applications run slower when simultaneous multithreading is enabled.
This is typical true in a 1C/2T versus a 1C/1T, but the text does not explain that. See also reflection.
4. Wikipedia Multithreading (computer architecture) This document contains a paragraph: 1.1 Advantages The question is if those advantages are applicable related to Visual Basic 2010. I doubt this. It should be remembered that you should compare a 1C/2T with a 1C/1T or a 2C/4T with a 2C/2T and explain the advantages strictly based on an increase in Virtual Processors (threads) not on Physical Processors (cores). In the paragraph: 1.2 Disadvantages is written:: Hardware support for multithreading is more visible to software, thus requiring more changes to both application programs and operating systems than Multiprocessing.
This gives the wrong impression particular because of the comparison with Multiprocessing. Both require in order to solve one physical application parallel programming and synchronization. The claim is only true in the case of a single core. In the next sentence we read:: Intel claims etc while a program performing a loop of non-optimized dependent floating-point operations actually gains a 100 percent speed improvement when run in parallel.
I doubt that. The question is also here do they compare a 1C/1T with a 1C/2T. I have done a similar test with the program VStest1. See CPU Performance and CPU Performance Pentium 4. What those tests show on a i5, which is an 2C/4T, that the total performance increase with a factor 2 strictly comes because this is a 2 Core and not because this is a 4 Thread.
5. Wikipedia Multi-core processor In the paragragh called: 2 Development is written:: Many applications are better suited to thread level parallelism (TLP) methods,
I doubt if that is true, specific in the sense of Virtual Processors. What is true that certain applications could benefit from multi core processors specific if they do not require parallel programming, that means if they can be divided into two parts which are independent. In the paragraph: 2.3 Advantage is written:: The largest boost in performance will likely be noticed in improved response-time while running CPU-intensive processes, like antivirus scans, ripping/burning media (requiring file conversion), or file searching.
This is not true. Antivirus scans and file searching are both applications which are highly IO bound. If you excute them in parallel each will heavily influence the performance of the other because they have to wait. The most benefit could come from applications which use the Arithmatic Unit for 100%.
In the paragraph: 2.4 Disadvantage is written:: Also, the ability of multi-core processors to increase application performance depends on the use of multiple threads within applications.
In stead of threads they should have written: independent branches

If you want to give a comment you can use the following form Comment form Created: 22 Februari 2011 Back to my home page Contents of This Document

Multi Processors, Parallel Programming, Threads and Threading
Advantages and Disadvatages

Introduction

Detail

Multi Processors and Hyper-Threading

Wikipedia

1. Wikipedia Parallel computing

2. Wikipedia Hyper-threading

3. Wikipedia Simultaneous multithreading

4. Wikipedia Multithreading (computer architecture)

5. Wikipedia Multi-core processor

Reflection

Multi Processors, Parallel Programming, Threads and Threading Advantages and Disadvatages

Introduction

Detail

Multi Processors and Hyper-Threading

Wikipedia

1. Wikipedia Parallel computing

2. Wikipedia Hyper-threading

3. Wikipedia Simultaneous multithreading

4. Wikipedia Multithreading (computer architecture)

5. Wikipedia Multi-core processor

Reflection

Multi Processors, Parallel Programming, Threads and Threading
Advantages and Disadvatages