Midterm 2 Answers

SANTA CLARA UNIVERSITY

School of Engineering

COEN 001

Fall 2000
Midterm 2 Answers

Part A - Vocabulary

1. (16 points)

Give brief (two or three sentence) definitions, as used in Coen 1, of four terms from the following list.

dirty bit - an indication whether a block in a cache has been written into (and hence is "dirty") and must have its contents written back to main memory before a new main memory block can be loaded; doesn't apply to "write through" cache designs.

dry etching - literally, to eat away (usually silicon dioxide, but sometimes metal or other components of an IC), by chemical action of a gas plasma (as opposed to liquid acid), a part of the surface of a wafer.

interrupt - a signal from a channel or peripheral device to the CPU that a requested operation has been completed. Interrupts allow the CPU to continue meaningful work while a channel or device independently carries out a transfer.

MAR - Memory Address Register, a register (storage location) in a CPU that contains the address of the memory location that is being read from or written into.

mask - a glass plate with transparent and opaque areas that provides a pattern to be imprinted on the surface of an integrated circuit.

MHz - megaHertz, or millions of cycles per second. Used to measure the clock ticks of a CPU clock.

program counter - a register in the control unit of a CPU that holds the address of the next instruction to be fetched from memory. By default the program counter is automatically incremented at each fetch to point to the next instruction.

speculation - a method used by the CPU to try to speed up execution in a pipelined CPU. Based on previous program behavior, the CPU control circuits "guess" at whether a conditional branch instruction will be executed or skipped, and fetch instructions accordingly. A correct guess keeps the pipeline full, an incorrect guess requires the pipeline to flush.

Superscalar - a CPU architecture in which there are two or more separate physical execution paths, which allows the CPU to fetch more than one instruction per clock tick (up to the number of separate execution paths).

Part B - Precise Answers

Answer each question in this section.

2. (15 points)

a. Suppose you were designing a CPU for a simple computer that would hold 4 million bytes of main memory, with each byte having its own address. How many bits long should the addresses in your computer be?

The addresses need to be large enough to hold a binary number with an equivalent decimal value of 4,000,000. 222 = 4,194,304 so the addresses should be 22 bits long.

b. Assume for the same computer that the blocks in your cache memory are 16 bytes long. How many bits will the offset portion of the address use?

The offset must hold a value between 0 and 15 as a binary number. Four bits is enough to do that.

c. Suppose (again for the same computer) that you will use a cache that holds 1024 blocks and uses a 4-way set associative mapping. How many bits will the tag portion of the address use?

If there are 1024 blocks, then the block number requires 10 bits. So 22 bits in the address minus 10 bits for the block number minus 4 bits for the offset leaves 8 bits for the tag.

3. (5 points) Choose the one best answer. The Ariane 5 rocket blew up because:

a. of failure to control the fuel to the booster rocket.

b. of fixed point overflow problems.

c. it was sabotaged by people opposed to digital technology.

d. it didn't have a backup computer.

4. (5 points) We talked in class about the SETI (Search for Extraterrestrial Intelligence) effort to use many personal computers (via screen savers that perform calculations when the PC is not doing any other work) to analyze radio telescope data. This is an example of what category of parallel computer?

The computers (1.6 million of them) are connected by the Internet, but at any given moment might be executing anywhere in the screen saver program, on a set of data that are unique for each machine. That's the MIMD (multiple instruction, multiple data) model for parallel computers.

You might think this is a SIMD parallel computer, but in SIMD machines each CPU executes the same instruction at the same time. In this case, the CPUs are executing the same program, but they may well be executing different machine language instructions in that program.

5. (10 points) Assume that a (small) cache holds only 4 blocks of data, numbered 1 to 4. Assume that the blocks were originally loaded into the cache in the order 1, 2, 3, 4, and that the CPU has since accessed the same data held in the blocks in the following block order: 1, 2, 3, 4, 2, 3, 4, 1, 2, 4. Now the CPU wants to access data that is not resident in the cache (a miss); identify which block will be replaced if the cache uses:

a. an LRU replacement algorithm.

Block 3 is the one that was accessed the longest time ago (5 memory accesses in the past) so that is the one an LRU algorithm will replace.

b. a FIFO replacement algorithm.

The first block loaded was block 1. A first in, first out algorithm will replace block 1.

Part C - Short Answer

Give short (3 to 5 sentence) answers to 3 of the 4 questions in this section. (10 points each)

What is the difference between "process" testing and "functional" testing of an integrated circuit?

Functional testing determines whether the integrated circuit correctly performs the operations it was designed to do. For example, whether bits can be written into and read from a memory location, whether a CPU can add, subtract, and compare two numeric values, etc. Process testing measures the correctness of the integrated circuit fabrication process (mask alignment, electrical resistance values, etc.).

7. We said in class that a VLIW architecture makes the instruction level parallelism in a program "explicit." What does this mean, and how does it differ from non-VLIW architectures?

The VLIW instruction is actually a composite of several basic instructions, and has one position for each basic instruction that may be issued simultaneously. So if the CPU had three instruction execution paths (2 floating point and 1 integer) the VLIW instruction would contain 3 parts (2 for floating point instructions and 1 for integer instructions). By looking at the VLIW instructions of a program, you can tell exactly how many instructions will be issued at each instruction fetch, so the parallelism is explicit.

Non-VLIW programs simply have sequences of the basic instructions, and the hardware tries to select several non-interferring instructions to issue at the same time. It's much harder to tell how much parallelism there will be before the program executes, so the parallelism is implicit.

8. The following two diagrams represent two different microprocessor chips and the amount of space on each devoted to its various major functional sections.

Arithmetic & Logic Unit	Registers		Arithmetic & Logic Unit	Registers
Control Unit			Control Unit	Registers

A. B.

a. Identify which diagram represents a typical RISC architecture, and which is a typical CISC architecture.

b. Explain why you made the choice you did. That is, what characteristic(s) of a CISC or RISC architecture allow you to decide which is which?

Design A is a CISC design (because the more complex instructions require a complex, and therefor larger, control unit design). Design B is RISC (simpler instructions allow for a smaller and less complex control unit, and the extra space on the chip can be used for more general purpose registers).

9. What tradeoffs does a computer system architect have when designing a bus? What is the effect of different choices on the cost or the performance of the computer? There are at least three tradeoffs.

The main things that change are the width of the data lines in the bus, and whether there are separate data and address lines. A fast bus will have a wide data path (4 bytes, 8 bytes) so that lots of information will be moved with each bus transfer. This, of course, is more expensive to implement. An inexpensive bus might have a narrow data path (say, 1 byte) and use several transfers each time a word has to be fetched, making it significantly slower. A really cheap bus might "multiplex" a single set of wires, using them one time to hold the address (or parts of the address if the paths are very narrow) and then later using the same wires to transfer the data. A third tradeoff is the speed (clock ticks per second) of the bus. Fast busses are expensive, less expensive busses are slower.

Part D - Longer Answer

Answer 2 of 3 questions in this section.

(15 points each)

10. We said that one of the best ways to speed up the processing performed by an integrated circuit (or a set of integrated circuits) was to increase the number of transistors on each chip. Explain why this works. What keeps chip makers from fully exploiting this approach to speed (that is, what things limit the number of transistors we can put on a chip)?

Computers operate by transmitting electrical signals (currents or voltage levels). It takes time for electricity to travel through a wire (1 nanosecond per foot). Increasing the number of transistors on a chip by making the transistors smaller reduces the distance that the signals must travel, hence increases the overall speed. Higher transistor density also allows more functions to be placed on the chip (e.g., a cache memory) which further reduces the distance signals must travel (on chip vs. off chip).

The price for this is greater heat generated by the chip (since there are more transistors to create heat) and potentially lower yield from processing, since packing devices more closely together makes them more susceptible to imperfections within the crystal lattice and makes feature sizes (e.g., metal connections) smaller (more likely to be broken) and closer together (more likely to have features accidentally merged). This requires greater precision from the fabrication process. Generally, early runs of a particular design for a given process are not as precise as later ones, so yield is low at first and improves over time.

Alternatively, we could try to increase the size of the chip, to get more transistors per chip without reducing transistor size. But as the size of the chip increases, the probability that there will be a flaw in the silicon lattice making up the chip increases. The flaw means the chip won't work, so larger chip sizes tend to cause lower yield. We can counteract this effect most effectively by striving to improve the process that creates the wafer in the first place, thus reducing the total number of flaws in the ingot.

11. Assume you are a design engineer for a disk drive manufacturer, and you've been asked to reduce (make faster) the access speed for an existing disk drive. Your boss has made the following suggestions for improvements. For each, explain whether it will improve access time, which component of access time it is targeting, and how significant the improvement will be.

a. put in a motor to spin the disk drive faster.

This reduces rotational latency, and will improve access time significantly depending on how much faster the new motor is.

b. improve the magnetic recording material so you can get greater recording density, and have more tracks on the disk

This is directed at seek time. Having more tracks on the disk increases capacity, but since the tracks are already so close together, squeezing them a little closer has little effect on seek time. Also, although I wouldn't expect you to know this, a large part of the seek time is starting the head moving and then stopping it when it gets to the desired track. Having tracks closer together doesn't affect this aspect of seek time.

c. improve the magnetic recording material so you can get greater recording density, and have more sectors on each track

This might be expected to reduce rotational latency, but it doesn't change latency at all. It will have a small effect on when the data are actually available to be used because it takes less of the rotational period to read from a smaller sector, but we didn't talk about that as a component of access time in class.

d. add a second platter to the disk drive

This won't affect access time at all. It will eliminate the access time if data on the disk are laid out in an efficient manner, and a lot of data are being read.

e. move from a 3.5 inch platter to a 2 inch platter

This won't affect access time at all. If the magnetic recording material is not changed the smaller platter will have fewer tracks, so maximum seek time will be reduced somewhat, but average seek time won't change. Rotational latency won't change, either.

12. Having a hierarchy of memory and storage devices in a computer seems to add a lot of complexity. It would be simpler to just have the CPU and main memory. Why, then, do computers have a memory hierarchy? What is the minimum number of levels in the hierarchy and what are they? Why do some computers have more levels in their hierarchy?

Memory hierarchies are an attempt to satisfy many conflicting goals at minimum cost: to more closely match CPU and memory speeds, to provide a long-term, non-volatile storage mechanism for data at reasonable costs (e.g., disk drives), and to match speeds between memory and non-volatile storage. One way to solve these problems would be to have a very large main memory that didn't lose its contents when power was shut off (flash memory), but this would be impossibly expensive. At the very least there must be long-term storage of information. Thus the minimum number of levels is two: main memory and disk storage (some early computers used tape or cards instead of disk), which solves the non-volatile storage problem. One could also argue that registers are needed and thus the minimum number of levels is three. Note that while cache is almost always available in modern computers, it is not essential.

Computers with more than the minimum number of levels are attempting to satisfy some of the other constraints. One or more levels of cache memory helps match CPU speed and main memory speed, leading to faster execution of instructions. Use of a disk cache (on the disk controller or in main memory) helps match memory speed and disk access speed, reducing the time it takes to load programs or data into main memory, thus speeding up execution. A tape drive or magneto-optical disk drive provides long-term data storage at less cost than a standard hard disk drive.