13 June, 2011

Optimum Resiliency

This post is about the optimum point of resiliency, in terms of redundancy. These are three concepts in the first line, let me explain them one by one:

(1) Resiliency: Resiliency is the ability of a system to handle problems while operating normally or near-normally. Its the extend to which the system can pass through unexpected and rare but huge problems, intact. A normal repairing system for normal day-to-day issues-solving is some other kind of resiliency I am not talking here. I am talking about surviving through problems when the problems hit all of a sudden and there is no time of repairing. I am talking about the first-class set of problem, the huge ones, the rare ones, like earth quakes, country breaking up in parts, tsunamis, even nuclear attacks. These kinds of problems are very rare to occur but they do occur, even nuclear bombs have been already used twice. Since these huge problems are very rare in occurance people tends to set aside little resources for them and when these problems occur entire systems simply wipes out.

(2) Optimum Point: Optimum in the sense of economics. Sure you can build very resilient systems (factories, companies, political parties, countries etc) if you not have to worry about economics but in real world financial laws must be obeyed.

(3) Redundancy: Redundancy is setting aside extra sub-systems to be replaced when a working sub-system fails. As long as the working sub-system don't fail, the redundant sub-system would sit idle. One very simple way to achieve resiliency is through redundancy. There are other ways too, like making strong systems but today I want to talk about only that kind of resiliency that is due to redundancy.

In natural world we see resiliency every where. In our bodies there are two eyes for example while one is enough for day to day operations. In a family there is a father and a mother, even though for most part only one is enough. In water supply we have rain water, river water, ground water etc. In a govt we have district, provincial and federal. In armed forces we have army, rangers, police. Multiple engines in an aircraft. Multiple types of food categories for each component of food, for example for carbohydrates we have grains, fruits, honey, sugar, vegetables etc, for proteins we have meat, pulses, egg, milk etc. Multiple items of food in each food category, for example multiples types of grains such as rice, wheat etc in grains.

One important thing that we must note about resiliency is that the redundant sub-system can readily work in the place of failed sub-system without any modification of the redundant sub-system, means it should be mission-ready all the type, we can't afford any adjusting time.

Another thing to note is that its always a sub-system that is redundant, not the entire system. Its because in middle of operation its hard enough to switch on a new sub-system and sometimes we can't simply jump to a new system altogether, for example, in middle of flight, if one engine fails, we can switch on another engine, but shifting to another aircraft altogether is pretty hard. Same way its hard enough to dig a well for water if a canal from a river fails and its near-to-impossible level of hard to squeeze water out of stored grains or leaves.

Another thing to consider is that, there are levels of redundancy even in sub-systems' level of redundancy. There is a primary redundancy, which can be switched on right away and start working and then there is a secondary level of redundancy that need adjustment to get switched on and start working. For example, in a family, a mother can replace most operations of a father, but an elder sister need some time to perform functions of a mother. Squeezing water out of grains or leaves or milk can be considered a secondary level of redundancy. In wartime, women can be sent to fight at extreme situations but requires deep training and mental adjustment.

So, what is the optimum level of redundancy? Before that, we must find out what is the minimum acceptable level of redundancy. In my opinion, its a factor of two for primary resiliency. We see that everywhere in nature: mother-father as engine of family, two eyes, rain water - canal water etc. Its like flying a two-engine aircraft, would you feel safe in that? The answer is, it depends on the environment in which you are flying. In clear sky, low altitude and peace time you may be confident but to be a bomber pilot in a night air raid on enemy's capital you better have a four engine plane. So, in my opinion, the maximum level of primary level redundancy in sub-system that you should be asking for is 4 and you should be happy at level 2 for all but extreme situations.

Understanding secondary level of redundancy, we may say that its those sub-systems that cannot perform for that role in normal situations but have potential to be transformed to perform at that role. For example, a toy factory in 1942's stalingrad was not intended to work as a weapons factory, but in the heat of world war 2's invasion of germans it was quickly transformed in a tanks factory and worked in that role till the end of war. It may also be considered as the role of a deputy, a deputy has potential to work as the officer but in normal situations it should not, only once the officer becomes unavailable (due to death, injury, missingness or retirement) that the deputy should start operating as the officer.

So, how much redundancy is optimum at secondary level? I think its a strict 2. Having 2 deputies is good enough. In normal situations, having one deputy is enough, its like having 2 redundant sub-systems at primary level. In extreme situations, having two deputies is optimum and enough, its like having 4 redundant sub-systems at primary level. Lets look to nature for some lessons, there are two eyes while one is enough, so there is a primary-level redundancy of 2. There are also two ears that can work somewhat like eyes, so there is a secondary-level redundancy of 1. Note that I didn't said secondary-level redundancy of 2 though there are two ears. Its because I am using cumulative in multiplication sense here, there are two ears but I am comparing them with two eyes as I have already taken in consideration primary-level redundancy of 2. Also note that the nearest physical sensory organ to eye is ear and the next nearest is too far to be considered.

So, what we learned today? Let me summarize. Other than the basic concepts and definitions, we learned that there are classes of redundancy, lowest is the economy class, then there is a crisis class and then there is a luxury class. When there is no class there is no redundancy but there can still be time-consuming repairing. So, the first line of defense is normal repairing, this solves out day to day expected problems but requires downtime and reduction in operations for the time the repairing is take place. An example for a natural system is an eye, lets suppose there is only one eye in a human being. From time to time, this eye requires repairing, for example when it gets red from over work, or when it gets a minor disease or when some dust gets in it. An example from humans is like keeping a guard, lets suppose we need guard only 8 hours a day. Even for that time we can't expect our guard to be available every day of year. He may get sick and requires downtime to get repaired. We can expect maximum 75% availability. Why 75%? Well 365 days a year, 15 dayz gazetted holidays, 350 days means 50 weeks, one off day per week, 300 days, 30 days of casual leaves, 270 working day is like 75% of 365.

Moving forward, we can have an extra guard or an extra eye. I am not saying that the new guard work in some other duty time of day than the first guard. I am saying that the new guard is available only at the duty time when the first guard is already available so we have a redundancy. This is primary-level redundancy of 2, I call it economy class of redundancy. This works great as long as there is no crisis like a war time or dust storm or frequent robberies. The combined downtime of the two sub-systems now reduce to perhaps 1/16 of the total time, that is instead of 25% off-days of the single guard, we get 25% of 25% off-days of the entire team of guards. Why 1/16, lets get that mathematically:

Suppose we have a system of two balls, one small (lets say a tennis ball) and one large (lets say a football). Obviously these two represents the two sub-systems, the two guards or the two eyes. Now, lets suppose each of these balls can be of any of the four colors. Lets take any four colors, for example RED, GREEN, BLUE, YELLOW. Ok, now lets suppose that one of these colors represents downtime and the other three represents available-to-work time. I would take YELLOW as a symbol of downtime. Now lets suppose that on any given day, the probability of downtime is same as the probability at any other day of the year.

To start the experiment, lets put a large number of balls of each type and color in an opaque bag. Lets suppose we have 1000 balls and the probability of getting each type of ball and each color of ball is same. Now lets draw two balls, one after another. What is the probability of getting both balls of YELLOW color? Since the two experiments involve different types of balls (tennis and football) the experiments are mutually exclusive. To draw one ball of any given color when total number of colors for the ball is 4, the probability is 1/4. To draw two balls of the same color, the probability is 1/4 x 1/4 in a mutually exclusive system, that is 1/16. Its equal to a downtime of 365/16 = 23 days per year. Large but manageable.

Note that we are calculating the effect on downtime by considering primary-level redundancy alone. We are not considering the effect of secondary-level redundancy. We should not depend on the secondary-level of redundancy in our calculations.

The next class of redundancy is the crisis class, specially designed to handle crisis. Its having 4 redundant sub-systems at primary-level when only one is needed. Its why there are 4 engines in a passenger aircraft and in a bomber plane. The downtime is reduced to 1/4 x 1/4 x 1/4 x 1/4 = 1/256 or 0.4% approx. Though we are considering resiliency-through-redundancy it should be noted that we not need to have 4 redundant systems to get same level of downtime. We can alternatively make our sub-systems extra strong to endure twice as much pressure as considered normal.

The next class of redundancy is the luxury level. Its to have extreme mental peace. Its having 16 redundant primary-level sub-systems when only one is needed. Such a level of redundancy is very, very rarely seen in man-made systems and never seen in natural systems. Having such a system, the downtime is reduced to 1/256 x 1/256 x 1/256 x 1/256 = 1/65536 x 1/65536 or 1 in 16 million. Such a level of redundancy is needed when we are making a starship that has to travel for lets say 30 years before reaching the nearest non-solar star alpha centauri 4.2 light years away.

So, what level of resiliency is recommended for human-made systems? I think we should consider two ways: resiliency through quality and resiliency through redundancy. I have talked enough about resiliency through redundancy and conclusion for that is, either 2 or 4. 2 is enough and another 2 on the secondary-level. I must make my point clearer.

All in all, there are four ways to get resiliency:

(1) Repairability: There should be an abundance of spare parts and repair mechanisms must be automatic and in place. Simply said, the system must be able to make parts needed for repair on the fly. At this level, a supply of twice as many spare parts should be present as currently in active use in system.

(2) Quality: The next level of defense is quality. Only those things should be taken that are of the best quality. In the quality calculations, half of the stuff made is of average quality, quarter is of worse quality and quarter is of best quality in the three categories discussed. The best quality stuff is usually twice as expensive as the average quality stuff and four times as expensive as the worse quality stuff. This can be easily seen in food, best quality stuff is tastier, larger and more colorful. Here the cost is twice than it would be if average quality stuff is taken. Consequently, average life would be double than average. Also power output would be double, its like running a car on petrol than on cng.

(3) Redundancy: Keep 2 sub-systems of every sub-system instead of one.

(4) System: Keeping an entire system redundant. Its important when for example we are going on for war, we must never engage more than one half of our forces in offense because we don't know what level of counter-offense enemy would do. We must keep an extra working space-craft for every space-craft we send in space.

No comments:

Post a Comment