A vast amount of problems asked in "Shell Programming an Scripting" can be traced back to be an application of a basic algorithm called a Control Break. Every programmer - and script writers are programmers nonetheless - should immediately recognize problems of this sort and know how to deal with them. We will first discuss the problem in theory, then implement a shell script dealing with an example data set to show the ropes.
The Single Control Break
The most basic form is the single control break. It occurs when some record-oriented data is sortable by a key and all the records with an identical key value are to be processed somehow.
Too complicated? Perhaps, but in fact it is really easy: suppose you have a file of customers (the key) and their purchase values. The goal is to get the purchase totals for every customer. You build a sum (the processing) for all entries with the same customer ID (the identical key values). Lets see:
The first thing we have to do is to identify our key - the part we want to use to differentiate between customers, their names - and the data we need to do our processing - in this case the purchase values. This is easy in our example, but can be quite tricky in real-world applications
By the way: do not confuse "key" and "key value". Key is the part of the line we sort on. Here it is the first word. Key value is the value the key part holds in each record. The key value for line 1 is "Alan", while for line 5 it is "Charlie":
The next step is really easy for shell programmers, because there is a genuine UNIX command for it: we need to sort our input data set for the key we have identified. In our case this simply means: sort, without any options.
Data after applying "sort"
Suppose we read this, line by line. Because it is sorted we can rely on all the identical key values coming right one after the other. That is, once we read a line with "Bill", there will be no "Alan"s any more. Keep this in mind while we set up a simple read-loop for the file:
Script.version1.sh:
OK, we got the reading of the input correctly, as we have seen from the sample output. So let us come back on our last idea: every time the value of the key changes, one "group" (one certain customer) is finished and a new one begins.
Every time the key value changes from one to the next line we need to output the total. Let us see if we get the line where this happens correctly - we will just mark it, do nothing else:
Script.version2.sh:
Well - almost, yes? Two things which aren't quite right.
First, the first total is requested in the first line, which is nonsense. It happens because the value of "lastcustomer" is pre-set to "" which is of course different of the first customer name we actually read in. Still, there should be no total there.
Second, the last total, for "Charlie", is missing at all. The reason is that after the last record, which would end the group of "Charlie"s, the loop is simply left. So let us fix the code to take care of these two problems:
Script.version3.sh:
Very well! Now let us implement the total: we already know, where we need to output it. We need to sum up every value read in in a sum variable. Upon the "control break" happening, when the key value changes, we have to output the sum, then reinitialize the sum variable with zero again and continue. Let's do it:
That was really easy, wasn't it? In fact, that was all - we solved the problem! But suppose we would have had to calculate the average of the purchases instead of the total for each customer. You sure know by now how this works, no?
OK, don't read any further! Instead, do it yourself and compare your solution to mine:
Very well! But you see, when the "end processing" gets more and more complicated there is more and more redundant code to be written: once inside the main loop, once after it. It is therefore a good idea - at least for anything less trivial than summation - to move the end processing of each group to a function you can call:
Perfect! This was our final transformation and i promise you will be able to solve all kinds of simple control break type problems using this blueprint and adapting it. That's all!
The Multiple Control Break
You might already have sensed it from the innocent word "single": this wasn't really all there is. Where there is "single" there is "multiple" and the same is true here. So here is the multiple control break, which is of course more complex than the single one. But don't panic! The solution is really easy to grasp for experts of the single control break - you!
Suppose that every group you have to process consists of several subgroups you have to process too. Sounds complicated again? Well, an example will clear that up. This is a list of the cars our customers purchased:
file input2
You see, we have several models as well as several manufacturers. In the end i not only want to know how many of every model we sold, but also, how the manufacturers fared. So we need a sum over all the "Dodge"s and the "VW"s, but also a sub-sum for the Dodge Charger, one for the Dodge Dart, etc..
At first we start again with identifying the key(s): now, every key consists of a main key and a sub-key. If the main key value changes, we have a "primary control break", if only the sub-key value changes we have a "secondary control break". Instead of the single-layer control break we had in our first example we have now a two-layer hierarchy. Having anything else than a single layer - instead of the 2 levels here there could be several - means executing a multiple control break.
This time the sorting process is way more complex. I suggest you consult the man page of sort if you are not absolutely sure what the following command does:
As you have already gotten that far, i am sure the following code will be obvious to you. I changed the totalling function a bit to do either a "big" total (for the manufacturers) or a "small" total (for the models). But beware: i have introduced a very subtle weakness in the program and you might want to try to find it. Spoilsports will find it below, but if you want to try your debugging expertise - be my guest.
And now, without further ado, here is the "double control break":
Have you found the weakness? You probably need a hint. OK, the problem is in this line:
Still not sure? OK, here is the last hint: replace the input file "input2" with the following "input3" and let it run again:
Solution follows:
When you look at the sorted file, you will notice that the "last" model of "ManufacturerA" has the same name as the "first" model of "ManufacturerB". Because of this the "big" control break for the manufacturer is not executed. The weakness in the line is, that it implies a "big" group change to contain a "small" group change too. This does not necessarily have to be the case. Modify the line therefore like this:
and you will see the program can even process the "input3" file.
Happy shell programming.
...on Second Thoughts, or: Instead of an Epilogue
Since i have written the first version of this article we had a few threads where its logic was applicable. Sometimes in a straightforward manner, sometimes only after adapting the basic algorithm to fit the problem at hand. I have to admit that in my striving to clarify the inner workings of this common algorithm i failed to show where it fits into a greater picture.
Some have mentioned that the control break can be (or even: is already) implemented in other languages: SQL, awk and others. This is true. I tried to explain the basic workings of this algorithm in a general-purpose language and while i recognize some (specialized) languages have such a functionality already built in little would be gained in terms of understanding if i would be to refer only to "this keyword" and be done. But now, that we have learned what exactly is done and how we do it, i want to make good for my neglect by explaining how it can be done in other languages:
The control break in SQL
In short: you do not need it, because it is part of the language. In fact it is what the "GROUP BY" clause does.
The control break in awk
In this thread i suggested to apply this article to the problem and DonCragun mentioned - rightfully - that it would be easier to do in awk. I suggest you read his very elegant solution there.
Basically it burns down to the following: construct an "associative array" (1) where the index is your key and the array element is your number you want to sum up on. Using the file from the first example this would look like:
In the main part you just load this array, in the end part you cycle through it and print the sums. Suppose the first field to be the key and the second to be the value:
The advantage of this is: you do not have to sort the input before processing it. Elements of the associative array are created as they are needed. On the downside, you process the output file in memory: until you have read the complete input you have to hold the output (the array itself) in memory, only then you can process (=output) it. This is a very small concern in the overwhelming majority of cases, but you should be aware of this.
Thanks, Acknowledgements
I should mention that - being the awk ignorant that i am - i borrowed DonCraguns awk script and my only addition was to "dumb it down" to the bare minimum. I am additionally thankful for his valuable suggestions regarding this article. I barely know SQL and if it had not been for fpmurphys alertness this wouldn't have come to me that "GROUP BY" in SQL does exactly what i wrote about. He also mentioned XSLT, but i haven't used this language in my life. I am sure that, generous as he is, he will explain to you what i do not know if you ask him.
Errors, omissions and misrepresentations are still my own fault and if you find any dubious passages, including typos: do not take them with you but report them to me. I'll be glad to weed them out.
bakunin
________________
(1) In case you do not know this term: it is an array where the indexes are not numerical but strings. So, instead of
you'd have
Last edited by bakunin; 07-20-2013 at 12:38 PM..
These 6 Users Gave Thanks to bakunin For This Post:
I have to say that I have never heard this particular issue called the Control Break Algorithm before but a quick Internet search showed that the term is becoming somewhat common among educators. In SQL, this would be handled by the GROUP BY statement, in XSLT it is called grouping and is typically handled by the Muenchian Method.
Hello.
System : opensuse leap 42.3
I have a bash script that build a text file.
I would like the last command doing :
print_cmd -o page-left=43 -o page-right=22 -o page-top=28 -o page-bottom=43 -o font=LatinModernMono12:regular:9 some_file.txt
where :
print_cmd ::= some printing... (1 Reply)
Hi,
input:
1|Bob
2|Bob
Ref|Bob
3|Rick
7|Rick
Ref|Rick
I am trying to append a "*" field to the record containing "Ref" in $1, only if the previous $1 with a common $2 have a value <4.
ouput:
1|Bob
2|Bob
Ref|Bob|*
3|Rick
7|Rick
Ref|Rick (11 Replies)
Hi,
thanks to a precedent post, and thanks to the reply of derekludwig of the forum, I have convert my first awk command as :
test.txt is : AAAAAGHIJKLAjKMEFJKLjklABCDJkLEFGHIJKL
awk -f findstring.awk test.txt > textreturn.txtfindstring.awk is :
BEGIN{ SLENGTH = 3 }
{
... (3 Replies)
How to use "mailx" command to do e-mail reading the input file containing email address, where column 1 has name and column 2 containing “To” e-mail address
and column 3 contains “cc” e-mail address to include with same email.
Sample input file, email.txt
Below is an sample code where... (2 Replies)
Hi,
I have line in input file as below:
3G_CENTRAL;INDONESIA_(M)_TELKOMSEL;SPECIAL_WORLD_GRP_7_FA_2_TELKOMSEL
My expected output for line in the file must be :
"1-Radon1-cMOC_deg"|"LDIndex"|"3G_CENTRAL|INDONESIA_(M)_TELKOMSEL"|LAST|"SPECIAL_WORLD_GRP_7_FA_2_TELKOMSEL"
Can someone... (7 Replies)
Hello.
I am a french newbie in unix shell scripting (sorry if my english speaking is wrong).
I have a file with path and filenames in it. I want to limit the number of characters on each line and break the line if necessary. But the "break" should occur after a slash caracter "/".
Example of... (9 Replies)
Hi Friends,
Can any of you explain me about the below line of code?
mn_code=`env|grep "..mn"|awk -F"=" '{print $2}'`
Im not able to understand, what exactly it is doing :confused:
Any help would be useful for me.
Lokesha (4 Replies)
Not sure if I am using the correct terminology:
I have a user setup in a menu. I want to let the user browse a file using the pg.
I want to prevent the person from doing <ctrl>c and being able to exit the script to the command line.
How do I do this? (6 Replies)