Visit Our UNIX and Linux User Community


Filter on one column and then perform conditional calculations on another column with a Linux script


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Filter on one column and then perform conditional calculations on another column with a Linux script
# 1  
Old 03-25-2015
Filter on one column and then perform conditional calculations on another column with a Linux script

Hi,
I have a file (stats.txt) with columns like in the example below. Destination IP address, timestamp, TCP packet sequence number and packet length.

Code:
destIP   time  seqNo  packetLength
1.2.3.4  0.01   123       500
1.2.3.5  0.03    44       1500
1.3.2.5  0.08    44       1500
1.2.3.4  0.44   123       500
1.2.3.4  0.48   123       500
1.2.3.4  0.52   124       800
1.2.3.4  0.72   124       800
1.2.3.5  0.83    45       80
...

I'm trying to come up with a way to derive some statistics from this file. Ideally, my Linux script would take the input from stats.txt (which could consist of 10 000's of rows) and tell per destination address (example for address 1.2.3.4 above used to illustrate):

- For destination IP 1.2.3.4, there has been two retransmissions for sequence number 123 and one retransmission for sequence number 124. This means three packet errors in total.
- The time between the first and last packet with the same sequence number is 0:48-0:01=0:47 seconds and 0:72-0:52=0.2 seconds respectively.
- Number of successful packets to 1.2.3.4 is two (sequence number 123 and 124, assuming that 124 is ok since it's not retransmitted).
- The total number of successfully transmitted Bytes to 1.2.3.4 is 500+800=1300B.

And of course the same kind of stats for any other IP address.

My current approach is to first sort the file like this:

Code:
sort -u -k1,1 -k3,3 -k2,2 stats.txt > statsSorted.txt

Then I get this:
Code:
1.2.3.4  0.01   123       500
1.2.3.4  0.44   123       500
1.2.3.4  0.48   123       500
1.2.3.4  0.52   124       800
1.2.3.4  0.72   124       800
1.2.3.5  0.03    44       1500
1.3.2.5  0.08    44       1500
1.2.3.5  0.83    45       80
...

Then to use awk to extract the stats. Have used the approach below to get started but I get syntax errors on pretty much everything. It probably looks quite bad with the nested loops as well. Wonder if someone could give some advice on how to improve the syntax or hints on how to make it work?

Code:
awk '
{	# Do-while criteria: as long as the IP address is the same
	do
		address[$1] = $1
		# Loop as long as sequence number is the same
		do
		
			# Is this the first time we see this sequence number?
			if (!($3 in c))
				# Set temporary min and max time and set retransmission counter to zero.
				tempMin=tempMax=$2
				retransmissions=0
			# If not the first time this sequence number occurs, increment retransmission and add time
			else
3			tempMax=$2
				retransmissions6+
		while ($3 in c)	
		averageTime[$1]=tempMax-tempMin
		retransmissions[$1]=retransmissions
	
	while ($1 in c)
END {	
	for(i in c)
		printf("%-17s %3d %5.1f \n", address[i], averageTime[i], retransmissions[i])
}' statsSorted.txt

Any hits welcome, even on how to form the basic syntax. Then I can try to pull it together myself.

Thanks!
/Z

Last edited by Zooma; 03-25-2015 at 11:15 PM.. Reason: Fixed one typo in the last code section.
# 2  
Old 03-26-2015
A few comments on your code:
  1. There is no do ... while loop in awk.
  2. I have no idea what you are trying to accomplish with the statement 3 tempMax=$2.
  3. You can't have an array and a scalar variable with the same name: retransmissions[$1]=retransmissions.
  4. If you have multiple statements to be processed in a loop, in an if, or in an else, you need to use braces ({ and }) to group those statements.
  5. The expression ($3 in c) is meaningless when you haven't created any elements in an array named c[].
  6. You don't calculate an average of n items by subtracting the lowest value from the highest value.
  7. Using sort -u deletes duplicate entries. Deleting duplicate entries makes it impossible to calculate an average of all values for any given IP address, or for an IP address and sequence # pair.
  8. A for(i in c) loop produces output in random (not necessarily sorted) order.
You didn't show what output you hope to produce from your sample input.

You talked about reporting the number of bytes transmitted, but there is nothing in your code that seems to try to capture or print that data. (And, the following script doesn't either.)

You seem to be trying to print the average time as a decimal number and the number of retransmissions as a floating point value printed with one decimal place. (Neither of these make any sense to me.)

So, making lots of wild guesses (ignoring the output your script seemed to be trying to produce), the following might help as a starting point for a script that will do what you want:
Code:
#!/bin/ksh
sort -k1,1 -k3,3n stats.txt | awk '
BEGIN {	printf("%17s %7s %s %s\n",
		"destIP", "seqNO", "AverageTime", "retransmissions")
	printf("----------------- ------- ----------- ---------------\n")
}
$1 != lIP || $3 != lSeqNo {
	if(NR != 1)
		printf("%17s %7d %11.3f %15d\n",
			lIP, lSeqNo, tTime / cnt, cnt - 1)
	if($1 == "destIP")
		exit
	lIP = $1
	tTime = $2
	lSeqNo = $3
	cnt = 1
	if(debug) printf("input: %s\nlIP=%s, lSeqNo=%d, tTime=%f, cnt=%d\n",
		$0, lIP, lSeqNo, tTime, cnt)
	next
}
{	tTime += $2
	cnt++
	if(debug)printf("input: %s\nlIP=%s, lSeqNo=%d, tTime=%f, cnt=%d\n",
		$0, lIP, lSeqNo, tTime, cnt)
}'

which, with the sample input you provided, produces the output:
Code:
           destIP   seqNO AverageTime retransmissions
----------------- ------- ----------- ---------------
          1.2.3.4     123       0.310               2
          1.2.3.4     124       0.620               1
          1.2.3.5      44       0.030               0
          1.2.3.5      45       0.830               0
          1.3.2.5      44       0.080               0

This User Gave Thanks to Don Cragun For This Post:
# 3  
Old 03-26-2015
Quote:
Originally Posted by Don Cragun
A few comments on your code:
  1. There is no do ... while loop in awk.
[..]
Actually there is:
Code:
awk 'BEGIN{ do print "Hello" ++i; while(i<10) }'

Although it cannot be used like the OP uses it..

Last edited by Scrutinizer; 03-26-2015 at 08:28 AM..
These 2 Users Gave Thanks to Scrutinizer For This Post:
# 4  
Old 03-26-2015
Thanks guys. Appreciate a lot. You've got some good questions Don and I realize I was fuzzy with the output and the description. So for the record I will describe a bit better here. The desired output would look like this:

Code:
destIP    avgRetransTime   maxRetransTime  noRetrans  noSuccPack  transBytes
-------- ----------------  --------------  ---------  ----------  ----------
1.2.3.4         0.335          0:47            3          2          1300
1.2.3.5         0.05           0:05            1          2          1580

So only one resulting line per destination IP with the following info:
1. IP address.

2. avgRetransTime: derived by finding the time difference between the first and last packet with same IP and sequence number and then divide that with number of seq numbers that are subject to retransmission. Example: For 1.2.3.4, there have been two seq numbers with retransmissions. For 123, the time between the last and first packet is 0:47 seconds (0:48-0:01). For 124 it's 0.2 seconds (0:72-0:52). So the average time is (0:47+0:2)/2=0.335.

3. maxRetransTime: The sequence number that took longest time to retransmit. For 1.2.3.4 it's 123 which took 0:47 seconds.

4. noRetrans: All retransmissions counted. For 1.2.3.4, packet 123 has been sent 3 times (2 retransmissions) and packet 124 has been sent 2 times (1 retransmission). So a total of 3.

5. noSuccPack: The number of packets (per IP) that are considered delivered. For 1.2.3.4, both 123 and 124 are considered delivered unless the number of retransmissions for a single sequence number exceeds 5. Then the packet is considered "not delivered".

6. transBytes: Each time a packet delivery is successful (counted as the last time the sequence number is seen if the sequence number is not repeated more than 5 times), this parameter is incremented with the number of Bytes.

Your code is very straight forward and useful. I think I will be able to adjust it to get what I need. Almost :-). What's missing is this:

The "loop" is repeated as long as the IP address and the seq No are the same. Given my desired output I want to sum up a few things, make some divisions and so on. Feels like I need a loop that knows if it's the last lap inside of the brackets so to say. "If this is the last time I see this combination of IP and seq No I should sum up things and divide etc.". That's why I tried to go for the do-while loop. Could you recommend how to approach this one?

Thanks!
/Z

Last edited by Zooma; 03-26-2015 at 04:36 PM.. Reason: Fixed code formatting.
# 5  
Old 03-30-2015
Hi,
Have created some code now that I think would do the trick if I didn't get syntax errors. Really appreciate any help.

Cheers!
/Z


Code:
#!/bin/ksh
sort -k1,1 -k3,3n -k2,2 stats.txt | awk 
BEGIN {	printf("%17s %7d %d %d %d %d\n",
		"destIP", "avgRetransTime", "maxRetransTime", "noRetrans", "noSuccPack", "transBytes")
	printf("---------- --------------- ---------------- -------------- -------------- ---------------\n")

}

# If the IP address found is not in the list
$1 != lIP {
	maxOverallTime = 0
	tempIp = $1
	noSuccPackPerIp=0
	transBytesPerIp=0
	
	while (tempIp == $1){
			transBytesPerIp=0
					
		$3 != lSeqNo{
			minTime = maxTime = $2
			cnt = 0
			
			# check this
			transBytesForSeqNo = $4
		
			while($3 == lSeqNo) {
				maxTime = $2
				cnt++
				next
			}
			
			if ((maxTime-minTime)>maxOverallTime){
				maxOverallTime=(maxTime-minTime)
			}
			
			if (count<10){
				noSuccPackPerIp++
				transBytesForSeqNo=0
			}
			transBytesPerIp += transBytesForSeqNo
			lSeqNo = $3
		}
	}
	printf("%17s %7d %11.3d %d %d %15d\n", lIP, (maxTime-minTime)/cnt, maxOverallTime, cnt, noSuccPackPerIp, transBytesForSeqNo)
}'

# 6  
Old 03-31-2015
This will become lengthy. Did you get any error message that would point you in some direction?

OK, let's start:
- The first single quote after awk is missing; it should introduce the 'program text'
- In the BEGIN section, you print 6 strings using 1 string but 5 integer format specifiers.
- The 17s don't match the underlining dashes.
- You don't modify/assign lIP, so $1 will rarely match.
- $3 != lSeqNo: you can't use pattern syntax within an action block. Use if (...)
- not sure if it is wise to leave a while loop with a next statement, but there's no other way, either. (BTW, it will never be entered as the entire block will be run only if $3 != lSeqNo)

... and then I'm lost, even though it looks like the counts of opening and closing brackets match.
This User Gave Thanks to RudiC For This Post:
# 7  
Old 03-31-2015
Hi RudiC,
Thanks a lot for your comments. I also find the overall syntax (missing starting '{' etc) a bit strange, but if you look at Don's example code further up it's also missing the initial '{' and worked excellent anyway.

The errors I get are of the same type:
awk: line X: syntax error at or near {

Have counted all brackets and it should be ok. Weird.. Maybe I try C instead.

Previous Thread | Next Thread
Test Your Knowledge in Computers #666
Difficulty: Medium
IEEE 802.3 is a working group and a collection of Institute of Electrical and Electronics Engineers (IEEE) standards produced by the working group defining the physical layer and data link layer's media access control (MAC) of wireless Ethernet.
True or False?

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Conditional Column Value

Hi Folks, I'm trying tog ain further experience with shell programming and have set my a small goal of writing a little filesystem monitoring script. So far my output is as follows: PACMYDB03 Filesystem Size Used Avail Use% Status /usr/local/mysql/data ... (5 Replies)
Discussion started by: Axleuk
5 Replies

2. Shell Programming and Scripting

awk script to append suffix to column when column has duplicated values

Please help me to get required output for both scenario 1 and scenario 2 and need separate code for both scenario 1 and scenario 2 Scenario 1 i need to do below changes only when column1 is CR and column3 has duplicates rows/values. This inputfile can contain 100 of this duplicated rows of... (1 Reply)
Discussion started by: as7951
1 Replies

3. Shell Programming and Scripting

awk script concatenate two column and perform mutiplication

Need your help in solving this puzzle. Any kind of help will be appreciated and link for any documents to read and learn and to deal with such scenarios would be helpful Concatenate column1 and column2 of file 1. Then check for the concatenated value in Column1 of File2. If found extract the... (14 Replies)
Discussion started by: as7951
14 Replies

4. UNIX for Dummies Questions & Answers

Command line / script option to filter a data set by values of one column

Hi all! I have a data set in this tab separated format : Label, Value1, Value2 An instance is "data.txt" : 0 1 1 -1 2 3 0 2 2 I would like to parse this data set and generate two files, one that has only data with the label 0 and the other with label -1, so my outputs should be, for... (1 Reply)
Discussion started by: gnat01
1 Replies

5. Shell Programming and Scripting

awk , conditional involving line and column

Dear All, I indeed your help for managing resarch data file. for example I have, data1.txt : type of atoms z vz Si 34 54 O 20 56 H 14 13 Si 40 17 O ... (11 Replies)
Discussion started by: ariesto
11 Replies

6. Shell Programming and Scripting

Enter third column & Perform Operation

I am trying to enter a third column in this file, but the third column should that I call "Math" perform a some math calculations based on the value found in column #2. Here is the input file: Here is the desired output: Output GERk0203078$ Levir Math Cotete_1... (5 Replies)
Discussion started by: Ernst
5 Replies

7. Shell Programming and Scripting

Replace a column with a value conditional on a value in col1

Hi, Perhaps a rather simple problem...? I have data that looks like this. BPC0013 ANNUL_49610 0 0 1 1 BPC0014 ANNUL_49642 0 0 2 1 BPC0015 ANNUL_49580 0 0 1 1 BPC0016 ANNUL_49596 0 0 2 1 BPC0017 VULGO_49612 0 0 1 1 BPC0018 ANNUL_49628 0 0 1 1 BPC0019 ANNUL_49692 0 0 2 1 170291_HMG... (4 Replies)
Discussion started by: genehunter
4 Replies

8. Shell Programming and Scripting

Conditional aggregation and print of a column in file

Hi My input file looks like field1 field2 field3 field4 field5 field1 field2 field3 field4 field5 field1 field2 field3 field4 field5 :::::::::::: :::::::::::: There may be one space of multiple spaces between fields and no fields contains spaces in them. If field 1 to 4 are equal for... (3 Replies)
Discussion started by: bittoo
3 Replies

9. Shell Programming and Scripting

Sed or awk script to remove text / or perform calculations from large CSV files

I have a large CSV files (e.g. 2 million records) and am hoping to do one of two things. I have been trying to use awk and sed but am a newbie and can't figure out how to get it to work. Any help you could offer would be greatly appreciated - I'm stuck trying to remove the colon and wildcards in... (6 Replies)
Discussion started by: metronomadic
6 Replies

10. Shell Programming and Scripting

How to perform calculations using numbers greater than 2150000000.

Could someone tell me how to perform calculations using numbers greater than 2150000000 in Korn Shell? When I tried to do it it gave me the wrong answer. e.g. I have a ksh file with the contents below: --------------------------------- #!/bin/ksh SUM=`expr 2150000000 + 2` PRODUCT=`expr... (3 Replies)
Discussion started by: stevefox
3 Replies

Featured Tech Videos