awk data subsets manipulation


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting awk data subsets manipulation
# 1  
Old 06-12-2012
awk data subsets manipulation

Hi,
I'm working on a data file with the following structure

Code:
val1,val2,flag
214.7332983,979.0259,1
12.87435571,205.7679,1
1.365976384,19.01616,1
44.08584096,205.7679,2
7.034721792,383.8778,2
189.5685503,979.0259,2
1.96352032,19.01616,2
[...]

where the field 'flag' identifies different groups. I'd like to obtain statistics on each group and save it in an output file. I'm not an expert user and it is not clear to me if it is possible to tell awk to take the first 3 lines, calculate the relevant stat (piping the first two columns of the first three lines to another shell command), print the output and move to the following group (last four lines).
Any help? Thanks in advance.

Last edited by Scrutinizer; 06-12-2012 at 07:02 AM.. Reason: cod gates
# 2  
Old 06-12-2012
Can you post the expected result?
# 3  
Old 06-12-2012
What I have in mind is to do the following

Code:
zcat temp.csv.gz | gawk -F ','  '{if($3 == 1) print $3,$5}' | STAT_CMD

where STAT_CMD produces a statistics on the first two columns and the value '1' is dynamically replaced by the third field in the temp file, grouping lines according to the value of the flag.
In my example the output will be two numbers reporting the output STAT_CMD (ex. the correlation between the two) applied on these two pairs of columns (identified by the flag)

Code:
214.7332983,979.0259
12.87435571,205.7679
1.365976384,19.01616

and

Code:
44.08584096,205.7679
7.034721792,383.8778
189.5685503,979.0259
1.96352032,19.01616

Sorry if I'm not super clear.

Last edited by Scrutinizer; 06-12-2012 at 07:40 AM.. Reason: code tags
# 4  
Old 06-12-2012
You need to get the list of groups.


Code:
awk -F, 'NR>1 {print $3}' temp.csv | sort -u | while read group
do
  awk -v g=$group -F, '$3==g {print $1,$2}' temp.csv | STAT_CMD
done

regarding redirection to file, It depends on how the command "STAT_CMD" process your data.

I might not fully understand you.
This User Gave Thanks to clx For This Post:
# 5  
Old 06-12-2012
Thanks a lot, it works nicely since my STAT_CMD reads the stdin. Only a minor modification
Code:
awk -F, 'NR>1 {print $3}' temp.csv | uniq | while read group
do
  awk -v g=$group -F, '$3==g {print $1,$2}' temp.csv | STAT_CMD
done

However it's very slow, and I have millions of lines. Any suggestions?
# 6  
Old 06-12-2012
It will indeed be a bit slow like this. What is in STAT_CMD ?

Last edited by Scrutinizer; 06-12-2012 at 10:21 AM..
# 7  
Old 06-12-2012
STAT_CMD it's C program to calculate the Spearman rank correlation coefficient between two columns. I've checked and it seems that even without piping the output of gawk to my STAT_CMD it remains slow.

Thanks again for your help
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Data manipulation, Please help..

Hello, I have a huge set of data that needs to be reformatted. Here is a simple example to explain the process. I have number n=5 and a input with many numbers separated with comma: ... (11 Replies)
Discussion started by: liuzhencc
11 Replies

2. Shell Programming and Scripting

[Solved] Data manipulation

Hallo Team, I need your help. I have a file that has two colums. See sample below: 105550 0.28 105550 0.24 125550 0.28 125550 0.24 215650 0.28 215650 0.24 315550 0.28 315550 0.24 335550 0.28 335550 0.24 40555 0.21 40555 0.17 415550 0.21 415550 0.17 43555 0.21 43555 0.17 (5 Replies)
Discussion started by: kekanap
5 Replies

3. UNIX for Dummies Questions & Answers

Data Manipulation

Dear Sir, I have file input RGR001|108.28|-2.86489|100-120|RANGGAR RGR002|108.071|-2.69028|80-100|RANNGAR RGR003|108.168|-2.97053|50-80|RANNGAR RGR007|108.192722222|-2.766138889|0-50|RANGGARI want to create files by joining each rows with each rows below Output as below ... (4 Replies)
Discussion started by: radius
4 Replies

4. UNIX for Dummies Questions & Answers

Data/date manipulation

Hallo Team, I need your help. I would like to change field9 format to yyyy-mm-dd it should be for example 2013-11-16 instead of 20131116 0780112843,0873599381,E,ISOL,ZAR,0.0035,O,1,20131116,4373200,0.21 0733001720,0873516499,E,ISOL,ZAR,0.0035,O,1,20131116,4331600,0.21... (3 Replies)
Discussion started by: kekanap
3 Replies

5. UNIX for Dummies Questions & Answers

Data manipulation

Hallo Team, I need to manipulate existing data file. Have a look at current data and expected data: Current Data: 27873517141 27873540000 27873515109 27873517140 27873540001 27873540000 27873501343 27873540000 27873517140 27873511292 27873645989 27873540000 27873540000... (7 Replies)
Discussion started by: kekanap
7 Replies

6. UNIX for Dummies Questions & Answers

Data file manipulation

Hi, I have two, double column data files (file1 and file2). I want to add the second column of file2 to as 3rd column of file1. But, the 3rd column values corresponds to the values of the 2nd column. example: file1: X Y ========= x1 y2 x3 y4 x2 y4 x5 y3 ========= file2: Y ... (7 Replies)
Discussion started by: gaurab
7 Replies

7. UNIX for Dummies Questions & Answers

Data Manipulation

Hello I am currently having problems in mapulating a certain file which contains vaious data. Belos is a sample content Event=<3190> Client IP=<151.111.11.143> DNS=<abc.sbc.com> TransCount=<139> Client IP=<150.222.133.163> DNS=<xyz.yuu.com> TransCount=<3734> Event=<3120> Client... (11 Replies)
Discussion started by: khestoi
11 Replies

8. Shell Programming and Scripting

Data manipulation with Awk

Hello guys, I'm a new member here and I need some help with the Awk application. I'm using it through the Terminal app of OSX (I'm a Mac user). I have a huge file with a large amount of data (rows of 3D cartesian coordinates). The data is typically like the following example (actually, the... (13 Replies)
Discussion started by: Cham
13 Replies

9. Shell Programming and Scripting

Data manipulation in perl

Hello guys.. I have the following question. lets have that i have the following variable: $field=werfiurd383nd93bc93 c93 d93 d9e3 ddd or array=werfiurd383nd93bc93 c93 d93 d9e3 ddd what i would like to do is to store the first 4 characters of gthe aboce variable in variable... (1 Reply)
Discussion started by: chriss_58
1 Replies

10. UNIX for Dummies Questions & Answers

data manipulation script

I have a folder called {homedata} Within this folder there are 12 subfolders 200601.......200612 Within each subfolder there are 8 sets of files Each filename commences with A B C D E F G or H, so {filename}* can be used. I am trying to write a script which will from the top level go... (1 Reply)
Discussion started by: grinder182533
1 Replies
Login or Register to Ask a Question