AWK Data Cleaning


 
Thread Tools Search this Thread
Top Forums UNIX for Dummies Questions & Answers AWK Data Cleaning
# 1  
Old 09-25-2010
AWK Data Cleaning

Hello,

I am trying to analyze data I recently ran, and the only way to efficiently clean up the data is by using an awk file.
I am very new to awk and am having great difficulty with it. In $8 and $9, for example, I am trying to delete numbers that contain 1.
I cannot find any tutorials that tell me how to do this. I would greatly appreciate if someone could point me in the right direction.

Code:
"RECORDING_SESSION_LABEL"	"DATA_FILE"	"TRIAL_LABEL"	"trial"	"word"	"SAMPLE_BUTTON"	"LEFT_PUPIL_SIZE"	"LEFT_IN_BLINK"	"LEFT_IN_SACCADE"

"test"	"test.edf"	"Trial: 1"	1	>>>>>	.	1872.00	0	0
"test"	"test.edf"	"Trial: 1"	1	>>>>>	.	1874.00	0	0
"test"	"test.edf"	"Trial: 1"	1	>>>>>	.	1873.00	0	0
"test"	"test.edf"	"Trial: 1"	1	>>>>>	.	1873.00	0	0
"test"	"test.edf"	"Trial: 1"	1	>>>>>	.	1873.00	0	0
"test"	"test.edf"	"Trial: 1"	1	>>>>>	.	1872.00	0	1
"test"	"test.edf"	"Trial: 1"	1	>>>>>	.	1872.00	0	1
"test"	"test.edf"	"Trial: 1"	1	>>>>>	.	1872.00	0	1
"test"	"test.edf"	"Trial: 1"	1	>>>>>	.	1871.00	0	1
"test"	"test.edf"	"Trial: 1"	1	>>>>>	.	1870.00	0	1
"test"	"test.edf"	"Trial: 1"	1	>>>>>	.	1872.00	0	1
"test"	"test.edf"	"Trial: 1"	1	>>>>>	.	1875.00	0	1
"test"	"test.edf"	"Trial: 1"	1	>>>>>	.	1880.00	0	1
"test"	"test.edf"	"Trial: 1"	1	>>>>>	.	1882.00	0	1
"test"	"test.edf"	"Trial: 1"	1	>>>>>	.	1886.00	0	1
"test"	"test.edf"	"Trial: 1"	1	>>>>>	.	1888.00	0	1
"test"	"test.edf"	"Trial: 1"	1	>>>>>	.	1888.00	0	1
"test"	"test.edf"	"Trial: 1"	1	>>>>>	.	1887.00	0	1
"test"	"test.edf"	"Trial: 1"	1	>>>>>	.	1886.00	0	1
"test"	"test.edf"	"Trial: 1"	1	>>>>>	.	1885.00	0	1
"test"	"test.edf"	"Trial: 1"	1	>>>>>	.	1883.00	0	1
"test"	"test.edf"	"Trial: 1"	1	>>>>>	.	1881.00	0	1
"test"	"test.edf"	"Trial: 1"	1	>>>>>	.	1880.00	0	1
"test"	"test.edf"	"Trial: 1"	1	>>>>>	.	1880.00	0	1
"test"	"test.edf"	"Trial: 1"	1	>>>>>	.	1880.00	0	1
"test"	"test.edf"	"Trial: 1"	1	>>>>>	.	1880.00	0	1
"test"	"test.edf"	"Trial: 1"	1	>>>>>	.	1879.00	0	1
"test"	"test.edf"	"Trial: 1"	1	>>>>>	.	1879.00	0	1
"test"	"test.edf"	"Trial: 1"	1	>>>>>	.	1878.00	0	1
"test"	"test.edf"	"Trial: 1"	1	>>>>>	.	1877.00	0	0
"test"	"test.edf"	"Trial: 1"	1	>>>>>	.	1878.00	0	0
"test"	"test.edf"	"Trial: 1"	1	>>>>>	.	1879.00	0	0
"test"	"test.edf"	"Trial: 1"	1	>>>>>	.	1881.00	0	0
"test"	"test.edf"	"Trial: 1"	1	>>>>>	.	1883.00	0	0
"test"	"test.edf"	"Trial: 1"	1	>>>>>	.	1885.00	0	0
"test"	"test.edf"	"Trial: 1"	1	>>>>>	.	1885.00	0	0
"test"	"test.edf"	"Trial: 1"	1	>>>>>	.	1885.00	0	0
"test"	"test.edf"	"Trial: 1"	1	>>>>>	.	1886.00	0	0
"test"	"test.edf"	"Trial: 1"	1	>>>>>	.	1887.00	0	0
"test"	"test.edf"	"Trial: 1"	1	>>>>>	.	1890.00	0	0
"test"	"test.edf"	"Trial: 1"	1	>>>>>	.	1892.00	0	0
"test"	"test.edf"	"Trial: 1"	1	>>>>>	.	1893.00	0	0
"test"	"test.edf"	"Trial: 1"	1	>>>>>	.	1893.00	0	0
"test"	"test.edf"	"Trial: 1"	1	>>>>>	.	1893.00	0	0
"test"	"test.edf"	"Trial: 1"	1	>>>>>	.	1892.00	0	0
"test"	"test.edf"	"Trial: 1"	1	>>>>>	.	1891.00	0	0
"test"	"test.edf"	"Trial: 1"	1	>>>>>	.	1890.00	0	0
"test"	"test.edf"	"Trial: 1"	1	>>>>>	.	1892.00	0	0
"test"	"test.edf"	"Trial: 1"	1	>>>>>	.	1892.00	0	0
"test"	"test.edf"	"Trial: 1"	1	>>>>>	.	1893.00	0	0
"test"	"test.edf"	"Trial: 1"	1	>>>>>	.	1893.00	0	0
"test"	"test.edf"	"Trial: 1"	1	>>>>>	.	1895.00	0	0
"test"	"test.edf"	"Trial: 1"	1	>>>>>	.	1897.00	0	0
"test"	"test.edf"	"Trial: 1"	1	>>>>>	.	1897.00	0	0
"test"	"test.edf"	"Trial: 1"	1	>>>>>	.	1896.00	0	0
"test"	"test.edf"	"Trial: 1"	1	>>>>>	.	1895.00	0	0
"test"	"test.edf"	"Trial: 1"	1	>>>>>	.	1895.00	0	0
"test"	"test.edf"	"Trial: 1"	1	>>>>>	.	1897.00	0	0


Last edited by Scott; 09-25-2010 at 04:36 AM.. Reason: Please use code tags
# 2  
Old 09-25-2010
That is probably because they are $9 and $10. You can try:
Code:
awk '$9!=1 && $10!=1' infile

if these values can only be 0 or 1 you could even do this:
Code:
awk '!$9 && !$10' infile

# 3  
Old 10-03-2010
Also, I am trying to insert a counter in the output which keeps tracks of each millisecond. Here is what my output should look like: Trial, MS, and Pupil Size. Not sure how I can include something not in my original data (each line is a millisecond). I am pretty lost, and becoming frustrated Smilie
# 4  
Old 10-03-2010
Something like this:
Code:
awk 'NR>3{print $3,$4,i++,$8}' infile

Code:
"Trial: 1" 0 1874.00
"Trial: 1" 1 1873.00
"Trial: 1" 2 1873.00
"Trial: 1" 3 1873.00
"Trial: 1" 4 1872.00
...

# 5  
Old 10-03-2010
This is the code I have created thus far:

Code:
BEGIN {
FS="\t";RS="\n";
}
{if ($3)
printf "%s\t%s\t" $3,$7
{if ($3)
i=i+1
{if (i==4000){
 printf "%s\t%s\%s\n" $3,$7
}
}
}

However, I get this error:
Code:
awk: not enough args in printf(%s       %s      %s      %s      "TRIAL_LABEL")
 input record number 1, file pupil.txt
 source line number 5


Last edited by Scott; 10-03-2010 at 06:51 PM.. Reason: Please use code tags
# 6  
Old 10-03-2010
You must separate the format string from the parameter list with a comma:

Code:
printf "%s\t%s\t", $3,$7


Hope this helps a bit.
This User Gave Thanks to agama For This Post:
# 7  
Old 10-03-2010
I seem to keep running into some minor problems. I am trying to get 3 different headers, one of which is not within the data (milliseconds). The output I get does not have a constant millisecond count ( each line is 1 millisecond). Is this because I am removing the 1's from fields $8 and $9?

Code:
Code:
BEGIN {
FS="\t";RS="\n";
}
{if ($3)
printf "%s\t%s\t%s\n",  $3, i++, $7
{if ($3)
i=i+1
{if (i==4000){
 printf "%s\t%s\n",  $3,$7
}
{if ($8!=1 && $9!=1){ printf "%s\t%s\t%s\n", $3, i++, $7
}
}
}
}
}

I thank you all for your help!

Last edited by radoulov; 10-05-2010 at 05:19 AM.. Reason: Code tags, please!
 
Login or Register to Ask a Question

Previous Thread | Next Thread

9 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

awk --> math-operation in data-record and joining with second file data

Hi! I have a pretty complex job - at least for me! i have two csv-files with meassurement-data: fileA ...... (2 Replies)
Discussion started by: IMPe
2 Replies

2. Shell Programming and Scripting

Cleaning through perl or awk a Stemmer dictionary

Hello, I work under Windows Vista and I am compiling an open-source stemmer dictionary for English and eventually for other Indian languages. The Engine which I have written has spewed out all lemmatised/expanded forms of the words: Nouns, Adjectives, Adverbs etc. Each set of expanded forms is... (4 Replies)
Discussion started by: gimley
4 Replies

3. Shell Programming and Scripting

Cleaning output using awk

I have some small problem with my code. data.html <TD class="statuscol2">c</TD> <TD class="statuscol3">18</TD> <TD class="statuscol4"><SPAN TITLE="#04">test4</SPAN></TD> <TD... (4 Replies)
Discussion started by: Jotne
4 Replies

4. Shell Programming and Scripting

Help with parsing data with awk , eliminating unwanted data

Experts , Below is the data: --- Physical volumes --- PV Name /dev/dsk/c1t2d0 VG Name /dev/vg00 PV Status available Allocatable yes VGDA 2 Cur LV 8 PE Size (Mbytes) 8 Total PE 4350 Free PE 2036 Allocated PE 2314 Stale PE 0 IO Timeout (Seconds) default --- Physical volumes ---... (5 Replies)
Discussion started by: rveri
5 Replies

5. Shell Programming and Scripting

Cleaning AWK code

Hi I need some help to clean my code used to get city location. wget -q -O - http://www.ip2location.com/ | grep chkRegionCity | awk 'END { print }' | awk -F"" '{print $4}' It gives me the city but have a leading space. I am sure this could all be done by one single AWK Also if possible... (8 Replies)
Discussion started by: Jotne
8 Replies

6. Shell Programming and Scripting

cleaning the file

Hi, I have a file with multiple rows. each row has 8 columns. Column 8 has entries separated by commas. I want to exclude all the rows in which column 8 has more than 3 commas. 1234#0/1 - ABC_1234 3 ATGCATGCATGC HHHIIIGIHVF 1 49:T>C,60:T>C,78:C>A,76:G>T,65:T>G Thanks, Diya (3 Replies)
Discussion started by: Diya123
3 Replies

7. Shell Programming and Scripting

File cleaning

HI , I am getting the source data as below. Source Data CDR_Data,,,,, F1,F2,F3,F4,F5,F6 5,5,6,7,8,7 6,6,g,,, 7,7,76,,, 8,8,gt,,, 9,9,df ,d,d,d ,,,,, (4 Replies)
Discussion started by: wangkc
4 Replies

8. Shell Programming and Scripting

Data Cleaning in a file

Hi , I have the below source data, I need to clean the data in 2nd,4th,5th columns. Source Data RECORD,CASH_TRANS,BEJING,AUG2011/CASH_TRANS,Y/N150/CASH_TRANS,N/201108 RECORD,CASH_TRANS,INDIA,AUG2011/CASH_TRANS,Y/NC110/CASH_TRANS,N/201108... (7 Replies)
Discussion started by: mora
7 Replies

9. AIX

doing some spring cleaning....

USERS="me you jim joe sue" for user in ${USERS}; do rmuser -p $user usrdir=`cat /etc/passwd|grep $user|awk -F":" '{ print $6 }'` rm -fr `cat /etc/passwd|grep $user|awk -F":" '{ print $6 }'` echo Deleting: $user '\t' REMOVING: $usrdir done This is for AIX ONLY!!! but easily ported to... (0 Replies)
Discussion started by: Optimus_P
0 Replies
Login or Register to Ask a Question