Removing repeating lines from a data frame (AWK)


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Removing repeating lines from a data frame (AWK)
# 8  
Old 07-19-2011
It's not working for me, maybe cos I'm an idiot.

I'm writing my code in notepad++ and running it through a shell. I'm using a .sh file to combine all my .dat files into one big .csv file, then a GAWK file to edit the format.

.sh file
Code:
#!/bin/bash

# input files from each day of data in june and combine into one big file
find /u/Picarro/DataLog/2011/june -type f -name "*dat" -exec cat {} > june.dat \;

#use new combined data as input file
IN_ALL='/u/Picarro/DataLog/2011/june/june.dat' 
	
# the csv file to create for all data called 'june.csv' in the respective directory
OUT_all='/u/Picarro/DataLog/2011/Awk/june.csv'		

# gawk files to create csv file
GAWK='/u/Picarro/DataLog/2011/Awk/Format_trial.csv.awk'

#produce the OUT file from the IN file(s)
$GAWK $IN_all > $OUT_all

GAWK file
Code:
#!/bin/gawk -f

# This file is to restructure the picarro data into the correct .csv columns for R

# create a header with same headings as variable in table
# also set other variables before parsing data
BEGIN   {
        OFS="," 	# tells awk that the output separator is a comma
        ORS=""  	# tells awk to not print newline after each print command so all records are
					# on the same line until we want a new line "\n"			
		getline}	# removes 1st line of input file ie header so we can replace it with correct one

# rearrange the yyyy-mm-dd | hh:mm:ss date and time to single date column of yyyy/mm/dd hh:mm:ss needed for openair
{print substr($1,9,2) "/" substr($1,6,2) "/" substr($1,1,4) " " substr($2,1,5)}		

# print the rest of variables as columns 
{print (" ", $3, $4, $5, $6, $7, $8, $9, $10, $11, $12, $13, $14, $15, $16, $17, $18, $19, $20)
}


$1 $2 != prev {
    	
	{print "\n"}	# newline after each 5 seconds of data has been parsed
  
; prev=$1 $2}

I've tried putting the code you gave me into various places in the GAWK file but it doesn't seem to work.

Where am I going wrong Smilie
# 9  
Old 07-19-2011
Don't append it to your code, just run that line on the file that you need to clean headers from. Put this after the find line in your script:
Code:
awk '!/^\/\//' /u/Picarro/DataLog/2011/june/june.dat > /u/Picarro/DataLog/2011/june/june.dat.tmp
mv /u/Picarro/DataLog/2011/june/june.dat.tmp /u/Picarro/DataLog/2011/june/june.dat

# 10  
Old 07-19-2011
Thanks for the help bartus, still not working. The headers are still interspersed throughout the data frame. I've tried putting the new code you gave me in different places too but it doesn't do anything to the file.

Any way around this?
# 11  
Old 07-19-2011
that smells like some windows mess... can you please post the output of this:
Code:
head -1 /u/Picarro/DataLog/2011/june/june.dat | od -c

which will show what exactly are the header characters

---------- Post updated at 01:53 AM ---------- Previous update was at 01:44 AM ----------

And you could try this to strip the headers
Code:
awk  '!/\/\/ *DATE .*/' /u/Picarro/DataLog/2011/june/june.dat

Looking at your awk script, I think this might be the culprit:
Code:
# print the rest of variables as columns 
{print (" ", $3, ...

If you have a space there at the beginning, than the regex in awk will not match that line.
When you pipe the header through 'od', as i suggested, it will show

Last edited by mirni; 07-19-2011 at 09:09 AM..
# 12  
Old 07-19-2011
Hey mirni,

when I put in the code you gave me

Code:
head -1 /u/Picarro/DataLog/2011/june/june.dat | od -c

it returned
Code:
0000000   D   A   T   E                                                
0000020                                           T   I   M   E        
0000040                                                                
0000060                   F   R   A   C   _   D   A   Y   S   _   S   I
0000100   N   C   E   _   J   A   N   1                           F   R
0000120   A   C   _   H   R   S   _   S   I   N   C   E   _   J   A   N
0000140   1                               E   P   O   C   H   _   T   I
0000160   M   E                                                        
0000200           A   L   A   R   M   _   S   T   A   T   U   S        
0000220                                                   s   p   e   c
0000240   i   e   s                                                    
0000260                           s   o   l   e   n   o   i   d   _   v
0000300   a   l   v   e   s                                            
0000320   M   P   V   P   o   s   i   t   i   o   n                    
0000340                                           O   u   t   l   e   t
0000360   V   a   l   v   e                                            
0000400                   C   a   v   i   t   y   P   r   e   s   s   u
0000420   r   e                                                   C   a
0000440   v   i   t   y   T   e   m   p                                
0000460                                   W   a   r   m   B   o   x   T
0000500   e   m   p                                                    
0000520           E   t   a   l   o   n   T   e   m   p                
0000540                                                   D   a   s   T
0000560   e   m   p                                                    
0000600                           C   O   2   _   s   y   n   c        
0000620                                                                
0000640   C   O   2   _   d   r   y   _   s   y   n   c                
0000660                                           C   H   4   _   s   y
0000700   n   c                                                        
0000720                   C   H   4   _   d   r   y   _   s   y   n   c
0000740                                                           H   2
0000760   O   _   s   y   n   c                                        
0001000                                  \r  \n
0001012

those are the headers that are interspersed throughout the combined .dat file. Is this bad?
# 13  
Old 07-19-2011
I don't see the "//" characters in from of the header. Can you also post output of:
Code:
head -1 /u/Picarro/DataLog/2011/june/june.dat | cat -Te

# 14  
Old 07-19-2011
The "'DATE TIME" header is in the .csv file, I think it appears once I combined the two columns using the GAWK code I posted above.

The result from
Code:
head -1 /u/Picarro/DataLog/2011/june/june.dat | cat -Te

The result
Code:
DATE                      TIME                      FRAC_DAYS_SINCE_JAN1      FRAC_HRS_SINCE_JAN1       EPOCH_TIME                ALARM_STATUS              species                   solenoid_valves           MPVPosition               OutletValve               CavityPressure            CavityTemp                WarmBoxTemp               EtalonTemp                DasTemp                   CO2_sync                  CO2_dry_sync              CH4_sync                  CH4_dry_sync              H2O_sync                  ^M$

To clarify,

The headers shown above are in the .dat files (both single files and big combined file)

The "// DATE TIME" (replaces the separate "DATE" and "TIME" header into one column) header arises in the .csv file, after the .dat file has been 'GAWKed'.

Does that help?
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Removing certain lines from results - awk

im using the code below to monitor a file: gawk '{ a += gsub("(^| )accepted( |$)", "&") a += gsub("(^| )open database( |$)", "&") } END { for (i in a) printf("%s=%s\n", i, a) }' /var/log/syslog the code is searching the syslog file for the string "accepted" and "open... (2 Replies)
Discussion started by: SkySmart
2 Replies

2. Shell Programming and Scripting

awk : collecting all data between two time frame

Hi Experts , I need your help to collect the complete data between two time frame from the log files, when I try awk it's collecting the data only which is printed with time stamp for example, awk works well from "16:00 to 17:30" but its not collecting <line*> "from 17:30 to 18:00" ... (8 Replies)
Discussion started by: zenkarthi
8 Replies

3. Shell Programming and Scripting

perform actions at specific locations in data frame

Hi everyone, I got a data frame like the one below and and would like to do the following: Ignore the first 3 rows and check in all following rows the second position. If the value is >500, subtract 100. Example DF: ABC 22 DE 12 BCD 223 GH 12 EFG 2104 DH ... (4 Replies)
Discussion started by: TuAd
4 Replies

4. UNIX for Dummies Questions & Answers

Remove groups of repeating lines

I know uniq exists, but am not sure how to remove repeating lines when they are groups of two different lines repeating themselves, without using sort. I need them to be sorted in the original order, just to remove repeats. cd /media/AUDIO/WAVE/9780743518673/mp3 ~/Desktop/mp3-to-m4b... (1 Reply)
Discussion started by: glev2005
1 Replies

5. Shell Programming and Scripting

awk removing data before or after a pattern

I have the following data: 01:00:00 29 10 20 41 01:20:00 18 6 34 42 01:40:00 28 5 24 43 02:00:01 11 7 8 74 02:20:01 19 15 12 54 02:40:01 1 4 0 95 03:00:01 1... (6 Replies)
Discussion started by: BeefStu
6 Replies

6. UNIX for Dummies Questions & Answers

Extract repeating data from file

I want to extract the last rows of a data file, similar to that one below: C1 xxx C2 rrr C3 ttt .... Cn-1 hhh Cn bbb C1 yyy C2 sss C3 uuu ... Cn-1 iii Cn ccc ... I just want to extract the final rows between C1 and Cn at each data file. n is not a constant,... (2 Replies)
Discussion started by: natasha
2 Replies

7. Shell Programming and Scripting

Merging non-repeating columns of lines

Hello, I have file to work with. It has 5 columns. The first three, altogether, constitutes the position. The 4th column contains some values for downstream analysis and the fifth column contains some values that I want to add to 4th column (only if they happen to be in the same position). My... (5 Replies)
Discussion started by: menenuh
5 Replies

8. UNIX for Advanced & Expert Users

removing frame charecters

Hi I have a requirement as follows. My Input file is as follows. COL1,COL2,COL3,COL4,COL5 987,2,3~7~5,400~468~598,0005~4687~5980 1111,2,2~7,400~468,0005~897 Expected OUTPUT ============ COL1,COL2,COL3,COL4,COL5 987,2,3,400,0005 987,2,7,468,4687 987,2,5,598,5980 1111,2,2,400,0005... (6 Replies)
Discussion started by: tkbharani
6 Replies

9. Shell Programming and Scripting

frame multiple lines into one

Hi, i have a file with contents like below ( any number of entries can be there) 111 222 333 444 555 i need to make another file with single line like below: 111,222,333,444,555 (without ending , ) TIA Prvn (8 Replies)
Discussion started by: prvnrk
8 Replies

10. UNIX for Dummies Questions & Answers

Omit repeating lines

Can someone help me with the following 2 objectives? 1) The following command is just an example. It gets a list of all print jobs. From there I am trying to extract the printer name. It works with the following command: lpstat -W "completed" -o | awk -F- '{ print $1}' Problem is, I want... (6 Replies)
Discussion started by: TheCrunge
6 Replies
Login or Register to Ask a Question