Removing repeating lines from a data frame (AWK)

07-19-2011

Registered User

45, 0

Join Date: Jul 2011

Last Activity: 26 August 2011, 8:07 AM EDT

Posts: 45

Thanks Given: 12

Thanked 0 Times in 0 Posts

It's not working for me, maybe cos I'm an idiot.

I'm writing my code in notepad++ and running it through a shell. I'm using a .sh file to combine all my .dat files into one big .csv file, then a GAWK file to edit the format.

.sh file

Code:

#!/bin/bash

# input files from each day of data in june and combine into one big file
find /u/Picarro/DataLog/2011/june -type f -name "*dat" -exec cat {} > june.dat \;

#use new combined data as input file
IN_ALL='/u/Picarro/DataLog/2011/june/june.dat' 
	
# the csv file to create for all data called 'june.csv' in the respective directory
OUT_all='/u/Picarro/DataLog/2011/Awk/june.csv'		

# gawk files to create csv file
GAWK='/u/Picarro/DataLog/2011/Awk/Format_trial.csv.awk'

#produce the OUT file from the IN file(s)
$GAWK $IN_all > $OUT_all

GAWK file

Code:

#!/bin/gawk -f

# This file is to restructure the picarro data into the correct .csv columns for R

# create a header with same headings as variable in table
# also set other variables before parsing data
BEGIN   {
        OFS="," 	# tells awk that the output separator is a comma
        ORS=""  	# tells awk to not print newline after each print command so all records are
					# on the same line until we want a new line "\n"			
		getline}	# removes 1st line of input file ie header so we can replace it with correct one

# rearrange the yyyy-mm-dd | hh:mm:ss date and time to single date column of yyyy/mm/dd hh:mm:ss needed for openair
{print substr($1,9,2) "/" substr($1,6,2) "/" substr($1,1,4) " " substr($2,1,5)}		

# print the rest of variables as columns 
{print (" ", $3, $4, $5, $6, $7, $8, $9, $10, $11, $12, $13, $14, $15, $16, $17, $18, $19, $20)
}


$1 $2 != prev {
    	
	{print "\n"}	# newline after each 5 seconds of data has been parsed
  
; prev=$1 $2}

I've tried putting the code you gave me into various places in the GAWK file but it doesn't seem to work.

Where am I going wrong

gd9629

View Public Profile for gd9629

Find all posts by gd9629

07-19-2011

Registered User

3,733, 1,154

Join Date: Apr 2009

Last Activity: 3 August 2016, 11:03 AM EDT

Posts: 3,733

Thanks Given: 7

Thanked 1,154 Times in 1,124 Posts

Don't append it to your code, just run that line on the file that you need to clean headers from. Put this after the find line in your script:

Code:

awk '!/^\/\//' /u/Picarro/DataLog/2011/june/june.dat > /u/Picarro/DataLog/2011/june/june.dat.tmp
mv /u/Picarro/DataLog/2011/june/june.dat.tmp /u/Picarro/DataLog/2011/june/june.dat

bartus11

View Public Profile for bartus11

Find all posts by bartus11

07-19-2011

Registered User

45, 0

Join Date: Jul 2011

Last Activity: 26 August 2011, 8:07 AM EDT

Posts: 45

Thanks Given: 12

Thanked 0 Times in 0 Posts

Thanks for the help bartus, still not working. The headers are still interspersed throughout the data frame. I've tried putting the new code you gave me in different places too but it doesn't do anything to the file.

Any way around this?

gd9629

View Public Profile for gd9629

Find all posts by gd9629

07-19-2011

Registered User

686, 179

Join Date: Mar 2011

Last Activity: 17 March 2020, 9:58 PM EDT

Posts: 686

Thanks Given: 51

Thanked 179 Times in 171 Posts

that smells like some windows mess... can you please post the output of this:

Code:

head -1 /u/Picarro/DataLog/2011/june/june.dat | od -c

which will show what exactly are the header characters

---------- Post updated at 01:53 AM ---------- Previous update was at 01:44 AM ----------

And you could try this to strip the headers

Code:

awk  '!/\/\/ *DATE .*/' /u/Picarro/DataLog/2011/june/june.dat

Looking at your awk script, I think this might be the culprit:

Code:

# print the rest of variables as columns 
{print (" ", $3, ...

If you have a space there at the beginning, than the regex in awk will not match that line.
When you pipe the header through 'od', as i suggested, it will show

Last edited by mirni; 07-19-2011 at 09:09 AM..

mirni

View Public Profile for mirni

Find all posts by mirni

07-19-2011

Registered User

45, 0

Join Date: Jul 2011

Last Activity: 26 August 2011, 8:07 AM EDT

Posts: 45

Thanks Given: 12

Thanked 0 Times in 0 Posts

Hey mirni,

when I put in the code you gave me

Code:

head -1 /u/Picarro/DataLog/2011/june/june.dat | od -c

it returned

Code:

0000000   D   A   T   E                                                
0000020                                           T   I   M   E        
0000040                                                                
0000060                   F   R   A   C   _   D   A   Y   S   _   S   I
0000100   N   C   E   _   J   A   N   1                           F   R
0000120   A   C   _   H   R   S   _   S   I   N   C   E   _   J   A   N
0000140   1                               E   P   O   C   H   _   T   I
0000160   M   E                                                        
0000200           A   L   A   R   M   _   S   T   A   T   U   S        
0000220                                                   s   p   e   c
0000240   i   e   s                                                    
0000260                           s   o   l   e   n   o   i   d   _   v
0000300   a   l   v   e   s                                            
0000320   M   P   V   P   o   s   i   t   i   o   n                    
0000340                                           O   u   t   l   e   t
0000360   V   a   l   v   e                                            
0000400                   C   a   v   i   t   y   P   r   e   s   s   u
0000420   r   e                                                   C   a
0000440   v   i   t   y   T   e   m   p                                
0000460                                   W   a   r   m   B   o   x   T
0000500   e   m   p                                                    
0000520           E   t   a   l   o   n   T   e   m   p                
0000540                                                   D   a   s   T
0000560   e   m   p                                                    
0000600                           C   O   2   _   s   y   n   c        
0000620                                                                
0000640   C   O   2   _   d   r   y   _   s   y   n   c                
0000660                                           C   H   4   _   s   y
0000700   n   c                                                        
0000720                   C   H   4   _   d   r   y   _   s   y   n   c
0000740                                                           H   2
0000760   O   _   s   y   n   c                                        
0001000                                  \r  \n
0001012

those are the headers that are interspersed throughout the combined .dat file. Is this bad?

gd9629

View Public Profile for gd9629

Find all posts by gd9629

07-19-2011

Registered User

3,733, 1,154

Join Date: Apr 2009

Last Activity: 3 August 2016, 11:03 AM EDT

Posts: 3,733

Thanks Given: 7

Thanked 1,154 Times in 1,124 Posts

I don't see the "//" characters in from of the header. Can you also post output of:

Code:

head -1 /u/Picarro/DataLog/2011/june/june.dat | cat -Te

bartus11

View Public Profile for bartus11

Find all posts by bartus11

07-19-2011

Registered User

45, 0

Join Date: Jul 2011

Last Activity: 26 August 2011, 8:07 AM EDT

Posts: 45

Thanks Given: 12

Thanked 0 Times in 0 Posts

The "'DATE TIME" header is in the .csv file, I think it appears once I combined the two columns using the GAWK code I posted above.

The result from

Code:

head -1 /u/Picarro/DataLog/2011/june/june.dat | cat -Te

The result

Code:

DATE                      TIME                      FRAC_DAYS_SINCE_JAN1      FRAC_HRS_SINCE_JAN1       EPOCH_TIME                ALARM_STATUS              species                   solenoid_valves           MPVPosition               OutletValve               CavityPressure            CavityTemp                WarmBoxTemp               EtalonTemp                DasTemp                   CO2_sync                  CO2_dry_sync              CH4_sync                  CH4_dry_sync              H2O_sync                  ^M$

To clarify,

The headers shown above are in the .dat files (both single files and big combined file)

The "// DATE TIME" (replaces the separate "DATE" and "TIME" header into one column) header arises in the .csv file, after the .dat file has been 'GAWKed'.

Does that help?

gd9629

View Public Profile for gd9629

Find all posts by gd9629

Shell Programming and Scripting

Removing repeating lines from a data frame (AWK)

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Removing certain lines from results - awk

Discussion started by: SkySmart

2. Shell Programming and Scripting

awk : collecting all data between two time frame

Discussion started by: zenkarthi

3. Shell Programming and Scripting

perform actions at specific locations in data frame

Discussion started by: TuAd

4. UNIX for Dummies Questions & Answers

Remove groups of repeating lines

Discussion started by: glev2005

5. Shell Programming and Scripting

awk removing data before or after a pattern

Discussion started by: BeefStu

6. UNIX for Dummies Questions & Answers

Extract repeating data from file

Discussion started by: natasha

7. Shell Programming and Scripting

Merging non-repeating columns of lines

Discussion started by: menenuh

8. UNIX for Advanced & Expert Users

removing frame charecters

Discussion started by: tkbharani

9. Shell Programming and Scripting

frame multiple lines into one

Discussion started by: prvnrk

10. UNIX for Dummies Questions & Answers

Omit repeating lines

Discussion started by: TheCrunge