The UNIX and Linux Forums  
Hello and Welcome from United States to the UNIX and Linux Forums! Thank You for Visiting and Joining Our Global Community.

Go Back   The UNIX and Linux Forums > Top Forums > Shell Programming and Scripting
.
google unix.com



Shell Programming and Scripting Post questions about KSH, CSH, SH, BASH, PERL, PHP, SED, AWK and OTHER shell scripts and shell scripting languages here.

More UNIX and Linux Forum Topics You Might Find Helpful
Thread Thread Starter Forum Replies Last Post
How do I remove everything after a certain character in text files? guitarscn Shell Programming and Scripting 11 01-15-2009 06:32 AM
fopen() + reading in large text files JamesGoh High Level Programming 2 03-11-2008 10:30 AM
Large Text Files caddyjoe77 Shell Programming and Scripting 4 07-12-2006 10:27 AM
Script to perform some actions on multiple files heprox AIX 2 06-12-2006 03:31 PM
How to perform calculations using numbers greater than 2150000000. stevefox Shell Programming and Scripting 3 11-22-2005 07:21 AM

Reply
English Japanese Spanish French German Portuguese Italian Dutch Swedish Russian Norwegian Hungarian Hebrew Danish Bulgarian Greek Powered by Powered by Google
 
LinkBack Thread Tools Search this Thread Rate Thread Display Modes
  #1 (permalink)  
Old 06-17-2009
metronomadic metronomadic is offline
Registered User
  
 

Join Date: Jun 2009
Posts: 3
Question Sed or awk script to remove text / or perform calculations from large CSV files

I have a large CSV files (e.g. 2 million records) and am hoping to do one of two things. I have been trying to use awk and sed but am a newbie and can't figure out how to get it to work. Any help you could offer would be greatly appreciated - I'm stuck trying to remove the colon and wildcards in sed, and the average sample I've found using awk is giving me values of around 4e08.

The CSV file looks like this:

Code:
Date,AIRCOMPRESSOR\FLARE_FLOW,AIRCOMPRESSOR\FLARE_TEMP
3/1/2008,1044.83215332,1090.88208008
3/1/2008 12:00:10 AM,1044.83215332,1090.88208008
3/1/2008 12:00:21 AM,1046.71142578,1090.88208008
3/1/2008 12:00:31 AM,1044.83215332,1090.88208008
3/1/2008 12:00:41 AM,1048.59057617,1083.96069336
3/1/2008 12:00:51 AM,1044.83215332,1083.96069336
I am hoping to either use sed or another script to remove the seconds portion of the data lines (i.e. remove ":10 AM" and all similar occurrences, or preferably to use awk to average the flow rates for each minute or each 15 minutes (i.e. the column right after the time).

Thanks in advance for any help you can offer.
  #2 (permalink)  
Old 06-17-2009
vgersh99's Avatar
vgersh99 vgersh99 is online now Forum Staff  
Moderator
  
 

Join Date: Feb 2005
Location: Boston, MA
Posts: 5,119
something to start with:
Code:
nawk -F, '$1~":" {match($1,"\:[^:]*$"); $1=substr($1,1,RSTART-1)}1' OFS=, myFile
  #3 (permalink)  
Old 06-17-2009
metronomadic metronomadic is offline
Registered User
  
 

Join Date: Jun 2009
Posts: 3
Thanks for your prompt response, but it looks like I don't have nawk (I'm running Mac OS X). I'll see if I can get it through MacPorts and try again, but if there's any help that can be offered using awk, sed, or tr I know that I have those at my disposal.

EDIT: Installed nawk, and it worked like a charm. Thank you very much.

Last edited by metronomadic; 06-17-2009 at 02:07 PM..
  #4 (permalink)  
Old 06-17-2009
vgersh99's Avatar
vgersh99 vgersh99 is online now Forum Staff  
Moderator
  
 

Join Date: Feb 2005
Location: Boston, MA
Posts: 5,119
try 'awk' instead of 'nawk'.
  #5 (permalink)  
Old 06-17-2009
ahmad.diab's Avatar
ahmad.diab ahmad.diab is offline
Registered User
  
 

Join Date: May 2008
Location: Amman Jordan in MEA
Posts: 228
To remove the second 12:21:10 use the below sed:

Code:
sed 's/.*:\([^,*]*\) AM/\1/g' file.txt
to get the to total use:-


Code:
awk ' BEGIN{c=0} {a[$1]+=$2;b[$1]+=$3;c++} END{for (i in a) {print "Total", a[i]/c,b[i]/c} ' file.txt
BR

Last edited by ahmad.diab; 06-17-2009 at 02:48 PM..
  #6 (permalink)  
Old 06-17-2009
metronomadic metronomadic is offline
Registered User
  
 

Join Date: Jun 2009
Posts: 3
Quote:
Originally Posted by ahmad.diab View Post
To remove the second 12:21:10 use the below sed:

Code:
sed 's/.*:\([^,*]*\) AM/\1/g' file.txt
to get the to total use:-


Code:
awk ' BEGIN{c=0} {a[$1]+=$2;b[$1]+=$3;c++} END{for (i in a) {print "Total", a[i]/c,b[i]/c} ' file.txt
BR

Thanks Ahmad. I tried the awk code (which I think needs an extra } to close out the for loop?), but I think that might be calculating something else. I am trying to get the average flow (column three) for each minute (or each 15 minute span) of each day. I am not sure I understand the code, but from the output it looks like it is gathering each days worth of records, and dividing them by the number of days?

I don't mean to be a bother, but can you tell me if this is what is going on?
  #7 (permalink)  
Old 06-17-2009
ahmad.diab's Avatar
ahmad.diab ahmad.diab is offline
Registered User
  
 

Join Date: May 2008
Location: Amman Jordan in MEA
Posts: 228
Quote:
Originally Posted by metronomadic View Post
Thanks Ahmad. I tried the awk code (which I think needs an extra } to close out the for loop?), but I think that might be calculating something else. I am trying to get the average flow (column three) for each minute (or each 15 minute span) of each day. I am not sure I understand the code, but from the output it looks like it is gathering each days worth of records, and dividing them by the number of days?

I don't mean to be a bother, but can you tell me if this is what is going on?
sorry kindly add the bold string below:-

awk -F"," ' BEGIN{c=0} {a[$1]+=$2;b[$1]+=$3;c++} END{for (i in a) {print "Total", a[i]/c,b[i]/c} ' file.txt
Reply

Bookmarks

Tags
awk, awk calculation, sed

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes Rate This Thread
Rate This Thread:

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On




All times are GMT -4. The time now is 02:32 PM.


Powered by: vBulletin, Copyright ©2000 - 2006, Jelsoft Enterprises Limited. Language Translations Powered by .
vBCredits v1.4 Copyright ©2007 - 2008, PixelFX Studios
The UNIX and Linux Forums Content Copyright ©1993-2009. All Rights Reserved.Ad Management by RedTyger

Content Relevant URLs by vBSEO 3.2.0