The UNIX and Linux Forums  
Hello and Welcome from United States to the UNIX and Linux Forums! Thank You for Visiting and Joining Our Global Community.

Go Back   The UNIX and Linux Forums > Top Forums > Shell Programming and Scripting
.
google unix.com



Shell Programming and Scripting Post questions about KSH, CSH, SH, BASH, PERL, PHP, SED, AWK and OTHER shell scripts and shell scripting languages here.

More UNIX and Linux Forum Topics You Might Find Helpful
Thread Thread Starter Forum Replies Last Post
fopen() + reading in large text files JamesGoh High Level Programming 2 03-11-2008 10:30 AM
large files? ranj@chn UNIX for Dummies Questions & Answers 2 11-29-2006 06:55 AM
List large files GNMIKE UNIX for Dummies Questions & Answers 2 12-28-2005 01:48 PM
Large files sehgalniraj UNIX for Dummies Questions & Answers 4 03-31-2005 08:03 AM
grep multiple text files in folder into 1 text file? coppertone UNIX for Dummies Questions & Answers 7 08-23-2002 02:50 PM

Closed Thread
English Japanese Spanish French German Portuguese Italian Dutch Swedish Russian Norwegian Hungarian Hebrew Danish Powered by Powered by Google
 
LinkBack Thread Tools Search this Thread Rate Thread Display Modes
  #1 (permalink)  
Old 07-11-2006
caddyjoe77 caddyjoe77 is offline
Registered User
  
 

Join Date: Apr 2005
Posts: 40
Large Text Files

Hi All

I have approximately 10 files that are at least 100+ MB in size. I am importing them into a DB to output them to the web. What i need to do first is clean the files up so i dont have un necessary rows in the DB. Below is what the file looks like:

Ignore the <TAB> annotations as that is just showing you what the file looks like Also ignore the Part A and Part B designataion as that is a descriptor to tell you what the format of the csv file looks like.

Part A: (the header information)

"Report Type"<TAB>"This Report"

"Date: 200610"

"Report: All Files"

"more junk:" <TAB> "Even More Junk"

"FileName"<TAB>"FilePath"<TAB>"LastAccessed"<TAB>"LastModified"<TAB>"Owner"

Part Bthe actual data i want to scrunch together without blank lines)

"NameofFile"<TAB>"PathOfFiles"<TAB>"FileLastAccessed"<TAB>"FileLastModified"<TAB>"FileOwner"
"NameofFile"<TAB>"PathOfFiles"<TAB>"FileLastAccessed"<TAB>"FileLastModified"<TAB>"FileOwner"
"NameofFile"<TAB>"PathOfFiles"<TAB>"FileLastAccessed"<TAB>"FileLastModified"<TAB>"FileOwner"
"NameofFile"<TAB>"PathOfFiles"<TAB>"FileLastAccessed"<TAB>"FileLastModified"<TAB>"FileOwner"
and on down the list for approximately 50 Lines

Then "Some Report Exection Time"

Part A

Part B

Part A

Part B

Part A and Part B Repeat over and over again, obviously showing all the files on a drive.

What I want to do i get Rid of the Part A Completely and only keep the first
"FileName"<TAB>"FilePath"<TAB>"LastAccessed"<TAB>"LastModified"<TAB>"Owner"

These are large files ranging from 100-500MB in size, so i want something quick and effecient such as SED or AWK but am unsure how to craft it.

I tried something like this in a sed file and called it via the Win32GNU tool SED

sed -f sedscript input filename >output filename

here is what the sed script file looked like:

/^$/d #get rid of spaces

s/"Report Type"<TAB>"This Report"//g #globally replace these strings
s/"Date: 200610"//g
s/"Report: All Files"//g
s/"more junk:" <TAB> "Even More Junk"//g

but i got some strange results. Only some of the blank lines disappeared, and left some blank lines that i didnt think it should have so maybe there is some hidden ASCII character there that i cant see?

Basically, what i would like from you all is am i doing this the best way? And any syntax help would be appreciated. FYI, I have to do this on a Windows box so i have to either use ActivePerl, the PERL that comes with Microsoft SFU, or the GNUWin32 tools GAWK and SED. I have enough memory (4 GB), dual core XEON, and plenty of disk space.

Thanks for the help/opinions.

Joe
  #2 (permalink)  
Old 07-11-2006
Hitori's Avatar
Hitori Hitori is offline Forum Advisor  
Registered User
  
 

Join Date: Jun 2006
Posts: 360
Replace commands like

s/"Report Type"<TAB>"This Report"//g

with

/"Report Type"<TAB>"This Report"/ d

Then you'll remove entire line (with '\n')
  #3 (permalink)  
Old 07-11-2006
caddyjoe77 caddyjoe77 is offline
Registered User
  
 

Join Date: Apr 2005
Posts: 40
Quote:
Originally Posted by Hitori
Replace commands like

s/"Report Type"<TAB>"This Report"//g

with

/"Report Type"<TAB>"This Report"/ d

Then you'll remove entire line (with '\n')
What do you mean by with \n Do you mean that i have to put \n in at the end of each line?

I tried using the /d also but it kept telling me that SED was missing an argument.

So the way i am understanding it, it should be like this?

s/"Report Set:"//g
s/"All Files ERM"//g
s/"All Files"//g
s/"Object Name:"//g
s/"06\/29\/2006 11:18:12"//g
s/"Selection:"//g
s/ All Files//g
s/"Description: This Report *"//g

/^$/d

There appears to be some unicode there also,尀 <--characters how do i get SED to see those when running the script? I put the following in:

s/尀//g

and returned an error on line 1 "Unknown Command".

thanks!!
  #4 (permalink)  
Old 07-12-2006
Hitori's Avatar
Hitori Hitori is offline Forum Advisor  
Registered User
  
 

Join Date: Jun 2006
Posts: 360
Code:
$ cat file
LINE1
LINE2
LINE3
$ cat file | sed -e 's/LINE2//g'
LINE1

LINE3
$ cat file | sed -e '/LINE2/ d'
LINE1
LINE3
$
Do you see the difference now?
  #5 (permalink)  
Old 07-12-2006
caddyjoe77 caddyjoe77 is offline
Registered User
  
 

Join Date: Apr 2005
Posts: 40
Yes, thanks.

trying that now.
Closed Thread

Bookmarks

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes Rate This Thread
Rate This Thread:

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On




All times are GMT -4. The time now is 01:35 AM.


Powered by: vBulletin, Copyright ©2000 - 2006, Jelsoft Enterprises Limited. Language Translations Powered by .
vBCredits v1.4 Copyright ©2007 - 2008, PixelFX Studios
The UNIX and Linux Forums Content Copyright ©1993-2009. All Rights Reserved.Ad Management by RedTyger

Content Relevant URLs by vBSEO 3.2.0