Large Text Files

07-11-2006

Registered User

59, 1

Join Date: Apr 2005

Last Activity: 18 March 2013, 4:16 PM EDT

Posts: 59

Thanks Given: 2

Thanked 1 Time in 1 Post

Large Text Files

Hi All

I have approximately 10 files that are at least 100+ MB in size. I am importing them into a DB to output them to the web. What i need to do first is clean the files up so i dont have un necessary rows in the DB. Below is what the file looks like:

Ignore the <TAB> annotations as that is just showing you what the file looks like

Also ignore the Part A and Part B designataion as that is a descriptor to tell you what the format of the csv file looks like.

Part A: (the header information)

"Report Type"<TAB>"This Report"

"Date: 200610"

"Report: All Files"

"more junk:" <TAB> "Even More Junk"

"FileName"<TAB>"FilePath"<TAB>"LastAccessed"<TAB>"LastModified"<TAB>"Owner"

Part B

the actual data i want to scrunch together without blank lines)

"NameofFile"<TAB>"PathOfFiles"<TAB>"FileLastAccessed"<TAB>"FileLastModified"<TAB>"FileOwner"
"NameofFile"<TAB>"PathOfFiles"<TAB>"FileLastAccessed"<TAB>"FileLastModified"<TAB>"FileOwner"
"NameofFile"<TAB>"PathOfFiles"<TAB>"FileLastAccessed"<TAB>"FileLastModified"<TAB>"FileOwner"
"NameofFile"<TAB>"PathOfFiles"<TAB>"FileLastAccessed"<TAB>"FileLastModified"<TAB>"FileOwner"
and on down the list for approximately 50 Lines

Then "Some Report Exection Time"

Part A

Part B

Part A

Part B

Part A and Part B Repeat over and over again, obviously showing all the files on a drive.

What I want to do i get Rid of the Part A Completely and only keep the first
"FileName"<TAB>"FilePath"<TAB>"LastAccessed"<TAB>"LastModified"<TAB>"Owner"

These are large files ranging from 100-500MB in size, so i want something quick and effecient such as SED or AWK but am unsure how to craft it.

I tried something like this in a sed file and called it via the Win32GNU tool SED

sed -f sedscript input filename >output filename

here is what the sed script file looked like:

/^$/d #get rid of spaces

s/"Report Type"<TAB>"This Report"//g #globally replace these strings
s/"Date: 200610"//g
s/"Report: All Files"//g
s/"more junk:" <TAB> "Even More Junk"//g

but i got some strange results. Only some of the blank lines disappeared, and left some blank lines that i didnt think it should have so maybe there is some hidden ASCII character there that i cant see?

Basically, what i would like from you all is am i doing this the best way? And any syntax help would be appreciated. FYI, I have to do this on a Windows box so i have to either use ActivePerl, the PERL that comes with Microsoft SFU, or the GNUWin32 tools GAWK and SED. I have enough memory (4 GB), dual core XEON, and plenty of disk space.

Thanks for the help/opinions.

Joe

caddyjoe77

View Public Profile for caddyjoe77

Find all posts by caddyjoe77

07-11-2006

Registered User

360, 9

Join Date: Jun 2006

Last Activity: 10 December 2011, 9:33 AM EST

Posts: 360

Thanks Given: 0

Thanked 9 Times in 8 Posts

Replace commands like

s/"Report Type"<TAB>"This Report"//g

with

/"Report Type"<TAB>"This Report"/ d

Then you'll remove entire line (with '\n')

Hitori

View Public Profile for Hitori

Find all posts by Hitori

07-11-2006

Registered User

59, 1

Join Date: Apr 2005

Last Activity: 18 March 2013, 4:16 PM EDT

Posts: 59

Thanks Given: 2

Thanked 1 Time in 1 Post

Quote:

Originally Posted by Hitori

Replace commands like

s/"Report Type"<TAB>"This Report"//g

with

/"Report Type"<TAB>"This Report"/ d

Then you'll remove entire line (with '\n')

What do you mean by with \n Do you mean that i have to put \n in at the end of each line?

I tried using the /d also but it kept telling me that SED was missing an argument.

So the way i am understanding it, it should be like this?

s/"Report Set:"//g
s/"All Files ERM"//g
s/"All Files"//g
s/"Object Name:"//g
s/"06\/29\/2006 11:18:12"//g
s/"Selection:"//g
s/ All Files//g
s/"Description: This Report *"//g

/^$/d

There appears to be some unicode there also,尀 <--characters how do i get SED to see those when running the script? I put the following in:

s/尀//g

and returned an error on line 1 "Unknown Command".

thanks!!

caddyjoe77

View Public Profile for caddyjoe77

Find all posts by caddyjoe77

07-12-2006

Registered User

360, 9

Join Date: Jun 2006

Last Activity: 10 December 2011, 9:33 AM EST

Posts: 360

Thanks Given: 0

Thanked 9 Times in 8 Posts

Code:

$ cat file
LINE1
LINE2
LINE3
$ cat file | sed -e 's/LINE2//g'
LINE1

LINE3
$ cat file | sed -e '/LINE2/ d'
LINE1
LINE3
$

Do you see the difference now?

Hitori

View Public Profile for Hitori

Find all posts by Hitori

07-12-2006

Registered User

59, 1

Join Date: Apr 2005

Last Activity: 18 March 2013, 4:16 PM EDT

Posts: 59

Thanks Given: 2

Thanked 1 Time in 1 Post

Yes, thanks.

trying that now.

caddyjoe77

View Public Profile for caddyjoe77

Find all posts by caddyjoe77

Shell Programming and Scripting

Large Text Files

9 More Discussions You Might Find Interesting

1. Programming

Fast string removal from large text collection

Discussion started by: shoaibjameel123

2. Shell Programming and Scripting

splitting a large text file into paragraphs

Discussion started by: lupin..the..3rd

3. Solaris

How to safely copy full filesystems with large files (10Gb files)

Discussion started by: dragonov7

4. Shell Programming and Scripting

Need help combining large number of text files

Discussion started by: mr_monocyte

5. Shell Programming and Scripting

extract unique pattern from large text file

Discussion started by: shijujoe

6. Shell Programming and Scripting

Help with splitting a large text file into smaller ones

Discussion started by: lord_butler

7. Shell Programming and Scripting

Sed or awk script to remove text / or perform calculations from large CSV files

Discussion started by: metronomadic

8. Shell Programming and Scripting

Need to extract 7 characters immediately after text '19' from a large file.

Discussion started by: parshant_bvcoe

9. Programming

fopen() + reading in large text files

Discussion started by: JamesGoh