Large Text Files


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Large Text Files
# 1  
Old 07-11-2006
Large Text Files

Hi All

I have approximately 10 files that are at least 100+ MB in size. I am importing them into a DB to output them to the web. What i need to do first is clean the files up so i dont have un necessary rows in the DB. Below is what the file looks like:

Ignore the <TAB> annotations as that is just showing you what the file looks like Smilie Also ignore the Part A and Part B designataion as that is a descriptor to tell you what the format of the csv file looks like.

Part A: (the header information)

"Report Type"<TAB>"This Report"

"Date: 200610"

"Report: All Files"

"more junk:" <TAB> "Even More Junk"

"FileName"<TAB>"FilePath"<TAB>"LastAccessed"<TAB>"LastModified"<TAB>"Owner"

Part BSmiliethe actual data i want to scrunch together without blank lines)

"NameofFile"<TAB>"PathOfFiles"<TAB>"FileLastAccessed"<TAB>"FileLastModified"<TAB>"FileOwner"
"NameofFile"<TAB>"PathOfFiles"<TAB>"FileLastAccessed"<TAB>"FileLastModified"<TAB>"FileOwner"
"NameofFile"<TAB>"PathOfFiles"<TAB>"FileLastAccessed"<TAB>"FileLastModified"<TAB>"FileOwner"
"NameofFile"<TAB>"PathOfFiles"<TAB>"FileLastAccessed"<TAB>"FileLastModified"<TAB>"FileOwner"
and on down the list for approximately 50 Lines

Then "Some Report Exection Time"

Part A

Part B

Part A

Part B

Part A and Part B Repeat over and over again, obviously showing all the files on a drive.

What I want to do i get Rid of the Part A Completely and only keep the first
"FileName"<TAB>"FilePath"<TAB>"LastAccessed"<TAB>"LastModified"<TAB>"Owner"

These are large files ranging from 100-500MB in size, so i want something quick and effecient such as SED or AWK but am unsure how to craft it.

I tried something like this in a sed file and called it via the Win32GNU tool SED

sed -f sedscript input filename >output filename

here is what the sed script file looked like:

/^$/d #get rid of spaces

s/"Report Type"<TAB>"This Report"//g #globally replace these strings
s/"Date: 200610"//g
s/"Report: All Files"//g
s/"more junk:" <TAB> "Even More Junk"//g

but i got some strange results. Only some of the blank lines disappeared, and left some blank lines that i didnt think it should have so maybe there is some hidden ASCII character there that i cant see?

Basically, what i would like from you all is am i doing this the best way? And any syntax help would be appreciated. FYI, I have to do this on a Windows box so i have to either use ActivePerl, the PERL that comes with Microsoft SFU, or the GNUWin32 tools GAWK and SED. I have enough memory (4 GB), dual core XEON, and plenty of disk space.

Thanks for the help/opinions.

Joe
# 2  
Old 07-11-2006
Replace commands like

s/"Report Type"<TAB>"This Report"//g

with

/"Report Type"<TAB>"This Report"/ d

Then you'll remove entire line (with '\n')
# 3  
Old 07-11-2006
Quote:
Originally Posted by Hitori
Replace commands like

s/"Report Type"<TAB>"This Report"//g

with

/"Report Type"<TAB>"This Report"/ d

Then you'll remove entire line (with '\n')
What do you mean by with \n Do you mean that i have to put \n in at the end of each line?

I tried using the /d also but it kept telling me that SED was missing an argument.

So the way i am understanding it, it should be like this?

s/"Report Set:"//g
s/"All Files ERM"//g
s/"All Files"//g
s/"Object Name:"//g
s/"06\/29\/2006 11:18:12"//g
s/"Selection:"//g
s/ All Files//g
s/"Description: This Report *"//g

/^$/d

There appears to be some unicode there also,尀 <--characters how do i get SED to see those when running the script? I put the following in:

s/尀//g

and returned an error on line 1 "Unknown Command".

thanks!!
# 4  
Old 07-12-2006
Code:
$ cat file
LINE1
LINE2
LINE3
$ cat file | sed -e 's/LINE2//g'
LINE1

LINE3
$ cat file | sed -e '/LINE2/ d'
LINE1
LINE3
$

Do you see the difference now?
# 5  
Old 07-12-2006
Yes, thanks.

trying that now.
Login or Register to Ask a Question

Previous Thread | Next Thread

9 More Discussions You Might Find Interesting

1. Programming

Fast string removal from large text collection

Hi All, I don't want any codes for this problem. Just suggestions: I have a huge collection of text files (around 300,000) which look like this: 1.fil orange apple dskjdsk computer skjks The entire text collection (referenced above) has about 1 billion words. I have created... (1 Reply)
Discussion started by: shoaibjameel123
1 Replies

2. Shell Programming and Scripting

splitting a large text file into paragraphs

Hello all, newbie here. I've searched the forum and found many "how to split a text file" topics but none that are what I'm looking for. I have a large text file (~15 MB) in size. It contains a variable number of "paragraphs" (for lack of a better word) that are each of variable length. A... (3 Replies)
Discussion started by: lupin..the..3rd
3 Replies

3. Solaris

How to safely copy full filesystems with large files (10Gb files)

Hello everyone. Need some help copying a filesystem. The situation is this: I have an oracle DB mounted on /u01 and need to copy it to /u02. /u01 is 500 Gb and /u02 is 300 Gb. The size used on /u01 is 187 Gb. This is running on solaris 9 and both filesystems are UFS. I have tried to do it using:... (14 Replies)
Discussion started by: dragonov7
14 Replies

4. Shell Programming and Scripting

Need help combining large number of text files

Hi, i have more than 1000 data files(.txt) like this first file format: 178.83 554.545 179.21 80.392 second file: 178.83 990.909 179.21 90.196 etc. I want to combine them to the following format: 178.83,554.545,990.909,... 179.21,80.392,90.196,... (7 Replies)
Discussion started by: mr_monocyte
7 Replies

5. Shell Programming and Scripting

extract unique pattern from large text file

Hi All, I am trying to extract data from a large text file , I want to extract lines which contains a five digit number followed by a hyphen , like 12345- , i tried with egrep ,eg : egrep "+" text.txt but which returns all the lines which contains any number of digits followed by hyhen ,... (19 Replies)
Discussion started by: shijujoe
19 Replies

6. Shell Programming and Scripting

Help with splitting a large text file into smaller ones

Hi Everyone, I am using a centos 5.2 server as an sflow log collector on my network. Currently I am using inmons free sflowtool to collect the packets sent by my switches. I have a bash script running on an infinate loop to stop and start the log collection at set intervals - currently one... (2 Replies)
Discussion started by: lord_butler
2 Replies

7. Shell Programming and Scripting

Sed or awk script to remove text / or perform calculations from large CSV files

I have a large CSV files (e.g. 2 million records) and am hoping to do one of two things. I have been trying to use awk and sed but am a newbie and can't figure out how to get it to work. Any help you could offer would be greatly appreciated - I'm stuck trying to remove the colon and wildcards in... (6 Replies)
Discussion started by: metronomadic
6 Replies

8. Shell Programming and Scripting

Need to extract 7 characters immediately after text '19' from a large file.

Hi All!! I have a large file containing millions of record. My purpose is to extract 7 characters immediately after text '19' from this file (including text '19') and save the result in new file. So, my OUTPUT would be as under : 191234561 194567894 192789005 198839408 and so on..... ... (7 Replies)
Discussion started by: parshant_bvcoe
7 Replies

9. Programming

fopen() + reading in large text files

For reading in large text files (say files over 1kB in size) are there any issues with fopen() that I should be aware of ? cheers (2 Replies)
Discussion started by: JamesGoh
2 Replies
Login or Register to Ask a Question