Hi All
I have approximately 10 files that are at least 100+ MB in size. I am importing them into a DB to output them to the web. What i need to do first is clean the files up so i dont have un necessary rows in the DB. Below is what the file looks like:
Ignore the <TAB> annotations as that is just showing you what the file looks like
Also ignore the Part A and Part B designataion as that is a descriptor to tell you what the format of the csv file looks like.
Part A: (the header information)
"Report Type"<TAB>"This Report"
"Date: 200610"
"Report: All Files"
"more junk:" <TAB> "Even More Junk"
"FileName"<TAB>"FilePath"<TAB>"LastAccessed"<TAB>"LastModified"<TAB>"Owner"
Part B
the actual data i want to scrunch together without blank lines)
"NameofFile"<TAB>"PathOfFiles"<TAB>"FileLastAccessed"<TAB>"FileLastModified"<TAB>"FileOwner"
"NameofFile"<TAB>"PathOfFiles"<TAB>"FileLastAccessed"<TAB>"FileLastModified"<TAB>"FileOwner"
"NameofFile"<TAB>"PathOfFiles"<TAB>"FileLastAccessed"<TAB>"FileLastModified"<TAB>"FileOwner"
"NameofFile"<TAB>"PathOfFiles"<TAB>"FileLastAccessed"<TAB>"FileLastModified"<TAB>"FileOwner"
and on down the list for approximately 50 Lines
Then "Some Report Exection Time"
Part A
Part B
Part A
Part B
Part A and Part B Repeat over and over again, obviously showing all the files on a drive.
What I want to do i get Rid of the Part A Completely and only keep the first
"FileName"<TAB>"FilePath"<TAB>"LastAccessed"<TAB>"LastModified"<TAB>"Owner"
These are large files ranging from 100-500MB in size, so i want something quick and effecient such as SED or AWK but am unsure how to craft it.
I tried something like this in a sed file and called it via the Win32GNU tool SED
sed -f sedscript input filename >output filename
here is what the sed script file looked like:
/^$/d #get rid of spaces
s/"Report Type"<TAB>"This Report"//g #globally replace these strings
s/"Date: 200610"//g
s/"Report: All Files"//g
s/"more junk:" <TAB> "Even More Junk"//g
but i got some strange results. Only some of the blank lines disappeared, and left some blank lines that i didnt think it should have so maybe there is some hidden ASCII character there that i cant see?
Basically, what i would like from you all is am i doing this the best way? And any syntax help would be appreciated. FYI, I have to do this on a Windows box so i have to either use ActivePerl, the PERL that comes with Microsoft SFU, or the GNUWin32 tools GAWK and SED. I have enough memory (4 GB), dual core XEON, and plenty of disk space.
Thanks for the help/opinions.
Joe