Removing White spaces from a huge file


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Removing White spaces from a huge file
# 1  
Old 01-20-2017
Removing White spaces from a huge file

I am trying to remove whitespaces from a file containing sample data as:
Code:
457 <EOFD> Mar  1 2007 12:00:00:000AM   <EOFD> Mar 31 2007 12:00:00:000AM   <EOFD>  system  <EORD> 458 <EOFD>    Mar  1 2007 12:00:00:000AM<EOFD>agf <EOFD> Apr 20 2007  9:10:56:036PM    <EOFD>  prodiws<EORD>

. Basically these files are delimited extracted files from a database with the delimiters as <EOFD>
I am using the below command to remove the whitespace
perl -pi -e 's/[[:space:]]*\<EOFD\>[[:space:]]*/\<EOFD\>/g' sample.dat

The above command is part of shell scripts t.ksh which is being invoked on the command line and internally runs the perl command.
It is working fine for moderately huge files but once the file reaches a conderable huge size, say 2.2 -3 GB, it empties the file content and the scripts exist with the below error:

./t.ksh[249]: 700492 Memory fault(coredump)
Server Message: <my unix machine name>- Msg 208, Level 16, State 1:


[249] - this points to my perl command. Also, I see a huge coredump file getting created in the directory from where I am invoking the shell script t.ksh. The file system has adequate space.

Can someone advise whats wrong here or any workaround?

Regards

Last edited by Scrutinizer; 01-22-2017 at 06:59 AM.. Reason: icode -> code tags
# 2  
Old 01-20-2017
It appears that your file is one giant line, yes? Perl is attempting to process it as one line, i.e. load the entire 2 gigabytes of it into memory at once.
This User Gave Thanks to Corona688 For This Post:
# 3  
Old 01-20-2017
Thanks Corona688..Is there any workaround for this? I am currently trying to split the files into smaller chunks and try it. But that requires an effort as my file is delimited file and I have to make sure the split happens properly.
# 4  
Old 01-20-2017
If you tell awk what your "lines" are, it won't have to read 2GB of data at once. RS and ORS variables control this. They usually default to newline, but they can as easily be <EOFD>.

Code:
 $ cat data
457 <EOFD> Mar  1 2007 12:00:00:000AM   <EOFD> Mar 31 2007 12:00:00:000AM   <EOFD>  system  <EORD> 458 <EOFD>    Mar  1 2007 12:00:00:000AM<EOFD>agf <EOFD> Apr 20 2007  9:10:56:036PM    <EOFD>

$  awk '{ sub(/ +$/, ""); sub(/^ +/, ""); } 1' RS="<EOFD>" ORS="<EOFD>" datafile ; echo

457<EOFD>Mar  1 2007 12:00:00:000AM<EOFD>Mar 31 2007 12:00:00:000AM<EOFD>system  <EORD> 458<EOFD>Mar  1 2007 12:00:00:000AM<EOFD>agf<EOFD>Apr 20 2007  9:10:56:036PM<EOFD>

$

..the "echo" afterwards is just to move the cursor to the next line, since it wouldn't print a newline otherwise.

Last edited by Corona688; 01-20-2017 at 03:35 PM..
This User Gave Thanks to Corona688 For This Post:
# 5  
Old 01-20-2017
Thanks Corona688..Just to confirm, my files are like below:
1. Each field delimited by <EOFD>
2. The marker for a new row is <EORD> . I think since <EORD> marks the end of a line, can you suggest what should be the above command?
# 6  
Old 01-20-2017
Unless the output I showed is wrong somehow, the command I just showed you works, no?
# 7  
Old 01-20-2017
Yeahh..it works..but just one thing..the whitespace between system and <EORD> remains: "system <EORD>". Actually except the first line, <EORD> defines start of a new row in the file so basically its the start of a new line and end of the previous line. Can you suggest for this?
Also, it echoes the whole content on the console. Can i avoid it?
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

List and Delete Files which are older than 7 days, but have white spaces in file name

I need to list and delete all files in current older which are olderthan 7 days. But my file names have white spaces. Before deleting I want to list all the files, so that I can verify.find . -type f -mtime +7 | xargs ls -l {} But the ls command is the working on the files which have white... (16 Replies)
Discussion started by: karumudi7
16 Replies

2. Shell Programming and Scripting

Removing blank/white spaces and special characters

Hello All , 1. I am trying to do a task where I need to remove Blank spaces from my file , I am usingawk '{$1=$1}{print}' file>file1Input :- ;05/12/1990 ;31/03/2014 ; Output:- ;05/12/1990 ;31/03/2014 ;This command is not removing all spaces from... (6 Replies)
Discussion started by: himanshu sood
6 Replies

3. Shell Programming and Scripting

Remove white spaces from flat file generated from Oracle table...

I have to export data from table into flat file with | delimited. In the ksh file, I am adding below to do this activity. $DBSTRING contains the sqlplus command and $SQL_STRING contains the SQL query. File is created properly with the data as per SQL command. I am getting white spaces in the... (1 Reply)
Discussion started by: mgpatil31
1 Replies

4. UNIX for Advanced & Expert Users

Performance problem with removing duplicates in a huge file (50+ GB)

I'm trying to remove duplicate data from an input file with unsorted data which is of size >50GB and write the unique records to a new file. I'm trying and already tried out a variety of options posted in similar threads/forums. But no luck so far.. Any suggestions please ? Thanks !! (9 Replies)
Discussion started by: Kannan K
9 Replies

5. Post Here to Contact Site Administrators and Moderators

Want a tcl script to compare a string in a file ignoring white spaces

Hi , I want a tcl script to search a string ignoring whitespaces in a .log file . It should correctly match . The string are as follows "Output-Maps 1 1 0 0 0" 1 and Active Intermediate-Maps 0 0 0 ... (1 Reply)
Discussion started by: kulua
1 Replies

6. UNIX for Dummies Questions & Answers

[Solved] Help with using tr - Removing white spaces

Hi, I have a file that contains whitespaces with spaces and spaces and tabs on each line and am wanting to remove the whitespaces. My version of sed is one that does not recognize \t etc. The sed and awk one-liners below that I found via Google both does not work. So my next best... (3 Replies)
Discussion started by: newbie_01
3 Replies

7. Shell Programming and Scripting

Removing blank spaces, tab spaces from file

Hello All, I am trying to remove all tabspaces and all blankspaces from my file using sed & awk, but not getting proper code. Please help me out. My file is like this (<b> means one blank space, <t> means one tab space)- $ cat file NARESH<b><b><b>KUMAR<t><t>PRADHAN... (3 Replies)
Discussion started by: NARESH1302
3 Replies

8. Solaris

removing special characters, white spaces from a field in a file

what my code is doing, it is executing a sql file and the resullset of the query is getting stored in the text file in a fixed format. for that fixed format i have used the following code:: Code: awk -F":"... (2 Replies)
Discussion started by: priyanka3006
2 Replies

9. Shell Programming and Scripting

ksh: removing all white spaces

'String' file contains the following contents, D11, D31, D92, D29, D24, using ksh, I want to remove all white spaces between characters no matter how long the string is. Would you please give me some help? (1 Reply)
Discussion started by: yoonius
1 Replies

10. UNIX for Dummies Questions & Answers

deleting white spaces in a file

Hello Guys, I am a newbie to unix. I am having a requirement. Please help me for finding a solution for this, I am having a file as mentioned below: $ cat shank ackca acackac akcajc akcjkcja akcj ckcklc I want to delete all the white spaces in this file, I tried... (2 Replies)
Discussion started by: mraghunandanan
2 Replies
Login or Register to Ask a Question