How to fix line breaks format text for huge files?


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting How to fix line breaks format text for huge files?
# 1  
Old 01-10-2012
How to fix line breaks format text for huge files?

Hi,
I need to correct line breaks for huge files (more than 1MM records in a file) and then format it properly.

Except the header and trailer, each record starts with 'D'.

Requirement:Scan the whole file except the header and trailer records and see if any of the records start with anything other than 'D'. In such cases, merge the broken line with the preceeding line after inserting a space after the end of the previous line.

The input file is:

HEADER474687
D1356jkl ugbliuybikb 879870
898976098 9687680
D77656757 uhgliug liygoiygig
D98679hjh kjbgihguygfu ugliyh
kbygfluy9809
D8796870 kjlhuigiyig
TRAILER0008

Expected output file is:

HEADER474687
D1356jkl ugbliuybikb 879870 898976098 9687680
D77656757 uhgliug liygoiygig
D98679hjh kjbgihguygfu ugliyh kbygfluy9809
D8796870 kjlhuigiyig
TRAILER0008

I am using the following code to achieve it:
Code:
#!/bin/ksh
 
FILENAME=$1
 
echo Checking for Line-Breaks in a file...
grep -nv '^D' $FILENAME > RESULT.OUT
 
echo Verify any match for Line Breaks...
COUNT=$(wc -l <RESULT.OUT)
 
if [ $COUNT -gt 0 ]; then
 
TOTLINECOUNT=$(wc -l <$FILENAME)
 
set RECORDCOUNT=$TOTLINECOUNT - 1
 
awk 'NR > 1 && NR < "$RECORDCOUNT" /^D/ {printf "\n%s", $0 ; next } { printf " %s", $0 } END {print eof}' $FILENAME > $FILENAME.TEMP 
 
> $FILENAME
 
cat $FILENAME.TEMP > $FILENAME
 
rm $FILENAME.TEMP
 
rm RESULT.OUT
 
exit
 
fi
 
echo The file $FILENAME does not contain any Line Breaks
 
rm RESULT.OUT

Received output:

<space>HEADER474687
D1356jkl ugbliuybikb 87987089 8976098 9687680
D77656757 uhgliug liygoiygig
D98679hjh kjbgihguygfu ugliyhkb ygfluy9809
D8796870 kjlhuigiyig TRAILER0008

After using the above code, I am facing the following issues:

On using the awk command, although I am using NR > 1 && NR < $RECORDCOUNT,

a) I am unable to exclude the Header and trailer records from the awk processing which is merging Trailer line as well to the previous one.

b)Also, a space is getting inserted before the first line. The first line becomes:
Code:
<space>HEADER474687

Because of this, I have to use a separate sed command(given below) just after the awk execution to delete the leading space from first line which is adding to the execution time of the whole process.
Code:
sed -e '1,1s/^[ \t]*//' $FILENAME.TEMP > $FILENAME

I would really appreciate if any one of you can guide me in writing this piece of code using awk/sed/perl (whichever is suitable keeping in mind the huge file size).

Thanks a lot in advance.

Last edited by kikionline; 01-10-2012 at 10:14 AM..
# 2  
Old 01-10-2012
Hi kikionline,

One way using sed:
Code:
$ cat infile
HEADER474687
D1356jkl ugbliuybikb 879870
898976098 9687680
D77656757 uhgliug liygoiygig
D98679hjh kjbgihguygfu ugliyh
kbygfluy9809
D8796870 kjlhuigiyig
TRAILER0008
$ sed -ne '
   1 { p ; b};
   /^D/ { x ; s/^\n// ; s/\n/ /g ; /./ { p }; b };
   /^D/! { $! { H ; b }; };
   $ { H ; x ; s/^\n// ; p }; 
' infile
HEADER474687
D1356jkl ugbliuybikb 879870 898976098 9687680
D77656757 uhgliug liygoiygig
D98679hjh kjbgihguygfu ugliyh kbygfluy9809
D8796870 kjlhuigiyig
TRAILER0008

Regards,
Birei
# 3  
Old 01-11-2012
Hi Birei,

Thanks a lot for your quick reply.

When i am using the sed command, i am receiving sed: command garbled error. I have checked the command but did not find any issue. Can you please help?

Code:
# sed -ne '
   1 { p ; b};
   /^D/ { x ; s/^\n// ; s/\n/ /g ; /./ { p }; b };
   /^D/! { $! { H ; b }; };
   $ { H ; x ; s/^\n// ; p }; 
' temp.txt
sed: command garbled: 1 { p ; b};

Thanks a lot in advance.
# 4  
Old 01-11-2012
Try with this version (I added ';' before each '}')
Code:
$ sed -ne '
   1 { p ; b; };
   /^D/ { x ; s/^\n// ; s/\n/ /g ; /./ { p }; b; };
   /^D/! { $! { H ; b; }; };
   $ { H ; x ; s/^\n// ; p; }; 
' infile

Regards,
Birei
# 5  
Old 01-11-2012
Hi Birei,

Thanks for your quick response.

However, I am still getting the same error even with the updated command.

Code:
# cat temp.txt
HEAD1767863 87987908798
D1565JHJKGKG;GUI69696Y9Y  UY6-96987
9YH HOIHOUHOUYUH9\
D87HUBOIUYGL79TY7G97GIUG HOUH UOHOU  ;UGH;U U9870877 9
87HYO8HYOH08HOHN089 9870978-987
D986UH;OUHGOUH98H80O8GOUG HO;IUH8O LIHOIH
TRAILER
 
# sed -ne '
   1 { p ; b; };
   /^D/ { x ; s/^\n// ; s/\n/ /g ; /./ { p }; b; };
   /^D/ { x ; s/^\n// ; s/\n/ /g ; /./ { p }; b; };
   /^D/! { $! { H ; b; }; };
   $ { H ; x ; s/^\n// ; p; };
   1 { p ; b; };.txt
   /^D/ { x ; s/^\n// ; s/\n/ /g ; /./ { p }; b; };
   /^D/! { $! { H ; b; }; };
   $ { H ; x ; s/^\n// ; p; };
' temp.txt
sed: command garbled:    1 { p ; b; };

Thanks again for all your time!
# 6  
Old 01-11-2012
Your last sed command seems to have duplicated instructions, did you try to run it like that or was an error posting it here?

Regards,
Birei
# 7  
Old 01-11-2012
Hi Birei,

Sorry for the confusion. It was a typo.

I tried with the correct command.

Code:
# sed -ne '
> 1 { p ; b; };
> /^D/ { x ; s/^\n// ; s/\n/ /g ; /./ { p }; b; };
> /^D/! { $! { H ; b; }; };
> $ { H ; x ; s/^\n// ; p; };
> ' temp.txt
sed: command garbled: 1 { p ; b; };

I have tried running it directly in the prompt as well as tried executing it as a ksh script (where it takes the location and filename as params) - however, results are the same.

Code:
# ksh LineBreakChecker.ksh /code/CheckLineBreak temp.txt > temp.log
sed: command garbled:    1 { p ; b; };


Thanks

---------- Post updated at 08:06 AM ---------- Previous update was at 07:56 AM ----------

Hi Birei,

Can this be because we may be using different shells (it should not be ideally though)?

Is there any alternative solution like using awk/perl to achieve the same thing?

Thanks.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

How to add line breaks to perl command with large text in single quotes?

Below code extracts multiple field values from XML into array and prints all in one line. perl -nle '@r=/(?: jndiName| authDataAlias| value| minConnections| maxConnections| connectionTimeout| name)="(+)/g and print join ",",$ENV{tIPnSCOPE},$ENV{pr ovider},$ENV{impClassName},@r' server.xml ... (4 Replies)
Discussion started by: kchinnam
4 Replies

2. UNIX for Dummies Questions & Answers

Page breaks and line breaks

Hi All, Need an urgent solution to an issue . We have created a ksh file or shell script which generates 1 DAT file. the DAT file contains extract of a select statement . Now the issue is , when we are executing the ksh file , the output is coimng with page breaks and line breaks . We have... (4 Replies)
Discussion started by: Ayaskant
4 Replies

3. UNIX for Dummies Questions & Answers

Convert UNIX text file in Windows to recognize line breaks

Hi all, I have some text files that I prepared in vi some time ago, and now I want to open and edit them with Windows Notepad. I don't have a Unix terminal at the moment so I need to do the conversion in Windows. Is there a way to do this? Or just reinsert thousands of line breaks again :eek: ? (2 Replies)
Discussion started by: frys_hp
2 Replies

4. Windows & DOS: Issues & Discussions

Convert UNIX text file in Windows to recognize line breaks

Hmmm I think I found the correct subforum to ask my question... I have some text files that I prepared in vi some time ago, and now I want to open and edit them with Windows Notepad. I don't have a Unix terminal at the moment so I need to do the conversion in Windows. Is there a way to do this?... (1 Reply)
Discussion started by: frys_hp
1 Replies

5. Shell Programming and Scripting

Format & Compare two huge CSV files

I have two csv files having 90K records each & each row has around 50 columns.Lets say the file names are FILE1 and FILE2. I have to compare both the files and generate a new file that has rows from FILE2 if it differs. FILE1 ----- 2001,"John",25,19901130,21211.41,Unix Forum... (3 Replies)
Discussion started by: Sheel
3 Replies

6. UNIX for Dummies Questions & Answers

VIM search and replace with line breaks in both the target and replacement text

Hi, Ive spent ages trying to find an explanation for how to do this on the web, but now feel like I'm :wall: I would like to change each occurence (there are many within my script) of the following: to in Vim. I know how to search and replace when it is just single lines... (2 Replies)
Discussion started by: blueade7
2 Replies

7. Shell Programming and Scripting

Comparing 2 huge text files

I have this 2 files: k5login sanwar@systems.nyfix.com jjamnik@systems.nyfix.com nisha@SYSTEMS.NYFIX.COM rdpena@SYSTEMS.NYFIX.COM service/backups-ora@SYSTEMS.NYFIX.COM ivanr@SYSTEMS.NYFIX.COM nasapova@SYSTEMS.NYFIX.COM tpulay@SYSTEMS.NYFIX.COM rsueno@SYSTEMS.NYFIX.COM... (11 Replies)
Discussion started by: linuxgeek
11 Replies

8. UNIX for Advanced & Expert Users

Best way to search for patterns in huge text files

I have the following situation: a text file with 50000 string patterns: abc2344536 gvk6575556 klo6575556 .... and 3 text files each with more than 1 million lines: ... 000000 abc2344536 46575 0000 000000 abc2344536 46575 4444 000000 abc2344555 46575 1234 ... I... (8 Replies)
Discussion started by: andy2000
8 Replies

9. Shell Programming and Scripting

Fix the breaks

The file FTP'd got few breaks and the data looks like: ABCTOM NYMANAGER ABCDAVE NJ PROGRAMMER ABCJIM CTTECHLEAD ABCPETERCA HR and i want the output like: ABCTOM NYMANAGER ABCDAVE NJPROGRAMMER ABCJIM CTTECHLEAD ABCPETERCAHR can you please help me in writing the shell... (8 Replies)
Discussion started by: rlmadhav
8 Replies

10. UNIX for Dummies Questions & Answers

How to remove FIRST Line of huge text file on Solaris

i need help..!!!! i have one big text file estimate data file size 50 - 100GB with 70 Mega Rows. on OS SUN Solaris version 8 How i can remove first line of the text file. Please suggest me for solutions. Thank you very much in advance:) (5 Replies)
Discussion started by: madoatz
5 Replies
Login or Register to Ask a Question