How to fix line breaks format text for huge files?

Login or Register to Ask a Question and Join Our Community

How to fix line breaks format text for huge files?

Tags

line-breaks, sed awk, shell scripts

Login or Register to Reply

Top Forums Shell Programming and Scripting How to fix line breaks format text for huge files?

01-10-2012

Registered User

9, 0

Join Date: Jun 2011

Last Activity: 5 February 2013, 9:03 PM EST

Posts: 9

Thanks Given: 6

Thanked 0 Times in 0 Posts

How to fix line breaks format text for huge files?

Hi,
I need to correct line breaks for huge files (more than 1MM records in a file) and then format it properly.

Except the header and trailer, each record starts with 'D'.

Requirement:Scan the whole file except the header and trailer records and see if any of the records start with anything other than 'D'. In such cases, merge the broken line with the preceeding line after inserting a space after the end of the previous line.

The input file is:

HEADER474687

D1356jkl ugbliuybikb 879870

898976098 9687680

D77656757 uhgliug liygoiygig

D98679hjh kjbgihguygfu ugliyh

kbygfluy9809

D8796870 kjlhuigiyig

TRAILER0008

Expected output file is:

HEADER474687

D1356jkl ugbliuybikb 879870 898976098 9687680

D77656757 uhgliug liygoiygig

D98679hjh kjbgihguygfu ugliyh kbygfluy9809

D8796870 kjlhuigiyig

TRAILER0008

I am using the following code to achieve it:

Code:

#!/bin/ksh
 
FILENAME=$1
 
echo Checking for Line-Breaks in a file...
grep -nv '^D' $FILENAME > RESULT.OUT
 
echo Verify any match for Line Breaks...
COUNT=$(wc -l <RESULT.OUT)
 
if [ $COUNT -gt 0 ]; then
 
TOTLINECOUNT=$(wc -l <$FILENAME)
 
set RECORDCOUNT=$TOTLINECOUNT - 1
 
awk 'NR > 1 && NR < "$RECORDCOUNT" /^D/ {printf "\n%s", $0 ; next } { printf " %s", $0 } END {print eof}' $FILENAME > $FILENAME.TEMP 
 
> $FILENAME
 
cat $FILENAME.TEMP > $FILENAME
 
rm $FILENAME.TEMP
 
rm RESULT.OUT
 
exit
 
fi
 
echo The file $FILENAME does not contain any Line Breaks
 
rm RESULT.OUT

Received output:

<space>HEADER474687

D1356jkl ugbliuybikb 87987089 8976098 9687680

D77656757 uhgliug liygoiygig

D98679hjh kjbgihguygfu ugliyhkb ygfluy9809

D8796870 kjlhuigiyig TRAILER0008

After using the above code, I am facing the following issues:

On using the awk command, although I am using NR > 1 && NR < $RECORDCOUNT,

a) I am unable to exclude the Header and trailer records from the awk processing which is merging Trailer line as well to the previous one.

b)Also, a space is getting inserted before the first line. The first line becomes:

Code:

<space>HEADER474687

Because of this, I have to use a separate sed command(given below) just after the awk execution to delete the leading space from first line which is adding to the execution time of the whole process.

Code:

sed -e '1,1s/^[ \t]*//' $FILENAME.TEMP > $FILENAME

I would really appreciate if any one of you can guide me in writing this piece of code using awk/sed/perl (whichever is suitable keeping in mind the huge file size).

Thanks a lot in advance.

Last edited by kikionline; 01-10-2012 at 10:14 AM..

kikionline

View Public Profile for kikionline

Find all posts by kikionline

01-10-2012

Registered User

628, 174

Join Date: Oct 2010

Last Activity: 2 December 2017, 5:58 AM EST

Location: Madrid, Spain

Posts: 628

Thanks Given: 8

Thanked 174 Times in 171 Posts

Hi kikionline,

One way using sed:

Code:

$ cat infile
HEADER474687
D1356jkl ugbliuybikb 879870
898976098 9687680
D77656757 uhgliug liygoiygig
D98679hjh kjbgihguygfu ugliyh
kbygfluy9809
D8796870 kjlhuigiyig
TRAILER0008
$ sed -ne '
   1 { p ; b};
   /^D/ { x ; s/^\n// ; s/\n/ /g ; /./ { p }; b };
   /^D/! { $! { H ; b }; };
   $ { H ; x ; s/^\n// ; p }; 
' infile
HEADER474687
D1356jkl ugbliuybikb 879870 898976098 9687680
D77656757 uhgliug liygoiygig
D98679hjh kjbgihguygfu ugliyh kbygfluy9809
D8796870 kjlhuigiyig
TRAILER0008

Regards,
Birei

birei

View Public Profile for birei

Find all posts by birei

01-11-2012

Registered User

9, 0

Join Date: Jun 2011

Last Activity: 5 February 2013, 9:03 PM EST

Posts: 9

Thanks Given: 6

Thanked 0 Times in 0 Posts

Hi Birei,

Thanks a lot for your quick reply.

When i am using the sed command, i am receiving sed: command garbled error. I have checked the command but did not find any issue. Can you please help?

Code:

# sed -ne '
   1 { p ; b};
   /^D/ { x ; s/^\n// ; s/\n/ /g ; /./ { p }; b };
   /^D/! { $! { H ; b }; };
   $ { H ; x ; s/^\n// ; p }; 
' temp.txt
sed: command garbled: 1 { p ; b};

Thanks a lot in advance.

kikionline

View Public Profile for kikionline

Find all posts by kikionline

01-11-2012

Registered User

628, 174

Join Date: Oct 2010

Last Activity: 2 December 2017, 5:58 AM EST

Location: Madrid, Spain

Posts: 628

Thanks Given: 8

Thanked 174 Times in 171 Posts

Try with this version (I added ';' before each '}')

Code:

$ sed -ne '
   1 { p ; b; };
   /^D/ { x ; s/^\n// ; s/\n/ /g ; /./ { p }; b; };
   /^D/! { $! { H ; b; }; };
   $ { H ; x ; s/^\n// ; p; }; 
' infile

Regards,
Birei

birei

View Public Profile for birei

Find all posts by birei

01-11-2012

Registered User

9, 0

Join Date: Jun 2011

Last Activity: 5 February 2013, 9:03 PM EST

Posts: 9

Thanks Given: 6

Thanked 0 Times in 0 Posts

Hi Birei,

Thanks for your quick response.

However, I am still getting the same error even with the updated command.

Code:

# cat temp.txt
HEAD1767863 87987908798
D1565JHJKGKG;GUI69696Y9Y  UY6-96987
9YH HOIHOUHOUYUH9\
D87HUBOIUYGL79TY7G97GIUG HOUH UOHOU  ;UGH;U U9870877 9
87HYO8HYOH08HOHN089 9870978-987
D986UH;OUHGOUH98H80O8GOUG HO;IUH8O LIHOIH
TRAILER
 
# sed -ne '
   1 { p ; b; };
   /^D/ { x ; s/^\n// ; s/\n/ /g ; /./ { p }; b; };
   /^D/ { x ; s/^\n// ; s/\n/ /g ; /./ { p }; b; };
   /^D/! { $! { H ; b; }; };
   $ { H ; x ; s/^\n// ; p; };
   1 { p ; b; };.txt
   /^D/ { x ; s/^\n// ; s/\n/ /g ; /./ { p }; b; };
   /^D/! { $! { H ; b; }; };
   $ { H ; x ; s/^\n// ; p; };
' temp.txt
sed: command garbled:    1 { p ; b; };

Thanks again for all your time!

kikionline

View Public Profile for kikionline

Find all posts by kikionline

01-11-2012

Registered User

628, 174

Join Date: Oct 2010

Last Activity: 2 December 2017, 5:58 AM EST

Location: Madrid, Spain

Posts: 628

Thanks Given: 8

Thanked 174 Times in 171 Posts

Your last sed command seems to have duplicated instructions, did you try to run it like that or was an error posting it here?

Regards,
Birei

birei

View Public Profile for birei

Find all posts by birei

01-11-2012

Registered User

9, 0

Join Date: Jun 2011

Last Activity: 5 February 2013, 9:03 PM EST

Posts: 9

Thanks Given: 6

Thanked 0 Times in 0 Posts

Hi Birei,

Sorry for the confusion. It was a typo.

I tried with the correct command.

Code:

# sed -ne '
> 1 { p ; b; };
> /^D/ { x ; s/^\n// ; s/\n/ /g ; /./ { p }; b; };
> /^D/! { $! { H ; b; }; };
> $ { H ; x ; s/^\n// ; p; };
> ' temp.txt
sed: command garbled: 1 { p ; b; };

I have tried running it directly in the prompt as well as tried executing it as a ksh script (where it takes the location and filename as params) - however, results are the same.

Code:

# ksh LineBreakChecker.ksh /code/CheckLineBreak temp.txt > temp.log
sed: command garbled:    1 { p ; b; };

Thanks

---------- Post updated at 08:06 AM ---------- Previous update was at 07:56 AM ----------

Hi Birei,

Can this be because we may be using different shells (it should not be ideally though)?

Is there any alternative solution like using awk/perl to achieve the same thing?

Thanks.

kikionline

View Public Profile for kikionline

Find all posts by kikionline

Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

How to add line breaks to perl command with large text in single quotes?

Below code extracts multiple field values from XML into array and prints all in one line. perl -nle '@r=/(?: jndiName| authDataAlias| value| minConnections| maxConnections| connectionTimeout| name)="(+)/g and print join ",",$ENV{tIPnSCOPE},$ENV{pr ovider},$ENV{impClassName},@r' server.xml ...

2. UNIX for Dummies Questions & Answers

Page breaks and line breaks

Hi All, Need an urgent solution to an issue . We have created a ksh file or shell script which generates 1 DAT file. the DAT file contains extract of a select statement . Now the issue is , when we are executing the ksh file , the output is coimng with page breaks and line breaks . We have...

3. UNIX for Dummies Questions & Answers

Convert UNIX text file in Windows to recognize line breaks

Hi all, I have some text files that I prepared in vi some time ago, and now I want to open and edit them with Windows Notepad. I don't have a Unix terminal at the moment so I need to do the conversion in Windows. Is there a way to do this? Or just reinsert thousands of line breaks again :eek: ?

4. Windows & DOS: Issues & Discussions

Convert UNIX text file in Windows to recognize line breaks

Hmmm I think I found the correct subforum to ask my question... I have some text files that I prepared in vi some time ago, and now I want to open and edit them with Windows Notepad. I don't have a Unix terminal at the moment so I need to do the conversion in Windows. Is there a way to do this?...

5. Shell Programming and Scripting

Format & Compare two huge CSV files

I have two csv files having 90K records each & each row has around 50 columns.Lets say the file names are FILE1 and FILE2. I have to compare both the files and generate a new file that has rows from FILE2 if it differs. FILE1 ----- 2001,"John",25,19901130,21211.41,Unix Forum...

6. UNIX for Dummies Questions & Answers

VIM search and replace with line breaks in both the target and replacement text

Hi, Ive spent ages trying to find an explanation for how to do this on the web, but now feel like I'm :wall: I would like to change each occurence (there are many within my script) of the following: to in Vim. I know how to search and replace when it is just single lines...

7. Shell Programming and Scripting

Comparing 2 huge text files

I have this 2 files: k5login sanwar@systems.nyfix.com jjamnik@systems.nyfix.com nisha@SYSTEMS.NYFIX.COM rdpena@SYSTEMS.NYFIX.COM service/backups-ora@SYSTEMS.NYFIX.COM ivanr@SYSTEMS.NYFIX.COM nasapova@SYSTEMS.NYFIX.COM tpulay@SYSTEMS.NYFIX.COM rsueno@SYSTEMS.NYFIX.COM...

8. UNIX for Advanced & Expert Users

Best way to search for patterns in huge text files

I have the following situation: a text file with 50000 string patterns: abc2344536 gvk6575556 klo6575556 .... and 3 text files each with more than 1 million lines: ... 000000 abc2344536 46575 0000 000000 abc2344536 46575 4444 000000 abc2344555 46575 1234 ... I...

9. Shell Programming and Scripting

Fix the breaks

The file FTP'd got few breaks and the data looks like: ABCTOM NYMANAGER ABCDAVE NJ PROGRAMMER ABCJIM CTTECHLEAD ABCPETERCA HR and i want the output like: ABCTOM NYMANAGER ABCDAVE NJPROGRAMMER ABCJIM CTTECHLEAD ABCPETERCAHR can you please help me in writing the shell...

10. UNIX for Dummies Questions & Answers

How to remove FIRST Line of huge text file on Solaris

i need help..!!!! i have one big text file estimate data file size 50 - 100GB with 70 Mega Rows. on OS SUN Solaris version 8 How i can remove first line of the text file. Please suggest me for solutions. Thank you very much in advance:)

Login or Register to Ask a Question