Breaking long lines into (characters, newline, space) groups

05-14-2009

Registered User

16, 0

Join Date: May 2009

Last Activity: 27 November 2012, 3:46 PM EST

Posts: 16

Thanks Given: 0

Thanked 0 Times in 0 Posts

Breaking long lines into (characters, newline, space) groups

Hello,

I am currently trying to edit an ldif file. The ldif specification states that a newline followed by a space indicates the subsequent line is a continuation of the line. So, in order to search and replace properly and edit the file, I open the file in textwrangler, search for "\r " and remove it, thus making all continued lines into single lines. Thats the first step. I make my changes to the ldif file at that point.

Now, after editing, I want to break any lines with more than 79 characters, (some of which are hundreds of characters long) into this: 79 characters, newline, space, next 79 characters, newline, space, next 79 characters, newline, space, etc.

using this simple sed command:

Code:

sed 's/./\
 /80' myfile > newfile

works for the first 79 characters of line x, breaks it properly, but then moves on to the next line in the ldif, leaving line x broken into: 79 characters, newline, space, remaining chunk of line x which is hundreds of characters, next line in ldif. Only partial success!

So heres the question. Is there a way to use sed to run this command every 79th character until the end of the line? If not, alternately, should I use a loop in the script using some sort of conditional statement like, if there are lines longer than 79 characters, rerun the sed command. (so that it will go and now break the remaining hundreds of characters that were not broken in the original sed run. and continue looping till all lines are broken into (79 character, newline, space) chunks? How could I set up that condition? I dont know how to search for lines longer than x characters.

Thanks a lot for any help on this!

rowie718

View Public Profile for rowie718

Find all posts by rowie718

05-14-2009

Registered User

16, 0

Join Date: May 2009

Last Activity: 27 November 2012, 3:46 PM EST

Posts: 16

Thanks Given: 0

Thanked 0 Times in 0 Posts

I came up with this script. It seems hackish and very inefficient, but it works. I would love for someone to help me come up with a better way since this script takes almost 10 full minutes to parse a text file into less than 7000 lines.

Code:

#!/bin/ksh

echo "where is the ldif file located that you would like to parse?"
read response
ldiffile=$response

while read line
do

x=`echo $line | wc -c`

while [ $x -gt 79 ]
do

sed 's/./\
 /79' $ldiffile > /test.ldif
mv /test.ldif $ldiffile
x=$x-79

done

done < $ldiffile

I just realized this script is substituting the 79th character with the newline and space. From what Ive been reading, I can add an ampersand before the newline escape in the sed replacement pattern. However when I put an ampersand there, it ruins the ldif file, cutting lines and inserting groups of blank lines. Ive searched all through a million forums, mostly suggesting using escaped parentheses to remember a pattern and then \1 to recall it with the newline after that. It doesnt work for me. Any which way I try to recall the 79th character in the replacement string and add to it, I get this crazy blank line effect on my file. I am on os x 10.4.11 server. Frustrating! How do I make it so the newline will come after the 79th character and not as a substitute?

Thanks again for any help you can offer!

Last edited by rowie718; 05-14-2009 at 07:53 PM..

rowie718

View Public Profile for rowie718

Find all posts by rowie718

05-14-2009

Registered User

2,898, 136

Join Date: Mar 2007

Last Activity: 11 July 2016, 2:55 PM EDT

Location: Toronto, Canada

Posts: 2,898

Thanks Given: 0

Thanked 136 Times in 120 Posts

Quote:

Originally Posted by rowie718

I came up with this script. It seems hackish and very inefficient, but it works. I would love for someone to help me come up with a better way since this script takes almost 10 full minutes to parse a text file into less than 7000 lines.
[/code]

For a file that size, you should use awk.

Quote:

[code]

Code:

#!/bin/ksh

echo "where is the ldif file located that you would like to parse?"
read response
ldiffile=$response

Why not simply:
Code:
read ldiffile

Quote:

Code:

while read line
do

x=`echo $line | wc -c`

You don't need an external command to get the length of a variable's contents:
Code:
x=${#line}

Quote:

Code:

while [ $x -gt 79 ]
do

sed 's/./\
 /79' $ldiffile > /test.ldif
mv /test.ldif $ldiffile
x=$x-79

done

done < $ldiffile

Code:

awk 'length > 79 { while ( length($0) > 79 ) {
    printf "%s\n ", substr($0,1,79)
    $0 = substr($0,80)
  }
  if (length) print
  next
}
{print}' "$FILE"

cfajohnson

View Public Profile for cfajohnson

Find all posts by cfajohnson

05-15-2009

Registered User

16, 0

Join Date: May 2009

Last Activity: 27 November 2012, 3:46 PM EST

Posts: 16

Thanks Given: 0

Thanked 0 Times in 0 Posts

Thank you so much for your help cfajohnson!

I put together your suggestions and tested them. Its almost there, but there were 2 problems. The first I fixed fairly easily. The 79th character is the newline in the orignal ldif, so I shouldve expressed it as wanting 78 characters. I deducted 1 from anywhere I saw 79 or 80 in your awk command and that seemed to do the trick. The second problem is trickier. Take a 240 character line as an example. When the awk command breaks it, and adds the space in the second chunk, it does not take into account that the last character of that second chunk should be at the same ending position as the first chunk. As it is currently written, all chunks after the first break align 1 character to the right because of the space.

Example:

Code:

123456789012345678901234567890.....(240 character long string repeating)

currently breaks into :
123456789012345678901234567890123456789012345678901234567890123456789012345678
 901234567890123456789012345678901234567890123456789012345678901234567890123456
 789012345678901234567890123456789012345678901234567890123456789012345678901234
 567890

but should actually end up more like this, so that every line has 78 characters, 
plus newline (including the space we've added):

123456789012345678901234567890123456789012345678901234567890123456789012345678
 90123456789012345678901234567890123456789012345678901234567890123456789012345
 67890123456789012345678901234567890123456789012345678901234567890123456789012
 34567890

The script currently looks like this:

#!/bin/ksh

echo "where is the ldif file located that you would like to parse?"
read ldiffile

awk 'length > 78 { while ( length($0) > 78 ) {
    printf "%s\n ", substr($0,1,78)
    $0 = substr($0,79)
  }
  if (length) print
  next
}
{print}' $ldiffile > /out.txt

Thanks again for your help, I really appreciate it.

rowie718

View Public Profile for rowie718

Find all posts by rowie718

05-15-2009

Registered User

2,669, 20

Join Date: Sep 2006

Last Activity: 28 January 2015, 8:30 AM EST

Posts: 2,669

Thanks Given: 0

Thanked 20 Times in 20 Posts

if you have Python, here's an alternative solution

Code:

import textwrap
t=textwrap.TextWrapper(subsequent_indent=" ",width=78)
for line in open("file"):
    for i in t.wrap(line):
        print i

output

Code:

# more file
1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789
01234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890

# ./test.py
123456789012345678901234567890123456789012345678901234567890123456789012345678
 90123456789012345678901234567890123456789012345678901234567890123456789012345
 67890123456789012345678901234567890123456789012345678901234567890123456789012
 34567890

ghostdog74

View Public Profile for ghostdog74

Find all posts by ghostdog74

05-15-2009

Registered User

16, 0

Join Date: May 2009

Last Activity: 27 November 2012, 3:46 PM EST

Posts: 16

Thanks Given: 0

Thanked 0 Times in 0 Posts

Thank you ghostdog,

First let me state that I am totally unfamiliar with python. However, if it solves this problem for me, I would be glad to learn a bit and use it. There are a few issues I noticed upon trying the code you provided, ranked in order of importance:

1) The code seems to eliminate blank lines from the source text. I need it to not do that. Example:

1111

2222

becomes

1111
2222

2) I dont know how to output to a file rather than the standard output. I apologize for the rookie question here.

3) Ideally I would like for there to be a way to interactively input the location of the file so it doesnt need to be hardcoded. If this is too much to ask though, I can live without it.

Generally I would prefer to use sed/awk since I have some familiarity with them and bash scripting, however I will use whatever solutions are presented that fully solve this problem. I really appreciate the assistance.

Cheers.

rowie718

View Public Profile for rowie718

Find all posts by rowie718

05-15-2009

Registered User

2,898, 136

Join Date: Mar 2007

Last Activity: 11 July 2016, 2:55 PM EDT

Location: Toronto, Canada

Posts: 2,898

Thanks Given: 0

Thanked 136 Times in 120 Posts

Code:

awk 'length > 79 {
    n=1
    while ( length($0) > 78 + n ) {
    printf "%s\n ", substr($0,1,78 + n)
    $0 = substr($0,79 + n)
    n=0
  }
  if (length) print
  next
}
{print}' "$FILE"

cfajohnson

View Public Profile for cfajohnson

Find all posts by cfajohnson

Shell Programming and Scripting

Breaking long lines into (characters, newline, space) groups

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Mailx appending exclamation mark and newline in a long line

Discussion started by: gilberteu

2. Shell Programming and Scripting

Newline characters in fields of a file

Discussion started by: lakshmi001

3. Shell Programming and Scripting

Breaking lines which contains more than 50 characters in a file

Discussion started by: wenclu

4. Shell Programming and Scripting

awk: searching for non-breaking-space

Discussion started by: sdf

5. Shell Programming and Scripting

cutting long text by special char around 100 byte and newline

Discussion started by: Shawn, Lee

6. UNIX for Dummies Questions & Answers

Breaking up a text file into lines

Discussion started by: evelibertine

7. Ubuntu

Disk Space lost mysteriously upon breaking a process.

Discussion started by: morningSunshine

8. Shell Programming and Scripting

Replace long space to become one space?

Discussion started by: justbow

9. UNIX for Dummies Questions & Answers

non-breaking space question

Discussion started by: runmeat6

10. Shell Programming and Scripting

remove trailing newline characters

Discussion started by: shweta_d