Delimited data contains line feeds where they shouldn't be


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Delimited data contains line feeds where they shouldn't be
# 1  
Old 03-24-2011
Delimited data contains line feeds where they shouldn't be

I have some data, each record (line) ends with a line feed (\n). Each field is pipe (|) delimited.
Code:
  1|short desc|long text|2001-01-01 01:01
  2|short desc| long
  text |2002-02-02 02:02
  3|short desc|  long  text  | 2003-03-03 03:03
  4|short desc
  |  long  text    | 2004-04-04 04:04

Note that ID #2 and #4 have an extra line feed between the field delimiters. I know that awk can read multi-line data. But the examples I found are for very strictly structured multi-line data, such as addresses. In this case it is only a few rows out of a hundred thousand that are bad. The data source somehow allows for line feeds in some of the text columns. But for my purposes, I don't want/need them.

I need to clean this up before I can load it into a database. The process I use to load into the database will trim any leading and trailing spaces, so they are not an issue for this clean up here. Unfortunately I can't get that to recognize that some of the text columns might also have a \n.

Any ideas? Is there a way to tell awk that I have x number of fields and that it should keep reading until it has that many, ignoring any line feeds until the actual end of the record data?

Thanks
Eric
# 2  
Old 03-24-2011
Code:
awk '{printf (/ *[0-9]+\|/?RS:FS) $0}' infile |awk -F "|" '{for (i=1;i<=NF;i++) {sub(/^ +/,"",$i);sub(/ +$/,"",$i)}}1'  OFS="|"

1|short desc|long text|2001-01-01 01:01
2|short desc|long   text|2002-02-02 02:02
3|short desc|long  text|2003-03-03 03:03
4|short desc|long  text|2004-04-04 04:04

# 3  
Old 03-25-2011
Is this hardcoded logic as this does not seems to be working
# 4  
Old 03-25-2011
Quote:
Originally Posted by dinjo_jo
Is this hardcoded logic as this does not seems to be working
Modifying a little bit rdcwayx idea:
Code:
echo "  1|short desc|long text|2001-01-01 01:01
   2|short desc| long
   text |2002-02-02 02:02
   3|short desc|  long  text  | 2003-03-03 03:03
   4|short desc
   |  long  text    | 2004-04-04 04:04 "  | awk '
{$1=$1;l=sprintf(/ *[0-9]+\|/?" "RS:FS) $0;printf gensub(/[ \t]+$|[ \t]+?(\|)[ \t]+?/,"\\1","g",l)}' 
 
1|short desc|long text|2001-01-01 01:01 
2|short desc|long text|2002-02-02 02:02 
3|short desc|long text|2003-03-03 03:03 
4|short desc|long text|2004-04-04 04:04

Regards
# 5  
Old 03-25-2011
No i mean can it handle any no of columns irrespective of datatypes ?
# 6  
Old 03-25-2011
Code:
awk -F"|" 'NF<4{if(short==""){short=$0;next} else{print short$0;short="";next}}1' file

Just change the number of columns.
# 7  
Old 03-25-2011
getting closer :)

Thanks Kato, that works just fine. Let me see if I can dissemble this.

NF checks to see that there are less then four fields(columns).
If so, save that line and go read the next line
Else print the previous line and the current line together. and go on to the next line

what does the '1' at the end do? if I leave it out, just the two line are printed. is it a flag that just has to be there and doesn't matter what its value is?

so to make it more robust as a script I can do something like:

cat clean.awk
Code:
BEGIN {

  USAGE = "usage: awk -f clean.awk -v col_cnt=<#> <input file>"
  if ( length( col_cnt ) == 0 ) {

    print "col_cnt not defined" > "/dev/stderr"
    print USAGE > "/dev/stderr"
    exit 1

  }
  FS = "|"

} # BEGIN

NF < int( col_cnt ) {

  if ( short == "" ) {

    short = $0
    next

  } else { 

    print short$0
    short = ""
    next

  }
} 1

running it:

Code:
awk -f clean.awk -v col_cnt=4 < data.txt


I get the following mess. Changing it from a one-liner to a script somehow changed it.

Code:
2|short desc| long
1|short desc|long text|2001-01-01 01:012|short desc| long
  text |2002-02-02 02:02
  text |2002-02-02 02:023|short desc|  long  text  | 2003-03-03 03:03
4|short desc
  |  long  text    | 2004-04-04 04:04
4|short desc  |  long  text    | 2004-04-04 04:04

Getting closer. The one-liner works, but I would like to document this as a script for a later programmer to maintain.

Thanks much
Eric
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

How to remove new line characters from data rows in a Pipe delimited file?

I have a file as below Emp1|FirstName|MiddleName|LastName|Address|Pincode|PhoneNumber 1234|FirstName1|MiddleName2|LastName3| Add1 || ADD2|123|000000000 2345|FirstName2|MiddleName3|LastName4| Add1 || ADD2| 234|000000000 OUTPUT : ... (1 Reply)
Discussion started by: styris
1 Replies

2. Shell Programming and Scripting

Removing carriage return/line feeds on multiple lines

I would like to remove carriage returns/line feeds in a text file, but in a specific cadence: Read first line (Header Line 1), remove cr/lf at the end (replace it with a space ideally); Read the next line (Line of Text 2), leave the cr/lf intact; Read the next line, remove the cr/lf; Read... (14 Replies)
Discussion started by: tomr2012
14 Replies

3. Shell Programming and Scripting

useless line feeds in ldapsearch output. Howto remove with shell script?

Hi $ cat ad.sh ldapsearorg -x -LLL -h sb1131z.testbadbigcorp.org -D "CN=ADMINZZ,OU=AdminRoles,DC=testbadbigcorp,DC=org" -w "UT3w4f57lll--4...4" -b "OU=Test,DC=testbadbigcorp,DC=org" "(&(&(&(&(objectCategory=person)(objectClass=user)(lockoutTime:1.2.840.113556.1.4.804:=4294967295)))))" dn$... (3 Replies)
Discussion started by: slashdotweenie
3 Replies

4. Shell Programming and Scripting

remove line feeds followed by character

Hi everyone, I'm very new to using sed, run through some tutorials and everything but I've hit a problem that I'm unable to solve by myself. I need to remove all linefeeds that are followed by a particular character (in this case a semicolon). So basically, all lines starting with a semicolon... (5 Replies)
Discussion started by: fluffdasheep
5 Replies

5. UNIX for Dummies Questions & Answers

.properties file and new line feeds

Hi, I have a .properties file that a read in some values in an .sh file but everytime I put it out on the server it fails. If I copy and paste the values of the .properties file on my local machine to the .properties file on the server it works just fine. Someone mentioned to see if it has dos... (3 Replies)
Discussion started by: vsekvsek
3 Replies

6. Shell Programming and Scripting

supressing carrige returns/line feeds

Hi gurus I am stripping lots of email addresses from a file with this grep "^To" file.log |awk '{print "1,"$2}' > recipients.out file.log looks something like this: oasndfoasnosf To: person@email.co.uk lsdfjosd sdlfnmsopdfwer dtlghodrgn To: person2@emailsss.com sldfnsdf I... (5 Replies)
Discussion started by: terry2009
5 Replies

7. Shell Programming and Scripting

Spurious line feeds

Hi all, I know this is **awfully** general but..... I have a script which does, basically... for file in `find command`; do some stuff more stuff echo '.\c' done I want to output the '.' char just to give an idea of progress. However, it works fine for a while and then I... (2 Replies)
Discussion started by: ajcannon
2 Replies

8. Shell Programming and Scripting

line feeds in csv

:confused: hi all, i have csv file with three comma separated columns i/p file First_Name, Address, Last_Name XXX, "456 New albany \n newyork, Unitedstates \n 45322-33", YYY\n ZZZ, "654 rifle park \n toronto, canada \n 43L-w3b", RRR\n is there any way i can remove \n (newline) from... (10 Replies)
Discussion started by: gowrish
10 Replies

9. Shell Programming and Scripting

Remove line feeds

Hi, I have a fixed width flat file which has 1 as the first char and E as the last character. Some of the records have a carriage return /line feeds . how do I remove them? Let me know. Thanks VSK (8 Replies)
Discussion started by: vsk
8 Replies

10. Shell Programming and Scripting

carriage return/line feeds

Hello, I have a file that has got carriage returns in it and I want to take them out. Anyone know how I can do this in a ksh? thanks (4 Replies)
Discussion started by: pitstop
4 Replies
Login or Register to Ask a Question