Awk - Handling different types of newlines

03-21-2011

Registered User

176, 5

Join Date: Oct 2008

Last Activity: 11 November 2015, 6:40 PM EST

Location: Orem, Utah

Posts: 176

Thanks Given: 16

Thanked 5 Times in 5 Posts

Awk - Handling different types of newlines

Hi. We have some data that's generated from a webpage. Part is pretty well-formatted, but part of it preserves newlines in a way that breaks the record separating in awk. Here's 2 records, filtered through cat -e:

Code:

Jones,Bob,20,Q: What is your favorite ice cream?$
A: Butter Pecan$
Q: Do you like sprinkles?$
A: But of course$
cone^M$
Smith,Jane,18,Q: What is your favorite ice cream?$
A: Rocky Road$
Q: Do you like sprinkles?$
A: Yuck, no$
bowl^M$

So, you can see that there are different newline types here-- just the straight-up $ and the ^M$. How do I essentially get awk to ignore the first and use only the second as a record separator? Many thanks in advance.

treesloth

View Public Profile for treesloth

Find all posts by treesloth

03-21-2011

Registered User

11,728, 1,345

Join Date: Feb 2004

Last Activity: 8 May 2020, 9:07 AM EDT

Location: NM

Posts: 11,728

Thanks Given: 903

Thanked 1,345 Times in 1,201 Posts

You have to 'preprocess' the file with dos2unix (also called dos2ux on some boxes)

Code:

dos2unix somefile | awk '{awk program here}'

jim mcnamara

View Public Profile for jim mcnamara

Find all posts by jim mcnamara

03-21-2011

Registered User

176, 5

Join Date: Oct 2008

Last Activity: 11 November 2015, 6:40 PM EST

Location: Orem, Utah

Posts: 176

Thanks Given: 16

Thanked 5 Times in 5 Posts

Thanks for the reply. That does normalize the situation somewhat-- all of the newline characters are consistent-- but unfortunately doesn't make processing much simpler. Following the use of dos2unix, here's a single record from the human's perspective:

Code:

Jones,Bob,20,Q: What is your favorite ice cream?$
A: Butter Pecan$
Q: Do you like sprinkles?$
A: But of course$
cone$

From awk's perspective, though, that's 5 different records. My naive approach is to try to turn the above into:

Code:

Jones,Bob,20,Q: What is your favorite ice cream?  A: Butter Pecan, Q: Do you like sprinkles? A: But of course, cone$

That seems to require using ^M$ as a record separator, and just discarding $ (UNIX newline, as opposed to ^M$) so that the records join together. Any suggestions on how that might be done? This is a strangely sticky problem, but just researching it has been very informative. Anyway, again, thanks for the reply.

treesloth

View Public Profile for treesloth

Find all posts by treesloth

03-21-2011

Registered User

290, 37

Join Date: Jan 2009

Last Activity: 28 June 2018, 4:18 PM EDT

Location: Tegucigalpa, Honduras

Posts: 290

Thanks Given: 8

Thanked 37 Times in 36 Posts

Hi treesloth,

Please try with:

Code:

awk -F"\$" '{gsub(/\^M/,"\n",$0); for(i=1;i<NF;i++) printf $i" " }' inputfile
Jones,Bob,20,Q: What is your favorite ice cream? A: Butter Pecan Q: Do you like sprinkles? A: But of course cone
 Smith,Jane,18,Q: What is your favorite ice cream? A: Rocky Road Q: Do you like sprinkles? A: Yuck, no bowl

Regards.

cgkmal

View Public Profile for cgkmal

Find all posts by cgkmal

03-21-2011

Registered User

436, 107

Join Date: Feb 2011

Last Activity: 24 March 2015, 6:12 AM EDT

Posts: 436

Thanks Given: 9

Thanked 107 Times in 106 Posts

Code:

dos2unix file|awk '{printf $0~/[AQ]:/?$0 FS: $0 RS}'

yinyuemi

View Public Profile for yinyuemi

Find all posts by yinyuemi

03-21-2011

Registered User

290, 37

Join Date: Jan 2009

Last Activity: 28 June 2018, 4:18 PM EDT

Location: Tegucigalpa, Honduras

Posts: 290

Thanks Given: 8

Thanked 37 Times in 36 Posts

treesloth,

This is an improved version of my first code. I was missing the convertion from DOS to Unix format and commas in the output.
This time those details are included:

Code:

awk '{gsub(/\r$/,"");gsub(/\$$/,",$");gsub(/?,\$/,"?$");gsub(/\^M,\$/,"\n");gsub(/\$/," ");printf $0}' inputfile
Jones,Bob,20,Q: What is your favorite ice cream? A: Butter Pecan, Q: Do you like sprinkles? A: But of course, cone
Smith,Jane,18,Q: What is your favorite ice cream? A: Rocky Road, Q: Do you like sprinkles? A: Yuck, no, bowl

Hope it helps,

Regards

cgkmal

View Public Profile for cgkmal

Find all posts by cgkmal

03-22-2011

Registered User

35, 7

Join Date: Mar 2011

Last Activity: 17 January 2013, 10:46 PM EST

Location: San Diego

Posts: 35

Thanks Given: 0

Thanked 7 Times in 6 Posts

This should do it: tr '\r\n' '|' <in | sed 's/||/$/g' | tr '$' '\n'

You can pipe it to awk (for example: awk -F"|" '{print$0}' see below) and prove that it has the field separator "|" and the record separator of "\n"

stefangr$ cat in

Jones,Bob,20
Q: What is your favorite ice cream?
A: Butter Pecan
Q: Do you like sprinkles?
A: But of course
cone
Smith,Jane,18
Q: What is your favorite ice
cream?
A: Rocky Road
Q: Do you like sprinkles?
A: Yuck, no
bowl

stefangr$ tr '\r\n' '|' <in | sed 's/||/$/g' | tr '$' '\n' | awk -F"|" '{print$0}'
Jones,Bob,20|Q: What is your favorite ice cream?|A: Butter Pecan|Q: Do you like sprinkles?|A: But of course|cone
Smith,Jane,18|Q: What is your favorite ice |cream?|A: Rocky Road|Q: Do you like sprinkles?|A: Yuck, no|bowl

sgruenwald

View Public Profile for sgruenwald

Find all posts by sgruenwald

UNIX for Dummies Questions & Answers

Awk - Handling different types of newlines

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

Using find with awk to remove newlines

Discussion started by: kristinu

2. Shell Programming and Scripting

Handling 2 files simultaneously with awk

Discussion started by: fifteate

3. Shell Programming and Scripting

handling asterix in AWK

Discussion started by: rinku11

4. Shell Programming and Scripting

Data handling using AWK

Discussion started by: mtomar

5. UNIX for Advanced & Expert Users

awk function in handling quotes

Discussion started by: shahnazurs

6. Shell Programming and Scripting

handling arrays with awk

Discussion started by: gmartinez

7. Shell Programming and Scripting

Handling regular expressions in awk

Discussion started by: Priyanka Bhati

8. Shell Programming and Scripting

awk - need to remove unwanted newlines on match

Discussion started by: Bubnoff

9. Shell Programming and Scripting

column handling in awk

Discussion started by: Mish_99

10. Shell Programming and Scripting

Handling special characters using awk

Discussion started by: sam_78_nyc