Help to join separate lines in a single one from xml file

02-23-2018

Registered User

304, 2

Join Date: Oct 2011

Last Activity: 29 May 2019, 2:44 AM EDT

Posts: 304

Thanks Given: 59

Thanked 2 Times in 2 Posts

Help to join separate lines in a single one from xml file

Hi all,

I need help to parse this xml file that has paragraphs broken in different lines and I would like to join in a single line.

I hope you can understand my explanation. Thanks for any help/direction.

The script could be in bash, awk, ruby, perl whatever please

In the output I want:
The values with font=9 as initial line of each group
The 1st value with font=8 that is inmediately below of value with font=9, I want it as 2nd line of each group
The 2nd value with font=8 that is below of value with font=9, I want it as first column of each group
The values with font=10 I want them as second column of each group
And finally, join in a single line the values with font=8 that belong to the previous value of font=10

input

Code:

	<text top="333" left="98" width="93" height="16" font="9"><b>OS family </b></text>
	<text top="350" left="98" width="192" height="16" font="8">Unix pk1</text>
	<text top="368" left="98" width="12" height="16" font="8">1 </text>
	<text top="365" left="112" width="5" height="11" font="10">1</text>
	<text top="368" left="118" width="308" height="16" font="8"> originally meant to be a </text>
	<text top="365" left="427" width="5" height="11" font="10">2</text>
	<text top="368" left="433" width="4" height="16" font="8"> </text>
	<text top="385" left="98" width="339" height="16" font="8">convenient platform</text>
	<text top="402" left="98" width="339" height="16" font="8"> for programmers</text>
	
	<text top="333" left="98" width="93" height="16" font="9"><b>Source model </b></text>
	<text top="350" left="98" width="192" height="16" font="8">Unix pk2</text>
	<text top="368" left="98" width="12" height="16" font="8">2 </text>
	<text top="365" left="112" width="5" height="11" font="10">1</text>
	<text top="368" left="118" width="308" height="16" font="8">Historically </text>
	<text top="368" left="118" width="308" height="16" font="8">closed-source </text>
	<text top="365" left="427" width="5" height="11" font="10">2</text>
	<text top="368" left="433" width="4" height="16" font="8"> </text>
	<text top="385" left="98" width="339" height="16" font="8">, while some Unix</text>
	<text top="402" left="98" width="339" height="16" font="8"> projects (including BSD family and Illumos)</text>
	<text top="402" left="98" width="339" height="16" font="8"> are open-source.</text>
	<text top="402" left="98" width="339" height="16" font="8"> Development started in 1969.</text>
	<text top="365" left="427" width="5" height="11" font="10">3</text>
	<text top="402" left="98" width="339" height="16" font="8">this is</text>
	<text top="402" left="98" width="339" height="16" font="8"> last paragraph.</text>

desired output

Code:

	OS family
	Unix pk1
	1 1 originally meant to be a
	1 2 convenient platform for programmers
	
	Source model 
	Unix pk2
	2 1 Historically closed-source 
	2 2 , while some Unix projects (including BSD family and Illumos) are open-source. Development started in 1969.
	2 3 this is last paragraph.

Last edited by Ophiuchus; 02-23-2018 at 02:34 AM..

Ophiuchus

View Public Profile for Ophiuchus

Find all posts by Ophiuchus

02-23-2018

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

As always, it helps if we know what operating system you're using and what you have tried to solve this problem on your own.

By listing bash along with awk, ruby, and perl are you saying that bash is the shell that you use?

We are here to help you learn how to use the tools available on your system to do things like this; not to act as your unpaid programming staff.

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

02-24-2018

Registered User

304, 2

Join Date: Oct 2011

Last Activity: 29 May 2019, 2:44 AM EDT

Posts: 304

Thanks Given: 59

Thanked 2 Times in 2 Posts

Hello Don,

My apologies for any misunderstanding.

I�m using Cygwin on Windows and Ubuntu 16.04.2 LTS on Windows.

In awk or ruby I think would be preferable for my to understand any direction someone could share me.

The code I�ve been able to construct so far is in awk but the output is far from my desired one.

Code:

awk '/font="9">/ {a = gensub(/(<.*\">)(.*)(<.*)/, "\\2", "g", $0 )}
     /font="8">/ {
     z++; 
     if(z==1){ b = gensub(/(<.*\">)(.*)(<.*)/, "\\2", "g", $0 )}
     if(z==2){ c = gensub(/(<.*\">)(.*)(<.*)/, "\\2", "g", $0 )}
     if(z>2 ){ d = d " " gensub(/(<.*\">)(.*)(<.*)/, "\\2", "g", $0 )}
     }
     /font="10">/{d = ""; e = gensub(/(<.*\">)(.*)(<.*)/, "\\2", "g", $0 )

     print a"\n"b"\n"c"\n"e,d; z=0}' input.xml

My current output is:

Code:

<b>OS family </b>
Unix pk1
1
1
<b>OS family </b>
 originally meant to be a
1
2
<b>Source model </b>

convenient platform
1
<b>Source model </b>
Historically
closed-source
2
<b>Source model </b>

, while some Unix
3

I hope someone could give some help on this.

Thanks in advance.

Ophiuchus

View Public Profile for Ophiuchus

Find all posts by Ophiuchus

02-24-2018

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

Try

Code:

awk '
        {match ($0, /font="[^"]*"/)
         FNT = substr ($0, RSTART+6, RLENGTH-7)
         gsub (/^\t|<[^>]*>/, _)

         if (FNT ==  9) {LVL = 1
                         printf "%s%s" ORS, TRS, $0
                         TRS = ORS ORS
                        }
         if (FNT ==  8) {if (LVL == 3)   printf "%s ", $0
                         if (LVL == 2)  {LVL = 3
                                         GRP1 = $0
                                        }
                         if (LVL == 1)  {LVL = 2
                                         printf "%s", $0
                                        }
                         }
         if (FNT == 10) {GRP2 = $0
                         printf ORS "%s %s ", GRP1, GRP2
                        }
        }

END     {printf ORS
        }
' file
OS family 
Unix pk1
1  1  originally meant to be a  
1  2   convenient platform  for programmers 

Source model 
Unix pk2
2  1 Historically  closed-source  
2  2   , while some Unix  projects (including BSD family and Illumos)  are open-source.  Development started in 1969. 
2  3 this is  last paragraph.

EDIT: Looks like above can be simplified:

Code:

awk '
        {match ($0, /font="[^"]*"/)
         FNT = substr ($0, RSTART+6, RLENGTH-7)
         gsub (/^\t|<[^>]*>/, _)

         if (FNT ==  9) {LVL = 1
                         printf "%s%s" ORS, TRS, $0
                         TRS = ORS ORS
                        }
         if (FNT ==  8)  if (LVL++ == 2)        GRP1 = $0
                           else                 printf "%s ", $0
         if (FNT == 10)  printf ORS "%s %s ", GRP1, $0
        }

END     {printf ORS
        }
' file

Last edited by RudiC; 02-24-2018 at 08:01 AM..

This User Gave Thanks to RudiC For This Post:

RudiC

View Public Profile for RudiC

Find all posts by RudiC

02-24-2018

Registered User

304, 2

Join Date: Oct 2011

Last Activity: 29 May 2019, 2:44 AM EDT

Posts: 304

Thanks Given: 59

Thanked 2 Times in 2 Posts

Hi RudiC,

Thanks for your help.

I see your script prints the output desired but when I try it the output is different.

I get this output.

Code:

OS family
 nix pk1
  originally meant to be a
  for programmersorm

Source model
 nix pk2
 closed-source
  Development started in 1969.ly and Illumos)
  last paragraph.

Ophiuchus

View Public Profile for Ophiuchus

Find all posts by Ophiuchus

02-24-2018

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

I was afraid of that when readling your system info. What awk version do you use? Sure you ran the script exactly as given? And the data as given in the sample?

Pls. post the output of

Code:

awk '{match ($0, /font="[^"]*"/); LVL = substr ($0, RSTART+6, RLENGTH-7); gsub (/<[^>]*>/, _); print LVL, $0}' file

RudiC

View Public Profile for RudiC

Find all posts by RudiC

02-24-2018

Registered User

304, 2

Join Date: Oct 2011

Last Activity: 29 May 2019, 2:44 AM EDT

Posts: 304

Thanks Given: 59

Thanked 2 Times in 2 Posts

Hi RudiC,

Yes. I run exactly as given and with input the same as pasted in forum.

In Ubuntu system

Code:

$ awk -W version
GNU Awk 4.1.3, API: 1.1 (GNU MPFR 3.1.4, GNU MP 6.1.0)
Copyright (C) 1989, 1991-2015 Free Software Foundation.

In Cygwin:

Code:

$ awk -W version
GNU Awk 4.1.4, API: 1.1 (GNU MPFR 3.1.5-p2, GNU MP 6.1.2)
Copyright (C) 1989, 1991-2016 Free Software Foundation.

Code:

$ awk '{match ($0, /font="[^"]*"/); LVL = substr ($0, RSTART+6, RLENGTH-7); gsub (/<[^>]*>/, _); print LVL, $0}' input.xml
9       OS family
8       Unix pk1
8       1
10      1
8        originally meant to be a
10      2
8
8       convenient platform
8        for programmers

9       Source model
8       Unix pk2
8       2
10      1
8       Historically
8       closed-source
10      2
8
8       , while some Unix
8        projects (including BSD family and Illumos)
8        are open-source.
8        Development started in 1969.
10      3
8       this is
8        last paragraph.

Ophiuchus

View Public Profile for Ophiuchus

Find all posts by Ophiuchus

Shell Programming and Scripting

Help to join separate lines in a single one from xml file

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Ls to text file on separate lines

Discussion started by: newbie100

2. Shell Programming and Scripting

Splitting a single xml file into multiple xml files

Discussion started by: Narendra921631

3. UNIX for Dummies Questions & Answers

How to generate one long column by merging two separate two columns in a single file?

Discussion started by: forevertl

4. UNIX for Dummies Questions & Answers

Join lines in a file????

Discussion started by: foolishbar

5. Shell Programming and Scripting

How to add the multiple lines of xml tags before a particular xml tag in a file

Discussion started by: mjavalkar

6. UNIX for Dummies Questions & Answers

How to separate a single column file into files of the same size (i.e. number of rows)?

Discussion started by: evelibertine

7. Shell Programming and Scripting

Separate lines from text file

Discussion started by: ramse8pc

8. Shell Programming and Scripting

Using AWK to separate data from a large XML file into multiple files

Discussion started by: JRy

9. Shell Programming and Scripting

Join in a single line variable number of lines

Discussion started by: cgkmal

10. Shell Programming and Scripting

Separate lines in a single '|' separated line

Discussion started by: hidnana