Help to join separate lines in a single one from xml file


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Help to join separate lines in a single one from xml file
# 1  
Old 02-23-2018
Help to join separate lines in a single one from xml file

Hi all,

I need help to parse this xml file that has paragraphs broken in different lines and I would like to join in a single line.

I hope you can understand my explanation. Thanks for any help/direction.

The script could be in bash, awk, ruby, perl whatever please

In the output I want:
The values with font=9 as initial line of each group
The 1st value with font=8 that is inmediately below of value with font=9, I want it as 2nd line of each group
The 2nd value with font=8 that is below of value with font=9, I want it as first column of each group
The values with font=10 I want them as second column of each group
And finally, join in a single line the values with font=8 that belong to the previous value of font=10

input
Code:
	<text top="333" left="98" width="93" height="16" font="9"><b>OS family </b></text>
	<text top="350" left="98" width="192" height="16" font="8">Unix pk1</text>
	<text top="368" left="98" width="12" height="16" font="8">1 </text>
	<text top="365" left="112" width="5" height="11" font="10">1</text>
	<text top="368" left="118" width="308" height="16" font="8"> originally meant to be a </text>
	<text top="365" left="427" width="5" height="11" font="10">2</text>
	<text top="368" left="433" width="4" height="16" font="8"> </text>
	<text top="385" left="98" width="339" height="16" font="8">convenient platform</text>
	<text top="402" left="98" width="339" height="16" font="8"> for programmers</text>
	
	<text top="333" left="98" width="93" height="16" font="9"><b>Source model </b></text>
	<text top="350" left="98" width="192" height="16" font="8">Unix pk2</text>
	<text top="368" left="98" width="12" height="16" font="8">2 </text>
	<text top="365" left="112" width="5" height="11" font="10">1</text>
	<text top="368" left="118" width="308" height="16" font="8">Historically </text>
	<text top="368" left="118" width="308" height="16" font="8">closed-source </text>
	<text top="365" left="427" width="5" height="11" font="10">2</text>
	<text top="368" left="433" width="4" height="16" font="8"> </text>
	<text top="385" left="98" width="339" height="16" font="8">, while some Unix</text>
	<text top="402" left="98" width="339" height="16" font="8"> projects (including BSD family and Illumos)</text>
	<text top="402" left="98" width="339" height="16" font="8"> are open-source.</text>
	<text top="402" left="98" width="339" height="16" font="8"> Development started in 1969.</text>
	<text top="365" left="427" width="5" height="11" font="10">3</text>
	<text top="402" left="98" width="339" height="16" font="8">this is</text>
	<text top="402" left="98" width="339" height="16" font="8"> last paragraph.</text>

desired output

Code:
	OS family
	Unix pk1
	1 1 originally meant to be a
	1 2 convenient platform for programmers
	
	Source model 
	Unix pk2
	2 1 Historically closed-source 
	2 2 , while some Unix projects (including BSD family and Illumos) are open-source. Development started in 1969.
	2 3 this is last paragraph.


Last edited by Ophiuchus; 02-23-2018 at 02:34 AM..
# 2  
Old 02-23-2018
As always, it helps if we know what operating system you're using and what you have tried to solve this problem on your own.

By listing bash along with awk, ruby, and perl are you saying that bash is the shell that you use?

We are here to help you learn how to use the tools available on your system to do things like this; not to act as your unpaid programming staff.
# 3  
Old 02-24-2018
Hello Don,

My apologies for any misunderstanding.

I´m using Cygwin on Windows and Ubuntu 16.04.2 LTS on Windows.

In awk or ruby I think would be preferable for my to understand any direction someone could share me.

The code I´ve been able to construct so far is in awk but the output is far from my desired one.

Code:
awk '/font="9">/ {a = gensub(/(<.*\">)(.*)(<.*)/, "\\2", "g", $0 )}
     /font="8">/ {
     z++; 
     if(z==1){ b = gensub(/(<.*\">)(.*)(<.*)/, "\\2", "g", $0 )}
     if(z==2){ c = gensub(/(<.*\">)(.*)(<.*)/, "\\2", "g", $0 )}
     if(z>2 ){ d = d " " gensub(/(<.*\">)(.*)(<.*)/, "\\2", "g", $0 )}
     }
     /font="10">/{d = ""; e = gensub(/(<.*\">)(.*)(<.*)/, "\\2", "g", $0 )

     print a"\n"b"\n"c"\n"e,d; z=0}' input.xml

My current output is:
Code:
<b>OS family </b>
Unix pk1
1
1
<b>OS family </b>
 originally meant to be a
1
2
<b>Source model </b>

convenient platform
1
<b>Source model </b>
Historically
closed-source
2
<b>Source model </b>

, while some Unix
3

I hope someone could give some help on this.

Thanks in advance.
# 4  
Old 02-24-2018
Try
Code:
awk '
        {match ($0, /font="[^"]*"/)
         FNT = substr ($0, RSTART+6, RLENGTH-7)
         gsub (/^\t|<[^>]*>/, _)

         if (FNT ==  9) {LVL = 1
                         printf "%s%s" ORS, TRS, $0
                         TRS = ORS ORS
                        }
         if (FNT ==  8) {if (LVL == 3)   printf "%s ", $0
                         if (LVL == 2)  {LVL = 3
                                         GRP1 = $0
                                        }
                         if (LVL == 1)  {LVL = 2
                                         printf "%s", $0
                                        }
                         }
         if (FNT == 10) {GRP2 = $0
                         printf ORS "%s %s ", GRP1, GRP2
                        }
        }

END     {printf ORS
        }
' file
OS family 
Unix pk1
1  1  originally meant to be a  
1  2   convenient platform  for programmers 

Source model 
Unix pk2
2  1 Historically  closed-source  
2  2   , while some Unix  projects (including BSD family and Illumos)  are open-source.  Development started in 1969. 
2  3 this is  last paragraph.


EDIT: Looks like above can be simplified:
Code:
awk '
        {match ($0, /font="[^"]*"/)
         FNT = substr ($0, RSTART+6, RLENGTH-7)
         gsub (/^\t|<[^>]*>/, _)

         if (FNT ==  9) {LVL = 1
                         printf "%s%s" ORS, TRS, $0
                         TRS = ORS ORS
                        }
         if (FNT ==  8)  if (LVL++ == 2)        GRP1 = $0
                           else                 printf "%s ", $0
         if (FNT == 10)  printf ORS "%s %s ", GRP1, $0
        }

END     {printf ORS
        }
' file


Last edited by RudiC; 02-24-2018 at 08:01 AM..
This User Gave Thanks to RudiC For This Post:
# 5  
Old 02-24-2018
Hi RudiC,

Thanks for your help.

I see your script prints the output desired but when I try it the output is different.

I get this output.
Code:
OS family
 nix pk1
  originally meant to be a
  for programmersorm

Source model
 nix pk2
 closed-source
  Development started in 1969.ly and Illumos)
  last paragraph.

# 6  
Old 02-24-2018
I was afraid of that when readling your system info. What awk version do you use? Sure you ran the script exactly as given? And the data as given in the sample?

Pls. post the output of
Code:
awk '{match ($0, /font="[^"]*"/); LVL = substr ($0, RSTART+6, RLENGTH-7); gsub (/<[^>]*>/, _); print LVL, $0}' file

# 7  
Old 02-24-2018
Hi RudiC,

Yes. I run exactly as given and with input the same as pasted in forum.

In Ubuntu system
Code:
$ awk -W version
GNU Awk 4.1.3, API: 1.1 (GNU MPFR 3.1.4, GNU MP 6.1.0)
Copyright (C) 1989, 1991-2015 Free Software Foundation.

In Cygwin:
Code:
$ awk -W version
GNU Awk 4.1.4, API: 1.1 (GNU MPFR 3.1.5-p2, GNU MP 6.1.2)
Copyright (C) 1989, 1991-2016 Free Software Foundation.


Code:
$ awk '{match ($0, /font="[^"]*"/); LVL = substr ($0, RSTART+6, RLENGTH-7); gsub (/<[^>]*>/, _); print LVL, $0}' input.xml
9       OS family
8       Unix pk1
8       1
10      1
8        originally meant to be a
10      2
8
8       convenient platform
8        for programmers

9       Source model
8       Unix pk2
8       2
10      1
8       Historically
8       closed-source
10      2
8
8       , while some Unix
8        projects (including BSD family and Illumos)
8        are open-source.
8        Development started in 1969.
10      3
8       this is
8        last paragraph.

Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Ls to text file on separate lines

hi, I'm trying to print out the contents of a folder into a .txt file. The code I'm trying amongst variations is: ls -1 > filenames.txt but it prints them all on the same line ie. image102.bmpimage103.bmpimage104.bmpimage105.bmpimage106.bmp how can I change this? Please... (2 Replies)
Discussion started by: newbie100
2 Replies

2. Shell Programming and Scripting

Splitting a single xml file into multiple xml files

Hi, I'm having a xml file with multiple xml header. so i want to split the file into multiple files. Sample.xml consists multiple headers so how can we split these multiple headers into multiple files in unix. eg : <?xml version="1.0" encoding="UTF-8"?> <ml:individual... (3 Replies)
Discussion started by: Narendra921631
3 Replies

3. UNIX for Dummies Questions & Answers

How to generate one long column by merging two separate two columns in a single file?

Dear all, I have a simple question. I have a file like below (separated by tab): col1 col2 col3 col4 col5 col6 col7 21 66745 rs1234 21 rs5678 23334 0.89 21 66745 rs2334 21 rs9978 23334 0.89 21 66745 ... (4 Replies)
Discussion started by: forevertl
4 Replies

4. UNIX for Dummies Questions & Answers

Join lines in a file????

Hello UNIX gurus, I am new to the world of UNIX. Hopefully I am submitting my question at the right forum. Here is my dilemma - I have a file with contents like this - "line1","Hello","World","Today is a wonderful day","yes it is" "line2","Hello","World","Today is a beautiful day","oh... (8 Replies)
Discussion started by: foolishbar
8 Replies

5. Shell Programming and Scripting

How to add the multiple lines of xml tags before a particular xml tag in a file

Hi All, I'm stuck with adding multiple lines(irrespective of line number) to a file before a particular xml tag. Please help me. <A>testing_Location</A> <value>LA</value> <zone>US</zone> <B>Region</B> <value>Russia</value> <zone>Washington</zone> <C>Country</C>... (0 Replies)
Discussion started by: mjavalkar
0 Replies

6. UNIX for Dummies Questions & Answers

How to separate a single column file into files of the same size (i.e. number of rows)?

I have a text file with 1,000,000 rows (It is a single column text file of numbers). I would like to separate the text file into 100 files of equal size (i.e. number of rows). The first file will contain the first 10,000 rows, the second row will contain the second 10,000 rows (rows 10,001-20,000)... (2 Replies)
Discussion started by: evelibertine
2 Replies

7. Shell Programming and Scripting

Separate lines from text file

I have a text file with lot of rows like.. Action & Adventure|2012: Supernova NR|2009-11-01 00:01:00|2010-05-01 23:59:00|Active|3 Action & Adventure|50 Dead Men Walking|2010-01-05 00:01:00|2010-06-30 23:59:00|Active|3 Action & Adventure|Afterwards|2009-11-26 00:01:00|2010-03-26... (3 Replies)
Discussion started by: ramse8pc
3 Replies

8. Shell Programming and Scripting

Using AWK to separate data from a large XML file into multiple files

I have a 500 MB XML file from a FileMaker database export, it's formatted horribly (no line breaks at all). The node structure is basically <FMPXMLRESULT> <METADATA> <FIELD att="............." id="..."/> </METADATA> <RESULTSET FOUND="1763457"> <ROW att="....." etc="...."> ... (16 Replies)
Discussion started by: JRy
16 Replies

9. Shell Programming and Scripting

Join in a single line variable number of lines

Hi all, I have a file with little blocks beginning with a number 761XXXXXX, and 0, 1, 2 or 3 lines below of it beginning with STUS as follow: 761625820 STUS ACTIVE 16778294 STUS NOT ACTIVE 761157389 STUS ACTIVE 16778294 761554921 STUS ACTIVE 16778294 STUS NOT ACTIVE STUS ACTIVE OP... (4 Replies)
Discussion started by: cgkmal
4 Replies

10. Shell Programming and Scripting

Separate lines in a single '|' separated line

Hi I have a file with contents like china india france japan italy germany . . . . etc.... I want the output as china|india|france|japan|italy|germany|.|.|. (3 Replies)
Discussion started by: hidnana
3 Replies
Login or Register to Ask a Question