Removing extra lines from file

01-05-2015

Registered User

23, 0

Join Date: Sep 2012

Last Activity: 7 May 2017, 8:58 PM EDT

Posts: 23

Thanks Given: 11

Thanked 0 Times in 0 Posts

Removing extra lines from file

I have a file where data looks like this:

===

Code:

<?xml version="1.0" encoding="utf-8"?>
<xml xmlns:s='uuid:XYZ'
     xmlns:dt='uuid:ABC'
     xmlns:rs='urn:schemas-microsoft-com:rowset'
     xmlns:z='#RowsetSchema'>
<s:Schema id='RowsetSchema'>
   <s:ElementType name='row' content='eltOnly' rs:CommandTimeout='30'>
      <s:AttributeType name='First_Name' rs:name='First_Name' rs:number='1'>
         <s:datatype dt:type='string' dt:maxLength='100' />
      </s:AttributeType>
      <s:AttributeType name='Hospital_Name' rs:name='Hospital_Name' rs:number='2'>
         <s:datatype dt:type='string' dt:maxLength='100' />
      </s:AttributeType>
      <s:AttributeType name='Dept_Name' rs:name='Dept_Name' rs:number='3'>
         <s:datatype dt:type='string' dt:maxLength='100' />
      </s:AttributeType>
   </s:ElementType>
</s:Schema>
<rs:data>
   <z:First_Name='John Doe' Hospital_Name='XYZ Hospital' Dept_Name='Heart Health' />  
</rs:data>
</xml>

<?xml version="1.0" encoding="utf-8"?>
<xml xmlns:s='uuid:XYZ'
     xmlns:dt='uuid:ABC'
     xmlns:rs='urn:schemas-microsoft-com:rowset'
     xmlns:z='#RowsetSchema'>
<s:Schema id='RowsetSchema'>
   <s:ElementType name='row' content='eltOnly' rs:CommandTimeout='30'>
      <s:AttributeType name='First_Name' rs:name='First_Name' rs:number='1'>
         <s:datatype dt:type='string' dt:maxLength='100' />
      </s:AttributeType>
      <s:AttributeType name='Hospital_Name' rs:name='Hospital_Name' rs:number='2'>
         <s:datatype dt:type='string' dt:maxLength='100' />
      </s:AttributeType>
      <s:AttributeType name='Dept_Name' rs:name='Dept_Name' rs:number='3'>
         <s:datatype dt:type='string' dt:maxLength='100' />
      </s:AttributeType>
   </s:ElementType>
</s:Schema>
<rs:data>
    <z:First_Name='Jane Doe' Hospital_Name='XYZ Hospital' Dept_Name='Maternity' /> 
</rs:data>
</xml>

===

So basically XML attributes are being repeated. I want to get XML attributes once and

Code:

<?xml version="1.0" encoding="utf-8"?>
<xml xmlns:s='uuid:XYZ'
     xmlns:dt='uuid:ABC'
     xmlns:rs='urn:schemas-microsoft-com:rowset'
     xmlns:z='#RowsetSchema'>
<s:Schema id='RowsetSchema'>
   <s:ElementType name='row' content='eltOnly' rs:CommandTimeout='30'>
      <s:AttributeType name='First_Name' rs:name='First_Name' rs:number='1'>
         <s:datatype dt:type='string' dt:maxLength='100' />
      </s:AttributeType>
      <s:AttributeType name='Hospital_Name' rs:name='Hospital_Name' rs:number='2'>
         <s:datatype dt:type='string' dt:maxLength='100' />
      </s:AttributeType>
      <s:AttributeType name='Dept_Name' rs:name='Dept_Name' rs:number='3'>
         <s:datatype dt:type='string' dt:maxLength='100' />
      </s:AttributeType>
   </s:ElementType>
</s:Schema>
<rs:data>
   <z:First_Name='John Doe' Hospital_Name='XYZ Hospital' Dept_Name='Heart Health' />  
<rs:data>
    <z:First_Name='Jane Doe' Hospital_Name='XYZ Hospital' Dept_Name='Maternity' /> 
</rs:data>
</xml>

Any ideas on how to do this??

Last edited by Don Cragun; 01-05-2015 at 10:01 PM.. Reason: Add CODE tags.

vx04

View Public Profile for vx04

Find all posts by vx04

01-05-2015

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Shouldn't:

Code:

<rs:data>
   <z:First_Name='John Doe' Hospital_Name='XYZ Hospital' Dept_Name='Heart Health' />  
<rs:data>
    <z:First_Name='Jane Doe' Hospital_Name='XYZ Hospital' Dept_Name='Maternity' /> 
</rs:data>

in the output be:

Code:

<rs:data>
   <z:First_Name='John Doe' Hospital_Name='XYZ Hospital' Dept_Name='Heart Health' />
</rs:data>
<rs:data>
    <z:First_Name='Jane Doe' Hospital_Name='XYZ Hospital' Dept_Name='Maternity' /> 
</rs:data>

or:

Code:

<rs:data>
   <z:First_Name='John Doe' Hospital_Name='XYZ Hospital' Dept_Name='Heart Health' />  
    <z:First_Name='Jane Doe' Hospital_Name='XYZ Hospital' Dept_Name='Maternity' /> 
</rs:data>

and, if so, which is it supposed to be?

Should the opening tag <rs:data> be matched by the closing tag </rs:data> or by </rs>?

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

01-06-2015

Registered User

23, 0

Join Date: Sep 2012

Last Activity: 7 May 2017, 8:58 PM EDT

Posts: 23

Thanks Given: 11

Thanked 0 Times in 0 Posts

Yes you are right.

Data lines should be like this

Code:

<rs:data>
   <z:First_Name='John Doe' Hospital_Name='XYZ Hospital' Dept_Name='Heart Health' />  
    <z:First_Name='Jane Doe' Hospital_Name='XYZ Hospital' Dept_Name='Maternity' /> 
</rs:data>

Thanks.

Moderator's Comments:

Please always use CODE tags when displaying sample input, output, and code. Without CODE tags all occurrences of adjacent spaces and tabs disappear (at the start of a line) or are coalesced into a single space in the middle of a line.

Last edited by Don Cragun; 01-06-2015 at 01:47 PM.. Reason: ADD CODE tags.

vx04

View Public Profile for vx04

Find all posts by vx04

01-06-2015

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

Like so?

Code:

awk     'NR==1,/<rs:data>/
         /<\/rs:data>/          {P=0
                                 delete TAIL
                                 CNT=0
                                }
         P
         !P                     {TAIL[++CNT]=$0}
         /<rs:data>/            {P=1}
         END                    {for (n=1; n<=CNT; n++) print TAIL[n]}
        ' file
<?xml version="1.0" encoding="utf-8"?>
<xml xmlns:s='uuid:XYZ'
     xmlns:dt='uuid:ABC'
     xmlns:rs='urn:schemas-microsoft-com:rowset'
     xmlns:z='#RowsetSchema'>
<s:Schema id='RowsetSchema'>
   <s:ElementType name='row' content='eltOnly' rs:CommandTimeout='30'>
      <s:AttributeType name='First_Name' rs:name='First_Name' rs:number='1'>
         <s:datatype dt:type='string' dt:maxLength='100' />
      </s:AttributeType>
      <s:AttributeType name='Hospital_Name' rs:name='Hospital_Name' rs:number='2'>
         <s:datatype dt:type='string' dt:maxLength='100' />
      </s:AttributeType>
      <s:AttributeType name='Dept_Name' rs:name='Dept_Name' rs:number='3'>
         <s:datatype dt:type='string' dt:maxLength='100' />
      </s:AttributeType>
   </s:ElementType>
</s:Schema>
<rs:data>
   <z:First_Name='John Doe' Hospital_Name='XYZ Hospital' Dept_Name='Heart Health' />  
    <z:First_Name='Jane Doe' Hospital_Name='XYZ Hospital' Dept_Name='Maternity' /> 
</rs:data>
</xml>

RudiC

View Public Profile for RudiC

Find all posts by RudiC

01-06-2015

Moderator

8,825, 1,112

Join Date: Feb 2005

Last Activity: 23 August 2021, 11:26 AM EDT

Location: Foxborough, MA

Posts: 8,825

Thanks Given: 579

Thanked 1,112 Times in 1,003 Posts

RudiC,
not all awk's support the delete array, but only the delete array[idx] paradigm. The work-around for this is: split("",array)

This User Gave Thanks to vgersh99 For This Post:

vgersh99

View Public Profile for vgersh99

Find all posts by vgersh99

01-06-2015

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

If delete array_name doesn't work in your version of awk and the trailing part of your xml isn't too large, you can also try the slightly more complex script below. It just uses a variable instead of an array and just skips over header and trailer data after the 1st complete xml segment:

Code:

awk '
/<[?]xml/,/<rs:data>/ {
	CopyData = 1
	if(!TrailerDone) print
	next
}
/<\/rs:data>/,/<\/xml>/ {
	CopyData = 0
	if(!TrailerDone) Trailer = ((Trailer == "") ? $0 : (Trailer RS $0))
}
/<\/xml>/ {
	TrailerDone = 1
}
CopyData
END {	print Trailer
}' file.xml

If you want to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk, /usr/xpg6/bin/awk, or nawk.

This User Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

01-06-2015

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

You now can decide which trailer you want, should they be different - the first xml- block's trailer as supplied by Don Cragun's suggestion or the last block's as supplied by mine...

This User Gave Thanks to RudiC For This Post:

RudiC

View Public Profile for RudiC

Find all posts by RudiC

Shell Programming and Scripting

Removing extra lines from file

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Checking subset and removing extra letters

Discussion started by: Viernes

2. Shell Programming and Scripting

Removing extra unwanted spaces

Discussion started by: anshaa

3. Shell Programming and Scripting

Removing extra unwanted spaces

Discussion started by: anshaa

4. UNIX for Dummies Questions & Answers

Removing Extra Folders From a TAR

Discussion started by: Dogtown24

5. UNIX for Dummies Questions & Answers

Removing extra new line characters

Discussion started by: c56444

6. UNIX for Dummies Questions & Answers

Help with Removing extra characters in Filename

Discussion started by: Joule

7. Shell Programming and Scripting

removing extra files in dos

Discussion started by: namishtiwari

8. Shell Programming and Scripting

Extra/parse lines from a file between unque lines through the file

Discussion started by: jouuu

9. Shell Programming and Scripting

remove extra lines in the file

Discussion started by: CamTu

10. UNIX for Dummies Questions & Answers

removing linux/extra partition??

Discussion started by: justchillin