Removing extra lines from file


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Removing extra lines from file
# 1  
Old 01-05-2015
Removing extra lines from file

I have a file where data looks like this:

===

Code:
<?xml version="1.0" encoding="utf-8"?>
<xml xmlns:s='uuid:XYZ'
     xmlns:dt='uuid:ABC'
     xmlns:rs='urn:schemas-microsoft-com:rowset'
     xmlns:z='#RowsetSchema'>
<s:Schema id='RowsetSchema'>
   <s:ElementType name='row' content='eltOnly' rs:CommandTimeout='30'>
      <s:AttributeType name='First_Name' rs:name='First_Name' rs:number='1'>
         <s:datatype dt:type='string' dt:maxLength='100' />
      </s:AttributeType>
      <s:AttributeType name='Hospital_Name' rs:name='Hospital_Name' rs:number='2'>
         <s:datatype dt:type='string' dt:maxLength='100' />
      </s:AttributeType>
      <s:AttributeType name='Dept_Name' rs:name='Dept_Name' rs:number='3'>
         <s:datatype dt:type='string' dt:maxLength='100' />
      </s:AttributeType>
   </s:ElementType>
</s:Schema>
<rs:data>
   <z:First_Name='John Doe' Hospital_Name='XYZ Hospital' Dept_Name='Heart Health' />  
</rs:data>
</xml>

<?xml version="1.0" encoding="utf-8"?>
<xml xmlns:s='uuid:XYZ'
     xmlns:dt='uuid:ABC'
     xmlns:rs='urn:schemas-microsoft-com:rowset'
     xmlns:z='#RowsetSchema'>
<s:Schema id='RowsetSchema'>
   <s:ElementType name='row' content='eltOnly' rs:CommandTimeout='30'>
      <s:AttributeType name='First_Name' rs:name='First_Name' rs:number='1'>
         <s:datatype dt:type='string' dt:maxLength='100' />
      </s:AttributeType>
      <s:AttributeType name='Hospital_Name' rs:name='Hospital_Name' rs:number='2'>
         <s:datatype dt:type='string' dt:maxLength='100' />
      </s:AttributeType>
      <s:AttributeType name='Dept_Name' rs:name='Dept_Name' rs:number='3'>
         <s:datatype dt:type='string' dt:maxLength='100' />
      </s:AttributeType>
   </s:ElementType>
</s:Schema>
<rs:data>
    <z:First_Name='Jane Doe' Hospital_Name='XYZ Hospital' Dept_Name='Maternity' /> 
</rs:data>
</xml>

===

So basically XML attributes are being repeated. I want to get XML attributes once and


Code:
<?xml version="1.0" encoding="utf-8"?>
<xml xmlns:s='uuid:XYZ'
     xmlns:dt='uuid:ABC'
     xmlns:rs='urn:schemas-microsoft-com:rowset'
     xmlns:z='#RowsetSchema'>
<s:Schema id='RowsetSchema'>
   <s:ElementType name='row' content='eltOnly' rs:CommandTimeout='30'>
      <s:AttributeType name='First_Name' rs:name='First_Name' rs:number='1'>
         <s:datatype dt:type='string' dt:maxLength='100' />
      </s:AttributeType>
      <s:AttributeType name='Hospital_Name' rs:name='Hospital_Name' rs:number='2'>
         <s:datatype dt:type='string' dt:maxLength='100' />
      </s:AttributeType>
      <s:AttributeType name='Dept_Name' rs:name='Dept_Name' rs:number='3'>
         <s:datatype dt:type='string' dt:maxLength='100' />
      </s:AttributeType>
   </s:ElementType>
</s:Schema>
<rs:data>
   <z:First_Name='John Doe' Hospital_Name='XYZ Hospital' Dept_Name='Heart Health' />  
<rs:data>
    <z:First_Name='Jane Doe' Hospital_Name='XYZ Hospital' Dept_Name='Maternity' /> 
</rs:data>
</xml>



Any ideas on how to do this??

Last edited by Don Cragun; 01-05-2015 at 10:01 PM.. Reason: Add CODE tags.
# 2  
Old 01-05-2015
Shouldn't:
Code:
<rs:data>
   <z:First_Name='John Doe' Hospital_Name='XYZ Hospital' Dept_Name='Heart Health' />  
<rs:data>
    <z:First_Name='Jane Doe' Hospital_Name='XYZ Hospital' Dept_Name='Maternity' /> 
</rs:data>

in the output be:
Code:
<rs:data>
   <z:First_Name='John Doe' Hospital_Name='XYZ Hospital' Dept_Name='Heart Health' />
</rs:data>
<rs:data>
    <z:First_Name='Jane Doe' Hospital_Name='XYZ Hospital' Dept_Name='Maternity' /> 
</rs:data>

or:
Code:
<rs:data>
   <z:First_Name='John Doe' Hospital_Name='XYZ Hospital' Dept_Name='Heart Health' />  
    <z:First_Name='Jane Doe' Hospital_Name='XYZ Hospital' Dept_Name='Maternity' /> 
</rs:data>

and, if so, which is it supposed to be?

Should the opening tag <rs:data> be matched by the closing tag </rs:data> or by </rs>?
# 3  
Old 01-06-2015
Yes you are right.

Data lines should be like this

Code:
<rs:data>
   <z:First_Name='John Doe' Hospital_Name='XYZ Hospital' Dept_Name='Heart Health' />  
    <z:First_Name='Jane Doe' Hospital_Name='XYZ Hospital' Dept_Name='Maternity' /> 
</rs:data>

Thanks.
Moderator's Comments:
Mod Comment Please always use CODE tags when displaying sample input, output, and code. Without CODE tags all occurrences of adjacent spaces and tabs disappear (at the start of a line) or are coalesced into a single space in the middle of a line.

Last edited by Don Cragun; 01-06-2015 at 01:47 PM.. Reason: ADD CODE tags.
# 4  
Old 01-06-2015
Like so?
Code:
awk     'NR==1,/<rs:data>/
         /<\/rs:data>/          {P=0
                                 delete TAIL
                                 CNT=0
                                }
         P
         !P                     {TAIL[++CNT]=$0}
         /<rs:data>/            {P=1}
         END                    {for (n=1; n<=CNT; n++) print TAIL[n]}
        ' file
<?xml version="1.0" encoding="utf-8"?>
<xml xmlns:s='uuid:XYZ'
     xmlns:dt='uuid:ABC'
     xmlns:rs='urn:schemas-microsoft-com:rowset'
     xmlns:z='#RowsetSchema'>
<s:Schema id='RowsetSchema'>
   <s:ElementType name='row' content='eltOnly' rs:CommandTimeout='30'>
      <s:AttributeType name='First_Name' rs:name='First_Name' rs:number='1'>
         <s:datatype dt:type='string' dt:maxLength='100' />
      </s:AttributeType>
      <s:AttributeType name='Hospital_Name' rs:name='Hospital_Name' rs:number='2'>
         <s:datatype dt:type='string' dt:maxLength='100' />
      </s:AttributeType>
      <s:AttributeType name='Dept_Name' rs:name='Dept_Name' rs:number='3'>
         <s:datatype dt:type='string' dt:maxLength='100' />
      </s:AttributeType>
   </s:ElementType>
</s:Schema>
<rs:data>
   <z:First_Name='John Doe' Hospital_Name='XYZ Hospital' Dept_Name='Heart Health' />  
    <z:First_Name='Jane Doe' Hospital_Name='XYZ Hospital' Dept_Name='Maternity' /> 
</rs:data>
</xml>

# 5  
Old 01-06-2015
RudiC,
not all awk's support the delete array, but only the delete array[idx] paradigm. The work-around for this is: split("",array)
This User Gave Thanks to vgersh99 For This Post:
# 6  
Old 01-06-2015
If delete array_name doesn't work in your version of awk and the trailing part of your xml isn't too large, you can also try the slightly more complex script below. It just uses a variable instead of an array and just skips over header and trailer data after the 1st complete xml segment:
Code:
awk '
/<[?]xml/,/<rs:data>/ {
	CopyData = 1
	if(!TrailerDone) print
	next
}
/<\/rs:data>/,/<\/xml>/ {
	CopyData = 0
	if(!TrailerDone) Trailer = ((Trailer == "") ? $0 : (Trailer RS $0))
}
/<\/xml>/ {
	TrailerDone = 1
}
CopyData
END {	print Trailer
}' file.xml

If you want to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk, /usr/xpg6/bin/awk, or nawk.
This User Gave Thanks to Don Cragun For This Post:
# 7  
Old 01-06-2015
You now can decide which trailer you want, should they be different - the first xml- block's trailer as supplied by Don Cragun's suggestion or the last block's as supplied by mine...
This User Gave Thanks to RudiC For This Post:
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Checking subset and removing extra letters

In each line of file, I wish to check if word1 is a non-connected subset of any of the other words in the line. If yes, keep only the words that ward1 is a subset of. Else, remove the whole line. Also, I want to remove the letters that word1 doesn't match with, except for "_+" Example file:... (2 Replies)
Discussion started by: Viernes
2 Replies

2. Shell Programming and Scripting

Removing extra unwanted spaces

hi, i need to remove the extra spaces in the 2nd field. Sample: abc|bd |bkd123 .. 1space abc|badf |bakdsf123 .. 2space abc|bqe |bakuowe .. 3space Output: abc|bd|bkd123 abc|badf|bakdsf123 abc|bqe|bakuowe i used the following command, (9 Replies)
Discussion started by: anshaa
9 Replies

3. Shell Programming and Scripting

Removing extra unwanted spaces

hi, i need to remove the extra spaces in the filed. Sample: abc~bd ~bkd123 .. 1space abc~badf ~bakdsf123 .. 2space abc~bqed ~bakuowe .. 3space output: abc~bd ~bkd123 .. 1space abc~badf~bakdsf123 .. 2space abc~bqed~bakuowe .. 3space i used the following command, (2 Replies)
Discussion started by: anshaa
2 Replies

4. UNIX for Dummies Questions & Answers

Removing Extra Folders From a TAR

I use an extremely simple TAR function for files at work and I have a question about cleaning them up. My command is TAR -cvf ExampleTarName.tar then the folder I wish to TAR. When my TAR finishes and I double click it to check it unarchived beautifully (I don't do this with every file, duh)... (5 Replies)
Discussion started by: Dogtown24
5 Replies

5. UNIX for Dummies Questions & Answers

Removing extra new line characters

Hello, I have a text file that looks like: ABC123|some text|some more text|00001 00002 0003 0004 000019|000003|Item I have searched and found an example to remove the extra new line characters using grep and sed, but it (I think) assumes the lines start with a number and the... (5 Replies)
Discussion started by: c56444
5 Replies

6. UNIX for Dummies Questions & Answers

Help with Removing extra characters in Filename

Hi, It's my first time here... anyways, I have a simple problem with these filenames. This is probably too easy for you guys: ABC_20101.2A.2010_01 ABD_20103.2E.2010_04 ABE_20107.2R.2010_08 Expected Output: ABC_20101 ABD_20103 ABE_20107 The only pattern available are the ff: 1) All... (9 Replies)
Discussion started by: Joule
9 Replies

7. Shell Programming and Scripting

removing extra files in dos

Hi, I have same file by name i want to keep only access file and want to delete rest. This is specific to DOS only. Any idea of doing this. I tried so many options but none worked for me. Thanks Namish (11 Replies)
Discussion started by: namishtiwari
11 Replies

8. Shell Programming and Scripting

Extra/parse lines from a file between unque lines through the file

I need help to parse a file where there are many records, all of which are consistently separated by lines containing “^=============” and "^ End of Report". Example: ============= 1 2 3 4 End of record ============= 1 3 4 End of record Etc.... I only need specific lines... (5 Replies)
Discussion started by: jouuu
5 Replies

9. Shell Programming and Scripting

remove extra lines in the file

Hi, I have some files, with some extra lines in weird characters on the top and bottom of the. I want to get rid of those line. Is there a way I can do that? example of the input file. I want to get rid of those lines in bold (B ... (8 Replies)
Discussion started by: CamTu
8 Replies

10. UNIX for Dummies Questions & Answers

removing linux/extra partition??

ok, well i never could get my internet connection setup in linux so now it is just wasting space on my system... so, how do i get rid of it and the extra partition made during install?? (1 Reply)
Discussion started by: justchillin
1 Replies
Login or Register to Ask a Question