reformatting xml file, sed or awk I think (possibly perl)


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting reformatting xml file, sed or awk I think (possibly perl)
# 8  
Old 04-16-2011
Quote:
Originally Posted by bartus11
What should
Code:
<Name Network_0>

be transformed into?
Sorry, this should be just like the others,
<Name Name="Network_0">

LMHmedchem
# 9  
Old 04-16-2011
Try:
Code:
perl -pe 's/<(\w+) (\w+)>/<\1 \1="\2">/' file

# 10  
Old 04-16-2011
This is the final script that I have,
Code:
#!/bin/bash

infile=$1
outfile=$1".mod.wts"

# add version and root begin tags to beginning of input file
echo '<?xml version="1.0" encoding="ASCII"?>' > temp1
echo "<net>" >> temp1
cat temp1 $1  > temp2

# format information tags (end with />)
sed 's/<Fmt  *\([a-z,A-Z]*\) *>/<Fmt Fmt="\1"\/>/g'       temp2 | \
sed 's/<Name\ Network_0>/<Name\ Name="Network_0"\/>/g'          | \
sed 's/<Epoch  *\([0-9]*\) *>/<Epoch Epoch="\1"\/>/g'           | \

# format tags with arguments of different name than tag
sed 's/<UgUn  *\([^ ]\{1,9\}\) *>/<UgUn id="\1">/g'             | \
sed 's/<Cn  *\([0-9]*\) *>/<Cn num="\1">/g'                     | \

# format remaining tags where tag and arg have same value
perl -pe 's/<(\w+) (\w+)>/<\1 \1="\2">/'                        | \

# format tags with multiple args
awk '{ if ( $0 ~ /<Cg 0 Fm:Input>/ ) {
     printf( "%s\n", "<Cg Cg=\"0\">" );
     printf( "%s\n", "<Fm>Input</Fm>" );
     } else {
          print $0;
     }
}'                                                              | \
awk '{ if ( $0 ~ /<Cg 0 Fm:Hidden>/ ) {
     printf( "%s\n", "<Cg Cg=\"0\">" );
     printf( "%s\n", "<Fm>Hidden</Fm>" );
     } else {
          print $0;
     }
}'                                                              | \

# format multi row, multi are data
perl -p0e 's/(<Un>\n)(.*)/\1<Bias>\2<\/Bias>/g'                 | \
perl -pe 's/(\d+) ([\d.-]+)/<C\1>\2<\/C\1>/ if /<Cn/../<\/Cn/'   > $outfile

# add root close tag to end of file
echo "</net>" >> $outfile

# cleanup
rm temp1 temp2

You can see that I had to sort of hard code the single line tags at the beginning (format <stuff stuff="stuff"/>). These will always be present, so that isn't a big deal. The network does not always have to be Network_0, but when I tried the last perl code you posted, it changed some other things that I didn't need changed. I guess these three sort of need to look for,
<Fmt
<Name
<Epoch

The script works, and quickly, but is sort of a mishmash of regex stuff. I guess that's not too unusual, but if you see anything that is a real issue, or could be much cleaner, I would appreciate a heads up.

I have attached the file this works on, and the converted file.
# 11  
Old 04-16-2011
Quote:
Originally Posted by LMHmedchem
The network does not always have to be Network_0, but when I tried the last perl code you posted, it changed some other things that I didn't need changed.
Can you post the list of tags that shouldn't be changed?
# 12  
Old 04-16-2011
Well every tag that has no spaces should be left,
Code:
<Ug>
<Un>

There are three tags at the beginning that have a single argument,
Code:
<Fmt TEXT>
<Name Network_0>
<Epoch 7300>

These need to be changed to,
Code:
<Fmt Fmt="TEXT"/>
<Name Name="Network_0"/>
<Epoch Epoch="7300"/>

where the second value in the original tag goes in the quotes, followed by the added "/". The second value could be anything and so needs to be read from the original tag.

There are other tags that have a single argument, but do not get the trailing "/",
Code:
<Lay Input>
<UgUn 0 >
<Cn 36>

I have converted some of these so that the argument name is not the same as the tag name,
Code:
<UgUn id="0">
<Cn num="36">

but perhaps that is not such a good idea. I guess they can just be handeled as two cases,
Code:
text tag,  space, text argument (<Lay Input>)
text tag,  space, int argument (<Cn 36>)

to be converted to,
Code:
<Lay Lay="Input">
<UgUn UgUn="0">
<Cn Cn="36">

note that, for some reason, there is a trailing space between the argument and the > in in <UgUn 0 >.

There are multi argument tags,
Code:
<Cg 0 Fm:Input>

to convert to,
Code:
<Cg Cg="0">
<Fm>Input</Fm>

I expect that these could be,
Code:
<Cg Cg="0" Fm="Input">

And lastly there is the data that was corrected by the perl in your first post.
Code:
<Un>
-0.112027

converted to,
Code:
<Un>
<Bias>-0.112027</Bias>

Code:
<Cn 10>
0 1.42767
1 1.16508
2 -0.56867
3 -0.272873
4 -0.14623
5 -0.053066
6 0.345557
7 -0.424821
8 -0.507607
9 -0.459116
</Cn>

to convert to,
Code:
<Cn Cn="10">
<C0>1.42767</C0>
<C1>1.16508</C1>
<C2>-0.56867</C2>
<C3>-0.272873</C3>
<C4>-0.14623</C4>
<C5>-0.053066</C5>
<C6>0.345557</C6>
<C7>-0.424821</C7>
<C8>-0.507607</C8>
<C9>-0.459116</C9>
</Cn>

The <Cn tag is listed twice, since it seems to fall into two cases.

LMHmedchem

Last edited by Franklin52; 04-17-2011 at 10:57 AM.. Reason: Please use code tags
# 13  
Old 04-17-2011
OK, but is everything working fine now? Smilie Because I really don't feel like analyzing 1000+ lines XML file Smilie If something is not reformatted properly, then which particular tag is it?
# 14  
Old 04-17-2011
in the script posted above, I see lot of sed and awk that are chained of list of commands to be executed within a bash wrapper. With increase in file size, this approach is going to terribly slow down the processing as its going to keep spawning multiple processes.

Have you considered writing it in perl with read line interface and processing, which will be way faster than the current approach.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Replace string in XML file with awk/sed with string from another

Sorry for the long/weird title but I'm stuck on a problem I have. I have this XML file: </member> <member> <name>TransactionID</name> <value><string>123456789123456</string></value> </member> <member> <name>Number</name> ... (9 Replies)
Discussion started by: cozzin
9 Replies

2. Shell Programming and Scripting

Splitting xml file into several xml files using perl

Hi Everyone, I'm new here and I was checking this old post: /shell-programming-and-scripting/180669-splitting-file-into-several-smaller-files-using-perl.html (cannot paste link because of lack of points) I need to do something like this but understand very little of perl. I also check... (4 Replies)
Discussion started by: mcosta
4 Replies

3. Shell Programming and Scripting

Get multiple values from an xml file using one of the following commands or together awk/perl/script

Hello, I have a requirement to extract the value from multiple xml node and print out the values to new file to compare. Would be done using either awk/perl or some unix script. For example sample input file: ..... ..... <factories xmi:type="resources.jdbc:DataSource"... (2 Replies)
Discussion started by: slbmind
2 Replies

4. Shell Programming and Scripting

Modify the file with awk,sed or perl

Hi All, I need help from any of you.Would be so thankful for your help. I/P DDDD,1045,161,1557,429,1694,800,1911,1113,2460,1457,2917> 1609,3113,1869,3317,2732,3701,3727,4132,5857,5107> 9004,6496 DDDD,1125,157,1558,429,1694,800,1911,1117,2432,1444,2906>... (2 Replies)
Discussion started by: Indra2011
2 Replies

5. Shell Programming and Scripting

Using sed (or awk or perl) to delete rows in a file

I have a Unix file with 200,000 records, and need to remove all records from the file that have the character ‘I' in position 68 (68 bytes from the left). I have searched for similar problems and it appears that it would be possible with sed, awk or perl but I do not know enough about any of these... (7 Replies)
Discussion started by: joddo
7 Replies

6. Shell Programming and Scripting

awk multiple file reformatting

I hopefully have a simple request - I need to process multiple files reformatting the output based on tags at the beginning of each line. So the data for the new 3 lines of the output file are in the HDR line and then the details are in the DTL tagged lines. for ifile in $indir do echo... (1 Reply)
Discussion started by: jason_v_brown
1 Replies

7. Shell Programming and Scripting

Using SED/AWK to extract xml at end of file

Hello everyone, Firstly i do not require alot of help.. i am right at the end of finishing my scipt but cannot find a solution to the last part. What i need to do is, prompt the user for a file to work with, which i have done. promt the user for an output file - which is done. #!/bin/bash... (14 Replies)
Discussion started by: hugh86
14 Replies

8. Shell Programming and Scripting

How to get value from xml node using sed/perl/script?

hello, new to this forum. but i have a requirement to extract the value from multiple xml node and print out the values to new file with comma seperated. would like to know how this would be done using either sed/perl or some unix script. an example would be tremendous... sample input file:... (2 Replies)
Discussion started by: davidsouk
2 Replies

9. Shell Programming and Scripting

sed or awk to extract data from Xml file

Hi, I want to get data from Xml file by using sed or awk command. I want to get the following result : mon titre 1;Createur1;Dossier1 mon titre 1;Createur1;Dossier1 and save it in cvs file (fichier.cvs). FROM this Xml file (test.xml): <playlist version="1"> <trackList> <track>... (1 Reply)
Discussion started by: yeclota
1 Replies

10. Shell Programming and Scripting

How to parse a XML file using PERL and XML::DOm

I need to know the way. I have got parsing down some nodes. But I was unable to get the child node perfectly. If you have code please send it. It will be very useful for me. (0 Replies)
Discussion started by: girigopal
0 Replies
Login or Register to Ask a Question