reformatting xml file, sed or awk I think (possibly perl)


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting reformatting xml file, sed or awk I think (possibly perl)
# 15  
Old 04-17-2011
Quote:
Originally Posted by bartus11
OK, but is everything working fine now? Smilie Because I really don't feel like analyzing 1000+ lines XML file Smilie If something is not reformatted properly, then which particular tag is it?
Sorry about that, I meant to post an abbreviated version of the ill formed XML which is only 65 lines. I have attached that here in case anyone want to have a look. The .zip also includes two .doc files. The first is the ill formed XML with comments describing the issues. The problem tags are in bold blue. The second is the revised version with comments describing the corrections. Additions are in bold red. I hope this will be helpful to anyone looking to correct a similar issue.

The script is working and I suppose isn't overly kludgey. I am hard coding the text format as ASCII. Is there an easy way in bash to id the encoding of the input file?

I have made a couple of changes. I am naming all the created attributes the same as the tag name, since this is simpler and now the replacements are done with a more general rule. The exception is the UgUn tag, since there is an extra space to be dealt with, I guess. I have also changed to perl based on the format of the earlier posts.

Here is the current script
Code:
#!/bin/bash

infile=$1
outfile=$1".mod.xml"

# add version and root begin tags to beginning of input file
echo '<?xml version="1.0" encoding="ASCII"?>' > temp1
echo "<net>" >> temp1
cat temp1 $1  > temp2

# format information tags (end with />)
perl -pe 's/<Fmt\ (\w+)>/<Fmt\ Fmt="\1"\/>/'         temp2 | \
perl -pe 's/<Name\ (\w+)>/<Name\ Name="\1"\/>/'            | \
perl -pe 's/<Epoch\ (\n+)>/<Epoch\ Epoch="\1"\/>/'         | \

# format UgUn tag arguments of different name than tag
sed 's/<UgUn  *\([^ ]\{1,9\}\) *>/<UgUn UgUn="\1">/g'      | \

# format tags where tag and arg have same value
# word | space | integer
perl -pe 's/<(\w+)\ (\n+)>/<\1 \1="\2">/'                   | \
# word | space | word
perl -pe 's/<(\w+)\ (\w+)>/<\1 \1="\2">/'                   | \

# format tags with multiple args
awk '{ if ( $0 ~ /<Cg 0 Fm:Input>/ ) {
     printf( "%s\n", "<Cg Cg=\"0\" Fm=\"Input\">" );
     } else {
          print $0;
     }
}'                                                         | \
awk '{ if ( $0 ~ /<Cg 0 Fm:Hidden>/ ) {
     printf( "%s\n", "<Cg Cg=\"0\" Fm=\"Hidden\">" );
     } else {
          print $0;
     }
}'                                                         | \

# format multi row, multi are data
perl -p0e 's/(<Un>\n)(.*)/\1<Bias>\2<\/Bias>/g'            | \
perl -pe 's/(\d+) ([\d.-]+)/<C\1>\2<\/C\1>/ if /<Cn/../<\/Cn/'   > $outfile

# add root close tag to end of file
echo "</net>" >> $outfile

# cleanup
rm temp1 temp2

The exception is the UnUg tag, which I can't get perl to find, and the tags with multiple args, which are still in awk. I am not sure about the current solution, since it doesn't take into account the fact that the args may have different values that are being searched for.

Quote:
Originally Posted by matrixmadhan
in the script posted above, I see lot of sed and awk that are chained of list of commands to be executed within a bash wrapper. With increase in file size, this approach is going to terribly slow down the processing as its going to keep spawning multiple processes.

Have you considered writing it in perl with read line interface and processing, which will be way faster than the current approach.
I could switch to using more temp files if you think that would help.

I really don't know perl all that well at all. On the whole, I am pretty poor at regex stuff. I use it some, since it's so d**n convenient, but what I know is just bits and pieces of various interpreters and stream editors. I can modify stuff in perl an awk, but that is about it. I wish I knew this material better, but that is just on the long list of such things.

LMHmedchem

Last edited by LMHmedchem; 04-17-2011 at 04:18 PM..
# 16  
Old 04-17-2011
Quote:
Originally Posted by LMHmedchem
The perl substitute syntax seems much simpler than the sed. Is there a way to id word-number instead of word-word, (\w+) (\w+)?
Numeral characters class in Perl is \d, so I guess you want this: (\w+) (\d+)
Tomorrow I'll take a look at those .doc files.
# 17  
Old 04-17-2011
Quote:
Originally Posted by bartus11
Numeral characters class in Perl is \d, so I guess you want this: (\w+) (\d+)
Tomorrow I'll take a look at those .doc files.
I edited the post above and have changed most of the script to perl. I guessed and used (\w+) (\n+) and that seemed to work. I also added an excape (\w+)\ (\d+) since I wanted to match word | space | number, but maybe that's implicit. The only thing I can't manage is the UnUg tag where there is a space between the int and the >. I am also not sure what to do with the tags with multiple args, so those are still in awk.

Thanks very much for all your assistance.

LMHmedchem
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Replace string in XML file with awk/sed with string from another

Sorry for the long/weird title but I'm stuck on a problem I have. I have this XML file: </member> <member> <name>TransactionID</name> <value><string>123456789123456</string></value> </member> <member> <name>Number</name> ... (9 Replies)
Discussion started by: cozzin
9 Replies

2. Shell Programming and Scripting

Splitting xml file into several xml files using perl

Hi Everyone, I'm new here and I was checking this old post: /shell-programming-and-scripting/180669-splitting-file-into-several-smaller-files-using-perl.html (cannot paste link because of lack of points) I need to do something like this but understand very little of perl. I also check... (4 Replies)
Discussion started by: mcosta
4 Replies

3. Shell Programming and Scripting

Get multiple values from an xml file using one of the following commands or together awk/perl/script

Hello, I have a requirement to extract the value from multiple xml node and print out the values to new file to compare. Would be done using either awk/perl or some unix script. For example sample input file: ..... ..... <factories xmi:type="resources.jdbc:DataSource"... (2 Replies)
Discussion started by: slbmind
2 Replies

4. Shell Programming and Scripting

Modify the file with awk,sed or perl

Hi All, I need help from any of you.Would be so thankful for your help. I/P DDDD,1045,161,1557,429,1694,800,1911,1113,2460,1457,2917> 1609,3113,1869,3317,2732,3701,3727,4132,5857,5107> 9004,6496 DDDD,1125,157,1558,429,1694,800,1911,1117,2432,1444,2906>... (2 Replies)
Discussion started by: Indra2011
2 Replies

5. Shell Programming and Scripting

Using sed (or awk or perl) to delete rows in a file

I have a Unix file with 200,000 records, and need to remove all records from the file that have the character ‘I' in position 68 (68 bytes from the left). I have searched for similar problems and it appears that it would be possible with sed, awk or perl but I do not know enough about any of these... (7 Replies)
Discussion started by: joddo
7 Replies

6. Shell Programming and Scripting

awk multiple file reformatting

I hopefully have a simple request - I need to process multiple files reformatting the output based on tags at the beginning of each line. So the data for the new 3 lines of the output file are in the HDR line and then the details are in the DTL tagged lines. for ifile in $indir do echo... (1 Reply)
Discussion started by: jason_v_brown
1 Replies

7. Shell Programming and Scripting

Using SED/AWK to extract xml at end of file

Hello everyone, Firstly i do not require alot of help.. i am right at the end of finishing my scipt but cannot find a solution to the last part. What i need to do is, prompt the user for a file to work with, which i have done. promt the user for an output file - which is done. #!/bin/bash... (14 Replies)
Discussion started by: hugh86
14 Replies

8. Shell Programming and Scripting

How to get value from xml node using sed/perl/script?

hello, new to this forum. but i have a requirement to extract the value from multiple xml node and print out the values to new file with comma seperated. would like to know how this would be done using either sed/perl or some unix script. an example would be tremendous... sample input file:... (2 Replies)
Discussion started by: davidsouk
2 Replies

9. Shell Programming and Scripting

sed or awk to extract data from Xml file

Hi, I want to get data from Xml file by using sed or awk command. I want to get the following result : mon titre 1;Createur1;Dossier1 mon titre 1;Createur1;Dossier1 and save it in cvs file (fichier.cvs). FROM this Xml file (test.xml): <playlist version="1"> <trackList> <track>... (1 Reply)
Discussion started by: yeclota
1 Replies

10. Shell Programming and Scripting

How to parse a XML file using PERL and XML::DOm

I need to know the way. I have got parsing down some nodes. But I was unable to get the child node perfectly. If you have code please send it. It will be very useful for me. (0 Replies)
Discussion started by: girigopal
0 Replies
Login or Register to Ask a Question