Extract strings from XML files and create a new XML


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Extract strings from XML files and create a new XML
# 8  
Old 06-16-2015
Hello RudiC,

Thank you for your reply! It doesn't work for me. I assume that the grep command you gave me is missing the XML file from where the information should be extracted.

Milano
# 9  
Old 06-16-2015
Quote:
Originally Posted by milano.churchil
Hello RudiC,

Thank you for your reply! It doesn't work for me. I assume that the grep command you gave me is missing the XML file from where the information should be extracted.

Milano
You assume incorrectly. The code RudiC provided does exactly what you asked for given the filenames you provided. But, of course we're making assumptions about the utilities you have installed on your system, the shell you're using, and the operating system you're using.

What operating system are you using?
What version of UNIX/Linux utilities are you using?
What shell are you using?
What output did RudiC's code produce on your system?
Are you sure that the filenames you provided contain data in the same format as your sample data? (For instance, does C:/temp/input.txt contain <carriage-return><newline> line terminators instead of the <newline> line terminators expected by UNIX and Linux system utilities?)
This User Gave Thanks to Don Cragun For This Post:
# 10  
Old 06-19-2015
Yes, this was my mistake. Is working good for my example. I will test it on larger files too. Now I am trying to delete the extracted rows from the original XML. Hope I will manage it.
Thnak you!

Milano

---------- Post updated at 05:19 AM ---------- Previous update was at 02:28 AM ----------

I tried this to remove the line that were extracted from the xml file, but I got an error of
Code:
enexpected end of file

when I am trying to run the script.

Code:
sed -i 'iE "$(tr -d "'" </home/qqomtws/kwom/Test_HY/test_hy.txt | tr '\n' '|')" /home/qqomtws/kwom/Test_HY/test_hy.xml' ./infile

I put the command just after the grep command that extracts the lines I am interested in. Any advice?

Thank you!
Milano

---------- Post updated 06-19-15 at 03:07 AM ---------- Previous update was 06-18-15 at 05:19 AM ----------

Quote:
Originally Posted by Don Cragun
You assume incorrectly. The code RudiC provided does exactly what you asked for given the filenames you provided. But, of course we're making assumptions about the utilities you have installed on your system, the shell you're using, and the operating system you're using.

What operating system are you using?
What version of UNIX/Linux utilities are you using?
What shell are you using?
What output did RudiC's code produce on your system?
Are you sure that the filenames you provided contain data in the same format as your sample data? (For instance, does C:/temp/input.txt contain <carriage-return><newline> line terminators instead of the <newline> line terminators expected by UNIX and Linux system utilities?)
Yes, this was my mistake. Is working good for my example. I will test it on larger files too. Now I am trying to delete the extracted rows from the original XML. Hope I will manage it.
Thnak you!

I tried this to remove the lines that were extracted from the xml file, but I got an error of
Code:
unexpected end of file

when I am trying to run the script.

Code:
sed -i 'iE "$(tr -d "'" <C:/temp/input.txt | tr '\n' '|') " C:/temp/output.txt'./infile

I put the command just after the grep command that extracts the lines I am interested in. Any advice?

Milano

---------- Post updated at 04:22 AM ---------- Previous update was at 03:07 AM ----------

Quote:
Originally Posted by RudiC
Better, but still a bit vague. For EXACTLY your setup, this might work:
Code:
grep -iE "$(tr -d "'" <C:/temp/input.txt | tr '\n' '|')header|footer" C:/temp/output.txt
<Header>My favorite restaurant</Header>
         <name>Belgian Waffles</name>
         <description>Two of our famous Belgian Waffles with plenty of real maple syrup</description>
         <name>Strawberry Belgian Waffles</name>
         <description>Light Belgian waffles covered with strawberries and whipped cream</description>
         <name>Berry-Berry American Pie</name>
         <description>Light American Pie covered with an assortment of fresh berries and whipped cream</description>
<Footer>My favorite restaurant</Footer>

Redirect to C:/temp/output.xml if happy.
Yes, this was my mistake. Is working good for my example. I will test it on larger files too. Now I am trying to delete the extracted rows from the original file. Hope I will manage it.
Thnak you!

I tried this to remove the line that were extracted from the xml file, but I got an error of
Code:
enexpected end of file

when I am trying to run the script.

Code:
sed -i 'iE "$(tr -d "'" <C:/temp/input.txt | tr '\n' '|') " C:/temp/output.txt'./infile

I put the command just after the grep command that extracts the lines I am interested in. Any advice?

Many thanks,
Milano
# 11  
Old 06-19-2015
You can't do that, replace one comand ( grep ) by another ( sed ) with identical parameter set, and hope that it works.
To remove those selected lines from the original file, redirect the grep result to a temp file and try
Code:
grep -vfTMP C:/temp/output.txt

# 12  
Old 06-19-2015
The obvious simple thing to do (to extract all of the lines that the 1st grep did NOT extract) would be to just rerun that script adding a -v option:
Code:
grep -viE "$(tr -d "'" <C:/temp/input.txt | tr '\n' '|')header|footer" C:/temp/output.txt

Or, you could use RudiC's suggestion (but I would add a -F in case some of the strings extracted by the first grep contain characters that are special in a BRE):
Code:
grep -vFfTMP C:/temp/output.txt

If you are going to be running this script regularly, I would seriously consider rewriting it to use awk instead of grep. If you use awk you could produce both output files in one pass without needing to run grep twice and without needing to read your input XML file twice:
Code:
awk -v ERE="$(tr -d "'" <C:/temp/input.txt | tr '\n' '|')header|footer" '
BEGIN {ERE = tolower(ERE)}
      {print > ((tolower($0) ~ ERE) ? "matched.xml" : "unmatched.xml")}
' C:/temp/output.txt

Change matched.xml and unmatched.xml to the pathnames of the files you want to contain the matched lines and the unmatched lines, respectively. I assume that you already know that neither of those output files can be the input file for this awk script!
# 13  
Old 06-22-2015
Quote:
Originally Posted by Don Cragun
The obvious simple thing to do (to extract all of the lines that the 1st grep did NOT extract) would be to just rerun that script adding a -v option:
Code:
grep -viE "$(tr -d "'" <C:/temp/input.txt | tr '\n' '|')header|footer" C:/temp/output.txt

Or, you could use RudiC's suggestion (but I would add a -F in case some of the strings extracted by the first grep contain characters that are special in a BRE):
Code:
grep -vFfTMP C:/temp/output.txt

If you are going to be running this script regularly, I would seriously consider rewriting it to use awk instead of grep. If you use awk you could produce both output files in one pass without needing to run grep twice and without needing to read your input XML file twice:
Code:
awk -v ERE="$(tr -d "'" <C:/temp/input.txt | tr '\n' '|')header|footer" '
BEGIN {ERE = tolower(ERE)}
      {print > ((tolower($0) ~ ERE) ? "matched.xml" : "unmatched.xml")}
' C:/temp/output.txt

Change matched.xml and unmatched.xml to the pathnames of the files you want to contain the matched lines and the unmatched lines, respectively. I assume that you already know that neither of those output files can be the input file for this awk script!
Thank you a lot! It works very well! A great advice also with
Code:
awk

Many thanks,
Milano

Last edited by milano.churchil; 06-22-2015 at 07:26 AM.. Reason: writting misstake
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Splitting a single xml file into multiple xml files

Hi, I'm having a xml file with multiple xml header. so i want to split the file into multiple files. Sample.xml consists multiple headers so how can we split these multiple headers into multiple files in unix. eg : <?xml version="1.0" encoding="UTF-8"?> <ml:individual... (3 Replies)
Discussion started by: Narendra921631
3 Replies

2. Shell Programming and Scripting

Splitting xml file into several xml files using perl

Hi Everyone, I'm new here and I was checking this old post: /shell-programming-and-scripting/180669-splitting-file-into-several-smaller-files-using-perl.html (cannot paste link because of lack of points) I need to do something like this but understand very little of perl. I also check... (4 Replies)
Discussion started by: mcosta
4 Replies

3. Shell Programming and Scripting

Extract a particular xml only from an xml jar file

Hi..need help on how to extract a particular xml file only from an xml jar file... thanks! (2 Replies)
Discussion started by: qwerty000
2 Replies

4. Shell Programming and Scripting

Compare two xml files while ignoring some xml tags

I've got two different files and want to compare them. File 1 : <response ticketId="944" type="getQueryResults"><status>COMPLETE</status><description>Query results fetched successfully</description><recordSet totalCount="1" type="sms_records"><record id="38,557"><columns><column><name>orge... (2 Replies)
Discussion started by: Shaishav Shah
2 Replies

5. Shell Programming and Scripting

Extract strings within XML file between different delimiters

Good afternoon! I have an XML file from which I want to extract only certain elements contained within each line. The problem is that the format of each line is not exactly the same (though similiar). For example, oa_var will be in each line, however, there may be no value or other... (3 Replies)
Discussion started by: bab@faa
3 Replies

6. Programming

Extract xml data and create word document using perl.

Hi, I have large xml data file.I need to extract node and some tags in the node and after I need to create word document. my XMl data is look like as below -<student> <number>24</number> <education>bachelor</bachelor> <specialization>computers</specialization> ... (3 Replies)
Discussion started by: veerubiji
3 Replies

7. Programming

extract xml data and create word document using perl.

hi, i have large xml file which contains students information, i need to extract student number and some address tags and create a word document for the extracted data. my data looking llike this <student> <number>24</number> <education>bachelors</education> ... (1 Reply)
Discussion started by: veerubiji
1 Replies

8. Windows & DOS: Issues & Discussions

Renaming files with strings from xml tags

Hello! I need to rename 400+ xml files. The name of the specific file is contained in a xml tag in the file itself. The batch file should rename all these files with strings found in xml tags. Every xml file has the following tags: <footnote><para>FILENAME</para></footnote> I have to get... (3 Replies)
Discussion started by: degoor
3 Replies

9. Shell Programming and Scripting

Perl script for extract data from xml files

Hi All, Prepare a perl script for extracting data from xml file. The xml data look like as AC StartTime="1227858839" ID="88" ETime="1227858837" DSTFlag="false" Type="2" Duration="303" /> <AS StartTime="1227858849" SigPairs="119 40 98 15 100 32 128 18 131 23 70 39 123 20 120 27 100 17 136 12... (3 Replies)
Discussion started by: allways4u21
3 Replies

10. Shell Programming and Scripting

Parse an XML task list to create each task.xml file

I have an task definition listing xml file that contains a list of tasks such as <TASKLIST <TASK definition="Completion date" id="Taskname1" Some other <CODE name="Code12" <Parameter pname="Dog" input="5.6" units="feet" etc /Parameter> <Parameter... (3 Replies)
Discussion started by: MissI
3 Replies
Login or Register to Ask a Question