.PDF and .TXT to .XML. Is it possible?


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting .PDF and .TXT to .XML. Is it possible?
# 8  
Old 03-30-2011
Code:
for filename in `ls -l |awk -F'[\. ]' '/\.txt/||/\.pdf/{++a[$(NF-1)]}END{for(i in a) if(a[i]==2) print i}'`
do
echo -e "<file>\n<text>" `cat $filename.txt` "</text>\n<pdfcontent>"  `openssl base64 -in $filename.pdf` "</pdfcontent>\n</file>\n"  >$filename.xml
done


Last edited by yinyuemi; 03-30-2011 at 09:03 PM..
# 9  
Old 03-30-2011
@yinyuemi, Few things here:
  • large number of files will blow the for command line
  • XML will fail to parse if txt file contains "<" or "&"
  • large txt or pdf file will blow command line of echo command

Last edited by Chubler_XL; 03-30-2011 at 09:08 PM..
# 10  
Old 03-30-2011
Quote:
Originally Posted by Chubler_XL
@yinyuemi, Few things here:
  • large number of files will blow the for command line
  • XML will fail to parse if txt file contains "<" or "&"
  • large txt or pdf file will blow command line of echo command
Good Points, Thanks Chubler_XL.
I have no any idea about how XML parsing, so I followed you code,
how about this? please let me know as usual if any problemSmilie

Code:
for filename in `ls -l |awk -F'[. ]' '/\.txt/||/\.pdf/{++a[$(NF-1)]}END{for(i in a) if(a[i]==2) print i}'`
do
echo -e "<file>\n<text><![CDATA[" `sed 's/]]>/] ]>/g' $filename.txt ` "]]></text>\n<pdfcontent>"  `openssl base64 -in $filename.pdf` "</pdfcontent>\n</file>\n"  >$filename.xml
done


Last edited by yinyuemi; 03-30-2011 at 09:55 PM..
# 11  
Old 03-30-2011
Ruby(1.9+)
Code:
#!/usr/bin/env ruby  

# xml template
xml=<<EOF
<file>
<text>%s</text>
<pdfcontent>%s</pdfcontent>
</file>
EOF

Dir["*.txt"].each do |file|
    filename=file.sub(/\.txt$/,"")
    pdf = filename+".pdf"
    xmlfile = filename+".xml"
    if File.exists?( pdf )
        w = sprintf( xml , file, pdf )
        File.open(xmlfile,"w").write(w)
    end
end

# 12  
Old 03-31-2011
kurumi, how does that generate the base64 encode of the pdf file?
# 13  
Old 03-31-2011
Quote:
Originally Posted by Chubler_XL
kurumi, how does that generate the base64 encode of the pdf file?
well i missed that out didn't i ? Smilie

to generate base64,

Code:
require 'base64'

# xml template
xml=<<EOF
<file>
<text>%s</text>
<pdfcontent>%s</pdfcontent>
</file>
EOF

Dir["*.txt"].each do |file|
    filename=file.sub(/\.txt$/,"")
    pdf = filename+".pdf"
    xmlfile = filename+".xml"
    if File.exists?( pdf )
        b4=Base64.encode64( File.open(pdf).read )
        w = sprintf( xml , file, b4 )
        File.open(xmlfile,"w").write(w)
    end
end

---------- Post updated at 10:21 PM ---------- Previous update was at 10:18 PM ----------

Quote:
Originally Posted by yinyuemi
Good Points, Thanks Chubler_XL.
I have no any idea about how XML parsing, so I followed you code,
how about this? please let me know as usual if any problemSmilie

Code:
for filename in `ls -l |awk -F'[. ]' '/\.txt/||/\.pdf/{++a[$(NF-1)]}END{for(i in a) if(a[i]==2) print i}'`
do
echo -e "<file>\n<text><![CDATA[" `sed 's/]]>/] ]>/g' $filename.txt ` "]]></text>\n<pdfcontent>"  `openssl base64 -in $filename.pdf` "</pdfcontent>\n</file>\n"  >$filename.xml
done

one problem i see is the listing of files using ls -l. A simple shell expansion will do. No need to use ls -l
This User Gave Thanks to kurumi For This Post:
# 14  
Old 03-31-2011
Thanks kurumi, It's nice to see a Ruby script that's more than a 1 liner.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Using awk for converting xml to txt

Hi, I have a xml script, I converted it to .txt with values comma seperated using awk function. But I want the output values should be inside double quotes My xml script (Workorders.xml) is shown like below: <?xml version="1.0" encoding="utf-8" ?> <scbm-extract version="3.3">... (8 Replies)
Discussion started by: Viswanatheee55
8 Replies

2. Solaris

How to convert pdf file to txt?

Hello Unix gurus, I am learning unix. I have lots pdf data files. I need to convert them into txt files. Can you please guide me how to do that? Thanks in advance. Rao (1 Reply)
Discussion started by: raopatwari
1 Replies

3. Red Hat

How to convert TXT to PDF in RHEL 6?

Hello friends, I need to convert ASCII text to PDF on RHEL 6 so I did the below and could generate PDF but it has lot of junk/special characters. yum install enscript ghostscript enscript -p output.ps input.txt ps2pdf output.ps output.pdf So I download latest source of Ghostscript... (4 Replies)
Discussion started by: magnus29
4 Replies

4. Shell Programming and Scripting

Download pdf's using wget convert to txt

wget -i genedx.txt The code above will download multiple pdf files from a site, but how can i download and convert these to .txt? I have attached the master list (genedx.txt - which contains the url and file names) as well as the two PDF's that are downloaded. I am trying to have those... (7 Replies)
Discussion started by: cmccabe
7 Replies

5. Shell Programming and Scripting

Replace the .txt file between two strings in XML file

Hi i am having XML file with many number of lines,I need to replace between two strings with .txt file using awk. For ex <PersonInfoShipTo ------------------------------ /> My requirement is to replace the content between <PersonInfoShipTo ------------------------------ /> help me. Thanks... (9 Replies)
Discussion started by: Padmanabhan
9 Replies

6. UNIX for Dummies Questions & Answers

Need help converting txt to XML

I have a table as following Archive id Line Author Time Text 1fjj34 3 75jk5l 03:20 this is an evidence regarding ... 1fjj34 4 gjhhtrd 03:21 we have seen those documents before 1fjj34 10 645jmdvvb 04:00 Will you consider such an offer?... (0 Replies)
Discussion started by: A-V
0 Replies

7. UNIX for Dummies Questions & Answers

XML to TXT or CSV

Hi all, I am new to unix and even newer to XML :wall: I have a dataset which I need to work on and extract data from but I cant even see things. its a XML file which i need to analyse and return the results in xml as well but need to filter some of them like i would do with excel file so not... (7 Replies)
Discussion started by: A-V
7 Replies

8. Shell Programming and Scripting

Parsing txt, xml files and preparing csv file

Hi, I need to parse text, xml files to get the statistic numbers and prepare summary csv file. What is the best way to parse these file and prepare csv file. Any idea you have , please? Regards, (2 Replies)
Discussion started by: LinuxLearner
2 Replies

9. HP-UX

pdftotext / PDF conversion to .txt binaries

Good day, I've been trying to look for a way to compile the Xpdf sources in our HP-UX server, but have been failing to do so because there is no GCC installed, and I don't have privileges to install GCC. I was looking for a functionality to convert PDF files to .txt, which is exactly like the... (2 Replies)
Discussion started by: mike_s_6
2 Replies

10. Shell Programming and Scripting

Converter XML to PDF in Unix

Does anyone know of a lightweight freeware utility that will do the following?: 1) Input an XML file and XLS file 2) Do a transform 3) Then output a pdf file for Unix Platform. Thanks Andrea (3 Replies)
Discussion started by: andrea.giovanno
3 Replies
Login or Register to Ask a Question