.PDF and .TXT to .XML. Is it possible?


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting .PDF and .TXT to .XML. Is it possible?
# 1  
Old 03-30-2011
.PDF and .TXT to .XML. Is it possible?

Hi!

I need to realize this task.
In folder i have such files:
name1.txt
name1.pdf
name2.txt
name2.pdf
etc...

I want to scan this folder, match files with same name (name1.txt with name1.pdf, name2.txt with name2.pdf) and create files name1.xml and name2.xml, based on it. i.e:
i want to create .xml file with such structure:
<file>
<text>...</text>
<pdfcontent>...</pdfcontent>
</file>

,where between tags <text> I must put content of nameX.txt files;
and between tags <pdfcontent> I must put binary code of name nameX.pdf (base64 type or smth like it).

Thanks.
# 2  
Old 03-30-2011
Okay, find finds all pdf or txt files in the current directory, awk strips off the leading and trailing dots turning ./filename.txt into /filename, then sort -u gets rid of any duplicates in that list. From there you can feed it into the shell one by one then match "./${BASE}".* which will turn into ./filename.* and match only that group of files, which you loop through in turn and do what you want with. Each time you do that, redirect the output into the new xml file.

Code:
find . -maxdepth 1 -type f -iname '*.txt' -o -iname '*.pdf' |
        awk -v FS="." '{ print $2 }' | sort -u |
while read BASE
do
        ( echo "<file>"

        for FILE in "./${BASE}".*
        do
        case "$FILE" in
        *.pdf)
            printf "<pdfcontent>"
            openssl base64 < "$FILE"
            printf "</pdfcontent>\n"
            ;;
        *.txt)
            printf "<text>"
            cat "$FILE"
            printf "</text>\n"
            ;;
        *) # Do nothing for files of the wrong type, i.e. .xml
            ;;
        esac
        echo "</file>" ) > "./${BASE}".xml
done

# 3  
Old 03-30-2011
how to do it.

Code:
#!/bin/bash

dir=<some directory>

cd $dir
for ea_file in `ls *.txt`
do
    fname=`echo ${ea_name} | awk 'BEGIN{FS="."}{print $1}'`
    if [ -f ${ea_file}.pdf ]; then
       echo "<file>" > ${ea_file}.xml
       echo "<text>${ea_file}.txt</text>" >> ${ea_file}.xml
       echo "<pdf>${ea_file}.pdf</pdf>" >> ${ea_file}.xml
     else
       echo "WARNING: The file ${ea_file}.txt did not find cooresponding ${ea_file}.pdf"
     fi
done


Last edited by Scott; 03-30-2011 at 08:48 PM.. Reason: Please use code tags
# 4  
Old 03-30-2011
Corona688,
Sorry, but I get next warning
Code:
alex@alex:~/123$ . test.sh
bash: test.sh: string 23: Syntax error: word unexpected (expecting ")")
bash: test.sh: string 23: `        echo "</file>" ) > "./${BASE}".xml'

dajon
Thank you, but your script puts only names of files into .xml file. But I need to put content of .txt and .pdf files into .xml

Last edited by Scott; 03-30-2011 at 08:49 PM..
# 5  
Old 03-30-2011
Not sure where that went wrong, this one runs
Code:
find . -maxdepth 1 -type f -iname '*.txt' -o -iname '*.pdf' |
        awk -v FS="." '{ print $2 }' | sort -u |
while read BASE
do
        ( echo "<file>"
        for FILE in "./${BASE}".*
        do
                case "$FILE" in
                *.txt)
                        printf "<text>"
                        cat "$FILE"
                        echo "</text>"
                        ;;
                *.pdf)
                        printf "<pdfcontent>"
                        openssl base64 < "$FILE"
                        echo "</pdfcontent>"
                        ;;
                *)
                        ;;
                esac
        done
        echo "</file>"

        ) > "./${BASE}.xml"
done

This User Gave Thanks to Corona688 For This Post:
# 6  
Old 03-30-2011
how about this?
Code:
ls -l |awk -F'[\. ]' '/\.txt/||/\.pdf/{++a[$(NF-1)]}END{for(i in a) if(a[i]==2) print "<file>\n<text>"i".txt<\/text>\n<pdfcontent>"i".pdf<\/pdfcontent>\n<\/file>" >i".xml"}'

# 7  
Old 03-30-2011
dajon and yinyuemi, actual file contents required between <text> </text> and base64 encode of file between <pdfcontent> and </pdfcontent>

Corona688, CDATA escaping will need to be done, because text file may contain "<" and "&" and these are illegal in XML data blocks.

optik77 - wonder if it would be better to base64 encode the text file too?

Code:
for pdffile in *.pdf
do
   txtfile=${pdffile%.txt}.txt
   xmlfile=${pdffile%.txt}.xml
   if [ -f $pdffie ] && [ -f $txtfile ]
   then
        printf '<file>\n<text><![CDATA['
        sed 's/]]>/] ]>/g' "$txtfile"
        printf ']]></text>\n<pdfcontent>'
        openssl base64 < "$pdffile"
        echo "</pdfcontent>"
        echo "</file>"
    fi > "$xmlfile"
done


Last edited by Chubler_XL; 03-30-2011 at 09:44 PM..
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Using awk for converting xml to txt

Hi, I have a xml script, I converted it to .txt with values comma seperated using awk function. But I want the output values should be inside double quotes My xml script (Workorders.xml) is shown like below: <?xml version="1.0" encoding="utf-8" ?> <scbm-extract version="3.3">... (8 Replies)
Discussion started by: Viswanatheee55
8 Replies

2. Solaris

How to convert pdf file to txt?

Hello Unix gurus, I am learning unix. I have lots pdf data files. I need to convert them into txt files. Can you please guide me how to do that? Thanks in advance. Rao (1 Reply)
Discussion started by: raopatwari
1 Replies

3. Red Hat

How to convert TXT to PDF in RHEL 6?

Hello friends, I need to convert ASCII text to PDF on RHEL 6 so I did the below and could generate PDF but it has lot of junk/special characters. yum install enscript ghostscript enscript -p output.ps input.txt ps2pdf output.ps output.pdf So I download latest source of Ghostscript... (4 Replies)
Discussion started by: magnus29
4 Replies

4. Shell Programming and Scripting

Download pdf's using wget convert to txt

wget -i genedx.txt The code above will download multiple pdf files from a site, but how can i download and convert these to .txt? I have attached the master list (genedx.txt - which contains the url and file names) as well as the two PDF's that are downloaded. I am trying to have those... (7 Replies)
Discussion started by: cmccabe
7 Replies

5. Shell Programming and Scripting

Replace the .txt file between two strings in XML file

Hi i am having XML file with many number of lines,I need to replace between two strings with .txt file using awk. For ex <PersonInfoShipTo ------------------------------ /> My requirement is to replace the content between <PersonInfoShipTo ------------------------------ /> help me. Thanks... (9 Replies)
Discussion started by: Padmanabhan
9 Replies

6. UNIX for Dummies Questions & Answers

Need help converting txt to XML

I have a table as following Archive id Line Author Time Text 1fjj34 3 75jk5l 03:20 this is an evidence regarding ... 1fjj34 4 gjhhtrd 03:21 we have seen those documents before 1fjj34 10 645jmdvvb 04:00 Will you consider such an offer?... (0 Replies)
Discussion started by: A-V
0 Replies

7. UNIX for Dummies Questions & Answers

XML to TXT or CSV

Hi all, I am new to unix and even newer to XML :wall: I have a dataset which I need to work on and extract data from but I cant even see things. its a XML file which i need to analyse and return the results in xml as well but need to filter some of them like i would do with excel file so not... (7 Replies)
Discussion started by: A-V
7 Replies

8. Shell Programming and Scripting

Parsing txt, xml files and preparing csv file

Hi, I need to parse text, xml files to get the statistic numbers and prepare summary csv file. What is the best way to parse these file and prepare csv file. Any idea you have , please? Regards, (2 Replies)
Discussion started by: LinuxLearner
2 Replies

9. HP-UX

pdftotext / PDF conversion to .txt binaries

Good day, I've been trying to look for a way to compile the Xpdf sources in our HP-UX server, but have been failing to do so because there is no GCC installed, and I don't have privileges to install GCC. I was looking for a functionality to convert PDF files to .txt, which is exactly like the... (2 Replies)
Discussion started by: mike_s_6
2 Replies

10. Shell Programming and Scripting

Converter XML to PDF in Unix

Does anyone know of a lightweight freeware utility that will do the following?: 1) Input an XML file and XLS file 2) Do a transform 3) Then output a pdf file for Unix Platform. Thanks Andrea (3 Replies)
Discussion started by: andrea.giovanno
3 Replies
Login or Register to Ask a Question