'for LINE in $(cat file)' breaking at spaces, not just newlines


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting 'for LINE in $(cat file)' breaking at spaces, not just newlines
# 1  
Old 08-18-2011
'for LINE in $(cat file)' breaking at spaces, not just newlines

Hello. I'm making a (hopefully) simple shell script xml parser that outputs a file I can grep for information. I am writing it because I have yet to find a command line utility that can do this. If you know of one, please just stop now and tell me about it. Even better would be one I can input the XML file and path to the data (below example would be "foo/bar/c") and get the value. I know this can easily be done in python, but I have no experience in that and I want to make it part of a bigger program.

My goal, if writing the full parser, is to have the input file, for example,
Code:
<foo a="1">
    <bar b="2" c="Some string">
    </bar>
</foo>

output something like
Code:
foo.a="1"
foo.bar.b="2"
foo.bar.c="Some string"

The way I thought about doing this was
Code:
for LINE in $(cat file.xml)
do
if [ -n "$(echo $LINE | grep '<.*>' | grep -ve '<\/.*>')"

followed by how to parse it if it's an open tag <tag> and an if statement for ending tags </tag>.

NOW HERE'S THE PROBLEM:
"cat file.xml" works perfectly in terminal, but when using it in a for loop, assigning it to an array, or anything else, it breaks at spaces and newlines, not just newlines thus if I echoed $LINE in the for loop, the output would be
Code:
<foo
a="1">
<bar
b="2"
c="some
string">
</bar>
<foo>

which works fine for numeric values, but some some of my values are strings with spaces in their data which breaks the code since the value is split in half.

How can I get $LINE to be the full line, not just what's between two spaces? Or what command line utility could extract a piece of information from an XML file?
# 2  
Old 08-18-2011
Code:
for LINE in $(cat file.xml)

This is a useless use of cat. Where did you learn this? We need to discover and stop whoever's teaching this, new shell scripters always try it and it's pretty much always wrong. You've discovered only one of the many problems with this programming pattern.

Try this instead -- faster, splits where you want, doesn't waste memory, and doesn't truncate large files.
Code:
while read LINE
do
...
done < file.xml

Code:
if [ -n "$(echo $LINE | grep '<.*>' | grep -ve '<\/.*>')"

You don't always have to check a program's output text, you can check its return value.
Code:
if echo "$LINE" | grep -q ...

The xmlstarlet utility can do lots of things with XML but is quite difficult to use. I don't know a good general-purpose XML solution because parsing XML is definitely not trivial in shell or any other language.

Working on something.
# 3  
Old 08-18-2011
Don't write xml parsers in shell. Even the simplest one. Use right tools for job.
---
Code:
while read LINE; do
...
done <INPUTFILE

won't work well either, because of
1. It squashes the first blank spaces. 2. It wouldn't work if the last line doesn't end on the newline char. 3 It has quirks if a line has backslashes:
Code:
cat TESTFILE
<foo a="1">
    <bar b="2" c="Some string\n">
    </bar>
</foo>%
% while read -r LINE; do
  echo "$LINE"
done <TESTFILE
<foo a="1">
<bar b="2" c="Some string
">
</bar>

Yes, there are ways to do with it and shell is a programming language and it allows to do everything with it. But sed is Turing complete too.

Last edited by yazu; 08-18-2011 at 11:00 PM.. Reason: Added the word "well" and the proof.
# 4  
Old 08-18-2011
Thanks! I heard of using read in a while loop, but I didn't know how to get the information into it. I'll try it and get back to you.
# 5  
Old 08-18-2011
Here's an awesomely ugly bash mini-xml parser:

Code:
#!/bin/bash

TAG=()
TPOS=0

# Turn <tag> <stuff> into < tag > \n < stuff >
sed 's#># >\n#g;s#<\(/*\)#<\1 #g;' > /tmp/$$

while read LINE
do
        [ -z "$LINE" ] && continue

	# set -- "<" "stuff" ">" makes $1="<", $2="stuff", $3=">" 
	set -- $LINE

	while [ "$#" -gt 0 ]
	do
		case "$1" in
		"<")	TAG[$((TPOS++))]=$2
			shift
			;;

		"</")	((TPOS--))
			shift
			;;

		">")	;;

		*)	OLDIFS="$IFS"	;	IFS="."
			echo "${TAG[*]}.$1"
			IFS="$OLDIFS"
		esac

		shift
	done
done < /tmp/$$

rm -f /tmp/$$

It's as full of holes as swiss-cheese but (almost) works for your test data. A "proper" parser probably isn't feasible.

Last edited by Corona688; 08-18-2011 at 10:59 PM..
# 6  
Old 08-18-2011
Quote:
Originally Posted by Corona688
Here's an awesomely ugly bash mini-xml parser:

Code:
#!/bin/bash

TAG=()
TPOS=0

# Turn <tag> <stuff> into < tag > \n < stuff >
sed 's#># >\n#g;s#<\(/*\)#<\1 #g;' > /tmp/$$
...
...

Wow you're right that looks ugly lol.

@yazu:
there won't be any random escapes or anything in the XML. I will be preprocessing it, so I know what will be in it and I can filter out anything I don't want. Also, this is highly application-specific. I plan on using it on one XML that just keeps being updated with different data.

After all this fuss, I think I should give you guys my point and see if you have better sugestions. I am running a linux home media server. I use pianobar for pandora, will soon be setting up a music player for my local audio and was hoping to use something like this:

*turns out I can't post links yet. Google Search "script listen to iheartradio", it's the top result from the maemo site.*


to stream internet radio. My problems with it are:
*keeps spitting out song info because of "print" in a loop
*no way to change stations on the fly
*has a timeout
*doesn't have simple arguments like:
--scriptname channel <channelname>
--scriptname stop
--scriptname info
etc.

I knew if I could extract the url and song/artist info from the XML, I would be able to easily write a shell script that could accomplish all of this. That turned out to be a much tougher task then expected. If any of you have python experience and can give some suggestions, please let me know. I would be teaching myself python right now, but I'm too busy and as you can see, I am still having trouble with bash.

Edit:

The main thing I want is to know what is going on. Please whoever answers following this, please include comments in code, descriptive variables, or an explanation. Thank you.

and @corona688:
my code looked surprisingly similar to yours once I analyzed what yours did

And yeah, I should have stated this all from the start. I just wanted a little help where I was stuck so I could try to figure the rest out myself, but now I do see how hard it is to work with XMLs in bash. I guess I was going completely in the wrong direction. Besides, with my code the way it was, it worked fine for anything without a space, but it took 30 seconds to parse. That is 3 times as long as the original python script's refresh interval!!!

Last edited by natedawg1013; 08-18-2011 at 11:35 PM..
# 7  
Old 08-18-2011
Quote:
Originally Posted by natedawg1013
The main thing I want is to know what is going on. Please whoever answers following this, please include comments in code, descriptive variables, or an explanation. Thank you.
I was too embarrassed of it to dwell on it, heh. Smilie And the one obvious way left to improve it to better work with your data -- eval, so the string a="some string" would be processed by the shell literally to set the variable a -- would open up gigantic security holes. Someone could name a song `rm -rf ~/` and it'd do it...
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

sed a multiple line pattern that has newlines followed by text and r

Here is the text that I was to run sed on. In this text I want to insert a semi colon ';' before 'select a13.STORE_TYPE STORE_TYPE,' and after 'from ZZMR00 pa11' Input text: insert into ZZMQ01 select pa11.STATE_NBR STATE_NBR, pa11.STORE_TYPE STORE_TYPE, ... (9 Replies)
Discussion started by: v_vineeta11
9 Replies

2. UNIX for Dummies Questions & Answers

Sendmail with cat adding extra spaces in email body

when I try to read a file and send email using cat and sendmail: The email received having additional spaces.(Between the letters of words in the text) My code: export MAILTO="sa@y.com" export SUBJECT="mydomain PREPROD MONITOR AT ${DATE}" export... (5 Replies)
Discussion started by: visitsany
5 Replies

3. AIX

Script to cat and dd last line!!! of each file

hi Guys, Am new to this awesome forum, and yea i need some help here asap thnx :) i have a directory with over 34000 text files, i need a script that will delete the last line of each of this file without me necessary opening the files. illustration:- file1 200 records file2 130 records... (5 Replies)
Discussion started by: eetang
5 Replies

4. Shell Programming and Scripting

sed remove newlines and spaces

Hi all, i am getting count from oracle 11g by spooling it to a file. Now there are some newline characters and blank spaces i need to remove these. pl provide me a awk/sed solution. the spooled file is attached. i tried this.. but not getting req o/p (6 Replies)
Discussion started by: rishav
6 Replies

5. Shell Programming and Scripting

split a line of a file and cat a file with another

Hi, I have two files one.txt laptop boy apple two.txt unix linux OS openS I want to split one.txt into one line each and concatenate it with the two.txt output files onea.txt laptop (4 Replies)
Discussion started by: avatar_007
4 Replies

6. UNIX for Dummies Questions & Answers

how to append spaces(say 10 spaces) at the end of each line based on the length of th

Hi, I have a problem where I need to append few spaces(say 10 spaces) for each line in a file whose length is say(100 chars) and others leave as it is. I tried to find the length of each line and then if the length is say 100 chars then tried to write those lines into another file and use a sed... (17 Replies)
Discussion started by: prathima
17 Replies

7. Shell Programming and Scripting

cat in the command line doesn't match cat in the script

Hello, So I sorted my file as I was supposed to: sort -n -r -k 2 -k 1 file1 | uniq > file2 and when I wrote > cat file2 in the command line, I got what I was expecting, but in the script itself ... sort -n -r -k 2 -k 1 averages | uniq > temp cat file2 It wrote a whole... (21 Replies)
Discussion started by: shira
21 Replies

8. Shell Programming and Scripting

Breaking line

My input file is like USER_WORK.ABC USER_WORK.DEF I want output file like ABC DEF (4 Replies)
Discussion started by: scorp_rahul23
4 Replies

9. UNIX for Dummies Questions & Answers

newb help! file name with spaces breaking up when trying to retrieve it

for file in `ls *.txt` do sed '/s/old/new/g' $file > /tmp/tempfile.tmp mv /tmp/tempfile.tmp $file done the txt files names look like "text file one.txt", "text file two.txt" but when I run it, all i get is: sed: 0602-419 Cannot find or open file text. sed: 0602-419 Cannot find or... (3 Replies)
Discussion started by: DeuceLee
3 Replies

10. Shell Programming and Scripting

Cat'ing a multiple line file to one line

I am writing a script that is running a loop on one file to obtain records from another file. Using egrep, I am finding matching records in file b, then outputing feilds of both into another file. **************************** filea=this.txt fileb=that.txt cat $filea | while read line do... (1 Reply)
Discussion started by: djsal
1 Replies
Login or Register to Ask a Question