Need help with complicated script (reading directories, extracting data)


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Need help with complicated script (reading directories, extracting data)
# 1  
Old 02-19-2007
Need help with complicated script (reading directories, extracting data)

Hi, people.
I was searching script tutorials/examples and found this forum, and thought to ask for help, because there seems to be many wise people here.
(Try to bare with me, I'm a Windows user, so this stuff is somewhat strange to me, OK? Smilie )

Anyways, I need to create a script that does the following:

1) The script goes through all the subdirectories named "XML" in a given directory. The subdirectories can be many levels down the directory structure, or simply subdirectories of a specified starting directory.

2) In these subdirectories the script would need to open all the files with the extension *.xhtml.

3) For each xhtml file, the script would need to extract all the words between body tags (not any of the tags, but the text content of the page)

4) The script would need to report how many times a single word has been in all the files.

5) The script would need to produce a complete list of existing words and number of times they have been spotted.


With my very poor unix script skills, I haven't got ver far with this...
I know some basic commands, and I know that in theory I should be using grep and uniq -c and other stuff, but I just can't put this all together.
And because this is actually the first step in a research I'm doing (the wordlist would be material for that), I'm pretty much stuck here, and can't do my real study work.

So help with any and all parts of the script is highly appreciated.

And links to existing examples that could be more or less easily converted to do the above would be a great help too.


OK, thanks.
(and don't shoot a Windows user... Smilie )
# 2  
Old 02-19-2007
not tested - don't have access to xhtml:

Code:
# mypath = the starting top-level directory
cd /path/to/mypath
find . -type d -name 'XML' > xml.lis
while read dir 
do
	find $dir -name '*.xhtml' -exec sed -n '<body>/,/\/body/p'
done < xml.lis  | sed 's/<body>//g' | sed 's/\</body>//g' > words.lis

awk '{ for(i=1;i<=NF;i++) { arr[$i]++} }
     END { for (i in arr) { print i, arr[i]}' words.lis > finalwordcount

# 3  
Old 02-19-2007
Wow, thanks Jim.
That looks very nice.

I must take time to actually go through that step by step to better understand what it really does, and as soon as I can access the unix server again, try it out.

I'm sure this will help a lot.
Thanks again.


If someone has a different approach or further suggestions to some part of the overall script, I'd be interested to see them too.
As far as the real task goes (studying words and key concepts on a website), any script that provides the wordlist is fine, but seeing different solutions might help me to better understand unix scripts.
# 4  
Old 02-20-2007
I ran the above script on the server, and it worked halfways...

This is is what I got:

$ ./wordlist
sed: -e expression #1, char 12: unknown option to `s'
find: missing argument to `-exec'
find: missing argument to `-exec'
find: missing argument to `-exec'
find: missing argument to `-exec'
find: missing argument to `-exec'
find: missing argument to `-exec'
find: missing argument to `-exec'
find: missing argument to `-exec'
awk: cmd. line:2: END { for (i in arr) { print i, arr[i]}
awk: cmd. line:2: ^ unexpected newline or end of string


It creates the files xml.lis, words.lis and finalwordcount.

In the xml.lis there are all the paths for all XML directories (though no new lines, so basically all the directory paths have been put together).
Both of the other two files are completely empty. They are created, but contain no information.

I tried to google some suggestions on how to change the script, but didn't found any answer.

However, I believe there is something missing (or something too much) on the sed delimiters?

And about the awk part, should there be a BEGIN somewhere to match the END? All the examples I found had them both (or neither).


OK, any and all help still very much appreciated.
I think this may need only some minor adjustment to start working.

(On the sidenote, I'm beginning to re-learn stuff here. Wondering about that sed thing reminded me that I actually learned the basic use of that command some 5 years ago. But being a Windows user forgot all about it, until now. Smilie )
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Extracting LONG Data Type from DB via UNIX Script

Hi, I want to extract a XML file which is stored in the database having a data Type as "LONG" via UNIX Scripting. But when i am triggering the SQL via UNIX it is fetching only the first Line and not the complete XML. Can you please suggest if the parameters that i have used needs any... (2 Replies)
Discussion started by: Barbara1234
2 Replies

2. Programming

Python script for extracting data using two files

Hello, I have two files. File 1 is a list of interested IDs Ex1 Ex2 Ex3File 2 is the original file with over 8000 columns and 20 millions rows and is a compressed file .gz Ex1 xx xx xx xx .... Ex2 xx xx xx xx .... Ex2 xx xx xx xx ....Now I need to extract the information for all the IDs of... (4 Replies)
Discussion started by: nans
4 Replies

3. Shell Programming and Scripting

Script for extracting data from csv file based on column values.

Hi all, I am new to shell script.I need your help to write a shell script. I need to write a shell script to extract data from a .csv file where columns are ',' separated. The file has 5 columns having values say column 1,column 2.....column 5 as below along with their valuesm.... (3 Replies)
Discussion started by: Vivekit82
3 Replies

4. Shell Programming and Scripting

Extracting LONG Data Type from DB via UNIX Script

Hi, I want to extract a XML file which is stored in the database having a data Type as "LONG" via UNIX Scripting. But when i am triggering the SQL via UNIX it is fetching only the first Line and not the complete XML. Can you please suggest if the parameters that i have used needs any... (0 Replies)
Discussion started by: dear_abhi2007
0 Replies

5. Shell Programming and Scripting

Extracting data from https server with the help of unix shell script

There is a folder which can be accessed through URL by giving a particular Username and Password.Inside the folder there are few excel sheets.The excel sheets/folder need to be imported from there to unix box with the help of unix shell script. Can anyone help me?Does anyone have code for it?... (2 Replies)
Discussion started by: vanur
2 Replies

6. Shell Programming and Scripting

Script extracting the incorrect data from text file

Hello, A script has been written to extract a specific column data from a text file ONLY if the user's initial input matches the the data of the first column in the text, then only the data from that row will be prinited. The problem I am having is that the code is only reading the records... (6 Replies)
Discussion started by: jermaine4ever
6 Replies

7. Shell Programming and Scripting

Reading data from a file through shell script

There is one Text file data.txt. Data within this file looks like: a.sql b.sql c.sql d.sql ..... ..... want to write a shell script which will access these values within a loop, access one value at a time and store into a variable. can anyone plz help me. (2 Replies)
Discussion started by: Dip
2 Replies

8. Shell Programming and Scripting

urgent-extracting block data from flat file using shell script

Hi, I want to extract block of data from flat file. the data will be like this start of log One two three end of log i want all data between start of log to end of log i.e One two three to be copied to another file. This particular block may appear multiple times in same file. I... (4 Replies)
Discussion started by: shirish_cd
4 Replies

9. Shell Programming and Scripting

reading data from excel using shell script

Hi all I am new to shell scripting. I need to write a shell script that reads each row of an USER_ID colume in a excel file. the excel has around 10000 rows of data. Can someone gives me some example or advice what's best way to do this thanks (11 Replies)
Discussion started by: tiger99
11 Replies

10. Shell Programming and Scripting

script for reading BLOB data

Hi, this may not be a right place to post my question. But still.... I have database table of which one field contains BLOB data. Actually the BLOB data is xml script. Now How do I retrieve that xml script to xml file on unix(or windows). I will have to loop thru the records. There are... (0 Replies)
Discussion started by: shriashishpatil
0 Replies
Login or Register to Ask a Question