![]() |
|
|
google unix.com
|
|||||||
| Forums | Register | Forum Rules | Links | Albums | FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| Shell Programming and Scripting Post questions about KSH, CSH, SH, BASH, PERL, PHP, SED, AWK and OTHER shell scripts and shell scripting languages here. |
More UNIX and Linux Forum Topics You Might Find Helpful
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| urgent-extracting block data from flat file using shell script | shirish_cd | Shell Programming and Scripting | 4 | 02-06-2008 09:05 AM |
| reading data from excel using shell script | tiger99 | Shell Programming and Scripting | 11 | 01-06-2008 01:35 AM |
| Very complicated script.. | rocinante | Shell Programming and Scripting | 5 | 06-08-2007 11:56 AM |
| script for reading BLOB data | shriashishpatil | Shell Programming and Scripting | 0 | 04-25-2007 08:11 PM |
| Using loop reading a file,retrieving data from data base. | Sonu4lov | Shell Programming and Scripting | 1 | 01-19-2007 03:38 AM |
![]() |
|
|
LinkBack | Thread Tools | Search this Thread | Rate Thread | Display Modes |
|
|
|
||||
|
Need help with complicated script (reading directories, extracting data)
Hi, people.
I was searching script tutorials/examples and found this forum, and thought to ask for help, because there seems to be many wise people here. (Try to bare with me, I'm a Windows user, so this stuff is somewhat strange to me, OK? )Anyways, I need to create a script that does the following: 1) The script goes through all the subdirectories named "XML" in a given directory. The subdirectories can be many levels down the directory structure, or simply subdirectories of a specified starting directory. 2) In these subdirectories the script would need to open all the files with the extension *.xhtml. 3) For each xhtml file, the script would need to extract all the words between body tags (not any of the tags, but the text content of the page) 4) The script would need to report how many times a single word has been in all the files. 5) The script would need to produce a complete list of existing words and number of times they have been spotted. With my very poor unix script skills, I haven't got ver far with this... I know some basic commands, and I know that in theory I should be using grep and uniq -c and other stuff, but I just can't put this all together. And because this is actually the first step in a research I'm doing (the wordlist would be material for that), I'm pretty much stuck here, and can't do my real study work. So help with any and all parts of the script is highly appreciated. And links to existing examples that could be more or less easily converted to do the above would be a great help too. OK, thanks. (and don't shoot a Windows user... ) |
|
||||
|
not tested - don't have access to xhtml: Code:
# mypath = the starting top-level directory
cd /path/to/mypath
find . -type d -name 'XML' > xml.lis
while read dir
do
find $dir -name '*.xhtml' -exec sed -n '<body>/,/\/body/p'
done < xml.lis | sed 's/<body>//g' | sed 's/\</body>//g' > words.lis
awk '{ for(i=1;i<=NF;i++) { arr[$i]++} }
END { for (i in arr) { print i, arr[i]}' words.lis > finalwordcount
|
|
||||
|
Wow, thanks Jim.
That looks very nice. I must take time to actually go through that step by step to better understand what it really does, and as soon as I can access the unix server again, try it out. I'm sure this will help a lot. Thanks again. If someone has a different approach or further suggestions to some part of the overall script, I'd be interested to see them too. As far as the real task goes (studying words and key concepts on a website), any script that provides the wordlist is fine, but seeing different solutions might help me to better understand unix scripts. |
|
||||
|
I ran the above script on the server, and it worked halfways...
This is is what I got: $ ./wordlist sed: -e expression #1, char 12: unknown option to `s' find: missing argument to `-exec' find: missing argument to `-exec' find: missing argument to `-exec' find: missing argument to `-exec' find: missing argument to `-exec' find: missing argument to `-exec' find: missing argument to `-exec' find: missing argument to `-exec' awk: cmd. line:2: END { for (i in arr) { print i, arr[i]} awk: cmd. line:2: ^ unexpected newline or end of string It creates the files xml.lis, words.lis and finalwordcount. In the xml.lis there are all the paths for all XML directories (though no new lines, so basically all the directory paths have been put together). Both of the other two files are completely empty. They are created, but contain no information. I tried to google some suggestions on how to change the script, but didn't found any answer. However, I believe there is something missing (or something too much) on the sed delimiters? And about the awk part, should there be a BEGIN somewhere to match the END? All the examples I found had them both (or neither). OK, any and all help still very much appreciated. I think this may need only some minor adjustment to start working. (On the sidenote, I'm beginning to re-learn stuff here. Wondering about that sed thing reminded me that I actually learned the basic use of that command some 5 years ago. But being a Windows user forgot all about it, until now. ) |
![]() |
| Bookmarks |
| Thread Tools | Search this Thread |
| Display Modes | Rate This Thread |
|
|