counting_word_excluding patterns


 
Thread Tools Search this Thread
Top Forums UNIX for Dummies Questions & Answers counting_word_excluding patterns
# 1  
Old 03-30-2011
Data counting_word_excluding patterns

Hi everyone,

I am new to this forum. So I apologize if my question is too basic. I am trying to find the amount of words I have in a large number of XML files. Of course I do not want to count XML tags (<.*?>). But i do not know how to do it .Smilie Is there an easy way? (By the way I am working with a friend's mac since I am not at home). Am I having problems because of this? Please help me!!!
# 2  
Old 03-30-2011
Post sample input file and desired output.
This User Gave Thanks to bartus11 For This Post:
# 3  
Old 03-30-2011
A sample of one of the many files whose words I want to count is below (the other files are like this one):
Code:
<intervention id='in16'>
<speaker>
<name>O'Brien, Bill</name>
<birth_date>19290125</birth_date>
<birth_place>UK</birth_place>
<status>Mr</status>
<gender>male</gender>
<institution>
<ni country="UK">HC</ni>
</institution>
<constituency country="UK" region="Normanton"/>
<affiliation>
<hc group="NA"/>
<national_party>Lab</national_party>
</affiliation>
</speaker>
<speech id='sp16' language="EN">When considering local government expenditure and finance, will my right hon. Friend
examine the major problem that is developing in many areas because health and social care is funded by the Department
of Health and local government? Will he take into consideration the need for local authorities properly to fund health
and social care?</speech>
</intervention>

<intervention id='in17'>
<speaker>
<name>Raynsford, Nick</name>
<status>Rt Hon</status>
<gender>male</gender>
<institution>
<ni country="UK">HC</ni>
</institution>
<constituency country="UK" region="Greenwich and Woolwich"/>
<affiliation>
<hc group="NA"/>
<national_party>Lab</national_party>
</affiliation>
<post>The Minister for Local and Regional Government</post>
</speaker>
<speech id='sp17' language="EN">My hon. Friend makes a fair point, but he will be aware that under recent
settlements there has been a sustained increase in local government funding, with a 33 per cent. increase
in real terms since 1997. Specifically, the funding that is targeted on social care has increased above the
average, so the Government are well aware of the need and are putting money into local government to ensure
that the needs of communities are met without imposing unreasonable council tax increases.</speech>
</intervention>

<intervention id='in18'>
<speaker>
<name>Spelman, Caroline</name>
<status>Mrs</status>
<gender>female</gender>
<institution>
<ni country="UK">HC</ni>
</institution>
<constituency country="UK" region="Meriden"/>
<affiliation>
<hc group="NA"/>
<national_party>Con</national_party>
</affiliation>
</speaker>
<speech id='sp18' language="EN">Since 1997, council tax has risen by 70 per cent. and average bills are set
to top £1,000--the highest ever. At the same time, council tax receipts to the Treasury have soared by 80
per cent. Does the Minister accept that the Office of the Deputy Prime Minister has been filling the Chancellor's
coffers by stealth and that the sooner it is gone, the sooner we can restore fairness and accountability to local
government? <omit>12 Jan 2005 : Column 284</omit></speech>
</intervention>

<intervention id='in19'>
<speaker>
<name>Raynsford, Nick</name>
<status>Rt Hon</status>
<gender>male</gender>
<institution>
<ni country="UK">HC</ni>
</institution>
<constituency country="UK" region="Greenwich and Woolwich"/>
<affiliation>
<hc group="NA"/>
<national_party>Lab</national_party>
</affiliation>
<post>The Minister for Local and Regional Government</post>
</speaker>
<speech id='sp19' language="EN">In terms of fairness and accountability, when the hon. Lady's party was in power
grants to local government were cut year after year and local authorities were faced with the real problem of
trying to meet local needs without adequate finance. Since this Government have been in power, the grant to local
government has increased by 33 per cent. in real terms, which has enabled councils to budget prudently. If she
were really worried about council tax, she would be talking to Conservative councils, because they had the
unenviable record last year of setting larger increases than Labour councils--5.4 per cent. compared with 4.7
per cent. Labour is leading the way on keeping council tax down.</speech>
</intervention>

I just want to get the nomber of words in the file (and in many other files like this) EXCLUDING XML (<*?>) TAGS.

Please help!!

mc

Last edited by Scott; 03-30-2011 at 09:25 AM.. Reason: Please use code tags
# 4  
Old 03-30-2011
Code:
# sed "s/[]'-]/_/g;s/<[^>]*>/ /g;s/[[:blank:]][[:blank:]]*/ /g;s/^ //;s/ $//;" infile | wc -w

---------- Post updated at 02:44 PM ---------- Previous update was at 02:40 PM ----------

if you want to see how it looks like without HTML mark and without empty line :

Code:
sed "s/[]'-]/_/g;s/<[^>]*>/ /g;s/[[:blank:]][[:blank:]]*/ /g;s/^ //;s/ $//;/^$/d" yourfile.html

then just ... | wc -w

---------- Post updated at 02:50 PM ---------- Previous update was at 02:44 PM ----------

Code:
 # sed "s/'/_/g;s/<[^>]*>/ /g;s/[[:blank:]][[:blank:]]*/ /g;s/^ //;s/ $//;/^$/d" tst
O_Brien, Bill
19290125
UK
Mr
male
HC
Lab
When considering local government expenditure and finance, will my right hon. Friend examine the major problem that is developing in many areas because health and social care is funded by the Department of Health and local government? Will he take into consideration the need for local authorities properly to fund health and social care?
Raynsford, Nick
Rt Hon
male
HC
Lab
The Minister for Local and Regional Government
My hon. Friend makes a fair point, but he will be aware that under recent settlements there has been a sustained increase in local government funding, with a 33 per cent. increase in real terms since 1997. Specifically, the funding that is targeted on social care has increased above the average, so the Government are well aware of the need and are putting money into local government to ensure that the needs of communities are met without imposing unreasonable council tax increases.
Spelman, Caroline
Mrs
female
HC
Con
Since 1997, council tax has risen by 70 per cent. and average bills are set to top £1,000--the highest ever. At the same time, council tax receipts to the Treasury have soared by 80 per cent. Does the Minister accept that the Office of the Deputy Prime Minister has been filling the Chancellor_s coffers by stealth and that the sooner it is gone, the sooner we can restore fairness and accountability to local government?  12 Jan 2005 : Column 284
Raynsford, Nick
Rt Hon
male
HC
Lab
The Minister for Local and Regional Government
In terms of fairness and accountability, when the hon. Lady_s party was in power grants to local government were cut year after year and local authorities were faced with the real problem of trying to meet local needs without adequate finance. Since this Government have been in power, the grant to local government has increased by 33 per cent. in real terms, which has enabled councils to budget prudently. If she were really worried about council tax, she would be talking to Conservative councils, because they had the unenviable record last year of setting larger increases than Labour councils--5.4 per cent. compared with 4.7 per cent. Labour is leading the way on keeping council tax down.

Code:
# sed "s/'/_/g;s/<[^>]*>/ /g;s/[[:blank:]][[:blank:]]*/ /g;s/^ //;s/ $//;/^$/d" tst  | wc -w
     374

---------- Post updated at 02:57 PM ---------- Previous update was at 02:50 PM ----------

O'Brien should be considered as 1 word or 2 ?
£1,000--the should be considered as 1 word or 2 ?

This can easily be tweaked
if O'Brien is 1 word : ' will be replaced with _
if £1,000--the or councils--5.4 must be separated then the - must be replaced by space

in this case we get 376 words :
Code:
# sed "s/'/_/g;s/-/ /g;s/<[^>]*>/ /g;s/[[:blank:]][[:blank:]]*/ /g;s/^ //;s/ $//;/^$/d" tst  | wc -w
     376

# 5  
Old 03-30-2011
Code:
$ ruby -0777 -ne 'puts $_.gsub(/<.*?>|[[:punct:]]/,"").gsub(/\s+/," ").split.size' file
373

# 6  
Old 03-30-2011
Computer Thanks

Thank you so much. You are fantastic. I am so glad I have registered here. I hope I can manage to help you at some point as well.

mc Smilie
# 7  
Old 03-30-2011
If you have money, you can help me Smilie
 
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Bash - Find files excluding file patterns and subfolder patterns

Hello. For a given folder, I want to select any files find $PATH1 -f \( -name "*" but omit any files like pattern name ! -iname "*.jpg" ! -iname "*.xsession*" ..... \) and also omit any subfolder like pattern name -type d \( -name "/etc/gconf/gconf.*" -o -name "*cache*" -o -name "*Cache*" -o... (2 Replies)
Discussion started by: jcdole
2 Replies

2. Shell Programming and Scripting

Find matched patterns and print them with other patterns not the whole line

Hi, I am trying to extract some patterns from a line. The input file is space delimited and i could not use column to get value after "IN" or "OUT" patterns as there could be multiple white spaces before the next digits that i need to print in the output file . I need to print 3 patterns in a... (3 Replies)
Discussion started by: redse171
3 Replies

3. Shell Programming and Scripting

grep value between two patterns

Hi All, I've been trying solve this with a simple command but not having much luck. I have a file like this: Line 1: random_description 123/alert/high random_description2 356/alert/slow Line 2: random_description3 654/alert/medium Line 3: random_description4 234/alert/critical I'm... (7 Replies)
Discussion started by: joe19
7 Replies

4. Shell Programming and Scripting

need help in string patterns

Hi, i have following lines of code which is properly working. CAT1="${InputFile}CAT_*0?????" CAT2="${InputFile}CAT_*0?????" CountRecords(){ integer i=1 while ]; do print P$i `nawk 'END {print NR}' $1 ` >> ${OutputPath}result.txt & i=i+1 shift done } CountRecords "$CAT1"... (8 Replies)
Discussion started by: malikshahid85
8 Replies

5. Shell Programming and Scripting

need help in string patterns

Hi, i have a directory /u02.i have 2 files in it like abc1.gz abc2.gz i want to store file pattern in a variable like f1="abc?" i don't want to take .gz in variable rather i want .gz appended when i need to unzip the file like gunzip $f1 Can you please help me how to... (3 Replies)
Discussion started by: malikshahid85
3 Replies

6. Shell Programming and Scripting

Searching patterns in 1 file and deleting all lines with those patterns in 2nd file

Hi Gurus, I have a file say for ex. file1 which has 3500 lines in it which are different account numbers and another file (file2) which has 230000 lines in it. I want to read all the lines in file1 and delete all those lines from file2 which has that same pattern as in file1. I am not quite... (4 Replies)
Discussion started by: toms
4 Replies

7. Shell Programming and Scripting

How to get value between patterns

Gurus, If is my file <PRODUCT_TYPE>DN</PRODUCT_TYPE><SERVER_NAME>testserver1</SERVER_NAME><FLAVOR>Windows</FLAVOR><OS>Windows NT</OS><CPU>4</CPU> <PRODUCT_TYPE>PN</PRODUCT_TYPE><SERVER_NAME>testserver2</SERVER_NAME><FLAVOR>Windows</FLAVOR><OS>Windows NT</OS><CPU>3</CPU> ... (6 Replies)
Discussion started by: sirababu
6 Replies

8. Shell Programming and Scripting

How to get lines in between Patterns?

Hi, I need to create a script that does the following: 1. Read the file for the occurrences of "EXECUTE" and "END" strings. There will be several occurrences of EXECUTE and END strings on the file. 2. The resulting lines in #1, needs to be searched for the word... (11 Replies)
Discussion started by: racbern
11 Replies

9. Shell Programming and Scripting

get the value between 2 patterns

hello experts, I want to get the value between 2 patterns. ex. get hello in <line>hello</line> Any suggestions? any sed, grek, awk commands? (11 Replies)
Discussion started by: minifish
11 Replies

10. UNIX for Advanced & Expert Users

3 patterns in one line

hello, I want to write a script to find all the files that contain 3 specific patterns. example: shows the files containing any line that contain pattern1, pattern2 and pattern3, but the patterns can be in any order as long as they exist in the line. can I do that with grep? thank you (1 Reply)
Discussion started by: bashuser
1 Replies
Login or Register to Ask a Question