Shell script to find longest phrase


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Shell script to find longest phrase
# 1  
Old 02-23-2009
Shell script to find longest phrase

Hi Everyone,

I am trying to write a shell script that can find the longest phrase that appears at least twice in an online news article. The HTML has been parsed through an HTML parser, converted to XML and the article content extracted. I have put this article content in a text file to work from. Eventually, this will be read from the database.

Right now, I need to find the longest phrase that appears in this text file at least twice. I can extract bigrams and trigrams but does anyone know of a way to sift through the file and find the longest phrase, irrespective of it being an n-gram ?

Thanks. Much appreciated Smilie
SG
# 2  
Old 02-23-2009
Maybe you can post a sample.
# 3  
Old 02-23-2009
The text file is nothing but a news article after parsing HTML tags and extracting the content using XML. I use the following to extract and print the most bigrams for now.

tr -sc 'a-zA-z0-9.' '\012' < $1 > bigrams1
tail -n+2 bigrams1 > bigrams2
paste bigrams1 bigrams2

Here $1 is the name of the file containing the actual text. I am passing this as an argument for now.

Thing is, detecting bigrams and trigrams is easy. Is there a way to detect the longest phrase that appears at least twice ? It could be a bigram, a trigram or n-gram.

Thanks. Smilie
SG
# 4  
Old 02-23-2009
As an idea .. I was thinking of maybe counting the total number of words in the text file and then running a for loop to that number to check for n-grams.

I haven't yet tried this idea. Right now I am working out of text files. Creating 2 text files for bigrams is easy .. but in real time if I have to create n files when checking for n-grams ... how feasible is that in terms of memory and CPU cycles both ??

This is why I am asking here, if any of you can help me please Smilie

Thanks
SG
# 5  
Old 02-24-2009
I think you need to define what a 'phrase' is.
# 6  
Old 02-24-2009
A phrase is a collection of n number of words that are the same. Could be anything from 2 to n. So for example in a sample text:

The quick brown fox jumped over the ugly dog
The quick brown fox is fast asleep in a corner.

In the above sample 'The quick brown fox' is a the phrase I am looking for, as it occurs twice in the text. In another sample

My name is John Doe
My name used to Alfred

The "phrase" here will be "My name" as that is what is consistent.

I hope that makes a bit more sense ? Smilie

Thanks
SG
# 7  
Old 02-24-2009
Makes sense, not that I want to figure out how to do it.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Help script shell find fichier

Hello, I am looking for a shell script that can 1- take as input a variable, like "server.cpu" 2- do a search for that variable in a directory that contains subdirectories. The search will start at the last subdirectory working up to the top level if I can not find the file 3-... (7 Replies)
Discussion started by: georg2014
7 Replies

2. Shell Programming and Scripting

How to find a phrase and pull all lines that follow until the phrase occurs again?

I want to burst a report by using the page number value in the report header. Each section starts with *PAGE NO:* 1 Each section might have several pages, but the next section always starts back at 1. So I want to find the "*PAGE NO:* 1" value and pull all lines that follow until "*PAGE NO:* 1"... (4 Replies)
Discussion started by: Scottie1954
4 Replies

3. Shell Programming and Scripting

Shell script to find the sum of argument passed to the script

I want to make a script which takes the number of argument, add those argument and gives output to the user, but I am not getting through... Script that i am using is below : #!/bin/bash sum=0 for i in $@ do sum=$sum+$1 echo $sum shift done I am executing the script as... (3 Replies)
Discussion started by: mukulverma2408
3 Replies

4. Shell Programming and Scripting

How to find out the shell of the shell script?

Hello My question is: How to find out the shell of the shell script which we are running? I am writing a script, say f1.sh, as below: #!/bin/ksh echo "Sample script" From the first line, we can say this script will run in ksh. But, how can we prove it? Can we print anything inside... (6 Replies)
Discussion started by: guruprasadpr
6 Replies

5. Shell Programming and Scripting

Find longest string and print it

Hello all, I need to find the longest string in a select field and print that field. I have tried a few different methods and I always end up one step from where I need to be. Methods thus far: nawk '{if (length($1) > long) long=length($1); if(length($1)==long) print $1}' The above... (6 Replies)
Discussion started by: SEinT
6 Replies

6. Shell Programming and Scripting

Bash script find longest line/lines in several files

Hello everyone... I need to find out, how to find longest line or possibly lines in several files which are arguments for script. The thing is, that I tried some possibilities before, but nothing worked correctly. Example when i use: awk ' { if ( length > L ) { L=length ;s=$0 } }END{ print... (23 Replies)
Discussion started by: 1tempus1
23 Replies

7. Shell Programming and Scripting

shell script: longest match from right?

Return the position of matched string from right, awk match can do from left only. e.g return pos 7 for search string "service" from "AA-service" or return the matched string "service", then caculate the string length. Thanks!. (3 Replies)
Discussion started by: honglus
3 Replies

8. Shell Programming and Scripting

find PHRASE and PATH

I've got a script which finds *.txt files in directories and subdirectories after providing the path by the user and then searches in the files for phrase given by the user How to write script in such way that the paths to the found *.txt files and the phrase given by the user were both... (2 Replies)
Discussion started by: patrykxes
2 Replies

9. Shell Programming and Scripting

c shell script help with find

Okie here is my problem, 1. I have a directory with a ton of files. 2. I want to first get an input on how many days ago the files were created. 3. I will take those files and put it into another file 4. Then I will take the last # from each line and subtract by 1 then diff the line from the... (1 Reply)
Discussion started by: bigboizvince
1 Replies

10. Shell Programming and Scripting

Find the length of the longest line

Dear All, To find the length of the longest line from a file i have used wc -L which is giving the proper output... But the problem is AIX os does not support wc -L command. so is there any other way 2 to find out the length of the longest line using awk or sed ? Regards, Pankaj (1 Reply)
Discussion started by: panknil
1 Replies
Login or Register to Ask a Question