Eliminating words from a file through ngrams stored in another file

01-23-2013

Registered User

303, 4

Join Date: Feb 2011

Last Activity: 3 March 2020, 10:05 PM EST

Posts: 303

Thanks Given: 145

Thanked 4 Times in 4 Posts

Eliminating words from a file through ngrams stored in another file

Hello,
I have a large data file which contains a huge amount of garbage i.e. words which do not exist in the language. An example will make this clear:

Code:

kpaware
nlupset
rrrbring

In other words these words are invalid in English and constitute garbage in the data.
I have identified such combinations (at least in the initial position) and have prepared a file of such combos which for lack of better I call bigrams, trigrams
An example of such combos is given below:

Code:

nl
kp
rrr

Is there a script which could load the ngram file and check in the database which words do not meet the requirement and create two files a clean file and an invalid file
I am fully aware that this approach is fraught with a certain amount of danger since two letter combinations are involved and it could be that a bigram such as

Code:

nl

could eliminate out a word such as

Code:

nlong

Hence the request for storing the data in an invalid file for manual examination.
Mnay thanks in advance.

Last edited by Scrutinizer; 01-23-2013 at 01:22 AM.. Reason: quote tags -> code tags

gimley

View Public Profile for gimley

Find all posts by gimley

01-23-2013

Registered User

584, 75

Join Date: Jul 2012

Last Activity: 22 March 2018, 8:44 AM EDT

Location: Chennai

Posts: 584

Thanks Given: 51

Thanked 75 Times in 73 Posts

may be a line would help!

Code:

egrep -v "^nl|^kp|^rrr" file > valid_file
egrep "^nl|^kp|^rrr" file > invalid_file

This User Gave Thanks to PikK45 For This Post:

PikK45

View Public Profile for PikK45

Find all posts by PikK45

01-23-2013

Registered User

303, 4

Join Date: Feb 2011

Last Activity: 3 March 2020, 10:05 PM EST

Posts: 303

Thanks Given: 145

Thanked 4 Times in 4 Posts

Many thanks, but the list is large and it would involve grepping from a file. I work under windows and egrep does not always give expected results.

gimley

View Public Profile for gimley

Find all posts by gimley

Shell Programming and Scripting

Eliminating words from a file through ngrams stored in another file

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Search if file exists for a file pattern stored in array

Discussion started by: 100bees

2. Shell Programming and Scripting

Want to Insert few lines which are stored in some file before a pattern in another file

Discussion started by: nehashine

3. Shell Programming and Scripting

Splitting concatenated words in input file with words from the same file

Discussion started by: gimley

4. Shell Programming and Scripting

Extract rows from file based on row numbers stored in another file

Discussion started by: shoaibjameel123

5. OS X (Apple)

Where are package contents stored for a file, or why aren't they visible w/o right clicking the file

Discussion started by: Straitsfan

6. Shell Programming and Scripting

Splitting Concatenated Words in Input File with Words from a Master File

Discussion started by: gimley

7. Shell Programming and Scripting

Counting number of files that contain words stored in another file

Discussion started by: shoaibjameel123

8. Programming

Eliminating a row from a file....

Discussion started by: kamuju

9. UNIX for Dummies Questions & Answers

Eliminating CR (new lines) from a file.

Discussion started by: KornFire

10. Shell Programming and Scripting

UrgentPlease: compare 1 value with file values eliminating special characters

Discussion started by: kittusri9