Eliminating words from a file through ngrams stored in another file


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Eliminating words from a file through ngrams stored in another file
# 1  
Old 01-23-2013
Eliminating words from a file through ngrams stored in another file

Hello,
I have a large data file which contains a huge amount of garbage i.e. words which do not exist in the language. An example will make this clear:
Code:
kpaware
nlupset
rrrbring

In other words these words are invalid in English and constitute garbage in the data.
I have identified such combinations (at least in the initial position) and have prepared a file of such combos which for lack of better I call bigrams, trigrams
An example of such combos is given below:
Code:
nl
kp
rrr

Is there a script which could load the ngram file and check in the database which words do not meet the requirement and create two files a clean file and an invalid file
I am fully aware that this approach is fraught with a certain amount of danger since two letter combinations are involved and it could be that a bigram such as
Code:
nl

could eliminate out a word such as
Code:
nlong

Hence the request for storing the data in an invalid file for manual examination.
Mnay thanks in advance.

Last edited by Scrutinizer; 01-23-2013 at 01:22 AM.. Reason: quote tags -> code tags
# 2  
Old 01-23-2013
may be a line would help!

Code:
egrep -v "^nl|^kp|^rrr" file > valid_file
egrep "^nl|^kp|^rrr" file > invalid_file

This User Gave Thanks to PikK45 For This Post:
# 3  
Old 01-23-2013
Many thanks, but the list is large and it would involve grepping from a file. I work under windows and egrep does not always give expected results.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Search if file exists for a file pattern stored in array

Hi experts, I have two arrays one has the file paths to be searched in , and the other has the files to be serached.For eg searchfile.dat will have abc303 xyz123 i have to search for files that could be abc303*.dat or for that matter any extension . abc303*.dat.gz The following code... (2 Replies)
Discussion started by: 100bees
2 Replies

2. Shell Programming and Scripting

Want to Insert few lines which are stored in some file before a pattern in another file

Hello, I have few lines to be inserted in file_lines_to_insert. In another file final_file, I have to add lines from above file file_lines_to_insert before a particular pattern. e.g. $ cat file_lines_to_insert => contents are abc def lkj In another file final_file, before a... (6 Replies)
Discussion started by: nehashine
6 Replies

3. Shell Programming and Scripting

Splitting concatenated words in input file with words from the same file

Dear all, I am working with names and I have a large file of names in which some words are written together (upto 4 or 5) and their corresponding single forms are also present in the word-list. An example would make this clear annamarie mariechristine johnsmith johnjoseph smith john smith... (8 Replies)
Discussion started by: gimley
8 Replies

4. Shell Programming and Scripting

Extract rows from file based on row numbers stored in another file

Hi All, I have a file which is like this: rows.dat 1 2 3 4 5 6 3 4 5 6 7 8 7 8 9 0 4 3 2 3 4 5 6 7 1 2 3 4 5 6 I have another file with numbers like these (numbers.txt): 1 3 4 5 I want to read numbers.txt file line by line. The extract the row from rows.dat based on the... (3 Replies)
Discussion started by: shoaibjameel123
3 Replies

5. OS X (Apple)

Where are package contents stored for a file, or why aren't they visible w/o right clicking the file

I was wondering about the "Show Package Contents" option in OS X. I have a keynote file that I'm looking at. Exactly where are these contents or its directory stored, because they aren't visible in the Finder window, unless I obviously right click and choose to view them. And I don't think I can... (2 Replies)
Discussion started by: Straitsfan
2 Replies

6. Shell Programming and Scripting

Splitting Concatenated Words in Input File with Words from a Master File

Hello, I have a complex problem. I have a file in which words have been joined together: Theboy ranslowly I want to be able to correctly split the words using a lookup file in which all the words occur: the boy ran slowly slow put child ly The lookup file which is meant for look up... (21 Replies)
Discussion started by: gimley
21 Replies

7. Shell Programming and Scripting

Counting number of files that contain words stored in another file

Hi All, I have written a script on this but it does not do the requisite job. My requirement is this: 1. I have two kinds of files each with different extensions. One set of files are *.dat (6000 unique DAT files all in one directory) and another set *.dic files (6000 unique DIC files in... (1 Reply)
Discussion started by: shoaibjameel123
1 Replies

8. Programming

Eliminating a row from a file....

I have a file like 1 0 2 0 3 1 3 0 4 0 6 1 6 0 . . . . . . i need to eliminate values 3 0 and 6 0 in the same way there are such values in the whole file....but 3 1 and 6 1 shuld be present... (2 Replies)
Discussion started by: kamuju
2 Replies

9. UNIX for Dummies Questions & Answers

Eliminating CR (new lines) from a file.

Hi all, I made a C++ program in dos (in dev-C++) and uploaded it on Solaris box. On opening that file with 'vim' editor i found that there is some extra new lines after each written code line. I tried to find out is the file is in dos or in unix format, with 'file' command,and i got "<file-name>.h:... (4 Replies)
Discussion started by: KornFire
4 Replies

10. Shell Programming and Scripting

UrgentPlease: compare 1 value with file values eliminating special characters

Hi All, I have file i have values like ---- 112 113 109 112 109 I have another file cat supplierDetails.txt ------------------------- 112|MIMUS|krishnaveni@google.com 113|MIMIRE|krishnaveni@google.com 114|MIMCHN|krishnaveni@google.com 115|CEL|krishnaveni@google.com... (10 Replies)
Discussion started by: kittusri9
10 Replies
Login or Register to Ask a Question