Eliminating words from a file through ngrams stored in another file
Hello,
I have a large data file which contains a huge amount of garbage i.e. words which do not exist in the language. An example will make this clear:
In other words these words are invalid in English and constitute garbage in the data.
I have identified such combinations (at least in the initial position) and have prepared a file of such combos which for lack of better I call bigrams, trigrams
An example of such combos is given below:
Is there a script which could load the ngram file and check in the database which words do not meet the requirement and create two files a clean file and an invalid file
I am fully aware that this approach is fraught with a certain amount of danger since two letter combinations are involved and it could be that a bigram such as
could eliminate out a word such as
Hence the request for storing the data in an invalid file for manual examination.
Mnay thanks in advance.
Last edited by Scrutinizer; 01-23-2013 at 01:22 AM..
Reason: quote tags -> code tags
Hi experts,
I have two arrays one has the file paths to be searched in , and the other has the files to be serached.For eg
searchfile.dat will have
abc303
xyz123
i have to search for files that could be abc303*.dat or for that matter any extension . abc303*.dat.gz
The following code... (2 Replies)
Hello,
I have few lines to be inserted in file_lines_to_insert.
In another file final_file, I have to add lines from above file file_lines_to_insert before a particular pattern.
e.g.
$ cat file_lines_to_insert => contents are
abc
def
lkj
In another file final_file, before a... (6 Replies)
Dear all,
I am working with names and I have a large file of names in which some words are written together (upto 4 or 5) and their corresponding single forms are also present in the word-list.
An example would make this clear
annamarie
mariechristine
johnsmith
johnjoseph smith
john
smith... (8 Replies)
Hi All,
I have a file which is like this:
rows.dat
1 2 3 4 5 6
3 4 5 6 7 8
7 8 9 0 4 3
2 3 4 5 6 7
1 2 3 4 5 6
I have another file with numbers like these (numbers.txt):
1
3
4
5
I want to read numbers.txt file line by line. The extract the row from rows.dat based on the... (3 Replies)
I was wondering about the "Show Package Contents" option in OS X. I have a keynote file that I'm looking at. Exactly where are these contents or its directory stored, because they aren't visible in the Finder window, unless I obviously right click and choose to view them. And I don't think I can... (2 Replies)
Hello,
I have a complex problem. I have a file in which words have been joined together:
Theboy ranslowly
I want to be able to correctly split the words using a lookup file in which all the words occur:
the
boy
ran
slowly
slow
put
child
ly
The lookup file which is meant for look up... (21 Replies)
Hi All,
I have written a script on this but it does not do the requisite job. My requirement is this:
1. I have two kinds of files each with different extensions. One set of files are *.dat (6000 unique DAT files all in one directory) and another set *.dic files (6000 unique DIC files in... (1 Reply)
I have a file like
1 0
2 0
3 1
3 0
4 0
6 1
6 0
. .
. .
. .
i need to eliminate values 3 0 and 6 0 in the same way there are such values in the whole file....but 3 1 and 6 1 shuld be present... (2 Replies)
Hi all, I made a C++ program in dos (in dev-C++) and uploaded it on Solaris box. On opening that file with 'vim' editor i found that there is some extra new lines after each written code line. I tried to find out is the file is in dos or in unix format, with 'file' command,and i got "<file-name>.h:... (4 Replies)
Hi All,
I have file
i have values like
----
112
113
109
112
109
I have another file
cat supplierDetails.txt
-------------------------
112|MIMUS|krishnaveni@google.com
113|MIMIRE|krishnaveni@google.com
114|MIMCHN|krishnaveni@google.com
115|CEL|krishnaveni@google.com... (10 Replies)