Sponsored Content
Top Forums Shell Programming and Scripting Eliminating words from a file through ngrams stored in another file Post 302759773 by gimley on Tuesday 22nd of January 2013 11:39:32 PM
Old 01-23-2013
Eliminating words from a file through ngrams stored in another file

Hello,
I have a large data file which contains a huge amount of garbage i.e. words which do not exist in the language. An example will make this clear:
Code:
kpaware
nlupset
rrrbring

In other words these words are invalid in English and constitute garbage in the data.
I have identified such combinations (at least in the initial position) and have prepared a file of such combos which for lack of better I call bigrams, trigrams
An example of such combos is given below:
Code:
nl
kp
rrr

Is there a script which could load the ngram file and check in the database which words do not meet the requirement and create two files a clean file and an invalid file
I am fully aware that this approach is fraught with a certain amount of danger since two letter combinations are involved and it could be that a bigram such as
Code:
nl

could eliminate out a word such as
Code:
nlong

Hence the request for storing the data in an invalid file for manual examination.
Mnay thanks in advance.

Last edited by Scrutinizer; 01-23-2013 at 01:22 AM.. Reason: quote tags -> code tags
 

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

UrgentPlease: compare 1 value with file values eliminating special characters

Hi All, I have file i have values like ---- 112 113 109 112 109 I have another file cat supplierDetails.txt ------------------------- 112|MIMUS|krishnaveni@google.com 113|MIMIRE|krishnaveni@google.com 114|MIMCHN|krishnaveni@google.com 115|CEL|krishnaveni@google.com... (10 Replies)
Discussion started by: kittusri9
10 Replies

2. UNIX for Dummies Questions & Answers

Eliminating CR (new lines) from a file.

Hi all, I made a C++ program in dos (in dev-C++) and uploaded it on Solaris box. On opening that file with 'vim' editor i found that there is some extra new lines after each written code line. I tried to find out is the file is in dos or in unix format, with 'file' command,and i got "<file-name>.h:... (4 Replies)
Discussion started by: KornFire
4 Replies

3. Programming

Eliminating a row from a file....

I have a file like 1 0 2 0 3 1 3 0 4 0 6 1 6 0 . . . . . . i need to eliminate values 3 0 and 6 0 in the same way there are such values in the whole file....but 3 1 and 6 1 shuld be present... (2 Replies)
Discussion started by: kamuju
2 Replies

4. Shell Programming and Scripting

Counting number of files that contain words stored in another file

Hi All, I have written a script on this but it does not do the requisite job. My requirement is this: 1. I have two kinds of files each with different extensions. One set of files are *.dat (6000 unique DAT files all in one directory) and another set *.dic files (6000 unique DIC files in... (1 Reply)
Discussion started by: shoaibjameel123
1 Replies

5. Shell Programming and Scripting

Splitting Concatenated Words in Input File with Words from a Master File

Hello, I have a complex problem. I have a file in which words have been joined together: Theboy ranslowly I want to be able to correctly split the words using a lookup file in which all the words occur: the boy ran slowly slow put child ly The lookup file which is meant for look up... (21 Replies)
Discussion started by: gimley
21 Replies

6. OS X (Apple)

Where are package contents stored for a file, or why aren't they visible w/o right clicking the file

I was wondering about the "Show Package Contents" option in OS X. I have a keynote file that I'm looking at. Exactly where are these contents or its directory stored, because they aren't visible in the Finder window, unless I obviously right click and choose to view them. And I don't think I can... (2 Replies)
Discussion started by: Straitsfan
2 Replies

7. Shell Programming and Scripting

Extract rows from file based on row numbers stored in another file

Hi All, I have a file which is like this: rows.dat 1 2 3 4 5 6 3 4 5 6 7 8 7 8 9 0 4 3 2 3 4 5 6 7 1 2 3 4 5 6 I have another file with numbers like these (numbers.txt): 1 3 4 5 I want to read numbers.txt file line by line. The extract the row from rows.dat based on the... (3 Replies)
Discussion started by: shoaibjameel123
3 Replies

8. Shell Programming and Scripting

Splitting concatenated words in input file with words from the same file

Dear all, I am working with names and I have a large file of names in which some words are written together (upto 4 or 5) and their corresponding single forms are also present in the word-list. An example would make this clear annamarie mariechristine johnsmith johnjoseph smith john smith... (8 Replies)
Discussion started by: gimley
8 Replies

9. Shell Programming and Scripting

Want to Insert few lines which are stored in some file before a pattern in another file

Hello, I have few lines to be inserted in file_lines_to_insert. In another file final_file, I have to add lines from above file file_lines_to_insert before a particular pattern. e.g. $ cat file_lines_to_insert => contents are abc def lkj In another file final_file, before a... (6 Replies)
Discussion started by: nehashine
6 Replies

10. Shell Programming and Scripting

Search if file exists for a file pattern stored in array

Hi experts, I have two arrays one has the file paths to be searched in , and the other has the files to be serached.For eg searchfile.dat will have abc303 xyz123 i have to search for files that could be abc303*.dat or for that matter any extension . abc303*.dat.gz The following code... (2 Replies)
Discussion started by: 100bees
2 Replies
Text::Wrap(3pm) 					 Perl Programmers Reference Guide					   Text::Wrap(3pm)

NAME
Text::Wrap - line wrapping to form simple paragraphs SYNOPSIS
Example 1 use Text::Wrap $initial_tab = " "; # Tab before first line $subsequent_tab = ""; # All other lines flush left print wrap($initial_tab, $subsequent_tab, @text); print fill($initial_tab, $subsequent_tab, @text); @lines = wrap($initial_tab, $subsequent_tab, @text); @paragraphs = fill($initial_tab, $subsequent_tab, @text); Example 2 use Text::Wrap qw(wrap $columns $huge); $columns = 132; # Wrap at 132 characters $huge = 'die'; $huge = 'wrap'; $huge = 'overflow'; Example 3 use Text::Wrap $Text::Wrap::columns = 72; print wrap('', '', @text); DESCRIPTION
"Text::Wrap::wrap()" is a very simple paragraph formatter. It formats a single paragraph at a time by breaking lines at word boundries. Indentation is controlled for the first line ($initial_tab) and all subsequent lines ($subsequent_tab) independently. Please note: $ini- tial_tab and $subsequent_tab are the literal strings that will be used: it is unlikley you would want to pass in a number. Text::Wrap::fill() is a simple multi-paragraph formatter. It formats each paragraph separately and then joins them together when it's done. It will destory any whitespace in the original text. It breaks text into paragraphs by looking for whitespace after a newline. In other respects it acts like wrap(). OVERRIDES
"Text::Wrap::wrap()" has a number of variables that control its behavior. Because other modules might be using "Text::Wrap::wrap()" it is suggested that you leave these variables alone! If you can't do that, then use "local($Text::Wrap::VARIABLE) = YOURVALUE" when you change the values so that the original value is restored. This "local()" trick will not work if you import the variable into your own namespace. Lines are wrapped at $Text::Wrap::columns columns. $Text::Wrap::columns should be set to the full width of your output device. In fact, every resulting line will have length of no more than "$columns - 1". It is possible to control which characters terminate words by modifying $Text::Wrap::break. Set this to a string such as '[s:]' (to break before spaces or colons) or a pre-compiled regexp such as "qr/[s']/" (to break before spaces or apostrophes). The default is simply 's'; that is, words are terminated by spaces. (This means, among other things, that trailing punctuation such as full stops or commas stay with the word they are "attached" to.) Beginner note: In example 2, above $columns is imported into the local namespace, and set locally. In example 3, $Text::Wrap::columns is set in its own namespace without importing it. "Text::Wrap::wrap()" starts its work by expanding all the tabs in its input into spaces. The last thing it does it to turn spaces back into tabs. If you do not want tabs in your results, set $Text::Wrap::unexapand to a false value. Likewise if you do not want to use 8-character tabstops, set $Text::Wrap::tabstop to the number of characters you do want for your tabstops. If you want to separate your lines with something other than " " then set $Text::Wrap::seporator to your preference. When words that are longer than $columns are encountered, they are broken up. "wrap()" adds a " " at column $columns. This behavior can be overridden by setting $huge to 'die' or to 'overflow'. When set to 'die', large words will cause "die()" to be called. When set to 'overflow', large words will be left intact. Historical notes: 'die' used to be the default value of $huge. Now, 'wrap' is the default value. EXAMPLE
print wrap(" ","","This is a bit of text that forms a normal book-style paragraph"); AUTHOR
David Muir Sharnoff <muir@idiom.com> with help from Tim Pierce and many many others. perl v5.8.0 2002-06-01 Text::Wrap(3pm)
All times are GMT -4. The time now is 03:34 AM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy