I don't want any codes for this problem. Just suggestions:
I have a huge collection of text files (around 300,000) which look like this:
1.fil
The entire text collection (referenced above) has about 1 billion words.
I have created another text file which contains some words like these:
junk.dat
The above text file has about 300,000 such words.
As you can see the words that I have filtered out are some junk words which I want to remove from my text collection.
I have already written a code for this in C but the problem is when the code runs it seems that it will take days and months to finally clean up the entire text data.
My algorithm:
1. Read the junk.dat file and make a hash table of it.
2. Read each of the *.fil files one by one, search for words in the hash table.
3. If words present in junk.dat, then leave out that word and move onto the next word.
4. Keep doing this for the entire collection of 300000 files.
I have even tried to minimize the disk read-writes. Still it is very slow.
1. Stop using malloc()/calloc() and free() every time you need memory. Get ONE chunk of memory and reuse it. For example, pass a character buffer into a method instead of using calloc() to allocate a new one each and every time.
2. Fix your memory leaks - I spotted at least two, on in file_name_generator(), one caused by the return value of file_name_generator() overwriting a malloc()'d pointer.
3. Don't EVER use fgetc().
4. Don't read files TWICE. Use something like fgets() and process each word as you read it. If you're using rewind(), you've done something wrong.
What I need is to remove the text from Location_file.txt from each line matching all entries from Remove_location.txt
Location_file.txt
FlowPrePaid, h3nmg1cm2,Jamaica_MTAImageFileFlowPrePaid,h0nmg1cm1, Flow_BeatTest,FlowRockTest
FlowNewTest,FlowNewTest,h0nmg1cm1
PartiallySubscribed,... (3 Replies)
Hi,
I am trying to remove a string ".var" using the below command but it's not working as expected, when I execute this in the command prompt using the echo it's working fine , please let me know where I am doing it wrong.
UYRD=$FILE_$timestamp.csv | awk '{gsub(".var", "");print}' # this is... (6 Replies)
Dear all
From below mention input file I needed op file as show below. I am using below code but not worked.
I/p file
BSCBCH1 EXAL-1-4 WO* SMPS MAINS FAIL
BSCBCH1 EXAL-1-5 WO* SMPS RECTIFIER FAIL
BSCBCH1 EXAL-1-6 WO* SMPS MAJOR ALARM
BSCBCH2 EXAL-1-10 WO* ... (5 Replies)
Hello,
Sorry for my bad english.
I need to improve performance in project managing large data, these data are exported to a MySql from XML.
Now I use PHP (XMLReader ()) to do this job.
I need a faster way to do this process.
Which do you think is the best way?
Example:
(the item... (2 Replies)
Hi,
I'm a newbie to shell scripting and I have the following problem:
I need all spaces between two letters or a letter and a number exchanged for an underscore, but all spaces between a letter and other characters need to remain. Searching forums didn't help...
One example for clarity:
... (3 Replies)
Hi Everyone,
I am using a centos 5.2 server as an sflow log collector on my network. Currently I am using inmons free sflowtool to collect the packets sent by my switches. I have a bash script running on an infinate loop to stop and start the log collection at set intervals - currently one... (2 Replies)
Hi,
I'm trying to figure out the best solution to the following problem, and I'm not
yet that much experienced like you. :-)
Basically I have to read a fairly large file, composed of "messages" , in order
to display all of them through an user interface (made with QT).
The messages that... (3 Replies)
Hi All
I have approximately 10 files that are at least 100+ MB in size. I am importing them into a DB to output them to the web. What i need to do first is clean the files up so i dont have un necessary rows in the DB. Below is what the file looks like:
Ignore the <TAB> annotations as that... (4 Replies)