I have a very large file with millions of entries identified by @M. I am using the following script to "extract" entries based on specific strings/patterns:
I believe it would be more "cost-effective" to reduce the "size" of the input file by eliminating the entries that have been already "extracted" during the initial loops. Thus, by the time the last strings are being searched, the processing time would have been significantly reduced. I was wondering what would be the most efficient way to accomplish such task considering that I am dealing with a sizable infile?
Thanks in advance!
Read the input file once, keep a running total for the variable tot. For the array count[],
use a variable MID to decode which of these to increment. Index count by the element MID[].
Print the final totals in an
clause.
Since I do not get why you use "^@" and "^@M" for search patterns on the same records you've already searched, I'm not happy doing an example.
Please become accustomed to provide decent context info of your problem.
It is always helpful to support a request with system info like OS and shell, related environment (variables, options), preferred tools, and adequate (representative) sample input and desired output data and the logics connecting the two, to avoid ambiguities and keep people from guessing.
Totally seconding jim mcnamara, some hints on condensing your script into one single awk script, just trying to translate your code, no reasonable testing possible:
It is always helpful to support a request with system info like OS and shell, related environment (variables, options), preferred tools, and adequate (representative) sample input and desired output data and the logics connecting the two, to avoid ambiguities and keep people from guessing.
OS=biolinux 8
preferred tools: AWK
Mini input file:
Output files:
log.txt file:
MID-1.txt file:
MID-2.txt file:
MID-3.txt file:
MID-4.txt file:
As I tried to explain but obviously failed to convey is that my bash script outputs all desired files (log plus MID files).
Now, what I would like to change is this part:
In my script, for each and every loop, the entire input file is scanned searching for the strings.
Ideally, the input file should be reduced accordingly after each loop. Thus, for the second loop, the entries "extracted" during the first loop will not be "read"; therefore, reducing the processing time. For the third loop, all entries extracted in loops 1 and 2 would not be read either. So on and so forth. As a result, the processing time for the last loops would be significantly smaller since the file is getting smaller with each loop.
I thought about including the following pieces in my loop:
However, considering that the original input file is pretty large, the process of rewriting the input file during each loop besides looking horrible in the script, might not save that much of time. In a nutshell, I am trying to simplify the input file after each loop to save time during the last loops.
I hope this clarifies what I am trying to accomplish
Thanks!
Did you even consider what Jim McNamara said and what I tried to cast into some sample code? Reading AND WRITING a large file multiple times - even slightly reduced in size - is unnecessary a task and load on the system. Adapting (and even simplifying) that cited sample code to your sample input and output:
seems to give exactly what you're after in ONE SINGLE read of the input file - how large it ever be.
Hi Xterra,
I think you don't understand what is being suggested. If you have a file containing a million records, each of those records has a 1st line that is one of four values, and you want to create four output files where each of those output files contains all records that have the same 1st line; then you do not want to read that input file 4 times. You want to read it once and create all of your 4 output files in one pass. Doing this you read a million records, write a million records, and you're done.
What you are asking to do instead is read a million records, write ~250000 records to one file, and write ~750000 records to another file; then you read ~750000 records, write ~250000 records to one file, and write ~500000 records to another file; then you read ~500000 records, write ~250000 records to one file, and write ~250000 records to another file; and then you read ~250000 records, write ~250000 records to one file and write 0 records to another file. Why would you want to read ~2.5 million records and write ~2.5 million records instead of reading 1 million records and write 1 million records?
The code that you currently have is reading 4 million records and writing 1 million records (i.e., 5 million I/O operations). What you are asking to do would read 2.5 million records and write 2.5 million records (i.e., 5 million I/O operations). Even if we skip the last read and write and just rename one of the last two output files, your plan still has 4.5 million I/O operations instead of the 2 million I/O operations being proposed by RudiC and jim mcnamara.
Is there something else that you haven't told us about your data that would affect what I assume you are trying to do?
This User Gave Thanks to Don Cragun For This Post:
Jim, Rudy and Don
I deeply apologize! Indeed, I did not read well/understand the code and Jim's suggestion when they were first posted. I see the advantages over what I wrote and I am trying to dissect it. Quick question, and for a different application, if my infile has the actual sequence in the second line of the record, something like this:
And I would like to output the entire record using Rudi's code, e.g. for outfile file MID-1.txt:
I would need to change the RS to \n, correct? How could I modify Rudi's code so I can append the two other lines?
I have this fileA
TEST FILE ABC
this file contains ABC;
TEST FILE DGHT this file contains DGHT;
TEST FILE 123
this file contains ABC,
this file contains DEF,
this file contains XYZ,
this file contains KLM
;
I want to have a fileZ that has only (begin search pattern for will be... (2 Replies)
One more question:
I want to grep "COS_12_TM_4 pattern from a file look likes :
"COS_12_TM_4" " ];I am taking scan_out as the input from the user.
How to search "COS_12_TM_4" in the file which is corresponds to scan_out (12 Replies)
Hi,
Below is my requirement
File1:
svasjsdhvassdvasdhhgvasddhvasdhasdjhvasdjsahvasdjvdasjdvvsadjhv
vdjvsdjasvdasdjbasdjbasdjhasbdasjhdbjheasbdasjdsajhbjasbjasbhddjb
svfsdhgvfdshgvfsdhfvsdadhfvsajhvasjdhvsajhdvsadjvhasjhdvjhsadjahs
File2:
sdh
hgv
I need a command such that... (8 Replies)
I have a file full of coordinates of the form:
37.68899917602539 58.07500076293945 57.79100036621094
The numbers don't always have the same number of decimal points. I need to reduce the decimal points of all the numbers (there are 128 rows of 3 numbers) to 2.
I have tried to do this... (2 Replies)
Hi,
I have two files file1.txt and file2.txt. Please see the attachments.
In file2.txt (which actually is a diff output between two versions of file1.txt.), I extract the pattern corresponding to 1172c1172. Now ,In file1.txt I have to search for this pattern 1172c1172 and if found, I have to... (9 Replies)
I am trying to search a file for a patterns ERR- in a file and return a count for each of the error reported
Input file is a free flowing file without any format
example of output
ERR-00001=5
....
ERR-01010=10
.....
ERR-99999=10 (4 Replies)
guys,
I need to know how to assing pattern matched string as an input command variable. Here it goes'
My script is something like this.
./routing.sh <Server> <enable|disable>
## This Script takes an input <Server> variable from this line of the script ##
echo $1 | egrep... (1 Reply)
My root file size has reached 80% and I am looking where all i can reduce the file size . Here is the output of top directories in / . To me none of this looks useful but not sure . We use an appplication and email. Which all can be deleted . Please advise .
2016989
989445 /var
930059 ... (2 Replies)
I want to search a file for a string and then if the string is found I need the line that the string is on - but also the previous two lines from the file (that the pattern will not be found in)
This is on solaris
Can you help? (2 Replies)