Reducing input file size after pattern search

04-23-2017

Registered User

365, 3

Join Date: Jun 2010

Last Activity: 6 August 2019, 11:08 PM EDT

Posts: 365

Thanks Given: 149

Thanked 3 Times in 3 Posts

Reducing input file size after pattern search

I have a very large file with millions of entries identified by @M. I am using the following script to "extract" entries based on specific strings/patterns:

Code:

#!/bin/bash
if [[ -f $1 ]]
then
	file=$1
else
	echo "Input_file passed as an argument $1 is NOT found."
	exit;
fi
MID=(NULL "string-1" "string-2" "string-3" "string-4" )
tot=$(grep -c "^@" < "$file" )
echo "Total " "$tot" > log.txt

for y in {1..4}
do
	awk -v search="${MID[$y]}" '$2 ~ search { print $0 }' $file > MID-$y.txt
	awk -v Id="MID-$y" -v pct="$tot" '/^@M/ {count++} END { print Id "\t" (count*100)/pct }' MID-$y.txt >> log.txt
done

I believe it would be more "cost-effective" to reduce the "size" of the input file by eliminating the entries that have been already "extracted" during the initial loops. Thus, by the time the last strings are being searched, the processing time would have been significantly reduced. I was wondering what would be the most efficient way to accomplish such task considering that I am dealing with a sizable infile?
Thanks in advance!

Last edited by Xterra; 04-23-2017 at 04:12 PM..

Xterra

View Public Profile for Xterra

Find all posts by Xterra

04-23-2017

Registered User

11,728, 1,345

Join Date: Feb 2004

Last Activity: 8 May 2020, 9:07 AM EDT

Location: NM

Posts: 11,728

Thanks Given: 903

Thanked 1,345 Times in 1,201 Posts

One single awk script, no grep, nothing else.

Read the input file once, keep a running total for the variable tot. For the array count[],
use a variable MID to decode which of these to increment. Index count by the element MID[].

Print the final totals in an

Code:

END{}

clause.

Since I do not get why you use "^@" and "^@M" for search patterns on the same records you've already searched, I'm not happy doing an example.

jim mcnamara

View Public Profile for jim mcnamara

Find all posts by jim mcnamara

04-23-2017

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

Please become accustomed to provide decent context info of your problem.
It is always helpful to support a request with system info like OS and shell, related environment (variables, options), preferred tools, and adequate (representative) sample input and desired output data and the logics connecting the two, to avoid ambiguities and keep people from guessing.

Totally seconding jim mcnamara, some hints on condensing your script into one single awk script, just trying to translate your code, no reasonable testing possible:

Code:

awk -vSARR="string-1 string-2 string-3 string-4" '
BEGIN   {for (MX=n=split (SARR, TMP); n>0; n--) SRCH[TMP[n]]
        }
/^@/    {tot++
        }
        {for (s in SRCH) if (($2 ~ s) && /^@M/) count[s]++
        } 
END     {print "Total:", tot
         for (s in SRCH) print "MID" s "\t" count[s]/tot*100
        }
'  $1

RudiC

View Public Profile for RudiC

Find all posts by RudiC

04-23-2017

Registered User

365, 3

Join Date: Jun 2010

Last Activity: 6 August 2019, 11:08 PM EDT

Posts: 365

Thanks Given: 149

Thanked 3 Times in 3 Posts

Quote:

It is always helpful to support a request with system info like OS and shell, related environment (variables, options), preferred tools, and adequate (representative) sample input and desired output data and the logics connecting the two, to avoid ambiguities and keep people from guessing.

OS=biolinux 8
preferred tools: AWK

Mini input file:

Code:

@M03333 AGCTGTGAstring-1GATCAGTGCATGA
+
CCCCCCCCCCCCCCGGGGGGGGG;;;;.,..,.
@M03333 AGCTGTGAstring-2GATCAGTGCATGA
+
CCCCCCCCCCCCCCGGGGGGGGG;;;;.,..,.
@M03333 AGCTGTGAstring-3GATCAGTGCATGA
+
CCCCCCCCCCCCCCGGGGGGGGG;;;;.,..,.
@M03333 AGCTGTGAstring-4GATCAGTGCATGA
+
CCCCCCCCCCCCCCGGGGGGGGG;;;;.,..,.
@M03333 AGCTGTGAstring-1GATCAGTGCATGG
+
CCCCCCCCCCCCCCGGGGGGGGG;;;;.,..,.
@M03333 AGCTGTGAstring-1GATCAGCCCATGA
+
CCCCCCCCCCCCCCGGGGGGGGG;;;;.,..,.
@M03333 AGCTGTGAstring-2CCATCAGTGCATGA
+
CCCCCCCCCCCCCCGGGGGGGGG;;;;.,..,.
@M03333 AGCTAAGAstring-2GATCAGTGCATGA
+
CCCCCCCCCCCCCCGGGGGGGGG;;;;.,..,.

Output files:
log.txt file:

Code:

Total  8
MID-1	37.5
MID-2	37.5
MID-3	12.5
MID-4	12.5

MID-1.txt file:

Code:

@M03333 AGCTGTGAstring-1GATCAGTGCATGA
@M03333 AGCTGTGAstring-1GATCAGTGCATGG
@M03333 AGCTGTGAstring-1GATCAGCCCATGA

MID-2.txt file:

Code:

@M03333 AGCTGTGAstring-2GATCAGTGCATGA
@M03333 AGCTGTGAstring-2CCATCAGTGCATGA
@M03333 AGCTAAGAstring-2GATCAGTGCATGA

MID-3.txt file:

Code:

@M03333 AGCTGTGAstring-3GATCAGTGCATGA

MID-4.txt file:

Code:

@M03333 AGCTGTGAstring-4GATCAGTGCATGA

As I tried to explain but obviously failed to convey is that my bash script outputs all desired files (log plus MID files).
Now, what I would like to change is this part:

Code:

awk -v search="${MID[$y]}" '$2 ~ search { print $0 }' $file > MID-$y.txt

In my script, for each and every loop, the entire input file is scanned searching for the strings.
Ideally, the input file should be reduced accordingly after each loop. Thus, for the second loop, the entries "extracted" during the first loop will not be "read"; therefore, reducing the processing time. For the third loop, all entries extracted in loops 1 and 2 would not be read either. So on and so forth. As a result, the processing time for the last loops would be significantly smaller since the file is getting smaller with each loop.
I thought about including the following pieces in my loop:

Code:

	awk -v search="${MID[$y]}" '$2 !~ search { print $0 }' $file > New-$file
	mv New-$file $file

However, considering that the original input file is pretty large, the process of rewriting the input file during each loop besides looking horrible in the script, might not save that much of time. In a nutshell, I am trying to simplify the input file after each loop to save time during the last loops.
I hope this clarifies what I am trying to accomplish
Thanks!

Xterra

View Public Profile for Xterra

Find all posts by Xterra

04-23-2017

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

Did you even consider what Jim McNamara said and what I tried to cast into some sample code? Reading AND WRITING a large file multiple times - even slightly reduced in size - is unnecessary a task and load on the system. Adapting (and even simplifying) that cited sample code to your sample input and output:

Code:

awk -vSARR="string-1 string-2 string-3 string-4" '
BEGIN   {for (MX=n=split (SARR, TMP); n>0; n--) SRCH[TMP[n]] = n
        }
/^@M/   {tot++
         for (s in SRCH) if ($2 ~ s)    {count[s]++
                                         print > ("MID-" SRCH[s] ".txt")
                                        }
        } 
END     {print "Total:", tot
         for (s in SRCH) print "MID-" SRCH[s] "\t" count[s]/tot*100
        }
' file
Total: 8
MID-1    37.5
MID-2    37.5
MID-3    12.5
MID-4    12.5

cf M*
MID-1.txt:
@M03333 AGCTGTGAstring-1GATCAGTGCATGA
@M03333 AGCTGTGAstring-1GATCAGTGCATGG
@M03333 AGCTGTGAstring-1GATCAGCCCATGA
MID-2.txt:
@M03333 AGCTGTGAstring-2GATCAGTGCATGA
@M03333 AGCTGTGAstring-2CCATCAGTGCATGA
@M03333 AGCTAAGAstring-2GATCAGTGCATGA
MID-3.txt:
@M03333 AGCTGTGAstring-3GATCAGTGCATGA
MID-4.txt:
@M03333 AGCTGTGAstring-4GATCAGTGCATGA

seems to give exactly what you're after in ONE SINGLE read of the input file - how large it ever be.

Last edited by RudiC; 04-23-2017 at 04:58 PM..

This User Gave Thanks to RudiC For This Post:

RudiC

View Public Profile for RudiC

Find all posts by RudiC

04-23-2017

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Hi Xterra,
I think you don't understand what is being suggested. If you have a file containing a million records, each of those records has a 1st line that is one of four values, and you want to create four output files where each of those output files contains all records that have the same 1st line; then you do not want to read that input file 4 times. You want to read it once and create all of your 4 output files in one pass. Doing this you read a million records, write a million records, and you're done.

What you are asking to do instead is read a million records, write ~250000 records to one file, and write ~750000 records to another file; then you read ~750000 records, write ~250000 records to one file, and write ~500000 records to another file; then you read ~500000 records, write ~250000 records to one file, and write ~250000 records to another file; and then you read ~250000 records, write ~250000 records to one file and write 0 records to another file. Why would you want to read ~2.5 million records and write ~2.5 million records instead of reading 1 million records and write 1 million records?

The code that you currently have is reading 4 million records and writing 1 million records (i.e., 5 million I/O operations). What you are asking to do would read 2.5 million records and write 2.5 million records (i.e., 5 million I/O operations). Even if we skip the last read and write and just rename one of the last two output files, your plan still has 4.5 million I/O operations instead of the 2 million I/O operations being proposed by RudiC and jim mcnamara.

Is there something else that you haven't told us about your data that would affect what I assume you are trying to do?

This User Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

04-23-2017

Registered User

365, 3

Join Date: Jun 2010

Last Activity: 6 August 2019, 11:08 PM EDT

Posts: 365

Thanks Given: 149

Thanked 3 Times in 3 Posts

Jim, Rudy and Don
I deeply apologize! Indeed, I did not read well/understand the code and Jim's suggestion when they were first posted. I see the advantages over what I wrote and I am trying to dissect it. Quick question, and for a different application, if my infile has the actual sequence in the second line of the record, something like this:

Code:

@M03333 
AGCTGTGAstring-1GATCAGTGCATGA
+
CCCCCCCCCCCCCCGGGGGGGGG;;;;.,..,.
@M03333 
AGCTGTGAstring-2GATCAGTGCATGA
+
CCCCCCCCCCCCCCGGGGGGGGG;;;;.,..,.
@M03333 
AGCTGTGAstring-3GATCAGTGCATGA
+
CCCCCCCCCCCCCCGGGGGGGGG;;;;.,..,.
@M03333 
AGCTGTGAstring-4GATCAGTGCATGA
+
CCCCCCCCCCCCCCGGGGGGGGG;;;;.,..,.
@M03333 
AGCTGTGAstring-1GATCAGTGCATGG
+
CCCCCCCCCCCCCCGGGGGGGGG;;;;.,..,.
@M03333 
AGCTGTGAstring-1GATCAGCCCATGA
+
CCCCCCCCCCCCCCGGGGGGGGG;;;;.,..,.
@M03333 
AGCTGTGAstring-2CCATCAGTGCATGA
+
CCCCCCCCCCCCCCGGGGGGGGG;;;;.,..,.
@M03333 
AGCTAAGAstring-2GATCAGTGCATGA
+
CCCCCCCCCCCCCCGGGGGGGGG;;;;.,..,.

And I would like to output the entire record using Rudi's code, e.g. for outfile file MID-1.txt:

Code:

@M03333 
AGCTGTGAstring-1GATCAGTGCATGA
+
CCCCCCCCCCCCCCGGGGGGGGG;;;;.,..,.
@M03333 
AGCTGTGAstring-1GATCAGTGCATGG
+
CCCCCCCCCCCCCCGGGGGGGGG;;;;.,..,.
@M03333 
AGCTGTGAstring-1GATCAGCCCATGA
+
CCCCCCCCCCCCCCGGGGGGGGG;;;;.,..,.

I would need to change the RS to \n, correct? How could I modify Rudi's code so I can append the two other lines?

Xterra

View Public Profile for Xterra

Find all posts by Xterra

UNIX for Beginners Questions & Answers

Reducing input file size after pattern search

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Grep/awk using a begin search pattern and end search pattern

Discussion started by: vbabz

2. Shell Programming and Scripting

Grep command to search pattern corresponding to input from user

Discussion started by: Preeti Chandra

3. Shell Programming and Scripting

Search pattern in a file taking input from another file

Discussion started by: imrandec85

4. Shell Programming and Scripting

Reducing the decimal points of numbers (3d coordinates) in a file; how to input data to e.g. Python

Discussion started by: crunchgargoyle

5. Shell Programming and Scripting

How to use sed to search a particular pattern in a file backward after a pattern is matched.?

Discussion started by: saurabh kumar

6. Shell Programming and Scripting

Search for a pattern in a String file and count the occurance of each pattern

Discussion started by: swayam123

7. Shell Programming and Scripting

How to assign the Pattern Search string as Input Variable

Discussion started by: raghunsi

8. Solaris

reducing to root file size

Discussion started by: Hitesh Shah

9. Programming

reducing size of executeable in C under Unix

Discussion started by: useless79

10. Shell Programming and Scripting

Search file for pattern and grab some lines before pattern

Discussion started by: frustrated1