Issue when using egrep to extract strings (too many strings)


 
Thread Tools Search this Thread
Top Forums UNIX for Dummies Questions & Answers Issue when using egrep to extract strings (too many strings)
# 1  
Old 04-20-2016
Issue when using egrep to extract strings (too many strings)

Dear all,
I have a data like below (n of rows=400,000) and I want to extract the rows with certain strings. I use code below. It works if there is not too many strings for example n of strings <5000. while I have 90,000 strings to extract. If I use the egrep code below, I will get error:

error:
Code:
 /usr/bin/egrep: Argument list too long

data example:
Code:
 
 ILMN_167228 9.523 1.599 8.518
 ILMN_168228 8.823 2.599 8.518
 ILMN_169228 8.023 3.599 8.518
 ILMN_1751228 8.423 4.599 8.518
 ILMN_7751228 8.323 5.599 8.518
 ILMN_1881228 8.223 8.599 8.518

...

code example:
Code:
 
  
 egrep '(ILMN_2258774|ILMN_1700477|...|ILMN_1805992)' test1>test2

I got error since I have too many number of strings (n=80,000) to extract.

error:
Code:
 /usr/bin/egrep: Argument list too long

any one know how to fix it or any other way that can handle my request? Thank you.
# 2  
Old 04-20-2016
You could try putting those strings in a file, like so:

Code:
ILMN_2258774
ILMN_1700477
...
ILMN_1805992

Then you can extract like so:
Code:
grep -f stringfile test1>test2

For accuracy it would be better to use anchoring, by using a single space after each of the strings (ILMN_ is unique enough so the does not need to be a ^ in front) , to avoid possible false positives because of substring matches, unless all strings have the same length:

Code:
ILMN_2258774 
ILMN_1700477 
...
ILMN_1805992

--
On Solaris use /usr/xpg4/bin/grep

Last edited by Scrutinizer; 04-20-2016 at 08:23 PM..
This User Gave Thanks to Scrutinizer For This Post:
# 3  
Old 04-21-2016
There might be a performance problem also, because grep might do the comparisons one after the other, like a loop would do, for each line. Anchoring makes each comparison only a little faster.
If you have a plain stringlist file without RE wildcards and without spaces, while your main file is space separated and your strings should match the first field, then a hash is much faster. With awk
Code:
awk 'NR==FNR {A[$1]; next} ($1 in A)' stringfile test1>test2


Last edited by MadeInGermany; 04-21-2016 at 03:43 AM..
This User Gave Thanks to MadeInGermany For This Post:
# 4  
Old 04-21-2016
Thank you guys. I tried the code below and it works.
Code:
awk 'NR==FNR {A[$1]; next} ($1 in A)' stringfile test1>test2

 
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Extract strings from output

I am having the following output when executing a dig command : dig @1.1.1.1 google.com +noall +answer +stats ; <<>> DiG 9.11.4-P1 <<>> @1.1.1.1 google.com +noall +answer +stats ; (1 server found) ;; global options: +cmd obodrm.prod.at.dmdsdp.com. 86154 IN A ... (1 Reply)
Discussion started by: liviusbr
1 Replies

2. UNIX for Beginners Questions & Answers

Extract content between strings

Hello i am stuck with this. i have input which is as follows /type/work /works/OL10627594W 3 2019-04-24T16:46:21.351549 {"created": {"type": "/type/datetime", "value": "2009-12-11T03:18:17.488715"}, "title": "Tog the dog", "covers": , "last_modified": {"type":... (3 Replies)
Discussion started by: ahfze
3 Replies

3. UNIX for Beginners Questions & Answers

How to pass strings from a list of strings from another file and create multiple files?

Hello Everyone , Iam a newbie to shell programming and iam reaching out if anyone can help in this :- I have two files 1) Insert.txt 2) partition_list.txt insert.txt looks like this :- insert into emp1 partition (partition_name) (a1, b2, c4, s6, d8) select a1, b2, c4, (2 Replies)
Discussion started by: nubie2linux
2 Replies

4. Shell Programming and Scripting

Exclude lines in a file with matches with multiple Strings using egrep

Hi I have a txt file and I would like to use egrep without using -v option to exclude the lines which matches with multiple Strings. Let's say I have some text in the txt file. The command should not fetch lines if they have strings something like CAT MAT DAT The command should fetch me... (4 Replies)
Discussion started by: Sathwik
4 Replies

5. UNIX for Dummies Questions & Answers

Extract code between 2 strings.

Hi, Im having some problems with this. I have loaded a file with html code. All code is placed in the same line. I want to get everything between two given strings (including these strings and get only the first appearance). Example: File contains <html><body><a href='a.html'>abc</a><a... (5 Replies)
Discussion started by: ngb
5 Replies

6. Shell Programming and Scripting

Extract two strings from a file and create a new file with these strings

I have the following lines in a log file. It would be great if some one can help me to create a new file with the just entries in the below format. 66.150.161.195 HPSAC=Z05 66.150.161.196 HPSAC=A05 That is just extract the IP address and the string DPSAC=its value 66.150.161.195 -... (1 Reply)
Discussion started by: Tuxidow
1 Replies

7. Shell Programming and Scripting

Egrep strings on different lines in file

test.txt: appleboy orangeletter sweetdeal catracer conducivelot I want to only grep out lines that contain "appleboy" AND "sweetdeal". however, the closest thing to this that i can think of is this: cat test.txt | egrep "appleboy|sweetdeal" problem is this only searches for all... (9 Replies)
Discussion started by: SkySmart
9 Replies

8. Shell Programming and Scripting

Delete lines in file containing duplicate strings, keeping longer strings

The question is not as simple as the title... I have a file, it looks like this <string name="string1">RZ-LED</string> <string name="string2">2.0</string> <string name="string2">Version 2.0</string> <string name="string3">BP</string> I would like to check for duplicate entries of... (11 Replies)
Discussion started by: raidzero
11 Replies

9. Shell Programming and Scripting

How to Extract text between two strings?

Hi, I want to extract some text between two strings in a line i am using following command i.e; awk '/-string1/,/-string2/' filename contents of file is--- line1 line2 aaa -bbb -ccc -string1 c,d,e -string2 line4 but it is showing complete line which is having searched strings. aaa... (19 Replies)
Discussion started by: emresearch
19 Replies

10. UNIX for Dummies Questions & Answers

Delete strings in file1 based on the list of strings in file2

Hello guys, should be a very easy questn for you: I need to delete strings in file1 based on the list of strings in file2. like file2: word1_word2_ word3_word5_ word3_word4_ word6_word7_ file1: word1_word2_otherwords..,word3_word5_others... (7 Replies)
Discussion started by: roussine
7 Replies
Login or Register to Ask a Question