Pull out multiple lines with grep patternfile


 
Thread Tools Search this Thread
Top Forums UNIX for Dummies Questions & Answers Pull out multiple lines with grep patternfile
# 8  
Old 03-23-2013
Cool would do that thanks!
# 9  
Old 03-23-2013
More info

Here is the top of the pattern file:

Chr1 711
Chr1 892
Chr1 956
Chr1 10904
Chr1 32210
Chr1 37388
Chr1 49438
Chr1 71326
Chr1 71348
Chr1 88300
Chr1 90571
Chr1 90606
Chr1 90809
Chr1 90864
Chr1 96770
Chr1 97473


Here is the top of the Master file I'm trying to grep out of

Chr1 3658 A 1 ^M. C
Chr1 3659 A 1 . C
Chr1 3660 A 1 . C
Chr1 3661 C 2 .^M. F@
Chr1 3662 A 2 .. F@
Chr1 3663 A 2 .. F@
Chr1 3664 A 2 .. F@
Chr1 3665 T 2 .. FD
Chr1 3666 A 2 .. HD
Chr1 3667 C 2 .. HD
Chr1 3668 A 4 ..^K,^K. H:.$
Chr1 3669 T 4 ..,. GA1*
Chr1 3670 A 4 ..,. HCD9
Chr1 3671 A 4 ..,. GDDJ
Chr1 3672 T 4 ..,. JFBI
Chr1 3673 C 4 ..,. JBDJ
Chr1 3674 G 6 ..,.^:,^:. IBDJ?=
Chr1 3675 G 6 ..,.,. J8DHDJ




So I have a 12GB file and I want to pull out the lines that correspond to about 350,000 specific positions on the chromosomes. Each is a separate line.

I'm using Ubuntu. The Master file is a .bcf made with samtools and the coordinates (pattern file) was made from a table in a paper and is tab delimited. I added a third column with just a tab so as not to pull out multiple entries beginning with the search pattern eg only want Chr1 3675 pulling not other entries like Chr1 367567 for example.

This starts to ouput the lines with the correct patterns to the terminal within a few seconds:

Code:
cat C2L4_all.bcf | grep -f lertest.txt

But this just greps indefinately, creates the output file but it stays empty. After about 10 min I kill the command.

Code:
cat C2L4_all.bcf | grep -f lertest.txt > plop.txt

I tried the script command to copy the terminal to a file but it has all kinds if formatting info in it as well. Must be an easier way.

---------- Post updated at 01:29 AM ---------- Previous update was at 01:21 AM ----------

Should I let it run overnight? Its just the non-write to file version only takes a few seconds!!

I'm wondering if it scans the whole 12 GB file before writing to file whereas if it finds something it may instantly display it in the terminal?

Seems odd if this is the case though!
# 10  
Old 03-23-2013
1) I can't reproduce your problem with the sample data you gave above as there's no match between patterns and data.
2) Are you sure that there's all those strange char sequences near end of line in the data file?
3) why don't you do some testing with way smaller but still matching files?
4) use code tags, not icode tags.
This User Gave Thanks to RudiC For This Post:
# 11  
Old 03-23-2013
Thanks RudiC,
I just took the top 100 lines of my 12GB file and did a mock run with 10 patterns I know are entries. It worked fine both to the terminal and written to file. So it looks like the problem is the size of the file that is being searched for the patterns.

Can I tell grep to only look for one match? I guess that may make it faster.

FGPonce
# 12  
Old 03-23-2013
Reduce the pattern file to just one line, or give the pattern on the command line. You may want to switch off regex matching (which is compute intensive) with the -F (fixed string) option to grep.
# 13  
Old 03-23-2013
Quote:
Originally Posted by FGPonce
Here is the top of the pattern file:

Code:
Chr1    711    
Chr1    892    
Chr1    956    
Chr1    10904    
Chr1    32210    
Chr1    37388    
Chr1    49438    
Chr1    71326    
Chr1    71348    
Chr1    88300    
Chr1    90571    
Chr1    90606    
Chr1    90809    
Chr1    90864    
Chr1    96770    
Chr1    97473

Here is the top of the Master file I'm trying to grep out of

Code:
Chr1    3658    A    1    ^M.    C
Chr1    3659    A    1    .    C
Chr1    3660    A    1    .    C
Chr1    3661    C    2    .^M.    F@
Chr1    3662    A    2    ..    F@
Chr1    3663    A    2    ..    F@
Chr1    3664    A    2    ..    F@
Chr1    3665    T    2    ..    FD
Chr1    3666    A    2    ..    HD
Chr1    3667    C    2    ..    HD
Chr1    3668    A    4    ..^K,^K.    H:.$
Chr1    3669    T    4    ..,.    GA1*
Chr1    3670    A    4    ..,.    HCD9
Chr1    3671    A    4    ..,.    GDDJ
Chr1    3672    T    4    ..,.    JFBI
Chr1    3673    C    4    ..,.    JBDJ
Chr1    3674    G    6    ..,.^:,^:.    IBDJ?=
Chr1    3675    G    6    ..,.,.    J8DHDJ



So I have a 12GB file and I want to pull out the lines that correspond to about 350,000 specific positions on the chromosomes. Each is a separate line.

I'm using Ubuntu. The Master file is a .bcf made with samtools and the coordinates (pattern file) was made from a table in a paper and is tab delimited. I added a third column with just a tab so as not to pull out multiple entries beginning with the search pattern eg only want Chr1 3675 pulling not other entries like Chr1 367567 for example.

This starts to ouput the lines with the correct patterns to the terminal within a few seconds:

Code:
cat C2L4_all.bcf | grep -f lertest.txt

But this just greps indefinately, creates the output file but it stays empty. After about 10 min I kill the command.

Code:
cat C2L4_all.bcf | grep -f lertest.txt > plop.txt

I tried the script command to copy the terminal to a file but it has all kinds if formatting info in it as well. Must be an easier way.

---------- Post updated at 01:29 AM ---------- Previous update was at 01:21 AM ----------

Should I let it run overnight? Its just the non-write to file version only takes a few seconds!!

I'm wondering if it scans the whole 12 GB file before writing to file whereas if it finds something it may instantly display it in the terminal?

Seems odd if this is the case though!
Note 1: If your Master file contains any null bytes, the standards don't guarantee that you'll get the results you want. The behavior of grep is only specified when all of the input files are text files. By definition, text files have no lines longer than LINE_MAX bytes (including the terminating <newline> character) and do not contain any null bytes. Since you're seeing some output quickly, I doubt that this is your problem.

Note 2: I changed the ICODE tags to CODE tags in the quote from your posting above.

Note 3: The tabs in your input files have been converted to spaces (presumably by the tools used to copy and paste text from your files into this posting). You should do something like:
Code:
head C2L4_all.bcf lertest.txt | od -c

to verify that both of these files do have tabs rather than spaces separating fields. Note also that the last line you show above for your pattern file does NOT include a trailing tab (or spaces) on the last line. (However, this could also just be a copy and paste artifact.) Again, since you're seeing some output quickly, this probably isn't problem. But if some lines have tabs and others have spaces in either file (or both files), you could miss a lot of data you're trying to extract.

As RudiC mentioned, searching for fixed strings rather than basic regular expressions will be faster. And, as I mentioned before, getting rid of the unneeded cat will also be more efficient. Together, this would be:
Code:
grep -Ff lertest.txt C2L4_all.bcf > plop.txt

Note 4: You said that the command:
Code:
cat C2L4_all.bcf | grep -f lertest.txt

or more efficiently:
Code:
grep -Ff lertest.txt C2L4_all.bcf

starts writing output to the terminal within seconds. How much data did your command write to the terminal in ten minutes?

There is a good chance that the ouput produced by grep is being buffered when output is not directed to a terminal. If the amount of output you're seeing from the above command in ten minutes fits in less than BUFSIZ bytes on your system, you shouldn't expect to see anything in plop.txt in ten minutes. You should be able to find the value of BUFSIZ on your system in /usr/include/stdio.h or in a file that it #includes (assuming that your system includes the tools to build C or C++ applications and the headers are installed in the conventional location). From the data we've seen so far, I'm guessing this is your problem. If it is, letting the command run overnight may indeed solve your problem.
This User Gave Thanks to Don Cragun For This Post:
# 14  
Old 03-23-2013
Was cat the problem?

Hi,
I ran it overnight but found this error this morning:

Code:
grep: memory exhausted

However I did go for gold and point a 350,000 pattern list at the 12 GB file.

I did a bit of playing and got 300 patterns looked for and written to file in 1 min 45s, 3000 patterns in 2 min 39, 30,000 stopped after 60 mins 0 bytes output. Best ratio was 5,000 patterns in 2 min 47 s.

Thought I'd have to dribble 5000 patterns in at a time somehow.

Then tried Dons version:

Code:
grep -Ff lertest.txt C2L4_all.bcf

Did all 350,000 patterns in about 5 minutes............nice one Don! What does the -F do?

FGPonce
 
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

How do I use grep to pull incremental data and send to multiple files?

Hi Everyone, Im currently using the below code to pull data from a large CSV file and put it into smaller files with just the data associated with the number that I "grep". grep 'M053' test.csv > test053.csv Is there a way that I can use grep to run through my file like the example below... (6 Replies)
Discussion started by: TheStruggle
6 Replies

2. UNIX for Dummies Questions & Answers

Grep multiple lines

I want to grep multiple lines from a text file. I want to grep all lines containing X,Y and NA in a single command. How do I go about doing that? This is what my text files look like: rs1983866 0.0983 10 100016313 rs1983865 0.5994 X 100016339 rs1983864 0.3272 11 100017453 rs7077266... (2 Replies)
Discussion started by: evelibertine
2 Replies

3. UNIX for Dummies Questions & Answers

grep first occurrence but continue to next entry in patternfile

I have 1300 files (SearchFiles0001.txt, SearchFiles0002.txt, etc.) , each with 650,000 lines, tab-delimited data. I have a pattern file, with about 1000 lines with a single word. Each single word is found in the 1300 files once. If I grep -f PatternFile.txt SearchFiles*.txt >OutputFile.txt... (2 Replies)
Discussion started by: newhavendweeb
2 Replies

4. UNIX for Advanced & Expert Users

grep across multiple lines

How do you grep 'select * from table_name' string from a script if the select * and from table_name are on 2 different lines ? like select * from table_name Any help would be greatly appreciated !!! Thanks RDR (4 Replies)
Discussion started by: RDR
4 Replies

5. UNIX for Dummies Questions & Answers

grep in multiple lines

hi i have kind of below text in a file. I want to get a complete paragraph starting with START and ending with before another START) which has a particular string say XYZ or ABC START XYZ hshjghkjh 45 ljkfd fldjlj d jldf START 3493u ABC 454 4545454 4545454 45454 4545454 START ...... (3 Replies)
Discussion started by: reldb
3 Replies

6. UNIX for Dummies Questions & Answers

grep command to find multiple strings in multiple lines in a file.

I want to search files (basically .cc files) in /xx folder and subfolders. Those files (*.cc files) must contain #include "header.h" AND x() function. I am writing it another way to make it clear, I wanna list of *.cc files that have 'header.h' & 'x()'. They must have two strings, header.h... (2 Replies)
Discussion started by: ritikaSharma
2 Replies

7. Shell Programming and Scripting

How do you use pull data from multiple lines to do a for statement?

Guys I am having a problem with being able to find missing monitors in a configuration check script I am trying to create for accountability purposes for managing a large number of systems. What I am trying to do is run a script that will look at the raw config data in a file and pull all the pool... (7 Replies)
Discussion started by: scottzx7rr
7 Replies

8. Shell Programming and Scripting

grep multiple lines

Hi. I have this format on a textfile: VG Name /dev/vg00 PV Name /dev/dsk/c16t0d0 PV Name /dev/dsk/c18t0d0 PV Name /dev/dsk/c16t4d0 VG Name /dev/vg01 PV Name ... (6 Replies)
Discussion started by: jOOc
6 Replies

9. Shell Programming and Scripting

grep multiple lines

Hey guys: I've been meaning to post this question for awhile...it is regarding grep. Let's say for example that the following entry is in logxx: Wed Feb 2 07:44:11 <vsm> 91030 Line 5 Severity 1 Vps 6 Call Answered - DN:8753101 CLID:5164665761 PI:83 If I do a grep 91030... (27 Replies)
Discussion started by: cdunavent
27 Replies

10. Shell Programming and Scripting

Grep on multiple lines

I 'm trying to grep 2 fieldds on 2 differnt lines. Like this: psit > file egrep -e '(NS|ES)' $file. Not working. If this succeeds then run next cmd else exit. Pls Help Gundu (13 Replies)
Discussion started by: gundu
13 Replies
Login or Register to Ask a Question