How to control grep output intact for each matching line?


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting How to control grep output intact for each matching line?
# 1  
Old 10-10-2018
How to control grep output intact for each matching line?

I have multiple (~80) files (some can be as big as 30GB of >1 billion of lines!) to grep on a pattern, and piped the match to a single file. I have a 96-core machine so that each grep job was sent to the background to speed up the search:
Code:
file1.tab
chr1A_part1    123241847    123241848
chr1A_part1    123241848    123241849
chr1A_part1    123241849    123241850
chr1A_part1    123241850    123241851
......

The input files have uniformly 3 fields each row, so should the output file,
Code:
for file in $(cat files.list); do 
grep -F chr1A ${file} >> subset_chr1A.tab &
done

but I found some of the matching lines are broken and the output file became a mess!
Code:
subset_chr1A.tab
chr1A_part1    123241847    123241848
chr1A_part1    123241848    123241849
chr1A_part1    1232
41849    123241850
ch1
chr1A_part1    12
3241850    
chr1A_part1    123441848    123441849
123541851
...

It seems to me the problem is from the writing of the pipe, as 80 grep jobs for 80 files are writing to the same output file. By default grep prints matching lines so that I assume each row should be printed as a whole, but it did not in my case.

What is wrong here?

Last edited by yifangt; 10-10-2018 at 02:20 PM.. Reason: typos
# 2  
Old 10-10-2018
Buffering will make a mess of this, bundling arbitrary blocks into one write. These arbitrary blocks don't care much where lines begin and end. Long enough lines could conceivably take more than one write!

If you have GNU awk, --line-buffered may help, but will have a big performance cost.

You could also send the output to separate files and cat them together later.
This User Gave Thanks to Corona688 For This Post:
# 3  
Old 10-10-2018
I will do with the second suggestion. Thanks!
# 4  
Old 10-10-2018
Why not forgo the loop?

Code:
grep -F chr1A file*.tab > subset_chr1A.tab

# 5  
Old 10-11-2018
True, the limit is likely to be disk, not CPU.
# 6  
Old 10-12-2018
Thanks Rudic!
Before I try your method, does this grep -F chr1A file*.tab swallow all the 80 files (~2400GB!) in memory first?
# 7  
Old 10-12-2018
I don't think it consumes too much memory - it read the files line by line, greps each, and drops, or outputs, it.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Grep file starting from pattern matching line

I have a file with a list of references towards the end and want to apply a grep for some string. text .... @unnumbered References @sp 1 @paragraphindent 0 2017. @strong{Chalenski, D.A.}; Wang, K.; Tatanova, Maria; Lopez, Jorge L.; Hatchell, P.; Dutta, P.; @strong{Small airgun... (1 Reply)
Discussion started by: kristinu
1 Replies

2. Shell Programming and Scripting

Printing the output of a gzip command intact

how can i get the printf command or the echo command to print the data that are inbetween the first and the last quotes? #!/bin/sh printf '%s\n' "^_<8b>^H^@U<8c>MX^@^Cí=ÙzÚH<97>×ð^Teìn<8c>Ób_<9d><9f>dXd<9b>^N^F7<82>8qâÎ'^K^Y^T<83>D<90>°M^Lý^Hó^Fs5w3ß|s5/ÐýbS%©<84>^DBH... (4 Replies)
Discussion started by: SkySmart
4 Replies

3. Shell Programming and Scripting

Grep log file to get line above matching pattern

Hi, I have a log file that looks like this "delete" : { "_type" : "cl", "_id" : "1000600000000562636", "_version" : 1, "status" : 200, "found" : false } }, { "delete" : { "_type" : "cl", "_id" : "1000600000000562643", ... (4 Replies)
Discussion started by: wahi80
4 Replies

4. Shell Programming and Scripting

Help in removing control M and Line feed in output file.

Hi All, In my output file i am getting control m character and also the line feeds at different places and with different combinations, the content of the file is supposed to be in a single line but if there is a line feed in between then from there onwards it's going into new line. I tried... (7 Replies)
Discussion started by: Bipin Kumar
7 Replies

5. Shell Programming and Scripting

find out line number of matching string using grep

Hi all, I want to display line number for matching string in a file. can anyone please help me. I used grep -n "ABC" file so it displays 6 ABC. But i only want to have line number,i don't want that it should prefix matching context with line number. Actually my original... (10 Replies)
Discussion started by: sarbjit
10 Replies

6. Shell Programming and Scripting

Identify matching data in a file and output to original line, in perl

Hi, I haven't done this for awhile, and further, I've never done it in perl so I appreciate any help you can give me. I have a file of lines, each with 5 data points that look like this: AB,N,ALLIANCEBERNSTEIN HLDNG L.P,AB,N ALD,N,ALLIED CAPITAL CORPORATION,ALD,N AFC,N,ALLIED CAPITAL... (4 Replies)
Discussion started by: Pcushing
4 Replies

7. UNIX for Dummies Questions & Answers

Grep or other ways to output line above and/or below searched line

Hi all, Would like to know how I could search for a string 'xyz' but have the output show the line plus the line above and/or below all lines found. eg. search for xyz from file containing: abc 12345 asdf xyz asdfds wwwww kjkjkj ppppp kkkxyz eeee zzzzz and the output to... (2 Replies)
Discussion started by: sammac
2 Replies

8. UNIX for Dummies Questions & Answers

How to grep / zgrep to output ONLY the matching filename and line number?

Hi all, I am trying to zgrep / grep list of files so that it displays only the matching filename:line number and does not display the whole line, like: (echo "1.txt";echo "2.txt") | xargs zgrep -no STRING If I use -o option, it displays the matching STRING and if not used, displays the... (3 Replies)
Discussion started by: vvaidyan
3 Replies

9. UNIX for Advanced & Expert Users

Grep Line with Matching Fields

Below is the scenario. Help is appreciated. File1: ( 500,000 lines ) : Three fields comma delimited : Not sorted 1234FAA,435612,88975 1224FAB,12345,212356 File2: ( 4,000,000 lines ) : Six fields comma delimited (Last 3 field should match the 3 fields of File1) : Not Sorted : ... (13 Replies)
Discussion started by: hemangjani
13 Replies

10. Shell Programming and Scripting

need grep to output basename and line#

I have a script that sorta works the way I want but I would rather just get the base name and line number from the grep output. My current script is this one liner: grep -n "$1" $SCCSPATH/*/s.*.k | cut -c1-80 which if I was searching for 121197 I would get something like this: ... (18 Replies)
Discussion started by: zoo591
18 Replies
Login or Register to Ask a Question