Help with grep


 
Thread Tools Search this Thread
Top Forums UNIX for Dummies Questions & Answers Help with grep
# 1  
Old 05-02-2014
Help with grep

Hi guys,

I would like to ask for some help with grep or anything else that could do the trick :-)

Here's what I've got:

file1: a file with about 10 million lines and 10 columns

out of that file I need to cherry pick about 600K lines;

I've made a file with the first field of the 600K lines and called it file2, then I used grep:

Code:
grep -f file2 file1 > output

It runs forever. I've waited overnight and it still didn't finish. After aborting it, the output file did have some lines, but of course not all of them.

Is there a faster way to do this, please?

Many thanks in advance!

Last edited by vbe; 05-02-2014 at 07:59 AM.. Reason: repl html by code tags...
# 2  
Old 05-02-2014
could you please show us file1 & file2 samples.
Also what you are greping.

Why you want 600k lines.

please show 600k line samples
# 3  
Old 05-02-2014
here is a sample of file1:

HTML Code:
CHR         SNP          BP  A1  A2     FRQ    INFO      OR      SE       P
  22  rs7    24000036   T   C  0.0080  1.1135      NA      NA      NA
  22  rs78    24000052   T   C  0.1807  0.9619  0.8719  0.1004  0.1719
  22 chr22_24000097_D    24000097  I5   D  0.5776  1.0069  1.0221  0.0795  0.7831
  22  rs8    24000122  I2   D  0.5865  0.9455  0.9655  0.0800  0.6605
  22   rs9    24000258   C   G  0.0924  1.0318  1.1276  0.1286  0.3501
  22 rs10    24000275  I7   D  0.2967  0.9806  0.9981  0.0836  0.9821
  22   rs5    24000486   T   C  0.4948  0.9869  1.1252  0.0768  0.1247
  22   rs96    24000528   T   G  0.1712  0.9309  0.9432  0.1040   0.574
  22   rs962    24000549   T   C  0.8080  0.9956  1.0710  0.1011  0.4976
  22   rs10    24000567   A   G  0.1826  0.9485  0.9107  0.1005  0.3518
etc
a sample of file2:

HTML Code:
rs113849440
rs5
rs9
rs10
rs11
rs12
rs13
rs65
etc
I need to cherry pick those in file2 from file1 because I am interested only in the results of those specified in file2

Thanks for your help!

---------- Post updated at 06:42 AM ---------- Previous update was at 06:42 AM ----------

I am grepping file2 from file1
# 4  
Old 05-02-2014
Try:
Code:
#!/bin/bash

dir=/home/me
mkdir $dir/tmp
>$dir/pid

cat $dir/file2 | while read field
do
  nohup grep $field $dir/file1 > $dir/tmp/${field}_out &   #Do grep for each field in background
  echo $! >>$dir/pid
done

#Now check if all the background process is completed
while true
do
  is_running=0 
  cat $dir/pid | while read pid
   do
    is_running=$(ps -ef  | grep $pid | grep -vc grep )
    [ $is_running -ne 0 ] && break
  done
  [ $is_running -eq 0 ] && break
  sleep 30
done

[ -f $dir/output ] && mv $dir/output $dir/output.old
>$dir/output 

for file in $dir/tmp/*_out
do
  cat $file >> $dir/output     #Make the output file from all the result
done

rm -rf $dir/tmp
rm $dir/pid

Change the dir as per your need.

Last edited by chacko193; 05-02-2014 at 09:23 AM.. Reason: typo
# 5  
Old 05-02-2014
If you use grep -f like that, there are certain aspects you need to know:
  • It is inaccurate, since the match should only occur in the second field, but this will match any occurrence on the line, also in other fields
  • It is also inaccurate, because for example rs7 would also match rs78
  • There will be mismatches if there is any trailing or leading space somewhere in file2
  • It uses regular expressions for matching and depending on your version of grep that might be expensive. Also it will need to process the whole line...

I would try something like this:
Code:
awk 'NR==FNR{A[$1]; next} $2 in A' file2 file1

Which uses an exact string match in the proper field and a couple of MB of memory ...



--
On Solaris use /usr/xpg4/bin/awk or nawk
If you happen to have mawk on your system, try that, it should be fastest.

Last edited by Scrutinizer; 05-02-2014 at 09:53 AM..
This User Gave Thanks to Scrutinizer For This Post:
# 6  
Old 05-02-2014
@Chacko: That would mean 600000 background processes - I'm not sure any process table is that large. A split of file2 would be needed, increasing the number of loops your program needs to be executed in.

Are ( or can be) both files sorted?
# 7  
Old 05-02-2014
Thank you all for your help! The files can be sorted.

The solution posted by Scrutinizer works perfectly! and it only takes a few minutes! Many thanks!
 
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Inconsistent `ps -eaf -o args | grep -i sfs_pcard_load_file.ksh | grep -v grep | wc -l`

i have this line of code that looks for the same file if it is currently running and returns the count. `ps -eaf -o args | grep -i sfs_pcard_load_file.ksh | grep -v grep | wc -l` basically it is assigned to a variable ISRUNNING=`ps -eaf -o args | grep -i sfs_pcard_load_file.ksh |... (6 Replies)
Discussion started by: wtolentino
6 Replies

2. UNIX for Dummies Questions & Answers

Piping grep into awk, read the next line using grep

Hi, I have a number of files containing the information below. """"" Fundallinfo 6.3950 14.9715 14.0482 """"" I would like to grep for Fundallinfo and use it to read the next line? I ideally would like to read the three numbers that follow in the next line and... (2 Replies)
Discussion started by: Paul Moghadam
2 Replies

3. UNIX for Dummies Questions & Answers

Bash - CLI - grep - Passing result to grep through pipe

Hello. I want to get all modules which are loaded and which name are exactly 2 characters long and not more than 2 characters and begin with "nv" lsmod | (e)grep '^nv???????????? I want to get all modules which are loaded and which name begin with "nv" and are 2 to 7 characters long ... (1 Reply)
Discussion started by: jcdole
1 Replies

4. Shell Programming and Scripting

AWK/GREP: grep only lines starting with integer

I have an input file 12.4 1.72849432773174e+01 -7.74784188610632e+01 12.5 9.59432114416327e-01 -7.87018212757537e+01 15.6 5.20139995965960e-01 -5.61612429666624e+01 29.3 3.76696387248366e+00 -7.42896194101892e+01 32.1 1.86899877018077e+01 -7.56508762501408e+01 35 6.98857157014640e+00... (2 Replies)
Discussion started by: chrisjorg
2 Replies

5. UNIX for Dummies Questions & Answers

Advanced grep'in... grep for data next to static element.

I have a directory I need to grep which consists of numbered sub directories. The sub directory names change daily. A file resides in this main directory that shows which sub directories are FULL backups or INCREMENTAL backups. My goal is to grep the directory for the word "full" and then... (2 Replies)
Discussion started by: SysAdm2
2 Replies

6. UNIX for Dummies Questions & Answers

Difference between grep, egrep & grep -i

Hi All, Please i need to know the difference between grep, egrep & grep -i when used to serach through a file. My platform is SunOS 5.9 & i'm using the korn shell. Regards, - divroro12 - (2 Replies)
Discussion started by: divroro12
2 Replies

7. Shell Programming and Scripting

grep for certain files using a file as input to grep and then move

Hi All, I need to grep few files which has words like the below in the file name , which i want to put it in a file and and grep for the files which contain these names and move it to a new directory , full file name -C20091210.1000-20091210.1100_SMGBSC3:1000... (2 Replies)
Discussion started by: anita07
2 Replies

8. UNIX for Dummies Questions & Answers

| help | unix | grep (GNU grep) 2.5.1 | advanced regex syntax

Hello, I'm working on unix with grep (GNU grep) 2.5.1. I'm going through some of the newer regex syntax using Regular Expression Reference - Advanced Syntax a guide. ls -aLl /bin | grep "\(x\)" Which works, just highlights 'x' where ever, when ever. I'm trying to to get (?:) to work but... (4 Replies)
Discussion started by: MykC
4 Replies

9. UNIX for Dummies Questions & Answers

| help | unix | grep - Can I use grep to return a string with exactly n matches?

Hello, I looking to use grep to return a string with exactly n matches. I'm building off this: ls -aLl /bin | grep '^.\{9\}x' | tr -s ' ' -rwxr-xr-x 1 root root 632816 Nov 25 2008 vi -rwxr-xr-x 1 root root 632816 Nov 25 2008 view -rwxr-xr-x 1 root root 16008 May 25 2008... (7 Replies)
Discussion started by: MykC
7 Replies

10. Shell Programming and Scripting

MEM=`ps v $PPID| grep -i db2 | grep -v grep| awk '{ if ( $7 ~ " " ) { print 0 } else

Hi Guys, I need to set the value of $7 to zero in case $7 is NULL. I've tried the below command but doesn't work. Any ideas. thanks guys. MEM=`ps v $PPID| grep -i db2 | grep -v grep| awk '{ if ( $7 ~ " " ) { print 0 } else { print $7}}' ` Harby. (4 Replies)
Discussion started by: hariza
4 Replies
Login or Register to Ask a Question