Removing file lines that each match to a different patterns


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Removing file lines that each match to a different patterns
# 1  
Old 04-14-2010
Removing file lines that each match to a different patterns

I have a very large file (10,000,000 lines), that contains a sample id and a property of that sample. I have another file that contains around 1,000,000 lines with sample ids that I want to remove from the original file (create a new file without these lines).
I know how to do this in Perl, but it is too time consuming to run. I am aware of sed and awk as commands that should be able to complete this task in a much faster time. I have tried to implement codes that I thought would work, even after consulting previous posts, none seem to quite cover it. I also find it hard to debug as the server I'm working on is French so I don't understand the error messages of my command.

Please could anyone suggest a quick way of achieving this ?

Here are examples of the files I'm dealing with.

Here is a tab delineated sample id and property.
Code:
HELIUM:1:2:3      ABCDEF
HELIUM:1:2:4      ADEFBC
HELIUM:1:2:5      BDFACE
HELIUM:1:2:6      BEBACG
HELIUM:1:2:7      ABCDEF
HELIUM:1:2:8      ADEFBC
HELIUM:1:2:9      BDFACE
HELIUM:1:3:0      BEBACG

Here is a list of ids (The common prefix is missing) I wish to remove:
Code:
:1:2:3
:1:2:5
:1:2:6
:1:2:9

Many thanks in advance for any help you can provide.

Last edited by Franklin52; 04-15-2010 at 08:47 AM.. Reason: Please use code tags!
# 2  
Old 04-14-2010
Have you tried using grep? Use it as

Code:
grep -v -f "ids.txt" "sample-id-property.txt" > remainder.txt

# 3  
Old 04-14-2010
Please check how much time its consuming.
Code:
grep -v -f ids_to_remove_file orignal_file

# 4  
Old 04-14-2010
If your second example 1:2:3 is representative of the actual file contents of the small file, i.e., it has no prefix and no suffixed data either
Code:
awk -F':'  ' FILENAME=="smallfile" {arr[$1 $2 $3]++}
               FILENAME=="bigfile" {tmp=$2 $3 $4; if(tmp in arr) {next}; print $0 }
           ' smallfile bigfile >  newfile

Also
Code:
export LC_ALL=C

may help your error message language problem.

Last edited by jim mcnamara; 04-14-2010 at 08:37 AM..
# 5  
Old 04-14-2010
Great, thank you!
The grep works, but I was afraid it would also run too slowly. I just did a sample that searched 10,000 lines for 1,000 ids and it worked in about 2 seconds. I'm rather happy with that. I just hope the large files don't add too much load.

@jim mcnamara. The awk works wonderfully, but how do I get the data into a new file rather than print?

Thanks a lot for the help with the language problem. I'll definitely use that.
# 6  
Old 04-14-2010
Add the code in red.
# 7  
Old 04-14-2010
For comparison, the awk does the same task in 0.4 seconds.
Many thanks!
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Selecting text on multiple lines, then removing a beginning and end patterns

I have a file similar to the below. I am selecting only the paragraphs with @inlineifset. I am using the following command sed '/@inlineifset/,/^ *$/!d; s/@inlineifset{mrg, @btpar{@//' $flnm >> $ofln This produces @section Correlations between seismograms,,,,}} ... (5 Replies)
Discussion started by: Danette
5 Replies

2. Shell Programming and Scripting

awk to print match or non-match and select fields/patterns for non-matches

In the awk below I am trying to output those lines that Match between file1 and file2, those Missing in file1, and those missing in file2. Using each $1,$2,$4,$5 value as a key to match on, that is if those 4 fields are found in both files the match, but if those 4 fields are not found then missing... (0 Replies)
Discussion started by: cmccabe
0 Replies

3. Shell Programming and Scripting

Removing multiple lines from input file, if multiple lines match a pattern.

GM, I have an issue at work, which requires a simple solution. But, after multiple attempts, I have not been able to hit on the code needed. I am assuming that sed, awk or even perl could do what I need. I have an application that adds extra blank page feeds, for multiple reports, when... (7 Replies)
Discussion started by: jxfish2
7 Replies

4. UNIX for Dummies Questions & Answers

Match patterns from another file and tag

Hi all, I have a file , which has 6 tab delimited fields, with $3 and $4 subfielded with spaces. I wamt to match cols $2,$3,$4 of tmp1 with tmp2, ..and then flag the 5th col if found. tmp1 1756 Xerm XermA XermB XermC XermD AA TT AA GG A 1 1763 Xerm XermA XermB XermC... (3 Replies)
Discussion started by: senhia83
3 Replies

5. Shell Programming and Scripting

Match 2 different patterns and print the lines

Hi, i have been trying to extract multiple lines based on two different patterns as below:- file1 @jkm|kdo|aas012|192.2.3.1 blablbalablablkabblablabla sjfdsakfjladfjefhaghfagfkafagkjsghfalhfk fhajkhfadjkhfalhflaffajkgfajkghfajkhgfkf jahfjkhflkhalfdhfwearhahfl @jkm|sdf|wud08q|168.2.1.3... (8 Replies)
Discussion started by: redse171
8 Replies

6. Shell Programming and Scripting

Retrieve lines that match any occurence in a list of patterns

I have two files. The first containing a header and six columns of data. Example file 1: Number SNP ID dbSNP RS ID Chromosome Result_Call Physical Position 787066 SNP_A-8575395 RS6650104 1 NOCALL 564477 786872 SNP_A-8575125 RS10458597 1 AA ... (13 Replies)
Discussion started by: Selftaught
13 Replies

7. Shell Programming and Scripting

Match multiple patterns in a file and then print their respective next line

Dear all, I need to search multiple patterns and then I need to print their respective next lines. For an example, in the below table, I will look for 3 different patterns : 1) # ATC_Codes: 2) # Generic_Name: 3) # Drug_Target_1_Gene_Name: #BEGIN_DRUGCARD DB00001 # AHFS_Codes:... (3 Replies)
Discussion started by: AshwaniSharma09
3 Replies

8. Shell Programming and Scripting

print lines which match multiple patterns

Hi, I have a text file as follows: 11:38:11.054 run1_rdseq avg_2-5 999988.0000 1024.0000 11:50:52.053 run3_rdrand 999988.0000 1135.0 128.0417 11:53:18.050 run4_wrrand avg_2-5 999988.0000 8180.5833 11:55:42.051 run4_wrrand avg_2-5 999988.0000 213.8333 11:55:06.053... (2 Replies)
Discussion started by: annazpereira
2 Replies

9. Shell Programming and Scripting

Searching patterns in 1 file and deleting all lines with those patterns in 2nd file

Hi Gurus, I have a file say for ex. file1 which has 3500 lines in it which are different account numbers and another file (file2) which has 230000 lines in it. I want to read all the lines in file1 and delete all those lines from file2 which has that same pattern as in file1. I am not quite... (4 Replies)
Discussion started by: toms
4 Replies

10. Shell Programming and Scripting

sed/awk help to match list of patterns and remove from org file

Hi, From the pattern mentioned below remove lines based on pattern range. Conditions 1 Look For all lines starting with ALTER TABLE and Ending with ; and contains the word MOVE.I wanto to remove these lines from the file sample below. Note : The above pattern list could be found in... (1 Reply)
Discussion started by: rajan_san
1 Replies
Login or Register to Ask a Question