Match look up file and find result


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Match look up file and find result
# 1  
Old 09-10-2012
Match look up file and find result

Hi

I ahve a lookup file wiht seven words
Code:
CD
HT
CAD
HT
T1D
T2D
BD

another file contain data like this

Code:
CHRM1    P11229    Pirenzepine    DAP000492    Peptic ulcer disease    Approved T2D
CHRM1    P11229    Glycopyrrolate    DAP001116    Anesthetic    Approved T2D
CHRM1    P11229    Clidinium    DAP001117    Abdominal/stomach pain    Approved T2D
CHRM1    P11229    Dicyclomine    DAP001118    Irritable bowel syndrome    Approved T2D
CHRM1    P11229    Ethopropazine    DAP001119    Parkinson's disease    Approved T2D
CHRM1    P11229    Cycrimine    DAP001120    Parkinson's disease    Approved T2D
CHRM1    P11229    Benztropine    DAP001121    Parkinson's disease    Approved T2D
CHRM1    P11229    Trihexyphenidyl    DAP001122    Parkinson's disease    Approved T2D
CHRM1    P11229    Propantheline    DAP001123    Excessive sweating (hyperhidrosis)    Approved T2D
CHRM1    P11229    Oxyphenonium    DAP001124    Spasm    Approved T2D
CHRM1    P11229    Biperiden    DAP001125    Parkinson's disease    Approved T2D
CHRM1    P11229    Talsaclidine isomer    DCL000268    Alzheimer's disease    Discontinued T2D
CHRM1    P11229    Sabcomeline hydrochloride    DCL000279    Cardiovascular diseases    Phase IIa T2D
CHRM1    P11229    Talsaclidine fumarate    DCL000303    Alzheimer's disease    Discontinued T2D
CHRM1    P11229    Xanomeline tartrate    DCL000328    Alzheimer's disease    Phase II T2D
CHRM1    P11229    GSK573719    DCL000381    Chronic Obstructive Pulmonary Disease (COPD)    Phase II T2D
CHRM1    P11229    GSK961081    DCL000397    Chronic Obstructive Pulmonary Disease (COPD)    Phase II completed T2D
CHRM1    P11229    GSK1034702    DCL000402    Schizophrenia, Dementia    Phase I completed T2D
CHRM1    P11229    Darotropium    DCL000514    COPD    Suspended in Phase II in GSK 2009 Report T2D
CHRM1    P11229    Darotropium + 642444    DCL000515    COPD    Phase III T2D
CHRM1    P11229    Revatropate    DCL000957    Chronic obstructive pulmonary disease    Discontinued in Phase I T2D
FLT1    P17948    Sorafenib    DAP000006    Advanced renal cell carcinoma    Launched CAD
FLT1    P17948    Sorafenib    DAP000006    Hepatocellular carcinoma, NSCLC, melanoma    Phase III CAD
FLT1    P17948    Sorafenib    DAP000006    Myelodyspalstic syndrome, AML, head & neck cancer, breast, colon, ovarian, pancreatic cancer    Phase II CAD
FLT1    P17948    Ranibizumab    DAP001260    Age-related macular degeneration    Approved CAD
FLT1    P17948    Ranibizumab    DAP001260    Diabetic macular edema and retinal vein occlusion    Phase III CAD
FLT1    P17948    Telbermin    DCL001016    Diabetic foot ulcers    Discontinued in Phase II CAD
KDR    P35968    Sunitinib    DAP000005    Advanced renal cell carcinoma    Launched CAD,CD,CD
KDR    P35968    Sunitinib    DAP000005    Advanced renal cell carcinoma    Phase II CAD,CD,CD
KDR    P35968    Pazopanib HCl    DAP001550    Renal cell carcinoma    Approved CAD,CD,CD
KDR    P35968    CYC116    DCL000010    Solid Tumors    Terminated in Phase I CAD,CD,CD
KDR    P35968    XL999    DCL000011    Advanced Malignancies    Phase I CAD,CD,CD
KDR    P35968    CT-322    DCL000096    Cancer/Tumors    Phase I CAD,CD,CD
KDR    P35968    CT-322    DCL000096    Macular Degeneration    Preclinical CAD,CD,CD
KDR    P35968    XL647    DCL000263    Cancer    Phase I completed CAD,CD,CD
KDR    P35968    XL647    DCL000263    Carcinoma, Non-Small-Cell Lung    Phase II completed CAD,CD,CD
KDR    P35968    XL880    DCL000265    Solid Tumors    Phase I CAD,CD,CD
KDR    P35968    XL880    DCL000265    Gastric Cancer, Renal Cell Carcinoma, Squamous Cell Cancer of the Head and Neck    Phase II CAD,CD,CD
KDR    P35968    SU-6668    DCL000342    Advanced solid tumours    Discontinued CAD,CD,CD

[/CODE]
I am using following code

Code:
awk -F'\t' 'FNR==NR{a[$0]=1;next} {
gsub(/Approved */,"",$6)
n=split($6,b,",")
$6=""
for(i=1;i<=n;i++)
 if(b[i] in a)
  print $0, "Approved" > "file_" b[i] ".txt"
}' OFS='\t' lookupfile mainfile

But I m receiving seven file but output doesnot contain allt he data according to second input file


For eg one part of the output for T2D file is
Code:
Code:
CHRM1    P11229    Pirenzepine    DAP000492    Peptic ulcer disease        Approved
CHRM1    P11229    Glycopyrrolate    DAP001116    Anesthetic        Approved
CHRM1    P11229    Clidinium    DAP001117    Abdominal/stomach pain        Approved
CHRM1    P11229    Dicyclomine    DAP001118    Irritable bowel syndrome        Approved
CHRM1    P11229    Ethopropazine    DAP001119    Parkinson's disease        Approved
CHRM1    P11229    Cycrimine    DAP001120    Parkinson's disease        Approved
CHRM1    P11229    Benztropine    DAP001121    Parkinson's disease        Approved
CHRM1    P11229    Trihexyphenidyl    DAP001122    Parkinson's disease        Approved
CHRM1    P11229    Propantheline    DAP001123    Excessive sweating (hyperhidrosis)        Approved
CHRM1    P11229    Oxyphenonium    DAP001124    Spasm        Approved
CHRM1    P11229    Biperiden    DAP001125    Parkinson's disease        Approved
But, the expected output is
Code:
CHRM1    P11229    Pirenzepine    DAP000492    Peptic ulcer disease    Approved 
CHRM1    P11229    Glycopyrrolate    DAP001116    Anesthetic    Approved 
CHRM1    P11229    Clidinium    DAP001117    Abdominal/stomach pain    Approved 
CHRM1    P11229    Dicyclomine    DAP001118    Irritable bowel syndrome    Approved 
CHRM1    P11229    Ethopropazine    DAP001119    Parkinson's disease    Approved 
CHRM1    P11229    Cycrimine    DAP001120    Parkinson's disease    Approved 
CHRM1    P11229    Benztropine    DAP001121    Parkinson's disease    Approved 
CHRM1    P11229    Trihexyphenidyl    DAP001122    Parkinson's disease    Approved 
CHRM1    P11229    Propantheline    DAP001123    Excessive sweating (hyperhidrosis)    Approved 
CHRM1    P11229    Oxyphenonium    DAP001124    Spasm    Approved 
CHRM1    P11229    Biperiden    DAP001125    Parkinson's disease    Approved 
CHRM1    P11229    Talsaclidine isomer    DCL000268    Alzheimer's disease    Discontinued 
CHRM1    P11229    Sabcomeline hydrochloride    DCL000279    Cardiovascular diseases    Phase IIa 
CHRM1    P11229    Talsaclidine fumarate    DCL000303    Alzheimer's disease    Discontinued 
CHRM1    P11229    Xanomeline tartrate    DCL000328    Alzheimer's disease    Phase II 
CHRM1    P11229    GSK573719    DCL000381    Chronic Obstructive Pulmonary Disease (COPD)    Phase II 
CHRM1    P11229    GSK961081    DCL000397    Chronic Obstructive Pulmonary Disease (COPD)    Phase II completed 
CHRM1    P11229    GSK1034702    DCL000402    Schizophrenia, Dementia    Phase I completed 
CHRM1    P11229    Darotropium    DCL000514    COPD    Suspended in Phase II in GSK 2009 Report 
CHRM1    P11229    Darotropium + 642444    DCL000515    COPD    Phase III 
CHRM1    P11229    Revatropate    DCL000957    Chronic obstructive pulmonary disease    Discontinued in Phase I

So in out put its showing only those lines which cotain word "approved" on right hand side but others should also be there

---------- Post updated 09-10-12 at 05:29 AM ---------- Previous update was 09-09-12 at 11:56 PM ----------

Hi

Whether I will be able to get result after editing "approved" word but I have to choose many other words in the following code to make it worthwile

Code:
awk -F'\t' 'FNR==NR{a[$0]=1;next} {
gsub(/Approved */,"",$6)
n=split($6,b,",")
$6=""
for(i=1;i<=n;i++)
 if(b[i] in a)
  print $0, "Approved" > "file_" b[i] ".txt"
}' OFS='\t' lookupfile mainfile


Last edited by manigrover; 09-10-2012 at 07:21 AM..
# 2  
Old 09-10-2012
If your ID tags are always the last thing in $6 and with no embedded spaces then you could split on space and take the last element. i.e. something like:
Code:
{
m=split($6,c," ");
$6=c[m];
n=split($6,b,",")
...


Last edited by CarloM; 09-10-2012 at 11:09 AM.. Reason: fixed array reference
# 3  
Old 09-10-2012
Request to check

Hi


Thanks for the reply .I tried following but I m getting error. Not aware how to solve?

Code:
bash-3.2$ awk -F'\t' 'FNR==NR{a[$0]=1;next} {
n=split($6,c," ");
$6=c;
n=split($6,b,",")
for(i=1;i<=n;i++)
 if(b[i] in a)
  print $0, "Approved" > "file_" b[i] ".txt"
}' OFS='\t' lookupfie sarattdnewdruggene4.txt
awk: cmd. line:2: (FILENAME=sarattdnewdruggene4.txt FNR=1) fatal: attempt to use array `c' in a scalar context

# 4  
Old 09-10-2012
this error is because you are using
Trying to assign array to non-array
Code:
$6=c;

---------- Post updated at 05:41 PM ---------- Previous update was at 05:37 PM ----------

if your id tags doesn't contain space then try (But not tested)
Code:
awk 'FNR==NR{a[$0]=1;next} {
n=split($NF,b,",")
$NF=""
for(i=1;i<=n;i++)
 if(b[i] in a)
  print  > "file_" b[i] ".txt"
}' OFS='\t' lookupfile mainfile

# 5  
Old 09-10-2012
Request to check

Hi Raj

Thanks for reply.

It s giving correct results but the only issue is as u said the spacing between .

So when I am trying to paste result in excel the spacing between words being separated into columns like

below data contain 9 or morecolumns but it should come in just 6 columns
For example:
for first row:

1 column for CHRM1
2 column for P1129
3 xolumn for Pirenzepine
4 column for DAP000492
5 column for Peptic ulcer disease( not 3 different columns)
6 column for approved

Code:
CHRM1    P11229    Pirenzepine    DAP000492    Peptic ulcer disease    Approved 
CHRM1    P11229    Glycopyrrolate    DAP001116    Anesthetic    Approved 
CHRM1    P11229    Clidinium    DAP001117    Abdominal/stomach pain    Approved 
CHRM1    P11229    Dicyclomine    DAP001118    Irritable bowel syndrome    Approved 
CHRM1    P11229    Ethopropazine    DAP001119    Parkinson's disease    Approved 
CHRM1    P11229    Cycrimine    DAP001120    Parkinson's disease    Approved 
CHRM1    P11229    Benztropine    DAP001121    Parkinson's disease    Approved 
CHRM1    P11229    Trihexyphenidyl    DAP001122    Parkinson's disease    Approved 
CHRM1    P11229    Propantheline    DAP001123    Excessive sweating (hyperhidrosis)    Approved 
CHRM1    P11229    Oxyphenonium    DAP001124    Spasm    Approved 
CHRM1    P11229    Biperiden    DAP001125    Parkinson's disease    Approved 
CHRM1    P11229    Talsaclidine isomer    DCL000268    Alzheimer's disease    Discontinued 
CHRM1    P11229    Sabcomeline hydrochloride    DCL000279    Cardiovascular diseases    Phase IIa 
CHRM1    P11229    Talsaclidine fumarate    DCL000303    Alzheimer's disease    Discontinued 
CHRM1    P11229    Xanomeline tartrate    DCL000328    Alzheimer's disease    Phase II 
CHRM1    P11229    GSK573719    DCL000381    Chronic Obstructive Pulmonary Disease (COPD)    Phase II 
CHRM1    P11229    GSK961081    DCL000397    Chronic Obstructive Pulmonary Disease (COPD)    Phase II completed 
CHRM1    P11229    GSK1034702    DCL000402    Schizophrenia, Dementia    Phase I completed 
CHRM1    P11229    Darotropium    DCL000514    COPD    Suspended in Phase II in GSK 2009 Report 
CHRM1    P11229    Darotropium + 642444    DCL000515    COPD    Phase III 
CHRM1    P11229    Revatropate    DCL000957    Chronic obstructive pulmonary disease    Discontinued in Phase I

# 6  
Old 09-10-2012
try this your code with some modification
Code:
awk -F'\t' 'FNR==NR{a[$0]=1;next} {
n=split($6,b,",")
n1=split(b[1],c," ")
x=b[1]
b[1]=c[n1]
$6=""
for(i=1;i<=n;i++)
 if(b[i] in a)
  print $0, "Approved" > "file_" b[i] ".txt"
}' OFS='\t' lookupfile mainfile

# 7  
Old 09-10-2012
Quote:
Originally Posted by manigrover
Hi


Thanks for the reply .I tried following but I m getting error. Not aware how to solve?

Code:
bash-3.2$ awk -F'\t' 'FNR==NR{a[$0]=1;next} {
n=split($6,c," ");
$6=c;
n=split($6,b,",")
for(i=1;i<=n;i++)
 if(b[i] in a)
  print $0, "Approved" > "file_" b[i] ".txt"
}' OFS='\t' lookupfie sarattdnewdruggene4.txt
awk: cmd. line:2: (FILENAME=sarattdnewdruggene4.txt FNR=1) fatal: attempt to use array `c' in a scalar context

It should have been $6=c[m] (although that's actually an unnecessary step anyway):
Code:
carlo@host:/tmp -> cat x.awk
awk -F'\t' 'FNR==NR{a[$0]=1;next} {
   m=split($6,c," ")
   n=split(c[m],b,",")
   $6=""
   for(i=1;i<=n;i++)
      if(b[i] in a)
         print $0, "Approved" "::" "file_" b[i] ".txt"
}' OFS='\t' lookupfile mainfile2
carlo@host:/tmp -> ./x.awk
CHRM1   P11229  Pirenzepine     DAP000492       Peptic ulcer disease            Approved::file_T2D.txt
CHRM1   P11229  Glycopyrrolate  DAP001116       Anesthetic              Approved::file_T2D.txt
CHRM1   P11229  Clidinium       DAP001117       Abdominal/stomach pain          Approved::file_T2D.txt
CHRM1   P11229  Dicyclomine     DAP001118       Irritable bowel syndrome                Approved::file_T2D.txt
CHRM1   P11229  Ethopropazine   DAP001119       Parkinson's disease             Approved::file_T2D.txt
CHRM1   P11229  Cycrimine       DAP001120       Parkinson's disease             Approved::file_T2D.txt
CHRM1   P11229  Benztropine     DAP001121       Parkinson's disease             Approved::file_T2D.txt
CHRM1   P11229  Propantheline   DAP001123       Excessive sweating (hyperhidrosis)              Approved::file_T2D.txt
CHRM1   P11229  Oxyphenonium    DAP001124       Spasm           Approved::file_T2D.txt
CHRM1   P11229  Biperiden       DAP001125       Parkinson's disease             Approved::file_T2D.txt
CHRM1   P11229  Talsaclidine isomer     DCL000268       Alzheimer's disease             Approved::file_T2D.txt
CHRM1   P11229  Sabcomeline hydrochloride       DCL000279       Cardiovascular diseases         Approved::file_T2D.txt
CHRM1   P11229  Talsaclidine fumarate   DCL000303       Alzheimer's disease             Approved::file_T2D.txt
CHRM1   P11229  GSK573719       DCL000381       Chronic Obstructive Pulmonary Disease (COPD)            Approved::file_T2D.txt
CHRM1   P11229  GSK961081       DCL000397       Chronic Obstructive Pulmonary Disease (COPD)            Approved::file_T2D.txt
CHRM1   P11229  GSK1034702      DCL000402       Schizophrenia, Dementia         Approved::file_T2D.txt
CHRM1   P11229  Darotropium     DCL000514       COPD            Approved::file_T2D.txt
CHRM1   P11229  Darotropium + 642444    DCL000515       COPD            Approved::file_T2D.txt
CHRM1   P11229  Revatropate     DCL000957       Chronic obstructive pulmonary disease           Approved::file_T2D.txt
FLT1    P17948  Sorafenib       DAP000006       Advanced renal cell carcinoma           Approved::file_CAD.txt
FLT1    P17948  Sorafenib       DAP000006       Hepatocellular carcinoma, NSCLC, melanoma               Approved::file_CAD.txt
FLT1    P17948  Sorafenib       DAP000006       Myelodyspalstic syndrome, AML, head & neck cancer, breast, colon, ovarian, pancreatic cancer            Approved::file_CAD.txt
FLT1    P17948  Ranibizumab     DAP001260       Age-related macular degeneration                Approved::file_CAD.txt
FLT1    P17948  Ranibizumab     DAP001260       Diabetic macular edema and retinal vein occlusion               Approved::file_CAD.txt
FLT1    P17948  Telbermin       DCL001016       Diabetic foot ulcers            Approved::file_CAD.txt
KDR     P35968  Sunitinib       DAP000005       Advanced renal cell carcinoma           Approved::file_CAD.txt
KDR     P35968  Sunitinib       DAP000005       Advanced renal cell carcinoma           Approved::file_CD.txt
KDR     P35968  Sunitinib       DAP000005       Advanced renal cell carcinoma           Approved::file_CD.txt
KDR     P35968  Sunitinib       DAP000005       Advanced renal cell carcinoma           Approved::file_CAD.txt
KDR     P35968  Sunitinib       DAP000005       Advanced renal cell carcinoma           Approved::file_CD.txt
KDR     P35968  Sunitinib       DAP000005       Advanced renal cell carcinoma           Approved::file_CD.txt
...etc...

(outputting to terminal for testing purposes...)

Last edited by CarloM; 09-10-2012 at 11:22 AM..
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

How to find particular file-name in file and get result in table in mail?

We have 100 linux servers, All send logs to both centralize server(server1 and serverb). all send logs every day and stores in /syslog folder with hostname.log file. I need to prepare script to check every day from both centralize server(server1 and serverb) and send mail in table format. ... (1 Reply)
Discussion started by: yash_message
1 Replies

2. UNIX for Beginners Questions & Answers

Match file and find count

Hi All, I have transaction in one file.I want to match that to another file and find the number of time the transaction is available on the other file.I need to take each record from TRANSFILE and match that with SPEND FILE and find the number of counts of the transaction TRANSFILE: ... (4 Replies)
Discussion started by: arunkumar_mca
4 Replies

3. Shell Programming and Scripting

awk parse result that match data from file

i run command that return this result,example : gigabitethernet2/2/4:NotPresent, gigabitethernet2/1/17:UP, gigabitethernet2/1/10:UP, gigabitethernet2/1/5:UP, gigabitethernet2/1/9:UP, gigabitethernet2/1/36:DOWN, gigabitethernet2/1/33:DOWN, gigabitethernet2/1/8:UP,... (19 Replies)
Discussion started by: wanttolearn1
19 Replies

4. Shell Programming and Scripting

Compare two files and find match and print the header of the second file

Hi, I have two input files; file1 and file2. I compare them based on matched values in 1 column and print selected columns of the second file (file2). I got the result but the header was not printed. i want the header of file2 to be printed together with the result. Then i did below codes:- ... (3 Replies)
Discussion started by: redse171
3 Replies

5. Shell Programming and Scripting

Find diff bet 2 files and store result in another file

Hi I want to compare 2 files. The files have the same amount of rows and columns. So each line must be compare against the other and if one differs from the other, the result of both must be stored in a seperate file. I am doing this in awk. Here is my file1: Blocks... (2 Replies)
Discussion started by: ladyAnne
2 Replies

6. UNIX Desktop Questions & Answers

find result

When searching for some files which match some specific criteria with find from the root directory, I got a listing of a bunch of files that say "Permission Denied". How can do my search and not show the files that I don't have the permission to list? Thanks, (3 Replies)
Discussion started by: Pouchie1
3 Replies

7. Shell Programming and Scripting

How to find first match and last match in a file

Hi All, I have a below file: ================== 02:53 pravin-root 02:53 pravin-root 03:05 pravin-root 02:55 pravin1-root 02:59 pravin1-root ================== How do I find the first and last value of column 1. For example, how do I find 02:53 is the first time stamp and 03:05 is... (3 Replies)
Discussion started by: praving5
3 Replies

8. Shell Programming and Scripting

Find match in two diff file - local srv and remote server

Perl Guru.... I need to compare two diff file (file1.abc will locate in current server and file2.abc will locate in remote server), basically the script will look for match in both file and only will send out email if there is no match and also give me list of unmatch and dups as well. So... (0 Replies)
Discussion started by: amir07
0 Replies

9. Shell Programming and Scripting

Outputting formatted Result log file from old 30000 lines result log<help required>

Well I have a 3000 lines result log file that contains all the machine data when it does the testing... It has 3 different section that i am intrsted in 1) starting with "20071126 11:11:11 Machine Header 1" 1000 lines... "End machine header 1" 2) starting with "20071126 12:12:12 Machine... (5 Replies)
Discussion started by: vikas.iet
5 Replies

10. Shell Programming and Scripting

result of find

Hey, I am using 'find' to check the existence of a file which is created today, and this is what I have find . -name $filename -mtime +0 -exec ls {} \; my problem is I need to know what the above command actually get anything, so can anyone give me some pointer on how to do... (1 Reply)
Discussion started by: mpang_
1 Replies
Login or Register to Ask a Question