Find common entries


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Find common entries
# 8  
Old 11-02-2012
Hi Mani,

OK. Smilie

Let us take a huge step backwards here and try to figure out what is going on.

In the first posting on this thread, you had input file 1 containing one word in one field with no spaces; and input file 2 with 7 fields containing one word with field 1 separated from field 2 by three spaces, two spaces between fields 2 and 3, two spaces between fields 3 and 4, and one space separating the remaining fields. In that posting you also showed that you wanted (approved) added to field 2 if field 2's original contents matched a line in file 1 and that you wanted (approved) (note the leading space) to field 6 if the contents of field 6 matched a line in file 1. (I assumed that the leading space was desired consistently, but I no longer have any idea if this is true or not.)

Then in message #3, you show us that the 1st file has lines containing one word (sometimes with a leading space) and, except for the last line, always a trailing space. So, we now know that we can't match even columns in file 2 against whole lines in file 1 anymore. This message also shows us that in file 2 field 1 is separated from field 2 by five spaces and there are 4 spaces between fields 2, 3, and 4; and, each line, except the last, has hundreds of spaces at the end of the line.

Then in message #7, you tell us that we aren't dealing with text files at all. The real input files are binary files in the format produced by Microsoft's Word.

If the input files are binary files, the shell and most of the UNIX utilities aren't likely to be able help you much. They're designed to work on text files; not binary files. If you won't accurately describe the input files that you want to process, there isn't much that we can do to help you.
# 9  
Old 11-04-2012
Request to check

Hi Don,

Sorry for any inconvenience. Basically both files are in text format in Unix system. I myself put it in doc format to easily upload on the website. So, I am attaching the txt files this time.

Yes, there is a spacing present in different column. Although I am sure that entries are in different column but spacing is not fixed. As i checked after pasting in excel file.

Sorry for any inconvenience. But I always upload files here in doc format.


Mani
# 10  
Old 11-06-2012
Quote:
Originally Posted by manigrover
Hi Don,

Sorry for any inconvenience. Basically both files are in text format in Unix system. I myself put it in doc format to easily upload on the website. So, I am attaching the txt files this time.

Yes, there is a spacing present in different column. Although I am sure that entries are in different column but spacing is not fixed. As i checked after pasting in excel file.

Sorry for any inconvenience. But I always upload files here in doc format.


Mani
First. Please NEVER download files in .doc format. Doing so only makes it impossible to figure out what the real format of your input data and desired output data actually looks like.

Second. Your files do not look like UNIX files at all. They look like oddly formatted DOS files. I say oddly because lines in both files end with <space><carriage-return><newline>. Furthermore, your fields are not consistently <tab> separated and they are not <space> separated; the separator between the 1st and 2nd fields in Secondfile.txt seems to be <space><tab>. So using a line from "first file.txt" as an entry to be found in "Secondfile.txt" will never match anything (since the space and carriage return at the end of the lines in the 1st file do not appear in the even fields in the 2nd file.

Third. The last line in Secondfile.txt contains 9 tab characters, the other lines in that file contain 77 tab characters.

Fourth. Even after cleaning up "first file.txt", there are no entries in that file that match any field in "Secondfile.txt".

So, I added some code to the awk script to clean up both input files and used tab as the input and output field separators. The updated script is:
Code:
awk -F "\t" 'BEGIN {OFS = "\t"}
{       # Fix input oddities (globally change "<space><tab>" to "<tab>" and
        # delete any combination of <space>s, <tab>s, and <carriage-return>s at
        # the end of the each input line.
        x=$0
        gsub(/ \t/, "\t")
        sub(/[ \t\r]+$/, "")
}
FNR == NR {
        # Save the "approved" list from thet 1st file.
        c[$0]
        next
}
{       for(i = 2; i <= NF; i += 2)
                if($i in c)
                        # An entry in an even field in the 2nd file matched an
                        # item in the approved list; mark it approved.
                        $i = $i " (approved)"
        print
}' 'first file.txt' 'Secondfile.txt'

This script seems to do what you need, but (other than changing "<space><tab>" to "<tab>" as the field separator between all fields and throwing away all trailing <space>, <tab>, and <carriage-return> characters; the output is unchanged because no entries in "first file.txt" appear anywhere in "Secondfile.txt".
# 11  
Old 11-06-2012
Dear Don

Thanks for your help. I checked regarding this, but there are several entries common between first file and second file.

Infact, somebody even has given me code and found matched entries between two files but even this is not working in my system and out put is unchanged inmy system but is wrkingin his system. ab it strange!

Below id the code and output he has provided which is not wrking in this way on my system here:


Code:
Code:
$ awk 'NR==FNR{X[$0]=$0;next}{s=$1;$1="";for(i in X){if($0 ~ i){gsub(i,i" (matched)",$0)}};$0=s""$0}1' file1 file2
FHIT Adenosine (matched) Monotungstate Not Available,T2D Ado-P-Ch2-P-Ps-Ado Not Available,
CHRM1 Trospium (matched) Sanctura T2D Oxyphenonium (matched) Antrenyl T2D
PDE3B 5r-6-4-2-3-Iodobenzyl-3-Oxocyclohex-1-En-1-YlAminoPhenyl-5-Methyl-4,5-Dihydropyridazin-32h-One Not Available,T1D Hg9a-9, Nonanoyl-N-Hydroxyethylglucamide Not Available,
HSP90AA19-Butyl-8-2,5-Dimethoxy-Benzyl-9h-Purin-6-Ylamine Not Available,T2D 8-2-Chloro-3,4,5-Trimethoxy-Benzyl-2-Fluoro-9-Pent-4-Ylnyl-9h-Purin-6-Ylamine Not Available,T2D
ESR1 Chlorotrianisene (matched) Anisene,BD Conjugated Estrogens (matched) Conestoral,BD
INS M-Cresol Not Available,
FAH Acetoacetic Acid Not Available,BD 4-Hydroxy-Methyl-Phosphinoyl-3-Oxo-Butanoic Acid Not Available,
LPL Tyloxapol (matched) Alevaire,
ADAM17 3S-1-4-BUT-2-YN-1-YLOXYPHENYLSULFONYLPYRROLIDINE-3-THIOL Not Available T2D 3-4-but-2-yn-1-yloxyphenylsulfonylpropane-1-thiol Not Available T2D
GUCY1A2 Nitric Oxide (matched) INOmax,RA Isosorbide Mononitrate (matched) Conpin,
B4GALT1 6-Aminohexyl-Uridine-C1,5'-Diphosphate Not Available,
LCK 4-2-Acetylamino-2-3-Carbamoyl-2-Cyclohexylmethoxy-6,7,8,9-Tetrahydro-5h-Benzocyclohepten-5ylcarbamoyl-Ethyl-2-Phosphono-Phenyl-Phosphonic Acid Not Available,T1D 4-2-Acetylamino-2-1-3-Carbamoyl-4-Cyclohexylmethoxy-Phenyl-Ethylcarbamoyl-Ethyl-2-Phosphono-Phenoxy-Acetic Acid Not Available,T1D
GMDS Guanosine-5'-Diphosphate-Rhamnose Not Available,
LCT D-Gluconhydroximo-1,5-Lactam Not Available T2D Gluconolactone Not Available T2D
CALM1 3''-Beta-Chloroethyl-2'',4''-Dioxo-3, 5''-Spiro-Oxazolidino-4-Deacetoxy-Vinblastine (matched) Not Available T2D Prenylamine Bismethin,
RET 4-BROMO-2-FLUORO-N-4E-6-METHOXY-7-1-METHYLPIPERIDIN-4-YLMETHOXYQUINAZOLIN-41H-YLIDENEANILINE Not Available,
CYP1A2 2-PHENYL-4H-BENZOHCHROMEN-4-ONE Not Available,
PPARA Clofibrate (matched) Amotril,CD Gemfibrozil (matched) Bolutol,
TGFBR1 4-3-Pyridin-2-Yl-1h-Pyrazol-4-YlQuinoline Not Available,T2D Naphthyridine Inhibitor Not Available,T2D
PPARD 11E-OCTADEC-11-ENOIC ACID Not Available T2D 2S-2-3-2-fluoro-4-trifluoromethylphenylcarbonylaminomethyl-4-methoxybenzylbutanoic acid Not Available T2D
CSNK1G3 2Z-4-AMINO-2-4-METHOXYPHENYLIMINO-2,3-DIHYDRO-1,3-THIAZOL-5-YL4-METHOXYPHENYLMETHANONE Not Available T2D 4-AMINO-2-3-CHLOROANILINO-1,3-THIAZOL-5-YL4-FLUOROPHENYLMETHANONE Not Available,
NR3C1 Flunisolide (matched) Aerobid T2D Diflorasone (matched) Florone T2D
CTSD 1h-Benoximidazole-2-Carboxylic Acid Not Available T2D N-Aminoethylmorpholine Not Available T2D
TLL2 Carbobenzoxy-Pro-Lys-Phe-YPo2-Ala-Pro-Ome Not Available,
TYR Monobenzone (matched) AgeRite Alba,
HSD11B1 3,3-dimethylpiperidin-1-yl6-3-fluoro-4-methylphenylpyridin-2-ylmethanone Not Available,RA 5S-2-1S-1-4-fluorophenylethylamino-5-1-hydroxy-1-methylethyl-5-methyl-1,3-thiazol-45H-one Not Available,RA
C5 Eculizumab (matched) Soliris,
FGF1 Sucrose Octasulfate Not Available T2D Naphthalene Trisulfonate Not Available T2D
SORD Cp-166572, 2-Hydroxymethyl-4-4-N,N-Dimethylaminosulfonyl-1-Piperazino-Pyrimidine Not Available,
EGFR Gefitinib (matched) Iressa,T2D Panitumumab (matched) Vectibix,T2D
EPHB4 N-5-chloro-1,3-benzodioxol-4-yl-6-methoxy-7-3-piperidin-1-ylpropoxyquinazolin-4-amine Not Available T2D N'-5-CHLORO-1,3-BENZODIOXOL-4-YL-N-3,4,5- TRIMETHOXYPHENYLPYRIMIDINE-2,4-DIAMINE Not Available T2D
TPR N-1s-4-Bis2-ChloroethylAmino-1-Methylbutyl-N-6-Chloro-2-Methoxy-9-AcridinylAmine Not Available T2D Trypanothione Not Available,
CCL5 Heparin (matched) Disaccharide I-S Not Available,T1D Heparin (matched) Disaccharide Iii-S Not Available,

Although I tried in different machines at my place and still it's not working.

But I found that Adenosine monotungstatate is not present in first file but still it shows matched becasue adenosine is present which will be wrong in my case as I have to match whole word present in first file with whole word in second file columns.

Let me know if you come across any solution.
# 12  
Old 11-07-2012
Question

Quote:
Dear Don

Thanks for your help. I checked regarding this, but there are several entries common between first file and second file.

Infact, somebody even has given me code and found matched entries between two files but even this is not working in my system and out put is unchanged inmy system but is wrkingin his system. ab it strange!

Below id the code and output he has provided which is not wrking in this way on my system here:

Code:
$ awk 'NR==FNR{X[$0]=$0;next}{s=$1;$1="";for(i in X){if($0 ~ i){gsub(i,i" (matched)",$0)}};$0=s""$0}1' file1 file2
FHIT Adenosine (matched) Monotungstate Not Available,T2D Ado-P-Ch2-P-Ps-Ado Not Available,
CHRM1 Trospium (matched) Sanctura T2D Oxyphenonium (matched) Antrenyl T2D
PDE3B 5r-6-4-2-3-Iodobenzyl-3-Oxocyclohex-1-En-1-YlAminoPhenyl-5-Methyl-4,5-Dihydropyridazin-32h-One Not Available,T1D Hg9a-9, Nonanoyl-N-Hydroxyethylglucamide Not Available,
HSP90AA19-Butyl-8-2,5-Dimethoxy-Benzyl-9h-Purin-6-Ylamine Not Available,T2D 8-2-Chloro-3,4,5-Trimethoxy-Benzyl-2-Fluoro-9-Pent-4-Ylnyl-9h-Purin-6-Ylamine Not Available,T2D
ESR1 Chlorotrianisene (matched) Anisene,BD Conjugated Estrogens (matched) Conestoral,BD
INS M-Cresol Not Available,
FAH Acetoacetic Acid Not Available,BD 4-Hydroxy-Methyl-Phosphinoyl-3-Oxo-Butanoic Acid Not Available,
LPL Tyloxapol (matched) Alevaire,
ADAM17 3S-1-4-BUT-2-YN-1-YLOXYPHENYLSULFONYLPYRROLIDINE-3-THIOL Not Available T2D 3-4-but-2-yn-1-yloxyphenylsulfonylpropane-1-thiol Not Available T2D
GUCY1A2 Nitric Oxide (matched) INOmax,RA Isosorbide Mononitrate (matched) Conpin,
B4GALT1 6-Aminohexyl-Uridine-C1,5'-Diphosphate Not Available,
LCK 4-2-Acetylamino-2-3-Carbamoyl-2-Cyclohexylmethoxy-6,7,8,9-Tetrahydro-5h-Benzocyclohepten-5ylcarbamoyl-Ethyl-2-Phosphono-Phenyl-Phosphonic Acid Not Available,T1D 4-2-Acetylamino-2-1-3-Carbamoyl-4-Cyclohexylmethoxy-Phenyl-Ethylcarbamoyl-Ethyl-2-Phosphono-Phenoxy-Acetic Acid Not Available,T1D
GMDS Guanosine-5'-Diphosphate-Rhamnose Not Available,
LCT D-Gluconhydroximo-1,5-Lactam Not Available T2D Gluconolactone Not Available T2D
CALM1 3''-Beta-Chloroethyl-2'',4''-Dioxo-3, 5''-Spiro-Oxazolidino-4-Deacetoxy-Vinblastine (matched) Not Available T2D Prenylamine Bismethin,
RET 4-BROMO-2-FLUORO-N-4E-6-METHOXY-7-1-METHYLPIPERIDIN-4-YLMETHOXYQUINAZOLIN-41H-YLIDENEANILINE Not Available,
CYP1A2 2-PHENYL-4H-BENZOHCHROMEN-4-ONE Not Available,
PPARA Clofibrate (matched) Amotril,CD Gemfibrozil (matched) Bolutol,
TGFBR1 4-3-Pyridin-2-Yl-1h-Pyrazol-4-YlQuinoline Not Available,T2D Naphthyridine Inhibitor Not Available,T2D
PPARD 11E-OCTADEC-11-ENOIC ACID Not Available T2D 2S-2-3-2-fluoro-4-trifluoromethylphenylcarbonylaminomethyl-4-methoxybenzylbutanoic acid Not Available T2D
CSNK1G3 2Z-4-AMINO-2-4-METHOXYPHENYLIMINO-2,3-DIHYDRO-1,3-THIAZOL-5-YL4-METHOXYPHENYLMETHANONE Not Available T2D 4-AMINO-2-3-CHLOROANILINO-1,3-THIAZOL-5-YL4-FLUOROPHENYLMETHANONE Not Available,
NR3C1 Flunisolide (matched) Aerobid T2D Diflorasone (matched) Florone T2D
CTSD 1h-Benoximidazole-2-Carboxylic Acid Not Available T2D N-Aminoethylmorpholine Not Available T2D
TLL2 Carbobenzoxy-Pro-Lys-Phe-YPo2-Ala-Pro-Ome Not Available,
TYR Monobenzone (matched) AgeRite Alba,
HSD11B1 3,3-dimethylpiperidin-1-yl6-3-fluoro-4-methylphenylpyridin-2-ylmethanone Not Available,RA 5S-2-1S-1-4-fluorophenylethylamino-5-1-hydroxy-1-methylethyl-5-methyl-1,3-thiazol-45H-one Not Available,RA
C5 Eculizumab (matched) Soliris,
FGF1 Sucrose Octasulfate Not Available T2D Naphthalene Trisulfonate Not Available T2D
SORD Cp-166572, 2-Hydroxymethyl-4-4-N,N-Dimethylaminosulfonyl-1-Piperazino-Pyrimidine Not Available,
EGFR Gefitinib (matched) Iressa,T2D Panitumumab (matched) Vectibix,T2D
EPHB4 N-5-chloro-1,3-benzodioxol-4-yl-6-methoxy-7-3-piperidin-1-ylpropoxyquinazolin-4-amine Not Available T2D N'-5-CHLORO-1,3-BENZODIOXOL-4-YL-N-3,4,5- TRIMETHOXYPHENYLPYRIMIDINE-2,4-DIAMINE Not Available T2D
TPR N-1s-4-Bis2-ChloroethylAmino-1-Methylbutyl-N-6-Chloro-2-Methoxy-9-AcridinylAmine Not Available T2D Trypanothione Not Available,


CCL5 Heparin (matched) Disaccharide I-S Not Available,T1D Heparin (matched) Disaccharide Iii-S Not Available,

Although I tried in different machines at my place and still it's not working.

But I found that Adenosine monotungstatate is not present in first file but still it shows matched becasue adenosine is present which will be wrong in my case as I have to match whole word present in first file with whole word in second file columns.

Let me know if you come across any solution.
This is all very interesting, but it has absolutely nothing to do with the data that you uploaded in the file named "first file.txt" in message #9 in this thread. If you upload "first file.txt" and look at it, you will find that not even one of Adenosine, Trospium, Estrogens, Tyloxapol, Oxide, Mononitrate, Vinblastine, Clofibrate, Gemfibrozil, Flunisolide, Diflorasone, Monobenzone, Eculizumab, Gefitinib, Panitumumab, Heparin, and Heparin are in that file (and all of them are part of a field that contains "(matched") in the text from your last message quoted above).

Obviously file1 and file2 used in this test are not first file.txt and Secondfile.txt that you told me to use. Please try my code with the actual data that was used for this test. If you don't have the data files that produced the output you're showing in this message, don't be surprised that the output you get from using my script produces different results.

I wish you luck, but I will not be able to provide any more assistance on this topic. Smilie
Login or Register to Ask a Question

Previous Thread | Next Thread

9 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Find common files between two directories

I have two directories Dir 1 /home/sid/release1 Dir 2 /home/sid/release2 I want to find the common files between the two directories Dir 1 files /home/sid/release1>ls -lrt total 16 -rw-r--r-- 1 sid cool 0 Jun 19 12:53 File123 -rw-r--r-- 1 sid cool 0 Jun 19 12:53... (5 Replies)
Discussion started by: sidnow
5 Replies

2. Shell Programming and Scripting

Find common words

Hi, I have 10 files which needs to be print common words from those all files. Is there any command to find out. (2 Replies)
Discussion started by: munna_dude
2 Replies

3. Shell Programming and Scripting

Find the common values

Hi, I have two files with the below values. file1 305231921 1.0 ben/Ben_Determination_Appeals 1348791394 2.0 ben/Ben_Determination_Appeals] 1305231921 1.0 ben/Cancel_Refund_Payment_JLRS 1348791394 2.0 ben/Cancel_Refund_Payment_JLRS 1305231921 ... (2 Replies)
Discussion started by: Vikram_Tanwar12
2 Replies

4. Shell Programming and Scripting

Find common numbers and print yes or no

Hi I have 2 files with following data First file, sp|Q676U5|A16L1_HUMAN, Autophagy-related protein 16-1 OS=Homo sapiens GN=ATG16L1 PE=1 SV=2, Maximum coiled-coil residue probability: 0.657 in position 163. Maximum dimeric residue probability: 0.288 in position 163. ... (1 Reply)
Discussion started by: manigrover
1 Replies

5. Shell Programming and Scripting

find common entries and match the number with long sequence and cut that sequence in output

Hi all, I have a file like this ID 3BP5L_HUMAN Reviewed; 393 AA. AC Q7L8J4; Q96FI5; Q9BQH8; Q9C0E3; DT 05-FEB-2008, integrated into UniProtKB/Swiss-Prot. DT 05-JUL-2004, sequence version 1. DT 05-SEP-2012, entry version 71. FT COILED 59 140 ... (1 Reply)
Discussion started by: manigrover
1 Replies

6. Shell Programming and Scripting

Find common entries in 2 list and write data before it

Hi all, I have 2 files: second file I want if entries in one file will match in other file. It shuld wite approve before it so output shuld be (1 Reply)
Discussion started by: manigrover
1 Replies

7. Shell Programming and Scripting

Request to check:find out common entries

I have to compare 2 files which means 2 files with common entries in same column and separate those common entries in a diferent file as well right before those entries common so that I can separat common and Uncommon entries in rows in 2 different files. Is it possible For eg. one file ... (3 Replies)
Discussion started by: manigrover
3 Replies

8. Shell Programming and Scripting

find common data

Hey guys, I have two files. file1 and file2. file1: a,1 b,2 c,343 d,343 e,4343 f,4544 file 2: a, d, e, Now i need to find the common data between these files from file1. i.e a,1 (8 Replies)
Discussion started by: jaituteja
8 Replies

9. Shell Programming and Scripting

To find all common lines from 'n' no. of files

Hi, I have one situation. I have some 6-7 no. of files in one directory & I have to extract all the lines which exist in all these files. means I need to extract all common lines from all these files & put them in a separate file. Please help. I know it could be done with the help of... (11 Replies)
Discussion started by: The Observer
11 Replies
Login or Register to Ask a Question