Help with matching entries in multiple files Post: 302605169

Sponsored Content

Top Forums Shell Programming and Scripting Help with matching entries in multiple files Post 302605169 by Vavad on Tuesday 6th of March 2012 07:55:59 PM

03-06-2012

Registered User

Help with matching entries in multiple files

Hi,

I am pretty new to Linux and I have a question.

I have 3 tab delimited text files which look like this:

FileA:

PROTEINID DESCRIPTION PEPTIDES FRAMES

GB://115298678 _gi_115298678_ref_NP_000055.2_ complement C3 precursor [Homo sapiens] 45 55
GB://4502027 _gi_4502027_ref_NP_000468.1_ serum albumin preproprotein [Homo sapiens] 34 73
Entrez://strain 11128 / EHEC _tr_C8UFA3_C8UFA3_ECO1A Conserved predicted protein OS=Escherichia coli O111:H_ (strain 11128 / EHEC) GN=ygfY 26 31
GB://296080754 _gi_296080754_ref_NP_001171670.1_ fibrinogen beta chain isoform 2 preproprotein [Homo sapiens] 23 30
GB://4557871 _gi_4557871_ref_NP_001054.1_ serotransferrin precursor [Homo sapiens] 16 23
GB://70906439 _gi_70906439_ref_NP_068656.2_ fibrinogen gamma chain isoform gamma_B precursor [Homo sapiens] 16 20
GB://66932947 _gi_66932947_ref_NP_000005.2_ alpha_2_macroglobulin precursor [Homo sapiens] 15 17

FileB:

PROTEINID DESCRIPTION PEPTIDES FRAMES

GB://115298678 _gi_115298678_ref_NP_000055.2_ complement C3 precursor [Homo sapiens] 43 52
GB://4502027 _gi_4502027_ref_NP_000468.1_ serum albumin preproprotein [Homo sapiens] 33 71
Entrez://strain 11128 / EHEC _tr_C8UL96_C8UL96_ECO1A HCP oxidoreductase_ NADH_dependent OS=Escherichia coli O111:H_ (strain 11128 / EHEC) GN=hcr 22 24
GB://296080754 _gi_296080754_ref_NP_001171670.1_ fibrinogen beta chain isoform 2 preproprotein [Homo sapiens] 21 24
GB://4557871 _gi_4557871_ref_NP_001054.1_ serotransferrin precursor [Homo sapiens] 16 24
GB://66932947 _gi_66932947_ref_NP_000005.2_ alpha_2_macroglobulin precursor [Homo sapiens] 15 16
GB://70906439 _gi_70906439_ref_NP_068656.2_ fibrinogen gamma chain isoform gamma_B precursor [Homo sapiens] 14 18

FileC:

GB://115298678 _gi_115298678_ref_NP_000055.2_ complement C3 precursor [Homo sapiens] 43 55
GB://4502027 _gi_4502027_ref_NP_000468.1_ serum albumin preproprotein [Homo sapiens] 30 67
GB://296080754 _gi_296080754_ref_NP_001171670.1_ fibrinogen beta chain isoform 2 preproprotein [Homo sapiens] 25 28
Entrez://strain 11128 / EHEC _tr_C8UF29_C8UF29_ECO1A Protease III OS=Escherichia coli O111:H_ (strain 11128 / EHEC) GN=ptr 24 28
GB://4557871 _gi_4557871_ref_NP_001054.1_ serotransferrin precursor [Homo sapiens] 16 23
GB://70906439 _gi_70906439_ref_NP_068656.2_ fibrinogen gamma chain isoform gamma_B precursor [Homo sapiens] 15 20
GB://4557485 _gi_4557485_ref_NP_000087.1_ ceruloplasmin precursor [Homo sapiens] 15 19

Explanation of the format:
PROTEINID: GB://4557485
DESCRIPTION: _gi_4557485_ref_NP_000087.1_ ceruloplasmin precursor [Homo sapiens]
PEPTIDES: 15
FRAMES: 19

I have actually 6 such files which have thousands of entries and I want to output the ones that are only common to the three files using the 1st column as the match criterion and display everything that follows the matched entries in every file separated by tab. Can you please help me with it.

I know how to do this with just two files. Multiple files is something I haven't tried and I really need help with that!

Thanks.

Last edited by Vavad; 03-07-2012 at 04:19 PM..

Vavad

View Public Profile for Vavad

Find all posts by Vavad

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Matching lines across multiple csv files and merging a particular field

I have about 20 CSV's that all look like this: "","","","","","","","","","","","","","","",""What I've been told I need to produce is the exact same thing, but with each file now containing the start_code from every other file where the email matches. It doesn't matter if any of the other...

2. Shell Programming and Scripting

Removing matching text from multiple files with a shell script

Hello all, I am in need of assistance in creating a script that will remove a specified block of text from multiple .htaccess files. (roughly 1000 files) I am attempting to help with a project to clean up a linux server that has a series of unwanted url rewrites in place, as well as some...

3. Shell Programming and Scripting

Matching multiple fields from two files and then some?

Hi, I am working with two tab-delimited files with multiple columns, formatted as follows: File 1: >chrom 1 100 A G 20 …(10 columns) >chrom 1 104 G C 18 …(10 columns) >chrom 2 28 T C ...

4. Shell Programming and Scripting

Split single file into multiple files using pattern matching

I have one single shown below and I need to break each ST|850 & SE to separate file using unix script. Below example should create 3 files. We can use ST & SE to filter as these field names will remain same. Please advice with the unix code. ST|850 BEG|PO|1234 LIN|1|23 SE|4 ST|850...

5. Shell Programming and Scripting

Creating single pattern for matching multiple files.

Hi friends, I have a some files in a directory. for example 856-abc 856-def 851-abc 945-def 956-abc 852-abc i want to display only those files whose name starts with 856* 945* and 851* using a single pattern. i.e 856-abc 856-def 851-abc 945-def the rest of the two files...

6. Shell Programming and Scripting

Copy files matching multiple conditions

Hello How do i copy files matching multiple conditions. Requirement is to search files starting with name abc* and def* and created on a particular date or date range given by the user and copy it to the destination folder. i tried with different commands. below one will give the list ,...

7. Shell Programming and Scripting

awk script issue redirecting to multiple files after matching pattern

Hi All I am having one awk and sed requirement for the below problem. I tried multiple options in my sed or awk and right output is not coming out. Problem Description ############################################################### I am having a big file say file having repeated...

8. Shell Programming and Scripting

Performance of calculating total number of matching records in multiple files

Hello Friends, I've been trying to calculate total number of a certain match in multiple data records files (DRs). Let say I have a daily created folders for each day since the beginning of july like the following drwxrwxrwx 2 mmsuper med 65536 Jul 1 23:59 20150701 drwxrwxrwx 2 mmsuper...

9. UNIX for Beginners Questions & Answers

Concatenate column values when header is Matching from multiple files

there can be n number of columns but the number of columns and header name will remain same in all 3 files. Files are tab Delimited. a.txt Name 9/1 9/2 X 1 7 y 2 8 z 3 9 a 4 10 b 5 11 c 6 12 b.xt Name 9/1 9/2 X 13 19 y 14 20 z 15 21 a 16 22 b 17 23 c 18 24 c.txt Name 9/1 9/2...

10. UNIX for Beginners Questions & Answers

Awk: matching multiple fields between 2 files

Hi, I have 2 tab-delimited input files as follows. file1.tab: green A apple red B apple file2.tab: apple - A;Z Objective: Return $1 of file1 if, . $1 of file2 matches $3 of file1 and, . any single element (separated by ";") in $3 of file2 is present in $2 of file1 In order to...

LEARN ABOUT DEBIAN

fa2htgs

FA2HTGS(1)						     NCBI Tools User's Manual							FA2HTGS(1)

NAME

       fa2htgs - formatter for high throughput genome sequencing project submissions

SYNOPSIS

       fa2htgs	[-]  [-6 str]  [-7 str]  [-A filename]	[-C str]  [-D]	[-L filename]  [-M str] [-N] [-O filename] [-P str] [-Q filename] [-S str]
       [-T filename] [-X] [-a str] [-b N] [-c str] [-d str] [-e filename]  [-f]  -g str  [-h str]  [-i filename]  [-k str]  [-l N]  [-m]  [-n str]
       [-o filename] [-p N] [-q] [-r str] -s str [-t filename] [-u] [-v] [-w] [-x str]

DESCRIPTION

       fa2htgs is a program used to generate Seq-submits (an ASN.1 sequence submission file) for high throughput genome sequencing projects.

       fa2htgs	will read a FASTA file (or an Ace Contig file with Phrap sequence quality values), a Sequin submission template file, (to get con-
       tact and citation information for the submission), and a series of command line arguments (see below).  This  program  will  then  combines
       these  information  to make a submission suitable for GenBank. Once you have generated your submission file, you need to follow the submis-
       sion protocol (see the README present on your FTP account or mailed out to your Center).

       fa2htgs is intended for the automation by scripts for bulk submission of unannotated genome sequence. It can easily be  extended  from  its
       current	simple	form  to allow more complicated processing.  A submission prepared with fa2htgs can also be read into Psequin(1), and then
       annotated more extensively.

       Questions and concerns about this processing protocol, or how to use this tool should be forwarded to <htgs@ncbi.nlm.nih.gov>.

OPTIONS

       A summary of options is included below.

       -      Print usage message

       -6 str SP6 clone (e.g., Contig1,left)

       -7 str T7 clone (e.g., Contig2,right)

       -A filename
	      Filename for accession list input (mutually exclusive with -T and -i).  The input file contains a tab-delimited table with three	to
	      five  columns,  which are accession number, start position, stop position, and (optionally) length and strand.  If start > stop, the
	      minus strand on the referenced accession is used.  A gap is indicated by the word "gap" instead of an accession, 0 for the start and
	      stop positions, and a number for the length.

       -C str Clone library name (will appear as /clone-lib="str" on the source feature)

       -D     HTGS_DRAFT sequence

       -L filename
	      Read phrap contig order from filename.  This is a tab-delimited file that can be used to drive the order of contigs (normally speci-
	      fied by -P), as well as indicating the SP6 and T7 ends.  It can also be used when contigs are known to be in  opposite  orientation.
	      For example:

		  Contig2     +       1       SP6     left
		  Contig3     +       1
		  Contig1     - 	      T7      right

	      The first column is the contig name, the second is the orientation, the third is the fragment_group, the fourth indicates the SP6 or
	      T7 end, and the fifth says which side of SP6 or T7 end had vector removed.

       -M str Map name (will appear as /map="str" on the source feature)

       -N     Annotate assembly_fragments

       -O filename
	      Read comment from filename (100-character-per-line maximum; ~ is a linebreak and `~ is a literal ~.  You can check the  format  with
	      PSequin(1).)

       -P str Contigs  to  use,  separated  by commas.	If -P is not indicated with the -T option, then the fragments will go in in the order that
	      they are in the ace file (which is appropriate for a phase 1 record, but not for a phase 2 or 3).  If you need to set the  order	of
	      the segments of the ace file, you need to set it with the -P flag, like this: -P "Contig1,Contig4,Contig3,Contig2,Contig5"

       -Q filename
	      Read quality scores from filename

       -S str Strain name

       -T filename
	      Filename for phrap input (mutually exclusive with -A and -i)

       -X     The  coordinates in the input file are on the resulting segmented sequence.  (Bases 1 through n of each accession are used.)  Other-
	      wise, the coordinates are on the individual accessions, which need not start at base 1 of the record.

       -a str GenBank accession; use if and only if updating a sequence.

       -b N   Gap length (default = 100; anything from 0 to 1000000000 is legal)

       -c str Clone name (will appear as /clone in the source feature; can be the same as -s)

       -d str Title for sequence (will appear in GenBank DEFINITION line)

       -e filename
	      Log errors to filename

       -f     htgs_fulltop keyword

       -g str Genome Center tag (probably the same as your login name on the NCBI FTP server)

       -h str Chromosome (will appear as /chromosome in the source feature)

       -i filename
	      Filename for fasta input (default is stdin; mutually exclusive with -A and -T)

       -k str Add the supplied string as a keyword.

       -l N   Length of sequence in bp (default = 0). The length is checked against the actual number of bases we get. For phase 1 and 2  sequence
	      it  is also used to estimate gap lengths. For phase 1 and 2 records, it is important to use a number GREATER than the amount of pro-
	      vided nucleotide, otherwise this will generate false `gaps'.  Here is assumed that the putative full length of  the  BAC	or  cosmid
	      will  be	used.	There  should  be  at least 20 to 30 `n' in between the segments (you can check for these in Sequin), as this will
	      ensure proper behavior when this sequence is used with BLAST.  Otherwise `artifactual' unrelated segment neighbors  may  be  brought
	      into proximity of each other.

       -m     Take comment from template

       -n str Organism name (default = Homo sapiens)

       -o filename
	      Filename for asn.1 output (default = stdout)

       -p N   HTGS phase:
	      1      A	collection  of	unordered  contigs with gaps of unknown length.  A Phase 1 record must at the very least have two segments
		     with one gap.  (default)
	      2      A series of ordered contigs, possibly with known gap lengths.  This could be a single sequence without gaps, if the  sequence
		     has ambiguities to resolve.
	      3      A single contiguous sequence.  This sequence is finished, but not necessarily annotated.

       -q     htgs_cancelled keyword

       -r str Remark for update (brief comment describing the nature of the update, such as "new sequence", "new citation", or "updated features")

       -s str Sequence	name.	The sequence must have a name that is unique within the genome center. We use the combination of the genome center
	      name (-g argument) and the sequence name (-s) to track this sequence and to talk to you about it.  The name can have  any  form  you
	      like but must be unique within your center.

       -t filename
	      Filename for Seq-submit template (default = template.sub)

       -u     Take biosource from template

       -v     htgs_activefin keyword

       -w     Whole Genome Shotgun flag

       -x str Secondary accession numbers, separated by commas, s.t. U10000,L11000.

	      In  some	cases  a  large  segment will supersede another or group of other accession numbers (records).	These records which are no
	      longer wanted in GenBank should be made secondary. Using the -x argument you can list the Accession Numbers you want  to	make  sec-
	      ondary.	This  will  instruct us to remove the accession number(s) from GenBank, and will no longer be part of the GenBank release.
	      They will nonetheless be available from Entrez.

	      GREAT CARE should be taken when using this argument!!!  Improper use of accession numbers here  will  result  in	the  inappropriate
	      withdrawal  of  GenBank  records from GenBank, EMBL and DDBJ.  We provide this parameter as a convenience to submitting centers, but
	      this may need to be removed if it is not used carefully.

AUTHOR

       The National Center for Biotechnology Information.

SEE ALSO

       Psequin(1), /usr/share/doc/ncbi-tools-bin/README.fa2htgs.gz

NCBI
								    2006-05-29								FA2HTGS(1)

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Matching lines across multiple csv files and merging a particular field

Discussion started by: Demosthenes

2. Shell Programming and Scripting

Removing matching text from multiple files with a shell script

Discussion started by: boxx

3. Shell Programming and Scripting

Matching multiple fields from two files and then some?

Discussion started by: mbp

4. Shell Programming and Scripting

Split single file into multiple files using pattern matching

Discussion started by: prasadm