Sponsored Content
Full Discussion: comparing multiple files
Top Forums Shell Programming and Scripting comparing multiple files Post 302338295 by karla on Monday 27th of July 2009 01:04:26 PM
Old 07-27-2009
comparing multiple files

hi, quick question i have one file which join one file with reference one
Looks like this:
KB0000 KB207418
KB0001 KB244904
KB0002 KB215027
KB0003 KB215027
KB0004 KB215027
KB0005 KB204320
KB0006 KB207074
KB0007 KB215204
KB0008 KB223809
KB0009 KB236640
KB0010 KB244506
....
Then i have all these files, which should be compared pairwise and the difference, if any should be printed, The files looks like this:
>KB0000 1658 amino acids
#
#
#
# Sequence # x Context Score Kinase Answer
# -------------------------------------------------------------------
# KB0000 10 S RRWASGSRG 0.978 unsp YES
# KB0000 10 S RRWASGSRG 0.637 PKA YES
# KB0000 10 S RRWASGSRG 0.528 RSK YES
# KB0000 10 S RRWASGSRG 0.519 cdc2 YES
# KB0000 10 S RRWASGSRG 0.468 CaM-II .
# KB0000 10 S RRWASGSRG 0.441 GSK3 .
# KB0000 10 S RRWASGSRG 0.416 DNAPK .
# KB0000 10 S RRWASGSRG 0.359 CKI YES
# KB0000 10 S RRWASGSRG 0.356 PKG .
# KB0000 10 S RRWASGSRG 0.281 p38MAPK .
# KB0000 10 S RRWASGSRG 0.252 ATM .
# KB0000 10 S RRWASGSRG 0.232 PKC .
# KB0000 10 S RRWASGSRG 0.223 CKII .
# KB0000 10 S RRWASGSRG 0.168 cdk5 .
# KB0000 10 S RRWASGSRG 0.147 PKB .
#
# KB0000 12 S WASGSRGAA 0.757 PKC YES



>KB207418 1658 amino acids
#
#
# Sequence # x Context Score Kinase Answer
# -------------------------------------------------------------------
# KB207418 10 S RRWASGSRG 0.978 unsp YES
# KB207418 10 S RRWASGSRG 0.637 PKA YES
# KB207418 10 S RRWASGSRG 0.528 RSK YES
# KB207418 10 S RRWASGSRG 0.519 cdc2 YES
# KB207418 10 S RRWASGSRG 0.468 CaM-II .
# KB207418 10 S RRWASGSRG 0.441 GSK3 .
# KB207418 10 S RRWASGSRG 0.416 DNAPK .
# KB207418 10 S RRWASGSRG 0.359 CKI .
# KB207418 10 S RRWASGSRG 0.356 PKG .
# KB207418 10 S RRWASGSRG 0.281 p38MAPK .
# KB207418 10 S RRWASGSRG 0.252 ATM .
# KB207418 10 S RRWASGSRG 0.232 PKC .
# KB207418 10 S RRWASGSRG 0.223 CKII .
# KB207418 10 S RRWASGSRG 0.168 cdk5 .
# KB207418 10 S RRWASGSRG 0.147 PKB .
#
# KB207418 12 S WASGSRGAA 0.757 PKC YES



so in this case the output should be
# KB0000 10 S RRWASGSRG 0.359 CKI YES


Thx in advance for the help Smilie
 

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

comparing multiple files in multiple subfolders

Hello, I am having a bit of hard time to get my head around this one. I really hope someone is out there to help me out! Background of my code: I am doing some automation where I am verifying multiple files in multiple sub folders and if they are all identical, I would echo a line with my test... (0 Replies)
Discussion started by: Riz
0 Replies

2. Shell Programming and Scripting

Comparing multiple variables

Hi! I've come up with a ksh-script that produces one or more lists of hosts. At the and of the script, I would like to print only those hosts that exists in all the lists. Ex. HOSTS="host1 host2 host3 host11" HOSTS="host1 host2 host4" HOSTS="host2 host11" HOSTS="host2 host5 host6 host7... (1 Reply)
Discussion started by: Bugenhagen
1 Replies

3. Shell Programming and Scripting

comparing multiple variables by 'if then'

Hi, I am a noob at shell scripting. basically I am trying to compare row counts from 8 tables in different databases. I have managed to get the row counts using awk from the spool files for both databases. now I have 16 variables with me for database 1 : $A $B $C $D $E $F $G... (3 Replies)
Discussion started by: smallville
3 Replies

4. Shell Programming and Scripting

Comparing multiple variable in if statement

Hi there this script is an atempt to define which instances of Jboss relate to its PID by the date and timestamp I am using calc to test with. On our system the only way you can tell which instance relates to a particular instance is by looking at the start up time and date in a log. The... (9 Replies)
Discussion started by: nathan.harris
9 Replies

5. UNIX for Dummies Questions & Answers

Comparing multiple fields from 2 files uing awk

Hi I have 2 files as below File 1 Chr Start End chr1 120 130 chr1 140 150 chr2 130 140 File2 Chr Start End Value chr1 121 128 ABC chr1 144 149 XYZ chr2 120 129 PQR I would like to compare these files using awk; specifically if column 1 of file1 is equal to column 1 of file2... (7 Replies)
Discussion started by: sshetty
7 Replies

6. Shell Programming and Scripting

awk arrays comparing multiple columns across two files.

Hi, I'm trying to use awk arrays to compare values across two files based on multiple columns. I've attempted to load file 2 into an array and compare with values in file 1, but success has been absent. If anyone has any suggestions (and I'm not even sure if my script so far is on the right lines)... (4 Replies)
Discussion started by: hubleo
4 Replies

7. UNIX for Advanced & Expert Users

Need help in comparing multiple columns from two files.

Hi all, I have two files as below. I need to compare field 2 of file 1 against field 1 of file 2 and field 5 of file 1 against filed 2 of file 2. If both matches , then create a result file 1 with first file data and if not matches , then create file with first fie data. Please help me in... (12 Replies)
Discussion started by: sivarajb
12 Replies

8. Shell Programming and Scripting

Comparing multiple network files (edge lists)

I want to compare 4 edge-lists to basically see if an edge is present in all 4 networks. The issue is that an edge A-B in one file can be present as B-A in another file. Example: Input 1: net1.txt A B 0.1 C D 0.65 D E 0.9 E A 0.7 Input 2: net2.txt A Z 0.1 C D 0.65 E D 0.9 E A... (1 Reply)
Discussion started by: Sanchari
1 Replies

9. Shell Programming and Scripting

Comparing multiple columns using awk

Hello All; I have two files with below conditions: 1. Entries in file A is missing in file B (primary is field 1) 2. Entries in file B is missing in file A (primary is field 1) 3. Field 1 is present in both files but Field 2 is different. Example Content: File A ... (4 Replies)
Discussion started by: mystition
4 Replies

10. Shell Programming and Scripting

Comparing multiple files

I want to develop one unix script that will first match the multiple files on one server say A with multiple files on another server say B and copy those to server A. After that need to compare the contents of these 2 set of multiple files on different location on same server and generate the... (4 Replies)
Discussion started by: Charnjeet Singh
4 Replies
CD-HIT-EST(1)							   User Commands						     CD-HIT-EST(1)

NAME
cdhit-est - run CD-HIT algorithm on RNA/DNA sequences SYNOPSIS
cdhit-est [Options] DESCRIPTION
====== CD-HIT version 4.6 (built on Apr 26 2012) ====== Options -i input filename in fasta format, required -o output filename, required -c sequence identity threshold, default 0.9 this is the default cd-hit's "global sequence identity" calculated as: number of identical amino acids in alignment divided by the full length of the shorter sequence -G use global sequence identity, default 1 if set to 0, then use local sequence identity, calculated as : number of identical amino acids in alignment divided by the length of the alignment NOTE!!! don't use -G 0 unless you use alignment coverage controls see options -aL, -AL, -aS, -AS -b band_width of alignment, default 20 -M memory limit (in MB) for the program, default 800; 0 for unlimitted; -T number of threads, default 1; with 0, all CPUs will be used -n word_length, default 10, see user's guide for choosing it -l length of throw_away_sequences, default 10 -d length of description in .clstr file, default 20 if set to 0, it takes the fasta defline and stops at first space -s length difference cutoff, default 0.0 if set to 0.9, the shorter sequences need to be at least 90% length of the representative of the cluster -S length difference cutoff in amino acid, default 999999 if set to 60, the length difference between the shorter sequences and the representative of the cluster can not be bigger than 60 -aL alignment coverage for the longer sequence, default 0.0 if set to 0.9, the alignment must covers 90% of the sequence -AL alignment coverage control for the longer sequence, default 99999999 if set to 60, and the length of the sequence is 400, then the alignment must be >= 340 (400-60) residues -aS alignment coverage for the shorter sequence, default 0.0 if set to 0.9, the alignment must covers 90% of the sequence -AS alignment coverage control for the shorter sequence, default 99999999 if set to 60, and the length of the sequence is 400, then the alignment must be >= 340 (400-60) residues -A minimal alignment coverage control for the both sequences, default 0 alignment must cover >= this value for both sequences -uL maximum unmatched percentage for the longer sequence, default 1.0 if set to 0.1, the unmatched region (excluding leading and tailing gaps) must not be more than 10% of the sequence -uS maximum unmatched percentage for the shorter sequence, default 1.0 if set to 0.1, the unmatched region (excluding leading and tail- ing gaps) must not be more than 10% of the sequence -U maximum unmatched length, default 99999999 if set to 10, the unmatched region (excluding leading and tailing gaps) must not be more than 10 bases -B 1 or 0, default 0, by default, sequences are stored in RAM if set to 1, sequence are stored on hard drive it is recommended to use -B 1 for huge databases -p 1 or 0, default 0 if set to 1, print alignment overlap in .clstr file -g 1 or 0, default 0 by cd-hit's default algorithm, a sequence is clustered to the first cluster that meet the threshold (fast clus- ter). If set to 1, the program will cluster it into the most similar cluster that meet the threshold (accurate but slow mode) but either 1 or 0 won't change the representatives of final clusters -r 1 or 0, default 1, by default do both +/+ & +/- alignments if set to 0, only +/+ strand alignment -mask masking letters (e.g. -mask NX, to mask out both 'N' and 'X') -match matching score, default 2 (1 for T-U and N-N) -mismatch mismatching score, default -2 -gap gap opening score, default -6 -gap-ext gap extension score, default -1 -bak write backup cluster file (1 or 0, default 0) -h print this help Questions, bugs, contact Limin Fu at l2fu@ucsd.edu, or Weizhong Li at liwz@sdsc.edu For updated versions and information, please visit: http://cd-hit.org cd-hit web server is also available from http://cd-hit.org If you find cd-hit useful, please kindly cite: "Clustering of highly homologous sequences to reduce thesize of large protein database", Weizhong Li, Lukasz Jaroszewski & Adam Godzik. Bioinformatics, (2001) 17:282-283 "Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences", Weizhong Li & Adam Godzik. Bioinformatics, (2006) 22:1658-1659 cd-hit-est 4.6-2012-04-25 April 2012 CD-HIT-EST(1)
All times are GMT -4. The time now is 05:28 PM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy