Comparing entries between files


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Comparing entries between files
# 1  
Old 09-07-2010
Comparing entries between files

I have 2 files containing several hundreds of different IDs and sequences, like this
File 1
Quote:
>Pat1_s60_c01_T400_s30_c08_0_1610
TAGATGTGCCCGTGGGTTTC
>Pat1_s60_c01_T400_s30_c08_10_3845
TTGATGTCGTGGGTTTCCCG
>Pat1_s60_c01_T400_s30_c08_23_28
TTGATGTGCCAGTTTCCCGT
>Pat1_s60_c01_T400_s30_c08_33_2588
TTGATGTGTCCCGTCGACAC
File 2
Quote:
>Pat1_s60_c01_T400_s30_c08_0_1610
TAGATGTGCCCGTGGGTTAC
>Pat1_s60_c01_T400_s30_c08_1_3845
TTGATGTCGTGGGTTTCCCG
>Pat1_s60_c01_T400_s30_c08_3_28
TTGATGTGCCAGTTTCCCGT
>Pat1_s60_c01_T400_s30_c08_53_2588
TTGATGTGTCCCGTCGACGC
I need to compare the sequences between the 2 and generate a file containing the sequences that are shared by the 2 files and specify the IDs in both files as the new identifier, so I will end up with something like this
File Shared sequences
Quote:
>File 1 Pat1_s60_c01_T400_s30_c08_10_3845; File 2 Pat1_s60_c01_T400_s30_c08_1_3845
TTGATGTCGTGGGTTTCCCG
>File 1 Pat1_s60_c01_T400_s30_c08_23_28; File 2 Pat1_s60_c01_T400_s30_c08_3_28
TTGATGTGCCAGTTTCCCGT
I also need to generate 2 independent files containing the sequences that are unique for each file, like this
File 1 UNIQUE
Quote:
>Pat1_s60_c01_T400_s30_c08_0_1610
TAGATGTGCCCGTGGGTTTC
>Pat1_s60_c01_T400_s30_c08_33_2588
TTGATGTGTCCCGTCGACAC
File 2 UNIQUE
Quote:
>Pat1_s60_c01_T400_s30_c08_0_1610
TAGATGTGCCCGTGGGTTAC
>Pat1_s60_c01_T400_s30_c08_53_2588
TTGATGTGTCCCGTCGACGC
This is way beyond my scripting capabilities and I really do not know how to go about it.
Any help will be greatly appreciate it.
# 2  
Old 09-07-2010
a bash way (input files are named file1 and file2)
Code:
#!/bin/bash
# write headers to the files for readability.
# Remove or comment these 3 lines and uncomment the 3 others if you don't want headers
echo -e "\n-----------\nshared\n" > shared
echo -e "\n-----------\nunique1\n" > unique1
echo -e "\n-----------\nunique2\n" > unique2
# :>shared
# :>unique1
# :>unique2

while read P1
do
   read S
   P2=$(grep -wB1 "$S" file2 | head -n1)
   if [ -n "$P2" ]
   then	echo -e ">File 1 ${P1#>}; File2 ${P2#>}\n$S" >> shared
   else echo -e "$P1\n$S" >> unique1
   fi
done < file1

while read P2
do
   read S
   grep -qw "$S" file1 || echo -e "$P2\n$S" >> unique2
done < file2

cat shared unique1 unique2 # to display the generated files


Last edited by frans; 09-07-2010 at 05:49 PM.. Reason: removed -B1 option in the second loop (not necessary) and added -q
# 3  
Old 09-07-2010
Hi.

Using some standard utilities and an awk script:
Code:
#!/usr/bin/env bash

# @(#) s1	Demonstrate sequence matching.

# Utility functions: print-as-echo, print-line-with-visual-space.
pe() { for i;do printf "%s" "$i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }

FILE1=data1
FILE2=data2

# Remove debris from previous runs.
rm -f t1 t2 t3 shared uniq_1 uniq_2

# Paste lines together in both files, add source file indicator.
paste - - < $FILE1 |
sed "s/^/1 /" > t1
paste - - < $FILE2 |
sed "s/^/2 /" > t2

# Join the files based on sequence.
join -a 1 -a 2 -j 3 t1 t2 > t3

# Comb and distibute results.
awk '
NF > 3	{ print > "shared" ; next }
$2 == 1	{ print > "uniq_1" ; next }
$2 == 2	{ print > "uniq_2" ; next }
' t3

pl " Shared sequences:"
cat shared

pl " Sequences unique to file 1:"
cat uniq_1

pl " Sequences unique to file 2:"
cat uniq_2

exit 0

producing:
Code:
% ./s1

-----
 Shared sequences:
TTGATGTCGTGGGTTTCCCG 1 >Pat1_s60_c01_T400_s30_c08_10_3845 2 >Pat1_s60_c01_T400_s30_c08_1_3845
TTGATGTGCCAGTTTCCCGT 1 >Pat1_s60_c01_T400_s30_c08_23_28 2 >Pat1_s60_c01_T400_s30_c08_3_28

-----
 Sequences unique to file 1:
TAGATGTGCCCGTGGGTTTC 1 >Pat1_s60_c01_T400_s30_c08_0_1610
TTGATGTGTCCCGTCGACAC 1 >Pat1_s60_c01_T400_s30_c08_33_2588

-----
 Sequences unique to file 2:
TAGATGTGCCCGTGGGTTAC 2 >Pat1_s60_c01_T400_s30_c08_0_1610
TTGATGTGTCCCGTCGACGC 2 >Pat1_s60_c01_T400_s30_c08_53_2588

Fpr a production run if you have not sorted the input files on the sequence, then they will need to be -- a requirement of the join.

Best wishes ... cheers, drl

Last edited by drl; 09-07-2010 at 07:23 PM..
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Comparing two files and list the difference with common first line content of both files

I have two file as given below which shows the ACL permissions of each file. I need to compare the source file with target file and list down the difference as specified below in required output. Can someone help me on this ? Source File ************* # file: /local/test_1 # owner: own #... (4 Replies)
Discussion started by: sarathy_a35
4 Replies

2. Shell Programming and Scripting

Comparing files in a directory against an array of files

I hope I can explain this correctly. I am using Bash-4.2 for my shell. I have a group of file names held in an array. I want to compare the names in this array against the names of files currently present in a directory. If the file does not exist in the directory, that is not a problem.... (5 Replies)
Discussion started by: BudMan
5 Replies

3. Shell Programming and Scripting

Comparing 2 files

Hi, I have two files in the following format. File 1 S00002583|NORFO|0002.20|MR|015542324A||BR|STD|201206|015542324A||E S00004144|MIDDL|0014.90|MR|017120472D||VR|STD|201206|017120472D||E S00005307|PLYMO|0002.20|MR|026187410A||P|STD|201206|026187410A||E... (4 Replies)
Discussion started by: nua7
4 Replies

4. UNIX for Advanced & Expert Users

How to find duplicates contents in a files by comparing other files?

Hi Guys , we have one directory ...in that directory all files will be set on each day.. files must have header ,contents ,footer.. i wants to compare the header,contents,footer ..if its same means display an error message as 'files contents same' (7 Replies)
Discussion started by: Venkatesh1
7 Replies

5. Shell Programming and Scripting

Comparing the matches in two files using awk when both files have their own field separators

I've two files with data like below: file1.txt: AAA,Apples,123 BBB,Bananas,124 CCC,Carrot,125 file2.txt: Store1|AAA|123|11 Store2|BBB|124|23 Store3|CCC|125|57 Store4|DDD|126|38 So,the field separator in file1.txt is a comma and in file2.txt,it is | Now,the output should be... (2 Replies)
Discussion started by: asyed
2 Replies

6. Shell Programming and Scripting

Need help comparing two files and deleting some things in those files!

So I have two files: File1 pictures.txt 1.1 1.3 dance.txt 1.2 1.4 treehouse.txt 1.3 1.5 File2 pictures.txt 1.5 ref2313 1.4 ref2345 1.3 ref5432 1.2 ref4244 dance.txt 1.6 ref2342 1.5 ref2352 1.4 ref0695 1.3 ref5738 1.2 ref4948 1.1 treehouse.txt 1.6 ref8573 1.5 ref3284 1.4 ref5838... (24 Replies)
Discussion started by: linuxkid
24 Replies

7. Shell Programming and Scripting

Help comparing files

Hi Everybody, I have an requirement in my project like I need to compare first 2 columns of two files and create a 3rd file in the format col1 col2 Y/N(Indicating matched or not) Ex: File 1: Col1 Col2 Col3 col 4 File 2: col1 col2 col5 col6 File 3: col1 col2 Y col3 col4 N col5 col6 N... (1 Reply)
Discussion started by: mr_manii
1 Replies

8. UNIX for Dummies Questions & Answers

comparing two files

Hi everyone, I have two files, and i want to detect the diffrent lines in these files. I tried to use the comm and diff commands, but I got that they are only comparing the lines in the same sequence number. forexample: file1 file2 cat dog ... (6 Replies)
Discussion started by: marwan
6 Replies

9. UNIX for Advanced & Expert Users

comparing shadow files with real files

Hi I need to compare shadow file sizes with their real file counterparts. If the shadow file size differs form the realfile size then it must send a mail. My problem is that our system has over 1600 shadowfiles in different directories, with different names. the only consistancy is the .sh file... (4 Replies)
Discussion started by: terrym
4 Replies

10. UNIX for Dummies Questions & Answers

comparing files

Hi, i have a long script saved in two different folders at a different date...these two savings are nearly identical but i would like to know what are the differences between them without having to print and compare them line by line. maybe there is a "compare file1 file2" command? thx (2 Replies)
Discussion started by: tomapam
2 Replies
Login or Register to Ask a Question