Sponsored Content
Top Forums UNIX for Dummies Questions & Answers Matching string and assembling Post 302969875 by Xterra on Tuesday 29th of March 2016 01:23:36 PM
Old 03-29-2016
Matching string and assembling

I have been thinking how to address this particular task but is way beyond my knowledge.
I have a reference sequence, something like this:
Code:
>Reference
AGAGAGACCTGGAGAGAGAGTGACGATGAGCAGTGACGATGACGTACGATAGCAGTAGACGCA

and a input.txt file with thousand of short sequences, something like this
Code:
>read1 ori 498
AGAGAGACCTGGAGAGAGAGT
>read2 ori-rep 500
AGAGAGACCTGGAGAGAGAGT
>read3 1-misma 456
GGAGAGACCTGGAGAGAGAGT
>read4 2-misma 456
TGAGAGACCTGGAGAGAGAGA
>read5 ori-rev 532
ACTCTCTCTCCAGGTCTCTCT
>read6 ori-rev-1-misma 499
ACTCTCTCTCCAGGTCTCTCC
>read7 medium 512
AGAGAGAGTGACGATGAGCAG
>read8 last 488
AGTGACGATGACGTACGATAGCAGTAGACGCA
>read9 last rep 488
AGTGACGATGACGTACGATAGCAGTAGACGCA
>read10 last gap 488
AGTGACGATGACGTACGATAGCAGTAGACGA
>read11 nomatch1 500
GGGGGGAAAAAGCGTGCGT
>read12 nomatch2 500
CCCCGGGATGACGATGACGATGACGATGACGATGAC
>read13 nomatch3 550
GGGGTGCGAAAAAACCCCCGGGGTGG
>read14 nomatch4 543
TTTTTTTTTTAAAAAGCCGCGCTTTTTTT
>read15 nomatch5 543
TTTTTTTTTTAAAAAGCCGCGCTTTTTAA

The output file should contain the following:
1. All sequences that match the reference sequence 100% (in my example, sequences 1, 2, 7, 8 and 9)
2. If a sequence does not match the reference, it should reversed and complemented (A=>T; T=>A; C=>G; G=>C), and run against the reference sequence for a second time. If it matches, it should be included in the output file as reversed/complemented sequence (sequences 5)
3. All sequences containing 1 or 2 mismatches should be included without changes (sequences 3 and 4)
4. All sequences that after being reversed and complemented contain 1 or 2 mismatches should also be included as reversed/complemented sequences (sequences 6)
5. All sequences missing 1 character (sequence 10)

Resulting in the following outfile
Code:
>read1 ori 498
AGAGAGACCTGGAGAGAGAGT
>read2 ori-rep 500
AGAGAGACCTGGAGAGAGAGT
>read3 1-misma 456
GGAGAGACCTGGAGAGAGAGT
>read4 2-misma 456
TGAGAGACCTGGAGAGAGAGA
>read5 ori-rev 532
AGAGAGACCTGGAGAGAGAGT
>read6 ori-rev-1-misma 499
GGAGAGACCTGGAGAGAGAGT
>read7 medium 512
AGAGAGAGTGACGATGAGCAG
>read8 last 488
AGTGACGATGACGTACGATAGCAGTAGACGCA
>read9 last rep 488
AGTGACGATGACGTACGATAGCAGTAGACGCA
>read10 last gap 488
AGTGACGATGACGTACGATAGCAGTAGACGA

The second outfile should be based on the first outfile. Here, I would like to assemble all sequences into one by overlapping the matching portions and name the new reference with the input file name. An "N" should be inserted if a variable position is found:
Code:
>input
NGAGAGACCTGGAGAGAGAGNGACGATGAGCAGTGACGATGACGTACGATAGCAGTAGACGCA

I know perl will probably be the best way to go. However, my understanding about perl is quite limited and I do not think AWK would be the best way to solve this task
Any help will be greatly appreciated
 

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

sed problem - replacement string should be same length as matching string.

Hi guys, I hope you can help me with my problem. I have a text file that contains lines like this: 78 ANGELO -809.05 79 ANGELO2 -5,000.06 I need to find all occurences of amounts that are negative and replace them with x's 78 ANGELO xxxxxxx 79... (4 Replies)
Discussion started by: amangeles
4 Replies

2. Shell Programming and Scripting

String matching

for a certain directory, I want to grep a particular file called ABCD so what I do is ls /my/dir | grep -i "ABCD" | awk '{print $9}' however, there is also this file called ABCDEFG, the above command would reurn both file when I only want ABCD, please help! (3 Replies)
Discussion started by: mpang_
3 Replies

3. UNIX for Dummies Questions & Answers

Matching string

Hello, i have a program where i have to get a character from the user and check it against the word i have and then replace the character in a blank at the same position it is in the word. (7 Replies)
Discussion started by: nehaquick
7 Replies

4. Shell Programming and Scripting

Help assembling script

I am trying to figure out how to write a bash script to process a file in order to make it more user readable. The file to be processed is quite uniform, every line starts with a 32 bit Unix timestamp in hexadecimal format, then a single tab charcter (0x09) then a string of text. What I want to... (1 Reply)
Discussion started by: stumpyuk
1 Replies

5. Shell Programming and Scripting

matching a string

I have a requirement of shell script where i need to read the File name i.e ls -t | head -1 and Match that Filename with some delimited values which are in a separate File. For Example i am reading the File name i.e (ls -t | head -1) after that i need to read one more sequential file which... (2 Replies)
Discussion started by: dsdev_123
2 Replies

6. Shell Programming and Scripting

String matching

I have a string like ab or abc of whatever length. But i want to know whether another string ( for example, abcfghijkl, OR a<space> bcfghijkl ab<space> cfghijkl OR a<space>bcfghijkl OR ab<space> c<space> fghijkl ) starts with ab or abc... space might existing on the longer string... If so, i... (4 Replies)
Discussion started by: nram_krishna@ya
4 Replies

7. Shell Programming and Scripting

Matching string from input to string of file

Hi, i want to know how to compare string of file with input string im trying following code: file_no=`paste -s -d "||||\n" a.txt | cut -c 1` #it will return collection number from file echo "enter number" read " curr_no" if ; then echo " current number already present" fi ... (4 Replies)
Discussion started by: a_smith
4 Replies

8. UNIX for Dummies Questions & Answers

finding, copying, assembling

Hi everybody, I've been running some analyses, the results of which have been stored in a sequential manner with a directory structure like step0, step1, step2, ... for iterations 0-2, for example. Each iteration contains several nested folders, with three pieces of information I need. I need to... (1 Reply)
Discussion started by: JDenton
1 Replies

9. Shell Programming and Scripting

Assembling the Pieces of a Regular Expression

Hello all. I'm scripting in ksh and trying to put together a regular expression. I think my logic is sound, but I'm doing the head-against-the-wall routine while trying to put the individual pieces together. Can anybody lend some suggestions to the below problem? I'm taking a date in the... (2 Replies)
Discussion started by: Michael_K
2 Replies

10. UNIX for Dummies Questions & Answers

Matching string

Hello all, i am trying to match a string and based on that proceed with my script or error out... i have a file called /tmp/sta.log that will be populated by oracle's spooling..it can have a output of either 2 of the below (OPEN or errors/ORACLE not avaiable) $ cat /tmp/sta.log OPEN $ $... (2 Replies)
Discussion started by: abdul.irfan2
2 Replies
Merge(3pm)						User Contributed Perl Documentation						Merge(3pm)

NAME
Algorithm::Merge - Three-way merge and diff SYNOPSIS
use Algorithm::Merge qw(merge diff3 traverse_sequences3); @merged = merge(@ancestor, @a, @b, { CONFLICT => sub { } }); @merged = merge(@ancestor, @a, @b, { CONFLICT => sub { } }, $key_generation_function); $merged = merge(@ancestor, @a, @b, { CONFLICT => sub { } }); $merged = merge(@ancestor, @a, @b, { CONFLICT => sub { } }, $key_generation_function); @diff = diff3(@ancestor, @a, @b); @diff = diff3(@ancestor, @a, @b, $key_generation_function); $diff = diff3(@ancestor, @a, @b); $diff = diff3(@ancestor, @a, @b, $key_generation_function); @trav = traverse_sequences3(@ancestor, @a, @b, { # callbacks }); @trav = traverse_sequences3(@ancestor, @a, @b, { # callbacks }, $key_generation_function); $trav = traverse_sequences3(@ancestor, @a, @b, { # callbacks }); $trav = traverse_sequences3(@ancestor, @a, @b, { # callbacks }, $key_generation_function); USAGE
This module complements Algorithm::Diff by providing three-way merge and diff functions. In this documentation, the first list to "diff3", "merge", and "traverse_sequences3" is called the `original' list. The second list is the `left' list. The third list is the `right' list. The optional key generation arguments are the same as in Algorithm::Diff. See Algorithm::Diff for more information. diff3 Given references to three lists of items, "diff3" performs a three-way difference. This function returns an array of operations describing how the left and right lists differ from the original list. In scalar context, this function returns a reference to such an array. Perhaps an example would be useful. Given the following three lists, original: a b c e f h i k left: a b d e f g i j k right: a b c d e h i j k merge: a b d e g i j k we have the following result from diff3: [ 'u', 'a', 'a', 'a' ], [ 'u', 'b', 'b', 'b' ], [ 'l', 'c', undef, 'c' ], [ 'o', undef, 'd', 'd' ], [ 'u', 'e', 'e', 'e' ], [ 'r', 'f', 'f', undef ], [ 'o', 'h', 'g', 'h' ], [ 'u', 'i', 'i', 'i' ], [ 'o', undef, 'j', 'j' ], [ 'u', 'k', 'k', 'k' ] The first element in each row is the array with the difference: c - conflict (no two are the same) l - left is different o - original is different r - right is different u - unchanged The next three elements are the lists from the original, left, and right arrays respectively that the row refers to (in the synopsis, these are @ancestor, @a, and @b, respectively). merge Given references to three lists of items, "merge" performs a three-way merge. The "merge" function uses the "diff3" function to do most of the work. The only callback currently used is "CONFLICT" which should be a reference to a subroutine that accepts two array references. The first array reference is to a list of elements from the left list. The second array reference is to a list of elements from the right list. This callback should return a list of elements to place in the merged list in place of the conflict. The default "CONFLICT" callback returns the following: q{<!-- ------ START CONFLICT ------ -->}, (@left), q{<!-- ---------------------------- -->}, (@right), q{<!-- ------ END CONFLICT ------ -->}, traverse_sequences3 This is the workhorse function that goes through the three sequences and calls the callback functions. The following callbacks are supported. NO_CHANGE This is called if all three sequences have the same element at the current position. The arguments are the current positions within each sequence, the first argument being the current position within the first sequence. A_DIFF This is called if the first sequence is different than the other two sequences at the current position. This callback will be called with one, two, or three arguments. If one argument, then only the element at the given position from the first sequence is not in either of the other two sequences. If two arguments, then there is no element in the first sequence that corresponds to the elements at the given positions in the second and third sequences. If three arguments, then the element at the given position in the first sequence is different than the corresponding element in the other two sequences, but the other two sequences have corresponding elements. B_DIFF This is called if the second sequence is different than the other two sequences at the current position. This callback will be called with one, two, or three arguments. If one argument, then only the element at the given position from the second sequence is not in either of the other two sequences. If two arguments, then there is no element in the second sequence that corresponds to the elements at the given positions in the first and third sequences. If three arguments, then the element at the given position in the second sequence is different than the corresponding element in the other two sequences, but the other two sequences have corresponding elements. C_DIFF This is called if the third sequence is different than the other two sequences at the current position. This callback will be called with one, two, or three arguments. If one argument, then only the element at the given position from the third sequence is not in either of the other two sequences. If two arguments, then there is no element in the third sequence that corresponds to the elements at the given positions in the first and second sequences. If three arguments, then the element at the given position in the third sequence is different than the corresponding element in the other two sequences, but the other two sequences have corresponding elements. CONFLICT This is called if all three sequences have different elements at the current position. The three arguments are the current positions within each sequence. BUGS
Most assuredly there are bugs. If a pattern similar to the above example does not work, send it to <jsmith@cpan.org> or report it on <http://rt.cpan.org/>, the CPAN bug tracker. Algorithm::Diff's implementation of "traverse_sequences" may not be symmetric with respect to the input sequences if the second and third sequence are of different lengths. Because of this, "traverse_sequences3" will calculate the diffs of the second and third sequences as passed and swapped. If the differences are not the same, it will issue an `Algorithm::Diff::diff is not symmetric for second and third sequences...' warning. It will try to handle this, but there may be some cases where it can't. SEE ALSO
Algorithm::Diff. AUTHOR
James G. Smith, <jsmith@cpan.org> COPYRIGHT
Copyright (C) 2003, 2007 Texas A&M University. All Rights Reserved. This module is free software; you may redistribute it and/or modify it under the same terms as Perl itself. perl v5.10.1 2010-10-15 Merge(3pm)
All times are GMT -4. The time now is 08:47 AM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy