Clipping -awk

07-02-2015

Registered User

365, 3

Join Date: Jun 2010

Last Activity: 6 August 2019, 11:08 PM EDT

Posts: 365

Thanks Given: 149

Thanked 3 Times in 3 Posts

Clipping -awk

I have this infile:

Code:

>Sample23
acagtagaca-atagacgatggagagacatgaggcccaaaattt
>Sample-123
--agtagacagatagacgatggagagacatgaggcccaattt--
>Sample23 Freq 45
acagtagacagatagacgat-gagagacatgaggcccaaaattt
>Reference
---------agatagacgatggagagacatgaggccca------
>Sample__23
acagtagacagatagacgatggagagacatgaggcccaaaa---

I need to trim all sequences based on the Reference entry. This is the desire outfile:

Code:

>Sample23
a-atagacgatggagagacatgaggccca
>Sample-123
agatagacgatggagagacatgaggccca
>Sample23 Freq 45
agatagacgat-gagagacatgaggccca
>Reference
agatagacgatggagagacatgaggccca
>Sample__23
agatagacgatggagagacatgaggccca

I am able to trim the right side but not the left
Script

Code:

awk 'NR==FNR{if($0 ~ /Reference/){getline; gsub("-","");x=length;}next;}{print substr($0,1,x);}' infile infile

I would like to modify this script so I can understand what I am doing wrong
Thanks!

Xterra

View Public Profile for Xterra

Find all posts by Xterra

07-02-2015

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

I could make a lot of wild guesses about what "trim all sequences based on the Reference entry" means based on a sample size of one sample. Instead of all of us making a lot of wild guesses, please explain to us in English what in a Reference entry determines what bits, or bytes, or characters in other entries are supposed to be trimmed.

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

07-02-2015

Registered User

365, 3

Join Date: Jun 2010

Last Activity: 6 August 2019, 11:08 PM EDT

Posts: 365

Thanks Given: 149

Thanked 3 Times in 3 Posts

In my files, the name followed by > is the identifier. If you see in my example above, the forth entry is called Reference -that's my reference is the alignment. I need to trim all sequences in the alignment based on the length of that particular entry.

Xterra

View Public Profile for Xterra

Find all posts by Xterra

07-02-2015

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Quote:

Originally Posted by Xterra

The length of your reference sample is the same as the length of all of your other samples. So, with the reference sample:

Code:

>Reference
-ag---atagac---gatgg--agaga-catga--ggccca---

you strip out all of the hyphens and say that you want to trim the leading and trailing characters off of the other samples. How do you determine how many characters are to be stripped off of the start of the other samples and how do you determine how many characters are to be stripped off of the end of the other samples?

Is it really so hard to state in English exactly what is supposed to be done to change your sample input into your desired sample output?

Please help us help you! Please stop making us guess at what is supposed to be done! Please, give us a clear specification of what you are trying to do!

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

07-02-2015

Registered User

365, 3

Join Date: Jun 2010

Last Activity: 6 August 2019, 11:08 PM EDT

Posts: 365

Thanks Given: 149

Thanked 3 Times in 3 Posts

Don
I apologize -in my defense, English is not my mother tongue, not an excuse just and explanation.
You see, the input file is a DNA alignment. The aligment tool (mafft), aligns each nucleotide (ATGC) in all sequences (multiple alignment). If one of the sequences is shorter, like the Reference in my example above, the alignment tool adds hyphens at the ends. I need to find the References sequence and scan it till I find the first nucleotide, in this case "a", and cut all sequences at that nucleotide position. Then, I must scan the Reference sequence till I find the last nucleotide, in this case another "a", and cut all sequences at that very same position. Thus, all sequences will have the same length. Sometimes, the DNA sequences have mismatches at the ends maki g nearly impossible to trim them based on a nucleotide pattern -mafft takes care of that since mutations do not disturbe the alignment.
I hope this explains better what I am trying to accomplish
Thanks!
PS. The reference does not contain gaps (hyphens) in between nucleotides. Thus, the string of hyphens in the flanks mark the beggining and end of the reference sequence

Xterra

View Public Profile for Xterra

Find all posts by Xterra

07-03-2015

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

You might try something like:

Code:

awk '
reffound == 1 {			# We are processing the reference data line...
	match($1, /[^-]+/)	# Find the string of non-hyphens and set the
				# RSTART and RLENGTH variables.
	reffound = 0		# Clear the Reference found flag.
	nextfile		# If your version of awk contains this function,
				# your script will run faster; if it does not
				# support the nextfile function, comment it out
				# or remove these four lines.
}
FNR == NR {			# We are reading the file for the first time...
	if($1 == ">Reference")	# Look for the Reference header...
		reffound = 1	# Found it; set the Reference found flag.
	next			# Skip remaining lines in this script and resume
				# processing with the next input line at the top
				# of the script.
}
				# We are reading the file the 2nd time now...
FNR % 2				# Print header lines (odd line numbers).
(FNR + 1) % 2 {			# Print selected part of data (even) lines.
	print substr($1, RSTART, RLENGTH)
}' infile infile

As always, if you want to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk.

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

07-04-2015

Moderator

12,296, 3,792

Join Date: Nov 2008

Last Activity: 1 January 2021, 1:47 AM EST

Location: Amsterdam

Posts: 12,296

Thanks Given: 679

Thanked 3,792 Times in 3,282 Posts

Since https://en.wikipedia.org/wiki/FASTA_format records are > terminated and the header data is one line and the sequence data can be multiple lines, it may be better to take this into account:

Code:

awk '
  {
    h=$1                                        # set header variable to $1 in the ">" terminated record
    sub(h FS,x)                                 # delete the header and the first newline
                                                # what remains is the sequence data
  } 
  NR==FNR {                                     # if the input file is read for the first time
    if(h=="Reference") {                        # if the reference is found
      match($0, /[^-]+/)                        # determine the relative start and length of the reference sequence
    }
    next                                        # process the next line 
  } 
  FNR>1 {                                       # When the file is read the second line, ignore the first empty record
    print RS h FS substr($0, RSTART, RLENGTH)   # Print RS, the header, a newline and the 
                                                # sequence, trimmed to the position and width from the reference
  }
' RS=\> ORS='\n' FS='\n' infile infile          # read the input twice and set RS to ">"

Last edited by Scrutinizer; 07-04-2015 at 07:29 PM..

Scrutinizer

View Public Profile for Scrutinizer

Find all posts by Scrutinizer

UNIX for Dummies Questions & Answers

Clipping -awk

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

awk output yields error: awk:can't open job_name (Autosys)

Discussion started by: alexcol

2. Shell Programming and Scripting

Passing awk variable argument to a script which is being called inside awk

Discussion started by: vivek d r

3. Shell Programming and Scripting

HELP with AWK one-liner. Need to employ an If condition inside AWK to check for array variable ?

Discussion started by: shell_boy23

4. Shell Programming and Scripting

awk command to compare a file with set of files in a directory using 'awk'

Discussion started by: anandek

5. Shell Programming and Scripting

Comparison and editing of files using awk.(And also a possible bug in awk for loop?)

Discussion started by: linuxkid

6. Shell Programming and Scripting

Problem with awk awk: program limit exceeded: sprintf buffer size=1020

Discussion started by: fate

7. Shell Programming and Scripting

Read content between xml tags with awk, grep, awk or what ever...

Discussion started by: Sebi0815

8. Shell Programming and Scripting

awk: assign variable with -v didn't work in awk filter

Discussion started by: honglus

9. Shell Programming and Scripting

scripting/awk help : awk sum output is not comming in regular format. Pls advise.

Discussion started by: rveri

10. Shell Programming and Scripting

Awk problem: How to express the single quote(') by using awk print function

Discussion started by: patrick87