Clipping -awk


 
Thread Tools Search this Thread
Top Forums UNIX for Dummies Questions & Answers Clipping -awk
# 1  
Old 07-02-2015
Clipping -awk

I have this infile:
Code:
>Sample23
acagtagaca-atagacgatggagagacatgaggcccaaaattt
>Sample-123
--agtagacagatagacgatggagagacatgaggcccaattt--
>Sample23 Freq 45
acagtagacagatagacgat-gagagacatgaggcccaaaattt
>Reference
---------agatagacgatggagagacatgaggccca------
>Sample__23
acagtagacagatagacgatggagagacatgaggcccaaaa---

I need to trim all sequences based on the Reference entry. This is the desire outfile:
Code:
>Sample23
a-atagacgatggagagacatgaggccca
>Sample-123
agatagacgatggagagacatgaggccca
>Sample23 Freq 45
agatagacgat-gagagacatgaggccca
>Reference
agatagacgatggagagacatgaggccca
>Sample__23
agatagacgatggagagacatgaggccca

I am able to trim the right side but not the left
Script
Code:
awk 'NR==FNR{if($0 ~ /Reference/){getline; gsub("-","");x=length;}next;}{print substr($0,1,x);}' infile infile

I would like to modify this script so I can understand what I am doing wrong
Thanks!
# 2  
Old 07-02-2015
I could make a lot of wild guesses about what "trim all sequences based on the Reference entry" means based on a sample size of one sample. Instead of all of us making a lot of wild guesses, please explain to us in English what in a Reference entry determines what bits, or bytes, or characters in other entries are supposed to be trimmed.
# 3  
Old 07-02-2015
In my files, the name followed by > is the identifier. If you see in my example above, the forth entry is called Reference -that's my reference is the alignment. I need to trim all sequences in the alignment based on the length of that particular entry.
# 4  
Old 07-02-2015
Quote:
Originally Posted by Xterra
In my files, the name followed by > is the identifier. If you see in my example above, the forth entry is called Reference -that's my reference is the alignment. I need to trim all sequences in the alignment based on the length of that particular entry.
The length of your reference sample is the same as the length of all of your other samples. So, with the reference sample:
Code:
>Reference
-ag---atagac---gatgg--agaga-catga--ggccca---

you strip out all of the hyphens and say that you want to trim the leading and trailing characters off of the other samples. How do you determine how many characters are to be stripped off of the start of the other samples and how do you determine how many characters are to be stripped off of the end of the other samples?

Is it really so hard to state in English exactly what is supposed to be done to change your sample input into your desired sample output?

Please help us help you! Please stop making us guess at what is supposed to be done! Please, give us a clear specification of what you are trying to do!
# 5  
Old 07-02-2015
Don
I apologize -in my defense, English is not my mother tongue, not an excuse just and explanation.
You see, the input file is a DNA alignment. The aligment tool (mafft), aligns each nucleotide (ATGC) in all sequences (multiple alignment). If one of the sequences is shorter, like the Reference in my example above, the alignment tool adds hyphens at the ends. I need to find the References sequence and scan it till I find the first nucleotide, in this case "a", and cut all sequences at that nucleotide position. Then, I must scan the Reference sequence till I find the last nucleotide, in this case another "a", and cut all sequences at that very same position. Thus, all sequences will have the same length. Sometimes, the DNA sequences have mismatches at the ends maki g nearly impossible to trim them based on a nucleotide pattern -mafft takes care of that since mutations do not disturbe the alignment.
I hope this explains better what I am trying to accomplish
Thanks!
PS. The reference does not contain gaps (hyphens) in between nucleotides. Thus, the string of hyphens in the flanks mark the beggining and end of the reference sequence
# 6  
Old 07-03-2015
You might try something like:
Code:
awk '
reffound == 1 {			# We are processing the reference data line...
	match($1, /[^-]+/)	# Find the string of non-hyphens and set the
				# RSTART and RLENGTH variables.
	reffound = 0		# Clear the Reference found flag.
	nextfile		# If your version of awk contains this function,
				# your script will run faster; if it does not
				# support the nextfile function, comment it out
				# or remove these four lines.
}
FNR == NR {			# We are reading the file for the first time...
	if($1 == ">Reference")	# Look for the Reference header...
		reffound = 1	# Found it; set the Reference found flag.
	next			# Skip remaining lines in this script and resume
				# processing with the next input line at the top
				# of the script.
}
				# We are reading the file the 2nd time now...
FNR % 2				# Print header lines (odd line numbers).
(FNR + 1) % 2 {			# Print selected part of data (even) lines.
	print substr($1, RSTART, RLENGTH)
}' infile infile

As always, if you want to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk.
# 7  
Old 07-04-2015
Since https://en.wikipedia.org/wiki/FASTA_format records are > terminated and the header data is one line and the sequence data can be multiple lines, it may be better to take this into account:

Code:
awk '
  {
    h=$1                                        # set header variable to $1 in the ">" terminated record
    sub(h FS,x)                                 # delete the header and the first newline
                                                # what remains is the sequence data
  } 
  NR==FNR {                                     # if the input file is read for the first time
    if(h=="Reference") {                        # if the reference is found
      match($0, /[^-]+/)                        # determine the relative start and length of the reference sequence
    }
    next                                        # process the next line 
  } 
  FNR>1 {                                       # When the file is read the second line, ignore the first empty record
    print RS h FS substr($0, RSTART, RLENGTH)   # Print RS, the header, a newline and the 
                                                # sequence, trimmed to the position and width from the reference
  }
' RS=\> ORS='\n' FS='\n' infile infile          # read the input twice and set RS to ">"


Last edited by Scrutinizer; 07-04-2015 at 07:29 PM..
 
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

awk output yields error: awk:can't open job_name (Autosys)

Good evening, Im newbie at unix specially with awk From an scheduler program called Autosys i want to extract some data reading an inputfile that comprises jobs names, then formating the output to columns for example 1. This is the inputfile: $ more MapaRep.txt ds_extra_nikira_usuarios... (18 Replies)
Discussion started by: alexcol
18 Replies

2. Shell Programming and Scripting

Passing awk variable argument to a script which is being called inside awk

consider the script below sh /opt/hqe/hqapi1-client-5.0.0/bin/hqapi.sh alert list --host=localhost --port=7443 --user=hqadmin --password=hqadmin --secure=true >/tmp/alerts.xml awk -F'' '{for(i=1;i<=NF;i++){ if($i=="Alert id") { if(id!="") if(dt!=""){ cmd="sh someScript.sh... (2 Replies)
Discussion started by: vivek d r
2 Replies

3. Shell Programming and Scripting

HELP with AWK one-liner. Need to employ an If condition inside AWK to check for array variable ?

Hello experts, I'm stuck with this script for three days now. Here's what i need. I need to split a large delimited (,) file into 2 files based on the value present in the last field. Samp: Something.csv bca,adc,asdf,123,12C bca,adc,asdf,123,13C def,adc,asdf,123,12A I need this split... (6 Replies)
Discussion started by: shell_boy23
6 Replies

4. Shell Programming and Scripting

awk command to compare a file with set of files in a directory using 'awk'

Hi, I have a situation to compare one file, say file1.txt with a set of files in directory.The directory contains more than 100 files. To be more precise, the requirement is to compare the first field of file1.txt with the first field in all the files in the directory.The files in the... (10 Replies)
Discussion started by: anandek
10 Replies

5. Shell Programming and Scripting

Comparison and editing of files using awk.(And also a possible bug in awk for loop?)

I have two files which I would like to compare and then manipulate in a way. File1: pictures.txt 1.1 1.3 dance.txt 1.2 1.4 treehouse.txt 1.3 1.5 File2: pictures.txt 1.5 ref2313 1.4 ref2345 1.3 ref5432 1.2 ref4244 dance.txt 1.6 ref2342 1.5 ref2352 1.4 ref0695 1.3 ref5738 1.2... (1 Reply)
Discussion started by: linuxkid
1 Replies

6. Shell Programming and Scripting

Problem with awk awk: program limit exceeded: sprintf buffer size=1020

Hi I have many problems with a script. I have a script that formats a text file but always prints the same error when i try to execute it The code is that: { if (NF==17){ print $0 }else{ fields=NF; all=$0; while... (2 Replies)
Discussion started by: fate
2 Replies

7. Shell Programming and Scripting

Read content between xml tags with awk, grep, awk or what ever...

Hello, I trying to extract text that is surrounded by xml-tags. I tried this cat tst.xml | egrep "<SERVER>.*</SERVER>" |sed -e "s/<SERVER>\(.*\)<\/SERVER>/\1/"|tr "|" " " which works perfect, if the start-tag and the end-tag are in the same line, e.g.: <tag1>Hello Linux-Users</tag1> ... (5 Replies)
Discussion started by: Sebi0815
5 Replies

8. Shell Programming and Scripting

awk: assign variable with -v didn't work in awk filter

I want to filter 2nd column = 2 using awk $ cat t 1 2 2 4 $ VAR=2 #variable worked in print $ cat t | awk -v ID=$VAR ' { print ID}' 2 2 # but variable didn't work in awk filter $ cat t | awk -v ID=$VAR '$2~/ID/ { print $0}' (2 Replies)
Discussion started by: honglus
2 Replies

9. Shell Programming and Scripting

scripting/awk help : awk sum output is not comming in regular format. Pls advise.

Hi Experts, I am adding a column of numbers with awk , however not getting correct output: # awk '{sum+=$1} END {print sum}' datafile 2.15291e+06 How can I getthe output like : 2152910 Thank you.. # awk '{sum+=$1} END {print sum}' datafile 2.15079e+06 (3 Replies)
Discussion started by: rveri
3 Replies

10. Shell Programming and Scripting

Awk problem: How to express the single quote(') by using awk print function

Actually I got a list of file end with *.txt I want to use the same command apply to all the *.txt Thus I try to find out the fastest way to write those same command in a script and then want to let them run automatics. For example: I got the file below: file1.txt file2.txt file3.txt... (4 Replies)
Discussion started by: patrick87
4 Replies
Login or Register to Ask a Question