Count and search by sequence in multiple fasta file

03-10-2014

Registered User

58, 0

Join Date: Jun 2009

Last Activity: 13 March 2014, 4:17 PM EDT

Posts: 58

Thanks Given: 12

Thanked 0 Times in 0 Posts

[Solved] Count and search by sequence in multiple fasta file

Hello,

I have 10 fasta files with sequenced reads information with read sizes from 15 - 35 . I have combined the reads and collapsed in to unique reads and filtered for sizes 18 - 26 bp long unique reads. Now i wanted to count each unique read appearance in all the fasta files and make a table with sample names as columns and reads as rows. I tried to use "grep -w "sequence name" file name " to count the tags but this seems to take long time. does anyone know how to do this faster?

empyrean

View Public Profile for empyrean

Find all posts by empyrean

03-10-2014

Registered User

440, 71

Join Date: Oct 2009

Last Activity: 26 June 2018, 6:52 PM EDT

Location: spaceBAR Central

Posts: 440

Thanks Given: 0

Thanked 71 Times in 70 Posts

Post example input and example output of what you want.

spacebar

View Public Profile for spacebar

Find all posts by spacebar

03-10-2014

Registered User

58, 0

Join Date: Jun 2009

Last Activity: 13 March 2014, 4:17 PM EDT

Posts: 58

Thanks Given: 12

Thanked 0 Times in 0 Posts

Sorry for the confusion. Here are the input files and required output

This is the input file where it contains unique sequences. i have more than million such unique sequences.
Query:

Code:

>tag1
TCGGA
>tag2
TCTCA
>tag3
TCTCGC

These are multiple files. for example i am showing with 3 files. i have more than 20 such files. each file contains more than 10 million sequences each
File1:

Code:

>file1_id1
TCGGA
>file1_id1
TCGGAT
>file1_id2
TCTCA
>file1_id3
TCTCA

File2:

Code:

>file2_id1
TCTCA
>file2_id2
TCTCA
>file2_id3
TCTCACTA
>file2_id4
TCTCGC
>file2_id5
TCTCGCCTAT
>file2_id6
TCTCGC

File3:

Code:

>file1_id1
TCGGA
>file1_id1
TCGGAT
>file2_id4
TCTCGC
>file2_id5
TCTCGCCTAT
>file2_id6
TCTCGC

I need the following output. Search has to be exact for the count.
output:

Code:

		sequence	file1	file2	file3
tag1	TCGGA		1		0		1
tag2	TCTCA		2		2		0		
tag3	TCTCGC		0		2		2

Moderator's Comments:

Please use CODE tags to mark sample code, sample input, and sample output; not your entire message.

Last edited by Don Cragun; 03-11-2014 at 03:17 AM.. Reason: Fix CODE tags.

empyrean

View Public Profile for empyrean

Find all posts by empyrean

03-11-2014

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Without seeing your code, it is impossible to guess at whether or not something else is faster. You could try something like:

Code:

awk '
FNR == NR {
	if($1 ~ /^>/)
		t[++tn] = substr($1, 2)
	else {	s[$1] = tn
		seq[tn] = $1
	}
	next
}
FNR == 1 {
	fn++
}
$1 in s {
	tc[s[$1], fn]++
}
END {	# Print header:
	printf("\t\tsequence")
	for(i = 1; i <= fn; i++)
		printf("\tfile%d", i)
	printf("\n")
	# Print tag data:
	for(i = 1; i <= tn; i++) {
		printf("%s\t%s", t[i], seq[i])
		for(j = 1; j <= fn; j++)
			printf("\t\t%d", tc[i, j])
		printf("\n")
	}
}' Query File1 File2 File3

which with your sample input files produces:

Code:

		sequence	file1	file2	file3
tag1	TCGGA		1		0		1
tag2	TCTCA		2		2		0
tag3	TCTCGC		0		2		2

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

03-11-2014

Registered User

1,910, 488

Join Date: Sep 2008

Last Activity: 22 December 2019, 2:31 AM EST

Location: San Jose, CA

Posts: 1,910

Thanks Given: 54

Thanked 488 Times in 481 Posts

a bash approach

Code:

#!/bin/bash

query=query.txt
file_list=(*.txt)
declare -A pattern

printf "%-10s\t%-10s\t" "" "Sequence"
for file in ${file_list[@]}
do
  printf "%-10s\t" $file
  while read line
  do
    [[ "$line" =~ ">" ]] && continue
    ((pattern[$file,$line]+=1))
  done < $file
done
echo

while read line
do
  [[ "$line" =~ ">" ]] && printf "%-10s\t" ${line/>/} && continue
  printf "%-10s\t" $line
  for file in ${file_list[@]}
  do
    printf "%-10s\t" ${pattern[$file,$line]:-0}
  done
  echo
done < $query

ahamed101

View Public Profile for ahamed101

Find all posts by ahamed101

03-13-2014

Registered User

58, 0

Join Date: Jun 2009

Last Activity: 13 March 2014, 4:17 PM EDT

Posts: 58

Thanks Given: 12

Thanked 0 Times in 0 Posts

Thank you , Don. This works really well. easy to run and easy to understand as well.

---------- Post updated at 04:17 PM ---------- Previous update was at 04:16 PM ----------

ahamed, thank you for your response but don's version was faster and finished my query in less time.

empyrean

View Public Profile for empyrean

Find all posts by empyrean

Shell Programming and Scripting

Count and search by sequence in multiple fasta file

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

How to find a specific sequence pattern in a fasta file?

Discussion started by: dineshkumarsrk

2. UNIX for Beginners Questions & Answers

How to count the length of fasta sequences?

Discussion started by: dineshkumarsrk

3. Shell Programming and Scripting

Getting unique sequences from multiple fasta file

Discussion started by: Ibk

4. Shell Programming and Scripting

To search duplicate sequence in file

Discussion started by: ashfaque

5. Shell Programming and Scripting

Extract sequence from fasta file

Discussion started by: ritakadm

6. UNIX for Dummies Questions & Answers

Change sequence names in fasta file

Discussion started by: tyrianthinae

7. UNIX for Dummies Questions & Answers

How to change sequence name in along fasta file?

Discussion started by: baika

8. UNIX for Dummies Questions & Answers

Breaking a fasta formatted file into multiple files containing each gene separately

Discussion started by: Ann Mc Cartney

9. Shell Programming and Scripting

Parsing a fasta sequence with start and end coordinates

Discussion started by: empyrean

10. Shell Programming and Scripting

Search and find total count from multiple files

Discussion started by: zooby