Searching a large file for short tandem repeats

09-03-2013

Registered User

11, 0

Join Date: Aug 2012

Last Activity: 3 September 2013, 1:34 PM EDT

Posts: 11

Thanks Given: 3

Thanked 0 Times in 0 Posts

Searching a large file for short tandem repeats

Hello,
I am searching large (~25gb) DNA sequence data in fasta short read format:

Code:

>ReadName
ACGTACGTACGT...[150charactersPerRead]

for short tandem repeats, meaning instances of any 2-6 character based run that are repeated in tandem a number of times given as an input variable. Seems like a reasonably simple job, but I'm having trouble developing a regex that will work. As a start, I have:

Code:

cat infile.fasta | awk --posix  '{STR="([ACGT]{2,6})" ; if (substr($0,40,(length()-40)) ~ STR) print}'

The substring constraints have to do with downstream requirements. But, I'm having trouble integrating in the regex that I want repeats of discrete motifs, not ANY 5 or more repeats (for example) of ANY 2-6 bases, which obviously returns every read.

Any ideas would be great, thanks for the help!

Moderator's Comments:

Use code tags, see your PM.

Last edited by zaxxon; 09-03-2013 at 02:32 PM..

ljk

View Public Profile for ljk

Find all posts by ljk

09-18-2013

Registered User

4,673, 588

Join Date: Oct 2010

Last Activity: 1 February 2016, 3:35 PM EST

Location: Southern NJ, USA (Nord)

Posts: 4,673

Thanks Given: 8

Thanked 588 Times in 561 Posts

Seems more like PERL or C for this, C for max speed, with input file mmap()'d, 64 bit OS/compile is nice to avoid remapping. It seems to be a 40 byte sliding window check for instances of the intiial 2-6 characters, then move window forward one. Do you just want counts? If ABC is reported as repeating, do you want AB and BC reported, too? It seems like you may get more output than input if it is not just aggregates. One process/thread to search and another to aggregate?

DGPickett

View Public Profile for DGPickett

Find all posts by DGPickett

09-19-2013

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

I'm sorry, but I don't understand what the problem is.
What's the pattern you are looking for? Your STR variable will match everything...
Tandem means "exactly two"?

Please give way more details.

RudiC

View Public Profile for RudiC

Find all posts by RudiC

09-19-2013

Registered User

4,673, 588

Join Date: Oct 2010

Last Activity: 1 February 2016, 3:35 PM EST

Location: Southern NJ, USA (Nord)

Posts: 4,673

Thanks Given: 8

Thanked 588 Times in 561 Posts

In pseudo code, I think he wants something like:

Code:

(
read 40 bytes to buffer (ensure full window)
for len in 6 5 4 3 2
do
 compare buffer + 0 for len bytes to substrings: buffer + len through buffer end - len, tracking current total file offsets.
 if a hit,
 then
  list the hit "Pattern offset1 offset2"
  move the window up by len
  loop back to restart the 'for'
 fi
done
 
move window up by 1 for no hits
exit (or special EOF adjust code) if EOF prevents full window
loop back to restart 'for'
) | sort | tee detail_file | cut -f 1 | uniq -c > summary_file

By searching the longest pattern and nearest string first, you avoid duplicates of substrings (AB in ABC) or three within the window (for 3, you get two detail records, first to second and second to third).

You need some minimally intrusive code to deal with end of file, or pad the file with 38 non-cap-letter bytes, perhaps using "(...;echo...)|" as input.

Managing the window without repeatedly sliding bytes in the buffer is a bit tricky, using either an oversized buffer so slides are less frequent or mmap64() of the entire file (not pipe friendly, use padded file or end of file special code?).

Last edited by DGPickett; 09-19-2013 at 05:13 PM..

DGPickett

View Public Profile for DGPickett

Find all posts by DGPickett

Shell Programming and Scripting

Searching a large file for short tandem repeats

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Short program to select lines from a file based on a second file

Discussion started by: Homa

2. UNIX for Dummies Questions & Answers

Can't figure out why this repeats

Discussion started by: shade917

3. UNIX for Dummies Questions & Answers

awk repeats counter

Discussion started by: verse123

4. Shell Programming and Scripting

searching a file with a specified text without using conventional file searching commands

Discussion started by: arindamlive

5. Shell Programming and Scripting

How to add static lines to short file?

Discussion started by: kevinmccallum

6. Shell Programming and Scripting

Searching for array in large list of files

Discussion started by: Rhije

7. Shell Programming and Scripting

Searching a specific line in a large file

Discussion started by: NIMISH AGARWAL

8. UNIX for Dummies Questions & Answers

viewing and searching large file

Discussion started by: Wrightman

9. UNIX for Dummies Questions & Answers

Search for repeats in text file - how?

Discussion started by: aarondesk

10. Solaris

cdrom, short file name

Discussion started by: MuellerUrs