Sponsored Content
Top Forums Shell Programming and Scripting Searching a large file for short tandem repeats Post 302849839 by ljk on Tuesday 3rd of September 2013 12:26:56 PM
Old 09-03-2013
Searching a large file for short tandem repeats

Hello,
I am searching large (~25gb) DNA sequence data in fasta short read format:

Code:
>ReadName
ACGTACGTACGT...[150charactersPerRead]

for short tandem repeats, meaning instances of any 2-6 character based run that are repeated in tandem a number of times given as an input variable. Seems like a reasonably simple job, but I'm having trouble developing a regex that will work. As a start, I have:

Code:
cat infile.fasta | awk --posix  '{STR="([ACGT]{2,6})" ; if (substr($0,40,(length()-40)) ~ STR) print}'

The substring constraints have to do with downstream requirements. But, I'm having trouble integrating in the regex that I want repeats of discrete motifs, not ANY 5 or more repeats (for example) of ANY 2-6 bases, which obviously returns every read.

Any ideas would be great, thanks for the help!

Moderator's Comments:
Mod Comment Use code tags, see your PM.

Last edited by zaxxon; 09-03-2013 at 02:32 PM..
 

10 More Discussions You Might Find Interesting

1. Solaris

cdrom, short file name

Hello on my cdrom, the length of the file names are 8 characters, not > 8. On a linux with the same cd, there are > 8 characters. What's wrong. Tanks Urs (3 Replies)
Discussion started by: MuellerUrs
3 Replies

2. UNIX for Dummies Questions & Answers

Search for repeats in text file - how?

I have a text file that I want to search for repeated lines and print those lines. These would be lines in the file that appear more than once. Is there a way to do this? Thanks (4 Replies)
Discussion started by: aarondesk
4 Replies

3. UNIX for Dummies Questions & Answers

viewing and searching large file

I need to search a very large file. 13g in size. i am looking for a record that has a value in the byte 4200 . how can i view the file or how can i search for value in the byte 4200? (1 Reply)
Discussion started by: Wrightman
1 Replies

4. Shell Programming and Scripting

Searching a specific line in a large file

Hey All Can any one please suggest the procedure to search a part of line in a very large file in which log entries are entered with very high speed. i have trued with grep and egrep grep 'text text text' <file-name> egrep 'text text text' <file-name> here 'text text text' is... (4 Replies)
Discussion started by: NIMISH AGARWAL
4 Replies

5. Shell Programming and Scripting

Searching for array in large list of files

I tried to make the title/subject detailed, but well.. have to keep it short as well. I am wanting to take a large list of strings, and search through a large list of files to hopefully find numerous matches. I am not sure the quickest way to do this though. // List of files file1.txt... (2 Replies)
Discussion started by: Rhije
2 Replies

6. Shell Programming and Scripting

How to add static lines to short file?

I've got a simple log file that looks something like this: And I need to append it to look like this: So I just want to add a timestamp and a static (non-variable) word to each line in the file. Is there an easy scripted way to cat the file and append that data to each line....?? (4 Replies)
Discussion started by: kevinmccallum
4 Replies

7. Shell Programming and Scripting

searching a file with a specified text without using conventional file searching commands

without using conventional file searching commands like find etc, is it possible to locate a file if i just know that the file that i'm searching for contains a particular text like "Hello world" or something? (5 Replies)
Discussion started by: arindamlive
5 Replies

8. UNIX for Dummies Questions & Answers

awk repeats counter

if I wanted to know if the word DOG(followed by several random numbers) appears in col 1, how many times will that same word DOG* appeared in col 2? This is a very large file Thanks! (7 Replies)
Discussion started by: verse123
7 Replies

9. UNIX for Dummies Questions & Answers

Can't figure out why this repeats

#!/bin/sh while IFS=: read address port; do : ${port:=443} address=$address port=$port cd $f_location number=`grep "$address" thing.txt -A 1 | grep "addresses=" | cut -d'"' -f2` echo "$address,$port,$number,$answer" >>... (9 Replies)
Discussion started by: shade917
9 Replies

10. Shell Programming and Scripting

Short program to select lines from a file based on a second file

Hello, I use UBUNTU 12.04. I want to write a short program using awk to select some lines in a file based on a second file. My first file has this format with about 400,000 lines and 47 fields: SNP1 1 12.1 SNP2 1 13.2 SNP3 1 45.2 SNP4 1 23.4 My second file has this format: SNP2 SNP3... (1 Reply)
Discussion started by: Homa
1 Replies
RE-PCR(1)						      General Commands Manual							 RE-PCR(1)

NAME
re-PCR -- Find sequence tagged sites (STS) in DNA sequences SYNOPSIS
re-PCR [-hV] -p hash-file [-g gaps] [-n mism] [-lq] [primer ...] re-PCR [-hV] -P hash-file [-g gaps] [-n mism] [-l] [-m margin] [-O+|-] [-C batchcnt] [-o outfile] [-r+|-] [primers-file ...] re-PCR [-hV] -s hash-file [-g gaps] [-n mism] [-lq] [-m margin] [-o outfile] [-r+|-] [left right lo[-hi] [...]] re-PCR [-hV] -S hash-file [-g gaps] [-n mism] [-lq] [-m margin] [-O+|-] [-C batchcnt] [-o outfile] [-r+|-] [stsfile ...] DESCRIPTION
Implements reverse searching (called Reverse e-PCR) to make it feasible to search the human genome sequence and other large genomes by per- forming STS and primer searches. OPTIONS
-p=hash-file Perform primer lookup using hash-file -P=hash-file Perform primer lookup using hash-file -s=hash-file Perform STS lookup using hash-file -S=hash-file Perform STS lookup using hash-file -n=mism Set max allowed mismatches per primer for lookup -g=gaps Set max allowed indels per primer for lookup -m=margin Set variability for STS size for lookup -l Use presize alignments (only if gaps>0) -G Print alignments in comments -d=min-max Set default STS size -r=+|- Enable/disable reverse STS lookup -O=+|- Enable/disable syscall optimisation -C=batchcnt Set number of STSes per batch -o=outfile Set output file name -q Quiet (no progress indicator) EXAMPLE
famap -tN -b genome.famap org/chr_*.fa fahash -b genome.hash -w 12 -f3 ${PWD}/genome.famap re-PCR -s genome.hash -n1 -g1 ACTATTGATGATGA AGGTAGATGTTTTT 120-200 See famap(1) and fahash(1) SEE ALSO
/usr/share/doc/ncbi-epcr/README.txt bioperl(1), e-pcr(1), famap(1) and fahash(1) AUTHORS
This manual page was written by Andreas Tille <tille@debian.org> for the Debian system (but may be used by others). Permission is granted to copy, distribute and/or modify this document under the terms of the GNU General Public License, Version 2 any later version published by the Free Software Foundation. On Debian systems, the complete text of the GNU General Public License can be found in /usr/share/common-licenses/GPL. April 2008 RE-PCR(1)
All times are GMT -4. The time now is 11:24 PM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy