Sponsored Content
Top Forums Shell Programming and Scripting Searching a large file for short tandem repeats Post 302849839 by ljk on Tuesday 3rd of September 2013 12:26:56 PM
Old 09-03-2013
Searching a large file for short tandem repeats

Hello,
I am searching large (~25gb) DNA sequence data in fasta short read format:

Code:
>ReadName
ACGTACGTACGT...[150charactersPerRead]

for short tandem repeats, meaning instances of any 2-6 character based run that are repeated in tandem a number of times given as an input variable. Seems like a reasonably simple job, but I'm having trouble developing a regex that will work. As a start, I have:

Code:
cat infile.fasta | awk --posix  '{STR="([ACGT]{2,6})" ; if (substr($0,40,(length()-40)) ~ STR) print}'

The substring constraints have to do with downstream requirements. But, I'm having trouble integrating in the regex that I want repeats of discrete motifs, not ANY 5 or more repeats (for example) of ANY 2-6 bases, which obviously returns every read.

Any ideas would be great, thanks for the help!

Moderator's Comments:
Mod Comment Use code tags, see your PM.

Last edited by zaxxon; 09-03-2013 at 02:32 PM..
 

10 More Discussions You Might Find Interesting

1. Solaris

cdrom, short file name

Hello on my cdrom, the length of the file names are 8 characters, not > 8. On a linux with the same cd, there are > 8 characters. What's wrong. Tanks Urs (3 Replies)
Discussion started by: MuellerUrs
3 Replies

2. UNIX for Dummies Questions & Answers

Search for repeats in text file - how?

I have a text file that I want to search for repeated lines and print those lines. These would be lines in the file that appear more than once. Is there a way to do this? Thanks (4 Replies)
Discussion started by: aarondesk
4 Replies

3. UNIX for Dummies Questions & Answers

viewing and searching large file

I need to search a very large file. 13g in size. i am looking for a record that has a value in the byte 4200 . how can i view the file or how can i search for value in the byte 4200? (1 Reply)
Discussion started by: Wrightman
1 Replies

4. Shell Programming and Scripting

Searching a specific line in a large file

Hey All Can any one please suggest the procedure to search a part of line in a very large file in which log entries are entered with very high speed. i have trued with grep and egrep grep 'text text text' <file-name> egrep 'text text text' <file-name> here 'text text text' is... (4 Replies)
Discussion started by: NIMISH AGARWAL
4 Replies

5. Shell Programming and Scripting

Searching for array in large list of files

I tried to make the title/subject detailed, but well.. have to keep it short as well. I am wanting to take a large list of strings, and search through a large list of files to hopefully find numerous matches. I am not sure the quickest way to do this though. // List of files file1.txt... (2 Replies)
Discussion started by: Rhije
2 Replies

6. Shell Programming and Scripting

How to add static lines to short file?

I've got a simple log file that looks something like this: And I need to append it to look like this: So I just want to add a timestamp and a static (non-variable) word to each line in the file. Is there an easy scripted way to cat the file and append that data to each line....?? (4 Replies)
Discussion started by: kevinmccallum
4 Replies

7. Shell Programming and Scripting

searching a file with a specified text without using conventional file searching commands

without using conventional file searching commands like find etc, is it possible to locate a file if i just know that the file that i'm searching for contains a particular text like "Hello world" or something? (5 Replies)
Discussion started by: arindamlive
5 Replies

8. UNIX for Dummies Questions & Answers

awk repeats counter

if I wanted to know if the word DOG(followed by several random numbers) appears in col 1, how many times will that same word DOG* appeared in col 2? This is a very large file Thanks! (7 Replies)
Discussion started by: verse123
7 Replies

9. UNIX for Dummies Questions & Answers

Can't figure out why this repeats

#!/bin/sh while IFS=: read address port; do : ${port:=443} address=$address port=$port cd $f_location number=`grep "$address" thing.txt -A 1 | grep "addresses=" | cut -d'"' -f2` echo "$address,$port,$number,$answer" >>... (9 Replies)
Discussion started by: shade917
9 Replies

10. Shell Programming and Scripting

Short program to select lines from a file based on a second file

Hello, I use UBUNTU 12.04. I want to write a short program using awk to select some lines in a file based on a second file. My first file has this format with about 400,000 lines and 47 fields: SNP1 1 12.1 SNP2 1 13.2 SNP3 1 45.2 SNP4 1 23.4 My second file has this format: SNP2 SNP3... (1 Reply)
Discussion started by: Homa
1 Replies
ppmtosixel(1)						      General Commands Manual						     ppmtosixel(1)

NAME
ppmtosixel - convert a portable pixmap into DEC sixel format SYNOPSIS
ppmtosixel [-raw] [-margin] [ppmfile] DESCRIPTION
Reads a portable pixmap as input. Produces sixel commands (SIX) as output. The output is formatted for color printing, e.g. for a DEC LJ250 color inkjet printer. If RGB values from the PPM file do not have maxval=100, the RGB values are rescaled. A printer control header and a color assignment table begin the SIX file. Image data is written in a compressed format by default. A printer control footer ends the image file. OPTIONS
-raw If specified, each pixel will be explicitly described in the image file. If -raw is not specified, output will default to com- pressed format in which identical adjacent pixels are replaced by "repeat pixel" commands. A raw file is often an order of magni- tude larger than a compressed file and prints much slower. -margin If -margin is not specified, the image will be start at the left margin (of the window, paper, or whatever). If -margin is speci- fied, a 1.5 inch left margin will offset the image. PRINTING
Generally, sixel files must reach the printer unfiltered. Use the lpr -x option or cat filename > /dev/tty0?. BUGS
Upon rescaling, truncation of the least significant bits of RGB values may result in poor color conversion. If the original PPM maxval was greater than 100, rescaling also reduces the image depth. While the actual RGB values from the ppm file are more or less retained, the color palette of the LJ250 may not match the colors on your screen. This seems to be a printer limitation. SEE ALSO
ppm(5) AUTHOR
Copyright (C) 1991 by Rick Vinci. 26 April 1991 ppmtosixel(1)
All times are GMT -4. The time now is 01:51 PM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy