Hello,
I am searching large (~25gb) DNA sequence data in fasta short read format:
for short tandem repeats, meaning instances of any 2-6 character based run that are repeated in tandem a number of times given as an input variable. Seems like a reasonably simple job, but I'm having trouble developing a regex that will work. As a start, I have:
The substring constraints have to do with downstream requirements. But, I'm having trouble integrating in the regex that I want repeats of discrete motifs, not ANY 5 or more repeats (for example) of ANY 2-6 bases, which obviously returns every read.
Seems more like PERL or C for this, C for max speed, with input file mmap()'d, 64 bit OS/compile is nice to avoid remapping. It seems to be a 40 byte sliding window check for instances of the intiial 2-6 characters, then move window forward one. Do you just want counts? If ABC is reported as repeating, do you want AB and BC reported, too? It seems like you may get more output than input if it is not just aggregates. One process/thread to search and another to aggregate?
I'm sorry, but I don't understand what the problem is.
What's the pattern you are looking for? Your STR variable will match everything...
Tandem means "exactly two"?
In pseudo code, I think he wants something like:
By searching the longest pattern and nearest string first, you avoid duplicates of substrings (AB in ABC) or three within the window (for 3, you get two detail records, first to second and second to third).
You need some minimally intrusive code to deal with end of file, or pad the file with 38 non-cap-letter bytes, perhaps using "(...;echo...)|" as input.
Managing the window without repeatedly sliding bytes in the buffer is a bit tricky, using either an oversized buffer so slides are less frequent or mmap64() of the entire file (not pipe friendly, use padded file or end of file special code?).
Last edited by DGPickett; 09-19-2013 at 05:13 PM..
Hello,
I use UBUNTU 12.04.
I want to write a short program using awk to select some lines in a file based on a second file.
My first file has this format with about 400,000 lines and 47 fields:
SNP1 1 12.1
SNP2 1 13.2
SNP3 1 45.2
SNP4 1 23.4
My second file has this format:
SNP2
SNP3... (1 Reply)
if I wanted to know if the word DOG(followed by several random numbers) appears in col 1, how many times will that same word DOG* appeared in col 2? This is a very large file
Thanks! (7 Replies)
without using conventional file searching commands like find etc, is it possible to locate a file if i just know that the file that i'm searching for contains a particular text like "Hello world" or something? (5 Replies)
I've got a simple log file that looks something like this:
And I need to append it to look like this:
So I just want to add a timestamp and a static (non-variable) word to each line in the file. Is there an easy scripted way to cat the file and append that data to each line....?? (4 Replies)
I tried to make the title/subject detailed, but well.. have to keep it short as well.
I am wanting to take a large list of strings, and search through a large list of files to hopefully find numerous matches. I am not sure the quickest way to do this though.
// List of files
file1.txt... (2 Replies)
Hey All
Can any one please suggest the procedure to search a part of line in a very large file in which log entries are entered with very high speed.
i have trued with grep and egrep
grep 'text text text' <file-name>
egrep 'text text text' <file-name>
here 'text text text' is... (4 Replies)
I need to search a very large file. 13g in size. i am looking for a record that has a value in the byte 4200 . how can i view the file or how can i search for value in the byte 4200? (1 Reply)
I have a text file that I want to search for repeated lines and print those lines. These would be lines in the file that appear more than once. Is there a way to do this?
Thanks (4 Replies)
Hello
on my cdrom, the length of the file names are 8 characters, not > 8. On a linux with the same cd, there are > 8 characters.
What's wrong.
Tanks
Urs (3 Replies)