Sponsored Content
Top Forums Shell Programming and Scripting Searching a large file for short tandem repeats Post 302855333 by DGPickett on Thursday 19th of September 2013 04:08:16 PM
Old 09-19-2013
In pseudo code, I think he wants something like:
Code:
(
read 40 bytes to buffer (ensure full window)
for len in 6 5 4 3 2
do
 compare buffer + 0 for len bytes to substrings: buffer + len through buffer end - len, tracking current total file offsets.
 if a hit,
 then
  list the hit "Pattern offset1 offset2"
  move the window up by len
  loop back to restart the 'for'
 fi
done
 
move window up by 1 for no hits
exit (or special EOF adjust code) if EOF prevents full window
loop back to restart 'for'
) | sort | tee detail_file | cut -f 1 | uniq -c > summary_file

By searching the longest pattern and nearest string first, you avoid duplicates of substrings (AB in ABC) or three within the window (for 3, you get two detail records, first to second and second to third).

You need some minimally intrusive code to deal with end of file, or pad the file with 38 non-cap-letter bytes, perhaps using "(...;echo...)|" as input.

Managing the window without repeatedly sliding bytes in the buffer is a bit tricky, using either an oversized buffer so slides are less frequent or mmap64() of the entire file (not pipe friendly, use padded file or end of file special code?).

Last edited by DGPickett; 09-19-2013 at 05:13 PM..
 

10 More Discussions You Might Find Interesting

1. Solaris

cdrom, short file name

Hello on my cdrom, the length of the file names are 8 characters, not > 8. On a linux with the same cd, there are > 8 characters. What's wrong. Tanks Urs (3 Replies)
Discussion started by: MuellerUrs
3 Replies

2. UNIX for Dummies Questions & Answers

Search for repeats in text file - how?

I have a text file that I want to search for repeated lines and print those lines. These would be lines in the file that appear more than once. Is there a way to do this? Thanks (4 Replies)
Discussion started by: aarondesk
4 Replies

3. UNIX for Dummies Questions & Answers

viewing and searching large file

I need to search a very large file. 13g in size. i am looking for a record that has a value in the byte 4200 . how can i view the file or how can i search for value in the byte 4200? (1 Reply)
Discussion started by: Wrightman
1 Replies

4. Shell Programming and Scripting

Searching a specific line in a large file

Hey All Can any one please suggest the procedure to search a part of line in a very large file in which log entries are entered with very high speed. i have trued with grep and egrep grep 'text text text' <file-name> egrep 'text text text' <file-name> here 'text text text' is... (4 Replies)
Discussion started by: NIMISH AGARWAL
4 Replies

5. Shell Programming and Scripting

Searching for array in large list of files

I tried to make the title/subject detailed, but well.. have to keep it short as well. I am wanting to take a large list of strings, and search through a large list of files to hopefully find numerous matches. I am not sure the quickest way to do this though. // List of files file1.txt... (2 Replies)
Discussion started by: Rhije
2 Replies

6. Shell Programming and Scripting

How to add static lines to short file?

I've got a simple log file that looks something like this: And I need to append it to look like this: So I just want to add a timestamp and a static (non-variable) word to each line in the file. Is there an easy scripted way to cat the file and append that data to each line....?? (4 Replies)
Discussion started by: kevinmccallum
4 Replies

7. Shell Programming and Scripting

searching a file with a specified text without using conventional file searching commands

without using conventional file searching commands like find etc, is it possible to locate a file if i just know that the file that i'm searching for contains a particular text like "Hello world" or something? (5 Replies)
Discussion started by: arindamlive
5 Replies

8. UNIX for Dummies Questions & Answers

awk repeats counter

if I wanted to know if the word DOG(followed by several random numbers) appears in col 1, how many times will that same word DOG* appeared in col 2? This is a very large file Thanks! (7 Replies)
Discussion started by: verse123
7 Replies

9. UNIX for Dummies Questions & Answers

Can't figure out why this repeats

#!/bin/sh while IFS=: read address port; do : ${port:=443} address=$address port=$port cd $f_location number=`grep "$address" thing.txt -A 1 | grep "addresses=" | cut -d'"' -f2` echo "$address,$port,$number,$answer" >>... (9 Replies)
Discussion started by: shade917
9 Replies

10. Shell Programming and Scripting

Short program to select lines from a file based on a second file

Hello, I use UBUNTU 12.04. I want to write a short program using awk to select some lines in a file based on a second file. My first file has this format with about 400,000 lines and 47 fields: SNP1 1 12.1 SNP2 1 13.2 SNP3 1 45.2 SNP4 1 23.4 My second file has this format: SNP2 SNP3... (1 Reply)
Discussion started by: Homa
1 Replies
buffer(3)							      OpenSSL								 buffer(3)

NAME
BUF_MEM_new, BUF_MEM_free, BUF_MEM_grow, BUF_strdup - simple character arrays structure SYNOPSIS
#include <openssl/buffer.h> BUF_MEM *BUF_MEM_new(void); void BUF_MEM_free(BUF_MEM *a); int BUF_MEM_grow(BUF_MEM *str, int len); char * BUF_strdup(const char *str); DESCRIPTION
The buffer library handles simple character arrays. Buffers are used for various purposes in the library, most notably memory BIOs. The library uses the BUF_MEM structure defined in buffer.h: typedef struct buf_mem_st { int length; /* current number of bytes */ char *data; int max; /* size of buffer */ } BUF_MEM; length is the current size of the buffer in bytes, max is the amount of memory allocated to the buffer. There are three functions which handle these and one "miscellaneous" function. BUF_MEM_new() allocates a new buffer of zero size. BUF_MEM_free() frees up an already existing buffer. The data is zeroed before freeing up in case the buffer contains sensitive data. BUF_MEM_grow() changes the size of an already existing buffer to len. Any data already in the buffer is preserved if it increases in size. BUF_strdup() copies a null terminated string into a block of allocated memory and returns a pointer to the allocated block. Unlike the standard C library strdup() this function uses OPENSSL_malloc() and so should be used in preference to the standard library strdup() because it can be used for memory leak checking or replacing the malloc() function. The memory allocated from BUF_strdup() should be freed up using the OPENSSL_free() function. RETURN VALUES
BUF_MEM_new() returns the buffer or NULL on error. BUF_MEM_free() has no return value. BUF_MEM_grow() returns zero on error or the new size (i.e. len). SEE ALSO
bio(3) HISTORY
BUF_MEM_new(), BUF_MEM_free() and BUF_MEM_grow() are available in all versions of SSLeay and OpenSSL. BUF_strdup() was added in SSLeay 0.8. 0.9.7a 2000-09-19 buffer(3)
All times are GMT -4. The time now is 11:24 PM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy