Extract distinc sequence of letters

10-14-2014

Registered User

7, 0

Join Date: Oct 2014

Last Activity: 3 December 2014, 7:15 AM EST

Posts: 7

Thanks Given: 3

Thanked 0 Times in 0 Posts

Extract distinc sequence of letters

Hallo,
I need to extract distinct sequence of letters for example from 136 to 193
Files are quite big, so I would prefer not to use "fold -w1"
Thank you very much

Input file look like this:

Code:

       1 cttttacctt catgtgtttt tgcagatatt tgttcataat aacatcttct ttttaagtta
      61 ttaaaatctt ttttaaagtt attaacattt ttttgtcttt tagatcctat atagatccta
      121 aaagatccta aaagatccta aaagatcccc gtttttgtta aagcatatgt gataaggttt
      181 tatagtactt taagattcac tatagtcagt aaaacgttca ctatagtcag taaaacgttc

Last edited by Don Cragun; 10-14-2014 at 04:00 AM.. Reason: Add CODE tags.

kamcamonty

View Public Profile for kamcamonty

Find all posts by kamcamonty

10-14-2014

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Quote:

Originally Posted by kamcamonty

Hallo,
I need to extract distinct sequence of letters for example from 136 to 193
Files are quite big, so I would prefer not to use "fold -w1"
Thank you very much

Input file look like this:

Code:

       1 cttttacctt catgtgtttt tgcagatatt tgttcataat aacatcttct ttttaagtta
      61 ttaaaatctt ttttaaagtt attaacattt ttttgtcttt tagatcctat atagatccta
      121 aaagatccta aaagatccta aaagatcccc gtttttgtta aagcatatgt gataaggttt
      181 tatagtactt taagattcac tatagtcagt aaaacgttca ctatagtcag taaaacgttc

I'm lost.

Please use CODE tags.
From 136 to 193 what? You have shown us lines with a variable length digit string followed by 5 or 6 groups of 10 letters. The only thing you have shown us between 136 and 193 is the digit string 181 marked in red above.
What do you mean by "Files are quite big"? How long will the longest lines be in your input files? What is the maximum size (in bytes) of your input files?
What output are you expecting from the above sample input?
I understand that you don't want to use fold -w1, but I don't understand why you would say that. I don't see how using fold -w1 to put each character in your input files on a single line would help solve this problem. (But, maybe that is just because I can't figure out what you're trying to do.)
What OS and shell are you using?

Last edited by Don Cragun; 10-14-2014 at 04:48 AM.. Reason: Add request for OS and shell information.

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

10-14-2014

Registered User

7, 0

Join Date: Oct 2014

Last Activity: 3 December 2014, 7:15 AM EST

Posts: 7

Thanks Given: 3

Thanked 0 Times in 0 Posts

2. Input file is in fact one sequence of letters which is separated into lines and numbered (and each lines is separated into sequences of 10 by space). If it would be better I can create one long line from whole input file at first. I want to create shorter sequence from each file such that Nth letter (eg.: third or 136th) of the sequence will be first letter of new sequence and Mth letter (e.g.: 196th) is the last one. (Just imagine all letters are numbered and I want all letters which has numbers greater or equal than 136 and smaller than 196)
I use zsh but is no problem to use bash; OS: Biolinux (Ubuntu)
3. All lines are of this length (just length of number is variable), but there can be about 1 000 000 lines in each file
5. I wanted to a) remove spaces and numbers b) put each character on new line c) select lines containing characters I wanted using awk NR d) join all lines into one

kamcamonty

View Public Profile for kamcamonty

Find all posts by kamcamonty

10-14-2014

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

Still not clear. If you want to remove the line count, all white space including new lines, and then extract some chars, this might help:

Code:

awk '{$1=""; gsub(/ /,"")}1' ORS="" file | dd bs=1 skip=135 count=58 2>/dev/null
tcctaaaagatccccgtttttgttaaagcatatgtgataaggttttatagtactttaa

This works for the sample you gave us; not sure how it behaves on large files.

RudiC

View Public Profile for RudiC

Find all posts by RudiC

10-14-2014

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

You could also try something like:

Code:

#!/bin/ksh
IAm=${0##*/}
if [ $# -lt 3 ]
then	printf 'Usage: %s first_char# last_char# file...\n' "$IAm" >&2
	exit 1
fi
first="$1"
last="$2"
shift 2
awk -v fc="$first" -v lc="$last" '
BEGIN {	fl = int((fc - 1) / 60) + 1	# first line # containing data to copy
	ll = int((lc - 1) / 60) + 1	# last line # containing data to copy
	flc1 = fc % 60 ? fc % 60 : 60	# first character # to copy on line fl
	llcl = lc % 60 ? lc % 60 : 60	# last character # to copy on line ll
}
FNR >= fl && FNR <= ll {
	s = $2 $3 $4 $5 $6 $7
	printf("%s%s", substr(s, FNR == fl ? flc1 : 1,
		FNR == ll ? FNR == fl ? llcl - flc1 + 1 : llcl : 60),
		FNR == ll ? "\n" : "")
	if(FNR == ll) nextfile
}' "$@"

I prefer ksh over bash, but this script will work with either shell. This script allows you to specify the starting character number, the last character number, and a list of one or more files to process. It should work fine on any Linux system, but the awk nextfile command is an extension to the standards. If your version of awk does not have nextfile:

if you only want to process one file at a time, change nextfile to exit,
otherwise, remove the entire line shown in red (it will still produce correct output, but will run slower; especially on large files). Note that the code shown in blue can be removed as long as this line remains in your code (with either exit or nextfile.)

If someone else reading this thread wants to try this on a Solaris/SunOS system, change awk in the script to /usr/xpg4/bin/awk, /usr/xpg6/bin/awk, or /usr/bin/nawk.

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

Shell Programming and Scripting

Extract distinc sequence of letters

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Random letters

Discussion started by: eldeingles

2. Shell Programming and Scripting

Extract sequence from fasta file

Discussion started by: ritakadm

3. UNIX for Dummies Questions & Answers

sed - extract a group of Letters/numbers

Discussion started by: newbie2010

4. Solaris

Escape Sequence for Capital Letters Input at Shell Not Working

Discussion started by: rstor

5. Shell Programming and Scripting

find common entries and match the number with long sequence and cut that sequence in output

Discussion started by: manigrover

6. Shell Programming and Scripting

Randomize letters

Discussion started by: jeppe83

7. Shell Programming and Scripting

Extract sequence blocks

Discussion started by: solli

8. Shell Programming and Scripting

extract words with even numbr of letters

Discussion started by: manish205

9. Shell Programming and Scripting

Extract Pattern Sequence

Discussion started by: jaganadh

10. Shell Programming and Scripting

How to extract a sequence of n lines from a file

Discussion started by: 0ktalmagik