Outputting characters after a given string and reporting the characters in the row below --sed

01-14-2019

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Making a few wild guesses... If the output you're trying to produce is:

Code:

Codon:	AAC	Quality Score:	,ED
Codon:	TCCAAG	Quality Score:	G7DCGG
Codon:	AAC	Quality Score:	GCC
Codon:	TCCAAG	Quality Score:	DGGCGG
Codon:	AAC	Quality Score:	GCC
Codon:	TCCAAG	Quality Score:	DGGCGG
Codon:	TTT	Quality Score:	+GG

you could try something like:

Code:

awk -v lengths="3 6" -v strings="GCATGAAAACATACA TTTCCAGAAATTGT" '
BEGIN {	nString = split(strings, String)
	split(lengths, OutLen)
	for(i = 1; i <= nString; i++)
		StringLen[i] = length(String[i])
}
/^@/ {	getline CodonLine
	getline
	getline QualityLine
	for(i = 1; i <= nString; i++)
		if(spot = index(CodonLine, String[i]))
			printf("Codon:\t%s\tQuality Score:\t%s\n",
			    substr(CodonLine, spot + StringLen[i], OutLen[i]),
			    substr(QualityLine, spot + StringLen[i], OutLen[i]))
}' file

These 2 Users Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

01-14-2019

Registered User

489, 285

Join Date: Nov 2018

Last Activity: 30 October 2021, 10:47 AM EDT

Location: undefined

Posts: 489

Thanks Given: 382

Thanked 285 Times in 215 Posts

It was interesting to try to implement this algorithm on the sed

Code:

#!/bin/sed -nrf
2~4 h
4~4 {
H;x
s/(.*GCATGAAAACATACA.{3})(.*)/\1\r\2/
}
/\r/ {
:1
s/^.(.{2}[^\r].*)/\1/
T2
s/(\n).(.*)/\1\2/
t1
:2
s/^(.{3}).*/\1/mg
s/(.*)\n(.*)/Codon:\t\1\tQuality Score:\t \2/p
}

Last edited by nezabudka; 01-14-2019 at 09:08 AM..

These 2 Users Gave Thanks to nezabudka For This Post:

nezabudka

View Public Profile for nezabudka

Find all posts by nezabudka

01-14-2019

Registered User

365, 3

Join Date: Jun 2010

Last Activity: 6 August 2019, 11:08 PM EDT

Posts: 365

Thanks Given: 149

Thanked 3 Times in 3 Posts

Don
I modified a bit your script to output the total count and give some format:

Code:

awk -v gene="gene-a gene-b" -v lengths="3 6" -v strings="GCATGAAAACATACA TTTCCAGAAATTGT" '
BEGIN {	nString = split(strings, String)
	split(lengths, OutLen)
	split(gene, Id)
	for(i = 1; i <= nString; i++)
		StringLen[i] = length(String[i])
}
/^@/ {	getline CodonLine
	getline
	getline QualityLine
	for(i = 1; i <= nString; i++)
		if(spot = index(CodonLine, String[i]))
			printf("Gene:\t"Id[i]"\tCodon:\t%s\t\tQuality Score:\t%s\t\n",
			    substr(CodonLine, spot + StringLen[i], OutLen[i]),
			    substr(QualityLine, spot + StringLen[i], OutLen[i]))
}' test.txt | awk '{ count[$0]++ } END {{ print "\n\t\t\t\tSummary\n#############################################################################\nCount\t\tGene\t\tCodon\t\t\tQuality Score\n" } {for (gene in count ) print count[gene] "\t" gene | "sort -k 3"}}'

With the above script I am getting the desired output:

Code:

                                Summary
#############################################################################
Count           Gene            Codon                   Quality Score

1       Gene:   gene-a  Codon:  AAC             Quality Score:  ,ED
2       Gene:   gene-a  Codon:  AAC             Quality Score:  GCC
1       Gene:   gene-a  Codon:  TTT             Quality Score:  +GG
2       Gene:   gene-b  Codon:  TCCAAG          Quality Score:  DGGCGG
1       Gene:   gene-b  Codon:  TCCAAG          Quality Score:  G7DCGG

However, I tried to include the END step in your awk script fail miserably. How can I modify the script so I don't have to "stitch" together the two scripts as shown above?
Thanks!

This User Gave Thanks to Xterra For This Post:

Xterra

View Public Profile for Xterra

Find all posts by Xterra

01-14-2019

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Hi Xterra,
Maybe something like:

Code:

awk -v genes="gene-a gene-b" \
    -v lengths="3 6" \
    -v strings="GCATGAAAACATACA TTTCCAGAAATTGT" '
BEGIN {	nString = split(strings, String)
	split(lengths, SLen)
	split(genes, Id)
	for(i = 1; i <= nString; i++)
		StringLen[i] = length(String[i])
	sort_cmd = "sort -k3,3 -k5,5 -k8,8"
	print "\n\t\t\t\tSummary"
	print "#############################################################" \
	    "################"
	print "Count\t\tGene\t\tCodon\t\t\tQuality Score\n"
}
/^@/ {	getline CodonLine
	getline
	getline QualityLine
	for(i = 1; i <= nString; i++)
		if(spot = index(CodonLine, String[i]))
			count[sprintf( \
			    "Gene:\t%s\tCodon:\t%s\tQuality Score:\t%s",
			    Id[i],
			    substr(CodonLine, spot + StringLen[i], SLen[i]),
			    substr(QualityLine, spot + StringLen[i], SLen[i])) \
			]++
}
END {	for(i in count)
		printf("%d\t%s\n", count[i], i) | sort_cmd
}' test.txt

which produces the output:

Code:

				Summary
#############################################################################
Count		Gene		Codon			Quality Score

1	Gene:	gene-a	Codon:	AAC	Quality Score:	,ED
2	Gene:	gene-a	Codon:	AAC	Quality Score:	GCC
1	Gene:	gene-a	Codon:	TTT	Quality Score:	+GG
2	Gene:	gene-b	Codon:	TCCAAG	Quality Score:	DGGCGG
1	Gene:	gene-b	Codon:	TCCAAG	Quality Score:	G7DCGG

I'm sure you could write this as a 1-liner, but I much prefer something I can see on a screen (and debug).

If there's anything here you can't figure out, ask questions about what you don't understand.

Hope this helps,
Don

These 2 Users Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

Shell Programming and Scripting

Outputting characters after a given string and reporting the characters in the row below --sed

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

Reporting characters after string

Discussion started by: Xterra

2. Shell Programming and Scripting

sed replace nth characters with string

Discussion started by: stinkefisch

3. Shell Programming and Scripting

Help with sed command - find a string between two characters

Discussion started by: vivek_damodaran

4. Shell Programming and Scripting

Trouble with sed and substituting a string with special characters in variable

Discussion started by: ampsys

5. Shell Programming and Scripting

sed cut characters of string

Discussion started by: vlm

6. Shell Programming and Scripting

sed replacing specific characters and control characters by escaping

Discussion started by: ijustneeda

7. Shell Programming and Scripting

Delete row if a a particular column has more then three characters in it

Discussion started by: bhargavpbk88

8. Shell Programming and Scripting

Want to remove the last characters from each row of csv using shell script

Discussion started by: rajak.net

9. Shell Programming and Scripting

SED help delete characters in a string

Discussion started by: redtred

10. UNIX for Dummies Questions & Answers

outputting selected characters from within a variable

Discussion started by: skinnygav