awk to change value in field according to another

11-09-2018

Registered User

1,393, 20

Join Date: Nov 2013

Last Activity: 1 May 2020, 2:35 PM EDT

Location: Chicago

Posts: 1,393

Thanks Given: 901

Thanked 20 Times in 19 Posts

awk to change value in field according to another

I am trying to use awk to check if each $2 in file1 falls between $2 and $3 of the matching $4 line of file2. If it does then in $5 of file2, exon if it does not intron. I think the awk below will do that, but I am struggling trying to is add a calculation that if the difference is less than 10, then $5 is splicing. I have added an example of line 1 as well.

The 5th line is an example of the splicing, because the $2 value in file1 is 2 away from the $2 value in file2. Thank you

.

file1

Code:

chr1	17345304	17345315 	SDHB	
chr1	17345516	17345524 	SDHB	
chr1	93306242	93306261 	RPL5	
chr1	93307262	93307291 	RPL5
chrx	153295819	153296875 	MECP2	
chrx	153295810	153296800 	MECP2

file2 tab-delimeted

Code:

chr1	17345375	17345453	SDHB_cds_0_0_chr1_17345376_r	0	-
chr1	17349102	17349225	SDHB_cds_1_0_chr1_17349103_r	0	-
chr1	17350467	17350569	SDHB_cds_2_0_chr1_17350468_r	0	-
chr1	17354243	17354360	SDHB_cds_3_0_chr1_17354244_r	0	-
chr1	17355094	17355231	SDHB_cds_4_0_chr1_17355095_r	0	-
chr1	17359554	17359640	SDHB_cds_5_0_chr1_17359555_r	0	-
chr1	17371255	17371383	SDHB_cds_6_0_chr1_17371256_r	0	-
chr1	17380442	17380514	SDHB_cds_7_0_chr1_17380443_r	0	-
chr1	93297671	93297674	RPL5_cds_0_0_chr1_93297672_f	0	+
chr1	93298945	93299015	RPL5_cds_1_0_chr1_93298946_f	0	+
chr1	93299101	93299217	RPL5_cds_2_0_chr1_93299102_f	0	+
chr1	93300335	93300470	RPL5_cds_3_0_chr1_93300336_f	0	+
chr1	93301746	93301949	RPL5_cds_4_0_chr1_93301747_f	0	+
chr1	93303012	93303190	RPL5_cds_5_0_chr1_93303013_f	0	+
chr1	93306107	93306196	RPL5_cds_6_0_chr1_93306108_f	0	+
chr1	93307322	93307422	RPL5_cds_7_0_chr1_93307323_f	0	+
chrX	153295817	153296901	MECP2_cds_0_0_chrX_153295818_r	0	-
chrX	153297657	153298008	MECP2_cds_1_0_chrX_153297658_r	0	-
chrX	153357641	153357667	MECP2_cds_2_0_chrX_153357642_r	0	-

desired output tab-delimited

Code:

chr1	17345304	17345315 	SDHB	intron
chr1	17345516	17345524 	SDHB	intron	
chr1	93306242	93306261 	RPL5	intron	
chr1	93307262	93307291 	RPL5	intron
chrx	153295819	153296875	MECP2	exon
chrx	153295810	153296800	MECP2	splicing

awk

Code:

awk '
FNR==NR{
  a[$4];
  min[$4]=$2;
  max[$4]=$3;
  next
}
{
  split($4,array,"_");
  print $0,(array[1] in a) && ($2>=min[array[1]] && $2<=max[array[1]])?"exon":"intron"
}
' file1 OFS="\t" file2 > output

example of line 1

Code:

a[$4] = SDHB
min[$4] = 17345304
max[$4] = 17345315

array[1] = SDHB, 17345304 >= 17345375 && array[1] = SDHB, 17345315 <= 17345453 ---- intron

cmccabe

View Public Profile for cmccabe

Find all posts by cmccabe

11-09-2018

Moderator

8,825, 1,112

Join Date: Feb 2005

Last Activity: 23 August 2021, 11:26 AM EDT

Location: Foxborough, MA

Posts: 8,825

Thanks Given: 579

Thanked 1,112 Times in 1,003 Posts

hmmm...
$4 in file1 is not unique - the last $4 wins.
Is that what you want?
Or you rather determine min/max per $4 in file1 as you go?

Last edited by vgersh99; 11-09-2018 at 11:30 AM..

This User Gave Thanks to vgersh99 For This Post:

vgersh99

View Public Profile for vgersh99

Find all posts by vgersh99

11-09-2018

Registered User

1,393, 20

Join Date: Nov 2013

Last Activity: 1 May 2020, 2:35 PM EDT

Location: Chicago

Posts: 1,393

Thanks Given: 901

Thanked 20 Times in 19 Posts

The $4 value in file1 is not unique but is meant to ensure that, using line 1 as an example, only SDBH lines are searched or used in the comparison. There may be hundreds of lines in file1, but only a subset will match the $4 value. Thank you

cmccabe

View Public Profile for cmccabe

Find all posts by cmccabe

11-10-2018

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Quote:

Originally Posted by cmccabe

And, of that subset that match the $4 value, we have to assume that the same subset or a smaller subset may be classified as "intron", the same subset or a smaller subset may be classified as "exon", and the same subset or a smaller subset may be classified as "splicing" when your criteria are applied. How are the subset of lines that match the $4 value supposed to be combined or selected so that only one of the possible results are returned (presumably the one possible result that is the one that you want to match from all of the ones in the subset that do match)???

This User Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

11-10-2018

Registered User

6,384, 2,214

Join Date: May 2005

Last Activity: 28 October 2019, 4:59 PM EDT

Location: In the leftmost byte of /dev/kmem

Posts: 6,384

Thanks Given: 143

Thanked 2,214 Times in 1,548 Posts

Quote:

Originally Posted by Don Cragun

may be classified as "intron",
may be classified as "exon"
may be classified as "splicing"

It certainly helps if one understands what this is all about and since it happens i have a biological researcher at home who explained it to me, here it is (errors/omissions are due to my limited understanding - i was told this is already the kindergarten version of what is really going on):

"exon", short for "expressed region", is a unit of a gene which codes something like a protein. Think of a "gene" as a text of describing something, then the "exon" would be one complete sentence of this text. When DNA is read (so that what it codes is actually produced) it is copied to "RNA"-pieces. This process is called RNA-splicing*) and these pieces contain always several whole such exons.

"intron", short for "intragenetic region" is (more or less meaningless) parts of the DNA between the exons. Think of it as some sort of punctuation and whitespace in the text. It is removed during RNA-splicing so that only the exons make it there.

*) RNA-splicing: the process of producing RNA from DNA works in several steps. First a complete DNA-piece is copied, including the introns. Then the real RNA is made from that ommitting the introns and only leaving the exons. This, in fact, is the "splicing".

In the human genome about 1% is exons (so this in fact makes up for the whole genetic information), about 25% is introns. The rest is intergenetic (that is: between genes and hence completely meaningless).

Thanks to my wife.

bakunin

Last edited by bakunin; 11-10-2018 at 05:05 AM..

This User Gave Thanks to bakunin For This Post:

bakunin

View Public Profile for bakunin

Find all posts by bakunin

11-10-2018

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

Quote:

Originally Posted by bakunin

. . . The rest is intergenetic (that is: between genes and hence completely meaningless). Thanks to my wife. . . .

Out of sheer curiosity - is that meaningless intergenetic rest the info that makes up the "genetic fingerprint" identifying individuals and revealing relationships like parent - child, or siblings? And, regards to your wife for educating us.

This User Gave Thanks to RudiC For This Post:

RudiC

View Public Profile for RudiC

Find all posts by RudiC

11-10-2018

Registered User

1,393, 20

Join Date: Nov 2013

Last Activity: 1 May 2020, 2:35 PM EDT

Location: Chicago

Posts: 1,393

Thanks Given: 901

Thanked 20 Times in 19 Posts

Quote:

file2 is a very large file of genes and all associated coding exons. So, using the SDHB gene, as an example, that is one of the ~22,000 genes in the human genome. A gene is made up a variable exons, introns, intragenic regions. File2 only lists the coding sequence of a gene, that is what is currently known to code for a protein product and contribute to the human "genetic makeup".

File1 is created from a script that output all regions in a particular gene that may need to be interrogated further. The problem is not all those regions, defined by the $1, $2, and $3 values may be important to know (there is still a lot unknown about the human genome).... its complex as @bakunin kindly described (a big thanks to your wife

.

The intons and intrageneic regions regulate/effect exons (both coding and non-coding) but are still largely an unknown. What is known is that coding exons(file2) and splicing(defined as +/- 10) are important.

Using the value in $4of file1 that is looked up in file2 to return the subset of a gene to use. There may be multiple lines in each file for that gene but the combination of $2 and $3 will define each line in file1 as intron, exon, or splicing.

I hope this helps and apologize for the long post but since I am excited to help share knowledge and really appreciate all the help... thank you very much

Last edited by cmccabe; 11-10-2018 at 02:03 PM..

cmccabe

View Public Profile for cmccabe

Find all posts by cmccabe

Shell Programming and Scripting

awk to change value in field according to another

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

awk to change contents of field based on condition in same file

Discussion started by: cmccabe

2. Shell Programming and Scripting

awk to change value of field using multiple conditions

Discussion started by: cmccabe

3. Shell Programming and Scripting

awk :how to change delimiter without giving all field name

Discussion started by: Lakshman_Gupta

4. UNIX for Dummies Questions & Answers

change field separator only from nth field until NF

Discussion started by: beca123456

5. Shell Programming and Scripting

awk or sed? change field conditional on key match

Discussion started by: RascalHoudi

6. Shell Programming and Scripting

AWK: Pattern match between 2 files, then compare a field in file1 as > or < field in file2

Discussion started by: right_coaster

7. Shell Programming and Scripting

awk, comma as field separator and text inside double quotes as a field.

Discussion started by: kevintse

8. Shell Programming and Scripting

awk,cut fields by change field format

Discussion started by: jimmy_y

9. Shell Programming and Scripting

dynamically change awk Field Separator FS

Discussion started by: satnamx

10. Shell Programming and Scripting

change field content awk

Discussion started by: littleboyblu