awk to change value in field according to another

11-11-2018

Registered User

6,384, 2,214

Join Date: May 2005

Last Activity: 28 October 2019, 4:59 PM EDT

Location: In the leftmost byte of /dev/kmem

Posts: 6,384

Thanks Given: 143

Thanked 2,214 Times in 1,548 Posts

Quote:

Originally Posted by RudiC

is that meaningless intergenetic rest the info that makes up the "genetic fingerprint"

After asking her: no. They use so-called "RFLP"s for that purpose and these are parts(s) of an exon if i have understood correctly. Here is a

https://en.wikipedia.org/wiki/Restri...ipedia-article

bakunin (enlightened by his wife)

These 2 Users Gave Thanks to bakunin For This Post:

bakunin

View Public Profile for bakunin

Find all posts by bakunin

11-11-2018

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

I'm sorry. I appreciate the lessons I'm getting in genomics, but I still don't understand your requirements.

From your description and examples, I'm guessing that even though you haven't said so:

there will be no overlap in $2-$3 value ranges for any two lines in file2,
all of the lines in file2 that are associated with a $4 value in file1 are adjacent,
the strings in $4 in file1 and at the start of $4 in file2 are irrelevant to this problem (only the ranges specified by $2-$3 matter other than copying the $4 value in file1 into the output),
if a $2 value in file1 is inside one of the $2-$3 ranges in file2, then a new 5th field added to file1 should be set to exon in the output (this comes from the examples, but conflicts with several statements in the English requirements),
if a $2 value in file1 is not inside any $2-$3 range in file2 and the difference $2 on some line in file2 minus $2 on a line in file1 is greater than zero and less than eleven, then a 5th field added to file1 should be set to splicing in the output (this also comes from the examples, but conflicts with the stated English requirements), and
otherwise, a 5th field added to file1 should be set to intron.

Please confirm whether or not my guesses are correct. And, if my guesses are not correct, please restate your requirements and give us an example where the stated requirements and the given examples are consistent with each other.

Note that if file2 is sorted on increasing values of field 2 (as in your example) and file1 was sorted on increasing values of field 2, neither file would have to be loaded into memory and both files could be read one line at a time. (This would make the code more complex, but would reduce the amount of memory needed to run your program if that is an issue.) But, in your sample data, file1 is not sorted.

These 2 Users Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

11-11-2018

Registered User

1,393, 20

Join Date: Nov 2013

Last Activity: 1 May 2020, 2:35 PM EDT

Location: Chicago

Posts: 1,393

Thanks Given: 901

Thanked 20 Times in 19 Posts

Quote:

there will be no overlap in $2-$3 value ranges for any two lines in file2

there could potentially be overlap in the $2-$3 value ranges, that is why $4 or the gene id id used because the same $2-$3 values can not exist in two different genes. To be extra sure the combination of $4 and $1 can be used to ensure this, that will look at only the gene in $4 on the chromosome in $1. That might be better as it will be a unique lookup key used in the search.

Quote:

all of the lines in file2 that are associated with a $4 value in file1 are adjacent

yes, after the search key or lookup value in found in file2 all its associated lines will be adjacent, one on top of the other...

Code:

...... SDBH
...... SDBH
...... SDBH

Quote:

the strings in $4 in file1 and at the start of $4 in file2 are irrelevant to this problem (only the ranges specified by $2-$3 matter other than copying the $4 value in file1

into the output)
yes, this is true... though the combination of $1 and $4 in file1 may be better to ensure a unique match and values are found faster.

Quote:

if a $2 value in file1 is inside one of the $2-$3 ranges in file2, then a new 5th field added to file1 should be set to exon in the output (this comes from the examples, but conflicts with several statements in the English requirements)
if a $2 value in file1 is not inside any $2-$3 range in file2 and the difference $2 on some line in file2 minus $2 on a line in file1 is greater than zero and less than eleven, then a 5th field added to file1 should be set to splicing in the output (this also comes from the examples, but conflicts with the stated English requirements), and
otherwise, a 5th field added to file1 should be set to intron.

yes this is correct, the conflicts in the english requirements have to do with the nature of the human genome and that it is ever-changing and still full of unknowns. The test being performed or utilized also factors in to it and can add additional complexity/conflicts.

Thank you very much for all of your help

.

awk

Code:

awk '
FNR==NR{
  a[$4];
  chr[$4]=$1;
  min[$4]=$2;
  max[$4]=$3;
  next
}
{
  split($4,array,"_");
  print $0,(array[1] in a) && ($2>=min[array[1]] && $2<=max[array[1] && $1=chr[array[1]])?"exon":"intron"
}
' file1 OFS="\t" file2 > output

Last edited by cmccabe; 11-11-2018 at 11:11 AM.. Reason: added awk

cmccabe

View Public Profile for cmccabe

Find all posts by cmccabe

11-12-2018

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

We would really appreciate it if the data you post in your examples was consistent with itself and with the descriptions of the problems you present.

Note that numeric values with a trailing space are not always equivalent to numeric values without a trailing space.

Note also that string values (in this case gene names) are case sensitive and gene chrx in file1 does not match gene chrX in file2. Therefore, when you say that we should use $1 and $4 to match values between your two input files, there can never be a match for any gene chrx information in file2.

If I change your file1 contents to:

Code:

chr1	17345304	17345315 	SDHB	
chr1	17345516	17345524 	SDHB	
chr1	93306242	93306261 	RPL5	
chr1	93307262	93307291 	RPL5
chrX	153295819	153296875 	MECP2	
chrX	153295810	153296800 	MECP2

to match the gene names in your file2 (but leaving the trailing spaces in field #3), the following code:

Code:

#!/bin/ksh
awk -v d=$# '
BEGIN {	FS = "[\t_]"
	OFS = "\t"
}
FNR == NR {
	m[$1, $4, ++c[$1, $4]] = $2 + 0
	M[$1, $4, c[$1, $4]] = $3 + 0
	if(d) printf("m[%s,%s,%d]=%s,M[%s,%s,%d]=%s\n",
		$1, $4, c[$1, $4], m[$1, $4, c[$1, $4]],
		$1, $4, c[$1, $4], M[$1, $4, c[$1, $4]])
	next
}
{	if(d) printf("FNR=%d:\"%s\"\n",FNR,$0)
	for(i = 1; i <= c[$1, $4]; i++) {
		if(d) printf("m[%d]=%d,M[%d]=%d,$2=%d\n",
			i, m[$1, $4, i],
			i, M[$1, $4, i],
			$2)
		if(m[$1, $4, i] <= $2 && $2 <= M[$1, $4, i]) {
			$5 = "exon"
			break
		} else {if(m[$1, $4, i] > $2 + 0) {
				if(m[$1, $4, i] - 10 <= $2 + 0) {
					$5 = "splicing"
					break
				} else {$5 = "intron"
					break
				}
			}
		}
	}
	if(i > c[$1, $4])
		$5 = "intron"
}
1' file2 file1

produces the output you said you wanted. Since you have extraneous non-numeric characters in some fields that should be numeric, this code includes safeguards to convert string values that may contain non-numeric values before performing comparisons. The debugging statements included helped me track down the conflict in your gene names that was keeping my code from producing the output you said you wanted. (To enable debugging, invoke the above script with an argument, any argument.) If you want to use case-insensitive comparisons on field #1 and field #4 values (which would be required to produce the output you say you want from the sample files you provided), I will leave it to you to update the code to do that. If you want to use case-insensitive comparisons you really need to say that in the description of you problem and not just hide it in inconsistent data in your sample input files.

The above code was written and tested on macOS Mojave (Version 10.14.1) using the Korn shell. It should work with any shell that uses Bourne shell syntax. If you want to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk or nawk.

This User Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

11-15-2018

Registered User

1,393, 20

Join Date: Nov 2013

Last Activity: 1 May 2020, 2:35 PM EDT

Location: Chicago

Posts: 1,393

Thanks Given: 901

Thanked 20 Times in 19 Posts

Thank you very much for your help, I did not realize that there were extra spaces but was able to fix that. Again thank you very much

cmccabe

View Public Profile for cmccabe

Find all posts by cmccabe

12-07-2018

Registered User

1,393, 20

Join Date: Nov 2013

Last Activity: 1 May 2020, 2:35 PM EDT

Location: Chicago

Posts: 1,393

Thanks Given: 901

Thanked 20 Times in 19 Posts

If I try to use a for loop on the above script (which I called exon.sh) to define $file I get empty output.

Code:

for file in path/to/*.txt ; do
     bname=$(basename $file)
     pref=${bname%%_*.txt}
     bash /path/to/exon.sh static $file > path/to/${pref}_output.txt
done

In the for loop the static will never change only the $file variable will (always .txt file). If I hardcode the files to use as part of the script then the desired output is achieved. I did not mention this in the original post because I thought the for loop would be able to be used. However I seem to be using it incorrectly. Thank you

cmccabe

View Public Profile for cmccabe

Find all posts by cmccabe

12-07-2018

Registered User

6,384, 2,214

Join Date: May 2005

Last Activity: 28 October 2019, 4:59 PM EDT

Location: In the leftmost byte of /dev/kmem

Posts: 6,384

Thanks Given: 143

Thanked 2,214 Times in 1,548 Posts

Quote:

Originally Posted by cmccabe

If I try to use a for loop on the above script (which I called exon.sh) to define $file I get empty output.

Please sit down, we need to talk. More specifically, i need to give you "the talk" - and, no, it is not about flowers and bees.... ;-))

What you do might look to you like some "quick hacks" to make your life easier. In fact it is full-fledged software development and you will never be successful in this endeavour if you do not apply the tenets and procedures of software development. You will never a successful biologist if you, instead of following established good lab practice, do whatever comes to your mind. I take it, you learned your trade as a researcher and acquired all these established best practices. It is now time you do the same with this certain aspect of your research.

Instead of giving you an answer i'd like to show you how to methodically apply procedures to hunt down a bug and find out the answer yourself. Even if your code is only a few lines long you apply the same techniques.

Quote:

Originally Posted by cmccabe

Code:

for file in path/to/*.txt ; do
     bname=$(basename $file)
     pref=${bname%%_*.txt}
     bash /path/to/exon.sh static $file > path/to/${pref}_output.txt
done

The first, obvious thing that comes to mind is this discrepancy:

Code:

bash /path/to/exon.sh static $file > path/to/${pref}_output.txt

whereas Don Craguns script from post #1 reads:

Code:

#!/bin/ksh

Now, the bash is well capable of strting a Korn shell, so that should work, but it is best practice to minimize every possible source of problems. Because the script states its command processor anyway you can change the line to:

Code:

/path/to/exon.sh static $file > path/to/${pref}_output.txt

which will perhaps not remedy the problem. Let us get on! The next thing is: if you get empty output you may have empty input. We actually see how the involved programs are called and what they are told to do, but there is some uncertainty involved and that are the variable contents: we suppose them to be correct, but better to "be sure" about something is to test it, so let us test it. For this we change the script a little bit. We do that in single steps, like walking is done: you do one step at a time, because if you try to make several steps at once you hop up and down but won't go anywhere.

The first thing to test is the for-llop itself. Does it produce all the files we want it to produce? And, while we are at it:

- does it produce all the files we want?
- does it produce files we don NOT want ("false positives")?
- does it not produce files we do want ("false negatives")?

Code:

for file in path/to/*.txt ; do
     echo "$file"
     # bname=$(basename $file)
     # pref=${bname%%_*.txt}
     # bash /path/to/exon.sh static $file > path/to/${pref}_output.txt
done

What did you find? Often software does not do what it is supposed to do because of small things we easily overlook: i.e. "path/to/*.txt" lacks an introducing "/" to be an absolute path. I understand this is not your real path, but maybe you made the same typo (or a similar one) there as you did here. This makes sure that - if the correct list is produced - this is not the case. This part will be "provenly correct". Let us assume it is and get on. The next thing we test is the variable expansion

Code:

for file in path/to/*.txt ; do
     echo "$file"
     bname=$(basename $file)
     pref=${bname%%_*.txt}
     echo "bname: \"$bname\"   pref:\"$pref\""
     # bash /path/to/exon.sh static $file > path/to/${pref}_output.txt
done

The first thing i notice is the lacking quoting of the variables. Your code will break when a filename will contain a space. A line like:

Code:

variable=something

should, if you are not absolutely 101% sure about what "something" is (and even then, because it doesn't hurt and you should do it habitually right) be quoted:

Code:

variable="something"

Therefore:

Code:

for file in path/to/*.txt ; do
     echo "$file"
     bname=$(basename "$file")
     pref="${bname%%_*.txt}"
     echo "bname: \"$bname\"   pref:\"$pref\""
     # bash /path/to/exon.sh static "$file" > "path/to/${pref}_output.txt"
done

Now, run that. Do the variables all contain the expected values? To be honest, i am suspicious that they do not, for some reason. But ou now have the tools to find out - the first step in correcting it.

The last thing, if the variables do produce the correct values, is to test the command itself: instead of running it we just display it. Notice that we need to escape the redirection:

Code:

for file in path/to/*.txt ; do
     bname=$(basename "$file")
     pref="${bname%%_*.txt}"
     echo /path/to/exon.sh static "$file" \> "path/to/${pref}_output.txt"
done

Now, this should produce a list of commands. Copy and paste one of them to another window and let it run. There might be some diagnostic message (very common are "file not found", "path does not exist" and similar ones, also an attempt to write to some write protected place, full disks, ....) in case the command is what fails.

In one sentence: we took a complex procedure which didn't work as expected and tested one step after the other until we found the culprit. This is how every scientist works and this is how software developer work. Here is a bonus information: you can switch "tracing mode" in the shell on and off so that every command is displayed (to stderr) before it is executed. Try this modification:

Code:

set -xv
for file in path/to/*.txt ; do
     bname=$(basename "$file")
     pref="${bname%%_*.txt}"
     echo /path/to/exon.sh static "$file" \> "path/to/${pref}_output.txt"
done
set +xv

set -xv switches on the trace, set +xv sitches it off again. You could also only trace certain parts:

Code:

for file in path/to/*.txt ; do
     bname=$(basename "$file")
     pref="${bname%%_*.txt}"
     set -xv
     /path/to/exon.sh static "$file" \> "path/to/${pref}_output.txt"
     set +xv
done

This works for Korn shell (ksh) and bash alike.

I hope this helps.

bakunin

This User Gave Thanks to bakunin For This Post:

bakunin

View Public Profile for bakunin

Find all posts by bakunin

Shell Programming and Scripting

awk to change value in field according to another

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

awk to change contents of field based on condition in same file

Discussion started by: cmccabe

2. Shell Programming and Scripting

awk to change value of field using multiple conditions

Discussion started by: cmccabe

3. Shell Programming and Scripting

awk :how to change delimiter without giving all field name

Discussion started by: Lakshman_Gupta

4. UNIX for Dummies Questions & Answers

change field separator only from nth field until NF

Discussion started by: beca123456

5. Shell Programming and Scripting

awk or sed? change field conditional on key match

Discussion started by: RascalHoudi

6. Shell Programming and Scripting

AWK: Pattern match between 2 files, then compare a field in file1 as > or < field in file2

Discussion started by: right_coaster

7. Shell Programming and Scripting

awk, comma as field separator and text inside double quotes as a field.

Discussion started by: kevintse

8. Shell Programming and Scripting

awk,cut fields by change field format

Discussion started by: jimmy_y

9. Shell Programming and Scripting

dynamically change awk Field Separator FS

Discussion started by: satnamx

10. Shell Programming and Scripting

change field content awk

Discussion started by: littleboyblu