awk to change value in field according to another

12-11-2018

Registered User

1,393, 20

Join Date: Nov 2013

Last Activity: 1 May 2020, 2:35 PM EDT

Location: Chicago

Posts: 1,393

Thanks Given: 901

Thanked 20 Times in 19 Posts

I added a couple parameters in the script to pass from the for loop. It does look like the correct files are being passed to the script, however the output is not correct. Commenting out the printf statements causes syntax errors

Code:

awk: cmd. line:9: 		$1, $4, c[$1, $4], m[$1, $4, c[$1, $4]],
awk: cmd. line:9: 		  ^ syntax error
awk: cmd. line:9: 		$1, $4, c[$1, $4], m[$1, $4, c[$1, $4]],
awk: cmd. line:9: 		                 ^ syntax error
awk: cmd. line:9: 		$1, $4, c[$1, $4], m[$1, $4, c[$1, $4]],
awk: cmd. line:9: 		                                       ^ syntax error
awk: cmd. line:10: 		$1, $4, c[$1, $4], M[$1, $4, c[$1, $4]])
awk: cmd. line:10: 		                 ^ syntax error
awk: cmd. line:10: 		$1, $4, c[$1, $4], M[$1, $4, c[$1, $4]])
awk: cmd. line:10: 		                                       ^ syntax error
awk: cmd. line:16: 			i, m[$1, $4, i],
awk: cmd. line:16: 			 ^ syntax error
awk: cmd. line:16: 			i, m[$1, $4, i],
awk: cmd. line:16: 			               ^ syntax error
awk: cmd. line:17: 			i, M[$1, $4, i],

I guess I am not understanding if the parameters are being passed from the for loop what else I am missing. I added the code that allows the operands to be controlled by the parameters passed by the for loop. Thank you

Code:

for file in /home/cmccabe/folder/less/*.txt ; do
       bname=$(basename "$file")
       pref=${bname%%_*.txt}
       #echo "file:\"$file\"    bname:\"$bname\"    pref:\"$pref\""
       #echo "output will be directed to:\"/home/cmccabe/folder/less/${pref}_output.txt\""
       bash -x /home/cmccabe/folder/less/exon.sh /home/cmccabe/folder/less/all_cdsV2 "$file" > /home/cmccabe/folder/less/${pref}_output.txt
done

exon.sh

Code:

#!/bin/sh
awk -v d=$# '
BEGIN {	FS = "[\t_]"
	OFS = "\t"
}
FNR == NR {
	m[$1, $4, ++c[$1, $4]] = $2 + 0
	M[$1, $4, c[$1, $4]] = $3 + 0
	if(d) printf("m[%s,%s,%d]=%s,M[%s,%s,%d]=%s\n",
		$1, $4, c[$1, $4], m[$1, $4, c[$1, $4]],
		$1, $4, c[$1, $4], M[$1, $4, c[$1, $4]])
	next
}
{	if(d) printf("FNR=%d:\"%s\"\n",FNR,$0)
	for(i = 1; i <= c[$1, $4]; i++) {
		if(d) printf("m[%d]=%d,M[%d]=%d,$2=%d\n",
			i, m[$1, $4, i],
			i, M[$1, $4, i],
			$2)
		if(m[$1, $4, i] <= $2 && $2 <= M[$1, $4, i]) {
			$5 = "exon"
			break
		} else {if(m[$1, $4, i] > $2 + 0) {
				if(m[$1, $4, i] - 10 <= $2 + 0) {
					$5 = "splicing"
					break
				} else {$5 = "intron"
					break
				}
		}
	}
}
	if(i > c[$1, $4])
		$5 = "intron"
}
1' "$1" "$2"

using the bash -x

Code:

+ awk -v d=2 '
BEGIN {	FS = "[\t_]"
	OFS = "\t"
}
FNR == NR {
	m[$1, $4, ++c[$1, $4]] = $2 + 0
	M[$1, $4, c[$1, $4]] = $3 + 0
	if(d) printf("m[%s,%s,%d]=%s,M[%s,%s,%d]=%s\n",
		$1, $4, c[$1, $4], m[$1, $4, c[$1, $4]],
		$1, $4, c[$1, $4], M[$1, $4, c[$1, $4]])
	next
}
{	if(d) printf("FNR=%d:\"%s\"\n",FNR,$0)
	for(i = 1; i <= c[$1, $4]; i++) {
		if(d) printf("m[%d]=%d,M[%d]=%d,$2=%d\n",
			i, m[$1, $4, i],
			i, M[$1, $4, i],
			$2)
		if(m[$1, $4, i] <= $2 && $2 <= M[$1, $4, i]) {
			$5 = "exon"
			break
		} else {if(m[$1, $4, i] > $2 + 0) {
				if(m[$1, $4, i] - 10 <= $2 + 0) {
					$5 = "splicing"
					break
				} else {$5 = "intron"
					break
				}
		}
	}
}
	if(i > c[$1, $4])
		$5 = "intron"
}
1' /home/cmccabe/folder/less/all_cdsV2 /home/cmccabe/folder/less/11-1111_regions.txt

output
00-0000_output.txt (only a few lines)

Code:

m[chr1,ADC,1]=33547850,M[chr1,ADC,1]=33547955
m[chr1,ADC,2]=33549554,M[chr1,ADC,2]=33549728
m[chr1,ADC,3]=33557650,M[chr1,ADC,3]=33557823

11-1111_output.txt

Code:

m[chr1,ADC,1]=33547850,M[chr1,ADC,1]=33547955
m[chr1,ADC,2]=33549554,M[chr1,ADC,2]=33549728
m[chr1,ADC,3]=33557650,M[chr1,ADC,3]=33557823

cmccabe

View Public Profile for cmccabe

Find all posts by cmccabe

12-11-2018

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

The output you have shown us in post #22 in the two output files is debugging output from exon.sh reading data from /home/cmccabe/folder/less/all_cdsV2.

If that is all of the output that is stored in those two output files, one might be inclined to believe that /home/cmccabe/folder/less/00-0000_regions.txt and /home/cmccabe/folder/less/11-1111_regions.txt are empty files.

What output do you get from the command:

Code:

ls -l /home/cmccabe/folder/less/*.txt

This User Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

12-11-2018

Registered User

1,393, 20

Join Date: Nov 2013

Last Activity: 1 May 2020, 2:35 PM EDT

Location: Chicago

Posts: 1,393

Thanks Given: 901

Thanked 20 Times in 19 Posts

Code:

ls -l /home/cmccabe/folder/less/*.txt
-rw-rw-r-- 1 cmccabe cmccabe 32254 Dec  7 17:14 /home/cmccabe/folder/less/00-0000_regions.txt
-rw-rw-r-- 1 cmccabe cmccabe 32254 Dec  7 17:14 /home/cmccabe/folder/less/11-1111_regions.txt

Those are only a few lines from /home/cmccabe/folder/less/all_cdsV2, that file is several thousands of lines.

Both 00-0000_regions.txt and 11-1111_regions.txt have the format below (only a few lines but all are similar)

Code:

chrX	153579249	153579469 	FLNA	
chrX	153579888	153580108 	FLNA	
chrX	153579904	153580124 	FLNA

If the debugging lines are being printed then each file is being read but not processed? Thank you very much

cmccabe

View Public Profile for cmccabe

Find all posts by cmccabe

12-11-2018

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Quote:

Originally Posted by cmccabe

... ... ...

If the debugging lines are being printed then each file is being read but not processed? Thank you very much Smilie

No. Debugging lines are being printed (along with your desired output) because you set d to a non-zero, non-null value in exon.sh. The fact that you implied that the output files only contained three lines of debugging information and none of the output you were expecting tells me that you didn't bother to try to understand how the script I gave you works. That is very discouraging. I guess I'm wasting my time here.

What output do you get if you run the command:

Code:

grep -E 'splicing|intron|exon' /home/cmccabe/folder/less/00-0000_output.txt

What output were you hoping to have exon.sh put into the file named /home/cmccabe/folder/less/00-0000_output.txt?

This User Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

12-11-2018

Registered User

1,393, 20

Join Date: Nov 2013

Last Activity: 1 May 2020, 2:35 PM EDT

Location: Chicago

Posts: 1,393

Thanks Given: 901

Thanked 20 Times in 19 Posts

I have been trying to understand your code and am just not understanding, but I am trying, I know it may seem like I am not but I assure yu that I am and will continue to do so. I added comments to each line and some questions. My understanding is not there completely but hopefully its a start. I apologize for the misleading output description, thousands of lines print, i only showed a few to keep the post short. Thank you

Code:

#!/bin/sh
awk -v d=$# '   # does this define d as non-zero
BEGIN {	FS = "[\t_]"    # define FS as tab or underscore
	OFS = "\t"      # define output as tab delimited
}
FNR == NR {   # process same line in file 2 as file1 and start processing file2 or /home/cmccabe/folder/less/all_cdsV2
	m[$1, $4, ++c[$1, $4]] = $2 + 0    # split $4 and store $1 and $2in array m, what does the + 0 do?
	M[$1, $4, c[$1, $4]] = $3 + 0      # split $4 and store $1 and $2in array M, what does the + 0 do?
	if(d) printf("m[%s,%s,%d]=%s,M[%s,%s,%d]=%s\n", # debugging print
		$1, $4, c[$1, $4], m[$1, $4, c[$1, $4]],  # for m (m=min)?
		$1, $4, c[$1, $4], M[$1, $4, c[$1, $4]])  # for M (M=max)?
	next   # process next line
}
{	if(d) printf("FNR=%d:\"%s\"\n",FNR,$0)  # not sure what this doesI think it prints each line in $file?
	for(i = 1; i <= c[$1, $4]; i++) {     # start a loop using $4 and $1 value
		#if(d) printf("m[%d]=%d,M[%d]=%d,$2=%d\n", # again not sure?
			i, m[$1, $4, i],  # loop through each m in /home/cmccabe/folder/less/all_cdsV2 for each $1 and $4 of $file
			i, M[$1, $4, i],  # loop through each M in /home/cmccabe/folder/less/all_cdsV2 for each $1 and $4 of $file
			$2)  # not sure what this does?
		if(m[$1, $4, i] <= $2 && $2 <= M[$1, $4, i]) {  # if the value of each matching m<=$2 and <=M then print 
			$5 = "exon"  #  exon in $5
			break    # break loop and move to next line
		} else {if(m[$1, $4, i] > $2 + 0) { # if the value of each matching m>=$2 and >=M then print
				if(m[$1, $4, i] - 10 <= $2 + 0) {  # if the value of each matching -10and <=$2 then print
					$5 = "splicing"
					break     # break loop and move to next line
				} else {$5 = "intron"   # print intron in $5
					break  # break loop and move to next line
				}
		}
	}
}
	if(i > c[$1, $4])     # what does this do hasn't intron already been printed?
		$5 = "intron"
}
1' "$1" "$2"   # parameters passed from for loop

Code:

  (only a few lines of the thousands to show the desired output results from the grep)
grep -E 'splicing|intron|exon' /home/cmccabe/folder/less/00-0000_output.txt
chr7	30062272	30062492 	FKBP14	splicing
chr7	30065867	30066087 	FKBP14	intron
chr7	30065964	30066184 	FKBP14	exon
chr7	94024268	94024488 	COL1A2	intron

Last edited by cmccabe; 12-12-2018 at 07:26 AM..

cmccabe

View Public Profile for cmccabe

Find all posts by cmccabe

12-11-2018

Registered User

6,384, 2,214

Join Date: May 2005

Last Activity: 28 October 2019, 4:59 PM EDT

Location: In the leftmost byte of /dev/kmem

Posts: 6,384

Thanks Given: 143

Thanked 2,214 Times in 1,548 Posts

Quote:

Originally Posted by cmccabe

Code:

#!/bin/sh
awk -v d=$# '   # does this define d as non-zero

$# is the number of arguments a script actually got when it was invoked. Thry this script:

Code:

#! /bin/bash
echo number of arguments is: $#
exit 0

and try the following invocations, note their results:

Code:

./script.sh
./script.sh one
./script.sh one two
./script.sh one "two two" three
./script.sh ""

I hope this helps.

bakunin

This User Gave Thanks to bakunin For This Post:

bakunin

View Public Profile for bakunin

Find all posts by bakunin

12-14-2018

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Here is an updated and commented version of your exon.sh script...

Code:

#!/bin/sh
# exon.sh gene_regions_file coding_exons_file
#
# This script reads the gene_regions_file and the coding_exons_file and
# produces output that classifies each line in the coding_exons_file as "exon",
# "intron", or "splicing".
#
# In addition, this script optionally prints debugging information showing:
#  1.	the minimum and maximum values found for each line read from the
#	gene_regions_file along with the field 1 and 4 values found on the
#	corresponding line and the number of lines seen with those field 1 and
#	4 values,
#  2.	the line number and contents of each line read from the
#	coding_exons_file, and
#  3.	while classifying an entry in the coding_exons_file; print the minimum
#	and maximum values, the sequence number of the minimum and maximum
#	values for the current entry's field 1 and 4 value; and the field 2
#	value of the current entry.
#
# An earlier version of this script turned on debugging output if any arguments
# were given to the script when it was invoked (by using awk -v d=$#).  This
# worked then because the filenames used by that version of the script were
# constants and built into the script rather than passed in as operands (as in
# this version of the script).  Since this version of the script now requires
# two operands, the same results could be achieved by using
# awk -v d=$(($# > 2)).  But, instead of doing that, this version of the
# script allows the user to set the debugging flag to a non-zero, non-null
# value before a file pathname operand if debugging printouts for that file
# are desired:
#	No debugging output:
#		exon.sh gene_regions_file coding_exons_file
#	Enable debugging output for both input files:
#		exon.sh d=1 gene_regions_file coding_exons_file
#	Enable debugging output only for 2nd input file:
#		exon.sh gene_regions_file d=1 coding_exons_file
#	Enable debugging output only for 1st input file:
#		exon.sh d=1 gene_regions_file d= coding_exons_file
#		or:
#		exon.sh d=1 gene_regions_file d=0 coding_exons_file

IAm=${0##*/}	# Save final component of the pathname of this script for
		# diagnostic messages.
if [ $# -lt 2 ]
then	# Print a diagnostic usage message and exit if we don't have at least
	# two arguments (the required input files).  (As explained above,
	# additional operands to enable or disable debugging may be included.
	# Verification of there being two file operands and only appropiate
	# debugging operands is left as an exercise for the reader.  For now,
	# assume that diagnostics from awk will be sufficient if inappropriate
	# arguments are supplied.)
	echo "Usage: $IAm gene-regions_file coding_exons_file" >&2
	exit 1
fi
awk '
BEGIN {	FS = "[\t_]"	# Set input field separators to <tab> and <underscore>.
	OFS = "\t"	# Set output field separator to <tab>.
}
FNR == NR {
	# When the current file record # is the same as the current record #
	# from all files, we are looking at a record from the 1st input file...

	# First, increment the number of entries seen for this pair of field 1
	# & 4 values (c[$1, $4]) and then save the minimum value found in field
	# 2 in this entry.  Adding 0 converts an alphanumeric field to a
	# numeric field by discarding any trailing non-numeric characters from
	# the field.
	m[$1, $4, ++c[$1, $4]] = $2 + 0

	# And, then save the corresonding maximum value found in field 3 in
	# this entry.
	M[$1, $4, c[$1, $4]] = $3 + 0
	# Note that this code will not work if the min-max ranges in the 1st
	# input file are not in increasing order for each field 1 and 4 pair
	# (although the entries for a given pair do not have to be adjacent).

	# If debugging is enabled, print the minimum and maximum values saved
	# from this entry.
	if(d) printf("m[%s,%s,%d]=%s,M[%s,%s,%d]=%s\n",
		$1, $4, c[$1, $4], m[$1, $4, c[$1, $4]],
		$1, $4, c[$1, $4], M[$1, $4, c[$1, $4]])

	# Skip the remaining steps in this script for this input record and
	# continue with the next input record.
	next
}
{	# If we are here, we have a record from the 2nd input file.

	# If debugging is enabled, print the record number in this file and the
	# contents of this record.
	if(d) printf("FNR=%d:\"%s\"\n", FNR, $0)

	# Loop through all of the entries for the field 1 & 4 pair values found
	# in the current input record.
	for(i = 1; i <= c[$1, $4]; i++) {
		# If debugging is enabled, print the minimum and maximum values
		# for this entry in the saved m[] and M[] arrays along with the
		# field 2 value from the current input record.
		if(d) printf("m[%d]=%d,M[%d]=%d,$2=%d\n",
			i, m[$1, $4, i],
			i, M[$1, $4, i],
			$2)
		# If the field 2 value in the current input record is in range
		# for the entry being evaluated in this loop, set the 5th field
		# in this input record to "exon" and break out of this loop.
		if(m[$1, $4, i] <= $2 && $2 <= M[$1, $4, i]) {
			$5 = "exon"
			break
		} else {
			# If the minimum field 2 value in the current input
			# record is greater than the numeric value of the field
			# 2 value in the current input record...
			if(m[$1, $4, i] > $2 + 0) {
				# If the field 2 value in the current input
				# record is within 10 of the low end of the
				# range for the entry being evaluated in this
				# iteration of the loop, set the 5th field in
				# this input record to "splicing" and break
				# out of this loop.
				if(m[$1, $4, i] - 10 <= $2 + 0) {
					$5 = "splicing"
					break
				} else {# Otherwise, set the 5th field in this
					# input record to "intron" and break
					# out of this loop.
					$5 = "intron"
					break
				}
			}
		}
	}
	if(i > c[$1, $4])
		# If we fell through the loop (instead of breaking out of it),
		# we need to set the 5th field in this input record to
		# "intron".
		$5 = "intron"
}
1	# Print the updated current input record (with a field 5 value added).
' "$@"	# Mark end of the awk script operand and add the operands provided to
	# this script as operands to awk.

Although written to use /bin/sh as its interpreter, this script will not work with a pure Bourne shell (it uses some shell parameter expansions that are defined by the POSIX shell standards that were not present in the Bourne shell). It has been tested and works with /bin/sh, /bin/bash, and /bin/ksh on macOS Mojave version 10.14.2 with the sample input files provided in this thread with all occurrences of chrx in file2 changed to chrX.

If you want to run this script on a Solaris/SunOS system, change awk in this script to /usr/xpg4/bin/awk or nawk.

This User Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

Shell Programming and Scripting

awk to change value in field according to another

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

awk to change contents of field based on condition in same file

Discussion started by: cmccabe

2. Shell Programming and Scripting

awk to change value of field using multiple conditions

Discussion started by: cmccabe

3. Shell Programming and Scripting

awk :how to change delimiter without giving all field name

Discussion started by: Lakshman_Gupta

4. UNIX for Dummies Questions & Answers

change field separator only from nth field until NF

Discussion started by: beca123456

5. Shell Programming and Scripting

awk or sed? change field conditional on key match

Discussion started by: RascalHoudi

6. Shell Programming and Scripting

AWK: Pattern match between 2 files, then compare a field in file1 as > or < field in file2

Discussion started by: right_coaster

7. Shell Programming and Scripting

awk, comma as field separator and text inside double quotes as a field.

Discussion started by: kevintse

8. Shell Programming and Scripting

awk,cut fields by change field format

Discussion started by: jimmy_y

9. Shell Programming and Scripting

dynamically change awk Field Separator FS

Discussion started by: satnamx

10. Shell Programming and Scripting

change field content awk

Discussion started by: littleboyblu