awk to change value in field according to another


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting awk to change value in field according to another
Prev   Next
# 28  
Old 12-14-2018
Here is an updated and commented version of your exon.sh script...
Code:
#!/bin/sh
# exon.sh gene_regions_file coding_exons_file
#
# This script reads the gene_regions_file and the coding_exons_file and
# produces output that classifies each line in the coding_exons_file as "exon",
# "intron", or "splicing".
#
# In addition, this script optionally prints debugging information showing:
#  1.	the minimum and maximum values found for each line read from the
#	gene_regions_file along with the field 1 and 4 values found on the
#	corresponding line and the number of lines seen with those field 1 and
#	4 values,
#  2.	the line number and contents of each line read from the
#	coding_exons_file, and
#  3.	while classifying an entry in the coding_exons_file; print the minimum
#	and maximum values, the sequence number of the minimum and maximum
#	values for the current entry's field 1 and 4 value; and the field 2
#	value of the current entry.
#
# An earlier version of this script turned on debugging output if any arguments
# were given to the script when it was invoked (by using awk -v d=$#).  This
# worked then because the filenames used by that version of the script were
# constants and built into the script rather than passed in as operands (as in
# this version of the script).  Since this version of the script now requires
# two operands, the same results could be achieved by using
# awk -v d=$(($# > 2)).  But, instead of doing that, this version of the
# script allows the user to set the debugging flag to a non-zero, non-null
# value before a file pathname operand if debugging printouts for that file
# are desired:
#	No debugging output:
#		exon.sh gene_regions_file coding_exons_file
#	Enable debugging output for both input files:
#		exon.sh d=1 gene_regions_file coding_exons_file
#	Enable debugging output only for 2nd input file:
#		exon.sh gene_regions_file d=1 coding_exons_file
#	Enable debugging output only for 1st input file:
#		exon.sh d=1 gene_regions_file d= coding_exons_file
#		or:
#		exon.sh d=1 gene_regions_file d=0 coding_exons_file

IAm=${0##*/}	# Save final component of the pathname of this script for
		# diagnostic messages.
if [ $# -lt 2 ]
then	# Print a diagnostic usage message and exit if we don't have at least
	# two arguments (the required input files).  (As explained above,
	# additional operands to enable or disable debugging may be included.
	# Verification of there being two file operands and only appropiate
	# debugging operands is left as an exercise for the reader.  For now,
	# assume that diagnostics from awk will be sufficient if inappropriate
	# arguments are supplied.)
	echo "Usage: $IAm gene-regions_file coding_exons_file" >&2
	exit 1
fi
awk '
BEGIN {	FS = "[\t_]"	# Set input field separators to <tab> and <underscore>.
	OFS = "\t"	# Set output field separator to <tab>.
}
FNR == NR {
	# When the current file record # is the same as the current record #
	# from all files, we are looking at a record from the 1st input file...

	# First, increment the number of entries seen for this pair of field 1
	# & 4 values (c[$1, $4]) and then save the minimum value found in field
	# 2 in this entry.  Adding 0 converts an alphanumeric field to a
	# numeric field by discarding any trailing non-numeric characters from
	# the field.
	m[$1, $4, ++c[$1, $4]] = $2 + 0

	# And, then save the corresonding maximum value found in field 3 in
	# this entry.
	M[$1, $4, c[$1, $4]] = $3 + 0
	# Note that this code will not work if the min-max ranges in the 1st
	# input file are not in increasing order for each field 1 and 4 pair
	# (although the entries for a given pair do not have to be adjacent).

	# If debugging is enabled, print the minimum and maximum values saved
	# from this entry.
	if(d) printf("m[%s,%s,%d]=%s,M[%s,%s,%d]=%s\n",
		$1, $4, c[$1, $4], m[$1, $4, c[$1, $4]],
		$1, $4, c[$1, $4], M[$1, $4, c[$1, $4]])

	# Skip the remaining steps in this script for this input record and
	# continue with the next input record.
	next
}
{	# If we are here, we have a record from the 2nd input file.

	# If debugging is enabled, print the record number in this file and the
	# contents of this record.
	if(d) printf("FNR=%d:\"%s\"\n", FNR, $0)

	# Loop through all of the entries for the field 1 & 4 pair values found
	# in the current input record.
	for(i = 1; i <= c[$1, $4]; i++) {
		# If debugging is enabled, print the minimum and maximum values
		# for this entry in the saved m[] and M[] arrays along with the
		# field 2 value from the current input record.
		if(d) printf("m[%d]=%d,M[%d]=%d,$2=%d\n",
			i, m[$1, $4, i],
			i, M[$1, $4, i],
			$2)
		# If the field 2 value in the current input record is in range
		# for the entry being evaluated in this loop, set the 5th field
		# in this input record to "exon" and break out of this loop.
		if(m[$1, $4, i] <= $2 && $2 <= M[$1, $4, i]) {
			$5 = "exon"
			break
		} else {
			# If the minimum field 2 value in the current input
			# record is greater than the numeric value of the field
			# 2 value in the current input record...
			if(m[$1, $4, i] > $2 + 0) {
				# If the field 2 value in the current input
				# record is within 10 of the low end of the
				# range for the entry being evaluated in this
				# iteration of the loop, set the 5th field in
				# this input record to "splicing" and break
				# out of this loop.
				if(m[$1, $4, i] - 10 <= $2 + 0) {
					$5 = "splicing"
					break
				} else {# Otherwise, set the 5th field in this
					# input record to "intron" and break
					# out of this loop.
					$5 = "intron"
					break
				}
			}
		}
	}
	if(i > c[$1, $4])
		# If we fell through the loop (instead of breaking out of it),
		# we need to set the 5th field in this input record to
		# "intron".
		$5 = "intron"
}
1	# Print the updated current input record (with a field 5 value added).
' "$@"	# Mark end of the awk script operand and add the operands provided to
	# this script as operands to awk.

Although written to use /bin/sh as its interpreter, this script will not work with a pure Bourne shell (it uses some shell parameter expansions that are defined by the POSIX shell standards that were not present in the Bourne shell). It has been tested and works with /bin/sh, /bin/bash, and /bin/ksh on macOS Mojave version 10.14.2 with the sample input files provided in this thread with all occurrences of chrx in file2 changed to chrX.

If you want to run this script on a Solaris/SunOS system, change awk in this script to /usr/xpg4/bin/awk or nawk.
This User Gave Thanks to Don Cragun For This Post:
 
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

awk to change contents of field based on condition in same file

In the awk below I am trying to copy the entire contents of $6 there may be multiple values seperated by a ;, to $8, if $8 is . (lines 1 and 3 are examples). If that condition $8 is not . (line2 is an example) then that line is skipped and printed as is. The awk does execute but prints the output... (3 Replies)
Discussion started by: cmccabe
3 Replies

2. Shell Programming and Scripting

awk to change value of field using multiple conditions

In the below awk in the first step I default Classification NF-1 to VUS. Next, I am trying to change the value of Classification (NF) to whatever CLINSIG (NF-1) is. If there is only one condition everything works great, but if there are two conditions it does not work. Is the syntax used... (4 Replies)
Discussion started by: cmccabe
4 Replies

3. Shell Programming and Scripting

awk :how to change delimiter without giving all field name

Hi Experts, i need to change delimiter from tab to "," sample test file cat test A0000368 A29938511 072569352 5 Any 2 for £1.00 BUTCHERS|CAT FOOD|400G Sep 12 2012 12:00AM Jan 5 2014 11:59PM Sep 7 2012 12:00AM M 2.000 group 5 ... (2 Replies)
Discussion started by: Lakshman_Gupta
2 Replies

4. UNIX for Dummies Questions & Answers

change field separator only from nth field until NF

Hi ! input: 111|222|333|aaa|bbb|ccc 999|888|777|nnn|kkk 444|666|555|eee|ttt|ooo|ppp With awk, I am trying to change the FS "|" to "; " only from the 4th field until the end (the number of fields vary between records). In order to get: 111|222|333|aaa; bbb; ccc 999|888|777|nnn; kkk... (1 Reply)
Discussion started by: beca123456
1 Replies

5. Shell Programming and Scripting

awk or sed? change field conditional on key match

Hi. I'd appreciate if I can get some direction in this issue to get me going. Datafile1: -About 4000 records, I have to update field#4 in selected records based on a match in the key field (Field#1). -Field #1 is the key field (servername) . # of Fields may vary # comment server1 bbb ccc... (2 Replies)
Discussion started by: RascalHoudi
2 Replies

6. Shell Programming and Scripting

AWK: Pattern match between 2 files, then compare a field in file1 as > or < field in file2

First, thanks for the help in previous posts... couldn't have gotten where I am now without it! So here is what I have, I use AWK to match $1 and $2 as 1 string in file1 to $1 and $2 as 1 string in file2. Now I'm wondering if I can extend this AWK command to incorporate the following: If $1... (4 Replies)
Discussion started by: right_coaster
4 Replies

7. Shell Programming and Scripting

awk, comma as field separator and text inside double quotes as a field.

Hi, all I need to get fields in a line that are separated by commas, some of the fields are enclosed with double quotes, and they are supposed to be treated as a single field even if there are commas inside the quotes. sample input: for this line, 5 fields are supposed to be extracted, they... (8 Replies)
Discussion started by: kevintse
8 Replies

8. Shell Programming and Scripting

awk,cut fields by change field format

Hi Everyone, # cat 1.txt 1321631,77770132976455,19,20091001011859,20091001011907 1321631,77770132976455,19,20091001011859,20091001011907 1321631,77770132976455,19,20091001011859,20091001011907 # cat 1.txt | awk -F, '{OFS=",";print $1,$3,$4,$5}' 1321631,19,20091001011859,20091001011907... (7 Replies)
Discussion started by: jimmy_y
7 Replies

9. Shell Programming and Scripting

dynamically change awk Field Separator FS

Hi All, I was wondering if anyone knew how to dynamically change the FS in awk to accept vairiable containing a field separator. the current code is as below and does not work when i introduce the dynamic FS change :-( validate_source_file() { source_file=$1 ... (2 Replies)
Discussion started by: satnamx
2 Replies

10. Shell Programming and Scripting

change field content awk

I have a line like this: I want to move HTTP/1.1 200 OK to the next line and put a blank line between the two lines i.e. How can i get it using awk? Thanks in advance (2 Replies)
Discussion started by: littleboyblu
2 Replies
Login or Register to Ask a Question