awk to change value in field according to another


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting awk to change value in field according to another
# 22  
Old 12-11-2018
I added a couple parameters in the script to pass from the for loop. It does look like the correct files are being passed to the script, however the output is not correct. Commenting out the printf statements causes syntax errors
Code:
awk: cmd. line:9: 		$1, $4, c[$1, $4], m[$1, $4, c[$1, $4]],
awk: cmd. line:9: 		  ^ syntax error
awk: cmd. line:9: 		$1, $4, c[$1, $4], m[$1, $4, c[$1, $4]],
awk: cmd. line:9: 		                 ^ syntax error
awk: cmd. line:9: 		$1, $4, c[$1, $4], m[$1, $4, c[$1, $4]],
awk: cmd. line:9: 		                                       ^ syntax error
awk: cmd. line:10: 		$1, $4, c[$1, $4], M[$1, $4, c[$1, $4]])
awk: cmd. line:10: 		                 ^ syntax error
awk: cmd. line:10: 		$1, $4, c[$1, $4], M[$1, $4, c[$1, $4]])
awk: cmd. line:10: 		                                       ^ syntax error
awk: cmd. line:16: 			i, m[$1, $4, i],
awk: cmd. line:16: 			 ^ syntax error
awk: cmd. line:16: 			i, m[$1, $4, i],
awk: cmd. line:16: 			               ^ syntax error
awk: cmd. line:17: 			i, M[$1, $4, i],

I guess I am not understanding if the parameters are being passed from the for loop what else I am missing. I added the code that allows the operands to be controlled by the parameters passed by the for loop. Thank you Smilie.

Code:
for file in /home/cmccabe/folder/less/*.txt ; do
       bname=$(basename "$file")
       pref=${bname%%_*.txt}
       #echo "file:\"$file\"    bname:\"$bname\"    pref:\"$pref\""
       #echo "output will be directed to:\"/home/cmccabe/folder/less/${pref}_output.txt\""
       bash -x /home/cmccabe/folder/less/exon.sh /home/cmccabe/folder/less/all_cdsV2 "$file" > /home/cmccabe/folder/less/${pref}_output.txt
done

exon.sh
Code:
#!/bin/sh
awk -v d=$# '
BEGIN {	FS = "[\t_]"
	OFS = "\t"
}
FNR == NR {
	m[$1, $4, ++c[$1, $4]] = $2 + 0
	M[$1, $4, c[$1, $4]] = $3 + 0
	if(d) printf("m[%s,%s,%d]=%s,M[%s,%s,%d]=%s\n",
		$1, $4, c[$1, $4], m[$1, $4, c[$1, $4]],
		$1, $4, c[$1, $4], M[$1, $4, c[$1, $4]])
	next
}
{	if(d) printf("FNR=%d:\"%s\"\n",FNR,$0)
	for(i = 1; i <= c[$1, $4]; i++) {
		if(d) printf("m[%d]=%d,M[%d]=%d,$2=%d\n",
			i, m[$1, $4, i],
			i, M[$1, $4, i],
			$2)
		if(m[$1, $4, i] <= $2 && $2 <= M[$1, $4, i]) {
			$5 = "exon"
			break
		} else {if(m[$1, $4, i] > $2 + 0) {
				if(m[$1, $4, i] - 10 <= $2 + 0) {
					$5 = "splicing"
					break
				} else {$5 = "intron"
					break
				}
		}
	}
}
	if(i > c[$1, $4])
		$5 = "intron"
}
1' "$1" "$2"

using the bash -x

Code:
+ awk -v d=2 '
BEGIN {	FS = "[\t_]"
	OFS = "\t"
}
FNR == NR {
	m[$1, $4, ++c[$1, $4]] = $2 + 0
	M[$1, $4, c[$1, $4]] = $3 + 0
	if(d) printf("m[%s,%s,%d]=%s,M[%s,%s,%d]=%s\n",
		$1, $4, c[$1, $4], m[$1, $4, c[$1, $4]],
		$1, $4, c[$1, $4], M[$1, $4, c[$1, $4]])
	next
}
{	if(d) printf("FNR=%d:\"%s\"\n",FNR,$0)
	for(i = 1; i <= c[$1, $4]; i++) {
		if(d) printf("m[%d]=%d,M[%d]=%d,$2=%d\n",
			i, m[$1, $4, i],
			i, M[$1, $4, i],
			$2)
		if(m[$1, $4, i] <= $2 && $2 <= M[$1, $4, i]) {
			$5 = "exon"
			break
		} else {if(m[$1, $4, i] > $2 + 0) {
				if(m[$1, $4, i] - 10 <= $2 + 0) {
					$5 = "splicing"
					break
				} else {$5 = "intron"
					break
				}
		}
	}
}
	if(i > c[$1, $4])
		$5 = "intron"
}
1' /home/cmccabe/folder/less/all_cdsV2 /home/cmccabe/folder/less/11-1111_regions.txt

output
00-0000_output.txt (only a few lines)
Code:
m[chr1,ADC,1]=33547850,M[chr1,ADC,1]=33547955
m[chr1,ADC,2]=33549554,M[chr1,ADC,2]=33549728
m[chr1,ADC,3]=33557650,M[chr1,ADC,3]=33557823

11-1111_output.txt
Code:
m[chr1,ADC,1]=33547850,M[chr1,ADC,1]=33547955
m[chr1,ADC,2]=33549554,M[chr1,ADC,2]=33549728
m[chr1,ADC,3]=33557650,M[chr1,ADC,3]=33557823

# 23  
Old 12-11-2018
The output you have shown us in post #22 in the two output files is debugging output from exon.sh reading data from /home/cmccabe/folder/less/all_cdsV2.

If that is all of the output that is stored in those two output files, one might be inclined to believe that /home/cmccabe/folder/less/00-0000_regions.txt and /home/cmccabe/folder/less/11-1111_regions.txt are empty files.

What output do you get from the command:
Code:
ls -l /home/cmccabe/folder/less/*.txt

This User Gave Thanks to Don Cragun For This Post:
# 24  
Old 12-11-2018
Code:
ls -l /home/cmccabe/folder/less/*.txt
-rw-rw-r-- 1 cmccabe cmccabe 32254 Dec  7 17:14 /home/cmccabe/folder/less/00-0000_regions.txt
-rw-rw-r-- 1 cmccabe cmccabe 32254 Dec  7 17:14 /home/cmccabe/folder/less/11-1111_regions.txt

Those are only a few lines from /home/cmccabe/folder/less/all_cdsV2, that file is several thousands of lines.

Both 00-0000_regions.txt and 11-1111_regions.txt have the format below (only a few lines but all are similar)

Code:
chrX	153579249	153579469 	FLNA	
chrX	153579888	153580108 	FLNA	
chrX	153579904	153580124 	FLNA

If the debugging lines are being printed then each file is being read but not processed? Thank you very much Smilie.
# 25  
Old 12-11-2018
Quote:
Originally Posted by cmccabe
... ... ...

If the debugging lines are being printed then each file is being read but not processed? Thank you very much Smilie.
No. Debugging lines are being printed (along with your desired output) because you set d to a non-zero, non-null value in exon.sh. The fact that you implied that the output files only contained three lines of debugging information and none of the output you were expecting tells me that you didn't bother to try to understand how the script I gave you works. That is very discouraging. I guess I'm wasting my time here.

What output do you get if you run the command:
Code:
grep -E 'splicing|intron|exon' /home/cmccabe/folder/less/00-0000_output.txt

What output were you hoping to have exon.sh put into the file named /home/cmccabe/folder/less/00-0000_output.txt?
This User Gave Thanks to Don Cragun For This Post:
# 26  
Old 12-11-2018
I have been trying to understand your code and am just not understanding, but I am trying, I know it may seem like I am not but I assure yu that I am and will continue to do so. I added comments to each line and some questions. My understanding is not there completely but hopefully its a start. I apologize for the misleading output description, thousands of lines print, i only showed a few to keep the post short. Thank you Smilie.

Code:
#!/bin/sh
awk -v d=$# '   # does this define d as non-zero
BEGIN {	FS = "[\t_]"    # define FS as tab or underscore
	OFS = "\t"      # define output as tab delimited
}
FNR == NR {   # process same line in file 2 as file1 and start processing file2 or /home/cmccabe/folder/less/all_cdsV2
	m[$1, $4, ++c[$1, $4]] = $2 + 0    # split $4 and store $1 and $2in array m, what does the + 0 do?
	M[$1, $4, c[$1, $4]] = $3 + 0      # split $4 and store $1 and $2in array M, what does the + 0 do?
	if(d) printf("m[%s,%s,%d]=%s,M[%s,%s,%d]=%s\n", # debugging print
		$1, $4, c[$1, $4], m[$1, $4, c[$1, $4]],  # for m (m=min)?
		$1, $4, c[$1, $4], M[$1, $4, c[$1, $4]])  # for M (M=max)?
	next   # process next line
}
{	if(d) printf("FNR=%d:\"%s\"\n",FNR,$0)  # not sure what this doesI think it prints each line in $file?
	for(i = 1; i <= c[$1, $4]; i++) {     # start a loop using $4 and $1 value
		#if(d) printf("m[%d]=%d,M[%d]=%d,$2=%d\n", # again not sure?
			i, m[$1, $4, i],  # loop through each m in /home/cmccabe/folder/less/all_cdsV2 for each $1 and $4 of $file
			i, M[$1, $4, i],  # loop through each M in /home/cmccabe/folder/less/all_cdsV2 for each $1 and $4 of $file
			$2)  # not sure what this does?
		if(m[$1, $4, i] <= $2 && $2 <= M[$1, $4, i]) {  # if the value of each matching m<=$2 and <=M then print 
			$5 = "exon"  #  exon in $5
			break    # break loop and move to next line
		} else {if(m[$1, $4, i] > $2 + 0) { # if the value of each matching m>=$2 and >=M then print
				if(m[$1, $4, i] - 10 <= $2 + 0) {  # if the value of each matching -10and <=$2 then print
					$5 = "splicing"
					break     # break loop and move to next line
				} else {$5 = "intron"   # print intron in $5
					break  # break loop and move to next line
				}
		}
	}
}
	if(i > c[$1, $4])     # what does this do hasn't intron already been printed?
		$5 = "intron"
}
1' "$1" "$2"   # parameters passed from for loop

Code:
  (only a few lines of the thousands to show the desired output results from the grep)
grep -E 'splicing|intron|exon' /home/cmccabe/folder/less/00-0000_output.txt
chr7	30062272	30062492 	FKBP14	splicing
chr7	30065867	30066087 	FKBP14	intron
chr7	30065964	30066184 	FKBP14	exon
chr7	94024268	94024488 	COL1A2	intron


Last edited by cmccabe; 12-12-2018 at 07:26 AM..
# 27  
Old 12-11-2018
Quote:
Originally Posted by cmccabe
Code:
#!/bin/sh
awk -v d=$# '   # does this define d as non-zero

$# is the number of arguments a script actually got when it was invoked. Thry this script:

Code:
#! /bin/bash
echo number of arguments is: $#
exit 0

and try the following invocations, note their results:

Code:
./script.sh
./script.sh one
./script.sh one two
./script.sh one "two two" three
./script.sh ""

I hope this helps.

bakunin
This User Gave Thanks to bakunin For This Post:
# 28  
Old 12-14-2018
Here is an updated and commented version of your exon.sh script...
Code:
#!/bin/sh
# exon.sh gene_regions_file coding_exons_file
#
# This script reads the gene_regions_file and the coding_exons_file and
# produces output that classifies each line in the coding_exons_file as "exon",
# "intron", or "splicing".
#
# In addition, this script optionally prints debugging information showing:
#  1.	the minimum and maximum values found for each line read from the
#	gene_regions_file along with the field 1 and 4 values found on the
#	corresponding line and the number of lines seen with those field 1 and
#	4 values,
#  2.	the line number and contents of each line read from the
#	coding_exons_file, and
#  3.	while classifying an entry in the coding_exons_file; print the minimum
#	and maximum values, the sequence number of the minimum and maximum
#	values for the current entry's field 1 and 4 value; and the field 2
#	value of the current entry.
#
# An earlier version of this script turned on debugging output if any arguments
# were given to the script when it was invoked (by using awk -v d=$#).  This
# worked then because the filenames used by that version of the script were
# constants and built into the script rather than passed in as operands (as in
# this version of the script).  Since this version of the script now requires
# two operands, the same results could be achieved by using
# awk -v d=$(($# > 2)).  But, instead of doing that, this version of the
# script allows the user to set the debugging flag to a non-zero, non-null
# value before a file pathname operand if debugging printouts for that file
# are desired:
#	No debugging output:
#		exon.sh gene_regions_file coding_exons_file
#	Enable debugging output for both input files:
#		exon.sh d=1 gene_regions_file coding_exons_file
#	Enable debugging output only for 2nd input file:
#		exon.sh gene_regions_file d=1 coding_exons_file
#	Enable debugging output only for 1st input file:
#		exon.sh d=1 gene_regions_file d= coding_exons_file
#		or:
#		exon.sh d=1 gene_regions_file d=0 coding_exons_file

IAm=${0##*/}	# Save final component of the pathname of this script for
		# diagnostic messages.
if [ $# -lt 2 ]
then	# Print a diagnostic usage message and exit if we don't have at least
	# two arguments (the required input files).  (As explained above,
	# additional operands to enable or disable debugging may be included.
	# Verification of there being two file operands and only appropiate
	# debugging operands is left as an exercise for the reader.  For now,
	# assume that diagnostics from awk will be sufficient if inappropriate
	# arguments are supplied.)
	echo "Usage: $IAm gene-regions_file coding_exons_file" >&2
	exit 1
fi
awk '
BEGIN {	FS = "[\t_]"	# Set input field separators to <tab> and <underscore>.
	OFS = "\t"	# Set output field separator to <tab>.
}
FNR == NR {
	# When the current file record # is the same as the current record #
	# from all files, we are looking at a record from the 1st input file...

	# First, increment the number of entries seen for this pair of field 1
	# & 4 values (c[$1, $4]) and then save the minimum value found in field
	# 2 in this entry.  Adding 0 converts an alphanumeric field to a
	# numeric field by discarding any trailing non-numeric characters from
	# the field.
	m[$1, $4, ++c[$1, $4]] = $2 + 0

	# And, then save the corresonding maximum value found in field 3 in
	# this entry.
	M[$1, $4, c[$1, $4]] = $3 + 0
	# Note that this code will not work if the min-max ranges in the 1st
	# input file are not in increasing order for each field 1 and 4 pair
	# (although the entries for a given pair do not have to be adjacent).

	# If debugging is enabled, print the minimum and maximum values saved
	# from this entry.
	if(d) printf("m[%s,%s,%d]=%s,M[%s,%s,%d]=%s\n",
		$1, $4, c[$1, $4], m[$1, $4, c[$1, $4]],
		$1, $4, c[$1, $4], M[$1, $4, c[$1, $4]])

	# Skip the remaining steps in this script for this input record and
	# continue with the next input record.
	next
}
{	# If we are here, we have a record from the 2nd input file.

	# If debugging is enabled, print the record number in this file and the
	# contents of this record.
	if(d) printf("FNR=%d:\"%s\"\n", FNR, $0)

	# Loop through all of the entries for the field 1 & 4 pair values found
	# in the current input record.
	for(i = 1; i <= c[$1, $4]; i++) {
		# If debugging is enabled, print the minimum and maximum values
		# for this entry in the saved m[] and M[] arrays along with the
		# field 2 value from the current input record.
		if(d) printf("m[%d]=%d,M[%d]=%d,$2=%d\n",
			i, m[$1, $4, i],
			i, M[$1, $4, i],
			$2)
		# If the field 2 value in the current input record is in range
		# for the entry being evaluated in this loop, set the 5th field
		# in this input record to "exon" and break out of this loop.
		if(m[$1, $4, i] <= $2 && $2 <= M[$1, $4, i]) {
			$5 = "exon"
			break
		} else {
			# If the minimum field 2 value in the current input
			# record is greater than the numeric value of the field
			# 2 value in the current input record...
			if(m[$1, $4, i] > $2 + 0) {
				# If the field 2 value in the current input
				# record is within 10 of the low end of the
				# range for the entry being evaluated in this
				# iteration of the loop, set the 5th field in
				# this input record to "splicing" and break
				# out of this loop.
				if(m[$1, $4, i] - 10 <= $2 + 0) {
					$5 = "splicing"
					break
				} else {# Otherwise, set the 5th field in this
					# input record to "intron" and break
					# out of this loop.
					$5 = "intron"
					break
				}
			}
		}
	}
	if(i > c[$1, $4])
		# If we fell through the loop (instead of breaking out of it),
		# we need to set the 5th field in this input record to
		# "intron".
		$5 = "intron"
}
1	# Print the updated current input record (with a field 5 value added).
' "$@"	# Mark end of the awk script operand and add the operands provided to
	# this script as operands to awk.

Although written to use /bin/sh as its interpreter, this script will not work with a pure Bourne shell (it uses some shell parameter expansions that are defined by the POSIX shell standards that were not present in the Bourne shell). It has been tested and works with /bin/sh, /bin/bash, and /bin/ksh on macOS Mojave version 10.14.2 with the sample input files provided in this thread with all occurrences of chrx in file2 changed to chrX.

If you want to run this script on a Solaris/SunOS system, change awk in this script to /usr/xpg4/bin/awk or nawk.
This User Gave Thanks to Don Cragun For This Post:
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

awk to change contents of field based on condition in same file

In the awk below I am trying to copy the entire contents of $6 there may be multiple values seperated by a ;, to $8, if $8 is . (lines 1 and 3 are examples). If that condition $8 is not . (line2 is an example) then that line is skipped and printed as is. The awk does execute but prints the output... (3 Replies)
Discussion started by: cmccabe
3 Replies

2. Shell Programming and Scripting

awk to change value of field using multiple conditions

In the below awk in the first step I default Classification NF-1 to VUS. Next, I am trying to change the value of Classification (NF) to whatever CLINSIG (NF-1) is. If there is only one condition everything works great, but if there are two conditions it does not work. Is the syntax used... (4 Replies)
Discussion started by: cmccabe
4 Replies

3. Shell Programming and Scripting

awk :how to change delimiter without giving all field name

Hi Experts, i need to change delimiter from tab to "," sample test file cat test A0000368 A29938511 072569352 5 Any 2 for £1.00 BUTCHERS|CAT FOOD|400G Sep 12 2012 12:00AM Jan 5 2014 11:59PM Sep 7 2012 12:00AM M 2.000 group 5 ... (2 Replies)
Discussion started by: Lakshman_Gupta
2 Replies

4. UNIX for Dummies Questions & Answers

change field separator only from nth field until NF

Hi ! input: 111|222|333|aaa|bbb|ccc 999|888|777|nnn|kkk 444|666|555|eee|ttt|ooo|ppp With awk, I am trying to change the FS "|" to "; " only from the 4th field until the end (the number of fields vary between records). In order to get: 111|222|333|aaa; bbb; ccc 999|888|777|nnn; kkk... (1 Reply)
Discussion started by: beca123456
1 Replies

5. Shell Programming and Scripting

awk or sed? change field conditional on key match

Hi. I'd appreciate if I can get some direction in this issue to get me going. Datafile1: -About 4000 records, I have to update field#4 in selected records based on a match in the key field (Field#1). -Field #1 is the key field (servername) . # of Fields may vary # comment server1 bbb ccc... (2 Replies)
Discussion started by: RascalHoudi
2 Replies

6. Shell Programming and Scripting

AWK: Pattern match between 2 files, then compare a field in file1 as > or < field in file2

First, thanks for the help in previous posts... couldn't have gotten where I am now without it! So here is what I have, I use AWK to match $1 and $2 as 1 string in file1 to $1 and $2 as 1 string in file2. Now I'm wondering if I can extend this AWK command to incorporate the following: If $1... (4 Replies)
Discussion started by: right_coaster
4 Replies

7. Shell Programming and Scripting

awk, comma as field separator and text inside double quotes as a field.

Hi, all I need to get fields in a line that are separated by commas, some of the fields are enclosed with double quotes, and they are supposed to be treated as a single field even if there are commas inside the quotes. sample input: for this line, 5 fields are supposed to be extracted, they... (8 Replies)
Discussion started by: kevintse
8 Replies

8. Shell Programming and Scripting

awk,cut fields by change field format

Hi Everyone, # cat 1.txt 1321631,77770132976455,19,20091001011859,20091001011907 1321631,77770132976455,19,20091001011859,20091001011907 1321631,77770132976455,19,20091001011859,20091001011907 # cat 1.txt | awk -F, '{OFS=",";print $1,$3,$4,$5}' 1321631,19,20091001011859,20091001011907... (7 Replies)
Discussion started by: jimmy_y
7 Replies

9. Shell Programming and Scripting

dynamically change awk Field Separator FS

Hi All, I was wondering if anyone knew how to dynamically change the FS in awk to accept vairiable containing a field separator. the current code is as below and does not work when i introduce the dynamic FS change :-( validate_source_file() { source_file=$1 ... (2 Replies)
Discussion started by: satnamx
2 Replies

10. Shell Programming and Scripting

change field content awk

I have a line like this: I want to move HTTP/1.1 200 OK to the next line and put a blank line between the two lines i.e. How can i get it using awk? Thanks in advance (2 Replies)
Discussion started by: littleboyblu
2 Replies
Login or Register to Ask a Question