Hi redse171,
Thanks of rthe update. That gives us a better idea of what you are trying to do. Although the awk script you have shown us will not produce the output you showed us for the sample input you provided. (Your awk script doesn't copy the CD field to the output.)
I haven't dug into all of the details again yet, but I think that if we get answers to the following, we'll be able to help you write a script that will work:
Do you want the output to contain the "CD" field from the input?
Will all lines with the same combination of $5, $6, and $7 values be on contiguous lines in your input file? (The answer to this is "yes" for your sample input. Does it hold true for your real, huge input files?)
If the answer to #2 is no, does the order of lines in your output file matter?
This User Gave Thanks to Don Cragun For This Post:
Answers to Don Cragun's above question may kill the assumptions on which this is based. Try
Code:
awk '$2 != "CD" {next} # not a "CD" line -> no action
!($7 in LINE) {LINE[$7]=$0} # new $7? Keep line with first occurrence of $3/$4 in memory
{CNT[$7]++; E3[$7]=$3; E4[$7]=$4} # count $7 lines and keep last $3 and $4
END {for (i in LINE) if (CNT[i]>=2) { # for the lines recorded, if count = 1: discard
match (LINE[i],"[0-9]*\t[0-9]*\t[+-]") # search for $3 $4 +- pattern (you can use constants here if
# sure the file structure remains identical all over)
if (substr (LINE[i], RSTART+RLENGTH-1, 1) == "-") { # take decision on + or -
POS=RSTART # where to replace
STR=E3[i]} # what to put in
else {POS=RSTART+8
STR=E4[i]}
print substr (LINE[i], 1, POS-2), # print first part of line, dep. on sign
STR, # replacement string
substr (LINE[i], POS+8) # last part
}
}
' FS="\t" OFS="\t" file
SP12.3 CD 2249762 2252075 - ID1_N006 ID2_N006T1
SP12.5 CD 3001307 3005025 + ID1_N01140 ID2_N01140T0
SP12.3 CD 2249762 2253117 - ID1_N006 ID2_N006T0
SP12.3 CD 2240806 2241681 + ID1_N003 ID2_N003T0
Hi redse171,
Thanks of rthe update. That gives us a better idea of what you are trying to do. Although the awk script you have shown us will not produce the output you showed us for the sample input you provided. (Your awk script doesn't copy the CD field to the output.)
I haven't dug into all of the details again yet, but I think that if we get answers to the following, we'll be able to help you write a script that will work:
Do you want the output to contain the "CD" field from the input?
Will all lines with the same combination of $5, $6, and $7 values be on contiguous lines in your input file? (The answer to this is "yes" for your sample input. Does it hold true for your real, huge input files?)
If the answer to #2 is no, does the order of lines in your output file matter?
Hi Don Crugan,
To answer your questions:-
1. Yes, i need to have "CD" field in my output file as shown in my sample output
2. Yes for my huge input files
thanks.
---------- Post updated at 10:12 AM ---------- Previous update was at 10:07 AM ----------
Quote:
Originally Posted by RudiC
Answers to Don Cragun's above question may kill the assumptions on which this is based. Try
Code:
awk '$2 != "CD" {next} # not a "CD" line -> no action
!($7 in LINE) {LINE[$7]=$0} # new $7? Keep line with first occurrence of $3/$4 in memory
{CNT[$7]++; E3[$7]=$3; E4[$7]=$4} # count $7 lines and keep last $3 and $4
END {for (i in LINE) if (CNT[i]>=2) { # for the lines recorded, if count = 1: discard
match (LINE[i],"[0-9]*\t[0-9]*\t[+-]") # search for $3 $4 +- pattern (you can use constants here if
# sure the file structure remains identical all over)
if (substr (LINE[i], RSTART+RLENGTH-1, 1) == "-") { # take decision on + or -
POS=RSTART # where to replace
STR=E3[i]} # what to put in
else {POS=RSTART+8
STR=E4[i]}
print substr (LINE[i], 1, POS-2), # print first part of line, dep. on sign
STR, # replacement string
substr (LINE[i], POS+8) # last part
}
}
' FS="\t" OFS="\t" file
SP12.3 CD 2249762 2252075 - ID1_N006 ID2_N006T1
SP12.5 CD 3001307 3005025 + ID1_N01140 ID2_N01140T0
SP12.3 CD 2249762 2253117 - ID1_N006 ID2_N006T0
SP12.3 CD 2240806 2241681 + ID1_N003 ID2_N003T0
Hi RudiC,
Tried your codes and thanks so much for your explanations. It seems working for my real input file except that there are few lines a little bit weird. I am checking on it now and try play around with your codes. Will give the feedback asap. Thanks
---------- Post updated at 09:25 PM ---------- Previous update was at 10:12 AM ----------
Hi,
just to give feedback. The codes by RudiC is modified to suit my real data. The codes worked well with the sample data but there was an issue with the number and position of digits (values) in $3 and $4 in my real huge file. So, i split the LINE into segments and take the value from the segments (info from awk manual). Thanks to RudiC for the codes and explanations that help me to understand better. Below is the codes that being modified and i got the results that i wanted.
Code:
awk '$2 != "CD" {next}
!($7 in LINE) {LINE[$7]=$0}
{CNT[$7]++; E3[$7]=$3; E4[$7]=$4}
END {for (i in LINE) if (CNT[i]>=2) {
match (LINE[i],"[0-9]*\t[0-9]*\t[+-]")
if (substr (LINE[i], RSTART+RLENGTH-1, 1) == "-") {
POS=RSTART
STR=E3[i]
split(LINE[i], seg, "\t")
print seg[1], seg[2],
STR,
seg[4], seg[5], seg[6], seg[7]
}
else {POS=RSTART+7
STR=E4[i]
split(LINE[i], seg, "\t")
print seg[1], seg[2], seg[3],
STR,
seg[5], seg[6], seg[7]
}
}
}
' FS="\t" OFS="\t" File1
My first code was not informative enough as i don't have any idea how to find the min and max from my input file and what i gave was just to extract all line with CD patterns. The help that i got here is awesome and help me to learn and understand better. thanks a lot! .
Hi redse171,
I'm very glad that RudiC was able to help you find a solution to your problem. Note that if you need to use split() to correctly group your fields, you don't need to also use match() and substr() to determine whether you have a + or - in field 5 (you can just look directly at seg[5]) after you call split(). You can then simplify your code to something like:
Code:
awk '$2 != "CD" {next}
!($7 in LINE) {LINE[$7]=$0}
{CNT[$7]++; E3[$7]=$3; E4[$7]=$4}
END {for (i in LINE) if (CNT[i]>=2) {
split(LINE[i], seg)
if (seg[5] == "-") {
print seg[1], seg[2], E3[i],
seg[4], seg[5], seg[6], seg[7]
} else {
print seg[1], seg[2], seg[3],
E4[i], seg[5], seg[6], seg[7]
}
}
}
' FS="\t" OFS="\t" File1
Hi redse171,
I'm very glad that RudiC was able to help you find a solution to your problem. Note that if you need to use split() to correctly group your fields, you don't need to also use match() and substr() to determine whether you have a + or - in field 5 (you can just look directly at seg[5]) after you call split(). You can then simplify your code to something like:
Code:
awk '$2 != "CD" {next}
!($7 in LINE) {LINE[$7]=$0}
{CNT[$7]++; E3[$7]=$3; E4[$7]=$4}
END {for (i in LINE) if (CNT[i]>=2) {
split(LINE[i], seg)
if (seg[5] == "-") {
print seg[1], seg[2], E3[i],
seg[4], seg[5], seg[6], seg[7]
} else {
print seg[1], seg[2], seg[3],
E4[i], seg[5], seg[6], seg[7]
}
}
}
' FS="\t" OFS="\t" File1
and get the same results.
Hope this helps,
Don
Hi Don,
It does help!.. It just that i need to add a tiny part (in blue) there at printing part or else it wont show $4 in my output.
Code:
awk '$2 != "CD" {next}
!($7 in LINE) {LINE[$7]=$0}
{CNT[$7]++; E3[$7]=$3; E4[$7]=$4}
END {for (i in LINE) if (CNT[i]>=2) {
split(LINE[i], seg)
if (seg[5] == "-") {
print seg[1], seg[2], seg[3]= E3[i],
seg[4], seg[5], seg[6], seg[7]
} else {
print seg[1], seg[2], seg[3],
seg[4]=E4[i], seg[5], seg[6], seg[7]
}
}
}
' FS="\t" OFS="\t" file1
Thanks a bunch
---------- Post updated at 09:32 AM ---------- Previous update was at 09:31 AM ----------
Quote:
Originally Posted by RudiC
Further simplification:
Code:
awk '$2 != "CD" {next}
!($7 in LINE) {LINE[$7]=$0}
{CNT[$7]++; E3[$7]=$3; E4[$7]=$4}
END {for (i in LINE) if (CNT[i]>=2) {
split(LINE[i], seg)
if (seg[5] == "-") seg[3] = E3[i]
else seg[4] = E4[i]
print seg[1], seg[2], seg[3], seg[4], seg[5], seg[6], seg[7]
}
}
' FS="\t" OFS="\t" file
I am trying to get a simple min/max script to work with the below input. Note the special character (">") within it.
Script
awk 'BEGIN{max=0}{if(($1)>max) max=($1)}END {print max}'
awk 'BEGIN{min=0}{if(($2)<min) min=($2)}END {print min}'
Input
-122.2840 42.0009
-119.9950 ... (7 Replies)
I need to find the max/min of columns 1 and 2 of a 2 column file what contains the special character ">".
I know that this will find the max value of column 1.
awk 'BEGIN {max = 0} {if ($1>max) max=$1} END {print max}' input.file
But what if I needed to ignore special characters in the... (3 Replies)
aaa: 3 ms
aaa: 2 ms
aaa: 5 ms
aaa: 10 ms
..........
to get the 3 2 5 10 ...'s min avg and max
something like
min: 2 ms avg: 5 ms max: 10 ms (2 Replies)
Hi,
I have a file which looks like this:
FID IID MISS_PHENO N_MISS N_GENO F_MISS
12AB43131 12AB43131 N 17774 906341 0.01961
65HJ87451 65HJ87451 N 10149 906341 0.0112
43JJ21345 43JJ21345 N 2826 906341 0.003118I would... (11 Replies)
Hi guys,
I already search on the forum but i can't solve this on my own.
I have a lot of files like this:
And i need to print the line with the maximum value in last column but if the value is the same (2 in this exemple for the 3 last lines) i need get the line with the minimum value in... (4 Replies)
Hi guys!
I'm new to scripting and I need to write a script in awk.
Here is example of file on which I'm working
ATOM 4688 HG1 PRO A 322 18.080 59.680 137.020 1.00 0.00
ATOM 4689 HG2 PRO A 322 18.850 61.220 137.010 1.00 0.00
ATOM 4690 CD ... (18 Replies)
hi, i have an awk script and I managed to figure out how to search the max value but Im having difficulty in searching for the min field value.
BEGIN {FS=","; max=0}
NF == 7 {if (max < $6) max = $6;}
END { print man, min}
where $6 is the column of a field separated by a comma (3 Replies)
Hello every one, I have following data
***CAMPAIGN 1998 CONTRIBUTIONS***
---------------------------------------------------------------------------
NAME PHONE Jan | Feb | Mar | Total Donated
... (12 Replies)