But the results shows all lines containing "CD" patterns like below:
Code:
SP12.3 CD 2240806 2241254 + ID1_N003 ID2_N003T0
SP12.3 CD 2241471 2241681 + ID1_N003 ID2_N003T0
SP12.3 CD 2245127 2246953 + ID1_N005 ID2_N005T0
SP12.3 CD 2253090 2253117 - ID1_N006 ID2_N006T0
SP12.3 CD 2252492 2252908 - ID1_N006 ID2_N006T0
SP12.3 CD 2251730 2251882 - ID1_N006 ID2_N006T0
SP12.3 CD 2251591 2251664 - ID1_N006 ID2_N006T0
SP12.3 CD 2249887 2251530 - ID1_N006 ID2_N006T0
SP12.3 CD 2249762 2249821 - ID1_N006 ID2_N006T0
SP12.3 CD 2251730 2252075 - ID1_N006 ID2_N006T1
SP12.3 CD 2251591 2251664 - ID1_N006 ID2_N006T1
SP12.3 CD 2249887 2251530 - ID1_N006 ID2_N006T1
SP12.3 CD 2249762 2249821 - ID1_N006 ID2_N006T1
SP12.5 CD 3001307 3001397 + ID1_N01140 ID2_N01140T0
SP12.5 CD 3001572 3002765 + ID1_N01140 ID2_N01140T0
SP12.5 CD 3002821 3004797 + ID1_N01140 ID2_N01140T0
SP12.5 CD 3004855 3004929 + ID1_N01140 ID2_N01140T0
SP12.5 CD 3004994 3005025 + ID1_N01140 ID2_N01140T0
The real output that i want will only show min and max value if "CD" pattern is found, and it should be based on value in $5. If "+", then the value in $3 for the first "CD" found and value in $4 for the last "CD" found for each ID2 ($6) will be printed in $3 and $4 of output file respectively. If "-", then the value in $4 for the first "CD" found and value in $3 for the last "CD" found for each ID2($6) will be printed in $4 and $3 respectively like below:-
Code:
SP12.3 CD 2240806 2241681 + ID1_N003 ID2_N003T0
SP12.3 CD 2249762 2253117 - ID1_N006 ID2_N006T0
SP12.3 CD 2249762 2252075 - ID1_N006 ID2_N006T1
SP12.5 CD 3001307 3005025 + ID1_N01140 ID2_N01140T0
If there is only 1 CD for any ID2 ($7), the line will also be omitted. Would appreciate if you can help me on this. thanks
Last edited by redse171; 08-03-2014 at 06:20 PM..
Reason: for better sample and description
Thanks a lot for your quick response.
I am not really clear about your question above but, I am extracting info for gene features and that's how to find out the region for the coding sequence.
i tried your code but it did not give accurate results on my real data. I tried to change and play around with your code but still the result is not correct. below is the sample result that i got:-
Code:
SP12.5 CD 2249762 2249821 - ID2_N006 ID2_N006T1
SP12.5 CD 3004994 3005025 + D2_N0114 ID2_N01140T0
SP12.5 CD 2249762 2249821 - ID2_N006 ID2_N006T0
SP12.5 CD 2241471 2241681 + ID2_N003 ID2_N003T0
If u don't mind, can you explain about your codes? The above data is just a sample. for $1, i have many different values, not only SP12.3. So, i changed "print "SP12.3"" to print "$1". But the output is still wrong. Thanks
But the results still shows all lines containing "CD" pattern. The real output that i want will only show min and max value based on $5 ((blue color for "+" and red color for "-") as below. :-
Code:
SP12.3 CD 2240806 2241681 + ID1 N003 ID2 N003T0
SP12.3 CD 2249762 2253117 - ID1 N006 ID2 N006T0
SP12.3 CD 2249762 2252075 - ID1 N006 ID2 N006T1
If there is only 1 CD for any ID2 ($7), the line will also be omitted. Would appreciate if you can help me on this. thanks
It is no wonder that the results you are getting are not what you want. Your description of how to process the input is so vague that we do not understand what you want.
The code you showed us prints parts of every line with "CD" in the 2nd field. For those lines, it throws away fields 2, 8, and 9; and, if $5 is "+", it swaps fields 3 and 4 before printing the remainder of the line. But, the output you say you want shows every field (keeping fields 2, 8, and 9). And if fields 3 and 4 have been swapped, it isn't obvious to me.
You mentioned ID2 ($7), but it looks like you are looking for the minimum $3 value and the maximum $4 value for each different value in field 9 (not field 7). And from the data shown, I don't see that the + or - in field 5 makes any difference at all.
You have shown us data where fields 1, 6, and 8 are all constants. You have said that $1 may change, but you haven't given any indication of how, or if, that should affect the output produced.
Please give us a clear English description of what you are trying to do and explain what the meaning is for each of the fields in your file.
Also, lots of gene data that we're asked to help with has huge files to process. If that is the case here as well, any details you can give us about the data may help speed up the process considerably. For example, what you have shown us could be sorted with field 1, 5, or 9 as a primary sort key. If data is to be grouped using field 9 as a key and the input is sorted on field 9, we can produce any needed output every time the contents of field 9 changes (as opposed to accumulating all of the input into memory and processing everything at the end).
We also need to know up front whether or not it is important that the output be in the same order as the input.
And, finally: just saying that the code you were given did't give you accurate results is useless information. Show us the output you got, the output you wanted, and explain why (based on your description of what you wanted) the output you got was wrong! Help us help you!
These 3 Users Gave Thanks to Don Cragun For This Post:
Thank u for your comments. Forgive me for the vague description. I just edited my question and sample above. I tried my best to explain my issue. My data is long and huge and has different conditions and i tried my best to make it simple for the sample. but it seems that it created more confusion. my mistake. thanks
I am trying to get a simple min/max script to work with the below input. Note the special character (">") within it.
Script
awk 'BEGIN{max=0}{if(($1)>max) max=($1)}END {print max}'
awk 'BEGIN{min=0}{if(($2)<min) min=($2)}END {print min}'
Input
-122.2840 42.0009
-119.9950 ... (7 Replies)
I need to find the max/min of columns 1 and 2 of a 2 column file what contains the special character ">".
I know that this will find the max value of column 1.
awk 'BEGIN {max = 0} {if ($1>max) max=$1} END {print max}' input.file
But what if I needed to ignore special characters in the... (3 Replies)
aaa: 3 ms
aaa: 2 ms
aaa: 5 ms
aaa: 10 ms
..........
to get the 3 2 5 10 ...'s min avg and max
something like
min: 2 ms avg: 5 ms max: 10 ms (2 Replies)
Hi,
I have a file which looks like this:
FID IID MISS_PHENO N_MISS N_GENO F_MISS
12AB43131 12AB43131 N 17774 906341 0.01961
65HJ87451 65HJ87451 N 10149 906341 0.0112
43JJ21345 43JJ21345 N 2826 906341 0.003118I would... (11 Replies)
Hi guys,
I already search on the forum but i can't solve this on my own.
I have a lot of files like this:
And i need to print the line with the maximum value in last column but if the value is the same (2 in this exemple for the 3 last lines) i need get the line with the minimum value in... (4 Replies)
Hi guys!
I'm new to scripting and I need to write a script in awk.
Here is example of file on which I'm working
ATOM 4688 HG1 PRO A 322 18.080 59.680 137.020 1.00 0.00
ATOM 4689 HG2 PRO A 322 18.850 61.220 137.010 1.00 0.00
ATOM 4690 CD ... (18 Replies)
hi, i have an awk script and I managed to figure out how to search the max value but Im having difficulty in searching for the min field value.
BEGIN {FS=","; max=0}
NF == 7 {if (max < $6) max = $6;}
END { print man, min}
where $6 is the column of a field separated by a comma (3 Replies)
Hello every one, I have following data
***CAMPAIGN 1998 CONTRIBUTIONS***
---------------------------------------------------------------------------
NAME PHONE Jan | Feb | Mar | Total Donated
... (12 Replies)