Help in awk/bash | Unix Linux Forums | Shell Programming and Scripting

  Go Back    


Shell Programming and Scripting Post questions about KSH, CSH, SH, BASH, PERL, PHP, SED, AWK and OTHER shell scripts and shell scripting languages here.

Help in awk/bash

Shell Programming and Scripting


Closed Thread    
 
Thread Tools Search this Thread Display Modes
    #1  
Old 12-28-2012
bioinfo bioinfo is offline
Registered User
 
Join Date: Dec 2012
Last Activity: 12 August 2013, 3:07 AM EDT
Posts: 50
Thanks: 52
Thanked 0 Times in 0 Posts
Linux Help in awk/bash

Hi, I am also a newbie in awk and trying to find solution of my problem.

I have one reference file 1.txt with 2 columns and I want to search other 10 files (a.txt, b.txt......h.txt each with 5 columns) corresponding to the values of 2nd column from 1.txt. If the value from 2nd column from 1.txt matches with the value of 4th column of 10 files, then print the row as well as file name.
Also, in 1.txt for eg. 1st value is -191.632 but originally in a.txt it is -191.6318, so I also want to print values same upto two decimal places and rest places can be any number.

1.txt:


Code:
1.35732	-191.632
1.36229	-190.8716
1.35503	-191.3254
1.35597	-191.2652

a.txt:


Code:
271640.000	 0.49000	 -0.0000036574 -191.6318 -183.82380	
271650.000	 0.49155	 0.0000033909	 -198.30111	 -198.73140	
271660.000	 0.48775	 0.0000014657	 -191.3254 -199.84910	
271670.000	 0.48212	 -0.0000004152 -195.48446	 -193.15580

Please guide.
Thanks

Last edited by joeyg; 12-28-2012 at 08:44 PM.. Reason: Please wrap scripts and data in CodeTags
Sponsored Links
    #2  
Old 12-28-2012
DGPickett DGPickett is offline Forum Advisor  
Registered User
 
Join Date: Oct 2010
Last Activity: 31 October 2014, 5:47 PM EDT
Location: Southern NJ, USA (Nord)
Posts: 4,480
Thanks: 8
Thanked 549 Times in 526 Posts
You can 'join' file 1.txt to each of the [a-h].txt in a 'for' loop, and process the 'for' output piped to shell 'while read'. The file name will be in the 'for' variable and the file columns will be all present in the 'read' variables. You have to 'sort' every file on the key column using a 'binary' sort (export LC_ALL=C, not a numeric sort). Hopefully the original line order is not critical, else number the lines in a new field. While you can join using a pile of awk or shell commands, this is cleaner.

Man Page for join (opensolaris Section 1) - The UNIX and Linux Forums

Man Page for sort (all Section 1) - The UNIX and Linux Forums
The Following User Says Thank You to DGPickett For This Useful Post:
bioinfo (12-28-2012)
Sponsored Links
    #3  
Old 12-28-2012
bioinfo bioinfo is offline
Registered User
 
Join Date: Dec 2012
Last Activity: 12 August 2013, 3:07 AM EDT
Posts: 50
Thanks: 52
Thanked 0 Times in 0 Posts
Thanks for the reply.
Can you please help in writing the code as I am not expert in awk.

Thanks again
    #4  
Old 12-28-2012
Don Cragun's Avatar
Don Cragun Don Cragun is offline Forum Staff  
Moderator
 
Join Date: Jul 2012
Last Activity: 31 October 2014, 10:33 PM EDT
Location: San Jose, CA, USA
Posts: 4,939
Thanks: 186
Thanked 1,656 Times in 1,405 Posts
Quote:
Originally Posted by bioinfo View Post
Hi, I am also a newbie in awk and trying to find solution of my problem.

I have one reference file 1.txt with 2 columns and I want to search other 10 files (a.txt, b.txt......h.txt each with 5 columns) corresponding to the values of 2nd column from 1.txt. If the value from 2nd column from 1.txt matches with the value of 4th column of 10 files, then print the row as well as file name.
Also, in 1.txt for eg. 1st value is -191.632 but originally in a.txt it is -191.6318, so I also want to print values same upto two decimal places and rest places can be any number.

1.txt:

1.35732 -191.632
1.36229 -190.8716
1.35503 -191.3254
1.35597 -191.2652

a.txt:

271640.000 0.49000 -0.0000036574 -191.6318 -183.82380
271650.000 0.49155 0.0000033909 -198.30111 -198.73140
271660.000 0.48775 0.0000014657 -191.3254 -199.84910
271670.000 0.48212 -0.0000004152 -195.48446 -193.15580

Please guide.
Thanks
I'm not sure if you want the values in 1.txt column 2 and a-h.txt column 4 truncated to two decimal places or rounded to two decimal places (with your sample input, the results are the same) and I'm not sure why DGPickett thinks join and sort would be easier than awk, but here are ways to use awk to do what I think you're requesting...

Code:
echo "awk with rounded values"
awk ' FNR == NR {v[sprintf("%.2f", $2)]}
sprintf("%.2f", $4) in v {print $0, FILENAME}' 1.txt [a-h].txt

echo "awk with truncated values"
awk '
function trunc(val) {
        split(val, a, /[.]/)
        return a[1] "." substr(a[2] "00", 1, 2)
}
FNR == NR {v[trunc($2)]}
trunc($4) in v {print $0, FILENAME}' 1.txt [a-h].txt

The Following User Says Thank You to Don Cragun For This Useful Post:
bioinfo (12-28-2012)
Sponsored Links
    #5  
Old 12-28-2012
bioinfo bioinfo is offline
Registered User
 
Join Date: Dec 2012
Last Activity: 12 August 2013, 3:07 AM EDT
Posts: 50
Thanks: 52
Thanked 0 Times in 0 Posts
Thanks for the reply.
Can you please explain it somewhat.

Thanks again.
Sponsored Links
    #6  
Old 12-28-2012
Don Cragun's Avatar
Don Cragun Don Cragun is offline Forum Staff  
Moderator
 
Join Date: Jul 2012
Last Activity: 31 October 2014, 10:33 PM EDT
Location: San Jose, CA, USA
Posts: 4,939
Thanks: 186
Thanked 1,656 Times in 1,405 Posts
Quote:
Originally Posted by bioinfo View Post
Thanks for the reply.
Can you please explain it somewhat.

Thanks again.

Code:
1  echo "awk with rounded values"
2  awk ' FNR == NR {v[sprintf("%.2f", $2)]; next}
3  sprintf("%.2f", $4) in v {print $0, FILENAME}' 1.txt [a-h].txt
4
5  echo "awk with truncated values"
6  awk '
7  function trunc(val) {
8          split(val, a, /[.]/)
9          return a[1] "." substr(a[2] "00", 1, 2)
10 }
11 FNR == NR {v[trunc($2)]; next}
12 trunc($4) in v {print $0, FILENAME}' 1.txt [a-h].txt

I have added line numbers to aid in this discussion, but note that the line numbers cannot appear in the script when you run it.

Also note that I have added an awk next command to lines 2 and 11. With the given sample data it won't make any difference, but with other data or with different fields being checked, it could be important.

In the suggestion on lines 1-3, the sprint("%.2f", arg) converts the string specified by arg to a floating point value and produces a string that represents that floating point value rounded to two digits after the decimal point. Line two uses that to create an array with indices that are the rounded floating point values of the second field ($2) in the first input file (lines where the record number within the file [FNR] is equal to the line number of all records read by awk [NR]).

(The next command I added here causes awk to skip to the next record instead of checking whether or not any remaining commands in the script should be executed. Without the next , the next line will process lines from all input files. It doesn't affect processing here because there is no field 4 in file one. The empty field 4 will be converted to 0.00 and none of the strings in the second field in the 1.txt will be converted to 0.00.)

Line 3 tests whether the same conversion used in line 2 produces a string that is an index in the array v (index in array evaluates to TRUE if index if is an index in the array named array . So, if $4 (rounded to two decimal places) in any of the files after the 1st file match $2 (rounded to two decimal places) in the first file, the print command will be run printing the current input line ($0) and the name of the file containing the line (FILENAME).

The 1.txt [a-h].txt on lines 3 and 12 specifies the eleven input files to be processed by these awk scripts.

The suggestion on lines 5-12 uses the same logic as the 1st suggestion but truncates the strings to two decimal places instead of rounding to two decimal places. Since the truncation logic is more complex than the single function call to sprint() used to perform the rounding, I wrote a function (lines 7-10) to convert the string to a string representing a floating point value with two decimal places.

The split() on line 8 creates an array of one or two elements with the first element containing all of the characters before the "." and the second element containing all of the characters after the ".". If there is no "." in the input value, the first element of the array will contain the entire input string and the second element of the array will not be set (and when referenced will act as an empty string). The return command on line 9 returns a string that is the concatenation of the first element in the array, a decimal point, and the 1st two characters of the concatenation of the second element of the array followed by "00". (The concatenation with "00" takes care of cases where field 2 in the first file or field 4 in the remaining files have an integer value with no decimal point and the case where the input field has a period but there are less than two digits after the decimal point.)

The logic on lines 11 and 12 is the same as the logic on lines 2 and 3.
The Following User Says Thank You to Don Cragun For This Useful Post:
bioinfo (12-29-2012)
Sponsored Links
    #7  
Old 12-30-2012
bioinfo bioinfo is offline
Registered User
 
Join Date: Dec 2012
Last Activity: 12 August 2013, 3:07 AM EDT
Posts: 50
Thanks: 52
Thanked 0 Times in 0 Posts
Hi,
Thanks a lot , I have done it.
I have got the following output for all files (just showing for one file and naming it as o.txt) :

Code:
100.000        0.51332	   0.0000001923	 -191.04738     a.txt
2000.000	   0.49573	   0.0000015512	 -191.40071     a.txt
1000.000	   0.51047	   0.0000028339	 -190.92254     a.txt

Further, I need your help. I have 10 more files, all of same format (11.txt) as follows, showing 2 repeats from this file:

Code:
ATOM      1  N    SER A   1      35.092  83.194 140.076  1.00  0.00           N  
ATOM      2  CA  SER A   1      35.216  83.725 138.725  1.00  0.00           C  
ATOM      3  C    SER A   1      36.530  84.485 138.538  1.00  0.00           C  
TER
ENDMDL
ATOM      1  N   SER A   1      35.683  81.326 139.778  1.00  0.00           N  
ATOM      2  CA  SER A   1      35.422  82.736 139.929  1.00  0.00           C  
ATOM      3  C   SER A   1      36.497  83.588 139.247  1.00  0.00           C  
TER
ENDMDL

ENDMDL is coming around 10000 times in each file. If I give input of 100 at $1 from o.txt, then it should output the first repeat from 11. txt ending with ENDMDL.

Code:
ATOM      1  N    SER A   1      35.092  83.194 140.076  1.00  0.00           N  
ATOM      2  CA  SER A   1      35.216  83.725 138.725  1.00  0.00           C  
ATOM      3  C    SER A   1      36.530  84.485 138.538  1.00  0.00           C  
TER
ENDMDL

So, corresponding to first column of o.txt, I want to retreive the repeat at the number $1/100 from 11.txt i.e. if $1=2000, then I want to retreive the pattern where ENDMDL is at 20 place.


Please guide me.

Thanks again

---------- Post updated at 10:40 PM ---------- Previous update was at 09:52 PM ----------

Please guide me. Its urgent.

Thanks

Last edited by Scrutinizer; 12-31-2012 at 02:53 AM.. Reason: code tags
Sponsored Links
Closed Thread

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

More UNIX and Linux Forum Topics You Might Find Helpful
Thread Thread Starter Forum Replies Last Post
Using AWK in a bash script mb001 Shell Programming and Scripting 3 08-02-2011 03:19 PM
awk bash help a-gopal Shell Programming and Scripting 2 05-08-2009 08:39 PM
BASH, HASH and AWK Corpsehy UNIX for Dummies Questions & Answers 2 02-13-2009 01:54 AM
Is there any better way for sorting in bash/awk ahjiefreak Shell Programming and Scripting 7 10-31-2008 09:07 AM
BASH with AWK narasimhulu Shell Programming and Scripting 2 08-25-2008 11:59 PM



All times are GMT -4. The time now is 11:13 PM.