Extract columns where header matches a given string


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Extract columns where header matches a given string
# 15  
Old 03-18-2011
Quote:
Originally Posted by flotsam
Hi,
Given the following input files:

file1.txt

Code:
41109297 
41109706 
43162207
41109808
41109377
41110441
41111192
43163011
43162367

file2.txt

Code:
I Name    41109297 41109297 41109706 41109706 41110441 41110441 41111192 41111192 41112086 41112086 41113889 41113889 41114003 41114003 41114656 41114656 41115162 41115162 41115561 41115561

In the first case it seems like it's using file2 to order the output? Any thoughts on how I can keep the output in the order of file1?
hi there, i think i understood what you mean. You want to get those relevant columns AND change the column order against the order defined in file1. E.g in your pasted output, 43162207 column should be before 41110441. right?

then take a look this, if it is what you want. (with output)


Code:
ArchT60 /tmp/test
kent$ awk '{for (i=1;i<=NF;i++)a[i,NR]=$i; }END{for(i=1;i<=NF;i++) {for(j=1;j<=NR;j++)printf a[i,j]" ";print ""} }' f2 |awk 'BEGIN{i=j=m=n=1;}
NR==FNR{a[$1]=$1;aa[i++]=$1;} 
NR>FNR{if ($1 in a ){b[n++]=$0;bb[m++]=$1;}} 
END{
la=length(a); lb=length(b); for(i=1;i<=la;i++){ for(j=1;j<=lb;j++){ if (bb[j] == aa[i]) print b[j]; } } 
}' f1 - |awk '{for (i=1;i<=NF;i++)a[i,NR]=$i; }END{for(i=1;i<=NF;i++) {for(j=1;j<=NR;j++)printf a[i,j]" ";print ""} }' 
b=length(b); for(i=1;i<=la;i++){ for(j=1;j<=lb;j++){ if (bb[j] == aa[i]) print b[j]; } } }' f1 - |awk '{for (i=1;i<=NF;i++)a[i,NR]=$i; }END{for(i=1;i<=NF;i++) {for(j=1;j<=NR;j++)printf a[i,j]" ";print ""} }' 

41109297 41109297 41109706 41109706 43162207 43162207 41110441 41110441 41111192 41111192 43163011 43163011 
B B B B B B B B B B B B 
B B B B B B B B B B B B 
B B B B B B B B B B B B 
B B B B B B B B B B B B 
B B B B B B B B B B B B 
B B B B B B B B B B B B 
A A A A A A A A A A A A 
B B B B B B B B B B B B 
A A A A A B A B A B A A


Last edited by sk1418; 03-18-2011 at 07:47 PM.. Reason: reformat
# 16  
Old 03-18-2011
Hi,
Given the following input files:

file1.txt

Code:
41109297 
41109706 
43162207
41109808
41109377
41110441
41111192
43163011
43162367

file2.txt

Code:
I Name    41109297 41109297 41109706 41109706 41110441 41110441 41111192 41111192 41112086 41112086 41113889 41113889 41114003 41114003 41114656 41114656 41115162 41115162 41115561 41115561 41115979 41115979 41116248 41116248 41130607 41130607 41130611 41130611 41131240 41131240 41132167 41132167 41133800 41133800 41134462 41134462 41134623 41134623 42135335 42135335 42137664 42137664 42143490 42143490 42144170 42144170 42144339 42144339 42144650 42144650 42145389 42145389 42146088 42146088 42146090 42146090 42146879 42146879 42148154 42148154 43161219 43161219 43162207 43162207 43163011 43163011 43163878 43163878 43164830 43164830 43165768 43165768 43166228 43166228 43166330 43166330 43167557 43167557 43180900 43180900 43181675 43181675 43182287 43182287 43184255 43184255 43184401 43184401 
M 1080_COI    B B B B B B B B B B B B B B B B B B B B B B B B 0 0 B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B 
M 10668_CO    B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B 
M 1218_ND    B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B 
M 1546_CY    B B B B B B B B B B B B B B B B B B B B B B B B 0 0 B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B 
M 1626_ND    B B B B B B B B B B B B B B B B B B B B B B B B 0 0 B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B 
M 1637_ND    B B B B B B B B B B B B B B B B B B B B B B B B A A B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B 
M 5831_ND2    A A A A A A A A A A A A A A A A A A A A A A A A 0 0 A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A 
M 8472_CO2    B B B B B B B B B B B B B B B B B B B B B B B B 0 0 B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B 
M GGal006    A A A A A B A B A B A A A A A B A B A A A B A B A A A B A B A A A B B B B B A B A A B B A B A A A B A B A A A A B B A B A B A B A A B B A B A B B B A B A B A A A B A B A B A A

I get the following output:

@sk1418 & vgersh99

Code:
41109297 41109297 41109706 41109706 41110441 41110441 41111192 41111192 43162207 43162207 43163011 43163011 
B B B B B B B B B B B B 
B B B B B B B B B B B B 
B B B B B B B B B B B B 
B B B B B B B B B B B B 
B B B B B B B B B B B B 
B B B B B B B B B B B B 
A A A A A A A A A A A A 
B B B B B B B B B B B B 
A A A A A B A B A B A A

@pravin27
Code:
41110441 41110441 41111192 41111192 43162207 43162207 43163011 43163011 
B B B B B B B B 
B B B B B B B B 
B B B B B B B B 
B B B B B B B B 
B B B B B B B B 
B B B B B B B B 
A A A A A A A A 
B B B B B B B B 
A B A B A B A A


In the first case it seems like it's using file2 to order the output? Any thoughts on how I can keep the output in the order of file1?

---------- Post updated at 05:49 PM ---------- Previous update was at 05:41 PM ----------

Here is a python script that will do what I was after if anyone is interested, it's cumbersome and will probably be incredibly slow, but should work

Code:
#!/usr/local/python
import sys


def argChk(input):
        warn='Takes id file and extracts matching columns from file 2 if headers match... USAGE: ./grabCol.py file1 file2 outfile'
        if '-h' in input:
                print warn
        elif len(input) == 3:
                print warn
        else:
                return input

def getPos(id,col):
        coLst=col.split()
        posLst=[]
        for i in id:
                for j in range(0,len(coLst),1):
                        if i.strip() == coLst[j]:
                                posLst.append(j)
        return posLst


############################MAIN#############################
if __name__ == '__main__':
        args=argChk(sys.argv)
        idFile=open(args[1],'rU')
        id=idFile.readlines()
        idFile.close()
        colFile=open(args[2],'rU')
        col=colFile.readlines()
        colFile.close()
        oFile=open(args[3],'w')
        posLst=getPos(id,col[0])
        oStr=''
        for i in col:
                iLst=i.split()
                for j in posLst:
                        oStr=oStr+iLst[j]+' '
                oStr=oStr+'\n'
        oFile.write(oStr)
        oFile.close()


Last edited by Franklin52; 03-21-2011 at 05:18 AM.. Reason: fixed code tagging
# 17  
Old 03-18-2011
hi, have you tried my solution? does it work there? the post before your python post.


btw, your python codes lost all indents. Smilie
# 18  
Old 03-18-2011
@sk1418
Yeah that's exactly what I was after! Sorry I didn't articulate it well when I first posted. I tried to just copy and paste your example, but am coming up with an error message.

Is there something missing on the second line around b=length?

The python script seems to be doing it's job, but I think it's going to be very slow and these are some very large files.. Thanks again for all your help.

Code:
awk '{for (i=1;i<=NF;i++)a[i,NR]=$i; }END{for(i=1;i<=NF;i++) {for(j=1;j<=NR;j++)printf a[i,j]" ";print ""} }' file2.txt |awk 'BEGIN{i=j=m=n=1;}NR==FNR{a[$1]=$1;aa[i++]=$1;} NR>FNR{if ($1 in a ){b[n++]=$0;bb[m++]=$1;}} END{la=length(a); lb=length(b); for(i=1;i<=la;i++){ for(j=1;j<=lb;j++){ if (bb[j] == aa[i]) print b[j]; } } }' file1.txt - |awk '{for (i=1;i<=NF;i++)a[i,NR]=$i; }END{for(i=1;i<=NF;i++) {for(j=1;j<=NR;j++)printf a[i,j]" ";print ""} }'
b=length(b); for(i=1;i<=la;i++){ for(j=1;j<=lb;j++){ if (bb[j] == aa[i]) print b[j]; } } }' file1.txt - |awk '{for (i=1;i<=NF;i++)a[i,NR]=$i; }END{for(i=1;i<=NF;i++) {for(j=1;j<=NR;j++)printf a[i,j]" ";print ""} }'


-bash: syntax error near unexpected token `('


Last edited by Franklin52; 03-21-2011 at 05:16 AM..
# 19  
Old 03-18-2011
ok, i just reformatted a little bit on my code, maybe something went wrong then. I paste again, the one line version. looks ugly but should work. just try it:

Code:
awk '{for (i=1;i<=NF;i++)a[i,NR]=$i; }END{for(i=1;i<=NF;i++) {for(j=1;j<=NR;j++)printf a[i,j]" ";print ""} }' f2 |awk 'BEGIN{i=j=m=n=1;}NR==FNR{a[$1]=$1;aa[i++]=$1;} NR>FNR{if ($1 in a ){b[n++]=$0;bb[m++]=$1;}} END{la=length(a); lb=length(b); for(i=1;i<=la;i++){ for(j=1;j<=lb;j++){ if (bb[j] == aa[i]) print b[j]; } } }' f1 - |awk '{for (i=1;i<=NF;i++)a[i,NR]=$i; }END{for(i=1;i<=NF;i++) {for(j=1;j<=NR;j++)printf a[i,j]" ";print ""} }'

This User Gave Thanks to sk1418 For This Post:
# 20  
Old 03-18-2011
Works perfectly thank you very much! I'll race it against the python script!
# 21  
Old 03-19-2011
As per order by file1
Code:
 awk 'NR==FNR{a[++k]=$1;next} {if(FNR==1){for(m=1;m<=k;m++){for(i=1;i<=NF;i++){if(a[m]==$i){printf a[m]" ";b[i]=$i}}}}else{printf "\n";for(j=1;j<=NF;j++){for(s in b) {if(s==j) {printf $j" "}} }}}END {printf "\n"}' file1 file2

This User Gave Thanks to pravin27 For This Post:
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

awk add all columns if column 1 name matches

Hi - I want to add all columns if column1 name matches. TOPIC1 5 1 4 TOPIC2 3 2 1 TOPIC3 7 2 5 TOPIC1 6 3 3 TOPIC2 4 1 3 TOPIC3 9 5 4 . . . . . . . . . . . . Result should look like TOPIC1 11 4 7 TOPIC2 7 3 4 (1 Reply)
Discussion started by: oraclermanpt
1 Replies

2. UNIX for Beginners Questions & Answers

Matches columns from two different files in shell script

Hi friends, i want to compare first columns from two different files ,if equal print the file2's second column else print the zero.Please help me... file1: a b c d efile2: a 1 c 20 e 30 desired output: 1 0 20 0 30 Please use CODE tags as required by forum rules! Please post in... (1 Reply)
Discussion started by: bhaskar illa
1 Replies

3. UNIX for Beginners Questions & Answers

Extract the whole set if a pattern matches

Hi, I have to extract the whole set if a pattern matches.i have a file called input.txt input.txt ------------ CREATE TABLE ABC ( A, B, C ); CREATE TABLE XYZ ( X, Y, Z, P, Q ); (6 Replies)
Discussion started by: raju2016
6 Replies

4. UNIX for Dummies Questions & Answers

Print Matches to New Columns

Hi all, I have a problem that I'm struggling to resolve. I have two files that look like this: File 1 654654654 3 987987987 2 321321321 1 File 2 14NS0064 654654654 14NS0054 654654654 14NS0032 654654654 14NS0090 987987987 14NS0093 987987987 14NS0056 321321321 As you may notice,... (2 Replies)
Discussion started by: winkleman
2 Replies

5. Shell Programming and Scripting

Blocks of text in a file - extract when matches...

I sat down yesterday to write this script and have just realised that my methodology is broken........ In essense I have..... ----------------------------------------------------------------- (This line really is in the file) Service ID: 12345 ... (7 Replies)
Discussion started by: Bashingaway
7 Replies

6. Shell Programming and Scripting

Extract columns based on header

Hi to all, I have two files. File1 has no header, two columns: sample1 A sample2 B sample3 B sample4 C sample5 A sample6 D sample7 D File2 has a header, except for the first 3 columns (chr,start,end). "sample1" is the header for the 4th ,5th ,6th columns, "sample2" is the header... (4 Replies)
Discussion started by: aec
4 Replies

7. Shell Programming and Scripting

Merge two columns from two files into one if another column matches

I have two text files that look something like this: A:B:C 123 D:E:F 234 G:H:I 345 J:K:L 123 M:N:O 456 P:Q:R 567 A:B:C 456 D:E:F 567 G:H:I 678 J:K:L 456 M:N:O 789 P:Q:R 890 I want to find the line where the first column matches and then combine the second columns into a single... (8 Replies)
Discussion started by: pbluescript
8 Replies

8. Shell Programming and Scripting

Need awk help to print specific columns with as string in a header

awk experts, I have a big file of 4000 columns with header. Would like to print the columns with string value of "Commands" in header. File has "," separator. This file is on ESX host with Bash. Thanks, Arv (21 Replies)
Discussion started by: arv_cds
21 Replies

9. Shell Programming and Scripting

Joining columns from two files, if the key matches

I am trying to join/paste columns from two files for the rows with matching first field. Any help will be appreciated. Files can not be sorted and may not have all rows in both files. Thanks. File1 aaa 111 bbb 222 ccc 333 File2 aaa sss mmmm ccc kkkk llll ddd xxx yyy Want to... (1 Reply)
Discussion started by: sk_sd
1 Replies

10. Shell Programming and Scripting

Extract if pattern matches

Hi All, I have an input below. I tried to use the awk below but it seems that it ;s not working. Can anybody help ? My concept here is to find the 2nd field of the last occurrence of such pattern " ** XXX ccc ccc cc cc ccc 2007 " . In this case, the 2nd field is " XXX ". With this "XXX" term... (20 Replies)
Discussion started by: Raynon
20 Replies
Login or Register to Ask a Question