Extract columns based on header


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Extract columns based on header
# 1  
Old 04-22-2013
Extract columns based on header

Hi to all,

I have two files. File1 has no header, two columns:


Code:
sample1 A
sample2 B
sample3 B
sample4 C
sample5 A
sample6 D
sample7 D

File2 has a header, except for the first 3 columns (chr,start,end). "sample1" is the header for the 4th ,5th ,6th columns, "sample2" is the header for 7th ,8th 9th columns and so on.

Code:
sample1 sample2 sample3 sample4 sample5 sample6 sample7
chr start end a1 b1 c1 a2 b2 c2 a3 b3 c3 a4 b4 c4 a5 b5 c5 a6 b6 c6 a7 b7 c7
chr start end d1 e1 f1 d2 e2 f2 d3 e3 f3 d4 e4 f4 d5 e5 f5 d6 e6 f6 d7 e7 f7

If I would like to extract from File2 just those samples that in File1 have B and D, the output would be:


Code:
sample2 sample3 sample6 sample7
chr start end a2 b2 c2 a3 b3 c3 a6 b6 c6 a7 b7 c7 
chr start end d2 e2 f2 d3 e3 f3 d6 e6 f6 d7 e7 f7


Thanks,

---------- Post updated at 08:06 PM ---------- Previous update was at 06:32 PM ----------

thanks for editing my post and write it in code tags

Last edited by aec; 04-23-2013 at 10:53 AM.. Reason: Please wrap commands and data with CodeTags
# 2  
Old 04-22-2013
Quote:
Originally Posted by aec
Hi,

I have a file like (no header, two columns):

Code:
sample1 A
sample2 B
sample3 B
sample4 C
sample5 A
sample6 D
sample7 D
....

I have another file like (with header, multiple rows and cols, each sample has 3 values a,b,c):

(no header first 3 cols)
Code:
sample1 sample2 sample3 
chr start end a1 b1 c1 a2 b2 c2 a3 b3 c3 ....
chr start end ....

If second field in file1 matches B or D, then extract columns with that sample.

Example of output:
(no header first 3 cols)
Code:
sample2 sample3 sample6 sample7
chr start end a2 b2 c2 a3 b3 c3 a6 b6 c6 a7 b7 c7 
chr start end ....

Thanks,

---------- Post updated at 08:06 PM ---------- Previous update was at 06:32 PM ----------

thanks for editing my post and write it in code tags
I don't understand your sample. Your "another file" is described two ways:
Code:
(with header, multiple rows and cols, each sample has 3 values a,b,c)

and
Code:
(no header first 3 cols)

and I don't see any way to turn:
Code:
sample2 B
sample3 B
sample6 D
sample7 D

into:
Code:
sample2 sample3 sample6 sample7
chr start end a2 b2 c2 a3 b3 c3 a6 b6 c6 a7 b7 c7 
chr start end ....

nor why you describe this as:
Code:
(no header first 3 cols)

when the 1st line sort of looks like a header and the 2nd line has 15 columns.

The .... doesn't appear to be a literal string, but I can't figure out what additional data you want if the .... is intended to be an ellipsis indicating that more data is to follow.

Please try again to clearly explain how the input files are to be transformed into the output you want.
This User Gave Thanks to Don Cragun For This Post:
# 3  
Old 04-22-2013
Yes, I also find it hard to understand exactly what you are trying to do. Smilie Please try again.
This User Gave Thanks to hanson44 For This Post:
# 4  
Old 04-23-2013
Hope now is clearer
# 5  
Old 04-23-2013
Yes, it is much clearer now.

I think the following shell and awk script does what you want. It was written and tested using a Korn shell, but should work with any other POSIX conforming shell (such as bash). If you are using a Solaris/SunOS system, change the script to use /usr/xpg4/bin/awk, /usr/xpg6/bin/awk, or nawk instead of awk:
Code:
#!/bin/ksh
# Usage: tester key...
# This script will look for any "key" given on the command line as a value in
# field 2 of File1 and then extract data from File2 based on the headings
# specified by field 1 of File1.
if [ $# -lt 1 ]
then    echo "Usage; ${0##*/} key..." >&2
        exit 1
fi
errf=stderr.$$
if ! printf "%s\n" "$@" | awk -v errf="$errf" '
FNR == 1 {
        # Increment input file count.
        # f == 1 => List of keys.
        # f == 2 => File1
        # f == 3 => File2
        f++
}
f == 1 {keys[$0]        # Save the keys
        next
}
f == 2 && $2 in keys {  # Save headers to be extracted from File2.
        headers[$1]
        next
}
f == 3 && FNR == 1 {    # 1st line of File2; create list of sets of fields to
                        # extract from remaining lines
        fields[1]       # Always keep 1st three fields
        fields[2]
        fields[3]
        for(i = 1; i<= NF; i++) {
                if($i in headers) {
                        # We have a matching header, add corresonding fields to
                        # the extraction list.
                        fields[3 * i + 1]
                        fields[3 * i + 2]
                        fields[lf = 3 * i + 3]  # Save last field #.
                        printf("%s%s", nh++ ? " " : "", $i)
                }
        }
        if(nh == 0) {   # No headers found.  Print diagnostic message.  It is
                        # being saved in a file since there is no portable way
                        # to direct output from awk to stderr.  The calling
                        # shell script will copy the message to stderr and add
                        # the script name to the start of the diagnostic.
                print ": Headings identified by keys not found in File2." > errf
                exit 1
        }
        print ""        # Add a trailing newline to the output file headings.
        next
}
{       for(i = 1; i <= NF; i++)        # Print selected fields from remaining
                if(i in fields) # lines in File2.
                        printf("%s%s", $i, i == lf ? "\n" : " ")
}' - File1 File2 > File3
then    # awk failed, print diagnostic message.
        printf "%s" ${0##*/} >&2
        cat $errf >&2
        rm $errf
        exit 1
fi

If you save the above code in a file named tester, make it executable with:
Code:
chmod +x tester

and execute it with the desired keys (B and D) in your example:
Code:
./tester B D

or:
Code:
./tester D B

when File1 and File2 contain the data specified in your example; File3 will contain:
Code:
sample2 sample3 sample6 sample7
chr start end a2 b2 c2 a3 b3 c3 a6 b6 c6 a7 b7 c7
chr start end d2 e2 f2 d3 e3 f3 d6 e6 f6 d7 e7 f7

which matches the output you said you wanted except that there is no space at the end of the 2nd line of output here even though there was a trailing space at the end of the 2nd line of File3 in you example.

I hope the comments are sufficient to explain how it works. If it doesn't work or you still have questions, let us know what diagnostics are produced or what shell or awk commands you can't decipher.
This User Gave Thanks to Don Cragun For This Post:
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

To extract certain columnns with header

Hi, I have data which has many columns but i need to extract only three column from below all records have one space in begining. i need to extract only column as below ,any tric awk or sed trick?, please advise. cloumn no : 3 : ORD NO coulmn no : 6: P A R T N U M B E R coulmn no : 9 :... (7 Replies)
Discussion started by: Riverstone
7 Replies

2. Shell Programming and Scripting

Find columns in a file based on header and print to new file

Hello, I have to fish out some specific columns from a file based on the header value. I have the list of columns I need in a different file. I thought I could read in the list of headers I need, # file with header names of required columns in required order headers_file=$2 # read contents... (11 Replies)
Discussion started by: LMHmedchem
11 Replies

3. UNIX for Beginners Questions & Answers

Keep only columns in first two rows based on partial header pattern.

I have this code below that only prints out certain columns from the first two rows (doesn't affect rows 3 and beyond). How can I do the same on a partial header pattern “G_TP” instead of having to know specific column numbers (e.g. 374-479)? I've tried many other commands within this pipe with no... (4 Replies)
Discussion started by: aachave1
4 Replies

4. Emergency UNIX and Linux Support

Average columns based on header name

Hi Friends, I have files with columns like this. This sample input below is partial. Please check below for main file link. Each file will have only two rows. ... (8 Replies)
Discussion started by: jacobs.smith
8 Replies

5. Shell Programming and Scripting

Add new columns based on existing columns

Hi all, I am kind of stuck with printing my desired output. Please help me if you know how it can work. My input file(tab separated): NW_0068.1 41,16 100,900 NW_0699.1 4,2,19 200,700,80 My Output file (desired): NW_0068.1 41,16 100,900 100 - 141 NW_0068.1 41,16 100,900 ... (3 Replies)
Discussion started by: sam_2921
3 Replies

6. Shell Programming and Scripting

Make copy of text file with columns removed (based on header)

Hello, I have some tab delimited text files with a three header rows. The headers look like, (sorry the tabs look so messy). index group Name input input input input input input input input input input input... (9 Replies)
Discussion started by: LMHmedchem
9 Replies

7. UNIX for Dummies Questions & Answers

Extract columns based on the first line of each column

Sorry to bother you guys again. I have a file1 with multiple columns like this:gga_miR_100 gga_miR_300 gga_miR_3500 gga_miR_4600 gga_miR_5600 gga_miR_30 gga_miR_500 kj rwg ghhh jy jyu we vf 5g 5hg h6 56h i8 45t 44r4 4bg 4r546 9lgtr (fer) 4fr f433 3feev f4 bf4 35g vfr ge 2rr ... (5 Replies)
Discussion started by: yuejian
5 Replies

8. Shell Programming and Scripting

Compare 2 csv files by columns, then extract certain columns of matcing rows

Hi all, I'm pretty much a newbie to UNIX. I would appreciate any help with UNIX coding on comparing two large csv files (greater than 10 GB in size), and output a file with matching columns. I want to compare file1 and file2 by 'id' and 'chain' columns, then extract exact matching rows'... (5 Replies)
Discussion started by: bkane3
5 Replies

9. UNIX for Dummies Questions & Answers

Printing columns with header

Hi Gurus, I want to extract certain columns from file 2 and combine with file 1. I am using the following script to extract the columns. $ awk 'FNR>1{print $2, $9, FILENAME}' *.lim > out1 However, this script does not print the titles of the columns 2 and 9. Can somebody help me in... (1 Reply)
Discussion started by: Unilearn
1 Replies

10. Shell Programming and Scripting

Extract columns where header matches a given string

Hi, I'm having trouble pulling out columns where the headers match a file of key ID's I'm interested in and was looking for some help. file1.txt I Name 34 56 84 350 790 1215 1919 7606 9420 file2.txt I Name 1 1 2 2 3 3 ... 34 34... 56 56... 84 84... 350 350... M 1 A A A A... (20 Replies)
Discussion started by: flotsam
20 Replies
Login or Register to Ask a Question