Find columns in a file based on header and print to new file

11-25-2016

Registered User

362, 16

Join Date: Mar 2010

Last Activity: 3 March 2020, 10:38 PM EST

Location: Boston

Posts: 362

Thanks Given: 193

Thanked 16 Times in 15 Posts

Quote:

Originally Posted by RudiC

This works if the name column is added to the headers file:

Code:

awk '
NR == FNR       {T[$1] = NR
                 next
                }
FNR == 1        {MX = NR - 1
                 for (i=1; i<=NF; i++) if ($i in T) C[T[$i]] = i
                }
                {for (j=1; j<=MX; j++) printf "%s%s", $C[j], (j == MX)?RS:"\t"
                }
' file2 file1

Thanks, that seems to work just fine.

For clarity, in your script above, "file2" is the file with the headers and "file1" is the file with the data.

I used it like,

Code:

#!/bin/bash

# get_cols_by_header.sh 

# file with header names of required columns in required order
headers_file=$1

# space delimited file with data in columns
input_file=$2

# tab delimited output goes here
output_file='./temp_output.txt'

# use the columns headers listed in headers_file to locate the colums
# in input_file and print to output_file
awk '
NR == FNR       {T[$1] = NR
                 next
                }
FNR == 1        {MX = NR - 1
                 for (i=1; i<=NF; i++) if ($i in T) C[T[$i]] = i
                }
                {for (j=1; j<=MX; j++) printf "%s%s", $C[j], (j == MX)?RS:"\t"
                }
' $headers_file  $input_file > $output_file

To run the script,

Code:

./get_cols_by_header.sh  header_file.txt  input_file.txt

Time to extract and print 48 columns from an input file with 73 columns and 2500 rows,

Code:

real    0m0.141s
user    0m0.170s
sys     0m0.030s

so this is pretty fast.

Are there any particular size limitations on this method as far as the size of the input file, number of rows and cols, etc?

LMHmedchem

LMHmedchem

View Public Profile for LMHmedchem

Find all posts by LMHmedchem

11-25-2016

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

I couldn't guess on the number of cols - this depends on your hardware situation (memory, swap), eventually system configuration (LINE_MAX(?) not sure...). Input file size or row count shouldn't matter, as there is always just one line operated upon and then immediately printed.

Why don't you run it like

Code:

awk '...' $1 $2 ./temp_output.txt

saving all the interim variables used just once?

Last edited by RudiC; 11-25-2016 at 05:18 PM..

RudiC

View Public Profile for RudiC

Find all posts by RudiC

11-25-2016

Registered User

362, 16

Join Date: Mar 2010

Last Activity: 3 March 2020, 10:38 PM EST

Location: Boston

Posts: 362

Thanks Given: 193

Thanked 16 Times in 15 Posts

Quote:

Originally Posted by RudiC

Why don't you run it like

Code:

awk '...' $1 $2 ./temp_output.txt

saving all the interim variables used just once?

I never thought of doing that, but it's a nice idea. Do you not need a redirect to the output file?

I am still not sure if this will be a standalone script or part of a larger script, so I'm not yet sure where the values of the two input files and the output file will be coming from. For now, I wrote them out as bash variables, partly because $header_file will make more sense to me than $1 when I look at this next week.

It's good to know about the resource limits. I theory, there could be millions of rows but there will never be all that many columns (<100). I will test on some larger files and report back if I run into any limitations.

LMHmedchem

LMHmedchem

View Public Profile for LMHmedchem

Find all posts by LMHmedchem

11-25-2016

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

Sorry, a lapse on my part - you of course need a redirection operator for the output.

RudiC

View Public Profile for RudiC

Find all posts by RudiC

11-26-2016

Registered User

2,288, 480

Join Date: Apr 2007

Last Activity: 3 May 2020, 8:28 AM EDT

Location: Saint Paul, MN USA / BSD, CentOS, Debian, OS X, Solaris

Posts: 2,288

Thanks Given: 430

Thanked 480 Times in 395 Posts

Hi.

This issue seems to be resolved, but for completeness on the sub-topic of csvtool, there are options for specifying the input and output field separators (see -t and -u below). I should have presented this version first:

Code:

#!/usr/bin/env bash

# @(#) s2       Demonstrate extraction of fields, csvtool.

# Utility functions: print-as-echo, print-line-with-visual-space, debug.
# export PATH="/usr/local/bin:/usr/bin:/bin"
LC_ALL=C ; LANG=C ; export LC_ALL LANG
pe() { for _i;do printf "%s" "$_i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
em() { pe "$*" >&2 ; }
db() { ( printf " db, ";for _i;do printf "%s" "$_i";done;printf "\n" ) >&2 ; }
db() { : ; }
C=$HOME/bin/context && [ -f $C ] && $C dixf sed pass-fail
pe
dixf csvtool

FILE=${1-data1}
E=expected-output.txt
H=headers

pl " Input data file $FILE:"
head $FILE

pl " Header name file $H, name list:"
head $H
h1=$( paste -s -d, $H )
pe " Header list = $h1"

pl " Expected output:"
cat $E

pl " Results, extract spaced-separated fields, write TAB-spaced:"
csvtool namedcol $h1 -t " " -u TAB $FILE |
tee f1

pl " Verify results if possible:"
C=$HOME/bin/pass-fail
[ -f $C ] && $C || ( pe; pe " Results cannot be verified." ) >&2

exit 0

produces the same output as in post #5.

In addition, there are some facilities that handle embedded problematic characters in field strings in csv-formatted files with perl modules. For example, a small code (about 15 lines of script, but surrounded by 100 lines of supporting code for error detection, argument processing, etc. plus comments) adds quotes to fields, as in:

Code:

#!/usr/bin/env bash

# @(#) s3       Demonstrate short perl code to add quotes to csv files.

LC_ALL=C ; LANG=C ; export LC_ALL LANG
pe() { for _i;do printf "%s" "$_i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
em() { pe "$*" >&2 ; }
db() { ( printf " db, ";for _i;do printf "%s" "$_i";done;printf "\n" ) >&2 ; }
db() { : ; }
C=$HOME/bin/context && [ -f $C ] && $C dixf
pe
dixf csvaq

FILE=${1-data3}

p=$( basename $0 ) t1="$Revision: 1.14 $" v=${t1//[!0-9.]/}
[[ $# -gt 0 ]] && [[ "$1" =~ -version ]] &&  { echo "$p (local) $v" ; exit 0 ; }

pl " Input data file $FILE:"
head $FILE

pl " Results, always add quotes:"
csvaq -q -t " " -u " " $FILE

exit 0

producing:

Code:

$ ./s3

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 3.16.0-4-amd64, x86_64
Distribution        : Debian 8.6 (jessie) 
bash GNU bash 4.3.30
dixf (local) 1.21

csvaq   Add quotes for Comma Separated Value format files. (what)
Path    : ~/bin/csvaq
Version : 1.3
Length  : 165 lines
Type    : Perl script, ASCII text executable
Shebang : #!/usr/bin/env perl
Help    : probably available with [     ]-h,--help
Modules : (for perl codes)
 warnings       1.23
 strict 1.08
 diagnostics    1.34
 English        1.09
 Carp   1.3301
 Data::Dumper   2.151_01
 Getopt::Long   2.42
 feature        1.36_01
 experimental   0.007
 Text::CSV      1.32

-----
 Input data file data3:
cn0 name
cn,1 1,1,4,4-tetrabutylpiperazine
cn,2 N,N,N',N'-tetramethylguanidine
cn,3 1-(2-aminoethyl)piperazine

-----
 Results, always add quotes:
"cn0" "name"
"cn,1" "1,1,4,4-tetrabutylpiperazine"
"cn,2" "N,N,N',N'-tetramethylguanidine"
"cn,3" "1-(2-aminoethyl)piperazine"

Best wishes ... cheers, drl

drl

View Public Profile for drl

Find all posts by drl

Shell Programming and Scripting

Find columns in a file based on header and print to new file

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

How to print multiple required columns dynamically in a file using the header name?

Discussion started by: balu1234

2. Shell Programming and Scripting

Find header in a text file and prepend it to all lines until another header is found

Discussion started by: verdepollo

3. Emergency UNIX and Linux Support

Average columns based on header name

Discussion started by: jacobs.smith

4. Shell Programming and Scripting

Make copy of text file with columns removed (based on header)

Discussion started by: LMHmedchem

5. Shell Programming and Scripting

Compare two files and find match and print the header of the second file

Discussion started by: redse171

6. Shell Programming and Scripting

awk based script to find the average of all the columns in a data file

Discussion started by: ks_reddy

7. Shell Programming and Scripting

Remove the file content based on the Header of the file

Discussion started by: i150371485

8. Shell Programming and Scripting

Awk based script to find the median of all individual columns in a data file

Discussion started by: ks_reddy

9. Shell Programming and Scripting

Need to find a column from one file and print certain columns in second file

Discussion started by: jpkumar10

10. UNIX for Dummies Questions & Answers

Changing file content based on file header

Discussion started by: Fahmida