Find columns in a file based on header and print to new file
Hello,
I have to fish out some specific columns from a file based on the header value. I have the list of columns I need in a different file. I thought I could read in the list of headers I need,
and then loop through the list to pick out each column I need,
The above awk does not work and even if it did it would overwrite the data from each previous column found. How do I find all the columns I need and then print all of them in the right order so they all end up in the output file?
The only thing I could think of was to read the header line from $input_file into another array and then loop through $headers_list making a note of the numerical position of the columns I need. In theory, I could use the list of numerical positions to cobble together a cut argument to get the columns I need. That seems like it would be horribly messy syntax and could probably be done with one line of awk from someone who knows what they are doing.
That means it's time to post and ask for help. I found allot of topics like this one, but most of them seemed to find one column by the header value and print it.
In case that makes a difference, the input files I am working have < 200 columns but may have almost any number of rows. The input file is space delimited and the output should be tab delimited, though I could replace space with tab after the fact if necessary.
Note: The order in for (i in a) is arbitrary, so it cannot be used reliably to preserve order. An alternative would be to use a for(i=min;i<=max;i++) loop..
for example:
These 2 Users Gave Thanks to Scrutinizer For This Post:
Location: Saint Paul, MN USA / BSD, CentOS, Debian, OS X, Solaris
Posts: 2,288
Thanks Given: 430
Thanked 480 Times in 395 Posts
Hi.
From my perspective, this is a csv manipulation problem. Consequently, a simple csv-aware tool seems appropriate. The dataset is transformed to csv format, and the header-name lines are collected into a csv-like string. The named columns are extracted, and the file is converted from csv format to TAB-separated format -- which the OP required,
Collecting these all together in a script and using dataset from ripat:
producing:
The command csvtool can be found in the Debian repository or at github as noted.
@LMHmedchem: with 300 posts, you should know that posting data samples, expected output, and your computing environment will help make replies easier and more likely to be applicable to your situation. Please do that in your future posts.
Note: The order in for (i in a) is arbitrary, so it cannot be used reliably to preserve order. An alternative would be to use a for(i=min;i<=max;i++) loop..
for example:
Thank you all for the replies.
I can't seem to get the above working.
Here is some data, sorry this is hard to read but I thought it best to leave it in its original single space delimited format.
headers_file,
desired output (in most cases, some columns in the original input will not be in output)
When I run the script above by I get,
The code suggestion posted by ripat has a similar issue but I haven't posted the results here because of the comment by Scrutinizer about the order of output.
Quote:
Originally Posted by drl
@LMHmedchem: with 300 posts, you should know that posting data samples, expected output, and your computing environment will help make replies easier and more likely to be applicable to your situation. Please do that in your future posts.
I certainly should have included an example with my post, sorry about that. I am currently running this under cygwin 2.3.1 but this will also run on openSuse 13.2 x86_64.
I know that the term csv is sometimes used to refer to generic delimited text data and not just comma separated data. I stay away from comma separation because many of the fields I use (chemical names) have commas ( 1,1,4,4-tetrabutylpiperazine ). The values in the name column could also have unmatched single quotes ( N,N,N',N'-tetramethylguanidine ) or parenthesis ( 1-(2-aminoethyl)piperazine ). I think that code that replaces space with comma would be problematic in my particular case. Yet another reason why an example of real data would have been useful for me to post.
LMHmedchem
Last edited by LMHmedchem; 11-25-2016 at 02:56 PM..
Hi All,
i am trying to print required multiple columns dynamically from a fie.
But i am able to print only one column at a time.
i am new to shell script, please help me on this issue.
i am using below script
awk -v COLT=$1 '
NR==1 {
for (i=1; i<=NF; i++) {
... (2 Replies)
I've been struggling with this one for quite a while and cannot seem to find a solution for this find/replace scenario. Perhaps I'm getting rusty.
I have a file that contains a number of metrics (exactly 3 fields per line) from a few appliances that are collected in parallel. To identify the... (3 Replies)
Hi Friends,
I have files with columns like this. This sample input below is partial.
Please check below for main file link. Each file will have only two rows.
... (8 Replies)
Hello,
I have some tab delimited text files with a three header rows. The headers look like, (sorry the tabs look so messy).
index group Name input input input input input input input input input input input... (9 Replies)
Hi,
I have two input files; file1 and file2. I compare them based on matched values in 1 column and print selected columns of the second file (file2). I got the result but the header was not printed. i want the header of file2 to be printed together with the result. Then i did below codes:-
... (3 Replies)
Hi All,
I need the modification for the below mentioned code (found in one more post https://www.unix.com/shell-programming-scripting/27161-script-generate-average-values.html) to find the average values for all the columns(but for a specific rows) and print the averages side by side.
I have... (4 Replies)
Hi All,
I want to remove the content based on the header information .
Please find the example below.
File1.txt
Name|Last|First|Location|DepId|Depname|DepLoc
naga|rr|tion|hyd|1|wer|opr
Nava|ra|tin|gen|2|wera|opra
I have to search for the DepId and remove the data from the... (5 Replies)
Hi All,
I have some data like below.
Step1,Param1,Param2,Param3
1,2,3,4
2,3,4,5
2,4,5,6
3,0,1,2
3,0,0,0
3,2,1,3
........
so on
Where I need to find the median(arithmetic) of each column from Param1...to..Param3 for each set of Step1 values.
(Sort each specific column, if the... (5 Replies)
Hi,
I need helping in finding some of the text in one file and some columns which have same column in file 1
EG
cat file_1
aaaa
bbbb
cccc
dddd
eeee
fffff
gggg
hhhh
cat file_2
aaaa,abcd,effgh,ereref,name,age,sex,...........
bbbb,efdfh,erere,afdafds,name,age,sex.............. (1 Reply)
Hi,
I have several text files each containing some data as shown below:
File1.txt
>DataHeader
Data...
Data...
File2.txt
>DataHeader
Data...
Data...
etc.
What I want is to change the 'DataHeader' based on the file name. So the output should look like:
File1.txt
>File1
... (1 Reply)