Creating subset of a file based on specific columns


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Creating subset of a file based on specific columns
# 1  
Old 06-16-2013
Creating subset of a file based on specific columns

Hello Unix experts,
I need a help to create a subset file. I know with cut comand, its very easy to select many different columns, or threshold. But here I have a bit problem as in my data file is big. And I don't want to identify the column numbers or names manually. I am trying to find any way to automatise this.

For example I have a file with about 1500 columns from TRFLP intensity data.

And the column names are like:
Code:
   [1] "Sample.Name"    "Marker"         "RE"             "Dye"            "Allele.1"       "Size.1"         "Height.1"       "Peak.Area.1"    "Data.Point.1"  
  [10] "Allele.2"       "Size.2"         "Height.2"       "Peak.Area.2"    "Data.Point.2"   "Allele.3"       "Size.3"         "Height.3"       "Peak.Area.3"   
  [19] "Data.Point.3"   "Allele.4"       "Size.4"         "Height.4"       "Peak.Area.4"    "Data.Point.4"   "Allele.5"       "Size.5"         "Height.5"      
  [28] "Peak.Area.5"    "Data.Point.5"   "Allele.6"       "Size.6"         "Height.6"       "Peak.Area.6"    "Data.Point.6"   "Allele.7"       "Size.7"        
  [37] "Height.7"       "Peak.Area.7"    "Data.Point.7"   "Allele.8"       "Size.8"         "Height.8"       "Peak.Area.8"    "Data.Point.8"   "Allele.9"      
  [46] "Size.9"         "Height.9"       "Peak.Area.9"    "Data.Point.9"   "Allele.10"      "Size.10"        "Height.10"      "Peak.Area.10"   "Data.Point.10"  .....

Suppose I want to create a subset selecting all the columns with name Peak.Area.1,Peak.Area.2 etc (as in unix Peak.Area.*)
How can I do that in easy way?
Thanks a lot for the help.
Best wishes,
Mitra

Moderator's Comments:
Mod Comment edit by bakunin: please use CODE-tags for data too, it makes for a better reading. Thank you.

Last edited by bakunin; 06-16-2013 at 08:26 AM..
# 2  
Old 06-16-2013
Please post sample input (using code tags) consisting of few columns and desired output for that data.
# 3  
Old 06-16-2013
My Data is e.g.:
Code:
   Sample.Name                      Marker  RE Dye Allele.1 Size.1 Height.1 Peak.Area.1 Data.Point.1 Allele.2 Size.2 Height.2 Peak.Area.2 Data.Point.2
1       D71I1A  _Internal_Marker_Dye_Blue_ ALU   B        0     NA       NA          NA           NA        0     NA       NA          NA           NA
2       D71I1A _Internal_Marker_Dye_Green_ ALU   G        0     NA       NA          NA           NA        0     NA       NA          NA           NA
3       D71I1A  _Internal_Marker_Dye_Blue_ BSU   B        0     NA       NA          NA           NA        0     NA       NA          NA           NA
4       D71I1A _Internal_Marker_Dye_Green_ BSU   G        0     NA       NA          NA           NA        0     NA       NA          NA           NA
5       D71I1B  _Internal_Marker_Dye_Blue_ ALU   B        0     NA       NA          NA           NA        0  55.54       20         211         1576
6       D71I1B _Internal_Marker_Dye_Green_ ALU   G        0     NA       NA          NA           NA        0     NA       NA          NA           NA
7       D71I1B  _Internal_Marker_Dye_Blue_ BSU   B        0     NA       NA          NA           NA        0     NA       NA          NA           NA
8       D71I1B _Internal_Marker_Dye_Green_ BSU   G        0     NA       NA          NA           NA        0     NA       NA          NA           NA
9       D71I1C  _Internal_Marker_Dye_Blue_ ALU   B        0     NA       NA          NA           NA        0  55.38       18         192         1554
10      D71I1C _Internal_Marker_Dye_Green_ ALU   G        0     NA       NA          NA           NA        0     NA       NA          NA           NA

And I want a output like:
Code:
  Peak.Area.1 Peak.Area.2
1           NA          NA
2           NA          NA
3           NA          NA
4           NA          NA
5           NA         211
6           NA          NA
7           NA          NA
8           NA          NA
9           NA         192
10          NA          NA

But this is just an example.. I want it for a big file where there are over 1000 columns... thus I can't specify column 8 and 13 like in this example.
But I want to use the name Peak.Area.1,Peak.Area.2,Peak.Area.3 etc...something like Peak.Area.*.
Thanks,
Mitra
# 4  
Old 06-16-2013
Try:
Code:
awk 'NR==1{for (i=1;i<=NF;i++) if ($i~"^Peak.Area") {printf $i" ";a[i+1]=1};printf "\n"}
NR>1{printf $1" ";for (i=2;i<=NF;i++) if (i in a) printf $i" ";printf "\n"}' file

This User Gave Thanks to bartus11 For This Post:
# 5  
Old 06-16-2013
Hello bartus11,
I am trying this...
Code:
smitra:TRFLP-RawData smitra$ awk 'NR==1{for (i=1;i<=NF;i++) if ($i~"^Peak.Area") {printf $i" ";a[i+1]=1};printf "\n"}
NR>1{printf $1" ";for (i=2;i<=NF;i++) if (i in a) printf $i" ";printf "\n"}' TRF_raw_data_reactor1.txt > test1.txt
smitra:TRFLP-RawData smitra$

or with csv file:
Code:
smitra:TRFLP-RawData smitra$ awk 'NR==1{for (i=1;i<=NF;i++) if ($i~"^Peak.Area") {printf $i" ";a[i+1]=1};printf "\n"}
NR>1{printf $1" ";for (i=2;i<=NF;i++) if (i in a) printf $i" ";printf "\n"}' TRF_raw_data_reactor1.csv > test1.txt
smitra:TRFLP-RawData smitra$

But somehow its returning a empty file..
No idea what did I do wrong
# 6  
Old 06-16-2013
Post output of:
Code:
head -5 TRF_raw_data_reactor1.csv | cut -c 1-100

# 7  
Old 06-16-2013
Code:
smitra:TRFLP-RawData smitra$ head -5 TRF_raw_data_reactor1.csv | cut -c 1-100
Sample Name,Marker,RE,Dye,Allele 1,Size 1,Height 1,Peak Area 1,Data Point 1,Allele 2,Size 2,Height 2
smitra:TRFLP-RawData smitra$

I also tried with
Code:
smitra:TRFLP-RawData smitra$ awk 'NR==1{for (i=1;i<=NF;i++) if ($i~"^Peak Area") {printf $i" ";a[i+1]=1};printf "\n"}
NR>1{printf $1" ";for (i=2;i<=NF;i++) if (i in a) printf $i" ";printf "\n"}' TRF_raw_data_reactor1.txt>test1.txt

, but still the same
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Help with Creating file based on conditions

Can anyone please assist? I have a .txt file(File1.txt) and a property file(propertyfile.txt) . I have to read the vales from the property file and .txt file and create the output file(outputfile.txt) mentioned in the attachment. For each record in .txt file,the below mentioned values shall be... (20 Replies)
Discussion started by: vinus
20 Replies

2. Shell Programming and Scripting

Creating subset of compilation errors

I am compiling a fortran program using gfortran and the result looks as below I want to write a bash or awk script that will scan the information and output only problems within a range of line numbers Example: If I specify the file createmodl.f08, start line 1000 and end line 1100, I will... (8 Replies)
Discussion started by: kristinu
8 Replies

3. UNIX for Dummies Questions & Answers

Swapping the columns of a text file for a subset of rows

Hi, I'd like to swap the columns 1 and 2 of a space-delimited text file but only for the first 1000 rows. How do I go about doing that? Thanks! (1 Reply)
Discussion started by: evelibertine
1 Replies

4. Shell Programming and Scripting

Transpose whole file and specific columns

Hi, I have a file like this a b c d e f g h i j k l Case1: I want to transpose the whole file Output1 a d g j b e h k c f i l Case2 Transpose a specific column - Say 3rd (6 Replies)
Discussion started by: jacobs.smith
6 Replies

5. Shell Programming and Scripting

How to remove a subset of data from a large dataset based on values on one line

Hello. I was wondering if anyone could help. I have a file containing a large table in the format: marker1 marker2 marker3 marker4 position1 position2 position3 position4 genotype1 genotype2 genotype3 genotype4 with marker being a name, position a numeric... (2 Replies)
Discussion started by: davegen
2 Replies

6. Shell Programming and Scripting

creating a file name based on date

I need to automate a weekly process of piping a directory list to a csv file. Normally I do ls -l > files_04182010.csv (04182010 being the date..) Can someome show me how I would script this, so that when the script is ran it grabs the current date and formats it and allows me to use that... (8 Replies)
Discussion started by: jeffs42885
8 Replies

7. Shell Programming and Scripting

Replace specific columns in one file with columns in another file

HELLO! This is my first post here! By the way, I think it is great that people do this. My question: I have two files, one is a .dilm and one is a .txt. It is my understanding that the .dilm file can be treated as a .txt file. I wrote another program where I was able to manipulate it as if it... (3 Replies)
Discussion started by: mehdib
3 Replies

8. Shell Programming and Scripting

Grep based on specific columns.

Hi, How can I grep a record for a value based on specific column. If I simply do a grep 'AB' FilenName.txt, I might end up getting the records returned whose part of value is 'AB'. But I want it specific to second column. cut -d'|' -f 2 FileName.txt | grep 'AB' But now it will return... (1 Reply)
Discussion started by: deepakwins
1 Replies

9. Shell Programming and Scripting

Creating a csv file based on Existing file

Hi I am Newbie to Unix.Appreciate Help from forum user would loada b.Csv File(Below example) in /data/m/ directory.Program need to read the b.csc to extract certain column and create a new file /data/d/ directory as csv file with new name. User File Format 1232,samshouston,12345... (3 Replies)
Discussion started by: skywayterrace
3 Replies

10. Shell Programming and Scripting

Deleting specific columns from a file

Hi Friends, I want to delete specific columns from a file. Say my file content is as follows: "1","a","ww1",1234" "2","b","wwr3","2222" "3","c","erre","3333" Now i want to delete the column 2 and 4 from this file. That is I want the file content to be: "1","ww1" "2","wwr3"... (11 Replies)
Discussion started by: premar
11 Replies
Login or Register to Ask a Question