Selecting random columns from large dataset in UNIX

07-05-2015

Registered User

44, 0

Join Date: Sep 2013

Last Activity: 16 November 2016, 12:10 PM EST

Posts: 44

Thanks Given: 12

Thanked 0 Times in 0 Posts

Selecting random columns from large dataset in UNIX

Dear folks

I have a large data set which contains 400K columns. I decide to select 50K determined columns from the whole 400K columns. Is there any command in unix which could do this process for me? I need to also mention that I store all of the columns id in one file which may help to select those columns out of the whole 400K columns.

Regards
Saj

sajmar

View Public Profile for sajmar

Find all posts by sajmar

07-05-2015

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

What operating system are you using?

Your large dataset clearly is not a text file. What type of file is it?

What delimits columns in your dataset?

What separates records in your dataset?

What is the format of column IDs?

What is the format of the file containing column IDs?

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

07-05-2015

Registered User

44, 0

Join Date: Sep 2013

Last Activity: 16 November 2016, 12:10 PM EST

Posts: 44

Thanks Given: 12

Thanked 0 Times in 0 Posts

1.What operating system are you using?
Linux
2.Your large dataset clearly is not a text file. What type of file is it?
ASCII text
3.What delimits columns in your dataset?
One space delimits columns
4.What separates records in your dataset?
One Space between each record
5.What is the format of column IDs?
All of the columns contain 0,1 or 2
6.What is the format of the file containing column IDs?
Integer

sajmar

View Public Profile for sajmar

Find all posts by sajmar

07-06-2015

Registered User

174, 45

Join Date: Oct 2014

Last Activity: 8 April 2019, 3:29 PM EDT

Posts: 174

Thanks Given: 78

Thanked 45 Times in 45 Posts

Lots of assumptions...since your request is not clear at all

if your data is space delimited, and the columns in you want to extract are in a file with one column name per line, this is worth a try

Save the following as selectcols.sh

Code:

#!/bin/bash

dlf=${1:-data.txt}
clf=${2:-list.txt}

awk  -v colsFile="$clf" '
   BEGIN {
     j=1
     while ((getline < colsFile) > 0) {
        col[j++] = $1
     }
     n=j-1;
     close(colsFile)
     for (i=1; i<=n; i++) s[col[i]]=i
   }
   NR==1 {
     for (f=1; f<=NF; f++)
       if ($f in s) c[s[$f]]=f
     next
   }
   { sep=""
     for (f=1; f<=n; f++) {
       printf("%c%s",sep,$c[f])
       sep=FS
     }
     print ""
   }
' "$dlf"

Run , after adding paths to script and files

Code:

selectcols.sh datafile listofcolsfile

senhia83

View Public Profile for senhia83

Find all posts by senhia83

07-20-2015

Registered User

44, 0

Join Date: Sep 2013

Last Activity: 16 November 2016, 12:10 PM EST

Posts: 44

Thanks Given: 12

Thanked 0 Times in 0 Posts

To senhia83:

Thanks for your awk script suggestion. But after running the code you mentioned, I did not got my expected result.

If you assume my "datafile" is:

Code:

1 2 1 0 2 0 1 0
2 2 2 1 1 1 0 0
1 1 0 0 0 2 2 2
2 2 2 1 1 0 0 0

and my "listofcolsfile" is:

Code:

1
4
8

my desire output is:

Code:

Regards
SAJ

sajmar

View Public Profile for sajmar

Find all posts by sajmar

07-20-2015

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

Try

Code:

awk 'FNR==NR {C[++j]=$1;next} {for (i=1;i<=j;i++) printf "%s ", $C[i]; printf "\n"}' file2 file1
1 0 0 
2 1 0 
1 0 2 
2 1 0

This User Gave Thanks to RudiC For This Post:

RudiC

View Public Profile for RudiC

Find all posts by RudiC

Shell Programming and Scripting

Selecting random columns from large dataset in UNIX

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Selecting lines having same values for first two columns

Discussion started by: manojmalhotra13

2. Shell Programming and Scripting

How to remove a subset of data from a large dataset based on values on one line

Discussion started by: davegen

3. Shell Programming and Scripting

Parse large file on line count (random lines)

Discussion started by: darbs121

4. Solaris

flarecreate for zfs root dataset and ignore multiple dataset

Discussion started by: uxravi

5. Shell Programming and Scripting

How to Pick Random records from a large file

Discussion started by: ajithshankar@ho

6. Programming

I have C++ exe file( no source code) and need to run many large dataset under unix, b

Discussion started by: Danielwang1986

7. Programming

Extracting differences between two columns dataset (SQL command)

Discussion started by: labrazil

8. UNIX for Dummies Questions & Answers

Using 'sed' to delete or ignore columns in a dataset

Discussion started by: aarif

9. UNIX for Dummies Questions & Answers

Using 'sed' to delete or ignore columns in a dataset

Discussion started by: aarif

10. UNIX for Dummies Questions & Answers

Help with selecting specific lines in a large file

Discussion started by: tansha