Selecting random columns from large dataset in UNIX


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Selecting random columns from large dataset in UNIX
# 1  
Old 07-05-2015
Selecting random columns from large dataset in UNIX

Dear folks

I have a large data set which contains 400K columns. I decide to select 50K determined columns from the whole 400K columns. Is there any command in unix which could do this process for me? I need to also mention that I store all of the columns id in one file which may help to select those columns out of the whole 400K columns.

Regards
Saj
# 2  
Old 07-05-2015
What operating system are you using?

Your large dataset clearly is not a text file. What type of file is it?

What delimits columns in your dataset?

What separates records in your dataset?

What is the format of column IDs?

What is the format of the file containing column IDs?
# 3  
Old 07-05-2015
1.What operating system are you using?
Linux
2.Your large dataset clearly is not a text file. What type of file is it?
ASCII text
3.What delimits columns in your dataset?
One space delimits columns
4.What separates records in your dataset?
One Space between each record
5.What is the format of column IDs?
All of the columns contain 0,1 or 2
6.What is the format of the file containing column IDs?
Integer
# 4  
Old 07-06-2015
Lots of assumptions...since your request is not clear at all

if your data is space delimited, and the columns in you want to extract are in a file with one column name per line, this is worth a try

Save the following as selectcols.sh

Code:
#!/bin/bash

dlf=${1:-data.txt}
clf=${2:-list.txt}

awk  -v colsFile="$clf" '
   BEGIN {
     j=1
     while ((getline < colsFile) > 0) {
        col[j++] = $1
     }
     n=j-1;
     close(colsFile)
     for (i=1; i<=n; i++) s[col[i]]=i
   }
   NR==1 {
     for (f=1; f<=NF; f++)
       if ($f in s) c[s[$f]]=f
     next
   }
   { sep=""
     for (f=1; f<=n; f++) {
       printf("%c%s",sep,$c[f])
       sep=FS
     }
     print ""
   }
' "$dlf"


Run , after adding paths to script and files
Code:
selectcols.sh datafile listofcolsfile

# 5  
Old 07-20-2015
To senhia83:

Thanks for your awk script suggestion. But after running the code you mentioned, I did not got my expected result.

If you assume my "datafile" is:
Code:
1 2 1 0 2 0 1 0
2 2 2 1 1 1 0 0
1 1 0 0 0 2 2 2
2 2 2 1 1 0 0 0

and my "listofcolsfile" is:
Code:
1
4
8

my desire output is:
Code:
1 0 0
2 1 0
1 0 2
2 1 0


Regards
SAJ
# 6  
Old 07-20-2015
Try
Code:
awk 'FNR==NR {C[++j]=$1;next} {for (i=1;i<=j;i++) printf "%s ", $C[i]; printf "\n"}' file2 file1
1 0 0 
2 1 0 
1 0 2 
2 1 0

This User Gave Thanks to RudiC For This Post:
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Selecting lines having same values for first two columns

Hello to all. This is first post. Kindly excuse me if I do not adhere to any rules and regulations of this forum. I have a file containing some rows with three columns each per row(separeted by a space). There are certain rows for which first two columns have same value but the value in... (6 Replies)
Discussion started by: manojmalhotra13
6 Replies

2. Shell Programming and Scripting

How to remove a subset of data from a large dataset based on values on one line

Hello. I was wondering if anyone could help. I have a file containing a large table in the format: marker1 marker2 marker3 marker4 position1 position2 position3 position4 genotype1 genotype2 genotype3 genotype4 with marker being a name, position a numeric... (2 Replies)
Discussion started by: davegen
2 Replies

3. Shell Programming and Scripting

Parse large file on line count (random lines)

I have a file that needs to be parsed into multiple files every time there line contains a number 1. the problem i face is the lines are random and the file size is random. an example is that on line 4, 65, 187, 202 & 209 are number 1's so there has to be file breaks between all those to create 4... (6 Replies)
Discussion started by: darbs121
6 Replies

4. Solaris

flarecreate for zfs root dataset and ignore multiple dataset

Hi All, I want to write a script to create flar images on multiple servers. In non zfs filesystem I am using -X option to refer a file to exclude mounts on different servers. but on ZFS -X option is not working. I want multiple mounts to be ignore on ZFS base system during flarecreate. I... (0 Replies)
Discussion started by: uxravi
0 Replies

5. Shell Programming and Scripting

How to Pick Random records from a large file

Hi, I have a huge file say with 2000000 records. The file has 42 fields. I would like to pick randomly 1000 records from this huge file. Can anyone help me how to do this? (1 Reply)
Discussion started by: ajithshankar@ho
1 Replies

6. Programming

I have C++ exe file( no source code) and need to run many large dataset under unix, b

I have C++ exe file( no source code) and need to run many large dataset under unix, but how to know the memeroy usage for one dataset?http://www.codeproject.com/script/Forums/Images/New.gif I think "top" is not good and if using the profiler, it seems no free download, any ideas? (1 Reply)
Discussion started by: Danielwang1986
1 Replies

7. Programming

Extracting differences between two columns dataset (SQL command)

Hi, I have a table in my sqlite, here is an example (tab separated) 585 name1 chr1 + 1872 3533 3533 3533 6 1872,2041,2475,2837,3083,3315, 1920,2090,2560,2915,3237,3533, name2 The 10th and 11th columns have information in a comma separated format (not tab).... (0 Replies)
Discussion started by: labrazil
0 Replies

8. UNIX for Dummies Questions & Answers

Using 'sed' to delete or ignore columns in a dataset

Hi, I want to know if its possible to delete or ignore columns in a large dataset using 'sed'. For example, I have the following dataset: - 20060714,X.XX,1,043004,Q,T,24.0000,1,25.5000,4, 20060714,X.XX,1,081209,Q,T,24.0000,1,25.5000,5, As you can see, there are 10 columns here and the... (4 Replies)
Discussion started by: aarif
4 Replies

9. UNIX for Dummies Questions & Answers

Using 'sed' to delete or ignore columns in a dataset

Hi, I've already posted elsewhere but am posting again here coz im a newbie. I hope you forgive me this time. I want to know if its possible to delete or ignore columns in a large dataset using 'sed'. For example, I have the following dataset: - ... (0 Replies)
Discussion started by: aarif
0 Replies

10. UNIX for Dummies Questions & Answers

Help with selecting specific lines in a large file

Hello, I need to select the 3 lines above as well as below a search string, including the search string. I have been trying various combinations using sed command without any success. Can anuone help please. Thanking (2 Replies)
Discussion started by: tansha
2 Replies
Login or Register to Ask a Question