Random selection of subset of sample from file


 
Thread Tools Search this Thread
Top Forums UNIX for Dummies Questions & Answers Random selection of subset of sample from file
# 1  
Old 11-29-2012
Random selection of subset of sample from file

Hello
Could you please help me to find a code that can randomly select 1224 lines from a file of 12240 and make tn output with 1224 line each.
my input is txt file with 12240 lines like :
Code:
13474	999003507	0	0	2	-9
13475	999003508	0	0	2	-9
13476	999003509	0	0	1	-9
13477	999003510	0	0	1	-9

and would like an 10 output with random selection of 1224 from the input
thanks a lot

Moderator's Comments:
Mod Comment Please use code tags.
Image
You: Threads: 5 - Posts: 16 - Posts Edited by a Moderator: 6

Last edited by Scott; 11-29-2012 at 07:28 AM.. Reason: Code tags
# 2  
Old 11-29-2012
Code:
awk -v hm=1224 -v ml=12240 'function randlines(howmany,maxlines,      a,i) {
 srand()
 for(i in arr) delete arr[i]
 i=1
 while(i<=howmany)
 {
  a=int(rand()*maxlines) + 1
  if(!(a in arr))
  {
   i++
   arr[a]
  }
 }
}
BEGIN{
randlines(hm,ml)
}
NR in arr' file


Last edited by elixir_sinari; 11-29-2012 at 08:49 AM..
# 3  
Old 11-29-2012
thank you
sorry how to execute the script? it give an error :
Code:
awk: cmd. line:2:  srand
awk: cmd. line:2:       ^ unexpected newline or end of string

thanks

Last edited by radoulov; 11-29-2012 at 08:45 AM.. Reason: Code tags fixed.
# 4  
Old 11-29-2012
Replace
Code:
srand

with
Code:
srand()

.
This User Gave Thanks to elixir_sinari For This Post:
# 5  
Old 11-29-2012
Hi.

See thread https://www.unix.com/unix-dummies-que...text-file.html for demonstrations of commands shuf and rl. There are also some versions of sort that will randomly arrange lines, the results of which one could then be used as input to head, for example. That might be useful for short files.

Best wishes ... cheers, drl
# 6  
Old 11-29-2012
Quote:
Originally Posted by elixir_sinari
Replace
Code:
srand

with
Code:
srand()

.
it works fine thanks
however i need to repeat 10 times the code to get the file split in 10 files, then i get 10 files but with overlaps between them, could you please modify to get 10 output files with random no overlaps ?

Last edited by biopsy; 11-29-2012 at 09:54 AM..
# 7  
Old 11-29-2012
Code:
awk 'BEGIN{srand()}
NR%10 == 1{i=int(rand()*10+1)}
{i=(i>10)?1:i;print > (FILENAME "_" i++)}' file

Is this OK?
Not very random though.

Last edited by elixir_sinari; 11-29-2012 at 10:47 AM..
This User Gave Thanks to elixir_sinari For This Post:
 
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

awk to filter file using another working on smaller subset

In the below awk if I use the attached file as the input, I get no results for TCF4. However, if I just copy that line from the attached file and use that as input I get results for TCF4. Basically the gene file is a 1 column list that is used to filter $8 of the attached file. When there is a... (9 Replies)
Discussion started by: cmccabe
9 Replies

2. UNIX for Advanced & Expert Users

How to extract subset file from dataset?

Hello I have a data set which looks like this : progeny sire dam gender 12 1 3 M 13 2 4 F 14 2 5 F 15 6 5 ... (13 Replies)
Discussion started by: sajmar
13 Replies

3. Shell Programming and Scripting

Need to generate a file with random data. /dev/[u]random doesn't exist.

Need to use dd to generate a large file from a sample file of random data. This is because I don't have /dev/urandom. I create a named pipe then: dd if=mynamed.fifo do=myfile.fifo bs=1024 count=1024 but when I cat a file to the fifo that's 1024 random bytes: cat randomfile.txt >... (7 Replies)
Discussion started by: Devyn
7 Replies

4. Shell Programming and Scripting

Creating subset of a file based on specific columns

Hello Unix experts, I need a help to create a subset file. I know with cut comand, its very easy to select many different columns, or threshold. But here I have a bit problem as in my data file is big. And I don't want to identify the column numbers or names manually. I am trying to find any... (7 Replies)
Discussion started by: smitra
7 Replies

5. UNIX for Dummies Questions & Answers

Swapping the columns of a text file for a subset of rows

Hi, I'd like to swap the columns 1 and 2 of a space-delimited text file but only for the first 1000 rows. How do I go about doing that? Thanks! (1 Reply)
Discussion started by: evelibertine
1 Replies

6. UNIX for Dummies Questions & Answers

how to get a subset of such a file

Dear all, I have a file lik below: n of row=420, n of letters in each row=100000 like below: there is no space between the letters. what I want is: the 75000th letter to the 85000th letter in each row. how to do that? thanks a lot! ... (2 Replies)
Discussion started by: forevertl
2 Replies

7. Shell Programming and Scripting

Random File Selection and Moving

OK, I am stumpped. I have this shell Script that I want to randomly select a file with the extention of .sct. Then using a portion of its file name select the six related .mot files. Then move them all to another folder. I also need a user input form for the number of .SCT files to randomly select... (6 Replies)
Discussion started by: stak1993
6 Replies

8. Shell Programming and Scripting

Count the number of words in some subset of file and disregard others

Hi All, I have some 6000 text files in a directory. My files are named like 1.txt, 2.txt 3.txt and so on until 6000.txt. I want to count the "number of words" in only first 3000 of them. Any suggestions? I know wc -w can count the number of words in a text file. I am using Red Hat Linux. (3 Replies)
Discussion started by: shoaibjameel123
3 Replies

9. Shell Programming and Scripting

Random lines selection form a file.

>cat data.dat 0001 Robbert 0002 Nick 0003 Mark ....... 1000 Jarek (3 Replies)
Discussion started by: McLan
3 Replies

10. UNIX for Dummies Questions & Answers

Total file size of a subset list

Hello! I'm trying to find out the total file size of a subset list in a directory. For example, I do not need to know the total file size of all the files in a directory, but I need to know what the total size is of say, "ls -l *FEB08*" in a directory. Is there any easy way of doing this? ... (3 Replies)
Discussion started by: tekster757
3 Replies
Login or Register to Ask a Question