How to select lines randomly without replacement in UNIX?


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting How to select lines randomly without replacement in UNIX?
# 8  
Old 09-13-2016
Quote:
Originally Posted by sajmar
Thank you Corona688 for your suggestion. I only present the small example. In my case, I want to randomly choose 5000 lines out of 15000 lines. what should I do for this situation?
Hello sajmar,

Could you please try following and let me know if this helps you.
1st solution:
Code:
cat script.ksh
lines_in_file=`cat Input_file | wc -l`
file=Input_file
for val in `seq 1 5000`;  do line=$(($RANDOM*${lines_in_file}/32767+1)); sed "${line}q;d" $file >> "output"; done

OR

cat script.ksh
lines_in_file=`cat Input_file | wc -l`
file=Input_file
for val in `seq 1 5000`
do
	line=$(($RANDOM*${lines_in_file}/32767+1))
	sed "${line}q;d" $file >> "output"
done

2nd solution:
Code:
cat script.ksh
lines_in_file=`cat Input_file | wc -l`
file=Input_file
for val in `seq 1 5001`;  do line=$(($RANDOM*${lines_in_file}/32767+1)); awk -vline="$line" 'FNR==line' $file >> "output"; done

OR

cat script.ksh
lines_in_file=`cat Input_file | wc -l`
file=Input_file
for val in `seq 1 5000`
do
        line=$(($RANDOM*${lines_in_file}/32767+1))
        awk -vline="$line" 'FNR==line' $file >> "output"
done

Where $RANDOMis responsible for generating the random numbers from 0to32767.
EDIT: Above 2 solutions will give lines duplicate, then following may help you in same too.
Code:
awk 'FNR==NR {A[$1]; next} {if (FNR in A) print}' <(shuf -i 1-15000 -n5000) Input_file > output

Thanks,
R. Singh

Last edited by RavinderSingh13; 09-13-2016 at 03:39 PM.. Reason: Added one more solution succesfully for same now.
This User Gave Thanks to RavinderSingh13 For This Post:
# 9  
Old 09-14-2016
Thanks to all the folks for their suggestions. I am still not meet my requirement. As I said, I have a file with 15000 lines and I want to select 5000 lines for five times. However, in each of these five times, I want to have different 5000 selected line. In other words, I am looking for five different set of randomly selected 5000 lines from the whole set of 15000.
# 10  
Old 09-15-2016
Quote:
Originally Posted by sajmar
Thanks to all the folks for their suggestions. I am still not meet my requirement. As I said, I have a file with 15000 lines and I want to select 5000 lines for five times. However, in each of these five times, I want to have different 5000 selected line. In other words, I am looking for five different set of randomly selected 5000 lines from the whole set of 15000.
Actually, your specification has never been clear. First, you wanted 2 3 line output files from a 10 line input file with no duplicates in either of the output files. Then you wanted a single 5000 line file from a 15000 line file. Then you wanted 3 5000 line output files from a 15000 line input file. And, now you want 5 5000 line output files from a 15000 input line file. How do you randomly select 25000 lines from a 15000 line file without replacements?

If you mean that you want 5 5000 files each of which has lines from a 15000 line file with no replacements in any one of the 5 output files, why doesn't:
Code:
shuf < 15000LineFile | head -n 5000 > 5000LineFile

give you what you want (or to get 5 output files):
Code:
for i in 1 2 3 4 5
do	shuf < 15000LineFile | head -n 5000 > 5000LineFile$i
done

And, of course, Corona688's suggestion would have given you 3 5000 line files with no duplicates from your 15000 line file and a second run would give you 3 more 5000 line files to choose from...

But, of course, all of these assume that there are no duplicated lines in 15000LineFile (or if there are duplicates, you don't mind them being duplicated in one of your output files as long as there aren't more than N duplicates in an output file if there are N duplicates in your input file). Is there a chance for duplicated lines in your input file? If so, do those duplicates have to be removed before creating output files?

If we had a clearer specification of how lines in one of the output files are related to lines in other output files and whether or not there could be duplicated lines in the input file (and, if so, how they are to be handled), all of the output files could be created by a single invocation of awk.

Knowing what operating system and shell you're using would also help for several possible script suggestions.
# 11  
Old 09-15-2016
Quote:
Originally Posted by sajmar
Thanks to all the folks for their suggestions. I am still not meet my requirement. As I said, I have a file with 15000 lines and I want to select 5000 lines for five times. However, in each of these five times, I want to have different 5000 selected line. In other words, I am looking for five different set of randomly selected 5000 lines from the whole set of 15000.
5 * 5000 = 25000. There will unavoidably be duplicates.

If you don't care about duplicates, it's easy to create as many 5000-line shuffles as you want.

Code:
for N in 1 2 3 4 5
do
        shuf < inputfile | head -n 5000 > output.$N
done

Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Randomly create time in UNIX

Hey, How can i create randomly create time N times. Suppose i want to create data for a particualr date 5 times... Mon Jan 19 11:42:50 Mon Jan 19 19:16:40 Mon Jan 19 12:12:33 Mon Jan 19 14:26:27 Mon Jan 19 12:29:53 Mon Jan 19 13:30:31 I want the script to create N times randome... (2 Replies)
Discussion started by: jaituteja
2 Replies

2. Shell Programming and Scripting

Concatenate select lines from multiple files

I have about 6000 files of the following format (three simplified examples shown; actual files have variable numbers of columns, but the same number of lines). I would like to concatenate the ID (*Loc*) and data lines, but not the others, as shown below. The result would be one large file (or... (3 Replies)
Discussion started by: pathunkathunk
3 Replies

3. Shell Programming and Scripting

Select lines where at least x columns above threshold value

I have a file with 20 columns. I'd like to retain only the lines for which the values in at least x columns, looking only at columns 6-20, are above a threshold. For example, I'd like to retain only the lines in the file below that have at least 8 columns (again, looking only at columns 6-20)... (3 Replies)
Discussion started by: pathunkathunk
3 Replies

4. UNIX for Dummies Questions & Answers

How to randomly select lines from a text file

I have a text file with 1000 lines, I want to randomly select 200 lines from it and print them as output. How do I go about doing that? Thanks! (7 Replies)
Discussion started by: evelibertine
7 Replies

5. Shell Programming and Scripting

Get 20% of lines in File randomly

Hello, This is my code: nb_lignes=`wc -l $1 | cut -d " " -f1` for i in $(seq $nb_lignes) do m=`head $1 -n $i | tail -1` //command done Please how can i change it to get Get 20% of lines in File randomly to apply "command" on each line ? 20% or 40% or 60 % (it's a parameter) Thank you. (15 Replies)
Discussion started by: chercheur857
15 Replies

6. Shell Programming and Scripting

select the lines in between some time span

Hi Everyone ! i want to take all the lines from a file that falls in between some date... and every line in a file has a time stamp.. ---some text---- 01/Jan/2010 ---- some other text ---- ---some text---- 10/Jan/2010 ---- some other text ---- ---some text---- 20/Dec/2010 ---- some... (3 Replies)
Discussion started by: me_newbie
3 Replies

7. Shell Programming and Scripting

How to select/delete some lines in shell?

I need to delete half(approx) the file or select half the file by existence of some character My file looks like 1 2 3 4 . . . 50 . . 100I need to select only 50 to rest of the file or needs to delete the file upto 50. Please help me out.. (6 Replies)
Discussion started by: SujeethP
6 Replies

8. Shell Programming and Scripting

Select lines in which column have value greater than some percent of total file lines

i have a file in following format 1 32 3 4 6 4 4 45 1 45 4 61 54 66 4 5 65 51 56 65 1 12 32 85 now here the total number of lines are 8(they vary each time) Now i want to select only those lines in which the values... (6 Replies)
Discussion started by: vaibhavkorde
6 Replies

9. UNIX for Dummies Questions & Answers

How to select lines in unix matches a pattern at a particular position

I have huge file. I want to copy the lines which have first character as 2 or 7, and also which has fist two characters as 90. I need only these records from file. How I can acheive this. Can somebody help me..... (2 Replies)
Discussion started by: cs_banda
2 Replies

10. Shell Programming and Scripting

how to select a value randomly

on my desktop i am using the kde rotating desktop image option. this rotates images randomly every half hour. now, i would like to write an html file which will have an inline frame with some text, maybe system messages, or my friends live journal thati read alot, or unix.com! however, i dont want... (1 Reply)
Discussion started by: norsk hedensk
1 Replies
Login or Register to Ask a Question