How to randomly select lines from a text file | Unix Linux Forums | UNIX for Dummies Questions & Answers

  Go Back    


UNIX for Dummies Questions & Answers If you're not sure where to post a UNIX or Linux question, post it here. All UNIX and Linux newbies welcome !!

How to randomly select lines from a text file

UNIX for Dummies Questions & Answers


Closed Thread    
 
Thread Tools Search this Thread Display Modes
    #1  
Old 10-23-2012
evelibertine evelibertine is offline
Registered User
 
Join Date: May 2011
Last Activity: 22 August 2014, 5:12 AM EDT
Posts: 193
Thanks: 94
Thanked 0 Times in 0 Posts
How to randomly select lines from a text file

I have a text file with 1000 lines, I want to randomly select 200 lines from it and print them as output. How do I go about doing that? Thanks!
Sponsored Links
    #2  
Old 10-23-2012
Corona688 Corona688 is online now Forum Staff  
Mead Rotor
 
Join Date: Aug 2005
Last Activity: 30 October 2014, 1:31 PM EDT
Location: Saskatchewan
Posts: 19,732
Thanks: 830
Thanked 3,372 Times in 3,159 Posts
Your time would be better spent learning awk than asking dozens of questions which could be trivially answered in it.


Code:
awk 'NR==FNR { B++; next }
(NR != FNR) && (!Z) {
        srand();
        if(B < 200) exit(1); # Too few lines
        for(N=1; N<=200; )
        {
                V=sprintf("%d", (rand()*B)+1)+0;
                if(!(V in A)) { A[V]=1; N++ }
        }
        Z=1
} NR in A' inputfile inputfile


Last edited by Corona688; 10-25-2012 at 11:34 AM.. Reason: Fixed typos
The Following User Says Thank You to Corona688 For This Useful Post:
evelibertine (10-23-2012)
Sponsored Links
    #3  
Old 10-23-2012
gary_w's Avatar
gary_w gary_w is offline
Registered User
 
Join Date: Oct 2010
Last Activity: 28 October 2014, 11:01 AM EDT
Posts: 446
Thanks: 32
Thanked 96 Times in 88 Posts
For the fun of it here's another way that does not use awk although the awk version will be more efficient. This has the overhead of creating the pipeline repeatedly which should be avoided for good practice. Also I believe the ksh RANDOM built-in has a limit of 32767 that must be considered if the file is large.

Code:
$ cat x
##
## x nbr_of_lines_wanted  filename
##
#!/bin/ksh

iterations=$1
file="$2"

((lines_avail=$(wc -l < "$file")+1))

while (( $iterations > 0 )); do
  head -$((${RANDOM} % $lines_avail)) "$file" | tail -1
  (( iterations=$iterations - 1 ))
done

exit 0

This is actually a good example of how a seemingly simple solution for a small file can end up burning you on performance and system limitations should you need to run it on a much larger file
or a system that may see increased load in the future.
Typically when you see a long command line or pipeline like this being done a large number of times (especially a user-enterable number of times) it should
be a red flag warning that there will most likely be a more efficient way of structuring the program.

Last edited by gary_w; 10-23-2012 at 04:11 PM..
The Following User Says Thank You to gary_w For This Useful Post:
evelibertine (10-23-2012)
    #4  
Old 10-24-2012
drl's Avatar
drl drl is offline Forum Advisor  
Registered Voter
 
Join Date: Apr 2007
Last Activity: 29 October 2014, 11:00 PM EDT
Location: Saint Paul, MN USA / BSD, CentOS, Debian, OS X, Solaris
Posts: 1,687
Thanks: 42
Thanked 197 Times in 179 Posts
Hi.

There are a number of commonly-available utilities to do this. Here is a demonstration of two:

Code:
#!/usr/bin/env bash

# @(#) s1	Demonstrate random selection of lines with rl, shuf.

pe() { for _i;do printf "%s" "$_i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
db() { ( printf " db, ";for _i;do printf "%s" "$_i";done;printf "\n" ) >&2 ; }
db() { : ; }
C=$HOME/bin/context && [ -f $C ] && $C rl shuf

pl " Lines selected from:"
cat data0

# Prepare data file from single line of words.
tr ' ' '\n' < data0 > data1

pl " Results from shuf, 1:"
shuf -n 3 data1

pl " Results from shuf, 2:"
shuf -n 3 data1

pl " Results from rl, 1:"
rl -c 3 data1

pl " Results from rl, 2:"
rl -c 3 data1

exit 0

producing:

Code:
% ./s1

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 2.6.26-2-amd64, x86_64
Distribution        : Debian GNU/Linux 5.0.8 (lenny) 
bash GNU bash 3.2.39
rl 0.2.7
shuf (GNU coreutils) 6.10

-----
 Lines selected from:
foo bar baz qux quux corge grault garble warg fred plugh xyzzy thud

-----
 Results from shuf, 1:
quux
thud
qux

-----
 Results from shuf, 2:
thud
xyzzy
quux

-----
 Results from rl, 1:
corge
qux
grault

-----
 Results from rl, 2:
thud
corge
warg

You may need to install these from your distribution repository. See man pages for details.

See also Algorithm::Numerical::Sample - search.cpan.org if a perl module is desirable.

Best wishes ... cheers, drl
The Following User Says Thank You to drl For This Useful Post:
evelibertine (10-24-2012)
Sponsored Links
    #5  
Old 10-24-2012
Scrutinizer's Avatar
Scrutinizer Scrutinizer is online now Forum Staff  
Moderator
 
Join Date: Nov 2008
Last Activity: 30 October 2014, 1:25 PM EDT
Location: Amsterdam
Posts: 9,558
Thanks: 286
Thanked 2,429 Times in 2,176 Posts
In the order of lines in the file, without all lines in memory:

Code:
awk '
  NR==FNR { next }
  FNR==1{
    srand;
    n=NR-1
    for(i=1; i<=200; i++) {
      line=0
      while(!line || line in A) line=int(rand*n)+1
      A[line]
    }
  } 
  FNR in A
' infile infile


In the order of the selection, with all lines in the file in memory..

Code:
awk '
  { R[NR]=$0 }
  END{
    srand;
    n=NR
    for(i=1; i<=200; i++) {
      line=0
      while(!line || line in A) line=int(rand*n)+1
      A[line]
      print R[line]
    }
  } 
' infile

Sponsored Links
    #6  
Old 10-24-2012
stunn3r stunn3r is offline
Registered User
 
Join Date: Sep 2012
Last Activity: 18 October 2014, 3:21 PM EDT
Location: India
Posts: 29
Thanks: 7
Thanked 0 Times in 0 Posts

Code:
cat file | head -n 500 | tail -n 200

A very simple idea but not for random lines.
Sponsored Links
    #7  
Old 10-24-2012
alister alister is offline
Registered User
 
Join Date: Dec 2009
Last Activity: 11 June 2014, 8:40 PM EDT
Posts: 3,231
Thanks: 179
Thanked 973 Times in 789 Posts
Quote:
Originally Posted by Corona688 View Post

Code:
awk 'NR==FNR { B++; next }
(NR != FNR) && (!Z) {
        srand();
        if(Z < 200) exit(1); # Too few lines
        for(N=1; N<=200; N++)
        {
                V=sprintf("%d", (rand()*B)+1)+0;
                if(!(V in A)) { A[V]=1; N++ }
        }
        Z=1
} NR in A' inputfile inputfile

Looks like there are a couple of bugs there. The first, a typo of Z for B, always exits before selecting anything. The second, the for-loop post-increment expression, can cause the algorithm to yield anywhere between 1 and 200 lines.

Probably no need to fix it since Scrutinizer's first example in post #5 is a correct implementation of the same approach.

Regards,
Alister

---------- Post updated at 06:28 PM ---------- Previous update was at 06:26 PM ----------

With regard to all of the AWK suggestions, without knowing exactly how the script is to be used, it's possible that all of the recommendations are inadequate. Nearly every awk srand implementation's default seed is the number of seconds since the epoch. Successive or simultaneous runs could yield identical results. May or may not be an issue. We don't have sufficient information to make that determination. Just a head's up for the OP.

If it is an issue, more information would be required to determine a robust seed expression.

Regards,
Alister
The Following User Says Thank You to alister For This Useful Post:
Corona688 (10-25-2012)
Sponsored Links
Closed Thread

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

More UNIX and Linux Forum Topics You Might Find Helpful
Thread Thread Starter Forum Replies Last Post
Get 20% of lines in File randomly chercheur857 Shell Programming and Scripting 15 10-22-2012 03:38 PM
randomly shuffle two text files the same way adrunknarwhal Shell Programming and Scripting 3 08-31-2011 10:34 PM
Select lines in which column have value greater than some percent of total file lines vaibhavkorde Shell Programming and Scripting 6 04-21-2011 04:42 AM
Randomly appearing control characters in text files aakashahuja AIX 0 07-18-2006 05:26 AM
how to select a value randomly norsk hedensk Shell Programming and Scripting 1 10-28-2003 04:39 PM



All times are GMT -4. The time now is 01:32 PM.