A simpler way to do this (save a list of files based on part of their name)

08-01-2013

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

If you just drop the eval you'll probably get the array you want, but creating an array sounds like an unnecessary complication for what you seem to want to do. If you give us a concrete example showing us exactly what you want to do, we can probably show you an easier way to do it just using a while read loop or a straight command substitution in a cp command line.

As a general rule, ALWAYS determine what you want to do 1st and then figure out how to do it. If you start with the assumption that you need to use an array before you decide what you want to do, you'll frequently miss simpler or more efficient ways to get things done.

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

08-01-2013

Registered User

362, 16

Join Date: Mar 2010

Last Activity: 3 March 2020, 10:38 PM EST

Location: Boston

Posts: 362

Thanks Given: 193

Thanked 16 Times in 15 Posts

This is the full script I have that does what I want.

Code:

#!/bin/bash


# loop through all files and copy the top 5 EV and CV to continue w/ random weight files

FOLDS=(f0 f1 f2 f3 f4 f5 f6 f7 f8 f9)

for FOLD in "${FOLDS[@]}"
do

   # get directory list
   FILES='./'$FOLD'/'*'out.txt'

   # reinitalize
   FILENAME=""
   EV_MAE_VALUE=0
   CV_MAE_VALUE=0

   EV_MAE_0=1000.0
   EV_MAE_1=1000.0
   EV_MAE_2=1000.0
   EV_MAE_3=1000.0
   EV_MAE_4=1000.0

   EV_FILES=(NULL0 NULL1 NULL2 NULL3 NULL4)

   CV_MAE_0=1000.0
   CV_MAE_1=1000.0
   CV_MAE_2=1000.0
   CV_MAE_3=1000.0
   CV_MAE_4=1000.0

   CV_FILES=(NULL0 NULL1 NULL2 NULL3 NULL4)

   for INFILE in $FILES
   do

   #  remove directory from path
      FILENAME=`echo $INFILE | awk 'BEGIN {FS="/"} {print $3}'`
   #  find ev mae value
      EV_MAE_VALUE=`echo $FILENAME | awk 'BEGIN {FS="_"} {print $1}'`
   #  find ev mae value
      CV_MAE_VALUE=`echo $FILENAME | awk 'BEGIN {FS="_"} {print $3}'`

   # save the names of the top 5 EV files
      if (( $(bc <<< "$EV_MAE_VALUE < $EV_MAE_0") == 1 ))
      then
         #bump down current list items
         EV_FILES[4]=${EV_FILES[3]}; EV_MAE_4=$EV_MAE_3
         EV_FILES[3]=${EV_FILES[2]}; EV_MAE_3=$EV_MAE_2
         EV_FILES[2]=${EV_FILES[1]}; EV_MAE_2=$EV_MAE_1
         EV_FILES[1]=${EV_FILES[0]}; EV_MAE_1=$EV_MAE_0
         EV_FILES[0]=$FILENAME
         # assign EV_MAE_VALUE to top value
         EV_MAE_0=$EV_MAE_VALUE

      elif (( $(bc <<< "$EV_MAE_VALUE < $EV_MAE_1") == 1 ))
      then
         EV_FILES[4]=${EV_FILES[3]}; EV_MAE_4=$EV_MAE_3
         EV_FILES[3]=${EV_FILES[2]}; EV_MAE_3=$EV_MAE_2
         EV_FILES[2]=${EV_FILES[1]}; EV_MAE_2=$EV_MAE_1
         EV_FILES[1]=$FILENAME
         EV_MAE_1=$EV_MAE_VALUE

      elif (( $(bc <<< "$EV_MAE_VALUE < $EV_MAE_2") == 1 ))
      then
         EV_FILES[4]=${EV_FILES[3]}; EV_MAE_4=$EV_MAE_3
         EV_FILES[3]=${EV_FILES[2]}; EV_MAE_3=$EV_MAE_2
         EV_FILES[2]=$FILENAME
         EV_MAE_2=$EV_MAE_VALUE

      elif (( $(bc <<< "$EV_MAE_VALUE < $EV_MAE_3") == 1 ))
      then
         EV_FILES[4]=${EV_FILES[3]}; EV_MAE_4=$EV_MAE_3
         EV_FILES[3]=$FILENAME
         EV_MAE_3=$EV_MAE_VALUE

      elif (( $(bc <<< "$EV_MAE_VALUE < $EV_MAE_4") == 1 ))
      then
         EV_FILES[4]=$FILENAME
         EV_MAE_4=$EV_MAE_VALUE

      fi

   # save the names of the top 5 CV files
      if (( $(bc <<< "$CV_MAE_VALUE < $CV_MAE_0") == 1 ))
      then
         #bump down current list items
         CV_FILES[4]=${CV_FILES[3]}; CV_MAE_4=$CV_MAE_3
         CV_FILES[3]=${CV_FILES[2]}; CV_MAE_3=$CV_MAE_2
         CV_FILES[2]=${CV_FILES[1]}; CV_MAE_2=$CV_MAE_1
         CV_FILES[1]=${CV_FILES[0]}; CV_MAE_1=$CV_MAE_0
         CV_FILES[0]=$FILENAME
         # assign EV_MAE_VALUE to top value
         CV_MAE_0=$CV_MAE_VALUE

      elif (( $(bc <<< "$CV_MAE_VALUE < $CV_MAE_1") == 1 ))
      then
         CV_FILES[4]=${CV_FILES[3]}; CV_MAE_4=$CV_MAE_3
         CV_FILES[3]=${CV_FILES[2]}; CV_MAE_3=$CV_MAE_2
         CV_FILES[2]=${CV_FILES[1]}; CV_MAE_2=$CV_MAE_1
         CV_FILES[1]=$FILENAME
         CV_MAE_1=$CV_MAE_VALUE

      elif (( $(bc <<< "$CV_MAE_VALUE < $CV_MAE_2") == 1 ))
      then
         CV_FILES[4]=${CV_FILES[3]}; CV_MAE_4=$CV_MAE_3
         CV_FILES[3]=${CV_FILES[2]}; CV_MAE_3=$CV_MAE_2
         CV_FILES[2]=$FILENAME
         CV_MAE_2=$CV_MAE_VALUE

      elif (( $(bc <<< "$CV_MAE_VALUE < $CV_MAE_3") == 1 ))
      then
         CV_FILES[4]=${CV_FILES[3]}; CV_MAE_4=$CV_MAE_3
         CV_FILES[3]=$FILENAME
         CV_MAE_3=$CV_MAE_VALUE

      elif (( $(bc <<< "$CV_MAE_VALUE < $CV_MAE_4") == 1 ))
      then
         CV_FILES[4]=$FILENAME
         CV_MAE_4=$CV_MAE_VALUE

      fi

   done

   # copy list of filenames and corresponding ini weight sets to continue
   RAND_SET=""
   for I in "${EV_FILES[@]}"
   do
      # copy file to continue
      cp -p './'$FOLD'/'$I './'$FOLD'/'$FOLD'_continue/EV/'$I
      #  find random ini set number
      RAND_SET=`echo $I} | awk 'BEGIN {FS="_"} {print $5}'`
      # copy random ini weight file to continue
      cp -p './rnd_ini/'$FOLD'/ri_'$RAND_SET'_'*'.wts'  './'$FOLD'/'$FOLD'_continue/EV/'
   done

   for I in "${CV_FILES[@]}"
   do
      # copy file to continue
      cp -p './'$FOLD'/'$I './'$FOLD'/'$FOLD'_continue/CV/'$I
      #  find random ini set number
      RAND_SET=`echo $I} | awk 'BEGIN {FS="_"} {print $5}'`
      # copy random ini weight file to continue
      cp -p './rnd_ini/'$FOLD'/ri_'$RAND_SET'_'*'.wts'  './'$FOLD'/'$FOLD'_continue/CV/'
   done

   #move fold output files to stats folder
   mv './'$FOLD'/'*'out.txt' './'$FOLD'/'$FOLD'_stats/'

done

I thought it was overly long to post since my question was about the first part. This does the job that the first one I posted did, except that it does it twice. It loops through a set of sub folders f0-f9, finds the files with the top 5 EV MAE values (5 smallest field 1) and copies those files and a corresponding set of files to f*/f*_continue/EV/. Then it does the same thing for the top CV MAE values (5 smallest field 3) and copies to f*/f*_continue/CV/.

I have attached a new test dir with the script and supporting files. I have edited the script so that it is only working with f0, f1, f2 to help simplify things. The script will find the top EV MAE values by reading the first field, and then find the .wts file that goes with it. Both will be coppied to the continue folder.

For example, the top EV MAE file for f0 is,
53.96_E3000_50.19_E2200_35_ri_OA_f0_S1C_v17_52.26.1_4_ON_0.25lr.out.txt
so this will be copied to ./f0/f0_continue/EV/

The .wts file associated with this is 35 (field 5), so the script will also copy,
./rnd_ini/f0/ri_35*.wts
to ./f0/f0_continue/EV/

The top CV MAE file for f0 is,
54.90_E3000_48.65_E4300_23_ri_OA_f0_S1C_v17_52.26.1_4_ON_0.25lr.out.txt
so this will be copied to ./f0/f0_continue/CV/

The .wts file associated with this is 23 (field 5), so the script will also copy,
./rnd_ini/f0/ri_23*.wts
to ./f0/f0_continue/CV/

After a f* directory is processed, the script moves all the .out.txt files to ./f*/f*_stats/.

Since the version you posted accepts arguments, there would be no need to do both CV and EV in the same run. The script could be called twice with the proper arguments for each. It is also very nice that your method allows you to pick any number of files to collect. It will probably end up being about 20, but I'm not sure yet.

Thanks for all the help.

LMHmedchem

test_copy2.zip (384.4 KB)

LMHmedchem

View Public Profile for LMHmedchem

Find all posts by LMHmedchem

08-05-2013

Registered User

362, 16

Join Date: Mar 2010

Last Activity: 3 March 2020, 10:38 PM EST

Location: Boston

Posts: 362

Thanks Given: 193

Thanked 16 Times in 15 Posts

This is why I often don't post what I want to do in its entirety. It seems that most of the time when I post a long script, no one seems to want to wade into it (which is quite understandable). I guess I still need to work on making posts that are long enough to convey what I am asking and get a workable solution, but short enough that they will actually be read.

I modified the code that you posted and have something that gives me what I need. This is the modified code,

Code:

#!/bin/bash

# argument $1 is the field to sort on based on file names as below

# 53.96_E3000_50.19_E2200_35_ri_OA_f0_S1C_v17_52.26.1_4_ON_0.25lr.out.txt
#     1     2     3     4  5

# argument $2 is the file count, meaning the number of files to find and copy
# argument $3 is the set type EV/CV

# for the top 10 EV outcomes call as ./01_copy_top_outcomes.sh 1 10 EV
# for the top 10 CV outcomes call as ./01_copy_top_outcomes.sh 3 10 CV

USAGE="./script_name  sort_field   file_count   set_type"

# field to sort on
KEY_FIELD=$1
# number of files to find
FILE_COUNT=$2
# processing set type EV/CV
SET_TYPE=$3

# make sure there are 3 arguments and the first two are numberss
if [ $# -ne 3 ] || [ "$1" != "${2%*[^0-9]*}" ] || [ "$2" != "${3%*[^0-9]*}" ]
then    echo "$USAGE"
        exit 1
fi

# loop on all folds
FOLDS=(f0 f1 f2 f3 f4 f5 f6 f7 f8 f9)

for FOLD in ${FOLDS[@]}
do

#  check if the directory exists, this should never throw.
   if [[ ! -d "$FOLD" ]]
   then  echo 'directory' $FOLD'/ does not exist, exit script'
         exit 1
   fi

   # change directory to current fold
   cd $FOLD
   echo "processing" $FOLD

   #re-initalize
   OUTPUT=""
   FILE_TEMP=""
   FILE_NAME=""
   RAND_SET=""

   # sort the list of filenames and output the top number "n" as specified in argument $3
   FILE_LIST=( $(df -h | ls *.out.txt | sort -t_ -k$KEY_FIELD,$KEY_FIELD'n' | awk -F_ -v f=$KEY_FIELD -v c=$FILE_COUNT 'NR > c {exit} {printf("%s", $0)}') )

   # loop up to file count to parse output and copy files that were found by sort
   for (( LOOP_CT=1; LOOP_CT<=$FILE_COUNT; LOOP_CT++ ))
   do

      # parse output string on .out.txt to locate individual files
      FILE_TEMP=`echo $FILE_LIST | awk -v N=$LOOP_CT 'BEGIN {FS=".out.txt"} {print $N}'`
      # restore file extension
      FILE_NAME=$FILE_TEMP'.out.txt'

      echo $FILE_NAME

      # copy file and corresponding ini weight set to continue
      # copy file to continue
      cp -p './'$FILE_NAME './'$FOLD'_continue/'$SET_TYPE'/'$FILE_NAME

      #  find random ini set number
      RAND_SET=`echo $FILE_NAME | awk 'BEGIN {FS="_"} {print $5}'`
      # copy random ini weight file to continue
      cp -p '../rnd_ini/'$FOLD'/ri_'$RAND_SET'_'*'.wts'  './'$FOLD'_continue/'$SET_TYPE'/'

   done

   # return to start directory
   cd ../

done

The entire list of files that is found ends up in the variable FILE_LIST, so that gets parsed into individual file names. Those files are copied to the proper location and an associated file is also located an copied. This loops through all sub folders f0-f9, so that is no longer an argument.

This seems reasonable and works, but I don't know awk well enough to see if there are any hidden problems. There is probably an easier way to copy the files I need, but I don't know how to copy in awk, so I needed to get the file names in bash variables that I know how to manipulate to some extent.

Do you see anything dreadfully wrong here? This does give me the ablity to specify the sort field and the number of files I want, which is a big improvement over what I first posted.

LMHmedchem

LMHmedchem

View Public Profile for LMHmedchem

Find all posts by LMHmedchem

08-06-2013

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

The goal of The UNIX and Linux Forums is to help you learn how to do "stuff" on your own; not to write programs for you. I gave you a sample script to get you started, and from your message #9 in this thread it sounded like you were well on your way to getting a working solution. (And posting a 384Kb zipped archive that expands to over 1Mb without a clear indication of the desired outcome of processing it takes more space and time that most volunteers are willing to donate.)

From what you have shown here in message #10, you are learning quickly.

I will make a few more comments that may help you speed this up a little bit: First, in the pipeline:

Code:

df -h | ls *.out.txt | sort -t_ -k$KEY_FIELD,$KEY_FIELD'n' | awk -F_ -v f=$KEY_FIELD -v c=$FILE_COUNT 'NR > c {exit} {printf("%s", $0)}'

what would happen if you remove the code shown in red? The ls utility doesn't read from standard input, so it seems that the df command in this pipeline should make no difference in the output of this pipeline. (It will just make the pipeline run slower.)

Second you seem to go to a lot of effort to store the output of this pipeline in an array and then spend a lot of time trying to extract individual file names from the array. It looks like the array will only have one element because the printf in your awk command doesn't put a space between the names of the files it prints. If you would change the printf statement from:

Code:

printf("%s", $0)

to:

Code:

printf(" %s", $0)

you could reference filenames in the array more simply by using ${FILELIST[0]} through ${FILELIST[$((FILE_COUNT-1))]}.

But, why have an array at all. Why not just process the files one at a time as they come out of awk? As an example, what would happen if you replaced:

Code:

   # sort the list of filenames and output the top number "n" as specified in argument $3
   FILE_LIST=( $(df -h | ls *.out.txt | sort -t_ -k$KEY_FIELD,$KEY_FIELD'n' | awk -F_ -v f=$KEY_FIELD -v c=$FILE_COUNT 'NR > c {exit} {printf("%s", $0)}') )

   # loop up to file count to parse output and copy files that were found by sort
   for (( LOOP_CT=1; LOOP_CT<=$FILE_COUNT; LOOP_CT++ ))
   do

      # parse output string on .out.txt to locate individual files
      FILE_TEMP=`echo $FILE_LIST | awk -v N=$LOOP_CT 'BEGIN {FS=".out.txt"} {print $N}'`
      # restore file extension
      FILE_NAME=$FILE_TEMP'.out.txt'

      echo $FILE_NAME

      # copy file and corresponding ini weight set to continue
      # copy file to continue
      cp -p './'$FILE_NAME './'$FOLD'_continue/'$SET_TYPE'/'$FILE_NAME

      #  find random ini set number
      RAND_SET=`echo $FILE_NAME | awk 'BEGIN {FS="_"} {print $5}'`
      # copy random ini weight file to continue
      cp -p '../rnd_ini/'$FOLD'/ri_'$RAND_SET'_'*'.wts'  './'$FOLD'_continue/'$SET_TYPE'/'

   done

with the much simpler:

Code:

      ls *.out.txt | sort -t_ -k$KEY_FIELD,${KEY_FIELD}n |
      awk -F_ -v c="$FILE_COUNT" '
        NR > c {exit}
        {print $0, $5}' |
      while read FILE_NAME RAND_SET
      do
        # copy files that were found by sort
        echo "file_name: $FILE_NAME rand_set: $RAND_SET"

        # copy file and corresponding ini weight set to continue
        # copy file to continue
        cp -p './'$FILE_NAME './'$FOLD'_continue/'$SET_TYPE'/'$FILE_NAME

        # copy random ini weight file to continue
        cp -p '../rnd_ini/'$FOLD'/ri_'$RAND_SET'_'*'.wts'  './'$FOLD'_continue/'$SET_TYPE'/'
      done

Note that there is no array here, there is only one invocation of awk (instead of n+1 invocations to process n files), and RAND_SET is pulled from the file name at the file name at a time when we already have the fields in the file name split out (so we only have to split the name once). You can also get rid of some unneeded temporary variables since OUTPUT was not (and still is not) referenced after being set, and FILE_TEMP is no longer used.

This User Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

08-11-2013

Registered User

362, 16

Join Date: Mar 2010

Last Activity: 3 March 2020, 10:38 PM EST

Location: Boston

Posts: 362

Thanks Given: 193

Thanked 16 Times in 15 Posts

Sorry for the delay, it has been an unexpectedly busy end of the week.

Quote:

Originally Posted by Don Cragun

The goal of The UNIX and Linux Forums is to help you learn how to do "stuff" on your own; not to write programs for you.

I completely understand and agree with this. I always try to start with a post that contains at least some kind of a working script. This is to do as much as I can on my own and let the other users here know that I am working to solve the problem, not expecting others to do it for me. I also think that compared to a text explanation, programming code is easier to read in terms of understand what a person is after. I read long prose explanations of code algorithms when I am having trouble falling asleep at night. I still haven't consistently found the sweet spot when it comes to exactly how much to post. It appears that my first attempt was to short to explain all I was trying to do, and my second was way too long to bother wading into.

It turns out that I didn't end up using an array. Everything that came out of the code you posted ended up in a single long string variable ($FILE_LIST). I just looped on the number of files I was expecting to find and parsed the long string to pull out the file name for each iteration of the loop.

Quote:

Originally Posted by Don Cragun

But, why have an array at all. Why not just process the files one at a time as they come out of awk?

The simple explanation for this is that I don't know awk very well at all. I can more or less use it to parse things, but only in the simplest implementations. I did spend some time trying to take the output from awk and work with it, but I think I was using a redirect instead of a pipe.

So as I read this now, ls is passing all .out.txt to sort, sort is sorting on the key field and passes the sorted list to awk. Awk processes each item in the list and outputs the specified fields. Then it looks like the output of awk is passed to read, which dumps the output into the vars $FILE_NAME and $RAND_SET. Once you are there, the rest is straightforward. I am not familiar with read, so that is something new to me. I would not have known how to get awk to output two variables and get them into something that I could use with cp. I am also not quite clear about how awk knows when it has read in enough lines to get to $FILE_COUNT. Is "NR" an implicit running counter of some kind so that when NR > $FILE_COUNT awk quits (I see that you have passed $FILE_COUNT to awk as c)?

From time to time I think I am getting better at this, then I try to do something new and find out how much I still don't know. I do really appreciate that help that is available here.

LMHmedchem

LMHmedchem

View Public Profile for LMHmedchem

Find all posts by LMHmedchem

08-11-2013

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Quote:

Originally Posted by LMHmedchem

Sorry for the delay, it has been an unexpectedly busy end of the week.

... ... ...

Quote:

Originally Posted by Don Cragun
But, why have an array at all. Why not just process the files one at a time as they come out of awk?

It turns out that I didn't end up using an array. Everything that came out of the code you posted ended up in a single long string variable ($FILE_LIST). I just looped on the number of files I was expecting to find and parsed the long string to pull out the file name for each iteration of the loop.

When you assign a value to a shell variable using the syntax:

Code:

var=( list_of_values )

and your shell is a recent bash or ksh, you are defining var to be an array. So, the way you initialized FILE_LIST, it was an array containing only one element.

Quote:

Originally Posted by LMHmedchem

Quote:

Originally Posted by Don Cragun
But, why have an array at all. Why not just process the files one at a time as they come out of awk?

Yes, you correctly interpreted what ls, sort, awk, and read are doing.

I suggest that you look at the read(1) man page. The read utility built into your shell will probably have additional options, but the POSIX description in the link above is all you need to understand what is going on in the simple script suggestions I provided. You might also want to read the awk(1) man page; the awk command:

Code:

      awk -F_ -v c="$FILE_COUNT" '
        NR > c {exit}
        {print $0, $5}'

The -F_ sets the input field separator to the underscore character, -v c="$FILE_COUNT" sets the awk variable c to the expansion of the shell variable FILE_COUNT, NR > c {exit} exits awk if the current number of input records read from all input files is greater than the awk variable c, and {print $0, $5} prints the entire current input record followed by a space followed by the 5th field from the current input record followed by a newline character. And then the:

Code:

while read FILE_NAME RAND_SET
do
 ... ... ...
done

does read one line in a loop until the end-of-file is detected on the input pipe and sets the shell variables FILE_NAME and RAND_SET to the two values written by awk. And, as you said, the loop processes each line of output from awk to move the appropriate files into their desired places.

Quote:

Originally Posted by LMHmedchem

From time to time I think I am getting better at this, then I try to do something new and find out how much I still don't know. I do really appreciate that help that is available here.

LMHmedchem

That's what we're here for. Don't be afraid to experiment. If you have a loop like this and want to see what it will do without actually copying files, put an echo in front of the cp to have the script show you what it will do when you remove the echos.

Get used to using:

Code:

set -xv
    code to trace
set +xv

to surround segments of shell code that you don't understand so you can see what commands are being called and what operands are being passed to them.

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

Shell Programming and Scripting

A simpler way to do this (save a list of files based on part of their name)

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Save an specific part of a expect_out in a variable

Discussion started by: bebehnaz

2. UNIX for Dummies Questions & Answers

Rename files based on a list

Discussion started by: a_bahreini

3. Shell Programming and Scripting

Save value from output of Corestat and save in a list for each core

Discussion started by: Zam_1234

4. Shell Programming and Scripting

List duplicate files based on Name and size

Discussion started by: prvnrk

5. UNIX for Dummies Questions & Answers

List only files based on a pattern

Discussion started by: shash

6. Shell Programming and Scripting

find the line starting with a pattern and save a part in variable

Discussion started by: kichu

7. Shell Programming and Scripting

Compare two files based on integer part only

Discussion started by: yale_work

8. Shell Programming and Scripting

strike last part from list of files

Discussion started by: fed.linuxgossip

9. UNIX for Dummies Questions & Answers

Report of duplicate files based on part of the filename

Discussion started by: sudheshnaiyer

10. Shell Programming and Scripting

can I save list of files in memory and not in text file?

Discussion started by: umen