A simpler way to do this (save a list of files based on part of their name)


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting A simpler way to do this (save a list of files based on part of their name)
# 1  
Old 08-01-2013
A simpler way to do this (save a list of files based on part of their name)

Hello,

I have a script that checks every file with a specific extension in a specific directory. The file names contain some numerical output and I am recording the file names with the best n outcomes.

The script finds all files in the directory with the extension .out.txt and uses awk to parse the filename on underscore. In this case, I am reading the first field and looking for the smallest three values across the set of files. In other cases, I may be reading the third field. I understand that in this simple case, all I would have to do is take the first three files, but there will be other cases where that would not work.

This is the script at this point and there is sample input in the attached zip. The input file names look like,

48.93_E3200_55.94_E1900_34_ri_OA_f0_S1A_v17_52.26.1_4_ON_0.25lr.out.txt
49.15_E2700_51.98_E1200_32_ri_OA_f0_S1A_v17_52.26.1_4_ON_0.25lr.out.txt
49.16_E1600_52.54_E1600_44_ri_OA_f0_S1A_v17_52.26.1_4_ON_0.25lr.out.txt
50.36_E3400_55.09_E3000_35_ri_OA_f0_S1A_v17_52.26.1_4_ON_0.25lr.out.txt
50.62_E1700_51.92_E300_8_ri_OA_f0_S1A_v17_52.26.1_4_ON_0.25lr.out.txt

Code:
#!/bin/bash
# loop through all files and save the top 3 filenames

   # initalize
   FILENAME=""
   CURRENT_MAE_VALUE=0
   # these are initalized to an arbitrarily large value
   EV_MAE_0=1000.0
   EV_MAE_1=1000.0
   EV_MAE_2=1000.0

   EV_FILES=(NULL0 NULL1 NULL2)

   # set fold value
   FOLD=f0

   # get directory list
   FILES='./'$FOLD'/'*'out.txt'

   for INFILE in $FILES
   do

   #  remove directory from path
      FILENAME=`echo $INFILE | awk 'BEGIN {FS="/"} {print $3}'`
   #  find ev mae value
      CURRENT_MAE_VALUE=`echo $FILENAME | awk 'BEGIN {FS="_"} {print $1}'`

   # save the names of the top 3 EV files and EV values
      if (( $(bc <<< "$CURRENT_MAE_VALUE < $EV_MAE_0") == 1 ))
      then
         #bump down current list items
         EV_FILES[2]=${EV_FILES[1]}; EV_MAE_2=$EV_MAE_1
         EV_FILES[1]=${EV_FILES[0]}; EV_MAE_1=$EV_MAE_0
         EV_FILES[0]=$FILENAME
         # assign EV_MAE_VALUE to top value
         EV_MAE_0=$CURRENT_MAE_VALUE

      elif (( $(bc <<< "$CURRENT_MAE_VALUE < $EV_MAE_1") == 1 ))
      then
         #bump down current list items
         EV_FILES[2]=${EV_FILES[1]}; EV_MAE_2=$EV_MAE_1
         EV_FILES[1]=$FILENAME
         # assign EV_MAE_VALUE to second value
         EV_MAE_1=$CURRENT_MAE_VALUE

      elif (( $(bc <<< "$CURRENT_MAE_VALUE < $EV_MAE_2") == 1 ))
      then
         #bump down current list items
         EV_FILES[2]=$FILENAME
         # assign EV_MAE_VALUE to third value
         EV_MAE_2=$CURRENT_MAE_VALUE

      fi

   done

# print results
   echo "1st EV file"
   echo ${EV_FILES[0]}
   echo "EV MAE 0"  $EV_MAE_0
   echo""
   echo "2nd EV file"
   echo ${EV_FILES[1]}
   echo "EV MAE 1"  $EV_MAE_1
   echo""
   echo "3rd EV file"
   echo ${EV_FILES[2]}
   echo "EV MAE 2"  $EV_MAE_2
   echo""

My main question is about how to keep a running record of the file names of the best three values as I loop through the file names. This script does it by brute force and works alright, but I may need to save the top 20 or 50, and I don't look forward to coding that up with the method I used above.

Any suggestions?

LMHmedchem
# 2  
Old 08-01-2013
Seems like a egrep would work where the output of your grep would include the filename and the particular field you wanted if the value you're interested in is actually in the file. Then you would sort by numeric value on that particular field, than use head or tail depending upon your sort and boom...done. I am not clear on if you're using the filenames to extract the values yet, but in any case it will be similar, I will look at your data and script and an example shortly. Someone will probably post a solution if I don't in a short time.

---------- Post updated at 12:50 PM ---------- Previous update was at 12:40 PM ----------
Based on filename approach...
Something like this
Code:
ls *.out.txt | sort -k1,1 -t\_ -n -r | tail -3

# 3  
Old 08-01-2013
If I was doing this in cpp, I would definitely use some kind of sort, but I'm not at all familiar with how to do this in a shell. The key value is in the file, but not somewhere where it can be easily found (not in the same place in every file). I have already processed these files and added the value I am interested in to the file name so it will be easier to access. It's easy enough to grab the value out of the filename, but I don't know if that's compatible with your solution.

LMHmedchem
# 4  
Old 08-01-2013
Based on the data in your zip file and your current bash script, here is another bash script that seems to do what you want, but instead of hard coding the directory, field number, and number of files to be listed, it takes them as parameters:
Code:
#!/bin/bash
IAm=${0##*/}
Usage="Usage: $IAm directory field_number count"
if [ $# -ne 3 ] || ! cd "$1" > /dev/null || [ "$2" != "${2%*[^0-9]*}" ] ||
        [ "$3" != "${3%*[^0-9]*}" ]
then    echo "$Usage"
        exit 1
fi
ls *.out.txt | sort -t_ -k$2,$2n | awk -F_ -v f=$2 -v c=$3 '
NR > c {exit}
{       if(NR == 1) s = "st"
        else if(NR == 2) s = "nd"
        else if(NR == 3) s = "rd"
        else s = "th"
        printf("%d%s EV file\n%s\nEV MAE %d %s\n\n", NR, s, $0, NR - 1, $f)
}'

This script was tested using both bash and ksh, but should work with any POSIX conforming shell.

If you save this in a file named test2_copy.sh, make it executable with:
Code:
chmod +x test2_copy.sh

and execute it with:
Code:
./test2_copy.sh f0 1 3

you get the same output as you get if you run ./test_copy.sh:
Code:
1st EV file
48.93_E3200_55.94_E1900_34_ri_OA_f0_S1A_v17_52.26.1_4_ON_0.25lr.out.txt
EV MAE 0 48.93

2nd EV file
49.15_E2700_51.98_E1200_32_ri_OA_f0_S1A_v17_52.26.1_4_ON_0.25lr.out.txt
EV MAE 1 49.15

3rd EV file
49.16_E1600_52.54_E1600_44_ri_OA_f0_S1A_v17_52.26.1_4_ON_0.25lr.out.txt
EV MAE 2 49.16

but you can also run it with:
Code:
./test2_copy.sh f0 3 5

to produce:
Code:
1st EV file
50.62_E1700_51.92_E300_8_ri_OA_f0_S1A_v17_52.26.1_4_ON_0.25lr.out.txt
EV MAE 0 51.92

2nd EV file
49.15_E2700_51.98_E1200_32_ri_OA_f0_S1A_v17_52.26.1_4_ON_0.25lr.out.txt
EV MAE 1 51.98

3rd EV file
49.16_E1600_52.54_E1600_44_ri_OA_f0_S1A_v17_52.26.1_4_ON_0.25lr.out.txt
EV MAE 2 52.54

4th EV file
50.36_E3400_55.09_E3000_35_ri_OA_f0_S1A_v17_52.26.1_4_ON_0.25lr.out.txt
EV MAE 3 55.09

5th EV file
48.93_E3200_55.94_E1900_34_ri_OA_f0_S1A_v17_52.26.1_4_ON_0.25lr.out.txt
EV MAE 4 55.94

which gives you data sorted on the 3rd underscore delimited field and limited to the 1st 5 matching files. The color was added only to highlight the sort field; the actual output will not have red text.So, you could sort on the 5th field with:
Code:
./test2_copy.sh f0 5 5

to get:
Code:
1st EV file
50.62_E1700_51.92_E300_8_ri_OA_f0_S1A_v17_52.26.1_4_ON_0.25lr.out.txt
EV MAE 0 8

2nd EV file
49.15_E2700_51.98_E1200_32_ri_OA_f0_S1A_v17_52.26.1_4_ON_0.25lr.out.txt
EV MAE 1 32

3rd EV file
48.93_E3200_55.94_E1900_34_ri_OA_f0_S1A_v17_52.26.1_4_ON_0.25lr.out.txt
EV MAE 2 34

4th EV file
50.36_E3400_55.09_E3000_35_ri_OA_f0_S1A_v17_52.26.1_4_ON_0.25lr.out.txt
EV MAE 3 35

5th EV file
49.16_E1600_52.54_E1600_44_ri_OA_f0_S1A_v17_52.26.1_4_ON_0.25lr.out.txt
EV MAE 4 44

Note, however, that it is doing a numeric sort, so the results are unspecified if you select a field that isn't entirely a number.
This User Gave Thanks to Don Cragun For This Post:
# 5  
Old 08-01-2013
Thanks, I will go over this and see if I can get it working. At the end of the day, I will be doing a cp of each file in the list to another directory. One of the problems I have is that I will probably want the top 20 out of 50 or so (not the top 3), so you can see why my method wasn't going to be practical.

It's not entirely clear to me what arguments 2 and 3 are. I argument 3 the number of files being processed and argument 2 the field being sorted on?

LMHmedchem
# 6  
Old 08-01-2013
Quote:
Originally Posted by LMHmedchem
Thanks, I will go over this and see if I can get it working. At the end of the day, I will be doing a cp of each file in the list to another directory. One of the problems I have is that I will probably want the top 20 out of 50 or so (not the top 3), so you can see why my method wasn't going to be practical.

It's not entirely clear to me what arguments 2 and 3 are. I argument 3 the number of files being processed and argument 2 the field being sorted on?

LMHmedchem
I'm sorry for not explaining it better. I thought the usage message comment was sufficient documentation along with the examples I gave. The arguments are:
  1. A pathname of the directory containing the files to be processed.
  2. The field to be used as your sort key.
  3. The maximum number of files you want to list.
Your file names are of the form:
Code:
field1_field2_field3_field4_field5_field6_field7_field8_field9_field10_field11_field12_field13_field14

where field14 always ends with the string .out.txt. I showed you examples using fields 1, 3, and 5 as the sort key since they were the only numeric fields in the names of the files you used in your example that had values that were not a constant. The 12th field was numeric but all filenames had 4 in field12 so sorting on it didn't seem useful.

The count (3rd operand) in my examples was 3 and 5 since you used 3 in your example and you only had 5 files in your example. You can put any number you want there to specify the number of files you want listed. It is happy with 1; it is happy with 32000. Pick the number you want.
# 7  
Old 08-01-2013
After spending some more time looking through this, you did explain it quite well. I just didn't read it as well as you explained it.

I am in the process of trying to copy the files that are found by this to a different location and not having much success. Probably the best solution would be to dump the sorted list into a bash array. Then I can do all the rest I need to do.

This is my attempt to do this (I didn't include that parsing and exception code here but will post the entire working script, once it is...)
Code:
eval array=( $(df -h | ls *.out.txt | sort -t_ -k$2,$2n | awk -F_ -v f=$2 -v c=$3 'NR > c {exit} {printf("%s", $0)}') )

This is mildly successful in that is does capture the file names in an array, but all of them are in the first array element. I suppose I could parse array[0] on out.txt, or something kludgey like that, but I am guessing there is a better way.

I know there are some ways to copy in awk, and also with system, but I need to extract some additional information from the filename to locate an additional file, and the only way I know how to do that is in bash.

LMHmedchem
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Save an specific part of a expect_out in a variable

I have a expect file like this #!/opt/tools/unsupported/expect-5.39/bin/expect spawn ssh -l user ip expect_after eof {exit 0} set timeout 10 log_file /report.txt expect "Password:" { send "pasword\r" } expect "$ " { send "date\r" } expect "$ " { send "readlink /somelink\r" } set... (7 Replies)
Discussion started by: bebehnaz
7 Replies

2. UNIX for Dummies Questions & Answers

Rename files based on a list

Hi, I have a directory with a lot of files like this: a.bam b.bam c.bam I like to rename these files based on a list where the name of the files in the first column will be replasced by the names in the second column. Here is my list which is a tab-delimited text file: a x b y c ... (4 Replies)
Discussion started by: a_bahreini
4 Replies

3. Shell Programming and Scripting

Save value from output of Corestat and save in a list for each core

I am trying to modify the "corestat v1.1" code which is in Perl.The typical output of this code is below: Core Utilization CoreId %Usr %Sys %Total ------ ----- ----- ------ 5 4.91 0.01 4.92 6 0.06 ... (0 Replies)
Discussion started by: Zam_1234
0 Replies

4. Shell Programming and Scripting

List duplicate files based on Name and size

Hello, I have a huge directory (with millions of files) and need to find out duplicates based on BOTH file name and File size. I know fdupes but it calculates MD5 which is very time-consuming and especially it takes forever as I have millions of files. Can anyone please suggest a script or... (7 Replies)
Discussion started by: prvnrk
7 Replies

5. UNIX for Dummies Questions & Answers

List only files based on a pattern

Hi Gurus, I need to list only the files with out certain extension. For eg from the following list of files: I need to only list: Thanks Shash (7 Replies)
Discussion started by: shash
7 Replies

6. Shell Programming and Scripting

find the line starting with a pattern and save a part in variable

Hi i have a file which has mutiple line in it. inside that i have a pattern similar to this /abc/def/hij i want to fine the pattern starting with "/" and get the first word in between the the symbols "/" i.e. "abc" in this case into a variable. thanks in advance (13 Replies)
Discussion started by: kichu
13 Replies

7. Shell Programming and Scripting

Compare two files based on integer part only

Please see how can I do this: File A (three columns): X1,Y1,1.01 X2,Y2,2.02 X3,Y3,4.03 File B (three columns): X1,Y1,1 X2,Y2,2 X3,Y3,4.0005 Now I have to compare file A and B based on the integer part of column 3. Means first 2 rows should be OK and the third row should not satisfy... (12 Replies)
Discussion started by: yale_work
12 Replies

8. Shell Programming and Scripting

strike last part from list of files

Hi, I have list of files as following: /home/abc/x/23344.php /home/axx/zz/ddddd/abc/7asda/2434.php /home/zzz/7x/y/114.php /home/assssc/x/yasyday/23664.php ( last part in each line is <somenumber.php> I need to somehow get this from the above: /home/abc/x/... (6 Replies)
Discussion started by: fed.linuxgossip
6 Replies

9. UNIX for Dummies Questions & Answers

Report of duplicate files based on part of the filename

I have the files logged in the file system with names in the format of : filename_ordernumber_date_time eg: file_1_12012007_1101.txt file_2_12022007_1101.txt file_1_12032007_1101.txt I need to find out all the files that are logged multiple times with same order number. In the above eg, I... (1 Reply)
Discussion started by: sudheshnaiyer
1 Replies

10. Shell Programming and Scripting

can I save list of files in memory and not in text file?

Hello all im using allot with the method of getting file list from misc place in unix and copy them into text file and then doing misc action on this list of files using foreach f (`cat file_list.txt`) do something with $f end can I replace this file_list.txt with some place in memory? ... (1 Reply)
Discussion started by: umen
1 Replies
Login or Register to Ask a Question