FILE_ID extraction from file name and save it in CSV file after looping through each folders

09-13-2012

Registered User

25, 0

Join Date: Jun 2012

Last Activity: 17 May 2013, 12:56 PM EDT

Posts: 25

Thanks Given: 0

Thanked 0 Times in 0 Posts

FILE_ID extraction from file name and save it in CSV file after looping through each folders

FILE_ID extraction from file name and save it in CSV file after looping through each folders

My files are located in UNIX Server, i want to extract file_id and file_name from each file .and save it in a CSV file. How do I do that?
I have folders in unix environment, directory structure is structured as follows
year folder -> inside 12 months folders -> inside 30/31 days folders

I ran ls command folder
year as follows
2009 2010 2011 2012
I ran cd command for year 2012

Code:

$ cd 2012

I ran ls command for 2012 year folder

Code:

$ ls 
01 02 03 04 05 06 07 08 09

then I ran command for september

Code:

$ cd 09 
$ ls 
01 02 03 04 05 06 07 08 09 10 11 12 13 
$ cd 13 
$ ls 
sasmm_fsbc_durds_id00020532_t20100313192606.dat.trnsfr.gz 
sasmm_fsbc_durds_id00020513_t20120913003312.dat.trnsfr.gz

there are folders for each year like 2009,2010,2011 and 2012
and folder has 12 folders for each months like 01,02,03,04,05,06,07,08,09,10,11,12
and each month folder has 31 folders for days like 1,2,3, etc... 29,30,31

inside each day folder has files..
the file name is as follows,
sasmm_fsbc_durds_id00020532_t20100313192606.dat.trnsfr.gz
sasmm_fsbc_durds_id00020513_t20120913003312.dat.trnsfr.gz
I want to have one csv file and that file needs to have two columns , one is for file_id and
second field is for file name.
to obtain file_id value ,loop through each folders and get file name, then read file name and
get substring between "sasmm_fsbc_durds_id000" and _t and store it in file_id column and store
file name in file_name column.

in above example for file sasmm_fsbc_durds_id00020532_t20100313192606.dat.trnsfr.gz
read file name sasmm_fsbc_durds_id00020532_t20100313192606.dat.trnsfr.gz
cut 20532 and save it in a file_id clumn and the whole file name in second column = sasmm_fsbc_durds_id00020532_t20100313192606.dat

CSV file will look like

Code:

file_id file_name 
20532 sasmm_fsbc_durds_id00020532_t20100313192606.dat 
20513 sasmm_fsbc_durds_id00020513_t20120913003312.dat

file_id is to be cut from the file name , if you look at the file name closely, you can see;
after 000 , file_ids in above file name examples , they are 20532 and 20513.

How do I loop through year 2012 and 12 months folders and 31 days folders inside it and create
csv file which has data as shown above?
I am very new unix, please help me out.. If you provide a code , that would be great..
thanks..

output CSV file look like this

Code:

file_id file_name 
20532 sasmm_fsbc_durds_id00020532_t20100313192606.dat 
20513 sasmm_fsbc_durds_id00020513_t20120913003312.dat

do we need to search files recursively for finding file in each folder or to go dwon to day folder?

Moderator's Comments:

edit by bakunin: Please view this code tag video for how to use code tags when posting code and data.

In addition please do not use all-caps routinely. All-caps is like spice - use it to make SOMETHING STAND OUT, but overdo it and its tasteless.

Last edited by bakunin; 09-14-2012 at 07:02 AM..

princetd001

View Public Profile for princetd001

Find all posts by princetd001

09-13-2012

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

First you say that filenames in the directory 2012/09/13 are:

Code:

sasmm_fsbc_durds_id00020532_t20100313192606.dat.trnsfr.gz 
sasmm_fsbc_durds_id00020513_t20120913003312.dat.trnsfr.gz

and then you say you want the entire filename to be the second field in your output file and say that that field should be:

Code:

sasmm_fsbc_durds_id00020532_t20100313192606.dat
sasmm_fsbc_durds_id00020513_t20120913003312.dat

What happened to the .trnsfr.gz at the end of the filenames?

Is the file_id field always supposed to be a string a decimal digits or could other characters appear in the file_id?

Is there any chance that there will be more than one occurrence of _t in a filename after sasmm_fsbc_durds_id000?

Should an error be reported if other files exist under 2???/[01][0-9]/[0-3][0-9] with filenames that that don't start with sasmm_fsbc_durds_id000 and contain _t after that?

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

09-13-2012

Registered User

1,466, 512

Join Date: Jul 2010

Last Activity: 7 April 2014, 3:02 PM EDT

Location: earth>US>UTC-5

Posts: 1,466

Thanks Given: 110

Thanked 512 Times in 491 Posts

A quick Ksh script that assumes the current directory contains the year directories:

Code:

#!/usr/bin/env ksh
find 20[0-1][0-9] -type f | while read path
do
    name=${path##*/}
    name=${name%.trns*}
    id=${name%_*}
    id=${id##*_}
    id=${id:2}
    echo  ${id/~(+E)^[0]+/} $name
done >output-file

Requires Kshell, and there are probably more efficient ways to do this.

agama

View Public Profile for agama

Find all posts by agama

09-14-2012

Registered User

25, 0

Join Date: Jun 2012

Last Activity: 17 May 2013, 12:56 PM EDT

Posts: 25

Thanks Given: 0

Thanked 0 Times in 0 Posts

Code:
sasmm_fsbc_durds_id00020532_t20100313192606.dat.trnsfr.gz sasmm_fsbc_durds_id00020513_t20120913003312.dat.trnsfr.gz
and then you say you want the entire filename to be the second field in your output file and say that that field should be:

Code:
sasmm_fsbc_durds_id00020532_t20100313192606.dat sasmm_fsbc_durds_id00020513_t20120913003312.dat
What happened to the .trnsfr.gz at the end of the filenames?

yeah in each folder , the file name ends with
.dat.trnsfr.gz
but when we enter into CSV file UNDER file_name column , it should omit
.trnsfr.gz
for file_id

it is number, it should be extracted from file name itself

in your code , you have not specified output file as CSV,
are you looping through all files inside all folders in a year?
which code is used for extracting id from file id?

how you specify the coulmn names in out put file?

do you know write same logic in simple Shell, Shell Scripts?

---------- Post updated at 10:18 PM ---------- Previous update was at 09:59 PM ----------

if i use this loop, will it loop through all folders?

FILES=`ls -1`
for FILE in $FILES
do

---------- Post updated at 10:27 PM ---------- Previous update was at 10:18 PM ----------

I ran your script, it says error message

[/work/users/po/prince]$ ./testSBI.sh
./testSBI.sh[8]: id=${id:2}: bad substitution

your code

Quote:

#!/usr/bin/env ksh
find 20[0-1][0-9] -type f | while read path
do
name=${path##*/}
name=${name%.trns*}
id=${name%_*}
id=${id##*_}
id=${id:2}
echo ${id/~(+E)^[0]+/} $name
done >output-file

---------- Post updated at 10:39 PM ---------- Previous update was at 10:27 PM ----------

i removed line of code which causes the error
i executed your script without that, it again throw an error

./testSBI.sh[10]: ${id/~(+E)^[0]+/}: bad substitution

princetd001

View Public Profile for princetd001

Find all posts by princetd001

09-14-2012

Registered User

1,466, 512

Join Date: Jul 2010

Last Activity: 7 April 2014, 3:02 PM EDT

Location: earth>US>UTC-5

Posts: 1,466

Thanks Given: 110

Thanked 512 Times in 491 Posts

Quote:

Originally Posted by princetd001

Code:

in your code , you have not specified output file as CSV,
are you looping through all files inside all folders in a year?
which code is used for extracting id from file id?

You can set the output file name however you want. Replace output-file with CSV, or what ever you want the output filename to be. The find command will list all files under all directories that are of the form 2000 - 2099, so yes, in a way we are looping through all files, but letting find do the work rather than the script.

The code that extracts the ID from the name is:

Code:

id=${name%_*}    # delete from last underbar to the end, and assign to variable id
id=${id##*_}    # delete from front of the string to the last underbar and reassign to id
id=${id:2}   # extract the number (portion of string starting at character 2)

The leading zeros are removed as the variable is expanded in the echo:

Code:

${id/~(+E)^[0]+/}

Quote:

how you specify the coulmn names in out put file?

You made no mention of column names, only that the ID was to be first and the filename was to be second. The code prints ID followed by filename. Per your example there is no comma; I was a bit confused with your initial post as you indicated that the file was comma separated values (csv) yet you didn't indicate that the columns should be separated that way.

Quote:

do you know write same logic in simple Shell, Shell Scripts?

The code I posted is a simple shell script.

Quote:

if i use this loop, will it loop through all folders?

FILES=`ls -1`
for FILE in $FILES
do

Yes, but it's bad form if you ask me. Something like this would be better:

Code:

ls | while read file
do
   echo $file
done

Quote:

I ran your script, it says error message

[FONT=r_ansi][SIZE=2][FONT=r_ansi][SIZE=2][/work/users/po/prince]$ ./testSBI.sh
./testSBI.sh[8]: id=${id:2}: bad substitution

Were you using ksh (Korn Shell)? Bash cannot handle the last substitution which eliminates the leading zeros from the ID. If you cannot use ksh, then you'll need to change the echo and delete the zeros with sed or some other mechanism.

agama

View Public Profile for agama

Find all posts by agama

09-14-2012

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

The following seems to do what you requested. You say that you want to create a CSV file, but by definition a CSV file has fields that are separated by commas. You don't show any commas in any of your sample output. This script uses a tab to separate output fields to get the headers to line up with the following data. Although it is written using ksh, it should also work with at least bash and sh:

Code:

#!/bin/ksh
printf "file_id\tfile_name\n"
find 2[0-9][0-9][0-9] -name 'sasmm_fsbc_durds_id000[0-9]*_t?*' | while read path
do
        file=$(basename "$path" .trnsfr.gz)
        id=${file#sasmm_fsbc_durds_id000}
        id=${id%%_t*}
        printf "%s\t%s\n" "$id" "$file"
done

Note that this will ignore any files found in and under the year directories that don't match your filename specifications.

To run it, save the above code in a file (e.g., extract) in the same directory where the year directories reside, make it executable by issuing the command:

Code:

chmod +x extract

and then issue the command:

Code:

./extract > output_file

If you leave off > output_file, the output will be written to your terminal. If you want to save the output in a file with a name other than output_file, replace it with any name you want.

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

09-14-2012

Registered User

25, 0

Join Date: Jun 2012

Last Activity: 17 May 2013, 12:56 PM EDT

Posts: 25

Thanks Given: 0

Thanked 0 Times in 0 Posts

script

Code:

#!/usr/bin/env ksh
OUTFILE=test.txt
find 20[0-1][0-9] -type f | while read path
 do
   name=${path##*/}
   name=${name%.trns*}   
   id=${name%_*}
   id=${id##*_}
   id=${id##*000}
   echo "id: $id"
   echo "file name: $name"
  done  > ${OUTFILE}
exit

MY SCRIPT RESULT

Code:

id: 20532
file name: sasmm_fsbc_durds_id00020532_t20120112192606.dat
id: 20533
file name: sasmm_fsbc_durds_id00020533_t20120212192606.dat
id: 20534
file name: sasmm_fsbc_durds_id00020534_t20120312192606.dat

Last edited by Corona688; 09-14-2012 at 01:00 PM..

princetd001

View Public Profile for princetd001

Find all posts by princetd001

Shell Programming and Scripting

FILE_ID extraction from file name and save it in CSV file after looping through each folders

9 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Data extraction and converting into .csv file.

Discussion started by: abhi_123

2. Shell Programming and Scripting

Save output of updated csv file as csv file itself, part 2

Discussion started by: refrain

3. Shell Programming and Scripting

Save output of updated csv file as csv file itself

Discussion started by: refrain

4. Shell Programming and Scripting

CSV file data extraction

Discussion started by: nanduri

5. Shell Programming and Scripting

need to save the space when converting to CSV file

Discussion started by: wintersnow2011

6. UNIX for Dummies Questions & Answers

CSV file:Find duplicates, save original and duplicate records in a new file

Discussion started by: arvindosu

7. Shell Programming and Scripting

select data from oracle table and save the output as csv file

Discussion started by: rdhanek

8. Shell Programming and Scripting

Data fetched from text file and save in a csv file

Discussion started by: rohitkalia

9. Shell Programming and Scripting

how to start looping from the second line in .csv file

Discussion started by: codeman007