FILE_ID extraction from file name and save it in CSV file after looping through each folders


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting FILE_ID extraction from file name and save it in CSV file after looping through each folders
# 1  
Old 09-13-2012
FILE_ID extraction from file name and save it in CSV file after looping through each folders

FILE_ID extraction from file name and save it in CSV file after looping through each folders

My files are located in UNIX Server, i want to extract file_id and file_name from each file .and save it in a CSV file. How do I do that?
I have folders in unix environment, directory structure is structured as follows
year folder -> inside 12 months folders -> inside 30/31 days folders

I ran ls command folder
year as follows
2009 2010 2011 2012
I ran cd command for year 2012
Code:
$ cd 2012

I ran ls command for 2012 year folder
Code:
$ ls 
01 02 03 04 05 06 07 08 09

then I ran command for september
Code:
$ cd 09 
$ ls 
01 02 03 04 05 06 07 08 09 10 11 12 13 
$ cd 13 
$ ls 
sasmm_fsbc_durds_id00020532_t20100313192606.dat.trnsfr.gz 
sasmm_fsbc_durds_id00020513_t20120913003312.dat.trnsfr.gz

there are folders for each year like 2009,2010,2011 and 2012
and folder has 12 folders for each months like 01,02,03,04,05,06,07,08,09,10,11,12
and each month folder has 31 folders for days like 1,2,3, etc... 29,30,31

inside each day folder has files..
the file name is as follows,
sasmm_fsbc_durds_id00020532_t20100313192606.dat.trnsfr.gz
sasmm_fsbc_durds_id00020513_t20120913003312.dat.trnsfr.gz
I want to have one csv file and that file needs to have two columns , one is for file_id and
second field is for file name.
to obtain file_id value ,loop through each folders and get file name, then read file name and
get substring between "sasmm_fsbc_durds_id000" and _t and store it in file_id column and store
file name in file_name column.

in above example for file sasmm_fsbc_durds_id00020532_t20100313192606.dat.trnsfr.gz
read file name sasmm_fsbc_durds_id00020532_t20100313192606.dat.trnsfr.gz
cut 20532 and save it in a file_id clumn and the whole file name in second column = sasmm_fsbc_durds_id00020532_t20100313192606.dat

CSV file will look like

Code:
file_id file_name 
20532 sasmm_fsbc_durds_id00020532_t20100313192606.dat 
20513 sasmm_fsbc_durds_id00020513_t20120913003312.dat

file_id is to be cut from the file name , if you look at the file name closely, you can see;
after 000 , file_ids in above file name examples , they are 20532 and 20513.

How do I loop through year 2012 and 12 months folders and 31 days folders inside it and create
csv file which has data as shown above?
I am very new unix, please help me out.. If you provide a code , that would be great..
thanks..


output CSV file look like this

Code:
file_id file_name 
20532 sasmm_fsbc_durds_id00020532_t20100313192606.dat 
20513 sasmm_fsbc_durds_id00020513_t20120913003312.dat


do we need to search files recursively for finding file in each folder or to go dwon to day folder?

Moderator's Comments:
Mod Comment edit by bakunin: Please view this code tag video for how to use code tags when posting code and data.

In addition please do not use all-caps routinely. All-caps is like spice - use it to make SOMETHING STAND OUT, but overdo it and its tasteless.

Last edited by bakunin; 09-14-2012 at 07:02 AM..
# 2  
Old 09-13-2012
First you say that filenames in the directory 2012/09/13 are:
Code:
sasmm_fsbc_durds_id00020532_t20100313192606.dat.trnsfr.gz 
sasmm_fsbc_durds_id00020513_t20120913003312.dat.trnsfr.gz

and then you say you want the entire filename to be the second field in your output file and say that that field should be:
Code:
sasmm_fsbc_durds_id00020532_t20100313192606.dat
sasmm_fsbc_durds_id00020513_t20120913003312.dat

What happened to the .trnsfr.gz at the end of the filenames?

Is the file_id field always supposed to be a string a decimal digits or could other characters appear in the file_id?

Is there any chance that there will be more than one occurrence of _t in a filename after sasmm_fsbc_durds_id000?

Should an error be reported if other files exist under 2???/[01][0-9]/[0-3][0-9] with filenames that that don't start with sasmm_fsbc_durds_id000 and contain _t after that?
# 3  
Old 09-13-2012
A quick Ksh script that assumes the current directory contains the year directories:

Code:
#!/usr/bin/env ksh
find 20[0-1][0-9] -type f | while read path
do
    name=${path##*/}
    name=${name%.trns*}
    id=${name%_*}
    id=${id##*_}
    id=${id:2}
    echo  ${id/~(+E)^[0]+/} $name
done >output-file

Requires Kshell, and there are probably more efficient ways to do this.
# 4  
Old 09-14-2012
Code:
sasmm_fsbc_durds_id00020532_t20100313192606.dat.trnsfr.gz sasmm_fsbc_durds_id00020513_t20120913003312.dat.trnsfr.gz
and then you say you want the entire filename to be the second field in your output file and say that that field should be:

Code:
sasmm_fsbc_durds_id00020532_t20100313192606.dat sasmm_fsbc_durds_id00020513_t20120913003312.dat
What happened to the .trnsfr.gz at the end of the filenames?

yeah in each folder , the file name ends with
.dat.trnsfr.gz
but when we enter into CSV file UNDER file_name column , it should omit
.trnsfr.gz
for file_id

it is number, it should be extracted from file name itself

in your code , you have not specified output file as CSV,
are you looping through all files inside all folders in a year?
which code is used for extracting id from file id?

how you specify the coulmn names in out put file?

do you know write same logic in simple Shell, Shell Scripts?

---------- Post updated at 10:18 PM ---------- Previous update was at 09:59 PM ----------

if i use this loop, will it loop through all folders?

FILES=`ls -1`
for FILE in $FILES
do

---------- Post updated at 10:27 PM ---------- Previous update was at 10:18 PM ----------

I ran your script, it says error message

[/work/users/po/prince]$ ./testSBI.sh
./testSBI.sh[8]: id=${id:2}: bad substitution


your code

Quote:
#!/usr/bin/env ksh
find 20[0-1][0-9] -type f | while read path
do
name=${path##*/}
name=${name%.trns*}
id=${name%_*}
id=${id##*_}
id=${id:2}
echo ${id/~(+E)^[0]+/} $name
done >output-file


---------- Post updated at 10:39 PM ---------- Previous update was at 10:27 PM ----------

i removed line of code which causes the error
i executed your script without that, it again throw an error

./testSBI.sh[10]: ${id/~(+E)^[0]+/}: bad substitution
# 5  
Old 09-14-2012
Quote:
Originally Posted by princetd001
Code:

in your code , you have not specified output file as CSV,
are you looping through all files inside all folders in a year?
which code is used for extracting id from file id?
You can set the output file name however you want. Replace output-file with CSV, or what ever you want the output filename to be. The find command will list all files under all directories that are of the form 2000 - 2099, so yes, in a way we are looping through all files, but letting find do the work rather than the script.

The code that extracts the ID from the name is:
Code:
id=${name%_*}    # delete from last underbar to the end, and assign to variable id
id=${id##*_}    # delete from front of the string to the last underbar and reassign to id
id=${id:2}   # extract the number (portion of string starting at character 2)

The leading zeros are removed as the variable is expanded in the echo:

Code:
${id/~(+E)^[0]+/}

Quote:

how you specify the coulmn names in out put file?
You made no mention of column names, only that the ID was to be first and the filename was to be second. The code prints ID followed by filename. Per your example there is no comma; I was a bit confused with your initial post as you indicated that the file was comma separated values (csv) yet you didn't indicate that the columns should be separated that way.

Quote:

do you know write same logic in simple Shell, Shell Scripts?
The code I posted is a simple shell script.

Quote:

if i use this loop, will it loop through all folders?

FILES=`ls -1`
for FILE in $FILES
do
Yes, but it's bad form if you ask me. Something like this would be better:

Code:
ls | while read file
do
   echo $file
done

Quote:
I ran your script, it says error message

[FONT=r_ansi][SIZE=2][FONT=r_ansi][SIZE=2][/work/users/po/prince]$ ./testSBI.sh
./testSBI.sh[8]: id=${id:2}: bad substitution
Were you using ksh (Korn Shell)? Bash cannot handle the last substitution which eliminates the leading zeros from the ID. If you cannot use ksh, then you'll need to change the echo and delete the zeros with sed or some other mechanism.
# 6  
Old 09-14-2012
The following seems to do what you requested. You say that you want to create a CSV file, but by definition a CSV file has fields that are separated by commas. You don't show any commas in any of your sample output. This script uses a tab to separate output fields to get the headers to line up with the following data. Although it is written using ksh, it should also work with at least bash and sh:
Code:
#!/bin/ksh
printf "file_id\tfile_name\n"
find 2[0-9][0-9][0-9] -name 'sasmm_fsbc_durds_id000[0-9]*_t?*' | while read path
do
        file=$(basename "$path" .trnsfr.gz)
        id=${file#sasmm_fsbc_durds_id000}
        id=${id%%_t*}
        printf "%s\t%s\n" "$id" "$file"
done

Note that this will ignore any files found in and under the year directories that don't match your filename specifications.

To run it, save the above code in a file (e.g., extract) in the same directory where the year directories reside, make it executable by issuing the command:
Code:
chmod +x extract

and then issue the command:
Code:
./extract > output_file

If you leave off > output_file, the output will be written to your terminal. If you want to save the output in a file with a name other than output_file, replace it with any name you want.
# 7  
Old 09-14-2012
script
Code:
#!/usr/bin/env ksh
OUTFILE=test.txt
find 20[0-1][0-9] -type f | while read path
 do
   name=${path##*/}
   name=${name%.trns*}   
   id=${name%_*}
   id=${id##*_}
   id=${id##*000}
   echo "id: $id"
   echo "file name: $name"
  done  > ${OUTFILE}
exit


MY SCRIPT RESULT
Code:
id: 20532
file name: sasmm_fsbc_durds_id00020532_t20120112192606.dat
id: 20533
file name: sasmm_fsbc_durds_id00020533_t20120212192606.dat
id: 20534
file name: sasmm_fsbc_durds_id00020534_t20120312192606.dat


Last edited by Corona688; 09-14-2012 at 01:00 PM..
Login or Register to Ask a Question

Previous Thread | Next Thread

9 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Data extraction and converting into .csv file.

Hi All, I have a data file and need to extract and convert it into csv format: 1) Read and extract the line containing string ending with "----" (file sample_linebyline.txt file) and to make a .csv file from this. 2) To read the flat file flatfile_sample.txt which consists of similar data (... (9 Replies)
Discussion started by: abhi_123
9 Replies

2. Shell Programming and Scripting

Save output of updated csv file as csv file itself, part 2

Hi, I have another problem. I want to sort another csv file by the first field. result.csv SourceFile,Airspeed,GPSLatitude,GPSLongitude,Temperature,Pressure,Altitude,Roll,Pitch,Yaw /home/intannf/foto5/2015_0313_090651_219.JPG,0.,-7.77223,110.37310,30.75,996.46,148.75,180.94,182.00,63.92 ... (2 Replies)
Discussion started by: refrain
2 Replies

3. Shell Programming and Scripting

Save output of updated csv file as csv file itself

Hi, all I want to sort a csv file based on timestamp from oldest to newest and save the output as csv file itself. Here is an example of my csv file. test.csv SourceFile,DateTimeOriginal /home/intannf/foto/IMG_0739.JPG,2015:02:17 11:32:21 /home/intannf/foto/IMG_0749.JPG,2015:02:17 11:37:28... (10 Replies)
Discussion started by: refrain
10 Replies

4. Shell Programming and Scripting

CSV file data extraction

Hi I am writing a shell script to parse a CSV file , in which i am facing a problem to separate the columns . Could some one help me with it. IN301330/00001 pvavan kumar limited xyz@ttccpp.com IN302148/00002 PRECIOUS SECURITIES (P) LTD viash@yahoo.co.in IN300239/00000 CENTRE india... (8 Replies)
Discussion started by: nanduri
8 Replies

5. Shell Programming and Scripting

need to save the space when converting to CSV file

Hi, I have a text file with the following format. Some of the fields are blank. 1234 3456 23 45464 327837283232 343434 5654353 34 34343 3434345 434242 .... .... .... I need to convert this file to a CSV file, like 1234, ,23, ... (3 Replies)
Discussion started by: wintersnow2011
3 Replies

6. UNIX for Dummies Questions & Answers

CSV file:Find duplicates, save original and duplicate records in a new file

Hi Unix gurus, Maybe it is too much to ask for but please take a moment and help me out. A very humble request to you gurus. I'm new to Unix and I have started learning Unix. I have this project which is way to advanced for me. File format: CSV file File has four columns with no header... (8 Replies)
Discussion started by: arvindosu
8 Replies

7. Shell Programming and Scripting

select data from oracle table and save the output as csv file

Hi I need to execute a select statement in a solaris environment with oracle database. The select statement returns number of rows of data. I need the data to be inserted into a CSV file with proper format. For that we normally use "You have to select all your columns as one big string,... (2 Replies)
Discussion started by: rdhanek
2 Replies

8. Shell Programming and Scripting

Data fetched from text file and save in a csv file

Hi i have wriiten a script which fetches the data from text file, and saves in the output in a text file itself, but i want that the output should save in different columns. I have the output like: For Channel:response_time__24.txt 1547 data points 0.339 0.299 0.448 0.581 7.380 ... (1 Reply)
Discussion started by: rohitkalia
1 Replies

9. Shell Programming and Scripting

how to start looping from the second line in .csv file

I have a .csv file and i use the below while loop to navigate through it But i need to loop from the second line since the first line is the header How will i do it?? please help while IFS=, read Filename Path size readonly do echo "Filename -> ${Filename}" echo "Path -> ${Path}" echo... (8 Replies)
Discussion started by: codeman007
8 Replies
Login or Register to Ask a Question