Add unique identifier from file to filetype in directory


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Add unique identifier from file to filetype in directory
# 1  
Old 11-25-2016
Add unique identifier from file to filetype in directory

I am trying to add a unique identifier to two file extensions .bam and .vcf in a directory located at /home/cmccabe/Desktop/index/R_2016_09_21_14_01_15_user_S5-00580-9-Medexome.

The identifier is in $2 of the input file. What the code below is attempting to do is strip off the last portion of the path in the for.... in the below bash that is R_2016_09_21_14_01_15_user_S5-00580-9-Medexome.

That string appears in the input file and is unique and has 3 lines above it with identifiers in it.

strings in bold are identifiers
Code:
IonXpress_007 MEV21
IonXpress_008 MEV22
IonXpress_009 MEV23
R_2016_09_21_14_01_15_user_S5-00580-9-Medexome   ---- this line is matched from the path in the directory

There are 3 .bam files in /home/cmccabe/Desktop/index/R_2016_09_21_14_01_15_user_S5-00580-9-Medexome, each on will match one $1 value in input. The corresponding $2 value is what is used as the identifier to update the file in /home/cmccabe/Desktop/index/R_2016_09_21_14_01_15_user_S5-00580-9-Medexome.

My actual data is several hundreds of lines but I have included a sample dataset:

input (file to update from) located at /home/cmccabe/s5_files/identifier
Code:
IonXpress_004 MEV49
IonXpress_005 MEV50
IonXpress_006 MEV51
R_2016_10_21_12_39_06_user_S5-00580-11-Medexome

IonXpress_001 MEC2
IonXpress_002 MEC3
IonXpress_003 MEV48
R_2016_10_21_09_52_37_user_S5-00580-10-Medexome

IonXpress_007 MEV21
IonXpress_008 MEV22
IonXpress_009 MEV23
R_2016_09_21_14_01_15_user_S5-00580-9-Medexome

files in /home/cmccabe/Desktop/index/R_2016_09_21_14_01_15_user_S5-00580-9-Medexome
Code:
MEV21.bam
MEV21.vcf
MEV22.bam
MEV22.vcf
MEV23.bam
MEV23.vcf

desired output
Code:
IonXpress_007.bam
IonXpress_007.vcf
IonXpress_008.bam
IonXpress_008.vcf
IonXpress_009.bam
IonXpress_009.vcf

bash
Code:
for file in /home/cmccabe/Desktop/index/R_2016_09_21_14_01_15_user_S5-00580-9-Medexome/*.(bam,vcf}; do
   f="${file##*/}"
   path="${file%/*}"
   dt="${path##*/}"
   mv "$file" "$path/$(awk -v dt="$dt" -v f="$f" 'NF==1 {
               p=$0==dt ? 1 : 0; next} p && $1==f{print $2}' /home/cmccabe/s5_files/identifier/input)"
done

Currently the code does run, but the files do not update with the identifier. There will always be a match between the path and input. Thank you Smilie.

Last edited by cmccabe; 11-25-2016 at 09:37 PM.. Reason: fixed format, updated awk
# 2  
Old 11-25-2016
Are you trying to rename *.bam and *.vcf files? Or, only *.bam files? There is nothing in your current script that does anything with *.vcf files, is there?

Do you only want to rename the files in the directory /home/cmccabe/Desktop/index/R_2016_09_21_14_01_15_user_S5-00580-9-Medexome? Or do you want to rename the files in all of the directories in /home/cmccabe/Desktop/index/ that are named in your input file?

Do you need to look at the directory name to determine which name change needs to occur, or will the "unique identifiers" found in your input file be unique across all directories?
This User Gave Thanks to Don Cragun For This Post:
# 3  
Old 11-25-2016
I updated the script to rename both .bam and .vcf, change is in bold. Hopefully, that is what I need to do.

Only the current directory files are renamed in this case /home/cmccabe/Desktop/index/R_2016_09_21_14_01_15_user_S5-00580-9-Medexome.

The path or current directory is uniques in the input file and the 3 lines above it have the identifiers in it.

So if, R_2016_09_21_14_01_15_user_S5-00580-9-Medexome is the path or current directory
Code:
in the input file
IonXpress_007 MEV21
IonXpress_008 MEV22
IonXpress_009 MEV23
R_2016_09_21_14_01_15_user_S5-00580-9-Medexome  --- path ----

Thank you Smilie.
# 4  
Old 11-26-2016
I am fully aware that your current code only processes the contents of one directory. What I asked was whether or not you wanted your script to only process one directory. Since you didn't answer that question, the following script only processes one subdirectory of the directory /home/cmccabe/Desktop/index. (With changes to two lines and removal of a third line in this script, you can make it process files in all of the subdirectories instead of just processing files in one subdirectory.) The subdirectory is specified by an operand passed to your script (which defaults to R_2016_09_21_14_01_15_user_S5-00580-9-Medexome if no operand is given when you invoke this script).

Code:
#!/bin/bash
BaseDir=/home/cmccabe/Desktop/index
TranslationFile=/home/cmccabe/s5_files/identifier/input
SubDir=${1:-R_2016_09_21_14_01_15_user_S5-00580-9-Medexome}

cd "$BaseDir/$SubDir"
printf '%s\n' *.bam *.vcf | awk '
FNR == NR  {
	if(NF == 2)
		old[$2] = $1
	next
}
{	prefix = substr($0, 1, length($0) - 4)
	suffix = substr($0, length($0) - 3)
	if(prefix in old)
		printf("mv \"%s\" \"%s%s\"\n", $0, old[prefix], suffix)
	else	printf("# No translation found for \"%s\"\n", $0)
}' "$TranslationFile" -

which, with the sample data you provided, produces the output:
Code:
mv "MEV21.bam" "IonXpress_007.bam"
mv "MEV22.bam" "IonXpress_008.bam"
mv "MEV23.bam" "IonXpress_009.bam"
mv "MEV21.vcf" "IonXpress_007.vcf"
mv "MEV22.vcf" "IonXpress_008.vcf"
mv "MEV23.vcf" "IonXpress_009.vcf"

If you like the list of command produced by this script, run it again and pipe the output to a shell.

And, as always, if you want to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk or nawk.
This User Gave Thanks to Don Cragun For This Post:
# 5  
Old 11-26-2016
The full code is below, since I want to process only 1 directory at a time, as you knew, I use the first portion to ensure this.

Code:
#!/bin/bash

# get oldest folder
dir=/home/cmccabe/Desktop/index
{
  read -r -d $'\t' time && read -r -d '' filename
} < <(find "$dir" -maxdepth 1 -mindepth 1 -printf '%T+\t%P\0' | sort -z )
printf "The oldest folder is $filename and was created on $time, analysis was performed using v1.3 of the medex pipeline by $USER at $(date "+%D %r")\n" >> /home/cmccabe/Desktop/index/log

# rename bam
cd /home/cmccabe/Desktop/index/$filename
   rename 's/^([^_]+_[^_]+)_.+$/$1.bam/' *.bam
   
# rename vcf files
cd /home/cmccabe/Desktop/index/$filename
   rename 's/^([^_]+_[^_]+)_.+$/$1.vcf/' *.vcf
   
# rename .bam.bai files
cd /home/cmccabe/Desktop/index/$filename
   rename 's/^([^_]+_[^_]+)_.+$/$1.bam.bai/' *.bam.bai

# add identifier to bam and vcf
BaseDir=/home/cmccabe/Desktop/index  # search dir
TranslationFile=/home/cmccabe/s5_files/identifier/input #input
SubDir=${1:-$filename} # specific subdir

cd "$BaseDir/$SubDir" # look in this folder
printf '%s\n' *.bam *.vcf *.bam.bai | awk '
FNR == NR  {  # process all rows and columns
    if(NF == 2) # 2 columns in input
        old[$2] = $1  # old identifier
    next  # next line
}
{    prefix = substr($0, 1, length($0) - 4)
    if(prefix in old)
        printf("mv \"%s\" \"%s%s\"\n", $0, old[prefix])
    else    printf("# No translation found for \"%s\"\n", $0) #not found
}' "$TranslationFile" - # update from

since there is no suffix in $1 of input I get:

I removed them from the code and added a third file to search .bam.bai

input format
Code:
IonXpress_001 MEC2
IonXpress_002 MEC3
IonXpress_003 MEV48
R_2016_10_21_09_52_37_user_S5-00580-10-Medexome

IonXpress_007 MEV21
IonXpress_008 MEV22
IonXpress_009 MEV23
R_2016_09_21_14_01_15_user_S5-00580-9-Medexome

Code:
# No translation found for "IonXpress_007.bam"
# No translation found for "IonXpress_007.bam"
# No translation found for "IonXpress_007.bam"
# No translation found for "IonXpress_008.vcf"
# No translation found for "IonXpress_008.vcf"
# No translation found for "IonXpress_008.vcf"
# No translation found for "IonXpress_009.bam.bai"
# No translation found for "IonXpress_009.bam.bai"
# No translation found for "IonXpress_009.bam.bai"

Also I am not sure what you mean by output to a shell, as the files in the subdirectory, should be updated with the name from input. I tried to follow your code aand made comments that I hope are correct, but do not quite understand the portion in bold. I think that is what updates the identifiers, but not quite sure.

Example
Code:
IonXpress_007.bam  >>> MEV21.bam   ---- since the IonXpress_007 in the .bam located in the subdir matches $1 of input that .bam file is updated with $2 of input
IonXpress_007.vcf >>> MEV21.vcf   ---- since the IonXpress_007 in the .vcf  located in the subdir matches $1 of input  that .vcf file is updated with $2 of  input
IonXpress_007.bam.bai >>> MEV21.bam.bai  ---- since the IonXpress_007 in the .bam.bai  located in the subdir matches $1 of input  that .bam.bai file is updated with $2 of  input
IonXpress_008.bam  >>> MEV22.bam   ---- since the IonXpress_008 in the .bam  located in the subdir matches $1 of input  that .bam file is updated with $2 of  input
IonXpress_008.vcf >>> MEV22.vcf  ---- since the IonXpress_008 in the .vcf  located in the subdir matches $1 of input  that .vcf file is updated with $2 of  input
IonXpress_008.bam.bai >>> MEV22.bam.bai   ---- since the IonXpress_008 in the .bam.bai  located in the subdir matches $1 of input  that .bam.bai file is updated with $2 of  input
IonXpress_009.bam  >>> MEV23.bam   ---  since the IonXpress_009 in the .bam  located in the subdir matches $1 of input  that .bam file is updated with $2 of  input
IonXpress_009.vcf >>> MEV23.vcf   ---  since the IonXpress_009 in the .vcf  located in the subdir matches $1 of input  that .vcf file is updated with $2 of  input
IonXpress_009.bam.bai >>> MEV23.bam.bai  ---  since the IonXpress_009 in the .bam.bai  located in the subdir matches $1 of input  that .bam.bai file is updated with $2 of  input

Thank you for your help Smilie.

Last edited by cmccabe; 11-26-2016 at 12:30 PM.. Reason: added details
# 6  
Old 11-27-2016
Quote:
Originally Posted by cmccabe
The full code is below, since I want to process only 1 directory at a time, as you knew, I use the first portion to ensure this.

Code:
#!/bin/bash

# get oldest folder
dir=/home/cmccabe/Desktop/index
{
  read -r -d $'\t' time && read -r -d '' filename
} < <(find "$dir" -maxdepth 1 -mindepth 1 -printf '%T+\t%P\0' | sort -z )
printf "The oldest folder is $filename and was created on $time, analysis was performed using v1.3 of the medex pipeline by $USER at $(date "+%D %r")\n" >> /home/cmccabe/Desktop/index/log

# rename bam
cd /home/cmccabe/Desktop/index/$filename
   rename 's/^([^_]+_[^_]+)_.+$/$1.bam/' *.bam
   
# rename vcf files
cd /home/cmccabe/Desktop/index/$filename
   rename 's/^([^_]+_[^_]+)_.+$/$1.vcf/' *.vcf
   
# rename .bam.bai files
cd /home/cmccabe/Desktop/index/$filename
   rename 's/^([^_]+_[^_]+)_.+$/$1.bam.bai/' *.bam.bai

# add identifier to bam and vcf
BaseDir=/home/cmccabe/Desktop/index  # search dir
TranslationFile=/home/cmccabe/s5_files/identifier/input #input
SubDir=${1:-$filename} # specific subdir

cd "$BaseDir/$SubDir" # look in this folder
printf '%s\n' *.bam *.vcf *.bam.bai | awk '
FNR == NR  {  # process all rows and columns
    if(NF == 2) # 2 columns in input
        old[$2] = $1  # old identifier
    next  # next line
}
{    prefix = substr($0, 1, length($0) - 4)
    if(prefix in old)
        printf("mv \"%s\" \"%s%s\"\n", $0, old[prefix])
    else    printf("# No translation found for \"%s\"\n", $0) #not found
}' "$TranslationFile" - # update from

since there is no suffix in $1 of input I get:

I removed them from the code and added a third file to search .bam.bai

input format
Code:
IonXpress_001 MEC2
IonXpress_002 MEC3
IonXpress_003 MEV48
R_2016_10_21_09_52_37_user_S5-00580-10-Medexome

IonXpress_007 MEV21
IonXpress_008 MEV22
IonXpress_009 MEV23
R_2016_09_21_14_01_15_user_S5-00580-9-Medexome

Code:
# No translation found for "IonXpress_007.bam"
# No translation found for "IonXpress_007.bam"
# No translation found for "IonXpress_007.bam"
# No translation found for "IonXpress_008.vcf"
# No translation found for "IonXpress_008.vcf"
# No translation found for "IonXpress_008.vcf"
# No translation found for "IonXpress_009.bam.bai"
# No translation found for "IonXpress_009.bam.bai"
# No translation found for "IonXpress_009.bam.bai"

Also I am not sure what you mean by output to a shell, as the files in the subdirectory, should be updated with the name from input. I tried to follow your code aand made comments that I hope are correct, but do not quite understand the portion in bold. I think that is what updates the identifiers, but not quite sure.

Example
Code:
IonXpress_007.bam  >>> MEV21.bam   ---- since the IonXpress_007 in the .bam located in the subdir matches $1 of input that .bam file is updated with $2 of input
IonXpress_007.vcf >>> MEV21.vcf   ---- since the IonXpress_007 in the .vcf  located in the subdir matches $1 of input  that .vcf file is updated with $2 of  input
IonXpress_007.bam.bai >>> MEV21.bam.bai  ---- since the IonXpress_007 in the .bam.bai  located in the subdir matches $1 of input  that .bam.bai file is updated with $2 of  input
IonXpress_008.bam  >>> MEV22.bam   ---- since the IonXpress_008 in the .bam  located in the subdir matches $1 of input  that .bam file is updated with $2 of  input
IonXpress_008.vcf >>> MEV22.vcf  ---- since the IonXpress_008 in the .vcf  located in the subdir matches $1 of input  that .vcf file is updated with $2 of  input
IonXpress_008.bam.bai >>> MEV22.bam.bai   ---- since the IonXpress_008 in the .bam.bai  located in the subdir matches $1 of input  that .bam.bai file is updated with $2 of  input
IonXpress_009.bam  >>> MEV23.bam   ---  since the IonXpress_009 in the .bam  located in the subdir matches $1 of input  that .bam file is updated with $2 of  input
IonXpress_009.vcf >>> MEV23.vcf   ---  since the IonXpress_009 in the .vcf  located in the subdir matches $1 of input  that .vcf file is updated with $2 of  input
IonXpress_009.bam.bai >>> MEV23.bam.bai  ---  since the IonXpress_009 in the .bam.bai  located in the subdir matches $1 of input  that .bam.bai file is updated with $2 of  input

Thank you for your help Smilie.
Yes. I was fully aware that you only wanted to process one directory at a time. I gave you a script that processed one directory at a time (and told you that changing two lines and removing one line from that script would make it process all directories in a single run).

Did you even try running my suggested script with one operand (the name of the directory under /home/cmccabe/Desktop/index that you wanted to process)? Or, did you just decide to make my code fail by changing the names of all of the files you said you wanted my script to process before you invoked my script AND by deciding that some file suffixes to be processed will now be eight characters long instead of the four characters that you originally specified (.bam and .vcf)???

What do you mean there is no suffix in $1 so you fixed my code??? My code extracted the existing suffix from the names of files being processed into an awk variable named suffix and looked for the prefix in $2. When it renamed the file it was processing, it replaced the prefix found in $2 with the prefix in $1 (as you requested) and retained whatever four character suffix was on the existing filename. You gave no indication that there were other suffixes to be processed and you never gave a description of the format of the prefixes that could be present in your input file (so I had to assume that some of the hundreds of prefixes in your input file might contain a <period> character and that I couldn't be guaranteed that the first <period> found in a filename was the start of that filename's suffix). Therefore, my code assumed that the suffix was always four characters (as it was in all of your examples until you decided to change everything in post #5).

I don't have a rename utility on the system I'm using, so I can't verify what I think your code is doing, but I'm guessing that your code is renaming files with names of the form s1_s2_s3.bam to s1_s2.bam (and the same thing for the suffixes .vcf and .bam.bai) where s1 and s2 are arbitrary strings that contain no underscore characters and do contain at least one character that is not an underscore and s3 is an arbitrary string of any zero or more characters. I have no idea what your original filenames were before this transformation, but I do know that in the data you showed us, there are no underscores in any name prefix in $2 in your input file. And, the output you got from my code stated that no prefix in $2 matched the prefix in the filename IonXpress_007.bam and the other eight files it listed. You will find those three prefixes in $1 in input, but none of them appear in $2!

I fully admit that my original code doesn't stand a chance of working with your new (still incomplete) specification, but the changes you made made it much less likely that the code I suggested will ever work. And, I have absolutely no idea what you expect my script to do with the nine files for which it reported that it found no translation in your input.

It does seem extremely inefficient to rename files using rename and then use awk to rename them again using mv commands. And with the data you showed us in post #1 and post #5, I have absolutely no idea how you got the output you showed us above. I would have expected something much more like:
Code:
# No translation found for "IonXpress_007.bam"
# No translation found for "IonXpress_008.bam"
# No translation found for "IonXpress_009.bam"
# No translation found for "IonXpress_007.vcf"
# No translation found for "IonXpress_008.vcf"
# No translation found for "IonXpress_009.vcf"
# No translation found for "IonXpress_007.bam.bai"
# No translation found for "IonXpress_008.bam.bai"
# No translation found for "IonXpress_009.bam.bai"

If you want a script that does something completely different from what you have specified in all of your posts in this thread, you obviously need code that is different from what I wrote that tried to do what you specified in posts #1 and #3. You have yet to show us anywhere where my script behaved differently than you requested with input that matched what you described. And you have yet to show us that the directory you are trying to process contains any files that you said you wanted to rename.
This User Gave Thanks to Don Cragun For This Post:
# 7  
Old 11-27-2016
I admit I did not fully understand your code and did run it as you had had it. That is how I got the No Translation Found. I only removed the suffix lines as I though that was I thought that was referring to $1 in input, and since it didn't match No Translation Found. I was mistaken and hope the below helps.

Files in directory being updated in dir: R_2016_09_21_14_01_15_user_S5-00580-9-Medexome
Code:
IonXpress_007.bam
IonXpress_007.vcf
IonXpress_007.bam.bai
IonXpress_008.bam
IonXpress_008.vcf
IonXpress_008.bam.bai
IonXpress_009.bam
IonXpress_009.vcf
IonXpress_009.bam.bai

input
Code:
IonXpress_001 MEC2
IonXpress_002 MEC3
IonXpress_003 MEV48
R_2016_10_21_09_52_37_user_S5-00580-10-Medexome

IonXpress_007 MEV21
IonXpress_008 MEV22
IonXpress_009 MEV23
R_2016_09_21_14_01_15_user_S5-00580-9-Medexome  --- line matches dir

The identifier is in $2 of the input file. That is what the file in dir should be updated with. The $1 value will match the file name (before it is renamed).

So using the dir in the example:
Code:
MEV21.bam
MEV21.vcf
MEV21.bam.bai
MEV22.bam
MEV22.vcf
MEV22.bam.bai
MEV23.bam
MEV23.vcf
MEV23.bam.bai

Each filename in dir [ICODE]IonXpress_007,IonXpress_008,IonXpress_009 will match a $1 value in input. The corresponding $2 value is what the filename in dir is renamed to. The dir will match only one line string appears in the input file and the 3 lines above it have the identifiers in it. Thank you very much Smilie.

Last edited by cmccabe; 11-27-2016 at 10:35 PM.. Reason: added details
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Bash to create new directory by date followed by identifier and additional subdirectories

I have a bash that downloads a list and if that list has data in it then a new main directory is created (with the date) with several subdirectories (example1, example2, example3). My question is in that list there are portion of specific file types (.vcf.gz) - identifier towards the end that have... (0 Replies)
Discussion started by: cmccabe
0 Replies

2. UNIX for Advanced & Expert Users

File command return wrong filetype while file holds group separator char.

hi, I am trying to get the FileType using the File command. I have one file, which holds Group separator along with ASCII character. It's a Text file. But when I ran the File command the FileType is coming as "data". It should be "ASCII, Text file". Is the latest version of File... (6 Replies)
Discussion started by: Arpitak29
6 Replies

3. Shell Programming and Scripting

Change everything in a file that maps to {module::name.filetype} to _modules/name/applicat

path = content.txt filename = application directory = _modules define create $(eval from := $(shell echo $$1)) \ $(eval to := $(shell echo $$2)) \ sed -i '' 's/$(from)/$(to)/g' content.txt endef all: clear $(eval modules := $(shell egrep -o "{module+\}" $(path))) ... (1 Reply)
Discussion started by: bmson
1 Replies

4. Shell Programming and Scripting

HPUX find string in directory and filetype and replace string

Hi, Here's my dilemma. I need to replace the string Sept_2012 to Oct_2012 in all *config.py files within the current directory and below directories Is this possible? Also I am trying to find all instances of the string Sept_2012 within files in the current directory and below I have... (13 Replies)
Discussion started by: pure_jax
13 Replies

5. Shell Programming and Scripting

Change unique file names into new unique filenames

I have 84 files with the following names splitseqs.1, spliseqs.2 etc. and I want to change the .number to a unique filename. E.g. change splitseqs.1 into splitseqs.7114_1#24 and change spliseqs.2 into splitseqs.7067_2#4 So all the current file names are unique, so are the new file names.... (1 Reply)
Discussion started by: avonm
1 Replies

6. Shell Programming and Scripting

Unique files in a given directory

I keep all my files on a NAS device and copy files from it to usb or local storage when needed. The bad part about this is that I often have the same file on numerous places. I'd like to write a script to check if the files in a given directory exist in another. An example: say I have a... (7 Replies)
Discussion started by: cue
7 Replies

7. Shell Programming and Scripting

get part of file with unique & non-unique string

I have an archive file that holds a batch of statements. I would like to be able to extract a certain statement based on the unique customer # (ie. 123456). The end for each statement is noted by "ENDSTM". I can find the line number for the beginning of the statement section with sed. ... (5 Replies)
Discussion started by: andrewsc
5 Replies

8. Shell Programming and Scripting

Unique Directory and Folder Deletion Script

Ok, so I just got charged with the task of deleting some 300 user folders in a FTP server to free up some space. I managed to grep and cut the list of user folders to delete into a list of one user folder per line. Example: bob00 jane01 sue03 In the home folder, there are folders a-z, and... (5 Replies)
Discussion started by: b4sher
5 Replies

9. UNIX for Dummies Questions & Answers

Shell Script Unique Identifier Question

i All I have scripting question. I have a file "out.txt" which is generated by another script the file contains the following my_identifier8859574 logout The number is generated in the script and I have put the my_identifier bit in front of it as a unique identifier I now have... (7 Replies)
Discussion started by: grahambo2005
7 Replies

10. UNIX for Dummies Questions & Answers

Directory Inode Number Not Unique

Hi, I know that inode for each file is unique, but is it the for the directory? So far I found different directories has the same inode nubmer when you do ls -i, could some one explain why? Thanks a lot. (9 Replies)
Discussion started by: nj302
9 Replies
Login or Register to Ask a Question