Add unique identifier from file to filetype in directory

11-25-2016

Registered User

1,393, 20

Join Date: Nov 2013

Last Activity: 1 May 2020, 2:35 PM EDT

Location: Chicago

Posts: 1,393

Thanks Given: 901

Thanked 20 Times in 19 Posts

Add unique identifier from file to filetype in directory

I am trying to add a unique identifier to two file extensions .bam and .vcf in a directory located at /home/cmccabe/Desktop/index/R_2016_09_21_14_01_15_user_S5-00580-9-Medexome.

The identifier is in $2 of the input file. What the code below is attempting to do is strip off the last portion of the path in the for.... in the below bash that is R_2016_09_21_14_01_15_user_S5-00580-9-Medexome.

That string appears in the input file and is unique and has 3 lines above it with identifiers in it.

strings in bold are identifiers

Code:

IonXpress_007 MEV21
IonXpress_008 MEV22
IonXpress_009 MEV23
R_2016_09_21_14_01_15_user_S5-00580-9-Medexome   ---- this line is matched from the path in the directory

There are 3 .bam files in /home/cmccabe/Desktop/index/R_2016_09_21_14_01_15_user_S5-00580-9-Medexome, each on will match one $1 value in input. The corresponding $2 value is what is used as the identifier to update the file in /home/cmccabe/Desktop/index/R_2016_09_21_14_01_15_user_S5-00580-9-Medexome.

My actual data is several hundreds of lines but I have included a sample dataset:

input (file to update from) located at /home/cmccabe/s5_files/identifier

Code:

IonXpress_004 MEV49
IonXpress_005 MEV50
IonXpress_006 MEV51
R_2016_10_21_12_39_06_user_S5-00580-11-Medexome

IonXpress_001 MEC2
IonXpress_002 MEC3
IonXpress_003 MEV48
R_2016_10_21_09_52_37_user_S5-00580-10-Medexome

IonXpress_007 MEV21
IonXpress_008 MEV22
IonXpress_009 MEV23
R_2016_09_21_14_01_15_user_S5-00580-9-Medexome

files in /home/cmccabe/Desktop/index/R_2016_09_21_14_01_15_user_S5-00580-9-Medexome

Code:

MEV21.bam
MEV21.vcf
MEV22.bam
MEV22.vcf
MEV23.bam
MEV23.vcf

desired output

Code:

IonXpress_007.bam
IonXpress_007.vcf
IonXpress_008.bam
IonXpress_008.vcf
IonXpress_009.bam
IonXpress_009.vcf

bash

Code:

for file in /home/cmccabe/Desktop/index/R_2016_09_21_14_01_15_user_S5-00580-9-Medexome/*.(bam,vcf}; do
   f="${file##*/}"
   path="${file%/*}"
   dt="${path##*/}"
   mv "$file" "$path/$(awk -v dt="$dt" -v f="$f" 'NF==1 {
               p=$0==dt ? 1 : 0; next} p && $1==f{print $2}' /home/cmccabe/s5_files/identifier/input)"
done

Currently the code does run, but the files do not update with the identifier. There will always be a match between the path and input. Thank you

Last edited by cmccabe; 11-25-2016 at 09:37 PM.. Reason: fixed format, updated awk

cmccabe

View Public Profile for cmccabe

Find all posts by cmccabe

11-25-2016

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Are you trying to rename *.bam and *.vcf files? Or, only *.bam files? There is nothing in your current script that does anything with *.vcf files, is there?

Do you only want to rename the files in the directory /home/cmccabe/Desktop/index/R_2016_09_21_14_01_15_user_S5-00580-9-Medexome? Or do you want to rename the files in all of the directories in /home/cmccabe/Desktop/index/ that are named in your input file?

Do you need to look at the directory name to determine which name change needs to occur, or will the "unique identifiers" found in your input file be unique across all directories?

This User Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

11-25-2016

Registered User

1,393, 20

Join Date: Nov 2013

Last Activity: 1 May 2020, 2:35 PM EDT

Location: Chicago

Posts: 1,393

Thanks Given: 901

Thanked 20 Times in 19 Posts

I updated the script to rename both .bam and .vcf, change is in bold. Hopefully, that is what I need to do.

Only the current directory files are renamed in this case /home/cmccabe/Desktop/index/R_2016_09_21_14_01_15_user_S5-00580-9-Medexome.

The path or current directory is uniques in the input file and the 3 lines above it have the identifiers in it.

So if, R_2016_09_21_14_01_15_user_S5-00580-9-Medexome is the path or current directory

Code:

in the input file
IonXpress_007 MEV21
IonXpress_008 MEV22
IonXpress_009 MEV23
R_2016_09_21_14_01_15_user_S5-00580-9-Medexome  --- path ----

Thank you

cmccabe

View Public Profile for cmccabe

Find all posts by cmccabe

11-26-2016

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

I am fully aware that your current code only processes the contents of one directory. What I asked was whether or not you wanted your script to only process one directory. Since you didn't answer that question, the following script only processes one subdirectory of the directory /home/cmccabe/Desktop/index. (With changes to two lines and removal of a third line in this script, you can make it process files in all of the subdirectories instead of just processing files in one subdirectory.) The subdirectory is specified by an operand passed to your script (which defaults to R_2016_09_21_14_01_15_user_S5-00580-9-Medexome if no operand is given when you invoke this script).

Code:

#!/bin/bash
BaseDir=/home/cmccabe/Desktop/index
TranslationFile=/home/cmccabe/s5_files/identifier/input
SubDir=${1:-R_2016_09_21_14_01_15_user_S5-00580-9-Medexome}

cd "$BaseDir/$SubDir"
printf '%s\n' *.bam *.vcf | awk '
FNR == NR  {
	if(NF == 2)
		old[$2] = $1
	next
}
{	prefix = substr($0, 1, length($0) - 4)
	suffix = substr($0, length($0) - 3)
	if(prefix in old)
		printf("mv \"%s\" \"%s%s\"\n", $0, old[prefix], suffix)
	else	printf("# No translation found for \"%s\"\n", $0)
}' "$TranslationFile" -

which, with the sample data you provided, produces the output:

Code:

mv "MEV21.bam" "IonXpress_007.bam"
mv "MEV22.bam" "IonXpress_008.bam"
mv "MEV23.bam" "IonXpress_009.bam"
mv "MEV21.vcf" "IonXpress_007.vcf"
mv "MEV22.vcf" "IonXpress_008.vcf"
mv "MEV23.vcf" "IonXpress_009.vcf"

If you like the list of command produced by this script, run it again and pipe the output to a shell.

And, as always, if you want to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk or nawk.

This User Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

11-26-2016

Registered User

1,393, 20

Join Date: Nov 2013

Last Activity: 1 May 2020, 2:35 PM EDT

Location: Chicago

Posts: 1,393

Thanks Given: 901

Thanked 20 Times in 19 Posts

The full code is below, since I want to process only 1 directory at a time, as you knew, I use the first portion to ensure this.

Code:

#!/bin/bash

# get oldest folder
dir=/home/cmccabe/Desktop/index
{
  read -r -d $'\t' time && read -r -d '' filename
} < <(find "$dir" -maxdepth 1 -mindepth 1 -printf '%T+\t%P\0' | sort -z )
printf "The oldest folder is $filename and was created on $time, analysis was performed using v1.3 of the medex pipeline by $USER at $(date "+%D %r")\n" >> /home/cmccabe/Desktop/index/log

# rename bam
cd /home/cmccabe/Desktop/index/$filename
   rename 's/^([^_]+_[^_]+)_.+$/$1.bam/' *.bam
   
# rename vcf files
cd /home/cmccabe/Desktop/index/$filename
   rename 's/^([^_]+_[^_]+)_.+$/$1.vcf/' *.vcf
   
# rename .bam.bai files
cd /home/cmccabe/Desktop/index/$filename
   rename 's/^([^_]+_[^_]+)_.+$/$1.bam.bai/' *.bam.bai

# add identifier to bam and vcf
BaseDir=/home/cmccabe/Desktop/index  # search dir
TranslationFile=/home/cmccabe/s5_files/identifier/input #input
SubDir=${1:-$filename} # specific subdir

cd "$BaseDir/$SubDir" # look in this folder
printf '%s\n' *.bam *.vcf *.bam.bai | awk '
FNR == NR  {  # process all rows and columns
    if(NF == 2) # 2 columns in input
        old[$2] = $1  # old identifier
    next  # next line
}
{    prefix = substr($0, 1, length($0) - 4)
    if(prefix in old)
        printf("mv \"%s\" \"%s%s\"\n", $0, old[prefix])
    else    printf("# No translation found for \"%s\"\n", $0) #not found
}' "$TranslationFile" - # update from

since there is no suffix in $1 of input I get:

I removed them from the code and added a third file to search .bam.bai

input format

Code:

IonXpress_001 MEC2
IonXpress_002 MEC3
IonXpress_003 MEV48
R_2016_10_21_09_52_37_user_S5-00580-10-Medexome

IonXpress_007 MEV21
IonXpress_008 MEV22
IonXpress_009 MEV23
R_2016_09_21_14_01_15_user_S5-00580-9-Medexome

Code:

# No translation found for "IonXpress_007.bam"
# No translation found for "IonXpress_007.bam"
# No translation found for "IonXpress_007.bam"
# No translation found for "IonXpress_008.vcf"
# No translation found for "IonXpress_008.vcf"
# No translation found for "IonXpress_008.vcf"
# No translation found for "IonXpress_009.bam.bai"
# No translation found for "IonXpress_009.bam.bai"
# No translation found for "IonXpress_009.bam.bai"

Also I am not sure what you mean by output to a shell, as the files in the subdirectory, should be updated with the name from input. I tried to follow your code aand made comments that I hope are correct, but do not quite understand the portion in bold. I think that is what updates the identifiers, but not quite sure.

Example

Code:

IonXpress_007.bam  >>> MEV21.bam   ---- since the IonXpress_007 in the .bam located in the subdir matches $1 of input that .bam file is updated with $2 of input
IonXpress_007.vcf >>> MEV21.vcf   ---- since the IonXpress_007 in the .vcf  located in the subdir matches $1 of input  that .vcf file is updated with $2 of  input
IonXpress_007.bam.bai >>> MEV21.bam.bai  ---- since the IonXpress_007 in the .bam.bai  located in the subdir matches $1 of input  that .bam.bai file is updated with $2 of  input
IonXpress_008.bam  >>> MEV22.bam   ---- since the IonXpress_008 in the .bam  located in the subdir matches $1 of input  that .bam file is updated with $2 of  input
IonXpress_008.vcf >>> MEV22.vcf  ---- since the IonXpress_008 in the .vcf  located in the subdir matches $1 of input  that .vcf file is updated with $2 of  input
IonXpress_008.bam.bai >>> MEV22.bam.bai   ---- since the IonXpress_008 in the .bam.bai  located in the subdir matches $1 of input  that .bam.bai file is updated with $2 of  input
IonXpress_009.bam  >>> MEV23.bam   ---  since the IonXpress_009 in the .bam  located in the subdir matches $1 of input  that .bam file is updated with $2 of  input
IonXpress_009.vcf >>> MEV23.vcf   ---  since the IonXpress_009 in the .vcf  located in the subdir matches $1 of input  that .vcf file is updated with $2 of  input
IonXpress_009.bam.bai >>> MEV23.bam.bai  ---  since the IonXpress_009 in the .bam.bai  located in the subdir matches $1 of input  that .bam.bai file is updated with $2 of  input

Thank you for your help

Last edited by cmccabe; 11-26-2016 at 12:30 PM.. Reason: added details

cmccabe

View Public Profile for cmccabe

Find all posts by cmccabe

11-27-2016

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Quote:

Originally Posted by cmccabe

The full code is below, since I want to process only 1 directory at a time, as you knew, I use the first portion to ensure this.

Code:

#!/bin/bash

# get oldest folder
dir=/home/cmccabe/Desktop/index
{
  read -r -d $'\t' time && read -r -d '' filename
} < <(find "$dir" -maxdepth 1 -mindepth 1 -printf '%T+\t%P\0' | sort -z )
printf "The oldest folder is $filename and was created on $time, analysis was performed using v1.3 of the medex pipeline by $USER at $(date "+%D %r")\n" >> /home/cmccabe/Desktop/index/log

# rename bam
cd /home/cmccabe/Desktop/index/$filename
   rename 's/^([^_]+_[^_]+)_.+$/$1.bam/' *.bam
   
# rename vcf files
cd /home/cmccabe/Desktop/index/$filename
   rename 's/^([^_]+_[^_]+)_.+$/$1.vcf/' *.vcf
   
# rename .bam.bai files
cd /home/cmccabe/Desktop/index/$filename
   rename 's/^([^_]+_[^_]+)_.+$/$1.bam.bai/' *.bam.bai

# add identifier to bam and vcf
BaseDir=/home/cmccabe/Desktop/index  # search dir
TranslationFile=/home/cmccabe/s5_files/identifier/input #input
SubDir=${1:-$filename} # specific subdir

cd "$BaseDir/$SubDir" # look in this folder
printf '%s\n' *.bam *.vcf *.bam.bai | awk '
FNR == NR  {  # process all rows and columns
    if(NF == 2) # 2 columns in input
        old[$2] = $1  # old identifier
    next  # next line
}
{    prefix = substr($0, 1, length($0) - 4)
    if(prefix in old)
        printf("mv \"%s\" \"%s%s\"\n", $0, old[prefix])
    else    printf("# No translation found for \"%s\"\n", $0) #not found
}' "$TranslationFile" - # update from

since there is no suffix in $1 of input I get:

I removed them from the code and added a third file to search .bam.bai

input format

Code:

IonXpress_001 MEC2
IonXpress_002 MEC3
IonXpress_003 MEV48
R_2016_10_21_09_52_37_user_S5-00580-10-Medexome

IonXpress_007 MEV21
IonXpress_008 MEV22
IonXpress_009 MEV23
R_2016_09_21_14_01_15_user_S5-00580-9-Medexome

Code:

# No translation found for "IonXpress_007.bam"
# No translation found for "IonXpress_007.bam"
# No translation found for "IonXpress_007.bam"
# No translation found for "IonXpress_008.vcf"
# No translation found for "IonXpress_008.vcf"
# No translation found for "IonXpress_008.vcf"
# No translation found for "IonXpress_009.bam.bai"
# No translation found for "IonXpress_009.bam.bai"
# No translation found for "IonXpress_009.bam.bai"

Code:

IonXpress_007.bam  >>> MEV21.bam   ---- since the IonXpress_007 in the .bam located in the subdir matches $1 of input that .bam file is updated with $2 of input
IonXpress_007.vcf >>> MEV21.vcf   ---- since the IonXpress_007 in the .vcf  located in the subdir matches $1 of input  that .vcf file is updated with $2 of  input
IonXpress_007.bam.bai >>> MEV21.bam.bai  ---- since the IonXpress_007 in the .bam.bai  located in the subdir matches $1 of input  that .bam.bai file is updated with $2 of  input
IonXpress_008.bam  >>> MEV22.bam   ---- since the IonXpress_008 in the .bam  located in the subdir matches $1 of input  that .bam file is updated with $2 of  input
IonXpress_008.vcf >>> MEV22.vcf  ---- since the IonXpress_008 in the .vcf  located in the subdir matches $1 of input  that .vcf file is updated with $2 of  input
IonXpress_008.bam.bai >>> MEV22.bam.bai   ---- since the IonXpress_008 in the .bam.bai  located in the subdir matches $1 of input  that .bam.bai file is updated with $2 of  input
IonXpress_009.bam  >>> MEV23.bam   ---  since the IonXpress_009 in the .bam  located in the subdir matches $1 of input  that .bam file is updated with $2 of  input
IonXpress_009.vcf >>> MEV23.vcf   ---  since the IonXpress_009 in the .vcf  located in the subdir matches $1 of input  that .vcf file is updated with $2 of  input
IonXpress_009.bam.bai >>> MEV23.bam.bai  ---  since the IonXpress_009 in the .bam.bai  located in the subdir matches $1 of input  that .bam.bai file is updated with $2 of  input

Thank you for your help Smilie

Yes. I was fully aware that you only wanted to process one directory at a time. I gave you a script that processed one directory at a time (and told you that changing two lines and removing one line from that script would make it process all directories in a single run).

Did you even try running my suggested script with one operand (the name of the directory under /home/cmccabe/Desktop/index that you wanted to process)? Or, did you just decide to make my code fail by changing the names of all of the files you said you wanted my script to process before you invoked my script AND by deciding that some file suffixes to be processed will now be eight characters long instead of the four characters that you originally specified (.bam and .vcf)???

What do you mean there is no suffix in $1 so you fixed my code??? My code extracted the existing suffix from the names of files being processed into an awk variable named suffix and looked for the prefix in $2. When it renamed the file it was processing, it replaced the prefix found in $2 with the prefix in $1 (as you requested) and retained whatever four character suffix was on the existing filename. You gave no indication that there were other suffixes to be processed and you never gave a description of the format of the prefixes that could be present in your input file (so I had to assume that some of the hundreds of prefixes in your input file might contain a <period> character and that I couldn't be guaranteed that the first <period> found in a filename was the start of that filename's suffix). Therefore, my code assumed that the suffix was always four characters (as it was in all of your examples until you decided to change everything in post #5).

I don't have a rename utility on the system I'm using, so I can't verify what I think your code is doing, but I'm guessing that your code is renaming files with names of the form s1_s2_s3.bam to s1_s2.bam (and the same thing for the suffixes .vcf and .bam.bai) where s1 and s2 are arbitrary strings that contain no underscore characters and do contain at least one character that is not an underscore and s3 is an arbitrary string of any zero or more characters. I have no idea what your original filenames were before this transformation, but I do know that in the data you showed us, there are no underscores in any name prefix in $2 in your input file. And, the output you got from my code stated that no prefix in $2 matched the prefix in the filename IonXpress_007.bam and the other eight files it listed. You will find those three prefixes in $1 in input, but none of them appear in $2!

I fully admit that my original code doesn't stand a chance of working with your new (still incomplete) specification, but the changes you made made it much less likely that the code I suggested will ever work. And, I have absolutely no idea what you expect my script to do with the nine files for which it reported that it found no translation in your input.

It does seem extremely inefficient to rename files using rename and then use awk to rename them again using mv commands. And with the data you showed us in post #1 and post #5, I have absolutely no idea how you got the output you showed us above. I would have expected something much more like:

Code:

# No translation found for "IonXpress_007.bam"
# No translation found for "IonXpress_008.bam"
# No translation found for "IonXpress_009.bam"
# No translation found for "IonXpress_007.vcf"
# No translation found for "IonXpress_008.vcf"
# No translation found for "IonXpress_009.vcf"
# No translation found for "IonXpress_007.bam.bai"
# No translation found for "IonXpress_008.bam.bai"
# No translation found for "IonXpress_009.bam.bai"

If you want a script that does something completely different from what you have specified in all of your posts in this thread, you obviously need code that is different from what I wrote that tried to do what you specified in posts #1 and #3. You have yet to show us anywhere where my script behaved differently than you requested with input that matched what you described. And you have yet to show us that the directory you are trying to process contains any files that you said you wanted to rename.

This User Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

11-27-2016

Registered User

1,393, 20

Join Date: Nov 2013

Last Activity: 1 May 2020, 2:35 PM EDT

Location: Chicago

Posts: 1,393

Thanks Given: 901

Thanked 20 Times in 19 Posts

I admit I did not fully understand your code and did run it as you had had it. That is how I got the No Translation Found. I only removed the suffix lines as I though that was I thought that was referring to $1 in input, and since it didn't match No Translation Found. I was mistaken and hope the below helps.

Files in directory being updated in dir: R_2016_09_21_14_01_15_user_S5-00580-9-Medexome

Code:

IonXpress_007.bam
IonXpress_007.vcf
IonXpress_007.bam.bai
IonXpress_008.bam
IonXpress_008.vcf
IonXpress_008.bam.bai
IonXpress_009.bam
IonXpress_009.vcf
IonXpress_009.bam.bai

input

Code:

IonXpress_001 MEC2
IonXpress_002 MEC3
IonXpress_003 MEV48
R_2016_10_21_09_52_37_user_S5-00580-10-Medexome

IonXpress_007 MEV21
IonXpress_008 MEV22
IonXpress_009 MEV23
R_2016_09_21_14_01_15_user_S5-00580-9-Medexome  --- line matches dir

The identifier is in $2 of the input file. That is what the file in dir should be updated with. The $1 value will match the file name (before it is renamed).

So using the dir in the example:

Code:

MEV21.bam
MEV21.vcf
MEV21.bam.bai
MEV22.bam
MEV22.vcf
MEV22.bam.bai
MEV23.bam
MEV23.vcf
MEV23.bam.bai

Each filename in dir [ICODE]IonXpress_007,IonXpress_008,IonXpress_009 will match a $1 value in input. The corresponding $2 value is what the filename in dir is renamed to. The dir will match only one line string appears in the input file and the 3 lines above it have the identifiers in it. Thank you very much

Last edited by cmccabe; 11-27-2016 at 10:35 PM.. Reason: added details

cmccabe

View Public Profile for cmccabe

Find all posts by cmccabe

Shell Programming and Scripting

Add unique identifier from file to filetype in directory

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Bash to create new directory by date followed by identifier and additional subdirectories

Discussion started by: cmccabe

2. UNIX for Advanced & Expert Users

File command return wrong filetype while file holds group separator char.

Discussion started by: Arpitak29

3. Shell Programming and Scripting

Change everything in a file that maps to {module::name.filetype} to _modules/name/applicat

Discussion started by: bmson

4. Shell Programming and Scripting

HPUX find string in directory and filetype and replace string

Discussion started by: pure_jax

5. Shell Programming and Scripting

Change unique file names into new unique filenames

Discussion started by: avonm

6. Shell Programming and Scripting

Unique files in a given directory

Discussion started by: cue

7. Shell Programming and Scripting

get part of file with unique & non-unique string

Discussion started by: andrewsc

8. Shell Programming and Scripting

Unique Directory and Folder Deletion Script

Discussion started by: b4sher

9. UNIX for Dummies Questions & Answers

Shell Script Unique Identifier Question

Discussion started by: grahambo2005

10. UNIX for Dummies Questions & Answers

Directory Inode Number Not Unique

Discussion started by: nj302