Deleting files that don't contain particular text strings / more than one instance of a string

11-04-2009

Registered User

5, 0

Join Date: Nov 2009

Last Activity: 22 June 2011, 5:33 PM EDT

Posts: 5

Thanks Given: 0

Thanked 0 Times in 0 Posts

Deleting files that don't contain particular text strings / more than one instance of a string

Hi all,

I have a directory containing many subdirectories each named like KOG#### where # represents any digit 0-9. There are several files in each KOG#### folder but the one I care about is named like KOG####_final.fasta. I am trying to write a script to copy all of the KOG####_final.fasta files to the same directory and then apply some filters to them.

For the filters, I want to go through each of the KOG####_final.fasta files and remove any of them that don't contain at least 10 different text strings that are specified in a text file or somewhere in the script. I'd also like to have a filter that removes files that have more than one instance of any one string.

I know this is a lot but I'm really stumped as to where to start on this one. Any assistance in getting started with this would be much appreciated!

Thanks!
Kevin

kmkocot

View Public Profile for kmkocot

Find all posts by kmkocot

11-04-2009

Registered User

2,759, 420

Join Date: Jun 2006

Last Activity: 13 September 2015, 8:58 PM EDT

Posts: 2,759

Thanks Given: 44

Thanked 420 Times in 408 Posts

For copy, you can use below command:

Code:

find KOG* -type f -name "KOG*_final.fasta" -exec cp {} /tmp \;

But not understand the filter, could you paste some sample KOG*_final.fasta, and give us the sample output.

rdcwayx

View Public Profile for rdcwayx

Find all posts by rdcwayx

11-06-2009

Registered User

5, 0

Join Date: Nov 2009

Last Activity: 22 June 2011, 5:33 PM EDT

Posts: 5

Thanks Given: 0

Thanked 0 Times in 0 Posts

Thanks for your help! That worked really well.

The KOG*_final.fasta files look like the example below. There is a one-line header that always begins with a greater-than sign, has a 3-4 letter species abbreviation, and a sequence identifier. The next line contains the corresponding amino-acid sequence which is always on one line and doesn't wrap no matter how long it is.

>ACAL_12345
XESLGRQVPSELFEKLDYHK
>ACAL_19472
XESLGRQVPSEXFEKLDYHJ
>ACAL_19473
XESLEKDVPSELFEKLDYHJ
>CGIG_Contig2554
XESLGRQVPSQLFEKLDYHK
>CVIR_Contig1338
XESLGRQVPSELEEKLDYHK
>HROB_98421
XESLGRQVPSELFEKLDYEV
>IPAR_Contig854
QESLGRQVPSELFEKLDYHK
>LGIG_182182
PESLGRQVPSELFEKLDYHD
>MCAL_Contig3433
XESLGRQVPSELFEKLDYHG
>NVEC_166966
XESLGRQVPSELFEKLDYHK

I'm trying to write a script that will go through each of these files and check them to see if they meet certain criteria. For example, I want to move all files containing fewer than 10 greater-than signs (fewer than 10 sequences) into a "trash" folder. I've played around using if and grep -c \> for this part but I haven't figured it out yet. Is there a better way to go about this?

I'd also like to trash any files that have more than 1 sequence for any one species (although I'd like to be able to vary this number if it turns out that is too strict). Would I have to use an array for this? Or another file that specifies all of the taxon names?

Thanks!
Kevin

---------- Post updated 11-06-09 at 04:29 PM ---------- Previous update was 11-05-09 at 06:38 PM ----------

I figured out the first filter:

Code:

for FileName in *.fa
do
sequences=`grep -c \> $FileName`
cutoff=6
echo $FileName $sequences
if [ "$sequences" -lt "$cutoff" ] ; then
printf "Too few sequences in file $FileName"
mv $FileName ./rejected_few_seq/
fi
done

I'm having trouble figuring out the other part. Here's what I've got so far:

Code:

for FileName in *.fa
do
grep -c ACAL_ $FileName >> taxon_count.txt
grep -c HROB_ $FileName >> taxon_count.txt
(...repeated for all species abbreviations)
?
done

I am trying to figure out how to add all the values put into the taxon_count.txt file and remove $FileName if that value is smaller than a desired value. I'd also like to set a max value for number of sequences per taxon and if that is exceeded, remove #FileName. Any guidance would be greatly appreciated.

Thanks,
Kevin

kmkocot

View Public Profile for kmkocot

Find all posts by kmkocot

11-10-2009

Registered User

2,759, 420

Join Date: Jun 2006

Last Activity: 13 September 2015, 8:58 PM EDT

Posts: 2,759

Thanks Given: 44

Thanked 420 Times in 408 Posts

You need this?

Code:

 awk -F_ '/^>/ {print $1 }' $FileName|sort |uniq -c |sort -n
      1 >CGIG
      1 >CVIR
      1 >HROB
      1 >IPAR
      1 >LGIG
      1 >MCAL
      1 >NVEC
      3 >ACAL

rdcwayx

View Public Profile for rdcwayx

Find all posts by rdcwayx

Shell Programming and Scripting

Deleting files that don't contain particular text strings / more than one instance of a string

10 More Discussions You Might Find Interesting

1. Windows & DOS: Issues & Discussions

Deleting all files containing string (WINDOWS DOS)

Discussion started by: pasc

2. Homework & Coursework Questions

Problem with Shell Scripts deleting text in files.

Discussion started by: Johnny2518

3. Shell Programming and Scripting

Search text file, then grep next instance of string

Discussion started by: Akilleez

4. UNIX for Dummies Questions & Answers

Deleting lines that contain a specific string from a space delimited text file?

Discussion started by: evelibertine

5. UNIX for Dummies Questions & Answers

Using grep to find files that don't contain a string

Discussion started by: SyphaX

6. Shell Programming and Scripting

Text strings in files.

Discussion started by: my_Perl

7. Shell Programming and Scripting

Deleting a line from a file based on one specific string instance?

Discussion started by: Black Sun

8. Shell Programming and Scripting

Extracting text between two strings, first instance only

Discussion started by: fubaya

9. Shell Programming and Scripting

deleting lines from multiple text files

Discussion started by: vrms

10. UNIX for Dummies Questions & Answers

Deleting a file I don't own

Discussion started by: kumachan