The UNIX and Linux Forums  

Go Back   The UNIX and Linux Forums > Top Forums > Shell Programming and Scripting
.
google unix.com



Shell Programming and Scripting Post questions about KSH, CSH, SH, BASH, PERL, PHP, SED, AWK and OTHER shell scripts and shell scripting languages here.

More UNIX and Linux Forum Topics You Might Find Helpful
Thread Thread Starter Forum Replies Last Post
find the first instance after a string josslate Shell Programming and Scripting 2 05-19-2009 09:11 AM
Search the last instance of a string in a file dinesh1178 Shell Programming and Scripting 6 01-28-2009 02:18 AM
deleting lines from multiple text files vrms Shell Programming and Scripting 3 04-25-2008 12:01 PM
replace nth instance of string uttamhoode Shell Programming and Scripting 4 04-03-2008 03:25 AM
How to concatenate two strings or several strings into one string in B-shell? fontana Shell Programming and Scripting 2 08-26-2005 12:58 PM

Reply
English Japanese Spanish French German Portuguese Italian Dutch Swedish Russian Norwegian Hungarian Hebrew Danish Bulgarian Greek Powered by Powered by Google
 
LinkBack Thread Tools Search this Thread Rate Thread Display Modes
  #1 (permalink)  
Old 4 Weeks Ago
kmkocot kmkocot is offline
Registered User
  
 

Join Date: Nov 2009
Posts: 2
Question Deleting files that don't contain particular text strings / more than one instance of a string

Hi all,

I have a directory containing many subdirectories each named like KOG#### where # represents any digit 0-9. There are several files in each KOG#### folder but the one I care about is named like KOG####_final.fasta. I am trying to write a script to copy all of the KOG####_final.fasta files to the same directory and then apply some filters to them.

For the filters, I want to go through each of the KOG####_final.fasta files and remove any of them that don't contain at least 10 different text strings that are specified in a text file or somewhere in the script. I'd also like to have a filter that removes files that have more than one instance of any one string.

I know this is a lot but I'm really stumped as to where to start on this one. Any assistance in getting started with this would be much appreciated!

Thanks!
Kevin
  #2 (permalink)  
Old 4 Weeks Ago
rdcwayx rdcwayx is offline
Registered User
  
 

Join Date: Jun 2006
Posts: 290
For copy, you can use below command:


Code:
find KOG* -type f -name "KOG*_final.fasta" -exec cp {} /tmp \;

But not understand the filter, could you paste some sample KOG*_final.fasta, and give us the sample output.
  #3 (permalink)  
Old 4 Weeks Ago
kmkocot kmkocot is offline
Registered User
  
 

Join Date: Nov 2009
Posts: 2
Thanks for your help! That worked really well.

The KOG*_final.fasta files look like the example below. There is a one-line header that always begins with a greater-than sign, has a 3-4 letter species abbreviation, and a sequence identifier. The next line contains the corresponding amino-acid sequence which is always on one line and doesn't wrap no matter how long it is.

>ACAL_12345
XESLGRQVPSELFEKLDYHK
>ACAL_19472
XESLGRQVPSEXFEKLDYHJ
>ACAL_19473
XESLEKDVPSELFEKLDYHJ
>CGIG_Contig2554
XESLGRQVPSQLFEKLDYHK
>CVIR_Contig1338
XESLGRQVPSELEEKLDYHK
>HROB_98421
XESLGRQVPSELFEKLDYEV
>IPAR_Contig854
QESLGRQVPSELFEKLDYHK
>LGIG_182182
PESLGRQVPSELFEKLDYHD
>MCAL_Contig3433
XESLGRQVPSELFEKLDYHG
>NVEC_166966
XESLGRQVPSELFEKLDYHK

I'm trying to write a script that will go through each of these files and check them to see if they meet certain criteria. For example, I want to move all files containing fewer than 10 greater-than signs (fewer than 10 sequences) into a "trash" folder. I've played around using if and grep -c \> for this part but I haven't figured it out yet. Is there a better way to go about this?

I'd also like to trash any files that have more than 1 sequence for any one species (although I'd like to be able to vary this number if it turns out that is too strict). Would I have to use an array for this? Or another file that specifies all of the taxon names?

Thanks!
Kevin

---------- Post updated 11-06-09 at 04:29 PM ---------- Previous update was 11-05-09 at 06:38 PM ----------

I figured out the first filter:

Code:
for FileName in *.fa
do
sequences=`grep -c \> $FileName`
cutoff=6
echo $FileName $sequences
if [ "$sequences" -lt "$cutoff" ] ; then
printf "Too few sequences in file $FileName"
mv $FileName ./rejected_few_seq/
fi
done

I'm having trouble figuring out the other part. Here's what I've got so far:

Code:
for FileName in *.fa
do
grep -c ACAL_ $FileName >> taxon_count.txt
grep -c HROB_ $FileName >> taxon_count.txt
(...repeated for all species abbreviations)
?
done

I am trying to figure out how to add all the values put into the taxon_count.txt file and remove $FileName if that value is smaller than a desired value. I'd also like to set a max value for number of sequences per taxon and if that is exceeded, remove #FileName. Any guidance would be greatly appreciated.

Thanks,
Kevin
  #4 (permalink)  
Old 4 Weeks Ago
rdcwayx rdcwayx is offline
Registered User
  
 

Join Date: Jun 2006
Posts: 290
You need this?


Code:
 awk -F_ '/^>/ {print $1 }' $FileName|sort |uniq -c |sort -n
      1 >CGIG
      1 >CVIR
      1 >HROB
      1 >IPAR
      1 >LGIG
      1 >MCAL
      1 >NVEC
      3 >ACAL

Reply

Bookmarks

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes Rate This Thread
Rate This Thread:

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On




All times are GMT -4. The time now is 06:46 AM.


Powered by: vBulletin, Copyright ©2000 - 2006, Jelsoft Enterprises Limited. Language Translations Powered by .
vBCredits v1.4 Copyright ©2007 - 2008, PixelFX Studios
The UNIX and Linux Forums Content Copyright ©1993-2009. All Rights Reserved.Ad Management by RedTyger

Content Relevant URLs by vBSEO 3.2.0