![]() |
|
|
google unix.com
|
|||||||
| Forums | Register | Forum Rules | Links | Albums | FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| Shell Programming and Scripting Post questions about KSH, CSH, SH, BASH, PERL, PHP, SED, AWK and OTHER shell scripts and shell scripting languages here. |
More UNIX and Linux Forum Topics You Might Find Helpful
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| find the first instance after a string | josslate | Shell Programming and Scripting | 2 | 05-19-2009 09:11 AM |
| Search the last instance of a string in a file | dinesh1178 | Shell Programming and Scripting | 6 | 01-28-2009 02:18 AM |
| deleting lines from multiple text files | vrms | Shell Programming and Scripting | 3 | 04-25-2008 12:01 PM |
| replace nth instance of string | uttamhoode | Shell Programming and Scripting | 4 | 04-03-2008 03:25 AM |
| How to concatenate two strings or several strings into one string in B-shell? | fontana | Shell Programming and Scripting | 2 | 08-26-2005 12:58 PM |
![]() |
|
|
LinkBack | Thread Tools | Search this Thread | Rate Thread | Display Modes |
|
|
|
||||
|
Hi all,
I have a directory containing many subdirectories each named like KOG#### where # represents any digit 0-9. There are several files in each KOG#### folder but the one I care about is named like KOG####_final.fasta. I am trying to write a script to copy all of the KOG####_final.fasta files to the same directory and then apply some filters to them. For the filters, I want to go through each of the KOG####_final.fasta files and remove any of them that don't contain at least 10 different text strings that are specified in a text file or somewhere in the script. I'd also like to have a filter that removes files that have more than one instance of any one string. I know this is a lot but I'm really stumped as to where to start on this one. Any assistance in getting started with this would be much appreciated! Thanks! Kevin |
|
||||
|
Thanks for your help! That worked really well. The KOG*_final.fasta files look like the example below. There is a one-line header that always begins with a greater-than sign, has a 3-4 letter species abbreviation, and a sequence identifier. The next line contains the corresponding amino-acid sequence which is always on one line and doesn't wrap no matter how long it is. >ACAL_12345 XESLGRQVPSELFEKLDYHK >ACAL_19472 XESLGRQVPSEXFEKLDYHJ >ACAL_19473 XESLEKDVPSELFEKLDYHJ >CGIG_Contig2554 XESLGRQVPSQLFEKLDYHK >CVIR_Contig1338 XESLGRQVPSELEEKLDYHK >HROB_98421 XESLGRQVPSELFEKLDYEV >IPAR_Contig854 QESLGRQVPSELFEKLDYHK >LGIG_182182 PESLGRQVPSELFEKLDYHD >MCAL_Contig3433 XESLGRQVPSELFEKLDYHG >NVEC_166966 XESLGRQVPSELFEKLDYHK I'm trying to write a script that will go through each of these files and check them to see if they meet certain criteria. For example, I want to move all files containing fewer than 10 greater-than signs (fewer than 10 sequences) into a "trash" folder. I've played around using if and grep -c \> for this part but I haven't figured it out yet. Is there a better way to go about this? I'd also like to trash any files that have more than 1 sequence for any one species (although I'd like to be able to vary this number if it turns out that is too strict). Would I have to use an array for this? Or another file that specifies all of the taxon names? Thanks! Kevin ---------- Post updated 11-06-09 at 04:29 PM ---------- Previous update was 11-05-09 at 06:38 PM ---------- I figured out the first filter: Code:
for FileName in *.fa do sequences=`grep -c \> $FileName` cutoff=6 echo $FileName $sequences if [ "$sequences" -lt "$cutoff" ] ; then printf "Too few sequences in file $FileName" mv $FileName ./rejected_few_seq/ fi done I'm having trouble figuring out the other part. Here's what I've got so far: Code:
for FileName in *.fa do grep -c ACAL_ $FileName >> taxon_count.txt grep -c HROB_ $FileName >> taxon_count.txt (...repeated for all species abbreviations) ? done I am trying to figure out how to add all the values put into the taxon_count.txt file and remove $FileName if that value is smaller than a desired value. I'd also like to set a max value for number of sequences per taxon and if that is exceeded, remove #FileName. Any guidance would be greatly appreciated. Thanks, Kevin |
![]() |
| Bookmarks |
| Thread Tools | Search this Thread |
| Display Modes | Rate This Thread |
|
|