Deleting files that don't contain particular text strings / more than one instance of a string

Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Deleting files that don't contain particular text strings / more than one instance of a string
# 1  
Old 11-04-2009
Question Deleting files that don't contain particular text strings / more than one instance of a string

Hi all,

I have a directory containing many subdirectories each named like KOG#### where # represents any digit 0-9. There are several files in each KOG#### folder but the one I care about is named like KOG####_final.fasta. I am trying to write a script to copy all of the KOG####_final.fasta files to the same directory and then apply some filters to them.

For the filters, I want to go through each of the KOG####_final.fasta files and remove any of them that don't contain at least 10 different text strings that are specified in a text file or somewhere in the script. I'd also like to have a filter that removes files that have more than one instance of any one string.

I know this is a lot but I'm really stumped as to where to start on this one. Any assistance in getting started with this would be much appreciated!

# 2  
Old 11-04-2009
For copy, you can use below command:

find KOG* -type f -name "KOG*_final.fasta" -exec cp {} /tmp \;

But not understand the filter, could you paste some sample KOG*_final.fasta, and give us the sample output.
# 3  
Old 11-06-2009
Thanks for your help! That worked really well.

The KOG*_final.fasta files look like the example below. There is a one-line header that always begins with a greater-than sign, has a 3-4 letter species abbreviation, and a sequence identifier. The next line contains the corresponding amino-acid sequence which is always on one line and doesn't wrap no matter how long it is.


I'm trying to write a script that will go through each of these files and check them to see if they meet certain criteria. For example, I want to move all files containing fewer than 10 greater-than signs (fewer than 10 sequences) into a "trash" folder. I've played around using if and grep -c \> for this part but I haven't figured it out yet. Is there a better way to go about this?

I'd also like to trash any files that have more than 1 sequence for any one species (although I'd like to be able to vary this number if it turns out that is too strict). Would I have to use an array for this? Or another file that specifies all of the taxon names?


---------- Post updated 11-06-09 at 04:29 PM ---------- Previous update was 11-05-09 at 06:38 PM ----------

I figured out the first filter:
for FileName in *.fa
sequences=`grep -c \> $FileName`
echo $FileName $sequences
if [ "$sequences" -lt "$cutoff" ] ; then
printf "Too few sequences in file $FileName"
mv $FileName ./rejected_few_seq/

I'm having trouble figuring out the other part. Here's what I've got so far:
for FileName in *.fa
grep -c ACAL_ $FileName >> taxon_count.txt
grep -c HROB_ $FileName >> taxon_count.txt
(...repeated for all species abbreviations)

I am trying to figure out how to add all the values put into the taxon_count.txt file and remove $FileName if that value is smaller than a desired value. I'd also like to set a max value for number of sequences per taxon and if that is exceeded, remove #FileName. Any guidance would be greatly appreciated.

# 4  
Old 11-10-2009
You need this?

 awk -F_ '/^>/ {print $1 }' $FileName|sort |uniq -c |sort -n
      1 >CGIG
      1 >CVIR
      1 >HROB
      1 >IPAR
      1 >LGIG
      1 >MCAL
      1 >NVEC
      3 >ACAL

Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Windows & DOS: Issues & Discussions

Deleting all files containing string (WINDOWS DOS)

So I want to skim through all folders (ongoing from the curr dir) and delete all files that contain the string: "in conflikt standing copy". Is this possible WITH DOS ? (1 Reply)
Discussion started by: pasc
1 Replies

2. Homework & Coursework Questions

Problem with Shell Scripts deleting text in files.

Me and a friend are working on a project, and We have to create a script that can go into a file, and replace all occurances of a certain expression/word/letter with another using Sed. It is designed to go through multiple tests replacing all these occurances, and we don't know what they will be so... (1 Reply)
Discussion started by: Johnny2518
1 Replies

3. Shell Programming and Scripting

Search text file, then grep next instance of string

I need to be able to search for a beginning line header, then use grep or something else to get the very next instance of a particular string, which will ALWAYS be in "Line5". What I have is some data that appears like this: Line1 Line2 Line3 Line4 Line5 Line6 Line7 Line1 Line2 ...... (4 Replies)
Discussion started by: Akilleez
4 Replies

4. UNIX for Dummies Questions & Answers

Deleting lines that contain a specific string from a space delimited text file?

Hi, I have a space delimited text file that looks like the following: 250 rs10000056 0.04 0.0888 4 189321617 250 rs10000062 0.05 0.0435 4 5254744 250 rs10000064 0.02 0.2403 4 127809621 250 rs10000068 0.01 NA 250 rs1000007 0.00 0.9531 2 237752054 250 rs10000081 0.03 0.1400 4 17348363... (5 Replies)
Discussion started by: evelibertine
5 Replies

5. UNIX for Dummies Questions & Answers

Using grep to find files that don't contain a string

Hi all, I am still learning my way around unix commands and I have the following question. I have a website and I want to search for all the html pages that don't contain a certain js file. The file I am searching for is located under /topfolder/js/rules.js . So I assume in my grep search I... (5 Replies)
Discussion started by: SyphaX
5 Replies

6. Shell Programming and Scripting

Text strings in files.

Hi all I have two files X.txt and Y.txt. The file format of X.txt is : madras is also the fountainhead of the theosophical movement which spread worldwide . and second file Y.txt is of the format: madra|s|nsubj is|cop also|advmod the|det fountainhead|empty of|prep the|det... (3 Replies)
Discussion started by: my_Perl
3 Replies

7. Shell Programming and Scripting

Deleting a line from a file based on one specific string instance?

Hello! I need to delete one line in a file which matches one very precise instance of a string only. When searching the forum I unfortunately only found a solution which would delete each line on which a particular string occurs. Let's assume I have a file composed of thousands of lines... (4 Replies)
Discussion started by: Black Sun
4 Replies

8. Shell Programming and Scripting

Extracting text between two strings, first instance only

There are a lot of ways to extract text from between two strings, but what if those strings occur multiple times and you only want the text from the first two strings? I can't seem to find anything to work here. I'm using sed to process the text after it's extracted, so I prefer a sed answer, but... (4 Replies)
Discussion started by: fubaya
4 Replies

9. Shell Programming and Scripting

deleting lines from multiple text files

I have a directory full of text data files. Unfortunately I need to get rid of the 7th and 8th line from them all so that I can input them into a GIS application. I've used an awk script to do one at a time but due to the sheer number of files I need some kind of loop mechanism to automate... (3 Replies)
Discussion started by: vrms
3 Replies

10. UNIX for Dummies Questions & Answers

Deleting a file I don't own

I have a directory with permissions set 777, and some gumby has dumped a bunch of files and directories in there. I don't own the culprit files or directories, but do own the containing directory - Is there some way I can delete this other user's files? The other interesting thing is that... (5 Replies)
Discussion started by: kumachan
5 Replies
Login or Register to Ask a Question