Deleting files that don't contain particular text strings / more than one instance of a string
Hi all,
I have a directory containing many subdirectories each named like KOG#### where # represents any digit 0-9. There are several files in each KOG#### folder but the one I care about is named like KOG####_final.fasta. I am trying to write a script to copy all of the KOG####_final.fasta files to the same directory and then apply some filters to them.
For the filters, I want to go through each of the KOG####_final.fasta files and remove any of them that don't contain at least 10 different text strings that are specified in a text file or somewhere in the script. I'd also like to have a filter that removes files that have more than one instance of any one string.
I know this is a lot but I'm really stumped as to where to start on this one. Any assistance in getting started with this would be much appreciated!
The KOG*_final.fasta files look like the example below. There is a one-line header that always begins with a greater-than sign, has a 3-4 letter species abbreviation, and a sequence identifier. The next line contains the corresponding amino-acid sequence which is always on one line and doesn't wrap no matter how long it is.
I'm trying to write a script that will go through each of these files and check them to see if they meet certain criteria. For example, I want to move all files containing fewer than 10 greater-than signs (fewer than 10 sequences) into a "trash" folder. I've played around using if and grep -c \> for this part but I haven't figured it out yet. Is there a better way to go about this?
I'd also like to trash any files that have more than 1 sequence for any one species (although I'd like to be able to vary this number if it turns out that is too strict). Would I have to use an array for this? Or another file that specifies all of the taxon names?
Thanks!
Kevin
---------- Post updated 11-06-09 at 04:29 PM ---------- Previous update was 11-05-09 at 06:38 PM ----------
I figured out the first filter:
I'm having trouble figuring out the other part. Here's what I've got so far:
I am trying to figure out how to add all the values put into the taxon_count.txt file and remove $FileName if that value is smaller than a desired value. I'd also like to set a max value for number of sequences per taxon and if that is exceeded, remove #FileName. Any guidance would be greatly appreciated.
So I want to skim through all folders (ongoing from the curr dir) and delete all files that contain the string:
"in conflikt standing copy".
Is this possible WITH DOS ? (1 Reply)
Me and a friend are working on a project, and We have to create a script that can go into a file, and replace all occurances of a certain expression/word/letter with another using Sed. It is designed to go through multiple tests replacing all these occurances, and we don't know what they will be so... (1 Reply)
I need to be able to search for a beginning line header, then use grep or something else to get the very next instance of a particular string, which will ALWAYS be in "Line5". What I have is some data that appears like this:
Line1
Line2
Line3
Line4
Line5
Line6
Line7
Line1
Line2
...... (4 Replies)
Hi all,
I am still learning my way around unix commands and I have the following question.
I have a website and I want to search for all the html pages that don't contain a certain js file. The file I am searching for is located under /topfolder/js/rules.js . So I assume in my grep search I... (5 Replies)
Hi all
I have two files X.txt and Y.txt. The file format of X.txt is :
madras is also the fountainhead of the theosophical movement which spread worldwide .
and second file Y.txt is of the format:
madra|s|nsubj is|cop also|advmod the|det fountainhead|empty of|prep the|det... (3 Replies)
Hello!
I need to delete one line in a file which matches one very precise instance of a string only. When searching the forum I unfortunately only found a solution which would delete each line on which a particular string occurs.
Let's assume I have a file composed of thousands of lines... (4 Replies)
There are a lot of ways to extract text from between two strings, but what if those strings occur multiple times and you only want the text from the first two strings? I can't seem to find anything to work here. I'm using sed to process the text after it's extracted, so I prefer a sed answer, but... (4 Replies)
I have a directory full of text data files.
Unfortunately I need to get rid of the 7th and 8th line from them all so that I can input them into a GIS application.
I've used an awk script to do one at a time but due to the sheer number of files I need some kind of loop mechanism to automate... (3 Replies)
I have a directory with permissions set 777, and some gumby has dumped a bunch of files and directories in there.
I don't own the culprit files or directories, but do own the containing directory - Is there some way I can delete this other user's files?
The other interesting thing is that... (5 Replies)