Sponsored Content
Top Forums Shell Programming and Scripting Deleting files that don't contain particular text strings / more than one instance of a string Post 302368846 by kmkocot on Friday 6th of November 2009 05:29:23 PM
Old 11-06-2009
Thanks for your help! That worked really well.

The KOG*_final.fasta files look like the example below. There is a one-line header that always begins with a greater-than sign, has a 3-4 letter species abbreviation, and a sequence identifier. The next line contains the corresponding amino-acid sequence which is always on one line and doesn't wrap no matter how long it is.

>ACAL_12345
XESLGRQVPSELFEKLDYHK
>ACAL_19472
XESLGRQVPSEXFEKLDYHJ
>ACAL_19473
XESLEKDVPSELFEKLDYHJ
>CGIG_Contig2554
XESLGRQVPSQLFEKLDYHK
>CVIR_Contig1338
XESLGRQVPSELEEKLDYHK
>HROB_98421
XESLGRQVPSELFEKLDYEV
>IPAR_Contig854
QESLGRQVPSELFEKLDYHK
>LGIG_182182
PESLGRQVPSELFEKLDYHD
>MCAL_Contig3433
XESLGRQVPSELFEKLDYHG
>NVEC_166966
XESLGRQVPSELFEKLDYHK

I'm trying to write a script that will go through each of these files and check them to see if they meet certain criteria. For example, I want to move all files containing fewer than 10 greater-than signs (fewer than 10 sequences) into a "trash" folder. I've played around using if and grep -c \> for this part but I haven't figured it out yet. Is there a better way to go about this?

I'd also like to trash any files that have more than 1 sequence for any one species (although I'd like to be able to vary this number if it turns out that is too strict). Would I have to use an array for this? Or another file that specifies all of the taxon names?

Thanks!
Kevin

---------- Post updated 11-06-09 at 04:29 PM ---------- Previous update was 11-05-09 at 06:38 PM ----------

I figured out the first filter:
Code:
for FileName in *.fa
do
sequences=`grep -c \> $FileName`
cutoff=6
echo $FileName $sequences
if [ "$sequences" -lt "$cutoff" ] ; then
printf "Too few sequences in file $FileName"
mv $FileName ./rejected_few_seq/
fi
done

I'm having trouble figuring out the other part. Here's what I've got so far:
Code:
for FileName in *.fa
do
grep -c ACAL_ $FileName >> taxon_count.txt
grep -c HROB_ $FileName >> taxon_count.txt
(...repeated for all species abbreviations)
?
done

I am trying to figure out how to add all the values put into the taxon_count.txt file and remove $FileName if that value is smaller than a desired value. I'd also like to set a max value for number of sequences per taxon and if that is exceeded, remove #FileName. Any guidance would be greatly appreciated.

Thanks,
Kevin
 

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

Deleting a file I don't own

I have a directory with permissions set 777, and some gumby has dumped a bunch of files and directories in there. I don't own the culprit files or directories, but do own the containing directory - Is there some way I can delete this other user's files? The other interesting thing is that... (5 Replies)
Discussion started by: kumachan
5 Replies

2. Shell Programming and Scripting

deleting lines from multiple text files

I have a directory full of text data files. Unfortunately I need to get rid of the 7th and 8th line from them all so that I can input them into a GIS application. I've used an awk script to do one at a time but due to the sheer number of files I need some kind of loop mechanism to automate... (3 Replies)
Discussion started by: vrms
3 Replies

3. Shell Programming and Scripting

Extracting text between two strings, first instance only

There are a lot of ways to extract text from between two strings, but what if those strings occur multiple times and you only want the text from the first two strings? I can't seem to find anything to work here. I'm using sed to process the text after it's extracted, so I prefer a sed answer, but... (4 Replies)
Discussion started by: fubaya
4 Replies

4. Shell Programming and Scripting

Deleting a line from a file based on one specific string instance?

Hello! I need to delete one line in a file which matches one very precise instance of a string only. When searching the forum I unfortunately only found a solution which would delete each line on which a particular string occurs. Let's assume I have a file composed of thousands of lines... (4 Replies)
Discussion started by: Black Sun
4 Replies

5. Shell Programming and Scripting

Text strings in files.

Hi all I have two files X.txt and Y.txt. The file format of X.txt is : madras is also the fountainhead of the theosophical movement which spread worldwide . and second file Y.txt is of the format: madra|s|nsubj is|cop also|advmod the|det fountainhead|empty of|prep the|det... (3 Replies)
Discussion started by: my_Perl
3 Replies

6. UNIX for Dummies Questions & Answers

Using grep to find files that don't contain a string

Hi all, I am still learning my way around unix commands and I have the following question. I have a website and I want to search for all the html pages that don't contain a certain js file. The file I am searching for is located under /topfolder/js/rules.js . So I assume in my grep search I... (5 Replies)
Discussion started by: SyphaX
5 Replies

7. UNIX for Dummies Questions & Answers

Deleting lines that contain a specific string from a space delimited text file?

Hi, I have a space delimited text file that looks like the following: 250 rs10000056 0.04 0.0888 4 189321617 250 rs10000062 0.05 0.0435 4 5254744 250 rs10000064 0.02 0.2403 4 127809621 250 rs10000068 0.01 NA 250 rs1000007 0.00 0.9531 2 237752054 250 rs10000081 0.03 0.1400 4 17348363... (5 Replies)
Discussion started by: evelibertine
5 Replies

8. Shell Programming and Scripting

Search text file, then grep next instance of string

I need to be able to search for a beginning line header, then use grep or something else to get the very next instance of a particular string, which will ALWAYS be in "Line5". What I have is some data that appears like this: Line1 Line2 Line3 Line4 Line5 Line6 Line7 Line1 Line2 ...... (4 Replies)
Discussion started by: Akilleez
4 Replies

9. Homework & Coursework Questions

Problem with Shell Scripts deleting text in files.

Me and a friend are working on a project, and We have to create a script that can go into a file, and replace all occurances of a certain expression/word/letter with another using Sed. It is designed to go through multiple tests replacing all these occurances, and we don't know what they will be so... (1 Reply)
Discussion started by: Johnny2518
1 Replies

10. Windows & DOS: Issues & Discussions

Deleting all files containing string (WINDOWS DOS)

So I want to skim through all folders (ongoing from the curr dir) and delete all files that contain the string: "in conflikt standing copy". Is this possible WITH DOS ? (1 Reply)
Discussion started by: pasc
1 Replies
refile(1mh)															       refile(1mh)

Name
       refile - file message in other folders

Syntax
       refile [ msgs ] [ +folder ] [ options ]

Description
       Use  the  command to move the specified message from the current folder to another folder.  You can refile messages in more than one folder
       by giving multiple folder names as arguments.

       If you do not specify a message, the current message is refiled.  You can refile a message other than the current  message  by  giving  its
       number  as  a msgs argument.  You can also refile more than one message at a time by specifying more than one message number, or a range of
       message numbers, or a message sequence.	See for more information on sequences.

       The current folder remains the same unless the -src option is specified; in that case, the source folder becomes  current.   Normally,  the
       last message specified becomes the current message.  However, if the -link option is used, the current message is not changed.

       If  the	Previous-Sequence:  entry is set in the file, in addition to defining the named sequences from the source folder, will also define
       those sequences for the destination folders.  See for information concerning the previous sequence.

Options
       -draft	 Refiles the draft message, or the current message in your folder, if you have one set up.  You cannot give a msgs  argument  when
		 you use this option.

       -file filename
		 Moves	a  file into a folder.	This option takes a file from its directory and places it in the named folder, as the next message
		 in the folder.  The file must be formatted as a legal mail message.  This means that the message must	have  the  minimum  header
		 fields separated from the body of the message by a blank line or a line of dashes.

       -help	 Prints a list of the valid options to this command.

       -link
       -nolink	 Keeps	a copy of the message in the source folder.  Normally, removes the messages from the original folder when it refiles them.
		 The -link option keeps a copy in the original folder, as well as filing a copy in the new folder.

       -preserve
       -nopreserve
		 Preserves the number of a message in the new folder.  Normally, when a message is refiled in to another folder, it is set to  the
		 next  available number in that folder.  The -preserve option keeps the number of the message the same in the new folder as it had
		 been in the old.

		 You cannot have two messages with the same number in one folder, so you should use this option with care.

       -src +folder
		 Specifies the source folder to take messages from.  Normally, messages are refiled from the current folder into  another  folder.
		 However, you can take messages from a different folder by using the -src +folder option to specify the alternative source folder.

Examples
       The following example refiles messages 3 and 5 in the folder
       % refile 3 5 +records

       The next example files the current message into two folders:
       % refile +jones +map

       The next example takes message 13 in the current folder and refiles it in the folder.  The message remains in the current folder as well as
       appearing in the folder.
       % refile -link 13 +test

       The next example takes a message from the folder when it is not the current folder, and places it in the folder
       % refile 3 -src +test +outbox

Profile Components
       Path:		 To determine your Mail directory

       Folder-Protect:	 To set protections when creating a new folder
       rmmproc: 	 Program to delete the message

Files
       The user profile.

See Also
       folder(1mh), mark(1mh), mh_profile(5mh)

																       refile(1mh)
All times are GMT -4. The time now is 11:59 AM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy