new enough versions of awk can conveniently be told to consider "//" the record splitter, which makes it just a matter of finding the "ACC" field and using the next one as the file name to print into.
Code:
$ cat hmmer.awk
BEGIN { RS="//\n"; ORS="//\n" }
{
for(N=1; N<=NF; N++)
if($N == "ACC")
{
printf("Send this record to %s\n", $(N+1));
print > $(N+1);
close( $(N+1) );
break;
}
}
$ awk -f hmmer.awk data
Send this record to PF10417.4
Send this record to PF12574.3
$ cat PF10417.4
HMMER3/b [3.0 | March 2010]
NAME 1-cysPrx_C
ACC PF10417.4
DESC C-terminal domain of 1-Cys peroxiredoxin
LENG 40
ALPH amino
RF no
CS yes
MAP yes
.....more data...
0.00103 6.88015 * 0.61958 0.77255 0.00000 *
//
$ cat PF12574.3
HMMER3/b [3.0 | March 2010]
NAME 120_Rick_ant
ACC PF12574.3
DESC 120 KDa Rickettsia surface antigen
LENG 255
ALPH amino
RF no
CS no
MAP yes
DATE Tue Sep 27 11:43:56 2011
NSEQ 7
//
$
Okay, absolute newbie here...
I'm on a Mac trying to split an almost 2 Gig log file on a Unix box into manageable chunks for my web-based log analysis tool.
What do I need to do, what programs do I need to do it?
All and any help appreciated/needed :-)
Cheers (8 Replies)
I have a file containing date/time sorted data of the form
...
2009/06/10,20:59:59.950,XAG/USD,Q,1,1115, 14.3025,100,1,1
2009/06/10,20:59:59.950,XAG/USD,Q,1,1116, 14.3026,125,1,1
2009/06/10,20:59:59.950,XAG/USD,R,0,0, , 0,0,0
2009/06/10,20:59:59.950,XAG/USD,R,1,0, 14.1910,100,1,1... (6 Replies)
Hello gurus,
I am new to "awk" and trying to break a large file having 4 million records into several output files each having half million but at the same time I want to keep the similar key records in the same output file, not to exist accross the files.
e.g. my data is like:
Row_Num,... (6 Replies)
I need to write a shell script for below scenario
My input file has data in format:
qwerty0101TWE 12345 01022005 01022005 datainala alanfernanded 26
qwerty0101mXZ 12349 01022005 06022008 datainalb johngalilo 28
qwerty0101TWE 12342 01022005 07022009 datainalc hitalbert 43
qwerty0101CFG 12345... (19 Replies)
Hi Experts,
I have to split huge file based on the pattern to create smaller files. The pattern which is expected in the file is:
Master.....
First...
second....
second...
third..
third...
Master...
First..
second...
third...
Master...
First...
second..
second..
second..... (2 Replies)
I will simplify the explaination a bit, I need to parse through a 87m file -
I have a single text file in the form of :
<NAME>house........
SOMETEXT
SOMETEXT
SOMETEXT
.
.
.
.
</script>
MORETEXT
MORETEXT
.
.
. (6 Replies)
OK So I Recently Bought A whatbox Seed-box Act!!:cool:
I am connected to whatbox via SSH!!!
Now i have downloaded a movie and renamed it to 2yify.mp4 (800MB):o
When I TYPE the command to split it which is:)
split -b 400m 2yify.mp4
It gets renamed into two parts with different names... (4 Replies)
Hi All,
This is my first post here. Hoping to share and gain knowledge from this great forum !!!!
I've scanned this forum before posting my problem here, but I'm afraid I couldn't find any thread that addresses this exact problem.
I'm trying to split a large XML file (with multiple tag... (7 Replies)
Hi all,
Newbie here. First of all, sorry if I made any mistakes while posting this question in terms of rules. Correct me if I am wrong. :b:
I have a .dat file whose name is in the format of 20170311_abc_xyz.dat. The file consists of records whose first column consists of multiple dates in... (2 Replies)
Hello Gurus,
I have a requirement to split the xml file into different xml files.
Can you please help me with that?
Here is my Source XML file
<jms-system-resource>
<name>PS6SOAJMSModule</name>
<target>soa_server1</target>
<sub-deployment>
... (3 Replies)
Discussion started by: Siv51427882
3 Replies
LEARN ABOUT DEBIAN
psi-cd-hit-2d-g1
PSI-CD-HIT-2D-G1.PL(1) User Commands PSI-CD-HIT-2D-G1.PL(1)NAME
psi-cd-hit-2d-g1.pl - runs similar algorithm like CD-HIT but using BLAST to calculate similarities in db1 or db2 format
DESCRIPTION
Usage psi-cd-hit-2d [Options]
Options
-i in_dbname, required
-o out_dbname, required
-c clustering threshold (sequence identity), default 0.3
-ce clustering threshold (blast expect), default -1,
it means by default it doesn't use expect threshold, but with positive value, the program cluster seqs if similarities meet either
identity threshold or expect threshold
-L coverage of shorter sequence ( aligned / full), default 0.0
-M coverage of longer sequence ( aligned / full), default 0.0
-R (1/0) use psi-blast profile? default 0 perform psi-blast / pdb-blast type search
-G (1/0) use global identity? default 1 sequence identity calculated as
total identical residues of local alignments / length of shorter seq
if you prefer to use -G 0, it is suggested that you also use -L, such as -L 0.8, to prevent very short matches.
-d length of description line in the .clstr file, default 30 if set to 0, it takes the fasta defline and stops at first space
-l length_of_throw_away_sequences, default 10
-p profile search para, default
"-a 2 -d nr80 -j 3 -F F -e 0.001 -b 500 -v 500"
-bfdb profile database, default nr80
-s blast search para, default
"-F F -e 0.000001 -b 100000 -v 100000"
-be blast expect cutoff, default 0.000001
-b filename of list of hosts to run this program in parallel with ssh calls, you need provide a list of hosts
-pbs No of jobs to send each time by PBS querying system
you can not use both ssh and pbs at same time
-k (1/0) keep blast raw output file, default 1
-rs steps of save restart file and clustering output, default 5000
everytime after process 5000 sequences, program write a restart file and current clustering information
-restart restart file, readin a restart file
if program crash, stoped, termitated, you can restart it by add a option "-restart sth.restart"
-rf steps of re format blast database, default 200,000
if program clustered 200,000 seqs, it remove them from seq pool, and re format blast db to save time
-local dir of local blast db,
when run in parallel with ssh (not pbs), I can copy blast dbs to local drives on each node to save blast db reading time BUT, IT MAY
NOT FASTER
-J job, job_file, exe specific jobs like parse blast outonly DON'T use it, it is only used by this program itself
-single files of ids those you known that they are singletons
so I won't run them as queries
-i2 second input database
-blastn run blastn, default 0
-lo how long can seq in db2 > db1 in a cluster, default 0
means, that seq in db2 should <= seqs in db1 in a cluster
============================== by Weizhong Li, liwz@sdsc.edu ==============================
If you find cd-hit useful, please kindly cite:
"Clustering of highly homologous sequences to reduce thesize of large protein database", Weizhong Li, Lukasz Jaroszewski & Adam
GodzikBioinformatics, (2001) 17:282-283 "Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide
sequences", Weizhong Li & Adam Godzik Bioinformatics, (2006) 22:1658-1659
psi-cd-hit-2d-g1.pl 4.6-2012-04-25 April 2012 PSI-CD-HIT-2D-G1.PL(1)