Splitting large file and renaming based on field Post: 302637161

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

Splitting a large log file

Okay, absolute newbie here... I'm on a Mac trying to split an almost 2 Gig log file on a Unix box into manageable chunks for my web-based log analysis tool. What do I need to do, what programs do I need to do it? All and any help appreciated/needed :-) Cheers

2. Shell Programming and Scripting

split large file based on field criteria

I have a file containing date/time sorted data of the form ... 2009/06/10,20:59:59.950,XAG/USD,Q,1,1115, 14.3025,100,1,1 2009/06/10,20:59:59.950,XAG/USD,Q,1,1116, 14.3026,125,1,1 2009/06/10,20:59:59.950,XAG/USD,R,0,0, , 0,0,0 2009/06/10,20:59:59.950,XAG/USD,R,1,0, 14.1910,100,1,1...

3. Shell Programming and Scripting

awk - splitting 1 large file into multiple based on same key records

Hello gurus, I am new to "awk" and trying to break a large file having 4 million records into several output files each having half million but at the same time I want to keep the similar key records in the same output file, not to exist accross the files. e.g. my data is like: Row_Num,...

4. Shell Programming and Scripting

Splitting large file into multiple files in unix based on pattern

I need to write a shell script for below scenario My input file has data in format: qwerty0101TWE 12345 01022005 01022005 datainala alanfernanded 26 qwerty0101mXZ 12349 01022005 06022008 datainalb johngalilo 28 qwerty0101TWE 12342 01022005 07022009 datainalc hitalbert 43 qwerty0101CFG 12345...

5. Shell Programming and Scripting

Problem with splitting large file based on pattern

Hi Experts, I have to split huge file based on the pattern to create smaller files. The pattern which is expected in the file is: Master..... First... second.... second... third.. third... Master... First.. second... third... Master... First... second.. second.. second.....

6. Shell Programming and Scripting

Sed: Splitting A large File into smaller files based on recursive Regular Expression match

I will simplify the explaination a bit, I need to parse through a 87m file - I have a single text file in the form of : <NAME>house........ SOMETEXT SOMETEXT SOMETEXT . . . . </script> MORETEXT MORETEXT . . .

7. UNIX for Dummies Questions & Answers

[Solved] File Splitting And Renaming Problem

OK So I Recently Bought A whatbox Seed-box Act!!:cool: I am connected to whatbox via SSH!!! Now i have downloaded a movie and renamed it to 2yify.mp4 (800MB):o When I TYPE the command to split it which is:) split -b 400m 2yify.mp4 It gets renamed into two parts with different names...

8. Shell Programming and Scripting

Help with Splitting a Large XML file based on size AND tags

Hi All, This is my first post here. Hoping to share and gain knowledge from this great forum !!!! I've scanned this forum before posting my problem here, but I'm afraid I couldn't find any thread that addresses this exact problem. I'm trying to split a large XML file (with multiple tag...

9. Shell Programming and Scripting

Splitting file into multiple files and renaming them

Hi all, Newbie here. First of all, sorry if I made any mistakes while posting this question in terms of rules. Correct me if I am wrong. :b: I have a .dat file whose name is in the format of 20170311_abc_xyz.dat. The file consists of records whose first column consists of multiple dates in...

10. UNIX for Beginners Questions & Answers

Splitting the XML file and renaming the files

Hello Gurus, I have a requirement to split the xml file into different xml files. Can you please help me with that? Here is my Source XML file <jms-system-resource> <name>PS6SOAJMSModule</name> <target>soa_server1</target> <sub-deployment> ...

LEARN ABOUT DEBIAN

psi-cd-hit-2d-g1

PSI-CD-HIT-2D-G1.PL(1)						   User Commands					    PSI-CD-HIT-2D-G1.PL(1)

NAME

       psi-cd-hit-2d-g1.pl - runs similar algorithm like CD-HIT but using BLAST to calculate similarities in db1 or db2 format

DESCRIPTION

       Usage psi-cd-hit-2d [Options]

       Options

       -i     in_dbname, required

       -o     out_dbname, required

       -c     clustering threshold (sequence identity), default 0.3

       -ce clustering threshold (blast expect), default -1,

	      it  means  by default it doesn't use expect threshold, but with positive value, the program cluster seqs if similarities meet either
	      identity threshold or expect threshold

       -L     coverage of shorter sequence ( aligned / full), default 0.0

       -M     coverage of longer sequence ( aligned / full), default 0.0

       -R     (1/0) use psi-blast profile? default 0 perform psi-blast / pdb-blast type search

       -G     (1/0) use global identity? default 1 sequence identity calculated as

	      total identical residues of local alignments / length of shorter seq

	      if you prefer to use -G 0, it is suggested that you also use -L, such as -L 0.8, to prevent very short matches.

       -d     length of description line in the .clstr file, default 30 if set to 0, it takes the fasta defline and stops at first space

       -l     length_of_throw_away_sequences, default 10

       -p     profile search para, default

	      "-a 2 -d nr80 -j 3 -F F -e 0.001 -b 500 -v 500"

       -bfdb profile database, default nr80

       -s     blast search para, default

	      "-F F -e 0.000001 -b 100000 -v 100000"

       -be blast expect cutoff, default 0.000001

       -b     filename of list of hosts to run this program in parallel with ssh calls, you need provide a list of hosts

       -pbs No of jobs to send each time by PBS querying system

	      you can not use both ssh and pbs at same time

       -k (1/0) keep blast raw output file, default 1

       -rs steps of save restart file and clustering output, default 5000

	      everytime after process 5000 sequences, program write a restart file and current clustering information

       -restart restart file, readin a restart file

	      if program crash, stoped, termitated, you can restart it by add a option "-restart sth.restart"

       -rf steps of re format blast database, default 200,000

	      if program clustered 200,000 seqs, it remove them from seq pool, and re format blast db to save time

       -local dir of local blast db,

	      when run in parallel with ssh (not pbs), I can copy blast dbs to local drives on each node to save blast db reading time BUT, IT MAY
	      NOT FASTER

       -J     job, job_file, exe specific jobs like parse blast outonly DON'T use it, it is only used by this program itself

       -single files of ids those you known that they are singletons

	      so I won't run them as queries

       -i2 second input database

       -blastn run blastn, default 0

       -lo how long can seq in db2 > db1 in a cluster, default 0

	      means, that seq in db2 should <= seqs in db1 in a cluster

	      ============================== by Weizhong Li, liwz@sdsc.edu ==============================

	      If you find cd-hit useful, please kindly cite:

	      "Clustering  of  highly  homologous  sequences  to reduce thesize of large protein database", Weizhong Li, Lukasz Jaroszewski & Adam
	      GodzikBioinformatics, (2001) 17:282-283 "Cd-hit: a fast program for clustering and comparing large sets  of  protein  or	nucleotide
	      sequences", Weizhong Li & Adam Godzik Bioinformatics, (2006) 22:1658-1659

psi-cd-hit-2d-g1.pl 4.6-2012-04-25				    April 2012						    PSI-CD-HIT-2D-G1.PL(1)

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

Splitting a large log file

Discussion started by: simmonet

2. Shell Programming and Scripting

split large file based on field criteria

Discussion started by: asriva

3. Shell Programming and Scripting

awk - splitting 1 large file into multiple based on same key records

Discussion started by: kam66

4. Shell Programming and Scripting

Splitting large file into multiple files in unix based on pattern

Discussion started by: jimmy12