Sponsored Content
Top Forums Shell Programming and Scripting Removing a portion of data in a file Post 302453785 by Lucky Ali on Thursday 16th of September 2010 11:40:36 AM
Old 09-16-2010
Removing a portion of data in a file

Hi,
I have a folder that contains many (multiple) files

1.fasta
2.fasta
3.fasta
4.fasta
5.fasta
.
.
100's of files

Each such file have data in the following format
for example:
vi 1.fasta


Code:
Code:
>AB_1
MLKKPIIIGVTGGSGGGKTSVSRAILDSFPNARIAMIQHDSYYKDQSHMSFEERVKTNYDHPLAFDTDFM
IQQLKELLAGRPVDIPIYDYKKHTRSNTTFRQDPQDVIIVEGILVLEDERLRDLMDIKLFVDTDDDIRII
RRIKRDMMERGRSLESIIDQYTSVVKPMYHQFIEPSKRYADIVIPEGVSNVVAIDVINSKIASILGEV
>AB_2
MRARLIYNPTSGQELMRKSVPEVLDILEGFGYETSAFQTTAKKNSALNEARRAAKAGFDLLIAAGGDGTI
NEVVNGIAPLKKRPKMAIIPTGTTNDFARALKVPRGNPSQAAKLIGKNQTIQMDIGRAKKDTYFINIAAA
GSLTELTYSVPSQLKTMFGYLAYLAKGVELLPRVSNVPVKITHDKGVFEGQVSMIFAAITNSVGGFEMIA
PDAKLDDGMFTLILIKTANLFEIVHLLRLILDGGKHITDRRVEYIKTSKIVIEPQCGKRMMINLDGEYGG
DAPITLENLKNHITFFADTDLISDDALVLDQDELEIEEIVKKFAHEVEDLEQELEE
>AB_3
MTGYDDFNYALSALKLGADDYLLKPFSKADVEDMLGKLRKKLELSKKTETIQELVEQPQKEVSAIAMAIH
ERLADSDLTLKSLAQQLGFSPNYLSVLIKKELGMPFQDYLVQERLKKAKLFLLTSNLKIYEIAEQVGFED
MNYFSQRFKQLVGVTPSQYKKGGQA
>AB_4 
MRARLIYNPTSGQELMRKSVPEVLDILEGFGYETSAFQTTAKKNSALNEARRAAKAGFDLLIAAGGDGTI
NEVVNGIAPLKKRPKMAIIPTGTTNDFARALKVPRGNPSQAAKLIGKNQTIQMDIGRAKKDTYFINIAAA
GSLTELTYSVPSQLKTMFGYLAYLAKGVELLPRVSNVPVKITHDKGVFEGQVSMIFAAITNSVGGFEMIA
PDAKLDDGMFTLILIKTANLFEIVHLLRLILDGGKHITDRRVEYIKTSKIVIEPQCGKRMMINLDGEYGG
DAPITLENLKNHITFFADTDLISDDALVLDQDELEIEEIVKKFAHEVEDLEQELEE
>AB_5  
MRARLIYNPTSGQELMRKSVPEVLDILEGFGYETSAFQTTAKKNSALNEARRAAKAGFDLLIAAGGDGTI
NEVVNGIAPLKKRPKMAIIPTGTTNDFARALKVPRGNPSQAAKLIGKNQTIQMDIGRAKKDTYFINIAAA
GSLTELTYSVPSQLKTMFGYLAYLAKGVELLPRVSNVPVKITHDKGVFEGQVSMIFAAITNSVGGFEMIA
PDAKLDDGMFTLILIKTANLFEIVHLLRLILDGGKHITDRRVEYIKTSKIVIEPQCGKRMMINLDGEYGG
DAPITLENLKNHITFFADTDLISDDALVLDQDELEIEEIVKKFAHEVEDLEQELEE

I would like to edit these files such a way that the data below
>AB_1 is removed (including the header) and have an output file like'

Code:
>AB_2
MRARLIYNPTSGQELMRKSVPEVLDILEGFGYETSAFQTTAKKNSALNEARRAAKAGFDLLIAAGGDGTI
NEVVNGIAPLKKRPKMAIIPTGTTNDFARALKVPRGNPSQAAKLIGKNQTIQMDIGRAKKDTYFINIAAA
GSLTELTYSVPSQLKTMFGYLAYLAKGVELLPRVSNVPVKITHDKGVFEGQVSMIFAAITNSVGGFEMIA
PDAKLDDGMFTLILIKTANLFEIVHLLRLILDGGKHITDRRVEYIKTSKIVIEPQCGKRMMINLDGEYGG
DAPITLENLKNHITFFADTDLISDDALVLDQDELEIEEIVKKFAHEVEDLEQELEE
>AB_3
MTGYDDFNYALSALKLGADDYLLKPFSKADVEDMLGKLRKKLELSKKTETIQELVEQPQKEVSAIAMAIH
ERLADSDLTLKSLAQQLGFSPNYLSVLIKKELGMPFQDYLVQERLKKAKLFLLTSNLKIYEIAEQVGFED
MNYFSQRFKQLVGVTPSQYKKGGQA
>AB_4 
MRARLIYNPTSGQELMRKSVPEVLDILEGFGYETSAFQTTAKKNSALNEARRAAKAGFDLLIAAGGDGTI
NEVVNGIAPLKKRPKMAIIPTGTTNDFARALKVPRGNPSQAAKLIGKNQTIQMDIGRAKKDTYFINIAAA
GSLTELTYSVPSQLKTMFGYLAYLAKGVELLPRVSNVPVKITHDKGVFEGQVSMIFAAITNSVGGFEMIA
PDAKLDDGMFTLILIKTANLFEIVHLLRLILDGGKHITDRRVEYIKTSKIVIEPQCGKRMMINLDGEYGG
DAPITLENLKNHITFFADTDLISDDALVLDQDELEIEEIVKKFAHEVEDLEQELEE
>AB_5  
MRARLIYNPTSGQELMRKSVPEVLDILEGFGYETSAFQTTAKKNSALNEARRAAKAGFDLLIAAGGDGTI
NEVVNGIAPLKKRPKMAIIPTGTTNDFARALKVPRGNPSQAAKLIGKNQTIQMDIGRAKKDTYFINIAAA
GSLTELTYSVPSQLKTMFGYLAYLAKGVELLPRVSNVPVKITHDKGVFEGQVSMIFAAITNSVGGFEMIA
PDAKLDDGMFTLILIKTANLFEIVHLLRLILDGGKHITDRRVEYIKTSKIVIEPQCGKRMMINLDGEYGG
DAPITLENLKNHITFFADTDLISDDALVLDQDELEIEEIVKKFAHEVEDLEQELEE

Like wise I would it to all the files in the folder.
Please let me know the best way to do it in awk or sed/
LA
 

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

removing a portion of a code from a file

Hi everyone, I need to know how to remove a chunk of codes from a file for instance i have couple of lines which are commented out of the file and i need to remove that block. here is the example --#------------------------------------------------------------------ --# File name= ... (5 Replies)
Discussion started by: ROOZ
5 Replies

2. Shell Programming and Scripting

Extracting a portion of a data file with identifier

Hi, I do have a TAB delimted text file with the following format. 1 (- identifier of each group. this text is not present in the file only number) 1 3 4 65 56 WERTF 2 3 4 56 56 GHTYHU 3 3 5 64 23 VMFKLG 2 1 3 4 65 56 DGTEYDH 2 3 4 56 56 FJJJCKC 3 3 5 64 23 FNNNCHD 3 1 3 4 65 56 JDHJDH... (9 Replies)
Discussion started by: Lucky Ali
9 Replies

3. Shell Programming and Scripting

SFTP to server, pulling data and removing the data

Hi all, I have the following script, but are not too sure about the syntax to complete the script. In essence, the script must connect to a SFTP server at a client site with username and password located in a file on my server. Then change to the appropriate directory. Pull the data to the... (1 Reply)
Discussion started by: codenjanod
1 Replies

4. Shell Programming and Scripting

Extracting a portion of data from a very large tab delimited text file

Hi All I wanted to know how to effectively delete some columns in a large tab delimited file. I have a file that contains 5 columns and almost 100,000 rows 3456 f g t t 3456 g h 456 f h 4567 f g h z 345 f g 567 h j k lThis is a very large data file and tab delimited. I need... (2 Replies)
Discussion started by: Lucky Ali
2 Replies

5. Shell Programming and Scripting

parsing a portion of Data from a text file

Hi All, I need some help to effectively parse out a subset of results from a big results file. Below is an example of the text file. Each block that I need to parse starts with "Output of GENE for sequence file 100.fasta" (next block starts with another number). I have given the portion of... (8 Replies)
Discussion started by: Lucky Ali
8 Replies

6. Shell Programming and Scripting

Extract portion of data

Hi Gurus, I need some help in extracting some of these information and massage it into the desired output as shown below. I need to extract the last row with the header in below sample which is usually the most recent date, for example: 2012-06-01 142356 mb 519 -219406 mb 1 ... (9 Replies)
Discussion started by: superHonda123
9 Replies

7. UNIX for Advanced & Expert Users

Removing portion of file name

Hi , I am getting file name like ABC_DATA_CUSTIOMERS_20120617.dat ABC_DATA_PRODUCTS_20120617.dat Need to convert CUSTIOMERS.dat PRODUCTS.dat Help me how to do this. (7 Replies)
Discussion started by: reach_malu
7 Replies

8. Shell Programming and Scripting

Unix Scripting : Sort a Portion of a File and not the complete file

Need to sort a portion of a file in a Alphabetical Order. Example : The user adam is not sorted and the user should get sorted. I don't want the complete file to get sorted. Currently All_users.txt contains the following lines. ############## # ARS USERS ############## mike, Mike... (6 Replies)
Discussion started by: evrurs
6 Replies

9. Shell Programming and Scripting

Removing inline binary data from txt file

I am trying to parse a file but the filehas binary data inline mixed with text fields. I tried the binutils strings function , it get the binary data out but put the char following the binary data in a new line . input file app_id:1936 pgm_num:0 branch:TBNY ord_num:0500012(–QMK) deal_num:0... (12 Replies)
Discussion started by: tasmac
12 Replies

10. Shell Programming and Scripting

Archiving or removing few data from log file in real time

Hi, I have a log file that gets updated every second. Currently the size has grown to 20+ GB. I need to have a command/script, that will try to get the actual size of the file and will remove 50% of the data that are in the log file. I don't mind removing the data as the size has grown to huge... (8 Replies)
Discussion started by: Souvik Patra
8 Replies
2NDSCORE(1)						  User Contributed Documentation					       2NDSCORE(1)

NAME
2ndscore - find the best hairpin anchored at each position. SYNOPSIS
2ndscore in.fasta > out.hairpins DESCRIPTION
For every position in the sequence this will output a line: -0.6 52 .. 62 TTCCTAAAGGTTCCA GCG CAAAA TGC CATAAGCACCACATT (score) (start .. end) (left context) (hairpin) (right contenxt) For positions near the ends of the sequences, the context may be padded with 'x' characters. If no hairpin can be found, the score will be 'None'. Multiple fasta files can be given and multiple sequences can be in each fasta file. The output for each sequence will be separated by a line starting with '>' and containing the FASTA description of the sequence. Because the hairpin scores of the plus-strand and minus-strand may differ (due to GU binding in RNA), by default 2ndscore outputs two sets of hairpins for every sequence: the FORWARD hairpins and the REVERSE hairpins. All the forward hairpins are output first, and are identified by having the word 'FORWARD' at the end of the '>' line preceding them. Similarly, the REVERSE hairpins are listed after a '>' line ending with 'REVERSE'. If you want to search only one or the other strand, you can use: --no-fwd Don't print the FORWARD hairpins --no-rvs Don't print the REVERSE hairpins You can set the energy function used, just as with transterm with the --gc, --au, --gu, --mm, --gap options. The --min-loop, --max-loop, and --max-len options are also supported. FORMAT OF THE .BAG FILES The columns for the .bag files are, in order: 1. gene_name 2. terminator_start 3. terminator_end 4. hairpin_score 5. tail_score 6. terminator_sequence 7. terminator_confidence: a combination of the hairpin and tail score that takes into account how likely such scores are in a random sequence. This is the main "score" for the terminator and is computed as described in the paper. 8. APPROXIMATE_distance_from_end_of_gene: The *approximate* number of base pairs between the end of the gene and the start of the terminator. This is approximate in several ways: First, (and most important) TransTermHP doesn't always use the real gene ends. Depending on the options you give it may trim some off the ends of genes to handle terminators that partially overlap with genes. Second, where the terminator "begins" isn't that well defined. This field is intended only for a sanity check (terminators reported to be the best near the ends of genes shouldn't be _too far_ from the end of the gene). USING TRANSTERM WITHOUT GENOME ANNOTATIONS TransTermHP uses known gene information for only 3 things: (1) tagging the putative terminators as either "inside genes" or "intergenic," (2) choosing the background GC-content percentage to compute the scores, because genes often have different GC content than the intergenic regions, and (3) producing slightly more readable output. Items (1) and (3) are not really necessary, and (2) has no effect if your genes have about the same GC-content as your intergenic regions. Unfortunately, TransTermHP doesn't yet have a simple option to run without an annotation file (either .ptt or .coords), and requires at least 2 genes to be present. The solution is to create fake, small genes that flank each chromosome. To do this, make a fake.coords file that contains only these two lines: fakegene1 1 2 chome_id fakegene2 L-1 L chrom_id where L is the length of the input sequence and L-1 is 1 less than the length of the input sequence. "chrom_id" should be the word directly following the ">" in the .fasta file containing your sequence. (If, for example, your .fasta file began with ">seq1", then chrom_id = seq1). This creates a "fake" annotation with two 1-base-long genes flanking the sequence in a tail-to-tail arrangement: --> <--. TransTermHP can then be run with: transterm -p expterm.dat sequence.fasta fake.coords If the G/C content of your intergenic regions is about the same as your genes, then this won't have too much of an effect on the scores terminators receive. On the other hand, this use of TransTermHP hasn't been tested much at all, so it's hard to vouch for its accuracy. SEE ALSO
transterm(1) AUTHOR
Alex Mestiashvili <alex@biotec.tu-dresden.de> 2011-02-19 2NDSCORE(1)
All times are GMT -4. The time now is 05:15 AM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy