extracting data


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting extracting data
# 1  
Old 07-07-2011
extracting data

I have a txt file of the following format
Code:
>ab_
qwerty
>rt_
hfjkil
>Ty2
hglashglkasghkf;
>P2
aklhfklflkkgfgkfl
>ui_
vnllkdskkkffkfkkf
>we32
vksksjksj;lslsf'sk's's
....

.....

I want to split this big file based on the header (>)

Suppose if I want to split it (or extract) the data into a new file by extracting data under say Ty2 and we32
such that the resultant file is:
Code:
>Ty2
hglashglkasghkf;
>we32
vksksjksj;lslsf'sk's's

Please let me know the best way I could do it in awk or sed just by mentioning any of the headers
# 2  
Old 07-07-2011
Code:
awk '/>Ty2/{p=1}/>we32/{c=3}!--c{exit}p' file

# 3  
Old 07-07-2011
it didn't work. I have multiple lines under one header. Thus this awk code consider this
# 4  
Old 07-07-2011
Sorry, I misread the question:
Code:
awk '/>Ty2/ || />we32/{c=3}--c==>0' file > newfile

# 5  
Old 07-07-2011
Sorry it didn't work as needed
Let me give you some actual data that I am dealing with:
Code:
  58    390
BA1_
GTATACATTATTGATGAAGTCCACATGCTTTCTATGGGTGCCTTCAATGCGCTTTTAAAA
ACGTTAGAAGAGCCGCCAGGACATGTTATCTTTATTTTGGCGACAACAGAACCGCATAAG
ATACCGCCTACAATCATTTCGCGTTGCCAACGTTTCGAATTTCGAAAAATATCAGTAAAT
GATATTGTTGAGAGATTGTCGACGGTTGTGACTAATGAAGGTACGCAAGTAGAAGATGAG
GCGTTACAAATTGTTGCGCGTGCCGCTGAAGGTGGTATGCGTGATGCGCTTAGTCTTATT
GATCAGGCGATATCTTATAGTGATGAGAGGGTTACGACAGAAGATGTATTAGCTGTAACG
GGTCGTGATATGTTCCGTATGTTAAGTGAA
BA
GTATACATTATTGATGAAGTCCACATGCTTTCTATGGGTGCCTTCAATGCGCTTTTAAAA
ACGTTAGAAGAGCCGCCAGGACATGTTATCTTTATTTTGGCGACAACAGAACCGCATAAG
ATACCGCCTACAATCATTTCGCGTTGCCAACGTTTCGAATTTCGAAAAATATCAGTAAAT
GATATTGTTGAGAGATTGTCCACGGTTGTGACTAATGAAGGTACGCAAGTAGAAGATGAG
GCTTTACAAATTGTTGCGCGTGCCGCTGAAGGTGGTATGCGTGATGCGCTTAGTCTTATT
GATCAAGCGATATCTTATAGTGATGAGAGGGTTACGACAGAAGATGTATTAGCTGTAACG
GGTCGTGATATGTTCCGTATGTTAAGTGAA
BC23_
GTATACATTATTGATGAAGTTCACATGCTTTCTATGGGTGCATTCAATGCGCTTTTAAAA
ACCTTAGAAGAGCCGCCAGGACATGTTATCTTTATTTTGGCGACAACAGAACCTCATAAG
ATCCCACCTACAATCATTTCACGTTGTCAGCGCTTTGAATTCCGAAAAATATCAGTGAAT
GATATTGTTGAGAGATTATCAACGGTCGTGACAAATGAAGGTACGCAAGTGGAAGGTGAA
GCATTACAAATTGTTGCGCGTGCTGCCGAAGGTGGTATGCGTGATGCGCTTAGTCTTATT
GATCAGGCTATATCTTATAGTGATGAGATTGTTACGACAGAAGATGTATTGGCCGTAACA
GGACGTGATATGTTCCGTAAGTTGAGTGAA
BC
GTATACATTATTGATGAAGTTCACATGCTTTCTATGGGTGCCTTCAATGCGCTTTTAAAA
ACGTTAGAAGAACCGCCAGGACATGTCATCTTTATTTTGGCGACAACAGAACCGCATAAG
ATACCGCCTACAATTATTTCGCGTTGCCAACGTTTCGAATTTCGAAAGATATCAGTAAAT
GATATTGTTGAGAGATTATCGACAGTTGTAAACAATGAAGGTACGCAAGTAGAAGATGAA
GCGTTACAAATCGTTGCACGTGCCGCTGAAGGTGGTATGCGTGATGCGCTTAGTCTTATT
GATCAGGCAATATCTTATAGTGATGAGACTGTTACGACAGAAGATGTATTAGCTGTAACA
GGGCGTGATATGTTCCGAATGTTAAGTGAA
B8_
GTATACATTATTGATGAAGTCCACATGCTTTCTATGGGTGCCTTCAATGCGCTTTTAAAA
ACGTTAGAAGAGCCGCCAGGACATGTTATCTTTATTTTGGCGACAACAGAACCGCATAAG
ATACCGCCTACAATCATTTCGCGTTGCCAACGTTTCGAATTTCGAAAAATATCAGTAAAT
GATATTGTTGAGAGATTGTCGACGGTTGTGACTAATGAAGGTACGCAAGTAGAAGATGAG
GCGTTACAAATTGTTGCGCGTGCCGCTGAAGGTGGTATGCGTGATGCGCTTAGTCTTATT
GATCAAGCGATATCTTATAGTGATGAGAGGGTTACGACAGAAGATGTATTAGCTGTAACG
GGTCGTGATATGTTCCGTATGTTAAGTGAA

I want to extract the data under BA and B8_ or any such pairwise combinations and write it into a new file (below) with the same format and also putting the first line as it is in the new file (with same spaces and all
Code:
  58    390
BA
GTATACATTATTGATGAAGTCCACATGCTTTCTATGGGTGCCTTCAATGCGCTTTTAAAA
ACGTTAGAAGAGCCGCCAGGACATGTTATCTTTATTTTGGCGACAACAGAACCGCATAAG
ATACCGCCTACAATCATTTCGCGTTGCCAACGTTTCGAATTTCGAAAAATATCAGTAAAT
GATATTGTTGAGAGATTGTCCACGGTTGTGACTAATGAAGGTACGCAAGTAGAAGATGAG
GCTTTACAAATTGTTGCGCGTGCCGCTGAAGGTGGTATGCGTGATGCGCTTAGTCTTATT
GATCAAGCGATATCTTATAGTGATGAGAGGGTTACGACAGAAGATGTATTAGCTGTAACG
GGTCGTGATATGTTCCGTATGTTAAGTGAA
B8_
GTATACATTATTGATGAAGTCCACATGCTTTCTATGGGTGCCTTCAATGCGCTTTTAAAA
ACGTTAGAAGAGCCGCCAGGACATGTTATCTTTATTTTGGCGACAACAGAACCGCATAAG
ATACCGCCTACAATCATTTCGCGTTGCCAACGTTTCGAATTTCGAAAAATATCAGTAAAT
GATATTGTTGAGAGATTGTCGACGGTTGTGACTAATGAAGGTACGCAAGTAGAAGATGAG
GCGTTACAAATTGTTGCGCGTGCCGCTGAAGGTGGTATGCGTGATGCGCTTAGTCTTATT
GATCAAGCGATATCTTATAGTGATGAGAGGGTTACGACAGAAGATGTATTAGCTGTAACG
GGTCGTGATATGTTCCGTATGTTAAGTGAA

Is there a best way in awk or sed to do it.
# 6  
Old 07-07-2011
Hi,

I have not tested the solution of Franklin52 but there is a subtle difference between your first post and your last one. In the first one each header begins with '>' but not in the last one.

I will make a try, but parsing your file, how can I know where each header begins or ends? I suppose each header is less than 20 characters while normal lines are above that number, but I may be wrong.

Code:
$ cat infile
(data of your last post)
$ cat script.pl
use warnings;
use strict;
use constant HEADER_LINE_LENGTH => 20;

die "Usage: perl $0 <input-file> <output-file> <headers>\n" unless @ARGV > 2;

my $infile = shift;
my $outfile = shift;
my %header = map { $_ => 1 } @ARGV;

open my $fh, "<", $infile or die "Cannot open file $infile: $!\n";
open my $ofh, ">", $outfile or die "Cannot open file $outfile: $!\n";

while ( my $line = <$fh> ) {
        chomp $line;
        if ( my $flip = ( exists $header{ $line } ... length( $line ) < HEADER_LINE_LENGTH ) ) {
                if ( $flip =~ /E/ ) {
                        redo;
                } else {
                        printf $ofh "%s\n", $line;
                }
        }
}

close $fh or warn "Cannot close $infile: $!\n";
close $ofh or warn "Cannot close $outfile: $!\n";
$ perl script.pl
Usage: perl script.pl <input-file> <output-file> <headers>
$ perl script.pl infile outfile BA BC BC23_
$ cat outfile
BA
GTATACATTATTGATGAAGTCCACATGCTTTCTATGGGTGCCTTCAATGCGCTTTTAAAA
ACGTTAGAAGAGCCGCCAGGACATGTTATCTTTATTTTGGCGACAACAGAACCGCATAAG
ATACCGCCTACAATCATTTCGCGTTGCCAACGTTTCGAATTTCGAAAAATATCAGTAAAT
GATATTGTTGAGAGATTGTCCACGGTTGTGACTAATGAAGGTACGCAAGTAGAAGATGAG
GCTTTACAAATTGTTGCGCGTGCCGCTGAAGGTGGTATGCGTGATGCGCTTAGTCTTATT
GATCAAGCGATATCTTATAGTGATGAGAGGGTTACGACAGAAGATGTATTAGCTGTAACG
GGTCGTGATATGTTCCGTATGTTAAGTGAA
BC23_
GTATACATTATTGATGAAGTTCACATGCTTTCTATGGGTGCATTCAATGCGCTTTTAAAA
ACCTTAGAAGAGCCGCCAGGACATGTTATCTTTATTTTGGCGACAACAGAACCTCATAAG
ATCCCACCTACAATCATTTCACGTTGTCAGCGCTTTGAATTCCGAAAAATATCAGTGAAT
GATATTGTTGAGAGATTATCAACGGTCGTGACAAATGAAGGTACGCAAGTGGAAGGTGAA
GCATTACAAATTGTTGCGCGTGCTGCCGAAGGTGGTATGCGTGATGCGCTTAGTCTTATT
GATCAGGCTATATCTTATAGTGATGAGATTGTTACGACAGAAGATGTATTGGCCGTAACA
GGACGTGATATGTTCCGTAAGTTGAGTGAA
BC
GTATACATTATTGATGAAGTTCACATGCTTTCTATGGGTGCCTTCAATGCGCTTTTAAAA
ACGTTAGAAGAACCGCCAGGACATGTCATCTTTATTTTGGCGACAACAGAACCGCATAAG
ATACCGCCTACAATTATTTCGCGTTGCCAACGTTTCGAATTTCGAAAGATATCAGTAAAT
GATATTGTTGAGAGATTATCGACAGTTGTAAACAATGAAGGTACGCAAGTAGAAGATGAA
GCGTTACAAATCGTTGCACGTGCCGCTGAAGGTGGTATGCGTGATGCGCTTAGTCTTATT
GATCAGGCAATATCTTATAGTGATGAGACTGTTACGACAGAAGATGTATTAGCTGTAACA
GGGCGTGATATGTTCCGAATGTTAAGTGAA

Regards,
Birei
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Extracting specific lines of data from a file and related lines of data based on a grep value range?

Hi, I have one file, say file 1, that has data like below where 19900107 is the date, 19900107 12 144 129 0.7380047 19900108 12 168 129 0.3149017 19900109 12 192 129 3.2766666E-02 ... (3 Replies)
Discussion started by: Wynner
3 Replies

2. Shell Programming and Scripting

extracting data from a string

Hi there, I have a bunch of vlan tagged network interfaces that are named as follows e1000g111000 e1000g99001 e1000g3456000 nge2002 where the 'e1000g' and 'nge' parts of the name are the driver, the red and blue bits above define the VLAN and the last digit on the end defines the... (3 Replies)
Discussion started by: rethink
3 Replies

3. Shell Programming and Scripting

Extracting and printing data

Hi I have the following data : <Cell id="34A" ref="ds:/BTS:34/Cells/Cell:34A"/> <Cell id="34B" ref="ds:/BTS:34/Cells/Cell:34B"/> <Cell id="34C" ref="ds:/BTS:34/Cells/Cell:34C"/> I would like to print this data in the following format : BTS:34 Cell:34A.I'm... (9 Replies)
Discussion started by: Prega
9 Replies

4. Shell Programming and Scripting

Extracting data between two characters

From the command line how would I extract data in file that was contained between parenthesis "()"? Awk or Grep? Thanks in advance Ted (11 Replies)
Discussion started by: TedSD
11 Replies

5. UNIX for Dummies Questions & Answers

Help with extracting data and plotting

I have attached a txt file, what I would like to be able to do is: 1. Extract Data from Columns labeled E/N and Ko into a new file 2. Then in the new file I would like to be able to plot E/N on the X axis and Ko on the y axis. 3. Lastly I would like to be able to extract multiple data sets and... (6 Replies)
Discussion started by: gingburg
6 Replies

6. UNIX for Dummies Questions & Answers

Extracting Data Using SED

Given the following text in a file named extract.txt: listenPort:=25 smtpDestination:=2 enableSSL:= I am trying to extract only the value 2 following smtpDestination:= Someone had suggested I use: sed -e "s/^smtpDestination:=\(.*\)$/\1/" extract.txt but this returns: listenPort:=25 2 ... (2 Replies)
Discussion started by: cleanden
2 Replies

7. Shell Programming and Scripting

extracting data from files..

frnds, I m having prob woth doing some 2-3 task simultaneously... what I want is... I have lots ( lacs ) of files in a dir... I want.. these info from arround 2-3 months files filename convention is - abc20080403sdas.xyz ( for todays files ) I want 1. total no of files for 1 dec... (1 Reply)
Discussion started by: clx
1 Replies

8. UNIX for Dummies Questions & Answers

Extracting Data from a File

Hi I need to calculate the number of occurrences of a item in a number of files using Perl. The item appears continually throughout the files but in each case I only want to calculate it in certain blocks of the file. Example - Calculalte the number of occurrences of a 'pass' in a block of... (0 Replies)
Discussion started by: oop
0 Replies

9. Shell Programming and Scripting

Extracting data from each line

Hi All I have one file aa.txt like this Change 172453 on 2006/04/26 10:45:45 by cm@cm-ixca-cm-build23 'cmbuild: ixweb-3.10.28.110 ' Change 172362 on 2006/04/26 08:58:47 by cm@cm-ixca-cm-build23 'build failed: ixweb-3.10.28.109' Change 172299 on 2006/04/26 07:39:08 by... (1 Reply)
Discussion started by: csaha
1 Replies

10. Shell Programming and Scripting

Extracting certain data from a sentence

How do I delete text in each line from the first character up to a certain pattern, ie. 'qmuser.' and then delete from the next occurence of a dot to the end of the sentence For example: - LTSB Renewal Notice Travel Pack --- d \qmaster\spool1\qmuser.8664_LM245_20031216094308.ps.0 From this... (7 Replies)
Discussion started by: dbrundrett
7 Replies
Login or Register to Ask a Question