![]() |
Hello and Welcome from United States to the UNIX and Linux Forums! Thank You for Visiting and Joining Our Global Community.
|
|
google unix.com
|
|||||||
| Forums | Register | Forum Rules | Links | Albums | FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| Shell Programming and Scripting Post questions about KSH, CSH, SH, BASH, PERL, PHP, SED, AWK and OTHER shell scripts and shell scripting languages here. |
More UNIX and Linux Forum Topics You Might Find Helpful
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Perl Issue | raj001 | Shell Programming and Scripting | 23 | 01-30-2009 06:12 AM |
| Need Help with Perl Scripting Issue. | manik112 | Shell Programming and Scripting | 23 | 12-13-2008 12:52 PM |
| Perl Script Issue - Please Help * Thanks!!! | jroberson | Shell Programming and Scripting | 8 | 11-03-2008 03:47 AM |
| perl issue .. | zedex | Shell Programming and Scripting | 3 | 09-13-2008 11:22 PM |
| issue with if loop in perl | amitrajvarma | Shell Programming and Scripting | 4 | 01-09-2008 12:02 AM |
![]() |
|
|
LinkBack | Thread Tools | Search this Thread | Rate Thread | Display Modes |
|
|
|
||||
|
Perl issue - please help!
Hello. I've been writing some code in Perl to read in strings from html files and have been having issues. In the html file, each "paragraph" is a certain file on the website. I need to find every one of the files that is a certain type, in this case, having green color....therefore bgcolor=#ddffff. Then once I find all of those, I'm having problems, because I find them and it only returns that line. I need my code to return the entire paragraph, because the string I need to return is in each paragraph that contains #ddffff and is usually approx. 7 lines below. Example:
Code:
</tr> <tr bgcolor="#ddffff"><td><a target=_top href=http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=34305><font color="green" size=-1>Lotus japonicus</font></a></td> <td><font size=-1> </font></td> <td><a target=_top href=http://www.ncbi.nlm.nih.gov/sites/entrez?db=genome&cmd=Retrieve&dopt=Overview&list_uids=15617><font size=-1>NC_002694</font></a></td> <td><font size=-1>150519 nt </font></td> <td><a target=_top href=http://www.ncbi.nlm.nih.gov/sites/entrez?db=genome&cmd=Retrieve&dopt=Protein+Table&list_uids=15617><font size=-1>82</font></a></td> <td><a target=_top href=http://www.ncbi.nlm.nih.gov/sites/entrez?db=genome&cmd=Retrieve&dopt=Structural+RNA+Table&list_uids=15617><font size=-1>45</font></a></td> <td><a target=_top href=http://www.ncbi.nlm.nih.gov/sites/entrez?Db=gene&Cmd=Search&TermToSearch=NC_002694[accn]><font size=-1>128</font></a></td> <td><font size=-1>Mar 1 2001</font></td> <td><font size=-1>Jan 30 2008</font></td> This is one of the "paragraphs" I would need, because it does in fact have bgcolor="#ddffff". From this paragraph, I then need to return and print the NC_'number' that is in the middle of it. How do I do this when the string matching of "#ddffff" only returns the line that that text is specifically on. Any help would be great! Last edited by Neo; 05-20-2009 at 06:24 PM.. Reason: code tags |
|
||||
|
Algorithm
Well, I'll say that a possible workaround is to first look for the the first paragraph, then save it in an array and finally do the grep
If the grep returns nothing, then proceed to fill the array with the next paragraph and repeat the grep Doing those processes in a for, that goes through all the lines seems feasible. just a thought ![]() |
|
||||
|
#!/usr/bin/perl
my $data_file = 'genomehtml2.txt'; open DATA, "$data_file" or die "can't open $data_file $!"; my @array_of_lines = <DATA>; foreach my $line (@array_of_lines) { if ($line =~ m/#ddffff/i) { print "This line: $line\n"; } } close(DATA); This is what I have so far...and this is returning the first line of each paragraph that has "#ddffff". I just don't know where to put in the code to get the NC numbers...I also have some code I've tried using grep: #!/usr/bin/perl my $data_file = 'genomehtml2.txt'; open DATA, "$data_file" or die "can't open $data_file $!"; my @array_of_lines = <DATA>; my @grepColor = grep(/#ddffff/, @array_of_lines) my @grepFiles = grep(/NC_/, @array_of_lines) I don't really know where to go with this one as much......any coding ideas? |
|
||||
|
I believe the structure <DATA> only returns ONE line at a time...
You need to put it in a loop... $state = 0; $line = <DATA>; $state = 1 if $line =~ /#ddffff/i; while (<DATA>){ $keep_line = $_ if ($state && $_ =~ /NC_/); # Now do something with $keep_line to persist it... $state = 1 if $_ =~ /#ddffff/i; $state = 0 if $_ =~ /^\s+&/; } Assuming blank lines between paragraphs, set a $state variable to 1 (some TRUE) value if you encounter a /#bbffff/ line... Now with the state set to TRUE, look for your NC_ pattern and save the line. The next blank, set state back to zero so that the next paragraph will NOT be searched unless you find #ddffff etc. There's probably a more elegant way to do it, but this should get you started... |
|
||||
|
Your html sample is pretty small, but see how this works:
Code:
my @NC = ();
my $data_file = 'genomehtml2.txt';
open (my $IN, $data_file) or die "can't open $data_file $!";
OUTTER: while(<$IN>){
if(/<tr bgcolor="#ddffff">/){
INNER: while(<$IN>) {
if(/\b(NC_\d+)\b/){
push @NC, $1;
next OUTTER;
}
}
}
}
print "$_\n" for @NC;
|
|
||||
|
Hey thanks guys. KevinADC, I just tried your code and it worked great, but could you put up what exactly you were thinking when you put it together. Just wanted to know as a learning experience. I understand a large majority of it, but a full description would be great. Thanks!
|
|
||||
|
Quote:
Its very similar to what ghostdog posted using a binary flag ($f) but they way I did it is "flagless". |
![]() |
| Bookmarks |
| Tags |
| file, html, perl, script |
| Thread Tools | Search this Thread |
| Display Modes | Rate This Thread |
|
|