Extract lines from files


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Extract lines from files
# 1  
Old 08-08-2009
Extract sentences from files

hi all,

I have three files.

The first file (FILE_INFO in my code) consists of four parameters for each line.

Code:
0.00765600    0.08450704    M3    E3 
0.00441931    0.04878049    M4    E5 
0.01904574    0.21022727    M5    E10
0.00510400    0.05633803    M6    E12
0.00905960    0.10000000    M7    E16
0.00799376    0.08823529    M8    E17
0.00424669    0.04687500    M9    E18

I want to write out the corresponding sentences from 2nd file (M_IN in my code) and 3rd file (E_IN in my code) based on the 3rd column and 4th column parameters of first file. The M# and E# are the sentence numbers in 2nd and 3rd files.

The format of 2nd and 3rd files is : [where M# and E# are for sentence numbers in 2nd and 3rd files]


Code:
M4  asd
M4  dfgg
M4  rtyt
M4  rtytry
M4  etrert
M4  EOS
M5  tyuty
M5  ertert
M5  yuyu
M5  EOS
M6  iui
M6  jkjk
M6  EOS

EOS means the fullstop .(End of sentence)

Please correct the script I have written. The E_OUT and M_OUT are the output files where the corresponding sentences will be written.

Code:
while(my $m_text = <$FILE_INFO> ){
        
     @me_text = split /\s+/, $m_text;    
     
     while(my $m_input= <$M_IN>)
     {
    @m_no = split /\s+/, $m_input;
    if($m_no[0] eq $me_text[2])
    {
        chomp; 
        s/^M\d[ ]+//g; 
        s/[ ]*$//; 
        $x .= " ".$_;
            $x =~ s/^ //; 
        $x =~ s/ EOS[ ]*/.\n/g; 
    }
      }    
    print M_OUT $_;
     
        
    while(my $e_input = <$E_IN>)
    {
    @e_no = split /\s+/, $e_input;
    if($e_no[0] eq $me_text[3])
    {
                chomp;
                s/^E\d[ ]+//g;
                s/[ ]*$//;
                $x .= " ".$_;
                $x =~ s/^ //;
                $x =~ s/ EOS[ ]*/.\n/g;
        }
    }
        print E_OUT $_;

Expected output in the M_OUT

asd dfgg rtyt rtytry etrert .
tyuty ertert yuyu .
iui jkjk.

Similarly same format in the E_OUT will appear picking up the corresponding sentences from E_IN file based on the parameter in FILE_INFO file.

Thanks in advance.

Last edited by my_Perl; 08-08-2009 at 04:57 PM.. Reason: Spelling and clarity of the problem
# 2  
Old 08-08-2009
Hi.

Many people don't wish to slog through someone else's code to find logic errors. You can use intermediate prints or the debug facility of perl to see where your code is incorrect. You could also look at provably correct code to see how it works.

Here's a solution in shell:
Code:
#!/usr/bin/env bash

# @(#) s1	Demonstrate creation of string from index of strings.

echo
set +o nounset
LC_ALL=C ; LANG=C ; export LC_ALL LANG
echo "Environment: LC_ALL = $LC_ALL, LANG = $LANG"
echo "(Versions displayed with local utility \"version\")"
version >/dev/null 2>&1 && version "=o" $(_eat $0 $1) tr cut grep sed
set -o nounset
echo

FILE1=data1
FILE2=data2

echo
echo " Data file $FILE1:"
cat $FILE1

echo
echo " Data file $FILE2:"
cat $FILE2

echo
echo " Results:"
tr -s ' ' <$FILE1 |
cut -d" " -f3 >t1

for key in $( cat t1 )
do
  # echo
  # echo " File $key:"
  if [ -z "$(grep "$key" $FILE2)" ] 
  then
    echo " Ignoring $key: no match." >&2
	continue
  fi
  grep "$key" $FILE2 |
  tr -s ' ' |
  cut -d" " -f2 |
  paste -d" "  -s |
  sed 's/ EOS/./'
done

exit 0

producing:
Code:
% ./s1

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 2.6.26-2-amd64, x86_64
Distribution        : Debian GNU/Linux 5.0 
GNU bash 3.2.39
tr (GNU coreutils) 6.10
cut (GNU coreutils) 6.10
GNU grep 2.5.3
GNU sed version 4.1.5


 Data file data1:
0.00765600    0.08450704    M3    E3 
0.00441931    0.04878049    M4    E5 
0.01904574    0.21022727    M5    E10
0.00510400    0.05633803    M6    E12
0.00905960    0.10000000    M7    E16
0.00799376    0.08823529    M8    E17
0.00424669    0.04687500    M9    E18

 Data file data2:
M4  asd
M4  dfgg
M4  rtyt
M4  rtytry
M4  etrert
M4  EOS
M5  tyuty
M5  ertert
M5  yuyu
M5  EOS
M6  iui
M6  jkjk
M6  EOS

 Results:
 Ignoring M3: no match.
asd dfgg rtyt rtytry etrert.
tyuty ertert yuyu.
iui jkjk.
 Ignoring M7: no match.
 Ignoring M8: no match.
 Ignoring M9: no match.

So now that I think I understand the problem, I do a perl version and try to make sure that I avoid reading the data file more than once (as is done with grep in the shell script):
Code:
#!/usr/bin/perl

# @(#) p1	Demonstrate creation of string from index of strings.

use warnings;
use strict;

my ($debug);
$debug = 1;
$debug = 0;

my ( $f1, $f2, $i, @index, $junk, $key, %sentence, $t1 );

open( $f1, "<", "data1" ) || die(" Cannot open data1.\n");

# Get the indices into array "index".

while (<$f1>) {
  $t1 = (split)[2];
  push @index, $t1;
}
print " index is :@index:\n" if $debug;
close $f1;

open( $f2, "<", "data2" ) || die(" Cannot open data2.\n");

# Read data file of words, check for match to anything in array
# "index", add to appropriate sentence hash.

while (<$f2>) {
  chomp;
  print " Working on line :$_:\n" if $debug;
  for ( $i = 0; $i <= $#index; $i++ ) {
    if (/^$index[$i]/) {
      print " Found match for :$_:\n" if $debug;
      $t1 = (split)[1];
      print " Adding :$t1: to sentence.\n" if $debug;
      $sentence{ $index[$i] } .= "$t1 ";
    }
  }
}

# Print the completed hash of sentences.

for $key ( sort keys %sentence ) {
  $sentence{$key} =~ s/ EOS/./;
  print "$sentence{$key}\n";
}

exit(0);

Using the same sample data files, produces:
Code:
% ./p1
asd dfgg rtyt rtytry etrert. 
tyuty ertert yuyu. 
iui jkjk.

Note the use of print statements if $debug is true. Simply swapping the position of the assignments turns on and off those debugging outputs. That's useful for a quick program, and the code can be left in, ready to turn on if and when the code is modified ... cheers, drl

PS I eliminated the extra trailing space before the full stop, it looked better that way.
# 3  
Old 08-09-2009
Definitely, Thanks a lot.

---------- Post updated 08-09-09 at 04:32 AM ---------- Previous update was 08-08-09 at 03:17 PM ----------

Hi drl

The perl script works well for lesser number of sentences but when the number crosses 15 or more. A group of sentences merge together to form a line.Also, Some of the sentences get printed repeatedly. In fact, I want to run this program for thousands of sentences. Is this problem due to the array or something else?
# 4  
Old 08-09-2009
Hi.

Yes, the lines will be quite long.

Do you want the "EOS" to be full-stop AND end-of-line? ... cheers, drl
# 5  
Old 08-10-2009
I want EOS as End-of-Sentence as well as end-of-line rather than fullstop.
# 6  
Old 08-10-2009
Hi.

OK, I changed the way that the sentence data structure is handled. The memory use might be high for a very large file, but for the sample data you have provided, this produces the same output:
Code:
#!/usr/bin/perl

# @(#) p1	Demonstrate creation of string from index of strings.

use warnings;
use strict;

my ($debug);
$debug = 1;
$debug = 0;

my ( $f1, $f2, $i, @index, $junk, $key, %sentence, $t1 );

open( $f1, "<", "data1" ) || die(" Cannot open data1.\n");

# Get the indices into array "index".

while (<$f1>) {
  $t1 = (split)[2];
  push @index, $t1;
}
print " index is :@index:\n" if $debug;
close $f1;

open( $f2, "<", "data2" ) || die(" Cannot open data2.\n");

# Read data file of words, check for match to anything in array
# "index", add to appropriate sentence hash.

while (<$f2>) {
  chomp;
  print " Working on line :$_:\n" if $debug;
  for ( $i = 0; $i <= $#index; $i++ ) {
    if (/^$index[$i]/) {
      print " Found match for :$_:\n" if $debug;
      $t1 = (split)[1];
      print " Adding :$t1: to sentence.\n" if $debug;
      $sentence{ $index[$i] } .= "$t1 ";
    }
  }
}

# Print the completed hash of sentences.

for $key ( sort keys %sentence ) {
  $sentence{$key} =~ s/ *EOS */.\n/g;
  print "$sentence{$key}";
}

exit(0);

producing:
Code:
% ./p1
asd dfgg rtyt rtytry etrert.
tyuty ertert yuyu.
iui jkjk.

cheers, drl
# 7  
Old 08-10-2009
Hi

After the correction of the code, some of the sentences get printed repeatedly at the output side. What could be the problem? I want the individual sentences to be printed only once. So, what can be done in this regard?

Last edited by my_Perl; 08-12-2009 at 07:22 AM.. Reason: Change in addressing
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

Extract the same lines from the two files

I used to use this script to extract the same lines from two files: grep -f file1 file2 > outputfile now I have file1 AB029895 AF208401 AF309648 AF526378 AJ444445 AJ720950 AJ851546 AY568629 AY591907 AY994087 BU116401 BU116599 BU119689 BU121308 BU125622 BU231446 BU236750 BU237045 (4 Replies)
Discussion started by: yuejian
4 Replies

2. Shell Programming and Scripting

Extract lines that appear twice

I have a text file that looks like this : root/user/usr1/0001/abab1* root/user/usr1/0001/abab2* root/user/usr1/0002/acac1* root/user/usr1/0002/acac2* root/user/usr1/0003/adad1* root/user/usr1/0004/aeae1* root/user/usr1/0004/aeae2* How could I code this to extract just the subjects... (9 Replies)
Discussion started by: LeftoverStew
9 Replies

3. Shell Programming and Scripting

ksh sed - Extract specific lines with mulitple occurance of interesting lines

Data file example I look for primary and * to isolate the interesting slot number. slot=`sed '/^primary$/,/\*/!d' filename | tail -1 | sed s'/*//' | awk '{print $1" "$2}'` Now I want to get the Touch line for only the associate slot number, in this case, because the asterisk... (2 Replies)
Discussion started by: popeye
2 Replies

4. Shell Programming and Scripting

Extract lines from text files

I have some files containing the following data # RESIDUE AA STRUCTURE BP1 BP2 ACC N-H-->O O-->H-N N-H-->O O-->H-N TCO KAPPA ALPHA PHI PSI X-CA Y-CA Z-CA 1 196 A M 0 0 230 0, 0.0 2,-0.2 0, 0.0 0, 0.0 0.000 360.0 360.0 360.0 76.4 21.7 -6.8 11.3 2 197 A D + 0 0 175 1,-0.1 2,-0.1 0, 0.0 0, 0.0... (10 Replies)
Discussion started by: edweena
10 Replies

5. Shell Programming and Scripting

Can you extract (remove) lines from log files?

I use "MineOS" (a linux distro with python scripts and web ui included for managing a Minecraft Server). The author of the scripts is currently having a problem with the Minecraft server log file being spammed with certain entries. He's working on clearing up the spam. But in the meantime, I'm... (8 Replies)
Discussion started by: nbsparks
8 Replies

6. Shell Programming and Scripting

Search for a pattern,extract value(s) from next line, extract lines having those extracted value(s)

I have hundreds of files to process. In each file I need to look for a pattern then extract value(s) from next line and then search for value(s) selected from point (2) in the same file at a specific position. HEADER ELECTRON TRANSPORT 18-MAR-98 1A7V TITLE CYTOCHROME... (7 Replies)
Discussion started by: AshwaniSharma09
7 Replies

7. UNIX for Dummies Questions & Answers

Extract lines with specific words with addition 2 lines before and after

Dear all, Greetings. I would like to ask for your help to extract lines with specific words in addition 2 lines before and after these lines by using awk or sed. For example, the input file is: 1 ak1 abc1.0 1 ak2 abc1.0 1 ak3 abc1.0 1 ak4 abc1.0 1 ak5 abc1.1 1 ak6 abc1.1 1 ak7... (7 Replies)
Discussion started by: Amanda Low
7 Replies

8. Shell Programming and Scripting

How to extract lines between tags into different files?

I have an xml file with the below data: unix>Cat address.xml <Address City=”Amsterdam” Street = “station straat” ZIPCODE="2516 CK " </Address> <Address City=”Amsterdam” Street = “Leeuwen straat” ZIPCODE="2517 AB " </Address> <Address City=”The Hauge” Street = “kirk straat” ... (1 Reply)
Discussion started by: LinuxLearner
1 Replies

9. Shell Programming and Scripting

extract nth line of all files and print in output file on separate lines.

Hello UNIX experts, I have 124 text files in a directory. I want to extract the 45678th line of all the files sequentialy by file names. The extracted lines should be printed in the output file on seperate lines. e.g. The input Files are one.txt, two.txt, three.txt, four.txt The cat of four... (1 Reply)
Discussion started by: yogeshkumkar
1 Replies

10. Shell Programming and Scripting

is it hard to extract particular lines & strings from the files??

Hi Experts, I have lots of big size files. Below is the snapshot of a file. From the files i want extract informmation like belows. What could be command or script for that? DELETE RESP:940120105 CREATE RESP:0 GET RESP:0 File contains like below- ... ... <log... (8 Replies)
Discussion started by: thepurple
8 Replies
Login or Register to Ask a Question