Extraction of various lines from a hugh file

04-27-2008

Registered User

15, 0

Join Date: Jan 2008

Last Activity: 1 July 2010, 1:31 AM EDT

Posts: 15

Thanks Given: 0

Thanked 0 Times in 0 Posts

Extraction of various lines from a hugh file

Dear Members,
I have a huge file generated by the command 'whois' for hundred of IPs. Each section in the file starts with [Querying whois

I want to extract those lines which start with any of these words: [Querying whois, OrgName, NetRange, inetnum, descr, owner, Country in that section.

Input:

[Querying whois.XJHIOUIIOOPIOP]

OrgName: University of C
OrgID: U1
Address: OIT
Address: NH
City: BC
StateProv: XY
PostalCode: 000000
Country: MN

NetRange: XXX.YYY.M.N - XXX.YYY.M.Q
CIDR: LMANERIE
NetName: UC

[Querying whois.ABCE.TSD]

% Rights restricted by copyright.
% See

% Note: This output has been filtered.
% To receive output for a database update, use the "-B" flag

inetnum: XXX.YYY.M.N - XXX.YYY.M.Q
netname: NET-C
descr: HB
descr: The University
country: PQ
admin-c: TYE
tech-c: SDF
status: FGRG
mnt-by: FSDGFG
source: FGDFSG

role: OPRROKROTR
address: The University
address: DJFIEJRE
address: DIJAIRJEJ
address: EIREROERE

Required output:

[Querying whois.BUHIOUJIOU]
OrgName: HHHHHHHHHH (May or may not present)
NetRange:TTTTTTTTT (May or may not present)
inetnum: FTYFYYYUII (May or may not present)
descr: HIJKJKLLKL (It will be better if only first occurrence)
owner: JHKJOJOIPI (May or may not present)
Country: OIOPOPOP (1st occurrence)

Thanking you
With regards

srsahu75

View Public Profile for srsahu75

Find all posts by srsahu75

04-27-2008

Registered User

3,653, 12

Join Date: Mar 2008

Last Activity: 28 March 2011, 6:41 AM EDT

Location: /there/is/only/bin/sh

Posts: 3,653

Thanks Given: 0

Thanked 12 Times in 10 Posts

Different registrars use different output formats. So unless you are querying a very restricted set of domains, for example domains all registered by one person, or for other reasons all registered with the same registrar or only a small set of registrars, this may turn out to be more complex than you thought.

Perhaps it would be useful as a first step to separate the entries to different files depending on the [Querying ... line? Try the csplit command for that. Then you can create a parser for each of the formats you find in there.

How do you know when to stop? Often a record will include hierarchical information (especially for the ARIN information, which is what your ABCE.TSD example looks like) in which the later lines are more specific than the earlier ones. Then you often want the later lines, not the earlier ones. (But this depends on what you need this for, of course.)

Anyway, here's an attempt at implementing your current spec. This simply picks out the first of anything after the Querying line:

Code:

perl -ne 'if (/^\[Querying/) {
  print; @wanted = qw(OrgName NetRange inetnum descr owner Country);
  $wanted = &wanted(@wanted);
}
sub wanted {
  return "^(" . join ("|", map { quotemeta $_ } @_) . "):";
}
if ($wanted && $_  =~ m/$wanted/i) {
  print;
  @wanted = grep { $_ ne $1 } @wanted;
  $wanted = @wanted ? &wanted(@wanted) : "";
}' file

This came out a little more monstrous than I'd like it to be, but maybe you can use it as a starting point.

(In retrospect, maybe it would have been better to use a hash to keep track of which values are already captured, and not capture if the hash says we already have the one we are looking at. Push the captured ones to an array if preserving order is important.)

Last edited by era; 04-27-2008 at 08:53 AM.. Reason: Add /i flag to make matching ignore case

era

View Public Profile for era

Find all posts by era

04-29-2008

Registered User

15, 0

Join Date: Jan 2008

Last Activity: 1 July 2010, 1:31 AM EDT

Posts: 15

Thanks Given: 0

Thanked 0 Times in 0 Posts

Hi,
Thank you very much for the help. The script is very useful upto 70% of my need. I will try to do something for rest of my 30%.

Thanking you
With regards
Satya

srsahu75

View Public Profile for srsahu75

Find all posts by srsahu75

05-05-2008

Registered User

15, 0

Join Date: Jan 2008

Last Activity: 1 July 2010, 1:31 AM EDT

Posts: 15

Thanks Given: 0

Thanked 0 Times in 0 Posts

Dear Era,
I want the script should take the input file as a variable as well as output file. I have two text files: (1) List of folders in which the script should work (2) List of input files on which the script should work.
Due to lack of Perl knowledge I tried unsuccessful. In Shell script I use:

for i in `(cat countries.txt)`
do

for j in `(cat year.txt)`

do

for k in `(cat countries/$i/$j)`

do

Same way I want the perl script take the inputfile as variable

Thanks

srsahu75

View Public Profile for srsahu75

Find all posts by srsahu75

05-05-2008

Registered User

3,653, 12

Join Date: Mar 2008

Last Activity: 28 March 2011, 6:41 AM EDT

Location: /there/is/only/bin/sh

Posts: 3,653

Thanks Given: 0

Thanked 12 Times in 10 Posts

As a matter of shell coding style, the parentheses are completely unnecessary, and stuff in backticks works badly if there's a file name with spaces in it.

I don't see why you couldn't use that shell script to wrap the Perl code; there's nothing much there which Perl does better than the shell, other than not having to read the country file over and over again (but you could optimize that in the shell script, too). But anyway, here goes. I'm afraid this is completely untested.

Code:

#!/usr/bin/perl

die "Usage: $0 dir yearfile countryfile" unless (@ARGV == 3);

open (Y, "$ARGV[1]") || die "$0: Could not open $ARGV[1]: $!\n";
open (C, "$ARGV[2]") || die "$0: Could not open $ARGV[2]: $!\n";
my @countries = <C>;
close C;
while ($year = <Y>) {
  for $country (@countries) {
    handle ("$ARGV[0]/$year/$country");
  }
}
close Y;

sub handle {
  my ($file) = @_;
  open (F, $file) || die "$0: Could not open $file: $!\n";
  while (<F>) {
    if (/^\[Querying/) {
      print; @wanted = qw(OrgName NetRange inetnum descr owner Country);
      $wanted = &wanted(@wanted);
    }
    if ($wanted && $_  =~ m/$wanted/i) {
      print;
      @wanted = grep { $_ ne $1 } @wanted;
      $wanted = @wanted ? &wanted(@wanted) : "";
    }
    close F;
  }
}  
sub wanted {
  return "^(" . join ("|", map { quotemeta $_ } @_) . "):";
}

era

View Public Profile for era

Find all posts by era

05-07-2008

Registered User

15, 0

Join Date: Jan 2008

Last Activity: 1 July 2010, 1:31 AM EDT

Posts: 15

Thanks Given: 0

Thanked 0 Times in 0 Posts

Thank you very much for the code

Regards

srsahu75

View Public Profile for srsahu75

Find all posts by srsahu75

Shell Programming and Scripting

Extraction of various lines from a hugh file

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Extraction of .gz file using 7zip fails

Discussion started by: PikK45

2. Shell Programming and Scripting

CSV file data extraction

Discussion started by: nanduri

3. Shell Programming and Scripting

FILE_ID extraction from file name and save it in CSV file after looping through each folders

Discussion started by: princetd001

4. Shell Programming and Scripting

data extraction from a file

Discussion started by: i150371485

5. Shell Programming and Scripting

problem with file content extraction

Discussion started by: priya_ag04

6. Shell Programming and Scripting

File Extraction

Discussion started by: pyaranoid

7. Shell Programming and Scripting

File extraction without awk

Discussion started by: freyr

8. Shell Programming and Scripting

Data Extraction From a File

Discussion started by: rrangaraju

9. UNIX for Dummies Questions & Answers

Flat File Extraction

Discussion started by: Mthimbana

10. UNIX for Dummies Questions & Answers

help on file extraction

Discussion started by: apalex