Extraction of various lines from a hugh file


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Extraction of various lines from a hugh file
# 1  
Old 04-27-2008
Extraction of various lines from a hugh file

Dear Members,
I have a huge file generated by the command 'whois' for hundred of IPs. Each section in the file starts with [Querying whois

I want to extract those lines which start with any of these words: [Querying whois, OrgName, NetRange, inetnum, descr, owner, Country in that section.

Input:

[Querying whois.XJHIOUIIOOPIOP]


OrgName: University of C
OrgID: U1
Address: OIT
Address: NH
City: BC
StateProv: XY
PostalCode: 000000
Country: MN

NetRange: XXX.YYY.M.N - XXX.YYY.M.Q
CIDR: LMANERIE
NetName: UC


[Querying whois.ABCE.TSD]

% Rights restricted by copyright.
% See

% Note: This output has been filtered.
% To receive output for a database update, use the "-B" flag


inetnum: XXX.YYY.M.N - XXX.YYY.M.Q
netname: NET-C
descr: HB
descr: The University
country: PQ
admin-c: TYE
tech-c: SDF
status: FGRG
mnt-by: FSDGFG
source: FGDFSG

role: OPRROKROTR
address: The University
address: DJFIEJRE
address: DIJAIRJEJ
address: EIREROERE

Required output:

[Querying whois.BUHIOUJIOU]
OrgName: HHHHHHHHHH (May or may not present)
NetRange:TTTTTTTTT (May or may not present)
inetnum: FTYFYYYUII (May or may not present)
descr: HIJKJKLLKL (It will be better if only first occurrence)
owner: JHKJOJOIPI (May or may not present)
Country: OIOPOPOP (1st occurrence)

Thanking you
With regards
# 2  
Old 04-27-2008
Different registrars use different output formats. So unless you are querying a very restricted set of domains, for example domains all registered by one person, or for other reasons all registered with the same registrar or only a small set of registrars, this may turn out to be more complex than you thought.

Perhaps it would be useful as a first step to separate the entries to different files depending on the [Querying ... line? Try the csplit command for that. Then you can create a parser for each of the formats you find in there.

How do you know when to stop? Often a record will include hierarchical information (especially for the ARIN information, which is what your ABCE.TSD example looks like) in which the later lines are more specific than the earlier ones. Then you often want the later lines, not the earlier ones. (But this depends on what you need this for, of course.)

Anyway, here's an attempt at implementing your current spec. This simply picks out the first of anything after the Querying line:

Code:
perl -ne 'if (/^\[Querying/) {
  print; @wanted = qw(OrgName NetRange inetnum descr owner Country);
  $wanted = &wanted(@wanted);
}
sub wanted {
  return "^(" . join ("|", map { quotemeta $_ } @_) . "):";
}
if ($wanted && $_  =~ m/$wanted/i) {
  print;
  @wanted = grep { $_ ne $1 } @wanted;
  $wanted = @wanted ? &wanted(@wanted) : "";
}' file

This came out a little more monstrous than I'd like it to be, but maybe you can use it as a starting point.

(In retrospect, maybe it would have been better to use a hash to keep track of which values are already captured, and not capture if the hash says we already have the one we are looking at. Push the captured ones to an array if preserving order is important.)

Last edited by era; 04-27-2008 at 08:53 AM.. Reason: Add /i flag to make matching ignore case
# 3  
Old 04-29-2008
Hi,
Thank you very much for the help. The script is very useful upto 70% of my need. I will try to do something for rest of my 30%.

Thanking you
With regards
Satya
# 4  
Old 05-05-2008
Dear Era,
I want the script should take the input file as a variable as well as output file. I have two text files: (1) List of folders in which the script should work (2) List of input files on which the script should work.
Due to lack of Perl knowledge I tried unsuccessful. In Shell script I use:

for i in `(cat countries.txt)`
do

for j in `(cat year.txt)`

do

for k in `(cat countries/$i/$j)`

do



Same way I want the perl script take the inputfile as variable

Thanks
# 5  
Old 05-05-2008
As a matter of shell coding style, the parentheses are completely unnecessary, and stuff in backticks works badly if there's a file name with spaces in it.

I don't see why you couldn't use that shell script to wrap the Perl code; there's nothing much there which Perl does better than the shell, other than not having to read the country file over and over again (but you could optimize that in the shell script, too). But anyway, here goes. I'm afraid this is completely untested.

Code:
#!/usr/bin/perl

die "Usage: $0 dir yearfile countryfile" unless (@ARGV == 3);

open (Y, "$ARGV[1]") || die "$0: Could not open $ARGV[1]: $!\n";
open (C, "$ARGV[2]") || die "$0: Could not open $ARGV[2]: $!\n";
my @countries = <C>;
close C;
while ($year = <Y>) {
  for $country (@countries) {
    handle ("$ARGV[0]/$year/$country");
  }
}
close Y;

sub handle {
  my ($file) = @_;
  open (F, $file) || die "$0: Could not open $file: $!\n";
  while (<F>) {
    if (/^\[Querying/) {
      print; @wanted = qw(OrgName NetRange inetnum descr owner Country);
      $wanted = &wanted(@wanted);
    }
    if ($wanted && $_  =~ m/$wanted/i) {
      print;
      @wanted = grep { $_ ne $1 } @wanted;
      $wanted = @wanted ? &wanted(@wanted) : "";
    }
    close F;
  }
}  
sub wanted {
  return "^(" . join ("|", map { quotemeta $_ } @_) . "):";
}

# 6  
Old 05-07-2008
Thank you very much for the code

Regards
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Extraction of .gz file using 7zip fails

Hi, My target is to send a file created by Unix process to myself as an excel file. So I have used the below commands to achieve it. tr -d '\t' < PROGRAM_CREATED_FILE | sed -e 's/\\//g' | awk 'BEGIN{FS=">"; OFS="\t"} '{$1=$1}1' > file.xls gzip -9 file.xls echo "test mail" | sendxchange -a... (3 Replies)
Discussion started by: PikK45
3 Replies

2. Shell Programming and Scripting

CSV file data extraction

Hi I am writing a shell script to parse a CSV file , in which i am facing a problem to separate the columns . Could some one help me with it. IN301330/00001 pvavan kumar limited xyz@ttccpp.com IN302148/00002 PRECIOUS SECURITIES (P) LTD viash@yahoo.co.in IN300239/00000 CENTRE india... (8 Replies)
Discussion started by: nanduri
8 Replies

3. Shell Programming and Scripting

FILE_ID extraction from file name and save it in CSV file after looping through each folders

FILE_ID extraction from file name and save it in CSV file after looping through each folders My files are located in UNIX Server, i want to extract file_id and file_name from each file .and save it in a CSV file. How do I do that? I have folders in unix environment, directory structure is... (15 Replies)
Discussion started by: princetd001
15 Replies

4. Shell Programming and Scripting

data extraction from a file

Hi Freinds, I have a file1.txt in the following format File1.txt I want to get 2 files from the above file filextra.txt should have the lines which are ending with "<" and remaining lines in the filecompare.txt file. Please help. (3 Replies)
Discussion started by: i150371485
3 Replies

5. Shell Programming and Scripting

problem with file content extraction

I need to extract some content of a file. Example file abc vi abc ooooooooo bbbbbbbbb vvv 1234 5 vvv 6789 3 xxxxxxxxxx xxxxxxxxxx i want to extract only the following content from file abc and store in another file say temp. 1234 5 6789 3 what should be my approach? (2 Replies)
Discussion started by: priya_ag04
2 Replies

6. Shell Programming and Scripting

File Extraction

Hi, I have three files as below: AA.DAT20110505063903.Z AA.DAT20110405062903.Z AA.DAT20110305061903.Z All the above files are appended with Date and timestamp in compressed format. I need to extract AA.DAT20110505063903.Z(which is the latest file) from one server and uncompress it... (2 Replies)
Discussion started by: pyaranoid
2 Replies

7. Shell Programming and Scripting

File extraction without awk

Hello everybody, Here is my problem : I cannot find a way to extract data from a particular file and more precisely I cannot extract the result of my awk script to an external file because I am currently working on HP-UX. I would like a simple script (without awk) which asks for a date like... (4 Replies)
Discussion started by: freyr
4 Replies

8. Shell Programming and Scripting

Data Extraction From a File

Hi All, I have a requirement where I have to search the file with some text say "Exception". This exception word can be repeated for more then 10 times. Suppose the "Exception" word is repeated at line numbers say x=10, 50, 60, 120. Now I want to extract all the lines starting from x-5 to... (3 Replies)
Discussion started by: rrangaraju
3 Replies

9. UNIX for Dummies Questions & Answers

Flat File Extraction

Hi all, I'm new in the unix environment. I'm having a challenge in extracting data from a flat file and convert it to a CSV file format or I should be able to open it with MS Excel. The input data in my flat file looks like this: AV00001001155000063637143326711145412082EM SITHOLE... (3 Replies)
Discussion started by: Mthimbana
3 Replies

10. UNIX for Dummies Questions & Answers

help on file extraction

Hello, Im trying to extract a portion of a big file. Using unique pattern /occurrence , (ex. loginname1,logoff and loginname2,logoff ), I like to print the lines that contain the patterns and the lines between them. Also, create a file for every login occurrence. Thanks for everyone's... (1 Reply)
Discussion started by: apalex
1 Replies
Login or Register to Ask a Question