Want to extract certain lines from big file

01-24-2016

Moderator

12,296, 3,792

Join Date: Nov 2008

Last Activity: 1 January 2021, 1:47 AM EST

Location: Amsterdam

Posts: 12,296

Thanks Given: 679

Thanked 3,792 Times in 3,282 Posts

Quote:

Originally Posted by mad man

[..]

Dear Scrutinizer,

Thanks a lot this time SED worked.

It gave me desired output .

Thanks.

---------- Post updated at 05:54 PM ---------- Previous update was at 05:51 PM ----------

Hi,

This msg is intended to all who are all replied to help me out.
Hats off for your efforts to help me. Also i request each one of you to suggest me a link of good materials as you feel it was, for me to learn the SED & AWK atleast the basics.

Thanks.

[..]
You are welcome! Please note I updated my post and added the tildes to the search string (~$transnum~)which had fallen off before and which should make it it a bit more accurate which was also suggested by Don earlier...

Scrutinizer

View Public Profile for Scrutinizer

Find all posts by Scrutinizer

01-24-2016

Registered User

503, 195

Join Date: Sep 2013

Last Activity: 22 January 2021, 1:52 PM EST

Location: France

Posts: 503

Thanks Given: 43

Thanked 195 Times in 176 Posts

Hi,
As sed but in awk:

Code:

awk  "/~$transnum~/{\$0=X\"\n\"\$0};/~$transnum~/,/EOT/;{X=\$0}" file

Regards.

disedorgue

View Public Profile for disedorgue

Find all posts by disedorgue

01-24-2016

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Quote:

Originally Posted by mad man

Hi Don,

Sorry, Since i am new to use blogging websites i am afraid of giving a banks transaction input structure in a public website. I am afraid since it might end up me in trouble and also apologize for a faulty input. I am learning 1 by 1 towards perfection.

I am going to try out your new suggestions will update you in 15 mins.

Thanks.

---------- Post updated at 05:40 PM ---------- Previous update was at 05:01 PM ----------

Hi Don

I am getting the below error after doing the changes what ever you have suggested.

Code:

awk: Cannot divide by zero.

 The input line number is 32042. The file is /tmp/remedixz.20160120_085021_41222370_1.
 The source line number is 18.

The line 32042 is the EOT line of the particular transaction reference number. Please find the code below

Code:

big_file='/tmp/remedixz.20160120_085021_41222370_1'
trannum="/tmp/transnum"

/tmp> cat /tmp/transnum
ABC160120XYZ0983921 

##In the above you can see the transnum given

awk -F '~' '
    FNR == NR {
      t[$1]
      tc = FNR
      next
      } 
      {
      l[++lc] = $0
      }
    $1 == "%%YEDTRN" && $3 in t {
        remove t[transnum = $2]
        tc--
    }

    /^0000EOT/ {
        if(transnum) {
            for(i = 1; i <= lc; i++)
                print l[i] > (/tmp/remedixz.20160120_085021_41222370_1_new "_" transnum)
            close(/tmp/remedixz.20160120_085021_41222370_1_new "_" transnum)
            printf("Transaction #%s extracted to file /tmp/remedixz.20160120_085021_41222370_1_new "_" transnum:%s\n", transnum,
                transnum)
        }
        if(tc) {
            lc = 0
            transnum = ""
        } else {
            exit
        }
    }' $trannum $file

This time i just directly gave the output file name rather than a variable.
Kindly let me know where i am missing something.

Thanks.

Realize that I have been up all night trying to help you (and it is now almost 6AM where I am), so I may not be thinking clearly. But, could you please explain why you chose to change the code I suggested:

Code:

			print l[i] > (FILENAME "_" transnum)

to:

Code:

                print l[i] > (/tmp/remedixz.20160120_085021_41222370_1_new "_" transom)

FILENAME is an awk variable holding the name of the current input file. But, /tmp/remedixz.20160120_085021_41222370_1_new is an attempt to divide nothing by the contents of the variable tmp divided by contents of the variable remedixz followed by a syntax error. And since neither tmp nor remedixz have been defined in this awk script, both are treated as a division by zero.

Would you PLEASE just try the following script without changing it:

Code:

#!/bin/ksh
big_file='/tmp/remedixz.20160120_085021_41222370_1'
transnums='/tmp/transnum'

awk -F '~' '
FNR == NR {
	# Gather transaction numbers...
	t[$1]
	tc = FNR
	next
}
{	# Gather transaction lines.
	l[++lc] = $0
}
$1 == "%%YEDTRN" && $3 in t {
	# We have found a transaction number for a transaction that is to be
	# extracted.  Save the transaction number and remove this transaction
	# from the transaction list.
	delete t[transnum = $2]
	file = FILENAME "_" transnum
	tc--
}
/^0000EOT/ {
	# If we have a transaction that is to be printed, print it.
	if(transnum) {
		# Print the transaction.
		for(i = 1; i <= lc; i++)
			print l[i] > file
		close(file)
		printf("Transaction #%s extracted to file %s\n", transnum, file)
		# Was this the last remaining transaction to be extracted?
		if(tc) {# No.  Reset for next transaction.
			lc = 0
			transnum = ""
		} else {# Yes.  Exit.
			exit
		}
	}
}' "$transnums" "$big_file"

Note that this has a few changes to match your latest description of your transaction format, has a typo fixed, and has some minor performance improvements. It also now includes your filenames (which had not been provided before).

If /tmp/transnum contains the single line:

Code:

ABC160120XYZ0983921

and there is a transaction in your big transaction file with that transaction number, it should produce a file named /tmp/remedixz.20160120_085021_41222370_1_ABC160120XYZ0983921 containing that transaction. And, as stated before, if /tmp/transnum contains multiple transaction numbers on separate lines, one invocation of this script will produce an output file for each transaction given.

If this all works, you could also add an END clause to print a list of any transaction numbers that were specified in your transaction numbers file that were not found in your big transactions file.

These 3 Users Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

01-24-2016

Registered User

1,781, 705

Join Date: May 2008

Last Activity: 10 November 2021, 5:38 PM EST

Posts: 1,781

Thanks Given: 62

Thanked 705 Times in 653 Posts

Quote:

Originally Posted by mad man

Hi Aia,

Thanks for your command

Code:

export t=2; perl -ne 'if(/^##transaction\b/ .. /EOT$/){ print if $n==$ENV{t}; /EOT$/ and ++$n }; last if $n==$ENV{t}+1' mad_man.example

This is not working. I exported the value of transnum to variable t. The output file doesn,t have the required output.

Please find one of the existing inline perl we use. If you give me your command in the same format it will be helpful

If I understand you correctly, the command could have been:

Code:

export t="ABC160120XYZ0983920"; perl -ne '/^##transaction\b/ and @t=(); if (/^##transaction\b/ .. /EOT$/){push @t, $_; $f = 1 if /$ENV{t}/;  if (/EOT$/ && $f){print @t; last}}' mad_man.example

Here's another script that will search a file with one transaction number per line and it will output a file ending in .transaction number per each find.

Code:

#!/usr/bin/env perl

use strict;
use warnings;

my $trans = shift || die "No search paramenters file given\n";
my $haystack = shift || die "Missing data file\n";

my %trans = ();
my @transaction = ();

open my $fh, '<', $trans or die "open $trans: $!\n";
while(<$fh>){
    chomp;
    $trans{$_} = $_;
}
close $fh;

open $fh, '<', $haystack or die "Could not open $haystack: $!\n";
while(<$fh>){
    if(/^##transaction\b/ .. /EOT$/){
        push @transaction, $_;
        if(/EOT$/){
            process_tran();
            @transaction = ();
        }
    }
}
close $fh;

sub process_tran {
    for my $k (keys %trans){
        my $yes = grep /$k/, @transaction;
        if($yes){
            write_tran ("$haystack.$k", \@transaction);
            delete $trans{$k};
            last;
        }
    }
}

sub write_tran {
    my ($save_tran, $tran_ref) =  @_;
    open my $wfh, '>', $save_tran
        or die "Could not write to $save_tran: $!\n";
    print $wfh @{ $tran_ref };
    close $wfh;
}

Save as mad_man.pl
Run as perl mad_man.pl trans_numbers data_with_trans

Or chmod +x mad_man.pl
/path/to/mad_man.pl /path/to/trans_numbers /path/to/data_with_trans
It will save in /path/to/data_with_trans.<number>

Aia

View Public Profile for Aia

Find all posts by Aia

01-25-2016

Registered User

54, 1

Join Date: Nov 2015

Last Activity: 30 January 2019, 7:26 AM EST

Posts: 54

Thanks Given: 27

Thanked 1 Time in 1 Post

Quote:

Originally Posted by Scrutinizer

Try this adaptation of RudiC's suggestion and Don's adaption for proper shell quoting on AIX:

Code:

sed -n "
/~$transnum~/ {
H
g
}
/~$transnum~/,/EOT/p
h
" file

---

Not so much 2047 bytes, in most implementations much higher or unlimited, and for some there is a much lower limit but unrelated to LINE_MAX, as I think we worked out before here: Sequence extraction

Hi sed code which Scrutinizer posted worked for a set of transaction which is actually 3455 characters

Thanks

---------- Post updated at 12:48 PM ---------- Previous update was at 12:36 PM ----------

Hi

I am going to try all of your new suggestions today and reply you back.

Thanks.

mad man

View Public Profile for mad man

Find all posts by mad man

01-27-2016

Registered User

54, 1

Join Date: Nov 2015

Last Activity: 30 January 2019, 7:26 AM EST

Posts: 54

Thanks Given: 27

Thanked 1 Time in 1 Post

Quote:

Originally Posted by Don Cragun

Hi Don,

I just tried your suggestion and ran the script as it is you have given.

I ran three times totally, I will explain as it is what happened every time.

First Run:

Code:

I gave one transaction number for example 990th out of 1200 totally.
Output file has 1st transaction till - 990th transaction.
output file name created was:
Transaction #0000004646 extracted to file /tmp/remedixz.20160120_085021_41222370_1_0000004646

Second run:

Code:

I gave two transaction numbers for example 410th and 990th transaction number. This time the output file created has transactions from 410 - 990.
output file name created was:
Transaction #0000004646 extracted to file /tmp/remedixz.20160120_085021_41222370_1_0000004646
Transaction #0000004646 extracted to file /tmp/remedixz.20160120_085021_41222370_1_0000004646

Third Run:

Code:

I gave 3 transaction numbers for ex: 330th 410th 990th. This time also the output file created with transactions from 410 - 990. It did not take 330 into account.
This time out files created was :
Transaction #0000004646 extracted to file /tmp/remedixz.20160120_085021_41222370_1_0000004646
Transaction #0000004646 extracted to file /tmp/remedixz.20160120_085021_41222370_1_0000004646
Transaction #0000004646 extracted to file /tmp/remedixz.20160120_085021_41222370_1_0000004646

Note for all the runs only one output file created with extn 0000004646(I am not understanding from where this 4646 is coming from).
Can you please suggest.
Thanks.

mad man

View Public Profile for mad man

Find all posts by mad man

01-27-2016

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Quote:

Originally Posted by mad man

Note for all the runs only one output file created with extn 0000004646(I am not understanding from where this 4646 is coming from).
Can you please suggest.
Thanks.

I sincerely apologize. In each case, the output file you got had a filename derived from the 2nd field (i.e., the data between the 1st and 2nd tildes which seems to be a constant for the transactions you selected to print) in a line that contained a transaction number you wanted to print, and the contents of that file was the transactions starting with the transaction after the next to the last transaction number you requested in the big input file through the last transaction number you requested from the big input file.

It comes from me not getting nearly enough sleep, you not providing sample data that matched the actual format of your data, and from me not getting nearly enough sleep. (There were three problems and I'm blaming two of them on not getting enough sleep.) Now that I have cleaned up my test data to match what I believe is your current data format, the following seems to work. Please try this replacement:

Code:

#!/bin/ksh
big_file='/tmp/remedixz.20160120_085021_41222370_1'
trannum='/tmp/transnum'

awk -F '~' '
FNR == NR {
	# Gather transaction numbers...
	t[$1]
	tc = FNR
	next
}
{	# Gather transaction lines.
	l[++lc] = $0
}
$1 == "%%YEDTRN" && $3 in t {
	# We have found a transaction number for a transaction that is to be
	# extracted.  Save the transaction number and remove this transaction
	# from the transaction list.
	delete t[transnum = $3]
	file = FILENAME "_" transnum
	tc--
}
/^0000EOT/ {
	# If we have a transaction that is to be printed, print it.
	if(transnum) {
		# Print the transaction.
		for(i = 1; i <= lc; i++)
			print l[i] > file
		close(file)
		printf("Transaction #%s extracted to file %s\n", transnum, file)
		# Did we just print the last transaction requested?
		if(!tc)	{
			# Yes.  We are done.
			exit
		}
		# No.  Clear found transaction number.
		transnum = ""
	}
	# Reset for next transaction.
	lc = 0
}' "$trannum" "$big_file"

Hopefully, this will do what you want.

As stated before, if someone wants to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk or nawk.

This User Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

Shell Programming and Scripting

Want to extract certain lines from big file

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Extract Big and continuous regions

Discussion started by: amrutha_sastry

2. UNIX for Beginners Questions & Answers

How to copy only some lines from very big file?

Discussion started by: phamnu

3. Shell Programming and Scripting

Extract certain columns from big data

Discussion started by: happypoker

4. Shell Programming and Scripting

Extract certain entries from big file:Request to check

Discussion started by: manigrover

5. UNIX for Advanced & Expert Users

Delete first 100 lines from a BIG File

Discussion started by: unohu

6. Shell Programming and Scripting

Extract some lines from one file and add those lines to current file

Discussion started by: snreddy_gopu

7. Shell Programming and Scripting

Re: Deleting lines from big file.

Discussion started by: dipeshvshah

8. Shell Programming and Scripting

Print #of lines after search string in a big file

Discussion started by: prash184u

9. UNIX for Dummies Questions & Answers

How big is too big a config.log file?

Discussion started by: NeedLotsofHelp

10. UNIX for Dummies Questions & Answers

How to view a big file(143M big)

Discussion started by: chenhao_no1