Reading and writing in same file

09-07-2012

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Quote:

Originally Posted by kmajumder

Let me give you a complete example what I am trying to achieve.

1. Below is the log file structure where I need 2,5 and 14th column of the logs after grepping through the linkId=1ddoic.

Log file structure:-

abc.com 20120829001415 127.0.0.1 app none11111 sas 0 0 N clk Mozilla/5.0 id=82c6a15ca06b2372c3b3ec2133fc8b14 referrer=google.com linkId=1ddoic

abc.com 20120829001416 127.0.0.1 dyn UD3BSAp8appncXlZ UD3BSAp8app4xHbz 0 0 N page Mozilla/5.0 id=82c6a15ca06b2372c3b3ec2133fc8b14 segments=

2. Now for the 1st log you can see I have invalid(none11111) 5th column. So I have to look for the actual 5th column value. 'id' column will help you to find that. So you have to run another grep based on the 'id' value so that you can find the actual 5th column in the same log file.
3. If you see the second log it has the exact matching 'id' value. So what I have to do I have to take the 5th column(UD3BSAp8appncXlZ) from the second log instead of the invalid one(none11111).

Output:-

20120829001415, UD3BSAp8appncXlZ, linkId=1ddoic

Note:- I have bunch of log files where I have to perform the above procedure. But I have to come up with a single file as output after grepping through all the log files.
It has a format like abc-2012-10-01_00000,abc-2012-10-01_00001.... etc.

Hope this time it makes clear. Smilie

Thanks for looking into it.

This is a big improvement over what you have posted before, but there are still some ambiguities.

You say that you're showing the log file structure and say that you need fields 2, 5, and 14 and then show two lines from one or two log files. Note that the second record has 13 fields (and the last fields appears to be incomplete); not 14??? If we are to determine what is supposed to happen we need to know whether or not field 14 in both lines has the same value (linkId=1ddoic). (I.e., do both of these lines appear in the output of your first grep:

Code:

grep "linkId=1ddoic" log_file

And, PLEASE USE CODE TAGS when presenting file contents!

Let me try restating the problem to determine if I understand what you want done:

In some places you say there are two log files, in other places you say there is one log file but you grep it twice. Which is it?
If there is a single log file, both greps and the conversion to the desired output can be done by reading the log file just once with awk if the output order doesn't matter.
The first time you read the log file, you look for entries in column 14 that match a given value (linkId=xxx) and ignore anything that doesn't match.
For lines that were selected in step 2, if column 5 is not "none1111", skip to step 5.
Read the same log file again (or read the second log file) looking for a line in the log where field 12 (id=yyy) matches field 12 in the line matched in step 3 AND collumn 5 is not "none1111". Use the value found in column 5 in this line as a replacement for field 5 in the line matched by step 3.
Print column 2 (from the line matched in step 3), a comma, column 5 (from the line matched in step 3 [updated by the line found in the second reading of the log file if it contained "none1111" in the line matched in step 3]), a comma, and column 12 (from the line matched in step 3 with "id=" at the start of the field removed).

Is this algorithm correct?

Is there one input log file or two? What is its (or are their) name(s)?

Is step 4 only supposed to be performed for lines that have the same contents for fields 14? Or, is any field 14 value OK as long as the contents of field 12 matches a field 12 in a line that has a field 12 that doesn't contain "none1111"?

Note that this is three comma separated values; not four values that were specified in the first several messages on this thread. Is this correct? If not, where does the other output field come from?

Note also that the early messages specified","as the separator between fields, but in the latest messages you specify", " intead of",". Is","the correct separator?

Does the order of the output lines matter?

The Note in your message:

Quote:

Note:- I have bunch of log files where I have to perform the above procedure. But I have to come up with a single file as output after grepping through all the log files.
It has a format like abc-2012-10-01_00000,abc-2012-10-01_00001.... etc.

doesn't make things clear at all. We have not seen anything like this list of values in any of the samples you have shown us. Are you saying you have to create a file with a single line that contains a comma separated list of an unspecified number of entries that consist of strings that are created using the following format string to printf:"abc-%s_%05d"where data printed by the%sis used to print something that comes from a date utility format string%Y-%m-%drun on the first day of next month and the%05dis used to print a sequence number? Please explain what the entries in this list mean, how many of them there are, and why this list is useful!

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

09-07-2012

Registered User

23, 0

Join Date: Jul 2012

Last Activity: 18 October 2012, 7:07 PM EDT

Posts: 23

Thanks Given: 5

Thanked 0 Times in 0 Posts

Hi Don,

I am trying my best.
This is my first grep command.

Code:

grep -e linkId=1ddoic abc-2012-10-01_000* | cut -f 2,5,14 | sort| uniq

Quote:

In some places you say there are two log files, in other places you say there is one log file but you grep it twice. Which is it?
If there is a single log file, both greps and the conversion to the desired output can be done by reading the log file just once with awk if the output order doesn't matter.

i) I have multiple log files where I need to grep(like abc-2012-10-01_00000,abc-2012-10-01_00001.... etc.) and output 2nd, 5th and 14th column.
ii) While grepping through all the log files those invalid 2nd column will appear which is not intended.In the same file from where invalid 2nd columns were found valid 2nd columns can be found from there only by looking through the matching 'id' attribute value. It is upto you if you can achieve my goal in single grep.

My algorithm:-
i) for each file run above grep
for each row got from above grep if 2nd column is invalid(none11111)
run another grep on same file and replace invalid 2nd column with new one.

kmajumder

View Public Profile for kmajumder

Find all posts by kmajumder

09-08-2012

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Quote:

Originally Posted by kmajumder

Hi Don,

I am trying my best.
This is my first grep command.

Code:

grep -e linkId=1ddoic abc-2012-10-01_000* | cut -f 2,5,14 | sort| uniq

1. yes. Though the initial fields(upto 10th column) are constant across all type of log entries but others will vary. The 2 example log I have given are two different type of log generated for 2 different events. So you will not get the linkId attribute in 2nd log entry. Even you do not need to bother about that because you just need to pick 5th column from 2nd log and replace 1st log after checking the id field if that matches. But in real scenario you have to grep through the entire log file to look for the id value found in 1st log.

i) I have multiple log files where I need to grep(like abc-2012-10-01_00000,abc-2012-10-01_00001.... etc.) and output 2nd, 5th and 14th column.
ii) While grepping through all the log files those invalid 2nd column will appear which is not intended.In the same file from where invalid 2nd columns were found valid 2nd columns can be found from there only by looking through the matching 'id' attribute value. It is upto you if you can achieve my goal in single grep.

My algorithm:-
i) for each file run above grep
for each row got from above grep if 2nd column is invalid(none11111)
run another grep on same file and replace invalid 2nd column with new one.

If the two sample lines from your log files are as you have shown in past posts, the command line you specify above is equivalent to the command:

Code:

grep -e linkId=1ddoic abc-2012-10-01_000* | sort -u

There are no tab characters in you input files, so the cut command in your pipeline is a no-op. So this command line throws away duplicate lines found in your log files and sorts the remaining lines on the first field. It does NOT limit the output to only columns 2, 5, and 14 from your input files; does NOT produce a CSV file with three fields (and if it did; it wouldn't contain the id=value fields that say are to be used in a second grep to look for the invalid values found while processing the output from your first grep).

In message #8 in this thread, I asked nine questions. You partially answered some of the questions although, as noted above, the answer doesn't match the other statements you've made.

I want to help you solve this problem, but if you won't answer the questions (and give answers that match your data), it is obvious that I'm wasting my time.

If you would like us to try to give you a working solution please answer ALL of these questions:

What are the actual commands you execute to convert your log files into the CSV file that you want processed?
Doesabc-2012-10-01_000*match the names of all of the log files (and only those log files) that you want to process?
When you findnone11111in your CSV file, will theid=xxxfield ever match more than one line (not containingnone11111) in your log files that aren't exact duplicates of other lines?
Am I correct in assuming that the line matching theid=xxxfield with the value needed to replacenone11111in your CSV file, will not be on a line that was selected by a grep on the linkId field you're processing?
Is the field separator you want in your output file","or", "?
Does the order of lines in your output file matter?
What is the purpose of having an additional single-line output file containing a comma separated list of all of your log files? If you need a file containing a list of the log files processed, wouldn't it be better to have the filenames on separate lines instead of separated by commas on a single line?
Will thelinkId=zzzfield ever appear in any log file that isn't exactly of the same form as the following example line from one of your log files?
Code:
```
abc.com 20120829001415 127.0.0.1 app none11111 sas 0 0 N clk Mozilla/5.0 id=82c6a15ca06b2372c3b3ec2133fc8b14 referrer=google.com linkId=1ddoic
```

This User Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

09-10-2012

Registered User

23, 0

Join Date: Jul 2012

Last Activity: 18 October 2012, 7:07 PM EDT

Posts: 23

Thanks Given: 5

Thanked 0 Times in 0 Posts

Hi Don,

Thank you for your cooperation. Here I am trying to list the answer of your questions.

Quote:

What are the actual commands you execute to convert your log files into the CSV file that you want processed?

Code:

grep -e linkId=1ddoic abc-2012-10-01_000* | cut -f 2,5,14 | awk '{$1=$1;print}' OFS=, > /tmp.output.xls

Quote:

Does abc-2012-10-01_000* match the names of all of the log files (and only those log files) that you want to process?

Yes.

Quote:

When you find none11111 in your CSV file, will the id=xxx field ever match more than one line (not containing none11111 ) in your log files that aren't exact duplicates of other lines?

Yes. it matches more than one line.

Quote:

Am I correct in assuming that the line matching theid=xxx field with the value needed to replacenone11111 in your CSV file, will not be on a line that was selected by a grep on the linkId field you're processing?

Yes. of course. It will never be in same line.

Quote:

Is the field separator you want in your output file"," or", " ?

Only comma. No space.

Quote:

Does the order of lines in your output file matter?

Yes. It matters. It has to be in sorted order of timestamp.

Quote:

What is the purpose of having an additional single-line output file containing a comma separated list of all of your log files? If you need a file containing a list of the log files processed, wouldn't it be better to have the filenames on separate lines instead of separated by commas on a single line?

This is not the additional file. This is the output file that I use as input to generate the final output. After I create the output file after replacing the invalid 'none1111' field I read that file and on top of those values I do some database call and then creates a report.

Quote:

Will thelinkId=zzz field ever appear in any log file that isn't exactly of the same form as the following example line from one of your log files?

Yes. It will appear. To resolve that problem we have to run the second grep like below.

Code:

grep id= 82c6a15ca06b2372c3b3ec2133fc8b14 abc-2012-10-01_000* | grep 'page|clk'

The purpose of running above grep is that. The id

Code:

82c6a15ca06b2372c3b3ec2133fc8b14

can appear in two different event, either 'page' or 'clk'. We can take any one 5th column from that log. And also this log will be found in the same file where 'none11111' was found.
Suppose for linkId=1ddoic we found one invalid 'none11111' value in 5th column of the log file abc-2012-10-01_00002 then in abc-2012-10-01_00002 file only the corresponding id should be found with proper 5th column.

Thank you a lot Don for looking into it.

kmajumder

View Public Profile for kmajumder

Find all posts by kmajumder

09-10-2012

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

The command lines you have shown with the two sample lines you have shown from your log files don't come close to providing the data that you say they will. I also note that your last post (message #11 in this thread) is the first time you mention anything about log file field #10 being used to determine the final report.

I have tried to interpret your requirements and come up with a script that should come close to what you have said you need. Given that you have only let us see one complete log file line and one abbreviated log file line, I have low confidence that this will actually do what you want, but I believe it meets the requirements you've been willing to share.

To try it out, save the following script in a file named match2:

Code:

#!/bin/ksh
# match2 -- Produce report from log files
#
# Usage: match2 keyId_value output_file log_file...

Usage="Usage: %s keyId_value output_file log_file...
    Output records are created for every unique input record with field 14
    that is \"keyId=keyId_value\".  The output records contain fields 2,
    5, and 14 from the selected input records.  If input field 5 is
    \"none11111\", \"none11111\" will be replaced by the contents of field 5
    in any other record in the log files that has the same value in
    field 12 as field 12 in the selected input record that does not have
    \"none11111\" in field 5.  Output records will be sorted by performing
    a numeric sort on the first output field.  The sorted output will be
    written to the file named by \"output_file\" (or to standard output
    if \"output_file\" is \"-\".\n"

base="$(basename "$0")"
if [ $# -lt 3 ]
then    printf "$Usage" "$base" >&2
        exit 1
fi
matchId_value="$1"
if [ X"$2" = "X-" ]
then    cmd="sort -n -u"
else    cmd="sort -n -u -o \"$2\""
fi
# Shift away the keyId_value and output_file operands that have already
# been saved.  This leaves just the log_file operands in "$@".
shift 2
awk -v base="$base" -v cmd="$cmd" -v link="linkId=$matchId_value" 'BEGIN {
        # Indicates exit code to use if not 0
        ec = ""
        # Set output field separator to a comma.
        OFS = ","
        # Value in field 5 indicating the the correct value is unknown.
        unknown = "none11111"
}
$14 == link {
        # Gather data for an output record for this keyId...
        # o1[x] and o2[x] are output fields 1 and 2; output field 3 is
        # a constant (link), so it does not need to be preserved from
        # each line.  nm[x] contains the "id=" fields needed to find a
        # matching record for records that did not have a valid field 5
        # when the log entry was created...
        o1[NR] = $2
        if($5 == unknown) nm[NR] = $12
        else    o2[NR] = $5
        next
}
$5 != unknown  && ( $10 == "page" || $10 == "clk" ) {
        # Save a field 5 value for the id specified by field 12...
        id[$12] = $5
}
END {   # Fill in the missing o2[x] output fields...
        for (i in nm) if((o2[i] = id[nm[i]]) == "") {
                # Set o2[x] to the unknown value if no matching field
                # was found, and set the final exit code to indicate
                # that at least one entry had no match.
                o2[i] = unknown
                printf("%s: No valid field 5 found for %s\n", base, nm[i])
                ec = 2
        }
        # Write and sort the completed output records.
        for (i in o1) {
                print o1[i],o2[i],link | cmd
        }
        exit ec
}' "$@" >&2
exit

Make it executable by running the command:

Code:

chmod +x match2

and invoke it with:

Code:

match2 1ddoic output_file abc-2012-10-01_000*

to produce a report containing log file entries found in the log files you specified for linkId=1ddoic sorted by timestamp in the file named output_file.

Although this script specifies ksh, it should also work with sh and bash. (It won't work with csh or tcsh.)

I hope this helps.

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

Shell Programming and Scripting

Reading and writing in same file

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Reading and Writing a conf file - Suggestions and improvements?

Discussion started by: sea

2. Shell Programming and Scripting

reading a file extracting information writing to a file

Discussion started by: Bruble

3. Programming

unexpected values received when writing and reading from file

Discussion started by: saman_glorious

4. Shell Programming and Scripting

Searching for Log / Bad file and Reading and writing to a flat file

Discussion started by: mlpathir

5. Shell Programming and Scripting

Reading data from DataBase and Writing to a file

Discussion started by: rajeshorpu

6. Programming

I need help with file reading/writing in C

Discussion started by: Zykl0n-B

7. UNIX for Dummies Questions & Answers

Log File Writing and Reading

Discussion started by: valluvan

8. UNIX for Dummies Questions & Answers

reading ,writing,appending ,manipulating a file.

Discussion started by: szchmaltz

9. UNIX for Advanced & Expert Users

Reading a file and writing the file name to a param file.

Discussion started by: thebeginer

10. Programming

Reading and Writing file on LAN

Discussion started by: lucky001