Keep only the closet match of timestamped row (include headers) from file1 to precede file2 row/s

08-15-2016

Registered User

29, 0

Join Date: Jul 2016

Last Activity: 14 July 2017, 9:57 AM EDT

Posts: 29

Thanks Given: 6

Thanked 0 Times in 0 Posts

Quote:

Originally Posted by RudiC

Would this come close to what you want (may need some polishing):

Code:

awk '
NR == 1         {getline HD1 < F1
                 HD2 = $0
                 next
                }

$1 >= T[1]      {do     {LAST = TMP
                         ST = getline TMP < F1
                         split (TMP, T, FS)
                        }
                 while (($1 >= T[1]) && (ST == 1))
                 if (ST == 0)   {LAST = TMP
                                 T[1] = "ZZZ"
                                }
                 print HD1
                 print LAST
                 print HD2
                 print
                 next
                }
                {print 
                }

' FS="," F1=file1 file2
TIMEFORMATTED,G_CCSDS_VERSION,G_CCSDS_TYPE,G_CCSDS_2HDR_FLAG,G_CCSDS_APID,G_CCSDS_GRP_FLAGS,G_CCSDS_SEQ_COUNT,G_CCSDS_PKT_LEN,G_CCSDS_DOY,G_CCSDS_MSEC
2014/04/07 16:03:10,0,0,1,572,3,0,1917,20550,57790339
TIMEFORMATTED,CCSDS_VERSION,CCSDS_TYPE,CCSDS_2HDR_FLAG,CCSDS_APID,CCSDS_GRP_FLAGS,CCSDS_SEQ_COUNT,CCSDS_PKT_LEN,CCSDS_DOY,CCSDS_MSEC
2014/04/07 16:03:12,0,0,1,544,3,0,985,20550,57788894
2014/04/07 16:03:13,0,0,1,544,3,0,985,20550,57793894
2014/04/07 16:03:14,0,0,1,544,3,0,985,20550,57794894
TIMEFORMATTED,G_CCSDS_VERSION,G_CCSDS_TYPE,G_CCSDS_2HDR_FLAG,G_CCSDS_APID,G_CCSDS_GRP_FLAGS,G_CCSDS_SEQ_COUNT,G_CCSDS_PKT_LEN,G_CCSDS_DOY,G_CCSDS_MSEC
2014/04/07 16:03:15,0,0,1,572,3,0,1917,20550,57795339
TIMEFORMATTED,CCSDS_VERSION,CCSDS_TYPE,CCSDS_2HDR_FLAG,CCSDS_APID,CCSDS_GRP_FLAGS,CCSDS_SEQ_COUNT,CCSDS_PKT_LEN,CCSDS_DOY,CCSDS_MSEC
2014/04/07 16:03:15,0,0,1,544,3,0,985,20550,57795894
2014/04/07 16:03:16,0,0,1,544,3,0,985,20550,57796894
2014/04/07 16:03:17,0,0,1,544,3,0,985,20550,57797894

RudiC, this seems to work on my "real" files in different scenarios (i.e different file1 and file2 sizes, header sizes, header names etc..

I will do some more testing since my real files are very large and I have to make sure all data is intact, but so far so good!! Maybe I'm being too optimistic at the moment

Quote:

Originally Posted by aachave1

I will test some more and get back to this forum with results soon.

Thanks to all of you (Stomp, RudiC, and Don Cragun) for your time on this!!

Oops, I found that it didn't quite sort completely accurate on my "real" files (small snippet below) because even though the times in column 1 that appeared equal (18:45:22), were actually different when it came to the msec column 19 (highlighted in red). So basically the first file2 row would have been with the previous file1 timestamp since it is less than the file1 time according to msec time.

I guess I need to find a way to sort on column 19 so that it is accurate down to milliseconds.

Code:

Output from RudiC code where 67522104 is less than 67522431 :

TIMEFORMATTED,G_CCSDS_VERSION, G_CCSDS_VERSION(RAW),G_CCSDS_TYPE, G_CCSDS_TYPE(RAW),G_CCSDS_2HDR_FLAG, G_CCSDS_2HDR_FLAG(RAW),G_CCSDS_APID, G_CCSDS_APID(RAW),G_CCSDS_GRP_FLAGS,G_CCSDS_GRP_FLAGS(RAW),G_CCSDS_SEQ_COUNT, G_CCSDS_SEQ_COUNT(RAW),G_CCSDS_PKT_LEN,G_CCSDS_PKT_LEN(RAW),G_CCSDS_DOY,G_CCSDS_DOY(RAW),G_CCSDS_MSEC
2014/04/07 18:45:22,0,0,0,0,1,1,572,572,3,3,0,0,1917,1917,20550,20550,67522431
TIMEFORMATTED,CCSDS_VERSION,CCSDS_VERSION(RAW),CCSDS_TYPE,CCSDS_TYPE(RAW),CCSDS_2HDR_FLAG,CCSDS_2HDR_FLAG(RAW),CCSDS_APID,CCSDS_APID(RAW),CCSDS_GRP_FLAGS,CCSDS_GRP_FLAGS(RAW),CCSDS_SEQ_COUNT,CCSDS_SEQ_COUNT(RAW),CCSDS_PKT_LEN,CCSDS_PKT_LEN(RAW),CCSDS_DOY,CCSDS_DOY(RAW),CCSDS_MSEC
2014/04/07 18:45:22,0,0,0,0,1,1,544,544,3,3,0,0,985,985,20550,20550,67522104
2014/04/07 18:45:23,0,0,0,0,1,1,544,544,3,3,0,0,985,985,20550,20550,67523104
2014/04/07 18:45:24,0,0,0,0,1,1,544,544,3,3,0,0,985,985,20550,20550,67524104
2014/04/07 18:45:25,0,0,0,0,1,1,544,544,3,3,0,0,985,985,20550,20550,67525104
2014/04/07 18:45:26,0,0,0,0,1,1,544,544,3,3,0,0,985,985,20550,20550,67526104


Should be like this since 67522104 is greater than 67517432:

TIMEFORMATTED,G_CCSDS_VERSION, G_CCSDS_VERSION(RAW),G_CCSDS_TYPE, G_CCSDS_TYPE(RAW),G_CCSDS_2HDR_FLAG, G_CCSDS_2HDR_FLAG(RAW),G_CCSDS_APID, G_CCSDS_APID(RAW),G_CCSDS_GRP_FLAGS,G_CCSDS_GRP_FLAGS(RAW),G_CCSDS_SEQ_COUNT, G_CCSDS_SEQ_COUNT(RAW),G_CCSDS_PKT_LEN,G_CCSDS_PKT_LEN(RAW),G_CCSDS_DOY,G_CCSDS_DOY(RAW),G_CCSDS_MSEC
2014/04/07 18:45:17,0,0,0,0,1,1,572,572,3,3,0,0,1917,1917,20550,20550,67517432
TIMEFORMATTED,CCSDS_VERSION,CCSDS_VERSION(RAW),CCSDS_TYPE,CCSDS_TYPE(RAW),CCSDS_2HDR_FLAG,CCSDS_2HDR_FLAG(RAW),CCSDS_APID,CCSDS_APID(RAW),CCSDS_GRP_FLAGS,CCSDS_GRP_FLAGS(RAW),CCSDS_SEQ_COUNT,CCSDS_SEQ_COUNT(RAW),CCSDS_PKT_LEN,CCSDS_PKT_LEN(RAW),CCSDS_DOY,CCSDS_DOY(RAW),CCSDS_MSEC
2014/04/07 18:45:21,0,0,0,0,1,1,544,544,3,3,0,0,985,985,20550,20550,67521104
2014/04/07 18:45:22,0,0,0,0,1,1,544,544,3,3,0,0,985,985,20550,20550,67522104

aachave1

View Public Profile for aachave1

Find all posts by aachave1

08-15-2016

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Quote:

Originally Posted by aachave1

I will test some more and get back to this forum with results soon.

Thanks to all of you (Stomp, RudiC, and Don Cragun) for your time on this!!

---------- Post updated at 05:47 PM ---------- Previous update was at 01:31 PM ----------

Oops, I found that it didn't quite sort completely accurate on my "real" files (small snippet below) because even though the times in column 1 that appeared equal (18:45:22), were actually different when it came to the msec column 19 (highlighted in red). So basically the first file2 row would have been with the previous file1 timestamp since it is less than the file1 time according to msec time.

I guess I need to find a way to sort on column 19 so that it is accurate down to milliseconds.

Code:

Output from RudiC code where 67522104 is less than 67522431 :

TIMEFORMATTED,G_CCSDS_VERSION, G_CCSDS_VERSION(RAW),G_CCSDS_TYPE, G_CCSDS_TYPE(RAW),G_CCSDS_2HDR_FLAG, G_CCSDS_2HDR_FLAG(RAW),G_CCSDS_APID, G_CCSDS_APID(RAW),G_CCSDS_GRP_FLAGS,G_CCSDS_GRP_FLAGS(RAW),G_CCSDS_SEQ_COUNT, G_CCSDS_SEQ_COUNT(RAW),G_CCSDS_PKT_LEN,G_CCSDS_PKT_LEN(RAW),G_CCSDS_DOY,G_CCSDS_DOY(RAW),G_CCSDS_MSEC
2014/04/07 18:45:22,0,0,0,0,1,1,572,572,3,3,0,0,1917,1917,20550,20550,67522431
TIMEFORMATTED,CCSDS_VERSION,CCSDS_VERSION(RAW),CCSDS_TYPE,CCSDS_TYPE(RAW),CCSDS_2HDR_FLAG,CCSDS_2HDR_FLAG(RAW),CCSDS_APID,CCSDS_APID(RAW),CCSDS_GRP_FLAGS,CCSDS_GRP_FLAGS(RAW),CCSDS_SEQ_COUNT,CCSDS_SEQ_COUNT(RAW),CCSDS_PKT_LEN,CCSDS_PKT_LEN(RAW),CCSDS_DOY,CCSDS_DOY(RAW),CCSDS_MSEC
2014/04/07 18:45:22,0,0,0,0,1,1,544,544,3,3,0,0,985,985,20550,20550,67522104
2014/04/07 18:45:23,0,0,0,0,1,1,544,544,3,3,0,0,985,985,20550,20550,67523104
2014/04/07 18:45:24,0,0,0,0,1,1,544,544,3,3,0,0,985,985,20550,20550,67524104
2014/04/07 18:45:25,0,0,0,0,1,1,544,544,3,3,0,0,985,985,20550,20550,67525104
2014/04/07 18:45:26,0,0,0,0,1,1,544,544,3,3,0,0,985,985,20550,20550,67526104


Should be like this since 67522104 is greater than 67517432:

TIMEFORMATTED,G_CCSDS_VERSION, G_CCSDS_VERSION(RAW),G_CCSDS_TYPE, G_CCSDS_TYPE(RAW),G_CCSDS_2HDR_FLAG, G_CCSDS_2HDR_FLAG(RAW),G_CCSDS_APID, G_CCSDS_APID(RAW),G_CCSDS_GRP_FLAGS,G_CCSDS_GRP_FLAGS(RAW),G_CCSDS_SEQ_COUNT, G_CCSDS_SEQ_COUNT(RAW),G_CCSDS_PKT_LEN,G_CCSDS_PKT_LEN(RAW),G_CCSDS_DOY,G_CCSDS_DOY(RAW),G_CCSDS_MSEC
2014/04/07 18:45:17,0,0,0,0,1,1,572,572,3,3,0,0,1917,1917,20550,20550,67517432
TIMEFORMATTED,CCSDS_VERSION,CCSDS_VERSION(RAW),CCSDS_TYPE,CCSDS_TYPE(RAW),CCSDS_2HDR_FLAG,CCSDS_2HDR_FLAG(RAW),CCSDS_APID,CCSDS_APID(RAW),CCSDS_GRP_FLAGS,CCSDS_GRP_FLAGS(RAW),CCSDS_SEQ_COUNT,CCSDS_SEQ_COUNT(RAW),CCSDS_PKT_LEN,CCSDS_PKT_LEN(RAW),CCSDS_DOY,CCSDS_DOY(RAW),CCSDS_MSEC
2014/04/07 18:45:21,0,0,0,0,1,1,544,544,3,3,0,0,985,985,20550,20550,67521104
2014/04/07 18:45:22,0,0,0,0,1,1,544,544,3,3,0,0,985,985,20550,20550,67522104

Ok. Give us a break. Every single one of your examples showed comparisons on year, month, day, hour, minute, and second in field 1. You talked about sorting on field 21 (when showing us sample files that only had 10 fields) and now in the post quoted above you say we have to sort by field 19 (when your sample input files only have 18 fields). If you mean that we should use field 18, please note that that field clearly only contains milliseconds since midnight of the current date. You have chosen to ignore several of my questions in earlier posts. If you refuse to answer the following questions, I probably won't ever respond to any of your threads again:

Will you guarantee that all timestamps in both of the files that will ever be processed on on the same calendar date? Or, are there two date fields that have to be processed?
If there are two date fields (presumably fields 17 and 18 in the sample files in post #29), are those two date fields always adjacent in the input files?
And, repeating a question that has already been asked twice: Will the date field (or fields) used in file1 be the same as the field (or fields) used in file2?
Will the milliseconds field in your files be set to the string 3600000 corresponding the exactly 1:00:00am or to the string 03600000 (i.e., are all values leading 0 padded to 8 digits, or are the values just the decimal number of milliseconds since midnight with no leading 0 fill)? (Note that the sort you were using in your examples sorting on field 21 would not work if that field does not have leading 0 fill.)
Will you supply the field number(s) as parameters to your script, or are the field headings for the date field(s) in the two files constants that the script is supposed to find when reading the header lines?
And, since at least one of the date fields is the last field in all of your sample input files, I will ask again: Are your input files in UNIX text file format or DOS text file format? (This might not matter on your system, but it does matter on the system I'm using to test my code.)
If your input files are in DOS text file format, do you want output in DOS format or UNIX format? (DOS, UNIX, and don't care are valid answers to this question.)
And, obviously, supply us with the complete contents of your latest sample files (including some with different dates if the data in your real files won't always be for a single date) along with the expected output from those sample inputs.

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

08-15-2016

Registered User

29, 0

Join Date: Jul 2016

Last Activity: 14 July 2017, 9:57 AM EDT

Posts: 29

Thanks Given: 6

Thanked 0 Times in 0 Posts

Okay, I apologize Don, I am having a hard time getting this across. Hopefully I can answer some the of questions.

1. Will you guarantee that all timestamps in both of the files that will ever be processed on the same calendar date? Or, are there two date fields that have to be processed? Yes, both files will always be processed with the same calendar date because they are ran almost simultaneously. Yes, only because fields 16 thru 21 (some fields are duplicated because of the RAW field) fields are the time (epoch) that our telemetry extractor converts and then creates field 1 timestamp. My example stopped at field 18 because my real files have about 600 fields and data points. So technically there are 3 time fields - Day Of Year, time in milliseconds, time in microseconds. I said filed 21 in an earlier post because there were 3 added fields in newer files. In these particular files, they don't have these extra fields, but as long as I chose the correct msec field, my sort works properly.
CCSDS_DOY,CCSDS_DOY(RAW),CCSDS_MSEC
20550,20550,67522104

2. If there are two date fields (presumably fields 17 and 18 in the sample files in post #29), are those two date fields always adjacent in the input files? Yes, all the date fields 16-21 are always adjacent in both files.

3. And, repeating a question that has already been asked twice: Will the date field (or fields) used in file1 be the same as the field (or fields) used in file2? Yes, both files use the same time fields.

4. Will the milliseconds field in your files be set to the string 3600000 corresponding the exactly 1:00:00am or to the string 03600000 (i.e., are all values leading 0 padded to 8 digits, or are the values just the decimal number of milliseconds since midnight with no leading 0 fill)? (Note that the sort you were using in your examples sorting on field 21 would not work if that field does not have leading 0 fill.) It is an 8 digit decimal number.

5. Will you supply the field number(s) as parameters to your script, or are the field headings for the date field(s) in the two files constants that the script is supposed to find when reading the header lines? I only used the field numbers when I sorted off of the �msec� field (i.e sort -t -k,18,18 file1 file2) and that provided and accurate sort.

6. And, since at least one of the date fields is the last field in all of your sample input files, I will ask again: Are your input files in UNIX text file format or DOS text file format? (This might not matter on your system, but it does matter on the system I'm using to test my code.). These files are .csv files processed on a Linux platform.

7. If your input files are in DOS text file format, do you want output in DOS format or UNIX format? (DOS, UNIX, and don't care are valid answers to this question.) Unix format, but they will end up being .csv files after processing (not DOS).

8. And, obviously, supply us with the complete contents of your latest sample files (including some with different dates if the data in your real files won't always be for a single date) along with the expected output from those sample inputs. I will provide more when I get to a PC later or tomorrow. A couple of days ago, I tried sending my �real� files , but this site kept giving me errors when uploading.

aachave1

View Public Profile for aachave1

Find all posts by aachave1

08-16-2016

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Quote:

Originally Posted by aachave1

For question #5: You didn't answer the question. You have now told us that we can't use field #1 and must use a field containing milliseconds. If you won't tell us how you expect your script to work, it is extremely hard to suggest how to write a script that will do what you want. From the sample data you have provided, I know that the names of the fields in your header line varies (sometimes with leading spaces, sometimes without; and different prefixes depending on what file is being processed). And, at least in file2 in post #1 in this thread, at least one of the millisecond fields does not have a value that matches the data in field 1 in that record (not only are the least significant 3 digits ignored, the high order 5 digits do not correspond to the HH:MM:SS part of field #1). Please be sure that the sample data you will be providing does not suffer from this same malady, and please explain how your script is supposed to determine which field contains the milliseconds data in the first input file and how your script is supposed to determine which field contains the milliseconds data i the second input file. Furthermore, in question #4 I said the way you were using sort would not work IF the milliseconds field being sorted was a variable length field. Since it is an 8 digit fixed width field with leading 0 fill, the sort command you were using should work. From the data you had shown us, we had no way to determine whether or not that would be true.

For question #6: Where a file is processed does not determine the format of a file. The format of a file is determine by the process that creates the file and the data used by that process when it is creating that file.

For quoins #7: A .csv file is a type of text file. Knowing that a file is a .csv file does not determine whether the file is a DOS text file or a UNIX text file. The difference is whether the file contains DOS line separators or UNIX line terminators. That is why I asked you to show us the output form the command:

Code:

tail -n 3 filename | od -bc

for your input files (which would show us the line terminators or separators used in those files), but you ignored that request both times I asked you to provide that information.

For question #8: This site doesn't allow .csv files to be uploaded, but it does allow .txt files to be uploaded. If you have a file named something.csv, change its name to something.txt and upload the .txt files.

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

08-16-2016

Registered User

29, 0

Join Date: Jul 2016

Last Activity: 14 July 2017, 9:57 AM EDT

Posts: 29

Thanks Given: 6

Thanked 0 Times in 0 Posts

Don,
Again, part of my confusion is me not interpreting your questions properly. Initially (#5), I was not sure what was the best way to sort/merge file1 and file2. I knew what result I needed (as I showed in output examples throughout), but I was still trying to figure out which time field would be better to sort on since these files are extracted from a binary file into a readable csv. Apparently the time fields "G_CCSDS_DOY,G_CCSDS_DOY(RAW),G_CCSDS_MSEC,G_CCSDS_MSEC(RAW),G_CCSDS_USEC,G_CCSDS_USEC(RAW)", contain the the actual timestamp (could range from 16-24 fields/columns) during process, however, the extraction tool converts it into a readable time and places it in field 1 (TIMEFORMATTED field). It appears now that field 1 time is not the best way to sort since it stops at seconds and even though the time may appear equal between the two files, they really are not when looking at milliseconds.

I uploaded my newest file1.txt and file2.txt files along with the files post "tail" command that you requested (file1_tail and file2_tail).

Take a look and see what you think, however, I fully understand if you need to abort this forum topic with me due to the lack of pertinent explainations and data I have provided. You guys have been more than helpful, Thank you!

file1.txt (2.87 MB)

file1_tail.txt (204.2 KB)

file2.txt (338.5 KB)

file2_tail.txt (187.6 KB)

aachave1

View Public Profile for aachave1

Find all posts by aachave1

08-16-2016

Registered User

29, 0

Join Date: Jul 2016

Last Activity: 14 July 2017, 9:57 AM EDT

Posts: 29

Thanks Given: 6

Thanked 0 Times in 0 Posts

After modifying RudiC's last code, it seems to sort correctly by sorting on field 18 of these older files that actually have many file2 rows to one file1 row - where one of the timestamps appears to match in field 1, yet doesn't when sorted on field 18 (msec field). So it now sets the correct file2 row with the correct file1 row. It begins at "18:45:22" row.

Since the two files will always be ran almost simultaneously, the year and day should always match. I will check with the person here at my work that needs these files and see if this seems adequate.

Code:

#!/bin/bash

awk '
NR == 1         {getline HD1 < F1
                 HD2 = $0
                 next
                }

$18 >= T[18]      {do     {LAST = TMP
                         ST = getline TMP < F1
                         split (TMP, T, FS)
                        }
                 while (($18 >= T[18]) && (ST == 1))
                 if (ST == 0)   {LAST = TMP
                                 T[18] = "ZZZ"
                                }
                 print HD1
                 print LAST
                 print HD2
                 print
                 next
                }
                {print 
                }

' FS="," F1=f14_apr_07_12_38_27_gse.csv f14_apr_07_12_38_27.csv > f14_apr_07_12_38_output.csv

I attached these files as well. The first file is considered "file1" and then next is "file2". I attached the output results also, but had to cut it in half because the file was too large for upload.

f14_apr_07_12_38_27_gse.txt (1.17 MB)

f14_apr_07_12_38_27.txt (760.0 KB)

f14_apr_07_12_38_output.txt (2.85 MB)

aachave1

View Public Profile for aachave1

Find all posts by aachave1

08-17-2016

Registered User

29, 0

Join Date: Jul 2016

Last Activity: 14 July 2017, 9:57 AM EDT

Posts: 29

Thanks Given: 6

Thanked 0 Times in 0 Posts

Quick question for RudiC or Don Cragun.

What is this actually used for in the previuos code?

Code:

 T[18] = "ZZZ"

Thanks!

aachave1

View Public Profile for aachave1

Find all posts by aachave1

UNIX for Beginners Questions & Answers

Keep only the closet match of timestamped row (include headers) from file1 to precede file2 row/s

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Keep only the closet match of timestamped row (include headers) from file1 to precede file2 row/s

Discussion started by: aachave1

2. Shell Programming and Scripting

awk to search field2 in file2 using range of fields file1 and using match to another field in file1

Discussion started by: cmccabe

3. Shell Programming and Scripting

Reading and appending a row from file1 to file2 using awk or sed

Discussion started by: ida1215

4. Shell Programming and Scripting

Print sequences from file2 based on match to, AND in same order as, file1

Discussion started by: pathunkathunk

5. Shell Programming and Scripting

Match single line in file1 to groups of lines in file2

Discussion started by: pathunkathunk

6. Shell Programming and Scripting

Get row number from file1 and print that row of file2

Discussion started by: Abhiraj Singh

7. Shell Programming and Scripting

Match part of string in file2 based on column in file1

Discussion started by: phoebus

8. UNIX for Dummies Questions & Answers

if matching strings in file1 and file2, add column from file1 to file2

Discussion started by: pathunkathunk

9. Shell Programming and Scripting

Match one column of file1 with that of file2

Discussion started by: polsum

10. Shell Programming and Scripting

match value from file1 in file2

Discussion started by: myguess21