Keep only the closet match of timestamped row (include headers) from file1 to precede file2 row/s

 
Thread Tools Search this Thread
Top Forums UNIX for Beginners Questions & Answers Keep only the closet match of timestamped row (include headers) from file1 to precede file2 row/s
# 1  
Old 08-11-2016
Keep only the closet match of timestamped row (include headers) from file1 to precede file2 row/s

My original files are like this below and I distinguish them from the AP_ID (file1 has 572 and file2 has 544). Also, the header on file1 has “G_” pre-pended. NOTE: these are only snippets of very large files and much of the data is not present here.

Code:
Original File 1:

TIMEFORMATTED,G_CCSDS_VERSION,G_CCSDS_TYPE,G_CCSDS_2HDR_FLAG,G_CCSDS_APID,G_CCSDS_GRP_FLAGS,G_CCSDS_SEQ_COUNT,G_CCSDS_PKT_LEN,G_CCSDS_DOY,G_CCSDS_MSEC
2014/04/07 16:02:55,0,0,1,572,3,0,1917,20550,57775339
2014/04/07 16:03:00,0,0,1,572,3,0,1917,20550,57780339
2014/04/07 16:03:05,0,0,1,572,3,0,1917,20550,57785339
2014/04/07 16:03:10,0,0,1,572,3,0,1917,20550,57790339
2014/04/07 16:03:15,0,0,1,572,3,0,1917,20550,57795339

Original File 2:    

TIMEFORMATTED,CCSDS_VERSION,CCSDS_TYPE,CCSDS_2HDR_FLAG,CCSDS_APID,CCSDS_GRP_FLAGS,CCSDS_SEQ_COUNT,CCSDS_PKT_LEN,CCSDS_DOY,CCSDS_MSEC
2014/04/07 16:03:12,0,0,1,544,3,0,985,20550,57788894
2014/04/07 16:03:13,0,0,1,544,3,0,985,20550,57793894
2014/04/07 16:03:14,0,0,1,544,3,0,985,20550,57794894
2014/04/07 16:03:15,0,0,1,544,3,0,985,20550,57795894
2014/04/07 16:03:16,0,0,1,544,3,0,985,20550,57796894
2014/04/07 16:03:17, 0,0,1,544,3,0,985,20550,57797894

I sorted/merged with this code below. Note: the “-k21,21” was only used with my very large “real” files.

Code:
#!/bin/bash
function f() { awk 'NR==1{h=$0; next} {print $0 "\t" h}' $1; }; sort -t"," -k21,21 <(f file1) <(f file2)  | 
  awk -F'\t' '$2!=p{print $2; p=$2} {print $1}' > temp5

PROBLEM: I only need one row from file1 that is an equal match or nearest match to file2 timestamp/s row/s and precede the file2 row/s (1 file1 row to 1 to many file2 rows). As you can see on this “example”, there are 3 rows (after first header) from file1 that are not needed. I only need the file1 row with timestamp “16:03:10”. So basically I only need the last row from file1 (AP_ID=572) to precede file2 row/s (1 to many). The space is only for readability between matched data.

Code:
MY OUTPUT:

TIMEFORMATTED,G_CCSDS_VERSION,G_CCSDS_TYPE,G_CCSDS_2HDR_FLAG,G_CCSDS_APID,G_CCSDS_GRP_FLAGS,G_CCSDS_SEQ_COUNT,G_CCSDS_PKT_LEN,G_CCSDS_DOY,G_CCSDS_MSEC
2014/04/07 16:02:55,0,0,1,572,3,0,1917,20550,57775339
2014/04/07 16:03:00,0,0,1,572,3,0,1917,20550,57780339
2014/04/07 16:03:05,0,0,1,572,3,0,1917,20550,57785339
2014/04/07 16:03:10,0,0,1,572,3,0,1917,20550,57790339
TIMEFORMATTED,CCSDS_VERSION,CCSDS_TYPE,CCSDS_2HDR_FLAG,CCSDS_APID,CCSDS_GRP_FLAGS,CCSDS_SEQ_COUNT,CCSDS_PKT_LEN,CCSDS_DOY,CCSDS_MSEC
2014/04/07 16:03:12,0,0,1,544,3,0,985,20550,57788894
2014/04/07 16:03:13,0,0,1,544,3,0,985,20550,57793894
2014/04/07 16:03:14,0,0,1,544,3,0,985,20550,57794894

TIMEFORMATTED,G_CCSDS_VERSION,G_CCSDS_TYPE,G_CCSDS_2HDR_FLAG,G_CCSDS_APID,G_CCSDS_GRP_FLAGS,G_CCSDS_SEQ_COUNT,G_CCSDS_PKT_LEN,G_CCSDS_DOY,G_CCSDS_MSEC
2014/04/07 16:03:15,0,0,1,572,3,0,1917,20550,57795339
TIMEFORMATTED,CCSDS_VERSION,CCSDS_TYPE,CCSDS_2HDR_FLAG,CCSDS_APID,CCSDS_GRP_FLAGS,CCSDS_SEQ_COUNT,CCSDS_PKT_LEN,CCSDS_DOY,CCSDS_MSEC
2014/04/07 16:03:15,0,0,1,544,3,0,985,20550,57795894
2014/04/07 16:03:16,0,0,1,544,3,0,985,20550,57796894
2014/04/07 16:03:17, 0,0,1,544,3,0,985,20550,57797894

I then ran this below to try and resolve, but it only kept the FIRST file1 row, but not the preferred LAST.
QUESTION: How can I modify this code to keep only the last file1 (AP_ID=572) row?

Code:
#!/bin/bash
function f() { awk 'NR==1{h=$0; next} {print $0 "\t" h}' $1; }; sort -t"," -k21,21 <(f file1) <(f file2)  | 
  awk -F'\t' '$2!=p{print $2; p=$2; b++; c=1} !(b%2)||c&&c--{print $1}' > temp5

I hope this isn't too long winded and confusing. Thank you!!

Last edited by aachave1; 08-11-2016 at 02:43 PM..
# 2  
Old 08-11-2016
Quote:
PROBLEM: I only need one row from file1 that is an equal match or nearest match to file2 timestamp/s row/s and precede the file2 row/s (1 file1 row to 1 to many file2 rows). As you can see on this “example”, there are 3 rows (after first header) from file1 that are not needed. I only need the file1 row with timestamp “16:03:10”. So basically I only need the last row from file1 (AP_ID=572) to precede file2 row/s (1 to many). The space is only for readability between matched data.
Sorry, but I do not understand what you are writing here at all. Can you explain in detail with an example with your shown data on every sentence what you are trying to achieve?
# 3  
Old 08-11-2016
I'll try to simplify, but not sure if I can.

I have a file1 and file2 that have timestamped rows of data and headers. I need to use the data from file1 as metadata to precede file2 data when sorted/merged (including headers). I can only use the file1 data that the timestamp is equal to or nearest match to file2 rows (below example).

My problem is that I only need ONE row from the file1 data that the timestamp matches or nearest to a file2 timestamped data. There can be many file2 rows, but only one file1 row assigned to them (because it is the closest match timestamp wise).

The rows in RED are the only rows I need displayed because the file1 metadata row timestamp is the nearest match to the file2 rows. All other non-red file1 rows must be deleted.

Code:
MY OUTPUT:

TIMEFORMATTED,G_CCSDS_VERSION,G_CCSDS_TYPE,G_CCSDS_2HDR_FLAG,G_CCSDS_APID,G_CCSDS_GRP_FLAGS,G_CCSDS_SEQ_COUNT,G_CCSDS_PKT_LEN,G_CCSDS_DOY,G_CCSDS_MSEC
2014/04/07 16:02:55,0,0,1,572,3,0,1917,20550,57775339
2014/04/07 16:03:00,0,0,1,572,3,0,1917,20550,57780339
2014/04/07 16:03:05,0,0,1,572,3,0,1917,20550,57785339
2014/04/07 16:03:10,0,0,1,572,3,0,1917,20550,57790339
TIMEFORMATTED,CCSDS_VERSION,CCSDS_TYPE,CCSDS_2HDR_FLAG,CCSDS_APID,CCSDS_GRP_FLAGS,CCSDS_SEQ_COUNT,CCSDS_PKT_LEN,CCSDS_DOY,CCSDS_MSEC
2014/04/07 16:03:12,0,0,1,544,3,0,985,20550,57788894
2014/04/07 16:03:13,0,0,1,544,3,0,985,20550,57793894
2014/04/07 16:03:14,0,0,1,544,3,0,985,20550,57794894

TIMEFORMATTED,G_CCSDS_VERSION,G_CCSDS_TYPE,G_CCSDS_2HDR_FLAG,G_CCSDS_APID,G_CCSDS_GRP_FLAGS,G_CCSDS_SEQ_COUNT,G_CCSDS_PKT_LEN,G_CCSDS_DOY,G_CCSDS_MSEC
2014/04/07 16:03:15,0,0,1,572,3,0,1917,20550,57795339
TIMEFORMATTED,CCSDS_VERSION,CCSDS_TYPE,CCSDS_2HDR_FLAG,CCSDS_APID,CCSDS_GRP_FLAGS,CCSDS_SEQ_COUNT,CCSDS_PKT_LEN,CCSDS_DOY,CCSDS_MSEC
2014/04/07 16:03:15,0,0,1,544,3,0,985,20550,57795894
2014/04/07 16:03:16,0,0,1,544,3,0,985,20550,57796894
2014/04/07 16:03:17, 0,0,1,544,3,0,985,20550,57797894

These are very large files and sometimes there will be ONE file1 timestamp match or nearest match to ONE file2 timestamp and sometimes there will be ONE file1 timestamp match or nearest match to MANY file2 timestamps. It all depends on the rate both files were ran because as you can see file1 timestamps are 5 seconds apart, while file2 timestamps are 1 second apart.

Last edited by aachave1; 08-11-2016 at 05:03 PM..
# 4  
Old 08-11-2016
Code:
0 TIMEFORMATTED,G_CCSDS_VERSION,G_CCSDS_TYPE,G_CCSDS_2HDR_FLAG,G_CCSDS_APID,G_CCSDS_GRP_FLAGS,G_CCSDS_SEQ_COUNT,G_CCSDS_PKT_LEN,G_CCSDS_DOY,G_CCSDS_MSEC 
1 2014/04/07 16:03:15,0,0,1,572,3,0,1917,20550,57795339 
2 TIMEFORMATTED,CCSDS_VERSION,CCSDS_TYPE,CCSDS_2HDR_FLAG,CCSDS_APID,CCSDS_GRP_FLAGS,CCSDS_SEQ_COUNT,CCSDS_PKT_LEN,CCSDS_DOY,CCSDS_MSEC 
3 2014/04/07 16:03:15,0,0,1,544,3,0,985,20550,57795894 
4 2014/04/07 16:03:16,0,0,1,544,3,0,985,20550,57796894 
5 2014/04/07 16:03:17, 0,0,1,544,3,0,985,20550,57797894

To say what I understood:
  • The green part is from File1 and the blue part is from File2.
  • Line 1 is the needed Line from File1 for File2

Correct?

Questions
  • Is Line 1 needed for the complete File2 ore only for Line3 of File2?
  • Is/Are there another Line(s) from File1 needed for the blue line 4 and 5 of File2 which is closest to the time of the line?
# 5  
Old 08-11-2016
"To say what I understood:"
"The green part is from File1 and the blue part is from File2." Yes, but also the headers are different because the file1 header (green also) has "G_" pre-pended to the field names, while the file2 header does not (blue also).
"Line 1 is the needed Line from File1 for File2" Yes, since it is the closest timestamp match to lines 3, 4, and 5.

"Questions"
"Is Line 1 needed for the complete File2 ore only for Line3 of File2?" It is needed for the complete file2 since line 1 matches or is nearest timestamp to lines 3, 4, & 5.
Is/Are there another Line(s) from File1 needed for the blue line 4 and 5 of File2 which is closest to the time of the line? No, just the one file1 line 1 (plus its header) since it is the nearest timestamp match to 4 & 5. (like the example).

Sorry for the confusion. It is difficult to explain this.
# 6  
Old 08-11-2016
Line 3,4 and 5 of File2 have different Timestamps. What if there are different timestamps nearest to different lines?

Example

Code:
0 TIMEFORMATTED,G_CCSDS_VERSION,G_CCSDS_TYPE,G_CCSDS_2HDR_FLAG,G_CCSDS_APID,G_CCSDS_GRP_FLAGS,G_CCSDS_SEQ_COUNT,G_CCSDS_PKT_LEN,G_CCSDS_DOY,G_CCSDS_MSEC 
1 2014/04/07 16:03:15,0,0,1,572,3,0,1917,20550,57795339 
2 2014/04/07 16:03:16,0,0,1,572,3,0,1917,20550,57795339 
3 2014/04/07 16:03:17,0,0,1,572,3,0,1917,20550,57795339  
4 TIMEFORMATTED,CCSDS_VERSION,CCSDS_TYPE,CCSDS_2HDR_FLAG,CCSDS_APID,CCSDS_GRP_FLAGS,CCSDS_SEQ_COUNT,CCSDS_PKT_LEN,CCSDS_DOY,CCSDS_MSEC 
5 2014/04/07 16:03:15,0,0,1,544,3,0,985,20550,57795894  
6 2014/04/07 16:03:16,0,0,1,544,3,0,985,20550,57796894  
7 2014/04/07 16:03:17,0,0,1,544,3,0,985,20550,57797894

Here line 1 timestamp of File1 is matching exactly line 5 of File2.
line 2 / File1 matching line 6 / File2
line 3 / File1 matching line 7 / File2

Which one is correct line to choose from file1 for file2?

Quote:
Sorry for the confusion. It is difficult to explain this.
No problem. Lots of fun puzzling for a solution Smilie
# 7  
Old 08-11-2016
Is this a specific scenario question, because my examples are different?

But, if this were a real output from file1 and file2 of mine, it would look like this below after a proper sort/merge.

Code:
0 TIMEFORMATTED,G_CCSDS_VERSION,G_CCSDS_TYPE,G_CCSDS_2HDR_FLAG,G_CCSDS_APID,G_CCSDS_GRP_FLAGS,G_CCSDS_SEQ_COUNT,G_CCSDS_PKT_LEN,G_CCSDS_DOY,G_CCSDS_MSEC 
1 2014/04/07 16:03:15,0,0,1,572,3,0,1917,20550,57795339
4 TIMEFORMATTED,CCSDS_VERSION,CCSDS_TYPE,CCSDS_2HDR_FLAG,CCSDS_APID,CCSDS_GRP_FLAGS,CCSDS_SEQ_COUNT,CCSDS_PKT_LEN,CCSDS_DOY,CCSDS_MSEC 
5 2014/04/07 16:03:15,0,0,1,544,3,0,985,20550,57795894

0 TIMEFORMATTED,G_CCSDS_VERSION,G_CCSDS_TYPE,G_CCSDS_2HDR_FLAG,G_CCSDS_APID,G_CCSDS_GRP_FLAGS,G_CCSDS_SEQ_COUNT,G_CCSDS_PKT_LEN,G_CCSDS_DOY,G_CCSDS_MSEC 
2 2014/04/07 16:03:16,0,0,1,572,3,0,1917,20550,57795339 
4 TIMEFORMATTED,CCSDS_VERSION,CCSDS_TYPE,CCSDS_2HDR_FLAG,CCSDS_APID,CCSDS_GRP_FLAGS,CCSDS_SEQ_COUNT,CCSDS_PKT_LEN,CCSDS_DOY,CCSDS_MSEC   
6 2014/04/07 16:03:16,0,0,1,544,3,0,985,20550,57796894

0 TIMEFORMATTED,G_CCSDS_VERSION,G_CCSDS_TYPE,G_CCSDS_2HDR_FLAG,G_CCSDS_APID,G_CCSDS_GRP_FLAGS,G_CCSDS_SEQ_COUNT,G_CCSDS_PKT_LEN,G_CCSDS_DOY,G_CCSDS_MSEC 
3 2014/04/07 16:03:17,0,0,1,572,3,0,1917,20550,57795339
4 TIMEFORMATTED,CCSDS_VERSION,CCSDS_TYPE,CCSDS_2HDR_FLAG,CCSDS_APID,CCSDS_GRP_FLAGS,CCSDS_SEQ_COUNT,CCSDS_PKT_LEN,CCSDS_DOY,CCSDS_MSEC   
7 2014/04/07 16:03:17,0,0,1,544,3,0,985,20550,57797894

Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Keep only the closet match of timestamped row (include headers) from file1 to precede file2 row/s

This is a question that is related to one I had last August when I was trying to sort/merge two files by millsecond time column (in this case column 6). The script (below) that helped me last august by RudiC solved the puzzle of sorting/merging two files by time, except it gets lost when the... (0 Replies)
Discussion started by: aachave1
0 Replies

2. Shell Programming and Scripting

awk to search field2 in file2 using range of fields file1 and using match to another field in file1

I am trying to use awk to find all the $2 values in file2 which is ~30MB and tab-delimited, that are between $2 and $3 in file1 which is ~2GB and tab-delimited. I have just found out that I need to use $1 and $2 and $3 from file1 and $1 and $2of file2 must match $1 of file1 and be in the range... (6 Replies)
Discussion started by: cmccabe
6 Replies

3. Shell Programming and Scripting

Reading and appending a row from file1 to file2 using awk or sed

Hi, I wanted to add each row of file2.txt to entire length of file1.txt given the sample data below and save it as new file. Any idea how to efficiently do it. Thank you for any help. input file file1.txt file2.txt 140 30 200006 141 32 140 32 200006 142 33 140 35 200006 142... (5 Replies)
Discussion started by: ida1215
5 Replies

4. Shell Programming and Scripting

Print sequences from file2 based on match to, AND in same order as, file1

I have a list of IDs in file1 and a list of sequences in file2. I can print sequences from file2, but I'm asking for help in printing the sequences in the same order as the IDs appear in file1. file1: EN_comp12952_c0_seq3:367-1668 ES_comp17168_c1_seq6:1-864 EN_comp13395_c3_seq14:231-1088... (5 Replies)
Discussion started by: pathunkathunk
5 Replies

5. Shell Programming and Scripting

Match single line in file1 to groups of lines in file2

I have two files. File 1 is a two-column index file, e.g. comp11084_c0_seq6:130-468(-) comp12746_c0_seq3:140-478(+) comp11084_c0_seq3:201-539(-) comp12746_c0_seq2:191-529(+) File 2 is a sequence file with headers named with the same terms that populate file 1. ... (1 Reply)
Discussion started by: pathunkathunk
1 Replies

6. Shell Programming and Scripting

Get row number from file1 and print that row of file2

Hi. How can we print those rows of file2 which are mentioned in file1. first character of file1 is a row number.. for eg file1 1:abc 3:ghi 6:pqr file2 a abc b def c ghi d jkl e mno f pqr ... (6 Replies)
Discussion started by: Abhiraj Singh
6 Replies

7. Shell Programming and Scripting

Match part of string in file2 based on column in file1

I have a file containing texts and indexes. I need the text between (and including ) INDEX and number "1" alone in line. I have managed this: awk '/INDEX/,/1$/{if (!/1$/)print}' file1.txt It works for all indexes. And then I have second file with years and indexes per year, one per line... (3 Replies)
Discussion started by: phoebus
3 Replies

8. UNIX for Dummies Questions & Answers

if matching strings in file1 and file2, add column from file1 to file2

I have very limited coding skills but I'm wondering if someone could help me with this. There are many threads about matching strings in two files, but I have no idea how to add a column from one file to another based on a matching string. I'm looking to match column1 in file1 to the number... (3 Replies)
Discussion started by: pathunkathunk
3 Replies

9. Shell Programming and Scripting

Match one column of file1 with that of file2

Hi, I have file1 like this aaa ggg ddd vvv eeeand file2 aaa 2 aaa 443 xxx 76 aaa 34 ggg 33 wee 99 ggg 33 ddd 1 ddd 10 ddd 98 sds 23 (4 Replies)
Discussion started by: polsum
4 Replies

10. Shell Programming and Scripting

match value from file1 in file2

Hi, i've two files (file1, file2) i want to take value (in column1) and search in file2 if the they match print the value from file2. this is what i have so far. awk 'FILENAME=="file1"{ arr=$1 } FILENAME=="file2" {print $0} ' file1 file2 (2 Replies)
Discussion started by: myguess21
2 Replies
Login or Register to Ask a Question