Text file parsing and comparison

09-12-2018

Registered User

22, 1

Join Date: Apr 2010

Last Activity: 17 March 2020, 12:04 AM EDT

Posts: 22

Thanks Given: 19

Thanked 1 Time in 1 Post

Text file parsing and comparison

I have two files (first.txt and second.txt):

Code:

more first.txt 

        cat mammal

        lizard reptile

        Elephant mammal

        ant Insecta

Code:

more second.txt 

        ant     termite

        ant     army_ant

        human   man

        human   woman

I want to make a third file that takes the relevant entry in the second column of the first file and paste it as a third column in the second file when the first column in first and second file matches.

Code:

more third.txt 
        ant     termite insecta
        ant     army_ant        insecta
        human   man     mammal
        human   woman   mammal

Note: The third file that I include here to illuminate my point shouldn't have human man mammal or human woman mammal . I was wrong. Check the entire thread for explanation.

Last edited by cs_novice; 09-17-2018 at 01:44 PM..

cs_novice

View Public Profile for cs_novice

Find all posts by cs_novice

09-12-2018

Moderator

8,825, 1,112

Join Date: Feb 2005

Last Activity: 23 August 2021, 11:26 AM EDT

Location: Foxborough, MA

Posts: 8,825

Thanks Given: 579

Thanked 1,112 Times in 1,003 Posts

What have you tried so far and where exactly are you stuck?
And how did you arrive at human man mammal in your desired output from the sample inputs you gave?

This User Gave Thanks to vgersh99 For This Post:

vgersh99

View Public Profile for vgersh99

Find all posts by vgersh99

09-12-2018

Registered User

22, 1

Join Date: Apr 2010

Last Activity: 17 March 2020, 12:04 AM EDT

Posts: 22

Thanks Given: 19

Thanked 1 Time in 1 Post

To clarify, the third.txt file is what I expect as output and I just put that out for illustration purposes. Reg my attempts: I tried capturing the the data in the first file data in an awk associative array, but then I am unable to think of the right conditional statement to print the corresponding entries from the first file as the third column in the second file.

cs_novice

View Public Profile for cs_novice

Find all posts by cs_novice

09-12-2018

Moderator

8,825, 1,112

Join Date: Feb 2005

Last Activity: 23 August 2021, 11:26 AM EDT

Location: Foxborough, MA

Posts: 8,825

Thanks Given: 579

Thanked 1,112 Times in 1,003 Posts

Quote:

Originally Posted by cs_novice

it might help to see the attempt so we could help to straighten it out.
It's a very doable covered many time in these fora req.

Last edited by Scott; 09-12-2018 at 05:44 PM.. Reason: Edited

vgersh99

View Public Profile for vgersh99

Find all posts by vgersh99

09-12-2018

Registered User

6,384, 2,214

Join Date: May 2005

Last Activity: 28 October 2019, 4:59 PM EDT

Location: In the leftmost byte of /dev/kmem

Posts: 6,384

Thanks Given: 143

Thanked 2,214 Times in 1,548 Posts

First, my suggestion is to re-read the forum rules: Homework/coursework questions are to be posted in a special section of the forum where special rules apply.

Second, i suggest to type

Code:

man join

into the next available terminal and read what stands there. This thread is

Moderator's Comments:

-closed-

bakunin

View Public Profile for bakunin

Find all posts by bakunin

09-13-2018

Registered User

6,384, 2,214

Join Date: May 2005

Last Activity: 28 October 2019, 4:59 PM EDT

Location: In the leftmost byte of /dev/kmem

Posts: 6,384

Thanks Given: 143

Thanked 2,214 Times in 1,548 Posts

Quote:

Originally Posted by bakunin

First, my suggestion is to re-read the forum rules: Homework/coursework questions are to be posted in a special section of the forum where special rules apply.

After checking back with the thread owner i learned that i was wrong and this is not homework. My apologies. This thread is

Moderator's Comments:

- reopened -

bakunin

View Public Profile for bakunin

Find all posts by bakunin

09-13-2018

Registered User

6,384, 2,214

Join Date: May 2005

Last Activity: 28 October 2019, 4:59 PM EDT

Location: In the leftmost byte of /dev/kmem

Posts: 6,384

Thanks Given: 143

Thanked 2,214 Times in 1,548 Posts

As i suggested join i will explain how to use it. Some work will be left over for the reader and effort on thread-owners part to solve the problem will be appreciated. The following is a loose translation from the german Wikipedia article which i also wrote:

join is used to annex information of several (usually two) input data streams (files or pipelines) and output the result. The input should be in some sort of record-format: a table-like structure in which records - separated by newline characters - exist which themselves consist of fields separated by field separators.

Example:

Code:

      field separators (here tabs)
            |
    fields  +---------+------+
      |     |         |      |
      |     |         |      |
      |-----|-----+---|-+----|----+
      |     |     |   | |    |    |
      V     V     V   V V    V    V
     Peter      Smith   38      50.000      <--- record
     Paul       Miller  40      55.000      <--- record
     Mary       Myers   32      60.000      <--- record

We see a table of persons with some characteristics: surname, family name, age, income. Each person is described in its record and each record consists of several fields, each denoting one such characteristic. Note that we could have put captions as table headers but these would NOT be part of the table.

join now creates an relation between two (or more) such tables. If a record in one table relates to several records in the other table it will be copied as often as necessary. Here is an example:

Code:

A:               B:            result:
     f1 a           f1 X               f1 a X
     f1 b                              f1 b X
     f1 c                              f1 c X

Let us put all together: Suppose we have a file ("tel") with people and their telephone numbers:

Code:

>Name	Telephone
Anna	123456-123
Karl	123456-456
Sandra	123457-789

And we have another file ("fax") with people and their Fax-number:

Code:

>Name	Fax
Anna	345678-997
Leo	345679-998
Sandra	345678-999

Notice that both files are tab-separated again, so that between fields there is always a single tabulator-character. The first try

Code:

$ join tel fax

>Name Telephone Fax
Anna 123456-123 345678-997
Sandra 123457-789 345678-999

would per default join over the first fields (the names) and only output the values available in both files. Database people would call that an inner join. Notice also that we have entered captions as a pseudo-record.

But this messy output is perhaps not what we want. join per default uses any whitespace as field separator but it can specifically be told (the -t to use a certain character. This character will in turn also be used in the output.

In addition we can specify a certain order of output fields (-o) if we don't want all of them to appear. The resulting output looks a lot better now:

Code:

$ join -t'<tab>' -o 0,1.2,2.2 tel fax

>Name	Telephone	Fax
Anna	123456-123	345678-997
Sandra	123457-789	345678-999

Furthermore we can change the default inner join to an outer join (include records not available in both files, -a) and we can assign a standard filler text for the missing information (-e):

Code:

$ join -t'<tab>' -a 1 -a 2 -e '(none)' -o 0,1.2,2.2 tel fax

>Name	Telephone	Fax
Anna	123456-123	345678-997
Karl	123456-456	(none)
Leo	(none)		345679-998
Sandra	123457-789	345678-999

At last we can also invert the joining so that only records appear in the output which are NOT present in every file - a list of people having either no phone or no Fax:

Code:

$ join -t'<tab>' -v 1 -v 2 -o 0 tel fax

Karl
Leo

A tip at last: all the input files to join have to be sorted already. In this case "sorted" means: sorted for the fields which will used to join the information. Otherwise some or maybe all records will be mysteriously missing from the output. In my example this was silently done before (this is the reason why i used ">" to mark the captions - it sorts before any character so that the header comes out on top).

The implementation of this is now left to the interested reader who is, by now, surely eager to try his newfound powers on his data. Be sure to post your results so that others can learn from your achievements as well as your mistakes.

I hope this helps.

bakunin

These 2 Users Gave Thanks to bakunin For This Post:

bakunin

View Public Profile for bakunin

Find all posts by bakunin

UNIX for Beginners Questions & Answers

Text file parsing and comparison

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Parsing text file

Discussion started by: jacobs.smith

2. Shell Programming and Scripting

Parsing text file

Discussion started by: comp8765

3. Shell Programming and Scripting

text file comparison

Discussion started by: Sub.kalps

4. Programming

Parsing a Text file using C++

Discussion started by: agupta2

5. Shell Programming and Scripting

Need help parsing a text file

Discussion started by: streetfighter2

6. Shell Programming and Scripting

Log file text parsing

Discussion started by: dizydolly

7. Shell Programming and Scripting

Complicated(?) text file comparison

Discussion started by: Daniel M. Clark

8. Shell Programming and Scripting

Parsing text from file

Discussion started by: ndnkyd

9. Shell Programming and Scripting

Text File Parsing

Discussion started by: Djlethal

10. HP-UX

XML parsing performace comparison with windows using sax

Discussion started by: saurabh.sid