Reducing text file using similar lines

03-17-2013

Registered User

64, 1

Join Date: Jul 2011

Last Activity: 12 September 2013, 3:40 PM EDT

Posts: 64

Thanks Given: 12

Thanked 1 Time in 1 Post

Reducing text file using similar lines

Hello,

I am a java programmer but want to try unix for a purpose where I need to reduce a file using its first field.. Here is the sample data:

HTML Code:

admin;2;0;[yrral];[]
admission;8;0;[timlu];[]
aman;1;0;[ev];[]
caroline;0;4;[];[luis, asethi]
cook;0;4;[];[shekhar, raj]
cook;2;0;[lew];[]
far;0;3;[];[venk]
far;1;5;[shekhar];[venk, raj]

I am explaining the dataset first. There are five fields separated by ";". First field is the main id, second and third are numerics, and 4th and 5th fields are list.

I need to combine all lines in the file where the first field matches. By Combination means, The 2nd and 3rd fields should be added and 4th anf 5th fields (lists) should be combined uniquely.

So, the desired output should be:

HTML Code:

admin;2;0;[yrral];[]
admission;8;0;[timlu];[]
aman;1;0;[ev];[]
caroline;0;4;[];[luis, asethi]
cook;2;4;[lew];[shekhar, raj]
far;1;8;[shekhar];[venk, raj]

I could have done this java but want to use the power of unix to get it done fast, since I have tonnes of such very large files.

Thanks a lot.
Shekhar

---------- Post updated at 10:23 PM ---------- Previous update was at 07:16 PM ----------

Please help.

Last edited by shekhar2010us; 03-17-2013 at 08:19 PM.. Reason: I provided the actual email-ids.

shekhar2010us

View Public Profile for shekhar2010us

Find all posts by shekhar2010us

03-17-2013

Moderator

3,689, 1,352

Join Date: Jan 2012

Last Activity: 22 August 2020, 11:29 PM EDT

Location: Galactic Empire

Posts: 3,689

Thanks Given: 268

Thanked 1,352 Times in 1,258 Posts

Please do not bump up questions if they are not answered promptly.

Here is an awk program based on some assumptions:

Code:

awk -F';' ' {
                if ( F1[i] == $1 && i > 1 )
                {
                        F2[i] += $2
                        F3[i] += $3
                        gsub (/\[|\]/, X, $4)
                        gsub (/\[|\]/, X, $5)
                        if ( F4[i] == "" )
                                F4[i] = $4
                        else if ( $4 != "" )
                        {
                                F4[i] = F4[i] "," $4
                                n = split (F4[i], V, ",")
                                F4[i] = ""
                                for ( k = 1; k <= n; k++ )
                                {
                                        if ( k > 1 && V[k] !~ V[k-1] )
                                                F4[i] = F4[i] "," V[k]
                                        else if ( k == 1 )
                                                F4[i] = V[k]
                                }
                        }
                        if ( F5[i] == "" )
                                F5[i] = $5
                        else if ( $5 != "" )
                        {
                                F5[i] = F5[i] "," $5
                                n = split (F5[i], V, ",")
                                F5[i] = ""
                                for ( k = 1; k <= n; k++ )
                                {
                                        if ( k > 1 && V[k] !~ V[k-1] )
                                                F5[i] = F5[i] "," V[k]
                                        else if ( k == 1 )
                                                F5[i] = V[k]
                                }
                        }
                }
                else
                {
                        ++i
                        F1[i] = $1
                        F2[i] = $2
                        F3[i] = $3
                        gsub (/\[|\]/, X, $4)
                        gsub (/\[|\]/, X, $5)
                        F4[i] = $4
                        F5[i] = $5
                }
} END {
                for ( j = 1; j <= i; j++ )
                        print F1[j], F2[j], F3[j], "[" F4[j] "]", "[" F5[j] "]"
} ' OFS=';' file

This User Gave Thanks to Yoda For This Post:

Yoda

View Public Profile for Yoda

Visit Yoda's homepage!

Find all posts by Yoda

03-18-2013

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

For the given input, Yoda's script produces the desired output. But, when combining F4 and F5 with the current line's $4 and $5, respectively, it will produce duplicate entries in the output list unless the newly added entries are adjacent to the entry with the same value in the previous line. And, the space after a comma is treated as part of a name. The specification given isn't clear if this is intended, but it seemed that the separator in the lists in the sample input was a comma followed by a space rather than just a comma.

As an example, if the last line in the input file is changed from:

Code:

far;1;5;[shekhar];[venk, raj]

to:

Code:

far;1;5;[shekhar];[raj, venk]

the last line of the output will be:

Code:

far;1;8;[shekhar];[venk,raj, venk]

instead of:

Code:

far;1;8;[shekhar];[raj, venk]

And, if the following lines are in the input file:

Code:

plus;1;2;[u1, u2];[g1,g2]
plus;1;2;[u2];[g1]

it produces:

Code:

plus;2;4;[u1, u2,u2];[g1,g2,g1]

while I would have thought the desired output was:

Code:

plus;2;4;[u1, u2];[g1, g2]

Yoda's code also assumes that all lines that need to be combined will be adjacent in the input file. That is true in the sample input, but the specification doesn't specify that this will be true.

Here is an alternative awk script that you may want to consider:

Code:

awk '
function combine(ins, LOCAL, a, i, j, n, os) {
        n = split(ins, a, /, */)
        os = a[1]
        for(i = 2; i <= n; i++) {
                for(j = 1; j < i; j++)
                        if(a[i] == a[j]) break
                if(j >= i) os = os ", " a[j]
        }
        return os
} 
BEGIN { FS = OFS = ";" }
{       if($1 in order) i = order[$1]
        else            F1[i = order[$1] = ++oc] = $1
        F2[i] += $2
        F3[i] += $3
        gsub (/[][]/, "", $4) 
        if(F4[i] == "")         F4[i] = $4
        else if($4 != "")       F4[i] = combine(F4[i] "," $4)
        gsub (/[][]/, "", $5) 
        if(F5[i] == "")         F5[i] = $5
        else if($5 != "")       F5[i] = combine(F5[i] "," $5)
}
END {   for(i = 1; i <= oc; i++)
                print F1[i], F2[i], F3[i], "[" F4[i] "]", "[" F5[i] "]"
}' file

With the input file:

Code:

admin;2;0;[yrral];[]
admission;8;0;[timlu];[]
aman;1;0;[ev];[]
caroline;0;4;[];[luis, asethi]
cook;0;4;[];[shekhar, raj]
cook;2;0;[lew];[]
far;0;3;[];[venk]
far;1;5;[shekhar];[raj, venk]
plus;1;2;[u1, u2];[g1,g2]
plus1;1;1;[u1];[]
plus2;0;3;[u2];[raj, venk, g3]
plus2;1;5;[shekhar];[venk, raj]
plus1;1;1;[u2];[g1, g2, g3]
plus1;1;1;[u1];[g2, g4]
plus2;0;3;[u1];[g1, g2, g3]
plus;1;2;[u2];[g1]

this script produces:

Code:

admin;2;0;[yrral];[]
admission;8;0;[timlu];[]
aman;1;0;[ev];[]
caroline;0;4;[];[luis, asethi]
cook;2;4;[lew];[shekhar, raj]
far;1;8;[shekhar];[venk, raj]
plus;2;4;[u1, u2];[g1, g2]
plus1;3;3;[u1, u2];[g1, g2, g3, g4]
plus2;1;11;[u2, shekhar, u1];[raj, venk, g3, g1, g2]

With this same input, Yoda's script produces:

Code:

admin;2;0;[yrral];[]
admission;8;0;[timlu];[]
aman;1;0;[ev];[]
caroline;0;4;[];[luis, asethi]
cook;2;4;[lew];[shekhar, raj]
far;1;8;[shekhar];[venk,raj, venk]
plus;1;2;[u1, u2];[g1,g2]
plus1;1;1;[u1];[]
plus2;1;8;[u2,shekhar];[raj, venk, g3,venk, raj]
plus1;2;2;[u2,u1];[g1, g2, g3,g2, g4]
plus2;0;3;[u1];[g1, g2, g3]
plus;1;2;[u2];[g1]

This User Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

03-18-2013

Registered User

64, 1

Join Date: Jul 2011

Last Activity: 12 September 2013, 3:40 PM EDT

Posts: 64

Thanks Given: 12

Thanked 1 Time in 1 Post

Many Thanks Yoda and Don.
Don, you are correct.
1) The input might not be same only in adjacent lines
2) The list is comma separated and when I checked with some tweaks in the inputs , it produces duplicates in the output list.. Which I mentioned in the first post that it should add uniquely. So, I dont need duplicates in the list.

Many thanks once again.

shekhar2010us

View Public Profile for shekhar2010us

Find all posts by shekhar2010us

03-18-2013

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Quote:

Originally Posted by shekhar2010us

Note that if you want the lists combined by my script to use just a comma to separate entries in fields 4 and 5 instead of a comma followed by a space, change the following line:

Code:

                if(j >= i) os = os ", " a[j]

to:

Code:

                if(j >= i) os = os "," a[j]

Note, however, that the script won't change the lists unless two non-empty lists are found for the same field for the same value in the first field. It could be made to normalize all lists found, but it would run slower.

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

03-18-2013

Registered User

3,231, 978

Join Date: Dec 2009

Last Activity: 11 June 2014, 8:40 PM EDT

Posts: 3,231

Thanks Given: 179

Thanked 978 Times in 791 Posts

Finally, a high performance contender has arrived. Step back and eat my dust.

Code:

#!/bin/sh

while IFS=\; read -r f1 f2 f3 f4 f5; do
    dir=temp/$f1
    if ! [ -d "$dir" ]; then
        mkdir "$dir"
        touch "$dir/2" "$dir/3" "$dir/4" "$dir/5"
    fi
    printf '%s\n' "$f2" >> "$dir/2"
    printf '%s\n' "$f3" >> "$dir/3"
    printf '%s\n' "$f4" | tr '[], ' '[\n*]' | sed '/./!d' >> "$dir/4"
    printf '%s\n' "$f5" | tr '[], ' '[\n*]' | sed '/./!d' >> "$dir/5"
done

ls -ct temp |
while IFS= read -r dir; do
(
    printf '%s\n' "$dir"
    dir=temp/$dir
    paste -sd+ "$dir/2" | bc
    paste -sd+ "$dir/3" | bc
    sort -u "$dir/4" | paste -sd, - | sed 's/,/, /g; s/.*/[&]/'
    sort -u "$dir/5" | paste -sd, - | sed 's/,/, /g; s/.*/[&]/'
) | paste -sd\; -
done

Obviously, that was created purely for amusement. It is not seriously recommended over an AWK (or perl, or ...) solution.

It assumes an empty directory named "temp" in the current working directory and reads data via stdin.

Regards,
Alister

alister

View Public Profile for alister

Find all posts by alister

Shell Programming and Scripting

Reducing text file using similar lines

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Match text to lines in a file, iterate backwards until text or text substring matches, print to file

Discussion started by: shogun1970

2. Shell Programming and Scripting

removing lines with similar values from file

Discussion started by: krecik28

3. Shell Programming and Scripting

extracting lines from a file with similar first name

Discussion started by: vivek d r

4. UNIX for Dummies Questions & Answers

Matching and reporting near-similar lines in a file

Discussion started by: Fahmida

5. Shell Programming and Scripting

Reducing file lines in awk

Discussion started by: vasanth.vadalur

6. UNIX for Dummies Questions & Answers

merge lines within a file that start with a similar pattern

Discussion started by: a_ba

7. Shell Programming and Scripting

remove one of each similar lines in a file

Discussion started by: tukuyomi

8. Shell Programming and Scripting

Counting similar lines from file UNIX

Discussion started by: mohsin.quazi

9. Infrastructure Monitoring

Remove Similar Lines from a File

Discussion started by: Nysif Steve

10. Shell Programming and Scripting

How to sort a file and then print similar lines once

Discussion started by: shira