Reducing text file using similar lines


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Reducing text file using similar lines
# 1  
Old 03-17-2013
Reducing text file using similar lines

Hello,

I am a java programmer but want to try unix for a purpose where I need to reduce a file using its first field.. Here is the sample data:

HTML Code:
admin;2;0;[yrral];[]
admission;8;0;[timlu];[]
aman;1;0;[ev];[]
caroline;0;4;[];[luis, asethi]
cook;0;4;[];[shekhar, raj]
cook;2;0;[lew];[]
far;0;3;[];[venk]
far;1;5;[shekhar];[venk, raj]
I am explaining the dataset first. There are five fields separated by ";". First field is the main id, second and third are numerics, and 4th and 5th fields are list.

I need to combine all lines in the file where the first field matches. By Combination means, The 2nd and 3rd fields should be added and 4th anf 5th fields (lists) should be combined uniquely.

So, the desired output should be:
HTML Code:
admin;2;0;[yrral];[]
admission;8;0;[timlu];[]
aman;1;0;[ev];[]
caroline;0;4;[];[luis, asethi]
cook;2;4;[lew];[shekhar, raj]
far;1;8;[shekhar];[venk, raj]

I could have done this java but want to use the power of unix to get it done fast, since I have tonnes of such very large files.

Thanks a lot.
Shekhar

---------- Post updated at 10:23 PM ---------- Previous update was at 07:16 PM ----------

Please help.

Last edited by shekhar2010us; 03-17-2013 at 08:19 PM.. Reason: I provided the actual email-ids.
# 2  
Old 03-17-2013
Please do not bump up questions if they are not answered promptly.

Here is an awk program based on some assumptions:
Code:
awk -F';' ' {
                if ( F1[i] == $1 && i > 1 )
                {
                        F2[i] += $2
                        F3[i] += $3
                        gsub (/\[|\]/, X, $4)
                        gsub (/\[|\]/, X, $5)
                        if ( F4[i] == "" )
                                F4[i] = $4
                        else if ( $4 != "" )
                        {
                                F4[i] = F4[i] "," $4
                                n = split (F4[i], V, ",")
                                F4[i] = ""
                                for ( k = 1; k <= n; k++ )
                                {
                                        if ( k > 1 && V[k] !~ V[k-1] )
                                                F4[i] = F4[i] "," V[k]
                                        else if ( k == 1 )
                                                F4[i] = V[k]
                                }
                        }
                        if ( F5[i] == "" )
                                F5[i] = $5
                        else if ( $5 != "" )
                        {
                                F5[i] = F5[i] "," $5
                                n = split (F5[i], V, ",")
                                F5[i] = ""
                                for ( k = 1; k <= n; k++ )
                                {
                                        if ( k > 1 && V[k] !~ V[k-1] )
                                                F5[i] = F5[i] "," V[k]
                                        else if ( k == 1 )
                                                F5[i] = V[k]
                                }
                        }
                }
                else
                {
                        ++i
                        F1[i] = $1
                        F2[i] = $2
                        F3[i] = $3
                        gsub (/\[|\]/, X, $4)
                        gsub (/\[|\]/, X, $5)
                        F4[i] = $4
                        F5[i] = $5
                }
} END {
                for ( j = 1; j <= i; j++ )
                        print F1[j], F2[j], F3[j], "[" F4[j] "]", "[" F5[j] "]"
} ' OFS=';' file

This User Gave Thanks to Yoda For This Post:
# 3  
Old 03-18-2013
For the given input, Yoda's script produces the desired output. But, when combining F4 and F5 with the current line's $4 and $5, respectively, it will produce duplicate entries in the output list unless the newly added entries are adjacent to the entry with the same value in the previous line. And, the space after a comma is treated as part of a name. The specification given isn't clear if this is intended, but it seemed that the separator in the lists in the sample input was a comma followed by a space rather than just a comma.



As an example, if the last line in the input file is changed from:
Code:
far;1;5;[shekhar];[venk, raj]

to:
Code:
far;1;5;[shekhar];[raj, venk]

the last line of the output will be:
Code:
far;1;8;[shekhar];[venk,raj, venk]

instead of:
Code:
far;1;8;[shekhar];[raj, venk]

And, if the following lines are in the input file:
Code:
plus;1;2;[u1, u2];[g1,g2]
plus;1;2;[u2];[g1]

it produces:
Code:
plus;2;4;[u1, u2,u2];[g1,g2,g1]

while I would have thought the desired output was:
Code:
plus;2;4;[u1, u2];[g1, g2]

Yoda's code also assumes that all lines that need to be combined will be adjacent in the input file. That is true in the sample input, but the specification doesn't specify that this will be true.

Here is an alternative awk script that you may want to consider:
Code:
awk '
function combine(ins, LOCAL, a, i, j, n, os) {
        n = split(ins, a, /, */)
        os = a[1]
        for(i = 2; i <= n; i++) {
                for(j = 1; j < i; j++)
                        if(a[i] == a[j]) break
                if(j >= i) os = os ", " a[j]
        }
        return os
} 
BEGIN { FS = OFS = ";" }
{       if($1 in order) i = order[$1]
        else            F1[i = order[$1] = ++oc] = $1
        F2[i] += $2
        F3[i] += $3
        gsub (/[][]/, "", $4) 
        if(F4[i] == "")         F4[i] = $4
        else if($4 != "")       F4[i] = combine(F4[i] "," $4)
        gsub (/[][]/, "", $5) 
        if(F5[i] == "")         F5[i] = $5
        else if($5 != "")       F5[i] = combine(F5[i] "," $5)
}
END {   for(i = 1; i <= oc; i++)
                print F1[i], F2[i], F3[i], "[" F4[i] "]", "[" F5[i] "]"
}' file

With the input file:
Code:
admin;2;0;[yrral];[]
admission;8;0;[timlu];[]
aman;1;0;[ev];[]
caroline;0;4;[];[luis, asethi]
cook;0;4;[];[shekhar, raj]
cook;2;0;[lew];[]
far;0;3;[];[venk]
far;1;5;[shekhar];[raj, venk]
plus;1;2;[u1, u2];[g1,g2]
plus1;1;1;[u1];[]
plus2;0;3;[u2];[raj, venk, g3]
plus2;1;5;[shekhar];[venk, raj]
plus1;1;1;[u2];[g1, g2, g3]
plus1;1;1;[u1];[g2, g4]
plus2;0;3;[u1];[g1, g2, g3]
plus;1;2;[u2];[g1]

this script produces:
Code:
admin;2;0;[yrral];[]
admission;8;0;[timlu];[]
aman;1;0;[ev];[]
caroline;0;4;[];[luis, asethi]
cook;2;4;[lew];[shekhar, raj]
far;1;8;[shekhar];[venk, raj]
plus;2;4;[u1, u2];[g1, g2]
plus1;3;3;[u1, u2];[g1, g2, g3, g4]
plus2;1;11;[u2, shekhar, u1];[raj, venk, g3, g1, g2]

With this same input, Yoda's script produces:
Code:
admin;2;0;[yrral];[]
admission;8;0;[timlu];[]
aman;1;0;[ev];[]
caroline;0;4;[];[luis, asethi]
cook;2;4;[lew];[shekhar, raj]
far;1;8;[shekhar];[venk,raj, venk]
plus;1;2;[u1, u2];[g1,g2]
plus1;1;1;[u1];[]
plus2;1;8;[u2,shekhar];[raj, venk, g3,venk, raj]
plus1;2;2;[u2,u1];[g1, g2, g3,g2, g4]
plus2;0;3;[u1];[g1, g2, g3]
plus;1;2;[u2];[g1]

This User Gave Thanks to Don Cragun For This Post:
# 4  
Old 03-18-2013
Many Thanks Yoda and Don.
Don, you are correct.
1) The input might not be same only in adjacent lines
2) The list is comma separated and when I checked with some tweaks in the inputs , it produces duplicates in the output list.. Which I mentioned in the first post that it should add uniquely. So, I dont need duplicates in the list.

Many thanks once again.
# 5  
Old 03-18-2013
Quote:
Originally Posted by shekhar2010us
Many Thanks Yoda and Don.
Don, you are correct.
1) The input might not be same only in adjacent lines
2) The list is comma separated and when I checked with some tweaks in the inputs , it produces duplicates in the output list.. Which I mentioned in the first post that it should add uniquely. So, I dont need duplicates in the list.

Many thanks once again.
Note that if you want the lists combined by my script to use just a comma to separate entries in fields 4 and 5 instead of a comma followed by a space, change the following line:
Code:
                if(j >= i) os = os ", " a[j]

to:
Code:
                if(j >= i) os = os "," a[j]

Note, however, that the script won't change the lists unless two non-empty lists are found for the same field for the same value in the first field. It could be made to normalize all lists found, but it would run slower.
# 6  
Old 03-18-2013
Finally, a high performance contender has arrived. Step back and eat my dust. Smilie
Code:
#!/bin/sh

while IFS=\; read -r f1 f2 f3 f4 f5; do
    dir=temp/$f1
    if ! [ -d "$dir" ]; then
        mkdir "$dir"
        touch "$dir/2" "$dir/3" "$dir/4" "$dir/5"
    fi
    printf '%s\n' "$f2" >> "$dir/2"
    printf '%s\n' "$f3" >> "$dir/3"
    printf '%s\n' "$f4" | tr '[], ' '[\n*]' | sed '/./!d' >> "$dir/4"
    printf '%s\n' "$f5" | tr '[], ' '[\n*]' | sed '/./!d' >> "$dir/5"
done

ls -ct temp |
while IFS= read -r dir; do
(
    printf '%s\n' "$dir"
    dir=temp/$dir
    paste -sd+ "$dir/2" | bc
    paste -sd+ "$dir/3" | bc
    sort -u "$dir/4" | paste -sd, - | sed 's/,/, /g; s/.*/[&]/'
    sort -u "$dir/5" | paste -sd, - | sed 's/,/, /g; s/.*/[&]/'
) | paste -sd\; -
done

Obviously, that was created purely for amusement. It is not seriously recommended over an AWK (or perl, or ...) solution.

It assumes an empty directory named "temp" in the current working directory and reads data via stdin.

Regards,
Alister
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Match text to lines in a file, iterate backwards until text or text substring matches, print to file

hi all, trying this using shell/bash with sed/awk/grep I have two files, one containing one column, the other containing multiple columns (comma delimited). file1.txt abc12345 def12345 ghi54321 ... file2.txt abc1,text1,texta abc,text2,textb def123,text3,textc gh,text4,textd... (6 Replies)
Discussion started by: shogun1970
6 Replies

2. Shell Programming and Scripting

removing lines with similar values from file

Hello, got a file with this structure: 33274 171030 02/29/2012 37897 P_GEH 2012-02-29 10:31:26 33275 171049 02/29/2012 38132 P_GEH 2012-02-29 10:35:27 33276 171058 02/29/2012 38515 P_GEH 2012-02-29 10:43:26 33277 170748 02/29/2012 40685 P_KOM ... (3 Replies)
Discussion started by: krecik28
3 Replies

3. Shell Programming and Scripting

extracting lines from a file with similar first name

consider i have two files cat onlyviews1.sql CREATE VIEW V11 AS SELECT id, name, FROM etc etc WHERE etc etc; CREATE VIEW V22 AS SELECT id, name, FROM etc etc WHERE etc etc; CREATE VIEW V33 AS (10 Replies)
Discussion started by: vivek d r
10 Replies

4. UNIX for Dummies Questions & Answers

Matching and reporting near-similar lines in a file

Hi, I have a file with the lines as below: C_10_A05_T7 C_10_A06_SP6 C_10_B05_SP6 C_10_B05_T7 C_10_B01_SP6 C_10_B01_T7 C_12_G07_SP6 C_12_G11_SP6 C_12_G11_T7 C_2_H18_T7 C_2_I02_SP6 C_2_I02_T7 C_2_I13_SP6 C_2_I17_SP6 The four segments of each line are connected by '_' symbols. I... (7 Replies)
Discussion started by: Fahmida
7 Replies

5. Shell Programming and Scripting

Reducing file lines in awk

Hi, Here i have to check first record $3 $4 with second record $1 $2 respectively. If match found, then check first record $2 == second record $4 , if it equals , then reduce two records to single record like as desired output. Input_file 1 1 2 1 2 1 3 1 3 1 4 1 3 1 3 2 desired... (3 Replies)
Discussion started by: vasanth.vadalur
3 Replies

6. UNIX for Dummies Questions & Answers

merge lines within a file that start with a similar pattern

Hello! i have a text file.. which contains the data as follows i want to merge the declarations lines pertaining to one datatype in to a single line as follows i've searched the forum for help.. but couldn't find much help.. how can i do this?? (1 Reply)
Discussion started by: a_ba
1 Replies

7. Shell Programming and Scripting

remove one of each similar lines in a file

Hello folks I have a question for you gurus of sed or grep (maybe awk, but I would prefer the first two) I have a file (f1) that says: (actually, these are not numbers but md5sum, but for simplicity, let's assume these numbers.) 1 2 3 4 5And I have a file (f2) that says 1|a 1|b 1|c 2|d... (3 Replies)
Discussion started by: tukuyomi
3 Replies

8. Shell Programming and Scripting

Counting similar lines from file UNIX

I have a file which contains data as below: nbk1j7o pageName=/jsp/RMBS/RMBSHome.jsf nbk1j7o pageName=/jsp/RMBS/RMBSHome.jsf nbk1j7o pageName=/jsp/RMBS/RMBSHome.jsf nbk1j7o pageName=/jsp/RMBS/RMBSHome.jsf nbk1j7o pageName=/jsp/common/index.jsf nbk1j7o pageName=/jsp/common/index.jsf nbk1wqe... (6 Replies)
Discussion started by: mohsin.quazi
6 Replies

9. Infrastructure Monitoring

Remove Similar Lines from a File

I have a log file "logreport" that contains several lines as seen below: 04:20:00 /usr/lib/snmp/snmpdx: Agent snmpd appeared dead but responded to ping 06:38:08 /usr/lib/snmp/snmpdx: Agent snmpd appeared dead but responded to ping 07:11:05 /usr/lib/snmp/snmpdx: Agent snmpd appeared dead... (4 Replies)
Discussion started by: Nysif Steve
4 Replies

10. Shell Programming and Scripting

How to sort a file and then print similar lines once

Hi! I have a trouble with the sort and the uniq. I know I have to use them, I just have trouble with putting them in the right order. I have a text file with unsorted lines (each line has a few words, the first word in the line is a number.). I need to sort this file in order to be... (6 Replies)
Discussion started by: shira
6 Replies
Login or Register to Ask a Question