Collapsing similar strings

12-20-2015

Registered User

365, 3

Join Date: Jun 2010

Last Activity: 6 August 2019, 11:08 PM EDT

Posts: 365

Thanks Given: 149

Thanked 3 Times in 3 Posts

Collapsing similar strings

I have a file that looks like this:

Code:

BC00001	GA	2	2	3	3	2	5	1	5	3	3	2	4																																																																																																																																																																																	
BC00002	CA	2	2	3	3	2	5	1	5	3	3	2	4																																																																																																																																																																																	
BC00003	TX	2	2	3	3	2	5	1	5	3	3	2	4																																																																																																																																																																																	
BC00004	TX	2	2	4	3	2	6	2	2	3	4	3	2																																																																																																																																																																																	
BC00005	NC	2	2	4	3	2	6	2	2	3	4	3	2																																																																																																																																																																																	
BC00006	TX	3	3	3	3	2	5	1	5	3	2	2	2																																																																																																																																																																																	
BC00007	TX	2	2	3	3	2	5	1	5	4	3	2	4																																																																																																																																																																																	
BC00008	TX	3	3	3	3	2	5	1	5	3	2	2	4																																																																																																																																																																																	
BC00009	NY	3	2	3	3	2	5	1	3	3	3	2	3																																																																																																																																																																																	
BC00010	NY	1	2	3	3	2	5	1	6	4	3	3	3

Column 1 $ 2 are the Identifiers. I need to scan each entry from column 3-14 and find those that are identical and 'collapse' them into one entry. I should also record the frequency by state "()" and global "Freq". Thus, my outfile should look like this:

Code:

BC00001	GA(1),CA(1),TX(1)-Freq-3	2	2	3	3	2	5	1	5	3	3	2	4																																																																																																																																																																																	
BC00004	TX(1),NC(1)-Freq-2	2	2	4	3	2	6	2	2	3	4	3	2																																																																																																																																																																																	
BC00006	TX	3	3	3	3	2	5	1	5	3	2	2	2																																																																																																																																																																																	
BC00007	TX	2	2	3	3	2	5	1	5	4	3	2	4																																																																																																																																																																																	
BC00008	TX	3	3	3	3	2	5	1	5	3	2	2	4																																																																																																																																																																																	
BC00009	NY	3	2	3	3	2	5	1	3	3	3	2	3																																																																																																																																																																																	
BC00010	NY	1	2	3	3	2	5	1	6	4	3	3	3

I put together the following awk script:

Code:

awk '{id=$1}{query=$3"\t"$4"\t"$5"\t"$6"\t"$7"\t"$8"\t"$9"\t"$10"\t"$11"\t"$12"\t"$13"\t"$14}{F[query]++;if (!I[query]) I[query]=id"\t"$2" Freq"}END{for(i in I)print I[i],F[i],i}'

but I am far from getting the expected results. This is what I am getting:

Code:

BC00010 NY Freq 1 1     2       3       3       2       5       1       6       4       3       3       3
BC00006 TX Freq 1 3     3       3       3       2       5       1       5       3       2       2       2
BC00008 TX Freq 1 3     3       3       3       2       5       1       5       3       2       2       4
BC00007 TX Freq 1 2     2       3       3       2       5       1       5       4       3       2       4
BC00004 TX Freq 2 2     2       4       3       2       6       2       2       3       4       3       2
         Freq 1
BC00001 GA Freq 3 2     2       3       3       2       5       1       5       3       3       2       4
BC00009 NY Freq 1 3     2       3       3       2       5       1       3       3       3       2       3

Any help with my code will be greatly appreciated!

Xterra

View Public Profile for Xterra

Find all posts by Xterra

12-20-2015

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

Are you sure field 1 is of any relevance? In your desired output, BC00002 and BC00003 have completely gone.

---------- Post updated at 21:43 ---------- Previous update was at 21:42 ----------

And, how and why do select BC00001 for the output?

RudiC

View Public Profile for RudiC

Find all posts by RudiC

12-20-2015

Registered User

365, 3

Join Date: Jun 2010

Last Activity: 6 August 2019, 11:08 PM EDT

Posts: 365

Thanks Given: 149

Thanked 3 Times in 3 Posts

Rudi
In reality, field 1 is not that important. However, I would like to keep as a reference. When similar entries are found, I just kept the first one. As I said, I would be ok with renaming all entries as Haplotype-1 through 10
Thanks

Xterra

View Public Profile for Xterra

Find all posts by Xterra

12-20-2015

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

I don't understand what haplotype-1 - 10 is. This is what I have so far:

Code:

awk '
        {T=$0
         gsub ($1 FS $2 FS "|" FS "*$", "", T)
         FREQ[T]++
         ST[T] = ST[T] $2 FS
         FQST[$2 FS T]++
        }
END     {for (f in FREQ)        {printf "%s ", f
                                 n = split (ST[f], TMP)
                                 for (i=1; i<n; i++) printf "%s(%s),", TMP[i], FQST[TMP[i] FS f]
                                 printf "-Freq-%s\n",  FREQ[f]
                                }
        }
' FS="\t" file
1    2    3    3    2    5    1    6    4    3    3    3 NY(1),-Freq-1
2    2    4    3    2    6    2    2    3    4    3    2 TX(1),NC(1),-Freq-2
3    3    3    3    2    5    1    5    3    2    2    4 TX(1),-Freq-1
3    3    3    3    2    5    1    5    3    2    2    2 TX(1),-Freq-1
3    2    3    3    2    5    1    3    3    3    2    3 NY(1),-Freq-1
2    2    3    3    2    5    1    5    4    3    2    4 TX(1),-Freq-1
2    2    3    3    2    5    1    5    3    3    2    4 GA(1),CA(1),TX(1),-Freq-3

---------- Post updated at 23:04 ---------- Previous update was at 22:19 ----------

This may come closer to what you need:

Code:

awk '
        {T=$0
         gsub ($1 FS $2 FS "|" FS "*$", "", T)
         FREQ[T]++
         ST[T] = ST[T] $2 FS
         FQST[$2 FS T]++
         BC[T] = $1
        }
END     {for (f in FREQ)        {printf "%s%s%s  ", BC[f], FS, f
                                 n = split (ST[f], TMP)
                                 if (n == 2)    print TMP[1]
                                 else   {for (i=1; i<n; i++) printf "%s(%s)%s", TMP[i], FQST[TMP[i] FS f], i==n-1?"-":","
                                                 printf "Freq-%s\n",  FREQ[f]
                                                }
                                }
        }
' FS="\t" file
BC00010    1    2    3    3    2    5    1    6    4    3    3    3  NY
BC00005    2    2    4    3    2    6    2    2    3    4    3    2  TX(1),NC(1)-Freq-2
BC00008    3    3    3    3    2    5    1    5    3    2    2    4  TX
BC00006    3    3    3    3    2    5    1    5    3    2    2    2  TX
BC00009    3    2    3    3    2    5    1    3    3    3    2    3  NY
BC00007    2    2    3    3    2    5    1    5    4    3    2    4  TX
BC00003    2    2    3    3    2    5    1    5    3    3    2    4  GA(1),CA(1),TX(1)-Freq-3

This User Gave Thanks to RudiC For This Post:

RudiC

View Public Profile for RudiC

Find all posts by RudiC

12-20-2015

Registered User

365, 3

Join Date: Jun 2010

Last Activity: 6 August 2019, 11:08 PM EDT

Posts: 365

Thanks Given: 149

Thanked 3 Times in 3 Posts

Rudi
Awesome! Would you mind explain it the code a bit?
Thanks

---------- Post updated at 09:36 PM ---------- Previous update was at 05:32 PM ----------

After testing the script, I came to realized that it does not do exactly what I need. Using the following infile (slight variation from my initial file):

Code:

BC00001 GA      2       2       3       3       2       5       1       5       3       3       2       4
BC00002 CA      2       2       3       3       2       5       1       5       3       3       2       4
BC00003 TX      2       2       3       3       2       5       1       5       3       3       2       4
BC00004 TX      2       2       4       3       2       6       2       2       3       4       3       2
BC00005 NC      2       2       4       3       2       6       2       2       3       4       3       2
BC00006 TX      3       3       3       3       2       5       1       5       3       2       2       2
BC00007 TX      2       2       3       3       2       5       1       5       4       3       2       4
BC00008 TX      3       3       3       3       2       5       1       5       3       2       2       4
BC00009 NY      3       2       3       3       2       5       1       3       3       3       2       3
BC00010 NY      1       2       3       3       2       5       1       6       4       3       3       3
BC00011 CA      2       2       3       3       2       5       1       5       3       3       2       4

This is what I get with Rudi's script:

Code:

BC00010 1       2       3       3       2       5       1       6       4       3       3       3  NY
BC00006 3       3       3       3       2       5       1       5       3       2       2       2  TX
BC00008 3       3       3       3       2       5       1       5       3       2       2       4  TX
BC00007 2       2       3       3       2       5       1       5       4       3       2       4  TX
BC00005 2       2       4       3       2       6       2       2       3       4       3       2  TX(1),NC(1)-Freq-2
BC00011 2       2       3       3       2       5       1       5       3       3       2       4  GA(1),CA(2),TX(1),CA(2)-Freq-4
BC00009 3       2       3       3       2       5       1       3       3       3       2       3  NY

However, this is what I need:

Code:

BC00010 1       2       3       3       2       5       1       6       4       3       3       3  NY
BC00006 3       3       3       3       2       5       1       5       3       2       2       2  TX
BC00008 3       3       3       3       2       5       1       5       3       2       2       4  TX
BC00007 2       2       3       3       2       5       1       5       4       3       2       4  TX
BC00005 2       2       4       3       2       6       2       2       3       4       3       2  TX(1),NC(1)-Freq-2
BC00011 2       2       3       3       2       5       1       5       3       3       2       4  GA(1),CA(2),TX(1)-Freq-4
BC00009 3       2       3       3       2       5       1       3       3       3       2       3  NY

As you can see, the cumulative number for CA is correct, but repeated

Xterra

View Public Profile for Xterra

Find all posts by Xterra

12-21-2015

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

Rats! Yes, states were collected unconditionally. Try

Code:

awk '
        {IX=$0                                          # create index for arrays
         gsub ($1 FS $2 FS "|" FS "*$", "", IX)         # modify/adapt index
         FREQ[IX]++                                     # count occurrences
         if (ST[IX] !~ $2) ST[IX] = ST[IX] $2 FS        # keep unique states
         FQST[$2 FS IX]++                               # count state/index occurrences
         BC[IX] = $1                                    # keep arbitrary BC code
        }

END     {for (f in FREQ)                                # run across all indices
                {printf "%s%s%s  ", BC[f], FS, f        # print BC code and index (former fields 3 - 14)
                 n = split (ST[f], T)                   # get back state
                 if (n == 2)    print T[1]              # single state? print just it
                 else   {for (i=1; i<n; i++) printf "%s(%s)%s", T[i], FQST[T[i] FS f], i==n-1?"-":","    
                                                        # multiple states? print each state''s frequency
                         printf "Freq-%s\n",  FREQ[f]   # and overall frequency of occurrences
                        }
                }
        }
' FS="\t" file

This User Gave Thanks to RudiC For This Post:

RudiC

View Public Profile for RudiC

Find all posts by RudiC

12-23-2015

Registered User

2,205, 181

Join Date: Mar 2006

Last Activity: 8 May 2020, 5:01 AM EDT

Location: Bangalore,India

Posts: 2,205

Thanks Given: 31

Thanked 181 Times in 171 Posts

Code:

$ sort -k3,14 -k 2,2 file |
> awk ' {
>       key=$3"\t"$4"\t"$5"\t"$6"\t"$7"\t"$8"\t"$9"\t"$10"\t"$11"\t"$12"\t"$13"\t"$14; cnt[key]++;
>       if(arr[key]) {
>               arr[key]=arr[key](secFld!=$2 ?"("cnt2ndFld[key]"),"$2:"");
>               cnt2ndFld[key] = (secFld == $2 ? cnt2ndFld[key]+1 : 1);
>       }
>       else {arr[key]=$1"\t"$2; secFld=$2; cnt2ndFld[key]=1}
> }
> END { for(i in arr) { print arr[i](cnt[i]>1?"("cnt2ndFld[key]")-Freq-"cnt[i]:"") "\t" i } } '
BC00009 NY      3       2       3       3       2       5       1       3       3       3       2       3
BC00006 TX      3       3       3       3       2       5       1       5       3       2       2       2
BC00008 TX      3       3       3       3       2       5       1       5       3       2       2       4
BC00007 TX      2       2       3       3       2       5       1       5       4       3       2       4
BC00005 NC(1),TX(1)-Freq-2      2       2       4       3       2       6       2       2       3       4       3       2
BC00010 NY      1       2       3       3       2       5       1       6       4       3       3       3
BC00002 CA(2),GA(1),TX(1)-Freq-4        2       2       3       3       2       5       1       5       3       3       2       4

anbu23

View Public Profile for anbu23

Find all posts by anbu23

UNIX for Dummies Questions & Answers

Collapsing similar strings

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Use strings from nth field from one file to match strings in entire line in another file, awk

Discussion started by: jvoot

2. UNIX for Beginners Questions & Answers

How to pass strings from a list of strings from another file and create multiple files?

Discussion started by: nubie2linux

3. UNIX for Dummies Questions & Answers

Issue when using egrep to extract strings (too many strings)

Discussion started by: forevertl

4. UNIX for Dummies Questions & Answers

Finding similar strings between two files

Discussion started by: a_bahreini

5. Shell Programming and Scripting

awk to search similar strings and arrange in a specified pattern

Discussion started by: prashu_g

6. Shell Programming and Scripting

awk to search similar strings and add their values

Discussion started by: prashu_g

7. Shell Programming and Scripting

Delete lines in file containing duplicate strings, keeping longer strings

Discussion started by: raidzero

8. Shell Programming and Scripting

Collapsing and counting by key column in a sorted file

Discussion started by: ramouz87

9. UNIX for Dummies Questions & Answers

Delete strings in file1 based on the list of strings in file2

Discussion started by: roussine

10. Shell Programming and Scripting

How to concatenate two strings or several strings into one string in B-shell?

Discussion started by: fontana