Collapsing similar strings


 
Thread Tools Search this Thread
Top Forums UNIX for Dummies Questions & Answers Collapsing similar strings
# 1  
Old 12-20-2015
Collapsing similar strings

I have a file that looks like this:
Code:
BC00001	GA	2	2	3	3	2	5	1	5	3	3	2	4																																																																																																																																																																																	
BC00002	CA	2	2	3	3	2	5	1	5	3	3	2	4																																																																																																																																																																																	
BC00003	TX	2	2	3	3	2	5	1	5	3	3	2	4																																																																																																																																																																																	
BC00004	TX	2	2	4	3	2	6	2	2	3	4	3	2																																																																																																																																																																																	
BC00005	NC	2	2	4	3	2	6	2	2	3	4	3	2																																																																																																																																																																																	
BC00006	TX	3	3	3	3	2	5	1	5	3	2	2	2																																																																																																																																																																																	
BC00007	TX	2	2	3	3	2	5	1	5	4	3	2	4																																																																																																																																																																																	
BC00008	TX	3	3	3	3	2	5	1	5	3	2	2	4																																																																																																																																																																																	
BC00009	NY	3	2	3	3	2	5	1	3	3	3	2	3																																																																																																																																																																																	
BC00010	NY	1	2	3	3	2	5	1	6	4	3	3	3

Column 1 $ 2 are the Identifiers. I need to scan each entry from column 3-14 and find those that are identical and 'collapse' them into one entry. I should also record the frequency by state "()" and global "Freq". Thus, my outfile should look like this:
Code:
BC00001	GA(1),CA(1),TX(1)-Freq-3	2	2	3	3	2	5	1	5	3	3	2	4																																																																																																																																																																																	
BC00004	TX(1),NC(1)-Freq-2	2	2	4	3	2	6	2	2	3	4	3	2																																																																																																																																																																																	
BC00006	TX	3	3	3	3	2	5	1	5	3	2	2	2																																																																																																																																																																																	
BC00007	TX	2	2	3	3	2	5	1	5	4	3	2	4																																																																																																																																																																																	
BC00008	TX	3	3	3	3	2	5	1	5	3	2	2	4																																																																																																																																																																																	
BC00009	NY	3	2	3	3	2	5	1	3	3	3	2	3																																																																																																																																																																																	
BC00010	NY	1	2	3	3	2	5	1	6	4	3	3	3

I put together the following awk script:
Code:
awk '{id=$1}{query=$3"\t"$4"\t"$5"\t"$6"\t"$7"\t"$8"\t"$9"\t"$10"\t"$11"\t"$12"\t"$13"\t"$14}{F[query]++;if (!I[query]) I[query]=id"\t"$2" Freq"}END{for(i in I)print I[i],F[i],i}'

but I am far from getting the expected results. This is what I am getting:
Code:
BC00010 NY Freq 1 1     2       3       3       2       5       1       6       4       3       3       3
BC00006 TX Freq 1 3     3       3       3       2       5       1       5       3       2       2       2
BC00008 TX Freq 1 3     3       3       3       2       5       1       5       3       2       2       4
BC00007 TX Freq 1 2     2       3       3       2       5       1       5       4       3       2       4
BC00004 TX Freq 2 2     2       4       3       2       6       2       2       3       4       3       2
         Freq 1
BC00001 GA Freq 3 2     2       3       3       2       5       1       5       3       3       2       4
BC00009 NY Freq 1 3     2       3       3       2       5       1       3       3       3       2       3

Any help with my code will be greatly appreciated!
# 2  
Old 12-20-2015
Are you sure field 1 is of any relevance? In your desired output, BC00002 and BC00003 have completely gone.

---------- Post updated at 21:43 ---------- Previous update was at 21:42 ----------

And, how and why do select BC00001 for the output?
# 3  
Old 12-20-2015
Rudi
In reality, field 1 is not that important. However, I would like to keep as a reference. When similar entries are found, I just kept the first one. As I said, I would be ok with renaming all entries as Haplotype-1 through 10
Thanks
# 4  
Old 12-20-2015
I don't understand what haplotype-1 - 10 is. This is what I have so far:
Code:
awk '
        {T=$0
         gsub ($1 FS $2 FS "|" FS "*$", "", T)
         FREQ[T]++
         ST[T] = ST[T] $2 FS
         FQST[$2 FS T]++
        }
END     {for (f in FREQ)        {printf "%s ", f
                                 n = split (ST[f], TMP)
                                 for (i=1; i<n; i++) printf "%s(%s),", TMP[i], FQST[TMP[i] FS f]
                                 printf "-Freq-%s\n",  FREQ[f]
                                }
        }
' FS="\t" file
1    2    3    3    2    5    1    6    4    3    3    3 NY(1),-Freq-1
2    2    4    3    2    6    2    2    3    4    3    2 TX(1),NC(1),-Freq-2
3    3    3    3    2    5    1    5    3    2    2    4 TX(1),-Freq-1
3    3    3    3    2    5    1    5    3    2    2    2 TX(1),-Freq-1
3    2    3    3    2    5    1    3    3    3    2    3 NY(1),-Freq-1
2    2    3    3    2    5    1    5    4    3    2    4 TX(1),-Freq-1
2    2    3    3    2    5    1    5    3    3    2    4 GA(1),CA(1),TX(1),-Freq-3

---------- Post updated at 23:04 ---------- Previous update was at 22:19 ----------

This may come closer to what you need:
Code:
awk '
        {T=$0
         gsub ($1 FS $2 FS "|" FS "*$", "", T)
         FREQ[T]++
         ST[T] = ST[T] $2 FS
         FQST[$2 FS T]++
         BC[T] = $1
        }
END     {for (f in FREQ)        {printf "%s%s%s  ", BC[f], FS, f
                                 n = split (ST[f], TMP)
                                 if (n == 2)    print TMP[1]
                                 else   {for (i=1; i<n; i++) printf "%s(%s)%s", TMP[i], FQST[TMP[i] FS f], i==n-1?"-":","
                                                 printf "Freq-%s\n",  FREQ[f]
                                                }
                                }
        }
' FS="\t" file
BC00010    1    2    3    3    2    5    1    6    4    3    3    3  NY
BC00005    2    2    4    3    2    6    2    2    3    4    3    2  TX(1),NC(1)-Freq-2
BC00008    3    3    3    3    2    5    1    5    3    2    2    4  TX
BC00006    3    3    3    3    2    5    1    5    3    2    2    2  TX
BC00009    3    2    3    3    2    5    1    3    3    3    2    3  NY
BC00007    2    2    3    3    2    5    1    5    4    3    2    4  TX
BC00003    2    2    3    3    2    5    1    5    3    3    2    4  GA(1),CA(1),TX(1)-Freq-3

This User Gave Thanks to RudiC For This Post:
# 5  
Old 12-20-2015
Rudi
Awesome! Would you mind explain it the code a bit?
Thanks

---------- Post updated at 09:36 PM ---------- Previous update was at 05:32 PM ----------

After testing the script, I came to realized that it does not do exactly what I need. Using the following infile (slight variation from my initial file):
Code:
BC00001 GA      2       2       3       3       2       5       1       5       3       3       2       4
BC00002 CA      2       2       3       3       2       5       1       5       3       3       2       4
BC00003 TX      2       2       3       3       2       5       1       5       3       3       2       4
BC00004 TX      2       2       4       3       2       6       2       2       3       4       3       2
BC00005 NC      2       2       4       3       2       6       2       2       3       4       3       2
BC00006 TX      3       3       3       3       2       5       1       5       3       2       2       2
BC00007 TX      2       2       3       3       2       5       1       5       4       3       2       4
BC00008 TX      3       3       3       3       2       5       1       5       3       2       2       4
BC00009 NY      3       2       3       3       2       5       1       3       3       3       2       3
BC00010 NY      1       2       3       3       2       5       1       6       4       3       3       3
BC00011 CA      2       2       3       3       2       5       1       5       3       3       2       4

This is what I get with Rudi's script:
Code:
BC00010 1       2       3       3       2       5       1       6       4       3       3       3  NY
BC00006 3       3       3       3       2       5       1       5       3       2       2       2  TX
BC00008 3       3       3       3       2       5       1       5       3       2       2       4  TX
BC00007 2       2       3       3       2       5       1       5       4       3       2       4  TX
BC00005 2       2       4       3       2       6       2       2       3       4       3       2  TX(1),NC(1)-Freq-2
BC00011 2       2       3       3       2       5       1       5       3       3       2       4  GA(1),CA(2),TX(1),CA(2)-Freq-4
BC00009 3       2       3       3       2       5       1       3       3       3       2       3  NY

However, this is what I need:
Code:
BC00010 1       2       3       3       2       5       1       6       4       3       3       3  NY
BC00006 3       3       3       3       2       5       1       5       3       2       2       2  TX
BC00008 3       3       3       3       2       5       1       5       3       2       2       4  TX
BC00007 2       2       3       3       2       5       1       5       4       3       2       4  TX
BC00005 2       2       4       3       2       6       2       2       3       4       3       2  TX(1),NC(1)-Freq-2
BC00011 2       2       3       3       2       5       1       5       3       3       2       4  GA(1),CA(2),TX(1)-Freq-4
BC00009 3       2       3       3       2       5       1       3       3       3       2       3  NY

As you can see, the cumulative number for CA is correct, but repeated
# 6  
Old 12-21-2015
Rats! Yes, states were collected unconditionally. Try
Code:
awk '
        {IX=$0                                          # create index for arrays
         gsub ($1 FS $2 FS "|" FS "*$", "", IX)         # modify/adapt index
         FREQ[IX]++                                     # count occurrences
         if (ST[IX] !~ $2) ST[IX] = ST[IX] $2 FS        # keep unique states
         FQST[$2 FS IX]++                               # count state/index occurrences
         BC[IX] = $1                                    # keep arbitrary BC code
        }

END     {for (f in FREQ)                                # run across all indices
                {printf "%s%s%s  ", BC[f], FS, f        # print BC code and index (former fields 3 - 14)
                 n = split (ST[f], T)                   # get back state
                 if (n == 2)    print T[1]              # single state? print just it
                 else   {for (i=1; i<n; i++) printf "%s(%s)%s", T[i], FQST[T[i] FS f], i==n-1?"-":","    
                                                        # multiple states? print each state''s frequency
                         printf "Freq-%s\n",  FREQ[f]   # and overall frequency of occurrences
                        }
                }
        }
' FS="\t" file

This User Gave Thanks to RudiC For This Post:
# 7  
Old 12-23-2015
Code:
$ sort -k3,14 -k 2,2 file |
> awk ' {
>       key=$3"\t"$4"\t"$5"\t"$6"\t"$7"\t"$8"\t"$9"\t"$10"\t"$11"\t"$12"\t"$13"\t"$14; cnt[key]++;
>       if(arr[key]) {
>               arr[key]=arr[key](secFld!=$2 ?"("cnt2ndFld[key]"),"$2:"");
>               cnt2ndFld[key] = (secFld == $2 ? cnt2ndFld[key]+1 : 1);
>       }
>       else {arr[key]=$1"\t"$2; secFld=$2; cnt2ndFld[key]=1}
> }
> END { for(i in arr) { print arr[i](cnt[i]>1?"("cnt2ndFld[key]")-Freq-"cnt[i]:"") "\t" i } } '
BC00009 NY      3       2       3       3       2       5       1       3       3       3       2       3
BC00006 TX      3       3       3       3       2       5       1       5       3       2       2       2
BC00008 TX      3       3       3       3       2       5       1       5       3       2       2       4
BC00007 TX      2       2       3       3       2       5       1       5       4       3       2       4
BC00005 NC(1),TX(1)-Freq-2      2       2       4       3       2       6       2       2       3       4       3       2
BC00010 NY      1       2       3       3       2       5       1       6       4       3       3       3
BC00002 CA(2),GA(1),TX(1)-Freq-4        2       2       3       3       2       5       1       5       3       3       2       4

 
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Use strings from nth field from one file to match strings in entire line in another file, awk

I cannot seem to get what should be a simple awk one-liner to work correctly and cannot figure out why. I would like to use patterns from a specific field in one file as regex to search for matching strings in the entire line ($0) of another file. I would like to output the lines of File2 which... (1 Reply)
Discussion started by: jvoot
1 Replies

2. UNIX for Beginners Questions & Answers

How to pass strings from a list of strings from another file and create multiple files?

Hello Everyone , Iam a newbie to shell programming and iam reaching out if anyone can help in this :- I have two files 1) Insert.txt 2) partition_list.txt insert.txt looks like this :- insert into emp1 partition (partition_name) (a1, b2, c4, s6, d8) select a1, b2, c4, (2 Replies)
Discussion started by: nubie2linux
2 Replies

3. UNIX for Dummies Questions & Answers

Issue when using egrep to extract strings (too many strings)

Dear all, I have a data like below (n of rows=400,000) and I want to extract the rows with certain strings. I use code below. It works if there is not too many strings for example n of strings <5000. while I have 90,000 strings to extract. If I use the egrep code below, I will get error: ... (3 Replies)
Discussion started by: forevertl
3 Replies

4. UNIX for Dummies Questions & Answers

Finding similar strings between two files

Hi, I have a file1 like this: ABAT ABCA1 ABCC1 ABCC5 ABCC8 ABCE1 ABHD2 ABL1 CAMTA1 ACBD3 ACCN1 And I have a second file like this: chr19 46118590 46119564 MACS_peak_1499 3100.00 chr19 46122009 46148405 CYP2B7P1 -2445 chr1 7430312 7430990... (7 Replies)
Discussion started by: a_bahreini
7 Replies

5. Shell Programming and Scripting

awk to search similar strings and arrange in a specified pattern

Hi, I'm running a DB query which returns names of people and writes it in a text file as shown below: Carey, Jim; Cena, John Cena, John Sen, Tim; Burt, Terrence Lock, Jessey; Carey, Jim Norris, Chuck; Lee, Bruce Rock, Dwayne; Lee, Bruce I want to use awk and get all the names... (9 Replies)
Discussion started by: prashu_g
9 Replies

6. Shell Programming and Scripting

awk to search similar strings and add their values

Hi, I have a text file with the following content: monday,20 tuesday,10 wednesday,29 monday,10 friday,12 wednesday,14 monday,15 thursday,34 i want the following output: monday,45 tuesday,10 wednesday,43 friday,12 (3 Replies)
Discussion started by: prashu_g
3 Replies

7. Shell Programming and Scripting

Delete lines in file containing duplicate strings, keeping longer strings

The question is not as simple as the title... I have a file, it looks like this <string name="string1">RZ-LED</string> <string name="string2">2.0</string> <string name="string2">Version 2.0</string> <string name="string3">BP</string> I would like to check for duplicate entries of... (11 Replies)
Discussion started by: raidzero
11 Replies

8. Shell Programming and Scripting

Collapsing and counting by key column in a sorted file

Hi I have a tab separated file with reads mappings of more than 2 million reads> the file is sorted by ID and looks like the following: SeqID Seq FreqSeq PosSeq HWI-EA332_0036:5:100:10131:16361#ATGC/1 GACTTGAGGTCTCCCCCGCA 1 TZRTMR_40497:317:+... (4 Replies)
Discussion started by: ramouz87
4 Replies

9. UNIX for Dummies Questions & Answers

Delete strings in file1 based on the list of strings in file2

Hello guys, should be a very easy questn for you: I need to delete strings in file1 based on the list of strings in file2. like file2: word1_word2_ word3_word5_ word3_word4_ word6_word7_ file1: word1_word2_otherwords..,word3_word5_others... (7 Replies)
Discussion started by: roussine
7 Replies

10. Shell Programming and Scripting

How to concatenate two strings or several strings into one string in B-shell?

like connect "summer" and "winter" to "summerwinter"? Can anybody help me? thanks a lot. (2 Replies)
Discussion started by: fontana
2 Replies
Login or Register to Ask a Question