need to remove invariant characters


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting need to remove invariant characters
# 1  
Old 08-21-2012
need to remove invariant characters

Hello,
I have a nexus alignment file that looks like this:


bar101_min2covg_binarynex 11001-100111
bar102_min2covg_binarynex 110010010011
bar103_min2covg_binarynex 11101010--11

etc.

There are 41 rows and 28014 characters in each, with 0, 1, and missing data (-) as the three possibilities. Probably 80% of all the sites are invariant, and I would like to remove them from the alignment. So, I'm looking for a way to scan through this alignment file and remove all sites where all rows' values match, or where only 1 row differs, ignoring missing datapoints to make this determination (i.e. if several rows have missing data at a site but all the others match, it gets chopped). A slight complication is that the data come in pairs, so I need to evaluate sites 1/2, 3/4, 5/6, 7/8, etc. etc. in pairs and eliminate them only if both sites are invariant across all rows. I'm kind of stumped at how to approach this, and fairly new to this kind of data manipulation. Does anyone have suggestions for how I might approach this?

The ideal output from the example would be:

bar101_min2covg_binarynex 001001
bar102_min2covg_binarynex 000100
bar103_min2covg_binarynex 1010--


Thanks for the help!
# 2  
Old 08-21-2012
I think you going to need gawk for this as awk can't handle lines with more that about 3000 characters.

Try:

Code:
gawk '
{
   key[NR]=$1;
   for(i=1;i<length($2);i+=2) {
       site[i,NR]=substr($2,i,2)
       if(i>maxi)maxi=i
   }
}
END {
    c=0
    for(i=1;i<=maxi;i+=2) {
           v=""
           for(r=1;r<=NR&&!keep[i];r++) {
              if(v=="" && !(site[i,r]~"-"))v=site[i,r];
              if(length(v)&&site[i,r]!=v) keep[++c]=i;
           }
        }
        for(r=1;r<=NR;r++) {
           printf "%s ", key[r]
           for(i=1;i<c;i++) printf "%s", site[keep[i],r]
           printf "\n"
        }
}' infile

# 3  
Old 08-21-2012
Thanks for the reply, I tried that code and it duplicated every site on each line, so the output is 56024 sites (I was off by 2 before when I wrote 28014).
# 4  
Old 08-21-2012
Funny, it seems to be working fine for me here using the test file you posted with gawk version 4.01.

Can you try it with your 3 line test file and see how it goes?
# 5  
Old 08-22-2012
Alright, it works ok on the example, but when it gets more complicated it starts to not handle the missing characters in the same way, and ultimately breaks down completely with a few full-length individuals, that's when the duplication of every site occurs.
# 6  
Old 08-22-2012
Quote:
Originally Posted by ljk
... So, I'm looking for a way to scan through this alignment file and remove all sites where all rows' values match, or where only 1 row differs ...

The ideal output from the example would be:

bar101_min2covg_binarynex 001001
bar102_min2covg_binarynex 000100
bar103_min2covg_binarynex 1010--
Why are those first 2 pairs included? According to your rules, if only one row differs, a pair should be excluded.

Regards,
Alister
# 7  
Old 08-22-2012
You're right, this is just a short snippet taken from the dataset so I didn't follow that rule in the example. If just one sample of the 41 differs, it's not informative, but it's not the end of the world to leave it in the dataset either.
Login or Register to Ask a Question

Previous Thread | Next Thread

8 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Remove first 2 characters and last two characters of each line

here's what im trying to do. i have a file containing lines similar to this: data.txt: 1hsRmRsbHRiSFZNTTA1dlEyMWFkbU5wUW5CSlIyeDFTVU5SYjJOSFRuWmpia0ZuWXpKV2FHTnRU 1lKUnpWMldrZFZaMG95V25oYQpSelEyWTBka2QyRklhSHBrUjA1b1kwUkJkd3BOVXpWM1lVaG5k... (5 Replies)
Discussion started by: SkySmart
5 Replies

2. UNIX for Dummies Questions & Answers

How do I remove ^M characters with VI

I have a file with all kinds of ^M at the end of each line. How the heck can these be removed? I tried a global search and replace, but it doesn't seem to work. Thanks! (8 Replies)
Discussion started by: HmmBerger
8 Replies

3. Shell Programming and Scripting

Remove whitespaces in the n first characters?

I assume removing whitespaces in the n first characters of a string would be an easy task for sed? If so, how? (7 Replies)
Discussion started by: KidCactus
7 Replies

4. UNIX for Dummies Questions & Answers

How to Remove Special Characters

Dear Members, We have a file which contains some special characters. I need to replace these special character by a new line character(\n). The Special character is \x85. I am not sure what this character means and how we can remove it. Any inputs are greatly appreciated. Thanks... (5 Replies)
Discussion started by: sandeep_1105
5 Replies

5. Shell Programming and Scripting

Remove characters from file name

Here is my code. for file in *1.3.html ; do mv "$file" `echo $file | tr '.1.3' ''` ; done For some reason I am getting an error. mv: file.idlesince.1.3.html and file.idlesince.1.3.html are identical Could this be done a different way? (5 Replies)
Discussion started by: mrlayance
5 Replies

6. UNIX for Advanced & Expert Users

remove characters

hi i have a file with these strings: 123_abc_X1116990 how to get rid of 123_abc_ and keep only X1116990? I have columns of these: 123_abc_X1134640 123_dfg_X1100237 123_tyu_X1103112 123_tyui_X1116990 thx (5 Replies)
Discussion started by: melanie_pfefer
5 Replies

7. UNIX for Dummies Questions & Answers

How to remove Characters before '~'

Hi, I am having a file which contains records as follows: DETAIL_KEY~12344|ACTIVE_PASSIVE~Y|AVG_SIZE_OF_RESPONSE~123123131 DETAIL_KEY~12344|ACTIVE_PASSIVE~Y|AVG_SIZE_OF_RESPONSE~123123131 DETAIL_KEY~12344|ACTIVE_PASSIVE~Y|AVG_SIZE_OF_RESPONSE~123123131... (4 Replies)
Discussion started by: Amey Joshi
4 Replies

8. UNIX for Dummies Questions & Answers

Remove control characters

Hi, When I do a man and save it into a file, I end up getting a lot of control characters. How can I remove them?? I tried this: /1,$ s/^H//g But I get an error saying "no previous regular expression". Can someone help me with this. Thanks, Aravind (5 Replies)
Discussion started by: aravind_mg
5 Replies
Login or Register to Ask a Question