Uppercase/lowercase comparison of one character per line with awk??


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Uppercase/lowercase comparison of one character per line with awk??
# 1  
Old 01-07-2010
Uppercase/lowercase comparison of one character per line with awk??

Another frustrating scripting problem from a biologist trying to manipulate a file with several millions line. For each of the line I need to compare the uppercase A or C or G or T with the lowercase a or c or g or t. If there are more uppercases, a + should be added to a new column, otherwise a - is added. Many of the lines are duplicated or even triplicated, etc... This is to allow the comparison of only one character at a time in the order of ACGT. And to make it even more complicated, comparison on the last line of the repeated lines should be between the . and , where if there are more . than , a + should be added.

Below are the examples of some of my data. The columns with numbers are the count of uppercase ACGT and lowercase acgt respectively.
Code:
.......GGGG,.G,,G...G.,.T...G.,..,.,,^F, 0 0 8 1 0 0 0 0
.......GGGG,.G,,G...G.,.T...G.,..,.,,^F, 0 0 8 1 0 0 0 0
.......GGGG,.G,,G...G.,.T...G.,..,.,,^F, 0 0 8 1 0 0 0 0
,gc,,cga,g,c,,,,,,, 0 0 0 0 1 3 3 0
,gc,,cga,g,c,,,,,,, 0 0 0 0 1 3 3 0
,gc,,cga,g,c,,,,,,, 0 0 0 0 1 3 3 0
,gc,,cga,g,c,,,,,,, 0 0 0 0 1 3 3 0
.....,,..,,...,,......,...cA.c,cC. 1 1 0 0 0 3 0 0
.....,,..,,...,,......,...cA.c,cC. 1 1 0 0 0 3 0 0
.....,,..,,...,,......,...cA.c,cC. 1 1 0 0 0 3 0 0

And this is what I'll like to get:
Code:
.......GGGG,.G,,G...G.,.T...G.,..,.,,^F, 0 0 8 1 0 0 0 0  +
.......GGGG,.G,,G...G.,.T...G.,..,.,,^F, 0 0 8 1 0 0 0 0  +
.......GGGG,.G,,G...G.,.T...G.,..,.,,^F, 0 0 8 1 0 0 0 0  +
,gc,,cga,g,c,,,,,,, 0 0 0 0 1 3 3 0  -
,gc,,cga,g,c,,,,,,, 0 0 0 0 1 3 3 0  -
,gc,,cga,g,c,,,,,,, 0 0 0 0 1 3 3 0  -
,gc,,cga,g,c,,,,,,, 0 0 0 0 1 3 3 0  -
.....,,..,,...,,......,...cA.c,cC. 1 1 0 0 0 3 0 0  +
.....,,..,,...,,......,...cA.c,cC. 1 1 0 0 0 3 0 0  -
.....,,..,,...,,......,...cA.c,cC. 1 1 0 0 0 3 0 0  +

I've tried awk with if conditions but I guess it is too simple. Any suggestions or help will be very much appreciated!

Last edited by Scott; 01-07-2010 at 02:38 AM.. Reason: Added code tags
# 2  
Old 01-07-2010
If your problem description is correct, shouldn't the final three lines in your sample data end with --+ instead of +-+ ? the first two minus signs because lowercase outnumbers uppercase, and the final plus because it is the last of a series of dupes, which triggers the commad-dot comparison rule, and since there are more dots a plus should end it.

Instead of:

Quote:
Originally Posted by ivpz
Code:
.....,,..,,...,,......,...cA.c,cC. 1 1 0 0 0 3 0 0  +
.....,,..,,...,,......,...cA.c,cC. 1 1 0 0 0 3 0 0  -
.....,,..,,...,,......,...cA.c,cC. 1 1 0 0 0 3 0 0  +

Should it not be:

Code:
.....,,..,,...,,......,...cA.c,cC. 1 1 0 0 0 3 0 0  -
.....,,..,,...,,......,...cA.c,cC. 1 1 0 0 0 3 0 0  -
.....,,..,,...,,......,...cA.c,cC. 1 1 0 0 0 3 0 0  +

Or, perhaps I misunderstood.

Regards,
Alister
# 3  
Old 01-07-2010
No. There is only one A, i.e. 1 in column 2 and 0 in column 5.

The order of comparison should be A followed by C followed by G and finally by T. If A is found in the first line, it should not be compared again in the next line and so on...

Thank you.
# 4  
Old 01-07-2010
ivpz, perhaps this will do:

Code:
$ cat dna.awk 
{
    for (i=2; i<=5; i++) {
        if ($i || $(i+4)) {
            print $0, ($i>$(i+4) ? "+" : "-")
            getline
        }
    }
    print $0, (split($0, a, /\./) > split($0, a, /,/) ? "+" : "-")
}


$ awk -f dna.awk data
.......GGGG,.G,,G...G.,.T...G.,..,.,,^F, 0 0 8 1 0 0 0 0 +
.......GGGG,.G,,G...G.,.T...G.,..,.,,^F, 0 0 8 1 0 0 0 0 +
.......GGGG,.G,,G...G.,.T...G.,..,.,,^F, 0 0 8 1 0 0 0 0 +
,gc,,cga,g,c,,,,,,, 0 0 0 0 1 3 3 0 -
,gc,,cga,g,c,,,,,,, 0 0 0 0 1 3 3 0 -
,gc,,cga,g,c,,,,,,, 0 0 0 0 1 3 3 0 -
,gc,,cga,g,c,,,,,,, 0 0 0 0 1 3 3 0 -
.....,,..,,...,,......,...cA.c,cC. 1 1 0 0 0 3 0 0 +
.....,,..,,...,,......,...cA.c,cC. 1 1 0 0 0 3 0 0 -
.....,,..,,...,,......,...cA.c,cC. 1 1 0 0 0 3 0 0 +


Last edited by alister; 01-07-2010 at 04:02 AM..
# 5  
Old 01-07-2010
Quote:
Originally Posted by ivpz
.. There is only one A, i.e. 1 in column 2 and 0 in column 5.
Why column 5 ? If A, C, G, T are in columns 2, 3, 4, 5, then shouldn't "a" be in column 6 ?

Here's the line no. 8 (from the top) of your original post:

Code:
....,,..,,...,,......,...cA.c,cC. 1 1 0 0 0 3 0 0

Now, aren't the fields supposed to mean the following ?

Code:
Column 1                            Column 2   Column 3   Column 4   Column 5   Column 6   Column 7   Column 8   Column 9
The data                            "A" count  "C" count  "G" count  "T" count  "a" count  "c" count  "g" count  "t" count
=================================   =========  =========  =========  =========  =========  =========  =========  =========
.....,,..,,...,,......,...cA.c,cC.  1          1          0          0          0          3          0          0

So, count of all uppercase characters (A, C, G, T) in columns 2, 3, 4, 5 respectively = 1 + 1 + 0 + 0 = 2

Count of all lowercase characters (a, c, g, t) in columns 6, 7, 8, 9 respectively = 0 + 3 + 0 + 0 = 3

Hence, shouldn't line 8 be followed by "-" because count of uppercase characters < count of lowercase characters ?

tyler_durden
# 6  
Old 01-07-2010
Hey, durden_tyler:

From what I gathered, the columns mean what you think they mean, but comparison is only made between corresponding upper-lower case letters (A-a, C-c, G-g, T-t) wherein at least one member of the pair occurs in the line. Also, there are as many duplicates of each line as there are comparisons to be made.

Line 8 will have appended to it the result of comparing A-a (columns 2 and 6), a "+". Line 9 is C-c (columns 3 and 7), and gets "-". Line 10 is for the comma-dot comparison (in this case, a "+"). If there are no instances of either member of a pair, there is no comparison made and no line is dupe appears for it.

alister
# 7  
Old 01-07-2010
Quote:
Originally Posted by durden_tyler
Why column 5 ? If A, C, G, T are in columns 2, 3, 4, 5, then shouldn't "a" be in column 6 ?
Sorry, my mistake. Yes, it should be column 6.


Quote:
Originally Posted by durden_tyler
... Hence, shouldn't line 8 be followed by "-" because count of uppercase characters < count of lowercase characters ?
Each Uppercase A should be compared with the lowercase a only; in essence:

compare col2 and col6; if col2>col6, add + else - to a new col. If both col2 and col6 are 0 then compare col3 and col7 ...

---------- Post updated at 04:27 AM ---------- Previous update was at 04:20 AM ----------

Alister, thanks for your help. Can I ask you, what is the function of a, in the last line of the script?

Quote:
Originally Posted by alister

Code:
$ cat dna.awk 
{
    for (i=2; i<=5; i++) {
        if ($i || $(i+4)) {
            print $0, ($i>$(i+4) ? "+" : "-")
            getline
        }
    }
    print $0, (split($0, a, /\./) > split($0, a, /,/) ? "+" : "-")
}

Also, when I ran this script, most of the comparisons seems to be correct but I got a few which are obviously incorrect and one extra line was added at the end:

Code:
,,.$.,..,....,.G,g,T.gG..Gg.,,,........,g,.....,g...,^F. 0 0 3 1 0 0 5 0 +
,,.$.,..,....,.G,g,T.gG..Gg.,,,........,g,.....,g...,^F. 0 0 3 1 0 0 5 0 +
,,.$.,..,....,.G,g,T.gG..Gg.,,,........,g,.....,g...,^F. 0 0 3 1 0 0 5 0 -

should be
,,.$.,..,....,.G,g,T.gG..Gg.,,,........,g,.....,g...,^F. 0 0 3 1 0 0 5 0 -
,,.$.,..,....,.G,g,T.gG..Gg.,,,........,g,.....,g...,^F. 0 0 3 1 0 0 5 0 +
,,.$.,..,....,.G,g,T.gG..Gg.,,,........,g,.....,g...,^F. 0 0 3 1 0 0 5 0 +

Another example:
Code:
...,.,,,..,c,...$..,,...,.cgA,,.GG,...........,,,,.G,,.Ng,,.G.,. 1 0 4 0 0 2 2 0 +
...,.,,,..,c,...$..,,...,.cgA,,.GG,...........,,,,.G,,.Ng,,.G.,. 1 0 4 0 0 2 2 0 +
...,.,,,..,c,...$..,,...,.cgA,,.GG,...........,,,,.G,,.Ng,,.G.,. 1 0 4 0 0 2 2 0 -
...,.,,,..,c,...$..,,...,.cgA,,.GG,...........,,,,.G,,.Ng,,.G.,. 1 0 4 0 0 2 2 0 +

should be:
...,.,,,..,c,...$..,,...,.cgA,,.GG,...........,,,,.G,,.Ng,,.G.,. 1 0 4 0 0 2 2 0 +
...,.,,,..,c,...$..,,...,.cgA,,.GG,...........,,,,.G,,.Ng,,.G.,. 1 0 4 0 0 2 2 0 -
...,.,,,..,c,...$..,,...,.cgA,,.GG,...........,,,,.G,,.Ng,,.G.,. 1 0 4 0 0 2 2 0 +
...,.,,,..,c,...$..,,...,.cgA,,.GG,...........,,,,.G,,.Ng,,.G.,. 1 0 4 0 0 2 2 0 +

Although some of the lines contains characters other than the 4 ACGT, they should be ignored.

Last edited by ivpz; 01-07-2010 at 05:25 AM..
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Uppercase to lowercase

Hello, I have a list of files in a directory whose names are all in uppercasse, including the file format for eg *.MP3 . I would like to convert these to the normal way we write it ie ABC.MP3 to be converted to Abc.mp3 . I know that this can be done manually by using a lot of "mv" or rename... (6 Replies)
Discussion started by: ajayram
6 Replies

2. Shell Programming and Scripting

Convert lowercase to uppercase

listprocs.sh contains ps -ef | grep "swikar" 1) Write a shell script to convert an input file to all upper case. Name your shell script toupper.sh. Hint: tr ' ' ' ' will convert all lower case letters to upper case To use your script, try the following command: cat... (1 Reply)
Discussion started by: swikar
1 Replies

3. UNIX for Dummies Questions & Answers

UPPERCASE to lowercase

Hi All, i have a file and i want to convert all uppercase letters to lowercase letters which are in my file. how can i do this. Thanx (3 Replies)
Discussion started by: temhem
3 Replies

4. UNIX Desktop Questions & Answers

Unix: lowercase to uppercase

I just started to learn unix... and i needed to make a basic script. i need to 1. read a file (.txt) 2. count the words of EVERY sentece 3. sentences with odd number of words need to be converted into lowercase sentences with even number of words need to be converted into uppercase ... (6 Replies)
Discussion started by: chilli1988
6 Replies

5. Shell Programming and Scripting

indentation and lowercase to uppercase

hi, i need to write a bash script that does two things. the program will take from the command line a file name, which is a C code, and an integer, which is the size of my indentation i would then have to indent every nested code by the number of columns provided by the user in the... (1 Reply)
Discussion started by: kratos.
1 Replies

6. UNIX for Dummies Questions & Answers

uppercase to lowercase

i have no variable and no file i just want to convert AJIT to ajit with some command in UNIX can anybody help (4 Replies)
Discussion started by: ajit.yadav83
4 Replies

7. AIX

Lowercase to Uppercase

Inside a script I have 2 variables COMP=cy and PT=t. further down the same script I require at the same line to call those 2 variables the first time uppercase and after lowercase ${COMP}${PT}ACE,${COMP}${PT}ace. Can somebody help me Thanks in advance George Govotsis (7 Replies)
Discussion started by: ggovotsis
7 Replies

8. Shell Programming and Scripting

UPPERCASE to lowercase with no overwriting?

Hey, I've just started learning shell script today. How would I write a bash script file that changes file names from uppercase to lowercase in that directory, the program should warn the user and NOT overwrite the existing file if it's already in lowercase? for example in a directory i... (1 Reply)
Discussion started by: lgd923
1 Replies

9. Shell Programming and Scripting

How convert lowercase or uppercase

It will only accept one argument where it should be upper or lowercase. if user choose to convert filnames to upper case than it should convert to upper or vice versa. if no action taken by the user then should not do anything any of the files in the current directory. (5 Replies)
Discussion started by: Alex20
5 Replies

10. Shell Programming and Scripting

uppercase to lowercase

Greetings & Happy New Years To All! A client of mine FTP'ed their files up to the server and it all ended up being in UPPERCASE when it all should be in lowercase. Is there a builtin command or a script anyone knows of that will automagically convert all files to lowercase? Please advise asap... (4 Replies)
Discussion started by: webex
4 Replies
Login or Register to Ask a Question