## Writing an algorithm to recode data points

Writing an algorithm to recode data points
# 1
10-23-2009
Writing an algorithm to recode data points

I have a file that has been partially recoded so that data points that were formerly letter combinations are now -1, 0, or 1. I need to finish recoding the GG and CC data points. The file looks like this:

ID 1 2 3 4 5 6 7 8
83845676 0 0 0 0 CC -1 CC CC
838469. -1 -1 1 GG CC 0 CC 1
83847041 -1 . 0 0 . 0 0 0
83847.4 0 -1 1 1 CC 0 0 0
83847085 0 CC 0 0 0 0 0 0
83847118 . -1 1 GG . GG CC 0
83847162 GG -1 1 0 0 0 0 0
83847165 -1 -1 . GG CC 0 GG 0

The problem with the GG and CC is that in either case they can be a -1 or a 1, depending on what has already been recoded. If a GG is in a column that already contains 1's then GG must = -1. If the GG is in a column that already contains -1's, then the GG must be a 1. This is also true for the CC columns. I have a total of >64,000 columns so I can not go through and list which column is which. It has been suggested that I need to write an algorithm to do this but I am not very familiar with programming. Can anyone help me?
Thanks!

Last edited by Franklin52; 10-27-2009 at 05:42 AM.. Reason: IMG tag process.gif removed
 doobedoo View Public Profile for doobedoo Find all posts by doobedoo
# 2
10-23-2009
so... in your example, for just GG...
column 1 GG would be 1
column 4 GG would be -1
column 6 GG would be 1
but...
a) column 7 GG would be... what ? there is no 1 or -1 ?
b) column 5 CC does not have a 1 or -1, either?
c) are you garuanteed to have at least one 1 or -1 in each column in the entire file?
d) are the GG's to be recoded first, and then the CC's recoded based on the recoded GG's?

 sdanner View Public Profile for sdanner Find all posts by sdanner
# 3
10-23-2009
Also e) Do you guarantee that no column contains both -1 and 1?
# 4
10-23-2009
Yes so far you are correct on the GG assignments. So to answer your questions I think it may help more if I answer them out of order:

c) are you garuanteed to have at least one 1 or -1 in each column in the entire file? NO. Some columns may have a combination of -1's, 0's, and 1's and some columns may have only two of these. I do not believe there are any columns that will be all 1's, all -1's, or all 0's

d) are the GG's to be recoded first, and then the CC's recoded based on the recoded GG's? Perhaps it would be easiest to recode the CC's first and then recode the GG's based on how the CC's are coded. The only times CC's will = 1 are in the columns that already have -1 in them. All other times CC will = -1. When CC and GG are in a column together (as in column 7), CC = -1 and GG = 1.

a) column 7 GG would be... what ? there is no 1 or -1 ?
b) column 5 CC does not have a 1 or -1, either?
These are so confusing because I provided a small, bad example! I apologize!! My real file has 1079 rows and 64,000 columns.... so, if there is a column with only 00's and GG's, GG's will =1. If there is a column with only 00's (or missing values) and CC's, CC's will = -1

I hope this helps! Please let me know if you have any other questions!
Thanks!!
 doobedoo View Public Profile for doobedoo Find all posts by doobedoo
# 5
10-26-2009
I have tried to come up with an easier way to describe what I need to do with my file. If you recall I have a file that looks like this:
ID 1 2 3 4 5 6 7 8
83845676 0 0 0 0 CC -1 CC CC
838469. -1 -1 1 GG CC 0 CC 1
83847041 -1 . 0 0 . 0 0 0
83847.4 0 -1 1 1 CC 0 0 0
83847085 0 CC 0 0 0 0 0 0
83847118 . -1 1 GG . GG CC 0
83847162 GG -1 1 0 0 0 0 0
83847165 -1 -1 . GG CC 0 GG 0

I have to get everything recoded for a specific program that only recognizes -1, 0, and 1 but the CC and GG cause an issue because they can take on multiple values. Perhaps calculating the minimum and maximum of each column may be easier (?)....

When min = -1 and max = 0, then both CC and GG = 1;
When min = 0 and max = 1, then both CC and GG = 1;
When both the min and max = 0, then CC = -1 and GG = 1;

I hope this clears things up a bit! Any help you could provide would be GREATLY appreciated!

Thanks,
Doob
 doobedoo View Public Profile for doobedoo Find all posts by doobedoo
# 6
10-26-2009
I think I get what you mean: - min = the lowest value in the column max = the highest value in the column but to prove it please post the expected output for the table using the new algorithm.
Please use the correctly formated table posted by sdanner and put it between code tags.Thanks

Last edited by steadyonabix; 10-26-2009 at 12:42 PM..
# 7
10-26-2009
Here is the original file formatted by Sdanner:
Code:
ID 1 2 3 4 5 6 7 883845676 0 0 0 0 CC -1 CC CC83846900 -1 -1 1 GG CC 0 CC 183847041 -1 . 0 0 . 0 0 083847004 0 -1 1 1 CC 0 0 083847085 0 CC 0 0 0 0 0 083847118 . -1 1 GG . GG CC 083847162 GG -1 1 0 0 0 0 083847165 -1 -1 . GG CC 0 GG 0

And here is the final output I would expect based on the minimum/maximum criteria:
Code:
ID 1 2 3 4 5 6 7 883845676 0 0 0 0 -1 -1 -1 -183846900 -1 -1 1 -1 -1 0 -1 183847041 -1 . 0 0 . 0 0 083847004 0 -1 1 1 -1 0 0 083847085 0 1 0 0 0 0 0 083847118 . -1 1 -1 . 1 -1 083847162 1 -1 1 0 0 0 0 083847165 -1 -1 . -1 -1 0 1 0
Thanks!
Doob

---------- Post updated at 10:54 AM ---------- Previous update was at 10:43 AM ----------

I am not quite sure why my tables did not upload correctly so I'll try to write it out for you...
Original input:
ID 1 2 3 4 5 6 7 8
83845676 0 0 0 0 CC -1 CC CC
83846900 -1 -1 1 GG CC 0 CC 1
83847041 -1 . 0 0 . 0 0 0
83847004 0 -1 1 1 CC 0 0 0
83847085 0 CC 0 0 0 0 0 0
83847118 . -1 1 GG . GG CC 0
83847162 GG -1 1 0 0 0 0 0
83847165 -1 -1 . GG CC 0 GG 0

Output needed:
ID 1 2 3 4 5 6 7 8
83845676 0 0 0 0 -1 -1 -1 -1
83846900 -1 -1 1 -1 -1 0 -1 1
83847041 -1 . 0 0 . 0 0 0
83847004 0 -1 1 1 -1 0 0 0
83847085 0 1 0 0 0 0 0 0
83847118 . -1 1 -1 . 1 -1 0
83847162 1 -1 1 0 0 0 0 0
83847165 -1 -1 . -1 -1 0 1 0

I am sorry for the poor formatting, but I could not get this uploaded correctly!
 doobedoo View Public Profile for doobedoo Find all posts by doobedoo

## Ranking data points from multiple files

I need to rank a large number of data points that exist in multiple files. My data points (Column 3) are based on unique values in columns 1 and 2. I need to rank the values that are in File 1, Column 3. For instance: Input File 1 AAA BBB 10 CCC DDD 16 EEE FFF 20 Input File 2 ...

## Grabbing data between 2 points in text file

I have a text file that shows the output of my solar inverters. I want to separate this into sections. overview , device 1 , device 2 , device 3. Each device has different number of lines. but they all have unique starting points. Overview starts with 6 #'s, Devices have 4#'s and their data starts...

## Finding data value that contains x% of points

Hi, I need help on finding the value of my data that encompasses certain percentage of my total data points (n). Attached is an example of my data, n=30. What I want to do is for instance is find the minimum threshold that still encompasses 60% (n=18), 70% (n=21) and 80% (n=24). manually to...

## Calculate difference between consecutive data points in a column from a file

Hi, I have a file with one column data (sample below) and I am trying to write a shell script to calculate the difference between consecutive data valuse i.e Var = Ni -N(i-1) 0.3141 -3.6595 0.9171 5.2001 3.5331 3.7022 -6.1087 -5.1039 -9.8144 1.6516 -2.725 3.982 7.769 8.88

## GNUPLOT- how to change the style of data points

Hi, I am trying to arrange my graphs with GNUPLOT. Although it looked like simple at the beginning, I could not figure out an answer for the following: I want to change the style of my data points (not the line, just exact data points) The terminal assigns first + and then x to them but what I...

## How to get data only inside polygon created by points which is part of whole data from file?

hiii, Help me out..i have a huge set of data stored in a file.This file has has 2 columns which is latitude & longitude of a region. Now i have a program which asks for the number of points & based on this number it asks the user to enter that latitude & longitude values which are in the same...

## Group search (multiple data points) in Linux

Hi All I have a data set like this tab delimited: weft fgr-1 345 -1 fgrythdgd weft fgr-3 456 -2 ghjdklflllff weft fgr-11 456 -3 ghtjuffl weft fgr-1 213 -2 ghtyjdkl weft fgr-34 567 -5 fghytkflf frgt fgr-36 567 -1 ghrjufjf frgt fgr-45 678 -2 ghjruir frgt fgr-34 546 -5 gjjjgkldlld frgt...

## recoding data points using SED??

Hello all, I have a data file that needs some serious work...I have no idea how to implement the changes that are needed! The file is a genotypic file with >64,000 columns representing genetic markers, a header line, and >1100 rows that looks like this: ID 1 2 3 4 ...

## to extarct data points

suppose u have a file which consist of many data points separated by asterisk Question is to extract third part in each line . 0.0002*0.003*-0.93939*0.0202*0.322*0.3332*0.2222*0.22020 0.003*0.3333*0.33322*-0.2220*0.3030*0.2222*0.3331*-0.3030 0.0393*0.3039*-0.03038*0.033*0.4033*0.30384*0.4048...

## Gnuplot question: how to plot 3D points as colored points in map view?

I have a simple gnuplot question. I have a set of points (list of x,y,z values; irregularly spaced, i.e. no grid) that I want to plot. I want the plot to look like this: - points in map view (no 3D view) - color of each point should depend on its z-value. - I want to define my own color scale -...