10-09-2009
I apologize for the confusion! I understand where I need to go with this but I have no clue how to tell the computer to do it so it is hard for me to explain it to others as well...let me try again...
My input file currently looks like this:
ID 1 2 3 4 5 6 7 8
83845676 AG AC AT GT CC AA CC CC
83846900 AA AA TT GG CC AG CC TT
83847041 AA 00 AT GT 00 AG CG CT
83847004 AG AA TT TT CC AG CG CT
83847085 AG CC AT GT CG AG CG CT
83847118 00 AA TT GG 00 GG CC CT
83847162 GG AA TT GT CG AG CG CT
83847165 AA AA 00 GG CC AG GG CT
I want to rename the missing values so they are just a period and save an output file like this:
ID 1 2 3 4 5 6 7 8
83845676 AG AC AT GT CC AA CC CC
83846900 AA AA TT GG CC AG CC TT
83847041 AA . AT GT . AG CG CT
83847004 AG AA TT TT CC AG CG CT
83847085 AG CC AT GT CG AG CG CT
83847118 . AA TT GG . GG CC CT
83847162 GG AA TT GT CG AG CG CT
83847165 AA AA . GG CC AG GG CT
Then I need to create an output file that has all of the letters recoded as -1, 0, or 1. This should be done in alphabetical order and on a per column basis so that:
ID 1 2 3 4 5 6 7 8
83845676 0 0 0 0 -1 -1 -1 -1
83846900 -1 -1 1 -1 -1 0 -1 1
83847041 -1 . 0 0 . 0 0 0
83847004 0 -1 1 1 -1 0 0 0
83847085 0 1 0 0 0 0 0 0
83847118 . -1 1 -1 . 1 -1 0
83847162 1 -1 1 0 0 0 0 0
83847165 -1 -1 . -1 -1 0 1 0
Finally I need to calculate the average of each column and replace the missing values from that column with the average:
ID 1 2 3 4 5 6 7 8
83845676 0 0 0 0 -1 -1 -1 -1
83846900 -1 -1 1 -1 -1 0 -1 1
83847041 -1 -0.5 0 0 -0.5 0 0 0
83847004 0 -1 1 1 -1 0 0 0
83847085 0 1 0 0 0 0 0 0
83847118 -0.25 -1 1 -1 -0.5 1 -1 0
83847162 1 -1 1 0 0 0 0 0
83847165 -1 -1 0.5 -1 -1 0 1 0
This will be the final file. Does this make more since or have I confused you more??
Thanks
10 More Discussions You Might Find Interesting
1. Shell Programming and Scripting
suppose u have a file which consist of many data points separated by asterisk
Question is to extract third part in each line .
0.0002*0.003*-0.93939*0.0202*0.322*0.3332*0.2222*0.22020
0.003*0.3333*0.33322*-0.2220*0.3030*0.2222*0.3331*-0.3030
0.0393*0.3039*-0.03038*0.033*0.4033*0.30384*0.4048... (5 Replies)
Discussion started by: cdfd123
5 Replies
2. Shell Programming and Scripting
I have a file that has been partially recoded so that data points that were formerly letter combinations are now -1, 0, or 1. I need to finish recoding the GG and CC data points. The file looks like this:
ID 1 2 3 4 5 6 7 8
83845676 0 0 0 0 CC -1 CC CC
838469. -1 -1 1 GG CC 0 CC 1
83847041... (10 Replies)
Discussion started by: doobedoo
10 Replies
3. Shell Programming and Scripting
Hi All I have a data set like this tab delimited:
weft fgr-1 345 -1 fgrythdgd
weft fgr-3 456 -2 ghjdklflllff
weft fgr-11 456 -3 ghtjuffl
weft fgr-1 213 -2 ghtyjdkl
weft fgr-34 567 -5 fghytkflf
frgt fgr-36 567 -1 ghrjufjf
frgt fgr-45 678 -2 ghjruir
frgt fgr-34 546 -5 gjjjgkldlld
frgt... (4 Replies)
Discussion started by: Lucky Ali
4 Replies
4. UNIX for Dummies Questions & Answers
hiii, Help me out..i have a huge set of data stored in a file.This file has has 2 columns which is latitude & longitude of a region. Now i have a program which asks for the number of points & based on this number it asks the user to enter that latitude & longitude values which are in the same... (7 Replies)
Discussion started by: reva
7 Replies
5. Programming
Hi,
I am trying to arrange my graphs with GNUPLOT. Although it looked like simple at the beginning, I could not figure out an answer for the following: I want to change the style of my data points (not the line, just exact data points) The terminal assigns first + and then x to them but what I... (0 Replies)
Discussion started by: natasha
0 Replies
6. Shell Programming and Scripting
Hi,
I have a file with one column data (sample below) and I am trying to write a shell script to calculate the difference between consecutive data valuse i.e
Var = Ni -N(i-1)
0.3141
-3.6595
0.9171
5.2001
3.5331
3.7022
-6.1087
-5.1039
-9.8144
1.6516
-2.725
3.982
7.769
8.88 (5 Replies)
Discussion started by: malandisa
5 Replies
7. UNIX for Dummies Questions & Answers
Hi, I need help on finding the value of my data that encompasses certain percentage of my total data points (n). Attached is an example of my data, n=30. What I want to do is for instance is find the minimum threshold that still encompasses 60% (n=18), 70% (n=21) and 80% (n=24).
manually to... (4 Replies)
Discussion started by: ida1215
4 Replies
8. Shell Programming and Scripting
I have a text file that shows the output of my solar inverters. I want to separate this into sections. overview , device 1 , device 2 , device 3. Each device has different number of lines. but they all have unique starting points. Overview starts with 6 #'s, Devices have 4#'s and their data starts... (6 Replies)
Discussion started by: Mikey
6 Replies
9. Shell Programming and Scripting
Hi, I was wondering if someone would be able to help with extrapolating information from a file and filling an existing matrix with that information.
I have made a matrix like this (file 1):
A B C D
1
2
3
4
I have another file with data like this (file 2):
1 A
1 C
3 C
4 B... (1 Reply)
Discussion started by: hubleo
1 Replies
10. Shell Programming and Scripting
I need to rank a large number of data points that exist in multiple files. My data points (Column 3) are based on unique values in columns 1 and 2. I need to rank the values that are in File 1, Column 3.
For instance:
Input File 1
AAA BBB 10
CCC DDD 16
EEE FFF 20
Input File 2
... (47 Replies)
Discussion started by: ncwxpanther
47 Replies
PSC(1) General Commands Manual PSC(1)
NAME
psc - prepare sc files
SYNOPSIS
psc [-fLkrSPv] [-s cell] [-R n] [-C n] [-n n] [-d c]
DESCRIPTION
Psc is used to prepare data for input to the spreadsheet calculator sc(1). It accepts normal ascii data on standard input. Standard out-
put is a sc file. With no options, psc starts the spreadsheet in cell A0. Strings are right justified. All data on a line is entered on
the same row; new input lines cause the output row number to increment by one. The default delimiters are tab and space. The column for-
mats are set to one larger than the number of columns required to hold the largest value in the column.
OPTIONS
-f Omit column width calculations. This option is for preparing data to be merged with an existing spreadsheet. If the option is not
specified, the column widths calculated for the data read by psc will override those already set in the existing spreadsheet.
-L Left justify strings.
-k Keep all delimiters. This option causes the output cell to change on each new delimiter encountered in the input stream. The
default action is to condense multiple delimiters to one, so that the cell only changes once per input data item.
-r Output the data by row first then column. For input consisting of a single column, this option will result in output of one row
with multiple columns instead of a single column spreadsheet.
-s cell
Start the top left corner of the spreadsheet in cell. For example, -s B33 will arrange the output data so that the spreadsheet
starts in column B, row 33.
-R n Increment by n on each new output row.
-C n Increment by n on each new output column.
-n n Output n rows before advancing to the next column. This option is used when the input is arranged in a single column and the
spreadsheet is to have multiple columns, each of which is to be length n.
-d c Use the single character c as the delimiter between input fields.
-P Plain numbers only. A field is a number only when there is no imbedded [-+eE].
-S All numbers are strings.
-v Print the version of psc
SEE ALSO
sc(1)
AUTHOR
Robert Bond
PSC 7.16 19 September 2002 PSC(1)