delete the same rows


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting delete the same rows
# 1  
Old 08-08-2012
delete the same rows

each individual (row) has genotype expressed for each SNP (column)

file1.txt

Code:
1 1 A G A T G T A A A A A A A A A A A A A A A A A A A A A
2 2 G A A A A A A A A A A A A A A A A A A A A A A A A A A
3 3 A A A A A A A A A A A A A A A A A A A A A A A A A A A
4 4 G A G T A T A A A A A A A A A A A A A A A A A A A A A
5 5 A A A A A A A A A A A A A A A A A A A A A A A A A A A

where the first and second columns are the id number, they are identical
First, I would like to see whether there are individuals (rows) with duplicate.
Then, I also want to see whether if two individuals that have over 95 percent similar genotype.

so in this case, 3rd and 5th individuals have the same genotype.
and 2nd and 3rd individuals have the same (26/27~96.26%) genotype information. So i want to have a script that can find ID with these potential problems.

Final output may look something like

file2.txt (for duplicate)
Code:
3
5

file3. txt (for similarity)
Code:
2
3

or

file3.txt (for problem)
Code:
2
3
5


Hope this makes sense,
Thanks in advance!
Moderator's Comments:
Mod Comment
Please use code tags when posting data and code samples!

Last edited by vgersh99; 08-08-2012 at 03:48 PM.. Reason: code tags, please!
# 2  
Old 08-08-2012
Your thinking on this problem is confusing to me.

1. You need to actually place duplicates into a separate file, keeping just one of many possible duplicates in the original. Or you cannot "do" the similarity test.
This assumes your example file is correct:
Code:
awk '!arr[substr($0,5)]++ {print $0; next}
       arr[substr($0,5)] {print $0 >"duplicate.file"; next} ' inputfile > depduped.file

2. I am not at all sure about your similarity test. If I get what you want, and assuming some randomness in the sample, by the time you have processed ~80 records from the deduped.file you will then hit a stream of almost all "similarity" records.

This is because each column has four choices, ATCG. With your format and a complete set of of unique records (say 80 records) to start with, you will reach "similarity saturation" at:
Code:
[number of columns]*4 records.

unless I am really missing something here. With other kinds of distributions you will reach full saturation more slowly, but will incur loads of similarity records getting there.


So. How many records are in the file? Is the file structure you gave us reasonable, i.e., 20+ columns? If there are hundreds of columns and 1000's of rows then running the similarity test will take a REALLY long time. The expected number of operations is
O(rows * columns)

3. assuming we can get you something in #2 there is one other issue. You have to incur "cross-similarity" records, example: 12 and 17 are similar and 17 and 24 are similar at a different point in the sequence. If there are more than a few hundred records.
How do we deal with those?

4. Finally: what shell do you have and what OS are you running?
# 3  
Old 08-10-2012
Could you provide a larger example? My idea is to use a chksum algorithm, but it needs to be checked on a broader basis.
# 4  
Old 08-10-2012
Very similar to earlier post https://www.unix.com/shell-programmin...me-column.html but now the data is arranged in rows.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Delete rows with conditions

Hi everyone, I will appreciate a lot if anyone can help me about a simple issue. I have a data file, and I need to remove some rows with a given condition. So here is a part of my data file: 5,14,1,3,3,0,0,-0.29977188269E+01 5,16,1,4,4,0,0,0.30394279900E+02... (4 Replies)
Discussion started by: hayreter
4 Replies

2. UNIX for Dummies Questions & Answers

How to delete rows?

I have an Output file which has the result YYYY 95,77 YYYY YYYY 95 YYYY 95 YYYY 95 YYYY 95 YYYY 95 YYYY 95 YYYY 95 YYYY 95 YYYY YYYY YYYY YYYY I would like to display the above along with a single line with above info. Final output should be YYYY 95 (3 Replies)
Discussion started by: priyanka.premra
3 Replies

3. UNIX for Dummies Questions & Answers

delete rows with a criteria

Hi, I would like to know how can I delete rows of a text file if from the 3rd column onwards there is only zeros? Thanks in advance (7 Replies)
Discussion started by: fadista
7 Replies

4. Shell Programming and Scripting

Delete duplicate rows

Hi, This is a followup to my earlier post him mno klm 20 76 . + . klm_mango unix_00000001; alp fdc klm 123 456 . + . klm_mango unix_0000103; her tkr klm 415 439 . + . klm_mango unix_00001043; abc tvr klm 20 76 . + . klm_mango unix_00000001; abc def klm 83 84 . + . klm_mango... (5 Replies)
Discussion started by: jacobs.smith
5 Replies

5. UNIX for Dummies Questions & Answers

Delete all rows but leaving first and last ones

Hello, Merry Christmas to all! I wish you the best for these holidays and the best for the next year 2011. I'd like your help please, I need to delete all the rows in the third column of my file, but without touching nor changing the first and last value position, this is an example of my... (2 Replies)
Discussion started by: Gery
2 Replies

6. Shell Programming and Scripting

delete the rows from the files

for example: this is the data file test.txt with more than 1000 rows 1. ccc 200 2.ddd 300 3.eee 400 4 fff 5000 ........ 1000 ddd 500 .... I would like to keep the rows with ccc and ddd, all other rows will be deleted, I still need the same output file: test.txt, how can... (5 Replies)
Discussion started by: jdsignature88
5 Replies

7. UNIX for Advanced & Expert Users

Delete rows from a file...!!

Say i have a file with X rows and Y columns....i see that in some of the rows,some columns are blank (no value set)...i wish to delete such rows....how can it be done? e.g 181766 100 2009-06-04 184443 2009-06-04 10962 151 2009-06-04 161 2009-06-04... (7 Replies)
Discussion started by: ak835
7 Replies

8. Shell Programming and Scripting

delete rows in a file based on the rows of another file

I need to delete rows based on the number of lines in a different file, I have a piece of code with me working but when I merge with my C application, it doesnt work. sed '1,'\"`wc -l < /tmp/fileyyyy`\"'d' /tmp/fileA > /tmp/filexxxx Can anyone give me an alternate solution for the above (2 Replies)
Discussion started by: Muthuraj K
2 Replies

9. Shell Programming and Scripting

how to delete duplicate rows in a file

I have a file content like below. "0000000","ABLNCYI","BOTH",1049,2058,"XYZ","5711002","","Y","","","","","","","","" "0000000","ABLNCYI","BOTH",1049,2058,"XYZ","5711002","","Y","","","","","","","","" "0000000","ABLNCYI","BOTH",1049,2058,"XYZ","5711002","","Y","","","","","","","",""... (5 Replies)
Discussion started by: vamshikrishnab
5 Replies

10. Shell Programming and Scripting

How to delete particular rows from a file

Hi I have a file having 1000 rows. Now I would like to remove 10 rows from it. Plz give me the script. Eg: input file like 4 1 4500.0 1 5 1 1.0 30 6 1 1.0 4500 7 1 4.0 730 7 2 500000.0 730 8 1 785460.0 45 8 7 94255.0 30 9 1 31800.0 30 9 4 36000.0 30 10 1 15000.0 30... (5 Replies)
Discussion started by: suresh3566
5 Replies
Login or Register to Ask a Question