delete the same rows

08-08-2012

Registered User

57, 1

Join Date: Jun 2012

Last Activity: 8 October 2013, 9:54 AM EDT

Posts: 57

Thanks Given: 30

Thanked 1 Time in 1 Post

delete the same rows

each individual (row) has genotype expressed for each SNP (column)

file1.txt

Code:

1 1 A G A T G T A A A A A A A A A A A A A A A A A A A A A
2 2 G A A A A A A A A A A A A A A A A A A A A A A A A A A
3 3 A A A A A A A A A A A A A A A A A A A A A A A A A A A
4 4 G A G T A T A A A A A A A A A A A A A A A A A A A A A
5 5 A A A A A A A A A A A A A A A A A A A A A A A A A A A

where the first and second columns are the id number, they are identical
First, I would like to see whether there are individuals (rows) with duplicate.
Then, I also want to see whether if two individuals that have over 95 percent similar genotype.

so in this case, 3rd and 5th individuals have the same genotype.
and 2nd and 3rd individuals have the same (26/27~96.26%) genotype information. So i want to have a script that can find ID with these potential problems.

Final output may look something like

file2.txt (for duplicate)

Code:

3
5

file3. txt (for similarity)

Code:

2
3

or

file3.txt (for problem)

Code:

2
3
5

Hope this makes sense,
Thanks in advance!

Moderator's Comments:

Please use code tags when posting data and code samples!

Last edited by vgersh99; 08-08-2012 at 03:48 PM.. Reason: code tags, please!

johnkim0806

View Public Profile for johnkim0806

Find all posts by johnkim0806

08-08-2012

Registered User

11,728, 1,345

Join Date: Feb 2004

Last Activity: 8 May 2020, 9:07 AM EDT

Location: NM

Posts: 11,728

Thanks Given: 903

Thanked 1,345 Times in 1,201 Posts

Your thinking on this problem is confusing to me.

1. You need to actually place duplicates into a separate file, keeping just one of many possible duplicates in the original. Or you cannot "do" the similarity test.
This assumes your example file is correct:

Code:

awk '!arr[substr($0,5)]++ {print $0; next}
       arr[substr($0,5)] {print $0 >"duplicate.file"; next} ' inputfile > depduped.file

2. I am not at all sure about your similarity test. If I get what you want, and assuming some randomness in the sample, by the time you have processed ~80 records from the deduped.file you will then hit a stream of almost all "similarity" records.

This is because each column has four choices, ATCG. With your format and a complete set of of unique records (say 80 records) to start with, you will reach "similarity saturation" at:

Code:

[number of columns]*4 records.

unless I am really missing something here. With other kinds of distributions you will reach full saturation more slowly, but will incur loads of similarity records getting there.

So. How many records are in the file? Is the file structure you gave us reasonable, i.e., 20+ columns? If there are hundreds of columns and 1000's of rows then running the similarity test will take a REALLY long time. The expected number of operations is
O(rows * columns)

3. assuming we can get you something in #2 there is one other issue. You have to incur "cross-similarity" records, example: 12 and 17 are similar and 17 and 24 are similar at a different point in the sequence. If there are more than a few hundred records.
How do we deal with those?

4. Finally: what shell do you have and what OS are you running?

jim mcnamara

View Public Profile for jim mcnamara

Find all posts by jim mcnamara

08-10-2012

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

Could you provide a larger example? My idea is to use a chksum algorithm, but it needs to be checked on a broader basis.

RudiC

View Public Profile for RudiC

Find all posts by RudiC

08-10-2012

Registered User

1,801, 116

Join Date: Oct 2003

Last Activity: 15 May 2015, 11:55 AM EDT

Location: 54.23, -4.53

Posts: 1,801

Thanks Given: 1

Thanked 116 Times in 101 Posts

Very similar to earlier post https://www.unix.com/shell-programmin...me-column.html but now the data is arranged in rows.

Ygor

View Public Profile for Ygor

Find all posts by Ygor

Shell Programming and Scripting

delete the same rows

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Delete rows with conditions

Discussion started by: hayreter

2. UNIX for Dummies Questions & Answers

How to delete rows?

Discussion started by: priyanka.premra

3. UNIX for Dummies Questions & Answers

delete rows with a criteria

Discussion started by: fadista

4. Shell Programming and Scripting

Delete duplicate rows

Discussion started by: jacobs.smith

5. UNIX for Dummies Questions & Answers

Delete all rows but leaving first and last ones

Discussion started by: Gery

6. Shell Programming and Scripting

delete the rows from the files

Discussion started by: jdsignature88

7. UNIX for Advanced & Expert Users

Delete rows from a file...!!

Discussion started by: ak835

8. Shell Programming and Scripting

delete rows in a file based on the rows of another file

Discussion started by: Muthuraj K

9. Shell Programming and Scripting

how to delete duplicate rows in a file

Discussion started by: vamshikrishnab

10. Shell Programming and Scripting

How to delete particular rows from a file

Discussion started by: suresh3566