Remove rows with e column values

06-15-2012

Registered User

30, 0

Join Date: Oct 2011

Last Activity: 2 July 2013, 3:18 AM EDT

Posts: 30

Thanks Given: 15

Thanked 0 Times in 0 Posts

Remove rows with e column values

Hi All,

I have a big file with 232 columns and 9 million rows, I want to delete all rows with same column values in col3 through col232. Also the output should be sorted based on first 2 columns.

Here is a reduced example with 6 columns. I want to remove rows with duplicate values in col3 through col6.

Code:

chr1 234 A T G C
chr1 567 T T T T
chr1 123 A T T -
chr1 98   A A A T
chr2 46 T T T T
chr2 123 A A T T

expected output

Code:

chr1 98   A A A T
chr1 123 A T T -
chr1 234 A T G C
chr2 123 A A T T

deleted rows

Code:

chr1 567 T T T T
chr2 46 T T T T

Thanks, please help.

alpesh

View Public Profile for alpesh

Find all posts by alpesh

06-15-2012

Registered User

2,524, 241

Join Date: Dec 2007

Last Activity: 17 March 2020, 2:04 PM EDT

Posts: 2,524

Thanks Given: 173

Thanked 241 Times in 206 Posts

Need more info

Are you expecting to delete more, or keep more? Might have an effect on how this is approached.
Also, you are not concerned about rows deleted? That was just to show how you wanted to solve this?

joeyg

View Public Profile for joeyg

Find all posts by joeyg

06-15-2012

Registered User

30, 0

Join Date: Oct 2011

Last Activity: 2 July 2013, 3:18 AM EDT

Posts: 30

Thanks Given: 15

Thanked 0 Times in 0 Posts

I am expecting to keep more rows, and the rows with duplicate values in columns are not meaningful to my analysis.

alpesh

View Public Profile for alpesh

Find all posts by alpesh

06-15-2012

Registered User

2,524, 241

Join Date: Dec 2007

Last Activity: 17 March 2020, 2:04 PM EDT

Posts: 2,524

Thanks Given: 173

Thanked 241 Times in 206 Posts

Havig a little trouble with this...

But, in theory, it should work....
The first command seems to work correctly - building T T T T
But the second is not quite working.

Code:

$ cat sample17.txt | cut -d" " -f3- | sort | uniq -d >sample17a.txt

$ cat sample17.txt |  egrep -v -f sample17a.txt
chr1 234 A T G C
chr1 567 T T T T
chr1 123 A T T -
chr1 98   A A A T
chr2 46 T T T T
chr2 123 A A T T

joeyg

View Public Profile for joeyg

Find all posts by joeyg

06-15-2012

Registered User

30, 0

Join Date: Oct 2011

Last Activity: 2 July 2013, 3:18 AM EDT

Posts: 30

Thanks Given: 15

Thanked 0 Times in 0 Posts

the option uniq -d extracts duplicate rows, I want to compare the value of columns in a single row.

For example I want to delete all of the following lines

Code:

A A A A
T T T T
- - - -

alpesh

View Public Profile for alpesh

Find all posts by alpesh

06-15-2012

Moderator

12,296, 3,792

Join Date: Nov 2008

Last Activity: 1 January 2021, 1:47 AM EST

Location: Amsterdam

Posts: 12,296

Thanks Given: 679

Thanked 3,792 Times in 3,282 Posts

So do you mean to:

remove all rows that have exactly the same value in cols 3 through 232, and
remove all rows that have duplicate values in cols 3 through 6, and
sort the output on cols 1 an 2 ?

Scrutinizer

View Public Profile for Scrutinizer

Find all posts by Scrutinizer

06-15-2012

Registered User

30, 0

Join Date: Oct 2011

Last Activity: 2 July 2013, 3:18 AM EDT

Posts: 30

Thanks Given: 15

Thanked 0 Times in 0 Posts

I want to remove all rows that have exactly the same value in cols 3 through 232.
Col 3 through 6 was just a shortened example of cols 3 through 232 in the main file.

This has nothing to do with duplicate rows, they should be there.

Yes, the output should be sorted by col2 first and then col1,
so all rows with value 'chr1' in col1 should always appear before rows with the value 'chr2' in col1.

Valid output

Code:

chr1 232 A A G C
chr1 789 T T T - 
chr2 345 A A G C
chr3 456 A A G C

Invalid output rows

Code:

chr3 678 A A A A
chr5 765 G G G G
chr6 433 - - - -

Last edited by Scrutinizer; 06-15-2012 at 06:45 PM.. Reason: code tags

alpesh

View Public Profile for alpesh

Find all posts by alpesh

Shell Programming and Scripting

Remove rows with e column values

9 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Compare values in multiple rows in one column using awk

Discussion started by: jiam912

2. UNIX for Beginners Questions & Answers

Pivoting values from column to rows

Discussion started by: Booo

3. Shell Programming and Scripting

Convert Column data values to rows

Discussion started by: Hypesslearner

4. Shell Programming and Scripting

Choosing rows based on column values

Discussion started by: Sanchari

5. Shell Programming and Scripting

Remove the values from a certain column without deleting the Column name in a .CSV file

Discussion started by: dhruuv369

6. Shell Programming and Scripting

join rows based on the column values

Discussion started by: vsachan

7. UNIX for Dummies Questions & Answers

count number of rows based on other column values

Discussion started by: itsme999

8. UNIX for Dummies Questions & Answers

How to assign scores to rows based on column values

Discussion started by: auburn

9. Shell Programming and Scripting

How to consolidate values in one column from different rows into one?

Discussion started by: sncoupons