Remove rows with e column values


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Remove rows with e column values
# 1  
Old 06-15-2012
Remove rows with e column values

Hi All,

I have a big file with 232 columns and 9 million rows, I want to delete all rows with same column values in col3 through col232. Also the output should be sorted based on first 2 columns.

Here is a reduced example with 6 columns. I want to remove rows with duplicate values in col3 through col6.

Code:
chr1 234 A T G C
chr1 567 T T T T
chr1 123 A T T -
chr1 98   A A A T
chr2 46 T T T T
chr2 123 A A T T

expected output
Code:
chr1 98   A A A T
chr1 123 A T T -
chr1 234 A T G C
chr2 123 A A T T

deleted rows
Code:
chr1 567 T T T T
chr2 46 T T T T


Thanks, please help.
# 2  
Old 06-15-2012
Need more info

Are you expecting to delete more, or keep more? Might have an effect on how this is approached.
Also, you are not concerned about rows deleted? That was just to show how you wanted to solve this?
# 3  
Old 06-15-2012
I am expecting to keep more rows, and the rows with duplicate values in columns are not meaningful to my analysis.
# 4  
Old 06-15-2012
Havig a little trouble with this...

But, in theory, it should work....
The first command seems to work correctly - building T T T T
But the second is not quite working.


Code:
$ cat sample17.txt | cut -d" " -f3- | sort | uniq -d >sample17a.txt

$ cat sample17.txt |  egrep -v -f sample17a.txt
chr1 234 A T G C
chr1 567 T T T T
chr1 123 A T T -
chr1 98   A A A T
chr2 46 T T T T
chr2 123 A A T T

# 5  
Old 06-15-2012
the option uniq -d extracts duplicate rows, I want to compare the value of columns in a single row.

For example I want to delete all of the following lines
Code:
A A A A
T T T T
- - - -

# 6  
Old 06-15-2012
So do you mean to:
  • remove all rows that have exactly the same value in cols 3 through 232, and
  • remove all rows that have duplicate values in cols 3 through 6, and
  • sort the output on cols 1 an 2 ?
# 7  
Old 06-15-2012
I want to remove all rows that have exactly the same value in cols 3 through 232.
Col 3 through 6 was just a shortened example of cols 3 through 232 in the main file.

This has nothing to do with duplicate rows, they should be there.

Yes, the output should be sorted by col2 first and then col1,
so all rows with value 'chr1' in col1 should always appear before rows with the value 'chr2' in col1.



Valid output

Code:
chr1 232 A A G C
chr1 789 T T T - 
chr2 345 A A G C
chr3 456 A A G C

Invalid output rows

Code:
chr3 678 A A A A
chr5 765 G G G G
chr6 433 - - - -


Last edited by Scrutinizer; 06-15-2012 at 06:45 PM.. Reason: code tags
Login or Register to Ask a Question

Previous Thread | Next Thread

9 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Compare values in multiple rows in one column using awk

I would like to compare values in column 8, and grep the ones where the different is > 1, columns 1 and 2 are the key for array. Every 4 rows the records values in columns 1 and 2 changed. Then, the comparison in the column 8 need to be done for the 4 rows everytime columns 1 and 2 changed ... (4 Replies)
Discussion started by: jiam912
4 Replies

2. UNIX for Beginners Questions & Answers

Pivoting values from column to rows

I/P: I/P: 2017/01/01 a 10 2017/01/01 b 20 2017/01/01 c 40 2017/02/01 a 10 2017/02/01 b 20 2017/02/01 c 30 O/P: a b c 2017/01/01 10 20 40 2017/02/01 10 20 30 (18 Replies)
Discussion started by: Booo
18 Replies

3. Shell Programming and Scripting

Convert Column data values to rows

Hi all , I have a file with the below content Header Section employee|employee name||Job description|Job code|Unitcode|Account|geography|C1|C2|C3|C4|C5|C6|C7|C8|C9|Csource|Oct|Nov|Dec|Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep Data section ... (1 Reply)
Discussion started by: Hypesslearner
1 Replies

4. Shell Programming and Scripting

Choosing rows based on column values

I have a .csv file: A,B,0.6 C,D,-0.7 D,E,0.1 A,E,0.45 D,G, -0.4 I want to select rows based on the values of the 3rd columns such that it is >=0.5 or <= -0.5 Thanks. A,B,0.6 D,G, -0.7 (1 Reply)
Discussion started by: Sanchari
1 Replies

5. Shell Programming and Scripting

Remove the values from a certain column without deleting the Column name in a .CSV file

(14 Replies)
Discussion started by: dhruuv369
14 Replies

6. Shell Programming and Scripting

join rows based on the column values

Hi, Please help me to convert the input file to a new one. input file: -------- 1231231231 3 A 4561223343 0 D 1231231231 1 A 1231231231 2 A 1231231231 4 D 7654343444 2 A 4561223343 1 D 4561223343 2 D the output should be: -------------------- 1231231231 3#1#2 A 4561223343 0 D... (3 Replies)
Discussion started by: vsachan
3 Replies

7. UNIX for Dummies Questions & Answers

count number of rows based on other column values

Could anybody help with this? I have input below ..... david,39 david,39 emelie,40 clarissa,22 bob,42 bob,42 tim,32 bob,39 david,38 emelie,47 what i want to do is count how many names there are with different ages, so output would be like this .... david,2 emelie,2 clarissa,1... (3 Replies)
Discussion started by: itsme999
3 Replies

8. UNIX for Dummies Questions & Answers

How to assign scores to rows based on column values

Hi, I'm trying to assign a score to each row which will allow me to identify which rows differ. In the example file below, I've used "," to indicate column separators (my actual file has tab separators). In this example, I'd like to identify that row 1 and row 5 are the same, and row 2 and row... (4 Replies)
Discussion started by: auburn
4 Replies

9. Shell Programming and Scripting

How to consolidate values in one column from different rows into one?

Hi Guys, Thank you all for helping me with my different queries and I continue to get better at scripting because of help from all of you! I have a file that would look something like - ID SUB ID VALUE 1 10 5 2 18 7 1 ... (1 Reply)
Discussion started by: sncoupons
1 Replies
Login or Register to Ask a Question