Help with deleting specific rows from a text file


 
Thread Tools Search this Thread
Top Forums UNIX for Dummies Questions & Answers Help with deleting specific rows from a text file
# 1  
Old 06-28-2011
Help with deleting specific rows from a text file

I know this is a complicated question but I will try to illustrate it with some data. I have a data file that looks like the following:


Code:
1341 NA06985 0 0 2 46.6432798439
1341 NA06991 NA06993 NA06985 2 48.8478948517
1341 NA06993 0 0 1 45.8022601455
1340 NA06994 0 0 1 48.780669145
1340 NA07000 0 0 2 47.7312017846
1340 NA07019 NA07022 NA07056 2 41.7389244255
1340 NA07022 0 0 1 54.1498530714
1340 NA07029 NA06994 NA07000 1 X
1341 NA07034 0 0 1 41.709838673
1341 NA07048 NA07034 NA07055 1 41.4599808018
1341 NA07055 0 0 2 43.7346131504
1340 NA07056 0 0 2 43.8415287938
1345 NA07345 0 0 2 35.6671940928
1345 NA07348 NA07357 NA07345 2 44.3923953539
1345 NA07357 0 0 1 45.179924889
1408 NA10830 NA12154 NA12236 1 33.3463998717
1408 NA10831 NA12155 NA12156 2 46.9172160682
1416 NA10835 NA12248 NA12249 1 33.2843722268
1420 NA10838 NA12003 NA12004 1 43.9668852859
1420 NA10839 NA12005 NA12006 2 44.5388697648
1334 NA10846 NA12144 NA12145 1 37.4468745153
1334 NA10847 NA12146 NA12239 2 45.605211554
1344 NA10851 NA12056 NA12057 1 37.928057554
1349 NA10854 NA11839 NA11840 2 47.1457402335
1350 NA10855 NA11831 NA11832 2 X
1350 NA10856 NA11829 NA11830 1 X
1346 NA10857 NA12043 NA12044 1 59.3261972639
1347 NA10859 NA11881 NA11882 2 60.5802420929
1362 NA10860 NA11992 NA11993 1 55.428533745
1362 NA10861 NA11994 NA11995 2 52.5134811264
1375 NA10863 NA12264 NA12234 2 44.3368601343
1350 NA11829 0 0 1 33.4327616207
1350 NA11830 0 0 2 33.0018192844
1350 NA11831 0 0 1 48.8652993625
1350 NA11832 0 0 2 51.7719464358

I want to look at the first column and delete lines where the value on the first column is not repeated at least 6 times (in 6 different rows) throughout the text file. For example, there are 6 rows which start with 1341 or 1350 so I would keep those rows. But there are only 2 rows which start with 1362, so that row would be deleted. In the end my output would look like this:

Code:
1341 NA06985 0 0 2 46.6432798439
1341 NA06991 NA06993 NA06985 2 48.8478948517
1341 NA06993 0 0 1 45.8022601455
1340 NA06994 0 0 1 48.780669145
1340 NA07000 0 0 2 47.7312017846
1340 NA07019 NA07022 NA07056 2 41.7389244255
1340 NA07022 0 0 1 54.1498530714
1340 NA07029 NA06994 NA07000 1 X
1341 NA07034 0 0 1 41.709838673
1341 NA07048 NA07034 NA07055 1 41.4599808018
1341 NA07055 0 0 2 43.7346131504
1340 NA07056 0 0 2 43.8415287938
1350 NA10855 NA11831 NA11832 2 X
1350 NA10856 NA11829 NA11830 1 X
1350 NA11829 0 0 1 33.4327616207
1350 NA11830 0 0 2 33.0018192844
1350 NA11831 0 0 1 48.8652993625
1350 NA11832 0 0 2 51.7719464358

Thanks a lot!

Last edited by joeyg; 06-28-2011 at 03:04 PM.. Reason: Please wrap scripts and data with CodeTags - makes easier to see & cut/paste
# 2  
Old 06-28-2011
In two steps

If I thought about a little more, could probably do in one...

Code:
$ cut -f1 -d" " sample3.txt | sort | uniq -c | awk '{if($1>=6) print $2}' >sample3.gd

$ grep -f sample3.gd sample3.txt
1341 NA06985 0 0 2 46.6432798439
1341 NA06991 NA06993 NA06985 2 48.8478948517
1341 NA06993 0 0 1 45.8022601455
1340 NA06994 0 0 1 48.780669145
1340 NA07000 0 0 2 47.7312017846
1340 NA07019 NA07022 NA07056 2 41.7389244255
1340 NA07022 0 0 1 54.1498530714
1340 NA07029 NA06994 NA07000 1 X
1341 NA07034 0 0 1 41.709838673
1341 NA07048 NA07034 NA07055 1 41.4599808018
1341 NA07055 0 0 2 43.7346131504
1340 NA07056 0 0 2 43.8415287938
1350 NA10855 NA11831 NA11832 2 X
1350 NA10856 NA11829 NA11830 1 X
1350 NA11829 0 0 1 33.4327616207
1350 NA11830 0 0 2 33.0018192844
1350 NA11831 0 0 1 48.8652993625
1350 NA11832 0 0 2 51.7719464358

The first command creates a file of matching lines.
The second finds the matching lines.

As I think about more, perhaps better would be:
Code:
$ cut -f1 -d" " sample3.txt | sort | uniq -c | awk '{if($1>=6) print $2}' | sed s/\^/^/g >sample3.gd1
$ grep -f sample3.gd1 sample3.txt

As this only captures lines beginning with the matching text.
 
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Deleting specific lines from text file via scripting

Hi, I'm trying to search for some number and from that line, i need to delete the 5th line exactly. Eg: Consider below as text file data: 10000 a b c d e . . . 10000 w q t (8 Replies)
Discussion started by: Gautham
8 Replies

2. UNIX for Dummies Questions & Answers

Deleting rows where the value in a specific column match

Hi, I have a tab delimited text file where I want to delete all rows that have the same string for column 1. How do I go about doing that? Thanks! Example Input: aa 1 aa 2 aa 3 bb 4 bc 5 bb 6 cd 8 Output: bc 5 cd 8 (4 Replies)
Discussion started by: evelibertine
4 Replies

3. UNIX for Dummies Questions & Answers

Deleting lines that contain a specific string from a space delimited text file?

Hi, I have a space delimited text file that looks like the following: 250 rs10000056 0.04 0.0888 4 189321617 250 rs10000062 0.05 0.0435 4 5254744 250 rs10000064 0.02 0.2403 4 127809621 250 rs10000068 0.01 NA 250 rs1000007 0.00 0.9531 2 237752054 250 rs10000081 0.03 0.1400 4 17348363... (5 Replies)
Discussion started by: evelibertine
5 Replies

4. UNIX for Dummies Questions & Answers

Deleting specific rows from a text file

How do I go about deleting specific rows from a text file (given row number)? (5 Replies)
Discussion started by: evelibertine
5 Replies

5. UNIX for Dummies Questions & Answers

Delete all rows that contain a specific string (text)

Hi, I have a text file and I want to delete all rows that contain a particular string of characters. How do I go about doing that? Thanks! (9 Replies)
Discussion started by: evelibertine
9 Replies

6. UNIX for Dummies Questions & Answers

Deleting cells that contain a specific number only from a space delimited text file

I have this space delimited large text file with more than 1,000,000+ columns and about 100 rows. I want to delete all the cells that consist of just 2 (leave 2's that are not by themselves intact): File before modification aa bb cc 2 NA100 dd aa b1 c2 2 NA102 de File after modification... (1 Reply)
Discussion started by: evelibertine
1 Replies

7. Shell Programming and Scripting

Deleting of Specific Rows.

Fruit : Price : Quantity apple : 20 : 40 chiku : 40 :30 Hey guys, i have written a code using sed to delete a specific char which is being typed in. But the problem i am having is , how can i expand my coding to actually allow it do delete the whole row. For example,... (21 Replies)
Discussion started by: gregarion
21 Replies

8. Shell Programming and Scripting

Deleting specific rows in large files having rows greater than 100000

Hi Guys, I need help in modifying a large text file containing more than 1-2 lakh rows of data using unix commands. I am quite new to the unix language the text file contains data in a pipe delimited format sdfsdfs sdfsdfsd START_ROW sdfsd|sdfsdfsd|sdfsdfasdf|sdfsadf|sdfasdf... (9 Replies)
Discussion started by: manish2009
9 Replies

9. Shell Programming and Scripting

Deleting rows from csv file

Hello, I am supposed to process about 100 csv files. But these files have some extra lines at the bottom of the file. these extra lines start with a header for each column and then some values below. These lines are actually a summary of the actual data and not supposed to be processed. These... (8 Replies)
Discussion started by: cobroraj
8 Replies

10. Shell Programming and Scripting

Deleting the emty rows in a file

I am getting some spaces between the two lines(rows) in file.i want delete that empty rows in the file example 1 abc xyz 2 def jkl like i am having lots of rows in a file i want to delete the spce between the two rows give any... (7 Replies)
Discussion started by: srivsn
7 Replies
Login or Register to Ask a Question