Remove sections based on duplicate first line


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Remove sections based on duplicate first line
# 1  
Old 01-16-2015
Remove sections based on duplicate first line

Hi,

I have a file with many sections in it. Each section is separated by a blank line.
The first line of each section would determine if the section is duplicate or not.
if the section is duplicate then remove the entire section from the file.

below is the example of input and output. Wherein, the lines starting with *& is the first line and there are 2 sections with the same first line. I need to delete one of them.

Code:
Input:
*& abc def
1
2
3
4
5

*& cde efg
1
2
3

*& abc def
1
2
3
4
5

Code:
Output:
*& cde efg
1
2
3

*& abc def
1
2
3
4
5

Thanks for your help!!
# 2  
Old 01-16-2015
Hello,
If order out of sections is not important, with (gnu) awk:
Code:
awk 'BEGIN{RS='\n\n'};{A[$0]=1};END{for (h in A) print h,"\n"}' file

Regards.
This User Gave Thanks to disedorgue For This Post:
# 3  
Old 01-16-2015
That works if DOS <CR> line terminators are removed from the input file. Try also
Code:
awk '/^\*\&/ {STOP=($0 in T); T[$0]} /^ *$/ {STOP=0} !STOP' file


Last edited by RudiC; 01-16-2015 at 04:17 PM.. Reason: removed the "4" from file name
This User Gave Thanks to RudiC For This Post:
# 4  
Old 01-16-2015
Quote:
Originally Posted by disedorgue
Hello,
If order out of sections is not important, with (gnu) awk:
Code:
awk 'BEGIN{RS='\n\n'};{A[$0]=1};END{for (h in A) print h,"\n"}' file

Regards.
Thanks for your help. your code worked fine. I had already tried similar code but the difference was I didn't set RS, and instead of A[$0] =1 I assigned A[$0]=$0 and the array was getting jumbled up. Do you know the reason?

Rudic - I dont quite understand this code. can you please help me understand?
Code:
awk '/^\*\&/ {STOP=($0 in T); T[$0]} /^ *$/ {STOP=0} !STOP' file4

Thank you both for your help!!
# 5  
Old 01-16-2015
Code:
awk '/^\*\&/ {STOP=($0 in T)            # if header (identified by *&) is known, stop the printing
              T[$0]                     # remember the header line next time
             } 
     /^ *$/  {STOP=0}                   # empty line: reenable printing
     !STOP                              # use default action: print, if NOT STOPped
    ' file

# 6  
Old 01-16-2015
By default, Record Separator is one '\n' that represent end of line, if RS is set to '\n\n', for awk, one record (line) is terminate by '\n\n'.
With this way, one line is one section.

Regards.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Remove duplicate lines from file based on fields

Dear community, I have to remove duplicate lines from a file contains a very big ammount of rows (milions?) based on 1st and 3rd columns The data are like this: Region 23/11/2014 09:11:36 41752 Medio 23/11/2014 03:11:38 4132 Info 23/11/2014 05:11:09 4323... (2 Replies)
Discussion started by: Lord Spectre
2 Replies

2. Shell Programming and Scripting

Remove duplicate rows based on one column

Dear members, I need to filter a file based on the 8th column (that is id), and does not mather the other columns, because I want just one id (1 line of each id) and remove the duplicates lines based on this id (8th column), and does not matter wich duplicate will be removed. example of my file... (3 Replies)
Discussion started by: clarissab
3 Replies

3. Shell Programming and Scripting

Remove duplicate entries based on the range

I have file like this: chr start end chr15 99874874 99875874 chr15 99875173 99876173 aa1 chr15 99874923 99875923 chr15 99875173 99876173 aa1 chr15 99874962 99875962 chr15 99875173 99876173 aa1 chr1 ... (7 Replies)
Discussion started by: raj_k
7 Replies

4. Shell Programming and Scripting

How To Remove Duplicate Based on the Value?

Hi , Some time i got duplicated value in my files , bundle_identifier= B Sometext=ABC bundle_identifier= A bundle_unit=500 Sometext123=ABCD bundle_unit=400 i need to check if there is a duplicated values or not if yes , i need to check if the value is A or B when Bundle_Identified ,... (2 Replies)
Discussion started by: OTNA
2 Replies

5. Shell Programming and Scripting

Remove duplicate value based on two field $4 and $5

Hi All, i have input file like below... CA009156;20091003;M;AWBKCA72;123;;CANADIAN WESTERN BANK;EDMONTON;;2300, 10303, JASPER AVENUE;;T5J 3X6;; CA009156;20091003;M;AWBKCA72;321;;CANADIAN WESTERN BANK;EDMONTON;;2300, 10303, JASPER AVENUE;;T5J 3X6;; CA009156;20091003;M;AWBKCA72;231;;CANADIAN... (2 Replies)
Discussion started by: mohan sharma
2 Replies

6. Shell Programming and Scripting

Remove duplicate based on Group

Hi, How can I remove duplicates from a file based on group on other column? for example: Test1|Test2|Test3|Test4|Test5 Test1|Test6|Test7|Test8|Test5 Test1|Test9|Test10|Test11|Test12 Test1|Test13|Test14|Test15|Test16 Test17|Test18|Test19|Test20|Test21 Test17|Test22|Test23|Test24|Test5 ... (2 Replies)
Discussion started by: yale_work
2 Replies

7. Shell Programming and Scripting

Remove duplicate lines based on field and sort

I have a csv file that I would like to remove duplicate lines based on field 1 and sort. I don't care about any of the other fields but I still wanna keep there data intact. I was thinking I could do something like this but I have no idea how to print the full line with this. Please show any method... (8 Replies)
Discussion started by: cokedude
8 Replies

8. UNIX for Dummies Questions & Answers

How to get remove duplicate of a file based on many conditions

Hii Friends.. I have a huge set of data stored in a file.Which is as shown below a.dat: RAO 1869 12 19 0 0 0.00 17.9000 82.3000 10.0 0 0.00 0 3.70 0.00 0.00 0 0.00 3.70 4 NULL LEE 1870 4 11 1 0 0.00 30.0000 99.0000 0.0 0 0.00 0 0.00 0.00 0.00 0 ... (3 Replies)
Discussion started by: reva
3 Replies

9. Shell Programming and Scripting

Remove duplicate line detail based on column one data

My input file: AVI.out <detail>named as the RRM .</detail> AVI.out <detail>Contains 1 RRM .</detail> AR0.out <detail>named as the tellurite-resistance.</detail> AWG.out <detail>Contains 2 HTH .</detail> ADV.out <detail>named as the DENR family.</detail> ADV.out ... (10 Replies)
Discussion started by: patrick87
10 Replies

10. UNIX for Dummies Questions & Answers

Remove duplicate rows of a file based on a value of a column

Hi, I am processing a file and would like to delete duplicate records as indicated by one of its column. e.g. COL1 COL2 COL3 A 1234 1234 B 3k32 2322 C Xk32 TTT A NEW XX22 B 3k32 ... (7 Replies)
Discussion started by: risk_sly
7 Replies
Login or Register to Ask a Question