[uniq + awk?] How to remove duplicate blocks of lines in files?


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting [uniq + awk?] How to remove duplicate blocks of lines in files?
# 1  
Old 09-20-2011
[uniq + awk?] How to remove duplicate blocks of lines in files?

Hello again, I am wanting to remove all duplicate blocks of XML code in a file. This is an example:

input:
Code:
<string-array name="threeItems">
<item>item1</item>
<item>item2</item>
<item>item3</item>
</string-array>
<string-array name="twoItems">
<item>item1</item>
<item>item2</item>
</string-array>
<string-array name="threeItems">
<item>item1</item>
<item>item2</item>
<item>item3</item>
</string-array>

I cannot have two arrays by the name of "threeItems". This is the desired output:

Code:
<string-array name="threeItems">
<item>item1</item>
<item>item2</item>
<item>item3</item>
</string-array>
<string-array name="twoItems">
<item>item1</item>
<item>item2</item>
</string-array>

The arrays can be taken out of order, but the array contents need to stay intact. I have been using awk to pull arrays and all their elements from several files into one like this

Code:
awk '/string-array name/,/string-array>/' $file

but it seems that the same array appears in more than one file ><

Thanks for any tips!
# 2  
Old 09-20-2011
Is the data actually as shown? One tag per line? Or do the contents sometimes splay across multiple lines?

---------- Post updated at 10:05 AM ---------- Previous update was at 09:57 AM ----------

You could use awk's record-separator feature, make each <string-array the beginning of a record and split fields on newlines:

Code:
$ cat 3items.awk
BEGIN { RS="<string-array";     FS="\n";        OFS="\n";       }

{
        if($1 ~ /name=/)
        {
                gsub(/ *name=\"|\">/, "", $1);
                if(!ARR[$1])
                {
                        ARR[$1]=1;
                        $1="<string-array name=\"" $1 "\">";
                        print;
                }
        }
}
$ cat data
<string-array name="threeItems">
<item>item1</item>
<item>item2</item>
<item>item3</item>
</string-array>
<string-array name="twoItems">
<item>item1</item>
<item>item2</item>
</string-array>
<string-array name="threeItems">
<item>item1</item>
<item>item2</item>
<item>item3</item>
</string-array>

$ awk -f 3items.awk < data # Use nawk/gawk outside Linux
<string-array name="threeItems">
<item>item1</item>
<item>item2</item>
<item>item3</item>
</string-array>

<string-array name="twoItems">
<item>item1</item>
<item>item2</item>
</string-array>
$

This User Gave Thanks to Corona688 For This Post:
# 3  
Old 09-20-2011
Code:
nawk -F'"' '/string-array/ && NF>1 {f=($(NF-1) in a)?0:1;if(f)a[$(NF-1)]} f' myFile

# 4  
Old 09-20-2011
wicked solution vgersh99!
Smilie

--ahamed
This User Gave Thanks to ahamed101 For This Post:
# 5  
Old 09-20-2011
Code:
 awk '{printf $0}' yourFile | sed 's#</string-array>#&\n#g'|awk '!a[$0]++'

# 6  
Old 09-20-2011
Thanks again, corona! However, it is doing this
Code:
<string-array name="emptyarray</string-array>">

to this
Code:
<string-array name="emptyarray"></string-array>

and this
Code:
<string-array name="arrayWithPipe</string-array>">

to this
Code:
        <string-array name="arrayWithPipe">
                <item>item1|item2</item>
        </string-array>

seems like the pipe is interfering and when there is an array with no elements it does the same thing?

I need a book on awk I think Smilie

---------- Post updated at 12:49 PM ---------- Previous update was at 12:47 PM ----------

wow guys I just saw all your replies.. I will try them out

this forum is great!
# 7  
Old 09-20-2011
It works with the data you posted. Please post more comprehensive input data.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

How to put the command to remove duplicate lines in my awk script?

I create a CGI in bash/html. My awk script looks like : echo "<table>" for fn in /var/www/cgi-bin/LPAR_MAP/*; do echo "<td>" echo "<PRE>" awk -F',|;' -v test="$test" ' NR==1 { split(FILENAME ,a,""); } $0 ~ test { if(!header++){ ... (12 Replies)
Discussion started by: Tim2424
12 Replies

2. Shell Programming and Scripting

Remove lines from output in files using awk

I have two large files (~250GB) that I am trying to remove the where GT: 0/0 or 1/1 or 2/2 for both files. I was going to use a bash with the below awk, which I think will find each line but how do I remove that line is that condition is found? Thank you :). Input 20 60055 . A ... (4 Replies)
Discussion started by: cmccabe
4 Replies

3. Shell Programming and Scripting

How to remove duplicate text blocks from a file?

Hi All I have a list of files which will have duplicate list of blocks of text. Following is a sample of the file, I have removed the sensitive information from the file. All the code samples starts from <TR BGCOLOR="white"> and Ends with IP address and two html tags like this. 10.14.22.22... (3 Replies)
Discussion started by: mahasona
3 Replies

4. Windows & DOS: Issues & Discussions

Remove duplicate lines from text files.

So, I have text files, one "fail.txt" And one "color.txt" I now want to use a command line (DOS) to remove ANY line that is PRESENT IN BOTH from each text file. Afterwards there shall be no duplicate lines. (1 Reply)
Discussion started by: pasc
1 Replies

5. Shell Programming and Scripting

Cant get awk 1liner to remove duplicate lines from Delimited file, get "event not found" error..help

Hi, I am on a Solaris8 machine If someone can help me with adjusting this awk 1 liner (turning it into a real awkscript) to get by this "event not found error" ...or Present Perl solution code that works for Perl5.8 in the csh shell ...that would be great. ****************** ... (3 Replies)
Discussion started by: andy b
3 Replies

6. Shell Programming and Scripting

remove duplicate lines using awk

Hi, I came to know that using awk '!x++' removes the duplicate lines. Can anyone please explain the above syntax. I want to understand how the above awk syntax removes the duplicates. Thanks in advance, sudvishw :confused: (7 Replies)
Discussion started by: sudvishw
7 Replies

7. Shell Programming and Scripting

perl/shell need help to remove duplicate lines from files

Dear All, I have multiple files having number of records, consist of more than 10 columns some column values are duplicate and i want to remove these duplicate values from these files. Duplicate values may come in different files.... all files laying in single directory.. Need help to... (3 Replies)
Discussion started by: arvindng
3 Replies

8. Shell Programming and Scripting

Command to remove duplicate lines with perl,sed,awk

Input: hello hello hello hello monkey donkey hello hello drink dance drink Output should be: hello hello monkey donkey drink dance (9 Replies)
Discussion started by: cola
9 Replies

9. UNIX for Dummies Questions & Answers

deleteing duplicate lines sing uniq while ignoring a column

I have a data set that has 4 columns, I want to know if I can delete duplicate lines while ignoring one of the columns, for example 10 chr1 ASF 30 15 chr1 ASF 20 5 chr1 ASF 30 6 chr2 EBC 15 4 chr2 EBC 30 ... I want to know if I can delete duplicate lines while ignoring column 1, so the... (5 Replies)
Discussion started by: japaneseguitars
5 Replies

10. Shell Programming and Scripting

remove all duplicate lines from all files in one folder

Hi, is it possible to remove all duplicate lines from all txt files in a specific folder? This is too hard for me maybe someone could help. lets say we have an amount of textfiles 1 or 2 or 3 or... maximum 50 each textfile has lines with text. I want all lines of all textfiles... (8 Replies)
Discussion started by: lowmaster
8 Replies
Login or Register to Ask a Question