Hi jim & era: I appreciate both your solutions very much! The summary below is rather long, but it illustrates my working through the problem - please be patient! ;-)
I think that we are close; however, the final output is still not quite what I want, as described below. Basically, I need to find duplicates (a pp b = b pp a, etc.) and summarize and count them (2 a pp b), including the unique lines (1 a pp a). However, I'd like this sorting, etc. to be based on three specific columns (here, $1, $2, $3 - in the source file), independently of any of the other fields.
I managed to save jim's awk script as a perl script/file, “search_replace.awk.sh”, executable from the command line. For reference, I am using Cygwin on Windows XP Pro at work; I am a Ubuntu linux user at home.
#!/usr/bin/bash
awk '{ if($3 > $1) {printf("%s %s %s", $3, $2, $1)} else {printf("%s %s %s", $1, $2, $3)} for(i=4; i<=NF; i++) {printf(" %s", $i)} printf("\n") }' dummy_test_duplicates_file_2.txt | awk '{ arr[$1 $2 $3]++; if(m[$1 $2 $3]=="") {m[$1 $2 $3]=$0} } END { for(i in arr) {print arr[i], m[i]} }'
which is executed as follows:
$ sh search_replace.awk.sh
(Question: Is it possible to specify the input file from the command line, rather than within the script?)
This also works (same code, split over different lines)
#!/usr/bin/bash
awk '{ if($3 > $1) {printf("%s %s %s", $3, $2, $1)}
else {printf("%s %s %s", $1, $2, $3)}
for(i=4; i<=NF; i++) {printf(" %s", $i)}
printf("\n") }' dummy_test_duplicates_file_2.txt | awk ' { arr[$1 $2 $3]++; if(m[$1 $2 $3]=="")
{m[$1 $2 $3]=$0} } END { for(i in arr) {print arr[i], m[i]} }'
but not this (identical code, split differently)
#!/usr/bin/bash
awk '{ if($3 > $1) {printf("%s %s %s", $3, $2, $1)}
else {printf("%s %s %s", $1, $2, $3)}
for(i=4; i<=NF; i++) {printf(" %s", $i)}
printf("\n") }' dummy_test_duplicates_file_2.txt |
awk ' { arr[$1 $2 $3]++; if(m[$1 $2 $3]=="")
{m[$1 $2 $3]=$0} } END { for(i in arr) {print arr[i], m[i]} }'
$ sh search_replace.awk.sh
search_replace.awk.sh: line 5: $'\r': command not found
... Adding a backslash ( printf("\n") }' dummy_test_duplicates_file_2.txt | \ ) should - but does not - correct the problem ...
Ultimately, I modified this “search_replace.awk.sh” script by adding
BEGIN {OFS=FS="\t"}
to try to ensure tab-delimited input/output:
#!/usr/bin/bash
awk 'BEGIN {OFS=FS="\t"}
{ if($3 > $1) {printf("%s %s %s", $3, $2, $1)}
else {printf("%s %s %s", $1, $2, $3)}
for(i=4; i<=NF; i++) {printf(" %s", $i)}
printf("\n") }' dummy_test_duplicates_file_2.txt | awk ' { arr[$1 $2 $3]++; if(m[$1 $2 $3]=="")
{m[$1 $2 $3]=$0} } END { for(i in arr) {print arr[i], m[i]} }'
I also compared the output form jim's code versus that from era's awk command,
$ awk '$3 < $1 { t = $1; $1 = $3; $3 = t } { a[$0]++ } END { for (k in a) { print a[k], k } } ' dummy_test_duplicates_file.txt
Using my source file (to be searched) - “dummy_test_duplicates_file.txt” (21 lines), I get the following:
#!/usr/bin/bash
awk 'BEGIN {OFS=FS="\t"}
{ if($3 > $1) {printf("%s %s %s", $3, $2, $1)}
else {printf("%s %s %s", $1, $2, $3)}
for(i=4; i<=NF; i++) {printf(" %s", $i)}
printf("\n") }' dummy_test_duplicates_file.txt | awk ' { arr[$1 $2 $3]++; if(m[$1 $2 $3]=="")
{m[$1 $2 $3]=$0} } END { for(i in arr) {print arr[i], m[i]} }'
$ cat dummy_test_duplicates_file.txt | sort
a gi b
a pp a
a pp b
a pp b
a pp c
a pp d
a pp e
a pp e
b pp a
d pp a
t gi u
t gi v
t gi w
t gi x
t gi y
t gi z
t pp z
v gi t
y gi t
y gi t
z gi t
$ sh search_replace.awk.sh
3 y gi t
1 u gi t
2 d pp a
1 z pp t
2 z gi t
2 v gi t
2 e pp a
1 a pp a
1 w gi t
1 b gi a
3 b pp a
1 x gi t
1 c pp a
Compare to:
$ awk '$3 < $1 { t = $1; $1 = $3; $3 = t } { a[$0]++ } END { for (k in a) { print a[k], k } } ' dummy_test_duplicates_file.txt
1 t gi x
1 t gi v
1 t gi y
1 t gi z
1 a pp a
2 t gi y
2 a pp b
1 t gi z
1 a pp c
1 a pp d
1 a pp b
2 a pp e
1 a gi b
1 a pp d
1 t gi u
1 t pp z
1 t gi v
1 t gi w
This also works:
$ awk 'BEGIN {OFS=FS="\t"} $3 < $1 { t = $1; $1 = $3; $3 = t } { a[$0]++ } END { for (k in a) { print a[k], k } } ' dummy_test_duplicates_file.txt
Interestingly, adding a second ( BEGIN {OFS=FS="\t"} ),
#!/usr/bin/bash
awk 'BEGIN {OFS=FS="\t"}
{ if($3 > $1) {printf("%s %s %s", $3, $2, $1)}
else {printf("%s %s %s", $1, $2, $3)}
for(i=4; i<=NF; i++) {printf(" %s", $i)}
printf("\n") }' dummy_test_duplicates_file.txt | awk ' BEGIN {OFS=FS="\t"} { arr[$1 $2 $3]++; if(m[$1 $2 $3]=="")
{m[$1 $2 $3]=$0} } END { for(i in arr) {print arr[i], m[i]} }'
gives the same output, formatted slightly differently and in a different line order:
$ sh search_replace.awk.sh
2 z gi t
2 d pp a
3 y gi t
1 c pp a
1 x gi t
3 b pp a
1 w gi t
1 a pp a
2 v gi t
1 b gi a
1 z pp t
1 u gi t
2 e pp a
$ sh search_replace.awk.sh | sort
1 a pp a
1 b gi a
1 c pp a
1 u gi t
1 w gi t
1 x gi t
1 z pp t
2 d pp a
2 e pp a
2 v gi t
2 z gi t
3 b pp a
3 y gi t
$ awk 'BEGIN {OFS=FS="\t"} $3 < $1 { t = $1; $1 = $3; $3 = t } { a[$0]++ } END { for (k in a) { print a[k], k } } ' dummy_test_duplicates_file.txt | sort
1 a gi b
1 a pp a
1 a pp c
1 t gi u
1 t gi w
1 t gi x
1 t pp z
2 a pp d
2 a pp e
2 t gi v
2 t gi z
3 a pp b
3 t gi y
====================
So far, this "appears" to be working great! :-)
However, when I add a fourth column (to see the effect - remember I want to sort, count "duplicates," etc. based only on three defined columns / fields), saved as "dummy_test_duplicates_file_3.txt"
$ cat dummy_test_duplicates_file_3.txt
a pp b This
a pp c column
a pp d contains
a pp e some
a pp b text
a pp e that
a gi b I
a pp a don't
b pp a want
d pp a to
t gi u affect
t gi v the
t gi w search/replace
t gi x operations.
t gi y
t gi z
z gi t
y gi t
v gi t
y gi t
t pp z
$ cat dummy_test_duplicates_file_3.txt | sort
a gi b I
a pp a don't
a pp b This
a pp b text
a pp c column
a pp d contains
a pp e some
a pp e that
b pp a want
d pp a to
t gi u affect
t gi v the
t gi w search/replace
t gi x operations.
t gi y
t gi z
t pp z
v gi t
y gi t
y gi t
z gi t
$ awk 'BEGIN {OFS=FS="\t"} $3 < $1 { t = $1; $1 = $3; $3 = t } { a[$0]++ } END { for (k in a) { print a[k], k } } ' dummy_test_duplicates_file_3.txt | sort
1 a gi b I
1 a pp a don't
1 a pp b This
1 a pp b text
1 a pp b want
1 a pp c column
1 a pp d contains
1 a pp d to
1 a pp e some
1 a pp e that
1 t gi u affect
1 t gi v
1 t gi v the
1 t gi w search/replace
1 t gi x operations.
1 t pp z
2 t gi z
3 t gi y
[Here I changed source file in perl script from “dummy test_duplicates_file.txt” to “dummy_test_duplicates_file_3.txt”
$ sh search_replace.awk.sh | sort
1 a pp a don't
1 b gi a I
1 b pp a This
1 b pp a text
1 b pp a want
1 c pp a column
1 d pp a contains
1 d pp a to
1 e pp a some
1 e pp a that
1 u gi t affect
1 v gi t
1 v gi t the
1 w gi t search/replace
1 x gi t operations.
1 z pp t
2 z gi t
3 y gi t
This (above) really isn't what I am looking for - Note that “a pp b” (or “b pp a” in the second example) are listed separately, 3 times, in the two different solutions (outputs), above. The contents of the fourth column ($4 in the input file) is affecting the search / replace operation.
Additionally - What is the purpose of the "%s” (ASCII) string printf specification? It looks like it is being set to the field variables (e.g. printing fields $1, $2, $3 as ASCII strings)?
I also decided to change the “order” of the search fields $3, $1:
#!/usr/bin/bash
awk 'BEGIN {OFS=FS="\t"}
{ if($1 > $3) {printf("%s %s %s", $3, $2, $1)}
else {printf("%s %s %s", $3, $2, $1)}
for(i=4; i<=NF; i++) {printf("%s", $i)}
printf("\n") }' dummy_test_duplicates_file_3.txt | awk ' BEGIN {OFS=FS="\t"} { arr[$1 $2 $3]++; if(m[$1 $2 $3]=="")
{m[$1 $2 $3]=$0} } END { for(i in arr) {print arr[i], m[i]} }'
e.g. if $1 (b) > $3 (a), print $3 (a) $2 $1 (b)
Thank you both once again for your different, unique solutions - I'd like to get this sorted out using one (or both - this is educational) methods!
Sincerely, Greg :-)