awk updating one file with another, comparing, updating

06-09-2008

Registered User

4, 0

Join Date: Jun 2008

Last Activity: 24 October 2008, 5:41 AM EDT

Posts: 4

Thanks Given: 0

Thanked 0 Times in 0 Posts

awk updating one file with another, comparing, updating

Hello,
I read and search through this wonderful forum and tried different approaches but it seems I lack some knowledge and neurones ^^

Here is what I'm trying to achieve :

file1:
test filea 3495;
test fileb 4578;
test filec 7689;
test filey 9978;
test filez 12300;

file2:
test filea 3495;
test filed 4578;
test filec 7689;
test filex 8978;

results:
test filea 3495;
test filed 4578;
test filec 7689;
test filex 8978;
test filey 9978;
test filez 12300;

comparison in based on last field (field $3), new content from file2 (here content with "key" 8978 is new) should be added to final output and content that is different in file2 (test filed 4578; here) should replace file1 one.

here is where I am now:

awk 'NF { key=$NF;keys[key]++ } NR == FNR { key1[key] = $NF ORS;rec1[key] = $0 ORS;next } { key2[key] = $NF ORS;rec2[key] = $0 ORS;next } END { for (k in keys) { if (key1[k] == key2[k]) { print rec2[k] } else { print rec1[k] } } }' $file1 $file2 > $file1.updated

for readability:

awk '
NF
{
key=$NF;
keys[key]++
}
NR == FNR
{
key1[key] = $NF ORS;
rec1[key] = $0 ORS;
next
}
{
key2[key] = $NF ORS;
rec2[key] = $0 ORS;
next
}
END
{
for (k in keys)
{
if (key1[k] == key2[k])
{
print rec2[k]
}
else
{
print rec1[k]
}
}
}'
$file1 $file2 > $file1.updated

but.. this doesn't work well :/

mecano

View Public Profile for mecano

Find all posts by mecano

06-09-2008

Registered User

5,690, 630

Join Date: Jan 2007

Last Activity: 9 January 2017, 4:40 AM EST

Location: Варна, България / Milano, Italia

Posts: 5,690

Thanks Given: 184

Thanked 630 Times in 587 Posts

If the order is not important:

(use nawk or /usr/xpg4/bin/awk on Solaris)

Code:

awk 'END{for(k in _)print _[k]}{_[$NF]=$0}' file1 file2

Otherwise, given your example:

Code:

awk 'END{for(k in _)print _[k]}{_[$NF]=$0}' file1 file2 |
  sort -k3n

Last edited by radoulov; 06-09-2008 at 10:39 AM..

radoulov

View Public Profile for radoulov

Find all posts by radoulov

06-09-2008

Registered User

4, 0

Join Date: Jun 2008

Last Activity: 24 October 2008, 5:41 AM EDT

Posts: 4

Thanks Given: 0

Thanked 0 Times in 0 Posts

oh my....

thanks a lot !

I thought the solution was something like store keys from file1, iterate them on file2, then reverse the iteration to find missing records... I was far far away from the beauty of awk...

if I understand correctly, awk reads the two files and automagically merged records itself ? It means that there is no need to store values from file1 to compare them to file2 ? Beautifull...

Two things I don't get: the use of the underscore (while i guess it stands for "all read records" ?), and why is END not at the end ?

About the sort command wouldn't it fail on the ';' ? Do you know how to specify 'last field' of line with sort ? Or is something like :
| awk '{ printf substr($NF, 1, length($NF)-1);$NF = "";printf " %s\n",$0 }' | sort -n | awk '{ printf "%s%s;\n",$0,$1 }' | awk '{$1="";sub(/^ +/, "");printf "%s\n",$0}'
preferable ?

Thanks a lot again radoulov ^^

mecano

View Public Profile for mecano

Find all posts by mecano

06-09-2008

Registered User

5,690, 630

Join Date: Jan 2007

Last Activity: 9 January 2017, 4:40 AM EST

Location: Варна, България / Milano, Italia

Posts: 5,690

Thanks Given: 184

Thanked 630 Times in 587 Posts

Quote:

Originally Posted by mecano

[...]
if I understand correctly, awk reads the two files and automagically merged records itself ? It means that there is no need to store values from file1 to compare them to file2 ? Beautifull...
[...]

It uses an associative array (a hash), so it guarantees the uniqueness
of the key ($NF in this case) and the value is alway the last one it sees
(the one in file2). It will associate to every key ($NF) the entire record ($0) and it will update the value when it sees the same key.

Quote:

Two things I don't get: the use of the underscore (while i guess it stands for "all read records" ?), and why is END not at the end ?

Well, this is kinda style of writing,
it you want the code more readable,
you could use this instead (and this is compatible even with the old plain Solaris awk):

Code:

awk '{
  key_record[$NF] = $0     # associate key ($NF) with entire record ($0) 
  }
END { 
  # after the entire input has been read 
  for (key  in key_record) # for every key stored
    print key_record[key]  # print the associated value
    }' file1 file2

Quote:

About the sort command wouldn't it fail on the ';' ?

I think the sort command will cast it correctly. Do you have an example where the input like this is not sorted correctly?

Quote:

Do you know how to specify 'last field' of line with sort ?
Or is something like :
| awk '{ printf substr($NF, 1, length($NF)-1);$NF = "";printf " %s\n",$0 }' | sort -n | awk '{ printf "%s%s;\n",$0,$1 }' | awk '{$1="";sub(/^ +/, "");printf "%s\n",$0}'
preferable ?

Why? Isn't the last field position fixed?
In that case I would go with:

Code:

perl -lane'
  $h{$F[-1]} = $_;
  print join "\n", map $h{$_}, sort {$a <=> $b} keys %h 
    if eof'

Or (if you really want to get rid of the ';' while sorting):

Code:

perl -lane'
  chop $F[-1] and $h{$F[-1]} = $_;
  print join "\n", map $h{$_}, sort {$a <=> $b} keys %h 
    if eof'

Otherwise using sort + shell:

Code:

read<file;set -- $REPLY;sort -k$#n file

radoulov

View Public Profile for radoulov

Find all posts by radoulov

06-09-2008

Registered User

4, 0

Join Date: Jun 2008

Last Activity: 24 October 2008, 5:41 AM EDT

Posts: 4

Thanks Given: 0

Thanked 0 Times in 0 Posts

Thanks a lot for taking the time to explain all this radoulov ^^ that's really great !

Quote:

I think the sort command will cast it correctly. Do you have an example where the input like this is not sorted correctly?

well not in that particular case but i remember having to strip the ';' to be able to use 'sort -n' correctly (without specifying a key, i just extract last field with awk then apply sort -n to it. A shame 'sort' doesn't allow reverse key selection), for example with values like :
27384;
7384; or 384;
but I tried so many different things, I guess this should be a remain of some mistypes/mistakes on my side or because of the Windows line endings some files seems to have (some files are created on Windows and some on Unix) ?

No the last field is not fixed because I'm on a bash script utility for sql queries files sorting/updating, this have to be used on several different files where the number of fields is not always the same and where the key value can be, rarely but happens, in the middle of the line.
So in this case taking a $key arguments from cli:
awk 'END{for(k in _)print _[k]}{_[$'"$key"']=$0}' $file1 $file2 > $file1.updated
with an additionnal conditional on argument '0' for the end of the line (because I didn't get $key to turn into NF and awk taking '"$key"').
I'm making it for a small community and it has to be really simple.
If you're not afraid to read awfull code I can post it ^^

mecano

View Public Profile for mecano

Find all posts by mecano

06-10-2008

Registered User

5,690, 630

Join Date: Jan 2007

Last Activity: 9 January 2017, 4:40 AM EST

Location: Варна, България / Milano, Italia

Posts: 5,690

Thanks Given: 184

Thanked 630 Times in 587 Posts

Quote:

[...]
If you're not afraid to read awfull code I can post it ^^

Yes, of course, post it.
You could get useful advices here.

radoulov

View Public Profile for radoulov

Find all posts by radoulov

06-10-2008

Registered User

4, 0

Join Date: Jun 2008

Last Activity: 24 October 2008, 5:41 AM EDT

Posts: 4

Thanks Given: 0

Thanked 0 Times in 0 Posts

Here it is ^^

Code:

#!/bin/bash

NO_ARGS=0
E_OPTERROR=65

if [ $# -eq "$NO_ARGS" ]
	then
	echo -e "\n\tUsage: `basename $0` -ulkdrm filename\n\tType :'awksort -help' for help.\n"
	exit $E_OPTERROR
fi

while getopts ":u:l:k:d:r:m:h" Option
	do
	case $Option in
	u )
	filename=$2
	if [ -f $filename ]
		then
		sort -u $filename > $filename.uniq
	else
		echo -e "\ncan't find file $filename\n"
	fi
	;;
	l )
	filename=$2
	if [ -f $filename ]
		then
		awk '{ printf substr($NF, 1, length($NF)-1);$NF = "";printf " %s\n",$0 }' $filename | sort -n | awk '{ printf "%s%s;\n",$0,$1 }' | awk '{$1="";sub(/^ +/, "");printf "%s\n",$0}' > $filename.sorted
	else
		echo -e "\ncan't find file $filename\n"
	fi
	;;
	k )
	filename=$3
	if [ -f $filename ]
		then
		opt=$OPTARG
		sort -n -t "=" -k $opt $filename > $filename.sorted
	else
		echo -e "\ncan't find file $filename\n"
	fi
	;;
	d )
	filename=$2
	if [ -f $filename ]
		then
		awk '{ printf substr($NF, 1, length($NF)-1);$NF = "";printf "\n" }' $filename | sort -n | awk '{ if ($1 == prev) { printf "%d\n",$0;num++ };prev=$1 } END { printf "\n%d duplicates were found...\n",num }'
	else
		echo -e "\ncan't find file $filename\n"
	fi
	;;
	r )
	filename=$2
	if [ -f $filename ]
		then
		awk '{ printf substr($NF, 1, length($NF)-1);$NF = "";printf " %s\n",$0 }' $filename | sort -n | awk '{ if ($1 != prev) { printf "%s%s;\n",$0,$1 };prev=$1 }' | awk '{$1="";sub(/^ +/, "");printf "%s\n",$0}' > $filename.noduplicate
	else
		echo -e "\ncan't find file $filename\n"
	fi
	;;
	m )
	key=$2
	file1=$3
    file2=$4
	if [ -z $file1 ]
		then
		echo -e "\nMissing argument. Usage: `basename $0` -m file1 file2\n"
	elif [ -z $file2 ]
		then
		echo -e "\nMissing argument. Usage: `basename $0` -m file1 file2\n"
	elif [ -f $file1 -a -f $file2 ]
		then
		if [ $key -eq 0 ]
			then
			awk 'END{for(k in _)print _[k]}{_[$NF]=$0}' $file1 $file2 > $file1.updated
			else
			awk 'END{for(k in _)print _[k]}{_[$'"$key"']=$0}' $file1 $file2 > $file1.updated
		fi
	elif [ -f $file1 ]
		then
		echo -e "\nCan't find file $file2\n"
	elif [ -f $file2 ]
		then
		echo -e "\nCan't find file $file1\n"
	else
		echo -e "\nCan't find any file! Neither $file1 or $file2 were found!\n"
	fi
	;;
	h )
	echo ""
	echo -e "\tawksort 0.1\n"
	echo -e "\tUsage: `basename $0` -ulkdrm options file\n\n"
	echo -e "\tOptions :\n"
	echo -e "\t-u file\n\tremove 'identical' entries, leaving only unique entries.\n"
	echo -e "\t-l file\n\tsort with last field of line.\n"
	echo -e "\t-k key file\n\twhere key is the field number to sort.\n"
	echo -e "\t-d file\n\treport duplicate 'similar' entries by id when id is the last field.\n"
	echo -e "\t-r file\n\tremove duplicate 'similar' entries arbitrary by id when id is the last field.\n"
	echo -e "\t-m key file1 file2\n\tmerge file1 with file2 where file2 is an 'update' file.\n\tThis overrides duplicates ids from file1 by replacing them\n\twith file2 records.\n\tUse 0 for key to merge using last field of line as key.\n"
	echo -e "\n\tClassic scenario is:\n\tUse -u to remove identical entries, then sort entries using -l or -k,\n\tremove similar entries with -r and finally apply update.\n"
	;;
	\? )
	echo -e "\n\tUsage: `basename $0` -ulkdrm filename\n\tType :'awksort -help' for help.\n"
	exit 1;;
	* )
	echo -e "\n\tUsage: `basename $0` -ulkdrm filename\n\tType :'awksort -help' for help.\n"
	exit 1;;
	esac
done

shift $(($OPTIND - 1))

exit 0

mecano

View Public Profile for mecano

Find all posts by mecano

Shell Programming and Scripting

awk updating one file with another, comparing, updating

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Updating in file

Discussion started by: preema

2. Shell Programming and Scripting

Updating variables using sed or awk

Discussion started by: Saanvi1

3. Shell Programming and Scripting

Help updating a file

Discussion started by: cmccabe

4. Shell Programming and Scripting

awk - updating variable in if statement

Discussion started by: origamisven

5. Shell Programming and Scripting

Comparing 2 files with awk and updating 2nd file

Discussion started by: jontjioe

6. Shell Programming and Scripting

AWK and sub/gsub: updating a date/time block

Discussion started by: chrismcg24

7. Shell Programming and Scripting

Updating a line in a large csv file, with sed/awk?

Discussion started by: trey85stang

8. UNIX for Dummies Questions & Answers

Updating specific fields with awk using conditions

Discussion started by: giannicello

9. UNIX for Dummies Questions & Answers

Korn shell awk use for updating two files

Discussion started by: alrinno

10. UNIX for Dummies Questions & Answers

Constantly updating log files (tail -f? grep? awk?)

Discussion started by: nortonloaf