Filter file to remove duplicate values in first column Post: 302977903

Sponsored Content

Top Forums Shell Programming and Scripting Filter file to remove duplicate values in first column Post 302977903 by drl on Saturday 23rd of July 2016 09:03:34 AM

07-23-2016

Registered User

Hi.

Seems like sort with unique option works for me:

Code:

#!/usr/bin/env bash

# @(#) s1       Demonstrate remove all identical lines, sort.

# Utility functions: print-as-echo, print-line-with-visual-space, debug.
# export PATH="/usr/local/bin:/usr/bin:/bin"
LC_ALL=C ; LANG=C ; export LC_ALL LANG
pe() { for _i;do printf "%s" "$_i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
db() { ( printf " db, ";for _i;do printf "%s" "$_i";done;printf "\n" ) >&2 ; }
db() { : ; }
C=$HOME/bin/context && [ -f $C ] && $C sort pass-fail

FILE=${1-data1}

pl " Input data file $FILE:"
head $FILE

pl " Expected output:"
head expected-output.txt

pl " Results:"
sort -u -k1,1 $FILE |
tee f1

pass-fail f1 expected-output.txt

exit 0

producing

Code:

$ ./s1

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 3.16.0-4-amd64, x86_64
Distribution        : Debian 8.4 (jessie) 
bash GNU bash 4.3.30
sort (GNU coreutils) 8.23
pass-fail - ( local: RepRev 1.6, ~/bin/pass-fail, 2016-07-23 )

-----
 Input data file data1:
0       compound_00     -3.5054     -1.1207     -2.4372
1       compound_01     -2.2641     0.4287      -1.6120
3       compound_03     -1.3053     1.8495      -1.0224
0       compound_00     -3.5054     -1.1207     -2.4372
4       compound_04     -1.1845     -0.3377     -2.9453
7       compound_07     -0.2988     1.3539      -1.6114
8       compound_08     2.6872     -1.3726      -5.9732
9       compound_09     -1.4546     -0.8284     -3.5016
4       compound_04     -1.1845     -0.3377     -2.9453
7       compound_07     -0.2988     1.3539      -1.6114

-----
 Expected output:
0       compound_00     -3.5054     -1.1207     -2.4372
1       compound_01     -2.2641     0.4287      -1.6120
3       compound_03     -1.3053     1.8495      -1.0224
4       compound_04     -1.1845     -0.3377     -2.9453
7       compound_07     -0.2988     1.3539      -1.6114
8       compound_08     2.6872     -1.3726      -5.9732
9       compound_09     -1.4546     -0.8284     -3.5016

-----
 Results:
0       compound_00     -3.5054     -1.1207     -2.4372
1       compound_01     -2.2641     0.4287      -1.6120
3       compound_03     -1.3053     1.8495      -1.0224
4       compound_04     -1.1845     -0.3377     -2.9453
7       compound_07     -0.2988     1.3539      -1.6114
8       compound_08     2.6872     -1.3726      -5.9732
9       compound_09     -1.4546     -0.8284     -3.5016

-----
 Comparison of 7 created lines with 7 lines of desired results:
 Succeeded -- files (computed) f1 and (standard) expected-output.txt have same content.

The pass-fail code is basically just a wrapper around cmp for some extra checking and reporting.

Best wishes ... cheers, drl

drl

View Public Profile for drl

Find all posts by drl

9 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Filter/remove duplicate .dat file with certain criteria

I am a beginner in Unix. Though have been asked to write a script to filter(remove duplicates) data from a .dat file. File is very huge containig billions of records. contents of file looks like 30002157,40342424,OTC,mart_rec,100, ,0 30002157,40343369,OTC,mart_rec,95, ,0...

2. UNIX for Dummies Questions & Answers

[SOLVED] remove lines that have duplicate values in column two

Hi, I've got a file that I'd like to uniquely sort based on column 2 (values in column 2 begin with "comp"). I tried sort -t -nuk2,3 file.txtBut got: sort: multi-character tab `-nuk2,3' "man sort" did not help me out Any pointers? Input: Output:

3. Shell Programming and Scripting

Check to identify duplicate values at first column in csv file

Hello experts, I have a requirement where I have to implement two checks on a csv file: 1. Check to see if the value in first column is duplicate, if any value is duplicate script should exit. 2. Check to verify if the value at second column is between "yes" or "no", if it is anything else...

4. Linux

Filter a .CSV file based on the 5th column values

I have a .CSV file with the below format: "column 1","column 2","column 3","column 4","column 5","column 6","column 7","column 8","column 9","column 10 "12310","42324564756","a simple string with a , comma","string with or, without commas","string 1","USD","12","70%","08/01/2013",""...

5. Shell Programming and Scripting

Identify duplicate values at first column in csv file

Input 1,ABCD,no 2,system,yes 3,ABCD,yes 4,XYZ,no 5,XYZ,yes 6,pc,noCode used to find duplicate with regard to 2nd column awk 'NR == 1 {p=$2; next} p == $2 { print "Line" NR "$2 is duplicated"} {p=$2}' FS="," ./input.csv Now is there a wise way to de-duplicate the entire line (remove...

6. Shell Programming and Scripting

Remove duplicate values in a column(not in the file)

Hi Gurus, I have a file(weblog) as below abc|xyz|123|agentcode=sample code abcdeeess,agentcode=sample code abcdeeess,agentcode=sample code abcdeeess|agentadd=abcd stereet 23343,agentadd=abcd stereet 23343 sss|wwq|999|agentcode=sample1 code wqwdeeess,gentcode=sample1 code...

7. Shell Programming and Scripting

Find duplicate values in specific column and delete all the duplicate values

Dear folks I have a map file of around 54K lines and some of the values in the second column have the same value and I want to find them and delete all of the same values. I looked over duplicate commands but my case is not to keep one of the duplicate values. I want to remove all of the same...

8. Shell Programming and Scripting

Filter duplicate records from csv file with condition on one column

I have csv file with 30, 40 columns Pasting just three column for problem description I want to filter record if column 1 matches CN or DN then, check for values in column 2 if column contain 1235, 1235 then in column 3 values must be sequence of 2345, 2345 and if column 2 contains 6789, 6789...

9. Shell Programming and Scripting

CSV File:Filter duplicate records from column1 & another column having unique record

Hi Experts, I have csv file with 30, 40 columns Pasting just 2 column for problem description. Need to print error if below combination is not present in file check for column-1 (DocumentNumber) and filter columns where value in DocumentNumber field is same. For all such rows, the field...

LEARN ABOUT DEBIAN

hfind

HFIND(1)						      General Commands Manual							  HFIND(1)

NAME

       hfind - Lookup a hash value in a hash database

SYNOPSIS

       hfind [-i db_type ] [-f lookup_file ] [-eq] db_file [hashes]

DESCRIPTION

       hfind looks up hash values in a database using a binary search algorithm.  This allows one to easily create a hash database and identify if
       a file is known or not.	It works with the NIST National Software Reference Library (NSRL) and the output of 'md5sum'.

       Before the database can be used by 'hfind', an index file must be created with the '-i' option.

       This tool is needed for efficiency.  Most text-based databases do not have fixed length entries and are sometimes not  sorted.	The  hfind
       tool  will  create an index file that is sorted and has fixed-length entries.  This allows for fast lookups using a binary search algorithm
       instead of a linear search such as 'grep'.

ARGUMENTS

       -i db_type
	      Create an index file for the database.  This step must be done before a lookup can be performed. The  'db_type'  argument  specifies
	      the database type (i.e. nsrl-md5 or md5sum).  See section below.

       -f lookup_file
	      Specify the location of a file that contains one hash value per line.  These hashes will be looked up in the database.

       -e     Extended mode.  Additional information besides just the name is printed.	(Does not apply for all hash database types).

       -q     Quick mode.  Instead of displaying the corresponding information with the hash, just display 0 if the hash was not found and 1 if it
	      was.  If this flag is used, then only one hash can be given at a time.

       -V     Display version

       db_file
	      The location of the hash database file.

       [hashes]
	      The hashes to lookup.  If they are not supplied on the command line, STDIN is used.  If index files exist for  both  SHA-1  and  MD5
	      hashes, then both types of hashes can be given at runtime.

INDEX FILE

       hfind uses an index file to perform a binary search for a hash value. This is much faster than using 'grep', which will do a linear search.
       Before a hash database is used, a corresponding index file must be created.  This is done with the '-i' option to hfind.

       The resulting index file will be named based on the database file name.	The name will have the original name following by  the	hash  type
       (sha1 or md5) followed by '.idx'.  For example, creating an MD5 hash index of the NIST NSRL results in 'NSRLFile.txt-md5.idx' and the SHA-1
       index results in 'NSRLFile.txt-sha1.idx'.

       The file has two columns.  Each entry is sorted by the first column, which is the hash value.  The second column has the byte offset of the
       corresponding  entry  in  the  original	file.  So, when a hash is found in the index, the offset is recorded and then 'hfind' seeks to the
       entry in the original database.

       The following input types are valid.  For NSRL, 'nsrl-md5' and 'nsrl-sha1' can be used.	The difference is which hash value  the  index	is
       sorted by.  The 'md5sum' value can also be used to sort and index "home made" databases.  'hfind' can take data in both common formats:

	   MD5 (test.txt) = 76b1f4de1522c20b67acc132937cf82e

       and

	   76b1f4de1522c20b67acc132937cf82e	   test.txt

EXAMPLES

       To create an MD5 index file for NIST NSRL:

	   # hfind -i nsrl-md5 /usr/local/hash/nsrl/NSRLFile.txt

       To lookup a value in the NSRL:

	   # hfind /usr/local/hash/nsrl/NSRLFile.txt 76b1f4de1522c20b67acc132937cf82e

	   76b1f4de1522c20b67acc132937cf82e  Hash Not Found

       You can even do both SHA-1 and MD5 if you want:

	   # hfind -i nsrl-sha1 /usr/local/hash/nsrl/NSRLFile.txt

	   # hfind /usr/local/hash/nsrl/NSRLFile.txt
	   76b1f4de1522c20b67acc132937cf82e
	   80001A80B3F1B80076B297CEE8805AAA04E1B5BA

	   76b1f4de1522c20b67acc132937cf82e  Hash Not Found

	   80001A80B3F1B80076B297CEE8805AAA04E1B5BA  thrdcore.cpp

       To make a database of critical binaries of a trusted system, use 'md5sum':

	   # md5sum /bin/* /sbin/* /usr/bin/* /usr/bin/* /usr/local/bin/* /usr/local/sbin/* > system.md5

	   # hfind -i md5sum system.md5

       To look entries up, the following will work:

	   # hfind system.md5 76b1f4de1522c20b67acc132937cf82e

	   76b1f4de1522c20b67acc132937cf82e  Hash Not Found

       or

	   # md5sum -q /bin/* | hfind system.md5

	   928682269cd3edb1acdf9a7f7e606ff2  /bin/bash

	   <...>

       or

	   # md5sum -q /bin/* > bin.md5

	   # hfind -f bin.md5 system.md5

	   928682269cd3edb1acdf9a7f7e606ff2  /bin/bash

	   <...>

SEE ALSO

       sorter(1)

       The NIST National Software Reference Library (NSRL) can be found at www.nsrl.nist.gov.

LICENSE

       Distributed under the Common Public License, found in the cpl1.0.txt file in the The Sleuth Kit licenses directory.

AUTHOR

       Brian Carrier <carrier at sleuthkit dot org>

       Send documentation updates to <doc-updates at sleuthkit dot org>

																	  HFIND(1)