Dedup a large file(30M rows) Post: 302705809

9 More Discussions You Might Find Interesting

1. AIX

sort and dedup problem

I have a file with contents: 1|4|oho hosfadu| 1|3|sdfsd fds| 2|2|sdfg| 2|1|sdf a| 3|5|ouhuh hu| I would like to do three things to it; 1- first, sort it on the first two fields 2- get a unique count on the first field 3- and write the first two unique rows (uniqueness based off the...

2. Shell Programming and Scripting

How to delete rows by RowNumber from a Large text file

Friends, I have text file with 700,000 rows. Once I load this file to our database via our cutom process, it logs the row number for rejected rows. How do I delete rows from a Large text file based on the Row Number? Thanks, Prashant

3. Shell Programming and Scripting

Performance issue in UNIX while generating .dat file from large text file

Hello Gurus, We are facing some performance issue in UNIX. If someone had faced such kind of issue in past please provide your suggestions on this . Problem Definition: /Few of load processes of our Finance Application are facing issue in UNIX when they uses a shell script having below...

4. Shell Programming and Scripting

Deleting specific rows in large files having rows greater than 100000

Hi Guys, I need help in modifying a large text file containing more than 1-2 lakh rows of data using unix commands. I am quite new to the unix language the text file contains data in a pipe delimited format sdfsdfs sdfsdfsd START_ROW sdfsd|sdfsdfsd|sdfsdfasdf|sdfsadf|sdfasdf...

5. Shell Programming and Scripting

delete rows in a file based on the rows of another file

I need to delete rows based on the number of lines in a different file, I have a piece of code with me working but when I merge with my C application, it doesnt work. sed '1,'\"`wc -l < /tmp/fileyyyy`\"'d' /tmp/fileA > /tmp/filexxxx Can anyone give me an alternate solution for the above

6. Shell Programming and Scripting

Large file - columns into rows etc

I have done a couple of searches on this and have found many threads but I don't think I've found one that is useful to me - probably because I have very basic comprehension of perl and beginners shell so trying to manipulate a script already posted maybe beyond my capabilities.... Anyway - I...

7. UNIX for Dummies Questions & Answers

merging rows into new file based on rows and first column

I have 2 files, file01= 7 columns, row unknown (but few) file02= 7 columns, row unknown (but many) now I want to create an output with the first field that is shared in both of them and then subtract the results from the rest of the fields and print there e.g. file 01 James|0|50|25|10|50|30...

8. Shell Programming and Scripting

Moving or copying first rows and last rows into another file

Hi I would like to move the first 1000 rows of my file into an output file and then move the last 1000 rows into another output file. Any help would be great Thanks

9. Shell Programming and Scripting

Honey, I broke awk! (duplicate line removal in 30M line 3.7GB csv file)

I have a script that builds a database ~30 million lines, ~3.7 GB .cvs file. After multiple optimzations It takes about 62 min to bring in and parse all the files and used to take 10 min to remove duplicates until I was requested to add another column. I am using the highly optimized awk code: awk...

LEARN ABOUT CENTOS

dblink_get_result

DBLINK_GET_RESULT(3)					  PostgreSQL 9.2.7 Documentation				      DBLINK_GET_RESULT(3)

NAME

       dblink_get_result - gets an async query result

SYNOPSIS

       dblink_get_result(text connname [, bool fail_on_error]) returns setof record

DESCRIPTION

       dblink_get_result collects the results of an asynchronous query previously sent with dblink_send_query. If the query is not already
       completed, dblink_get_result will wait until it is.

ARGUMENTS

       conname
	   Name of the connection to use.

       fail_on_error
	   If true (the default when omitted) then an error thrown on the remote side of the connection causes an error to also be thrown locally.
	   If false, the remote error is locally reported as a NOTICE, and the function returns no rows.

RETURN VALUE

       For an async query (that is, a SQL statement returning rows), the function returns the row(s) produced by the query. To use this function,
       you will need to specify the expected set of columns, as previously discussed for dblink.

       For an async command (that is, a SQL statement not returning rows), the function returns a single row with a single text column containing
       the command's status string. It is still necessary to specify that the result will have a single text column in the calling FROM clause.

NOTES

       This function must be called if dblink_send_query returned 1. It must be called once for each query sent, and one additional time to obtain
       an empty set result, before the connection can be used again.

       When using dblink_send_query and dblink_get_result, dblink fetches the entire remote query result before returning any of it to the local
       query processor. If the query returns a large number of rows, this can result in transient memory bloat in the local session. It may be
       better to open such a query as a cursor with dblink_open and then fetch a manageable number of rows at a time. Alternatively, use plain
       dblink(), which avoids memory bloat by spooling large result sets to disk.

EXAMPLES

	   contrib_regression=# SELECT dblink_connect('dtest1', 'dbname=contrib_regression');
	    dblink_connect
	   ----------------
	    OK
	   (1 row)

	   contrib_regression=# SELECT * FROM
	   contrib_regression-# dblink_send_query('dtest1', 'select * from foo where f1 < 3') AS t1;
	    t1
	   ----
	     1
	   (1 row)

	   contrib_regression=# SELECT * FROM dblink_get_result('dtest1') AS t1(f1 int, f2 text, f3 text[]);
	    f1 | f2 |	  f3
	   ----+----+------------
	     0 | a  | {a0,b0,c0}
	     1 | b  | {a1,b1,c1}
	     2 | c  | {a2,b2,c2}
	   (3 rows)

	   contrib_regression=# SELECT * FROM dblink_get_result('dtest1') AS t1(f1 int, f2 text, f3 text[]);
	    f1 | f2 | f3
	   ----+----+----
	   (0 rows)

	   contrib_regression=# SELECT * FROM
	   contrib_regression-# dblink_send_query('dtest1', 'select * from foo where f1 < 3; select * from foo where f1 > 6') AS t1;
	    t1
	   ----
	     1
	   (1 row)

	   contrib_regression=# SELECT * FROM dblink_get_result('dtest1') AS t1(f1 int, f2 text, f3 text[]);
	    f1 | f2 |	  f3
	   ----+----+------------
	     0 | a  | {a0,b0,c0}
	     1 | b  | {a1,b1,c1}
	     2 | c  | {a2,b2,c2}
	   (3 rows)

	   contrib_regression=# SELECT * FROM dblink_get_result('dtest1') AS t1(f1 int, f2 text, f3 text[]);
	    f1 | f2 |	   f3
	   ----+----+---------------
	     7 | h  | {a7,b7,c7}
	     8 | i  | {a8,b8,c8}
	     9 | j  | {a9,b9,c9}
	    10 | k  | {a10,b10,c10}
	   (4 rows)

	   contrib_regression=# SELECT * FROM dblink_get_result('dtest1') AS t1(f1 int, f2 text, f3 text[]);
	    f1 | f2 | f3
	   ----+----+----
	   (0 rows)

PostgreSQL 9.2.7						    2014-02-17						      DBLINK_GET_RESULT(3)