Sponsored Content
Full Discussion: Dedup a large file(30M rows)
Top Forums Shell Programming and Scripting Dedup a large file(30M rows) Post 302705809 by msabhi on Tuesday 25th of September 2012 02:21:56 PM
Old 09-25-2012
Code:
perl -F'\|' -alne  '{if(!$hash{$F[1]}){$hash{$F[1]}++;print $_;}}' input_file

Solution cut down :

Code:
perl -F'\|' -alne  '{if(!$hash{$F[1]}++){print}}' input_file


Last edited by msabhi; 09-25-2012 at 03:37 PM..
 

9 More Discussions You Might Find Interesting

1. AIX

sort and dedup problem

I have a file with contents: 1|4|oho hosfadu| 1|3|sdfsd fds| 2|2|sdfg| 2|1|sdf a| 3|5|ouhuh hu| I would like to do three things to it; 1- first, sort it on the first two fields 2- get a unique count on the first field 3- and write the first two unique rows (uniqueness based off the... (4 Replies)
Discussion started by: ChicagoBlues
4 Replies

2. Shell Programming and Scripting

How to delete rows by RowNumber from a Large text file

Friends, I have text file with 700,000 rows. Once I load this file to our database via our cutom process, it logs the row number for rejected rows. How do I delete rows from a Large text file based on the Row Number? Thanks, Prashant (8 Replies)
Discussion started by: ppat7046
8 Replies

3. Shell Programming and Scripting

Performance issue in UNIX while generating .dat file from large text file

Hello Gurus, We are facing some performance issue in UNIX. If someone had faced such kind of issue in past please provide your suggestions on this . Problem Definition: /Few of load processes of our Finance Application are facing issue in UNIX when they uses a shell script having below... (19 Replies)
Discussion started by: KRAMA
19 Replies

4. Shell Programming and Scripting

Deleting specific rows in large files having rows greater than 100000

Hi Guys, I need help in modifying a large text file containing more than 1-2 lakh rows of data using unix commands. I am quite new to the unix language the text file contains data in a pipe delimited format sdfsdfs sdfsdfsd START_ROW sdfsd|sdfsdfsd|sdfsdfasdf|sdfsadf|sdfasdf... (9 Replies)
Discussion started by: manish2009
9 Replies

5. Shell Programming and Scripting

delete rows in a file based on the rows of another file

I need to delete rows based on the number of lines in a different file, I have a piece of code with me working but when I merge with my C application, it doesnt work. sed '1,'\"`wc -l < /tmp/fileyyyy`\"'d' /tmp/fileA > /tmp/filexxxx Can anyone give me an alternate solution for the above (2 Replies)
Discussion started by: Muthuraj K
2 Replies

6. Shell Programming and Scripting

Large file - columns into rows etc

I have done a couple of searches on this and have found many threads but I don't think I've found one that is useful to me - probably because I have very basic comprehension of perl and beginners shell so trying to manipulate a script already posted maybe beyond my capabilities.... Anyway - I... (26 Replies)
Discussion started by: Myrona
26 Replies

7. UNIX for Dummies Questions & Answers

merging rows into new file based on rows and first column

I have 2 files, file01= 7 columns, row unknown (but few) file02= 7 columns, row unknown (but many) now I want to create an output with the first field that is shared in both of them and then subtract the results from the rest of the fields and print there e.g. file 01 James|0|50|25|10|50|30... (1 Reply)
Discussion started by: A-V
1 Replies

8. Shell Programming and Scripting

Moving or copying first rows and last rows into another file

Hi I would like to move the first 1000 rows of my file into an output file and then move the last 1000 rows into another output file. Any help would be great Thanks (6 Replies)
Discussion started by: kylle345
6 Replies

9. Shell Programming and Scripting

Honey, I broke awk! (duplicate line removal in 30M line 3.7GB csv file)

I have a script that builds a database ~30 million lines, ~3.7 GB .cvs file. After multiple optimzations It takes about 62 min to bring in and parse all the files and used to take 10 min to remove duplicates until I was requested to add another column. I am using the highly optimized awk code: awk... (34 Replies)
Discussion started by: Michael Stora
34 Replies
DBLINK_GET_RESULT(3)					  PostgreSQL 9.2.7 Documentation				      DBLINK_GET_RESULT(3)

NAME
dblink_get_result - gets an async query result SYNOPSIS
dblink_get_result(text connname [, bool fail_on_error]) returns setof record DESCRIPTION
dblink_get_result collects the results of an asynchronous query previously sent with dblink_send_query. If the query is not already completed, dblink_get_result will wait until it is. ARGUMENTS
conname Name of the connection to use. fail_on_error If true (the default when omitted) then an error thrown on the remote side of the connection causes an error to also be thrown locally. If false, the remote error is locally reported as a NOTICE, and the function returns no rows. RETURN VALUE
For an async query (that is, a SQL statement returning rows), the function returns the row(s) produced by the query. To use this function, you will need to specify the expected set of columns, as previously discussed for dblink. For an async command (that is, a SQL statement not returning rows), the function returns a single row with a single text column containing the command's status string. It is still necessary to specify that the result will have a single text column in the calling FROM clause. NOTES
This function must be called if dblink_send_query returned 1. It must be called once for each query sent, and one additional time to obtain an empty set result, before the connection can be used again. When using dblink_send_query and dblink_get_result, dblink fetches the entire remote query result before returning any of it to the local query processor. If the query returns a large number of rows, this can result in transient memory bloat in the local session. It may be better to open such a query as a cursor with dblink_open and then fetch a manageable number of rows at a time. Alternatively, use plain dblink(), which avoids memory bloat by spooling large result sets to disk. EXAMPLES
contrib_regression=# SELECT dblink_connect('dtest1', 'dbname=contrib_regression'); dblink_connect ---------------- OK (1 row) contrib_regression=# SELECT * FROM contrib_regression-# dblink_send_query('dtest1', 'select * from foo where f1 < 3') AS t1; t1 ---- 1 (1 row) contrib_regression=# SELECT * FROM dblink_get_result('dtest1') AS t1(f1 int, f2 text, f3 text[]); f1 | f2 | f3 ----+----+------------ 0 | a | {a0,b0,c0} 1 | b | {a1,b1,c1} 2 | c | {a2,b2,c2} (3 rows) contrib_regression=# SELECT * FROM dblink_get_result('dtest1') AS t1(f1 int, f2 text, f3 text[]); f1 | f2 | f3 ----+----+---- (0 rows) contrib_regression=# SELECT * FROM contrib_regression-# dblink_send_query('dtest1', 'select * from foo where f1 < 3; select * from foo where f1 > 6') AS t1; t1 ---- 1 (1 row) contrib_regression=# SELECT * FROM dblink_get_result('dtest1') AS t1(f1 int, f2 text, f3 text[]); f1 | f2 | f3 ----+----+------------ 0 | a | {a0,b0,c0} 1 | b | {a1,b1,c1} 2 | c | {a2,b2,c2} (3 rows) contrib_regression=# SELECT * FROM dblink_get_result('dtest1') AS t1(f1 int, f2 text, f3 text[]); f1 | f2 | f3 ----+----+--------------- 7 | h | {a7,b7,c7} 8 | i | {a8,b8,c8} 9 | j | {a9,b9,c9} 10 | k | {a10,b10,c10} (4 rows) contrib_regression=# SELECT * FROM dblink_get_result('dtest1') AS t1(f1 int, f2 text, f3 text[]); f1 | f2 | f3 ----+----+---- (0 rows) PostgreSQL 9.2.7 2014-02-17 DBLINK_GET_RESULT(3)
All times are GMT -4. The time now is 04:50 PM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy