Sponsored Content
Full Discussion: Remove dupes in a large file
Top Forums Shell Programming and Scripting Remove dupes in a large file Post 303024633 by gimley on Saturday 13th of October 2018 05:50:59 AM
Old 10-13-2018
Remove dupes in a large file

I have a large file 1.5 gb and want to sort the file.
I used the following AWK script to do the job
Code:
!x[$0]++

The script works but it is very slow and takes over an hour to do the job. I suspect this is because the file is not sorted.
Any solution to speed up the AWk script or a Perl script would be appreciated.
I work under Windows Environment and hence Unix tools don't work
Many thanks
p.S. I have checked an earlier solution available in the repository but it is just as slow if not slower.
 

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

remove a large number of user from oracle

Hi on solaris and oracle 10g2, I have number of users created in Oracle, I wonder if I have a list of the usernames will it be possible to remove the users quickly ? I want to keep the users access to system but oracle. some thing like shell script may be ?:confused: I am trying to... (4 Replies)
Discussion started by: upengan78
4 Replies

2. Shell Programming and Scripting

Sed or awk script to remove text / or perform calculations from large CSV files

I have a large CSV files (e.g. 2 million records) and am hoping to do one of two things. I have been trying to use awk and sed but am a newbie and can't figure out how to get it to work. Any help you could offer would be greatly appreciated - I'm stuck trying to remove the colon and wildcards in... (6 Replies)
Discussion started by: metronomadic
6 Replies

3. Shell Programming and Scripting

remove a specific line in a LARGE file

Hi guys, i have a really big file, and i want to remove a specific line. sed -i '5d' fileThis doesn't really work, it takes a lot of time... The whole script is supposed to remove every word containing less than 5 characters and currently looks like this: #!/bin/bash line="1"... (2 Replies)
Discussion started by: blubbiblubbkekz
2 Replies

4. Shell Programming and Scripting

Remove Duplicate Filenames in 2 very large directories

Hello Gurus, O/S RHEL4 I have a requirement to compare two linux based directories for duplicate filenames and remove them. These directories are close to 2 TB each. I have tried running a: Prompt>diff -r data1/ data2/ I have tried this as well: jason@jason-desktop:~$ cat script.sh ... (7 Replies)
Discussion started by: jaysunn
7 Replies

5. Shell Programming and Scripting

How to remove a subset of data from a large dataset based on values on one line

Hello. I was wondering if anyone could help. I have a file containing a large table in the format: marker1 marker2 marker3 marker4 position1 position2 position3 position4 genotype1 genotype2 genotype3 genotype4 with marker being a name, position a numeric... (2 Replies)
Discussion started by: davegen
2 Replies

6. UNIX for Dummies Questions & Answers

Filtering F-Dupes

Is there an easy way to tell FDupes what filetypes to look at or ignore? (0 Replies)
Discussion started by: furashgf
0 Replies

7. Shell Programming and Scripting

Removing Dupes from huge file- awk/perl/uniq

Hi, I have the following command in place nawk -F, '!a++' file > file.uniq It has been working perfectly as per requirements, by removing duplicates by taking into consideration only first 3 fields. Recently it has started giving below error: bash-3.2$ nawk -F, '!a++'... (17 Replies)
Discussion started by: makn
17 Replies

8. Shell Programming and Scripting

remove large portion of web page code between two tags

Hi everybody, I am trying to remove bunch of lines from web pages between two tags: one is <h1> and the other is <table it looks like <h1>Anniversary cards roses</h1> many lines here <table summary="Free anniversary greeting cards." cellspacing="8" cellpadding="8" width="70%">my goal... (5 Replies)
Discussion started by: georgi58
5 Replies

9. Shell Programming and Scripting

Removing dupes within 2 delimited areas in a large dictionary file

Hello, I have a very large dictionary file which is in text format and which contains a large number of sub-sections. Each sub-section starts with the following header : #DATA #VALID 1 and ends with a footer as shown below #END The data between the Header and the Footer consists of... (6 Replies)
Discussion started by: gimley
6 Replies

10. Shell Programming and Scripting

Modify script to remove dupes with two delimiters

Hello, I have a script which removes duplicates in a database with a single delimiter = The script is given below: # script to remove dupes from a row with structure word=word BEGIN{FS="="} {for(i=1;i<=NF;i++){a++;}for(i in a){b=b"="i}{sub("=","",b);$0=b;b="";delete a}}1 How do I modify... (6 Replies)
Discussion started by: gimley
6 Replies
nqs2pbs(1B)								PBS							       nqs2pbs(1B)

NAME
nqs2pbs - convert NQS job scripts to PBS SYNOPSIS
nqs2pbs nqs_script [pbs_script] DESCRIPTION
This utility converts a existing NQS job script to work with PBS and NQS. The existing script is copied and PBS directives, #PBS , are inserted prior to each NQS directive #QSUB or #@$ , in the original script. Certain NQS date specification and options are not supported by PBS. A warning message will be displayed indicating the problem and the line of the script on which it occurred. If any unrecognizable NQS directives are encountered, an error message is displayed. The new PBS script will be deleted if any errors occur. OPERANDS
nqs_script Specifies the file name of the NQS script to convert. This file is not changed. pbs_script If specified, it is the name of the new PBS script. If not specified, the new file name is nqs_script.new . NOTES
Converting NQS date specifications to the PBS form may result in a warning message and an incompleted converted date. PBS does not support date specifications of "today", "tomorrow", or the name of the days of the week such as "Monday". If any of these are encountered in a script, the PBS specification will contain only the time portion of the NQS specification, i.e. #PBS -a hhmm[.ss]. It is suggested that you specify the execution time on the qsub command line rather than in the script. Note that PBS will interpret a time specification without a date in the following way: - If the time specified has not yet been reached, the job will become eligible to run at that time today. - If the specified time has already passed when the job is submitted, the job will become eligible to run at that time tomorrow. PBS does not support time zone identifiers. All times are taken as local time. SEE ALSO
qsub(1B) Local nqs2pbs(1B)
All times are GMT -4. The time now is 05:40 PM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy