Sponsored Content
Full Discussion: Remove dupes in a large file
Top Forums Shell Programming and Scripting Remove dupes in a large file Post 303024633 by gimley on Saturday 13th of October 2018 05:50:59 AM
Old 10-13-2018
Remove dupes in a large file

I have a large file 1.5 gb and want to sort the file.
I used the following AWK script to do the job
Code:
!x[$0]++

The script works but it is very slow and takes over an hour to do the job. I suspect this is because the file is not sorted.
Any solution to speed up the AWk script or a Perl script would be appreciated.
I work under Windows Environment and hence Unix tools don't work
Many thanks
p.S. I have checked an earlier solution available in the repository but it is just as slow if not slower.
 

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

remove a large number of user from oracle

Hi on solaris and oracle 10g2, I have number of users created in Oracle, I wonder if I have a list of the usernames will it be possible to remove the users quickly ? I want to keep the users access to system but oracle. some thing like shell script may be ?:confused: I am trying to... (4 Replies)
Discussion started by: upengan78
4 Replies

2. Shell Programming and Scripting

Sed or awk script to remove text / or perform calculations from large CSV files

I have a large CSV files (e.g. 2 million records) and am hoping to do one of two things. I have been trying to use awk and sed but am a newbie and can't figure out how to get it to work. Any help you could offer would be greatly appreciated - I'm stuck trying to remove the colon and wildcards in... (6 Replies)
Discussion started by: metronomadic
6 Replies

3. Shell Programming and Scripting

remove a specific line in a LARGE file

Hi guys, i have a really big file, and i want to remove a specific line. sed -i '5d' fileThis doesn't really work, it takes a lot of time... The whole script is supposed to remove every word containing less than 5 characters and currently looks like this: #!/bin/bash line="1"... (2 Replies)
Discussion started by: blubbiblubbkekz
2 Replies

4. Shell Programming and Scripting

Remove Duplicate Filenames in 2 very large directories

Hello Gurus, O/S RHEL4 I have a requirement to compare two linux based directories for duplicate filenames and remove them. These directories are close to 2 TB each. I have tried running a: Prompt>diff -r data1/ data2/ I have tried this as well: jason@jason-desktop:~$ cat script.sh ... (7 Replies)
Discussion started by: jaysunn
7 Replies

5. Shell Programming and Scripting

How to remove a subset of data from a large dataset based on values on one line

Hello. I was wondering if anyone could help. I have a file containing a large table in the format: marker1 marker2 marker3 marker4 position1 position2 position3 position4 genotype1 genotype2 genotype3 genotype4 with marker being a name, position a numeric... (2 Replies)
Discussion started by: davegen
2 Replies

6. UNIX for Dummies Questions & Answers

Filtering F-Dupes

Is there an easy way to tell FDupes what filetypes to look at or ignore? (0 Replies)
Discussion started by: furashgf
0 Replies

7. Shell Programming and Scripting

Removing Dupes from huge file- awk/perl/uniq

Hi, I have the following command in place nawk -F, '!a++' file > file.uniq It has been working perfectly as per requirements, by removing duplicates by taking into consideration only first 3 fields. Recently it has started giving below error: bash-3.2$ nawk -F, '!a++'... (17 Replies)
Discussion started by: makn
17 Replies

8. Shell Programming and Scripting

remove large portion of web page code between two tags

Hi everybody, I am trying to remove bunch of lines from web pages between two tags: one is <h1> and the other is <table it looks like <h1>Anniversary cards roses</h1> many lines here <table summary="Free anniversary greeting cards." cellspacing="8" cellpadding="8" width="70%">my goal... (5 Replies)
Discussion started by: georgi58
5 Replies

9. Shell Programming and Scripting

Removing dupes within 2 delimited areas in a large dictionary file

Hello, I have a very large dictionary file which is in text format and which contains a large number of sub-sections. Each sub-section starts with the following header : #DATA #VALID 1 and ends with a footer as shown below #END The data between the Header and the Footer consists of... (6 Replies)
Discussion started by: gimley
6 Replies

10. Shell Programming and Scripting

Modify script to remove dupes with two delimiters

Hello, I have a script which removes duplicates in a database with a single delimiter = The script is given below: # script to remove dupes from a row with structure word=word BEGIN{FS="="} {for(i=1;i<=NF;i++){a++;}for(i in a){b=b"="i}{sub("=","",b);$0=b;b="";delete a}}1 How do I modify... (6 Replies)
Discussion started by: gimley
6 Replies
ns_sched(3aolserver)					    AOLserver Built-In Commands 				      ns_sched(3aolserver)

__________________________________________________________________________________________________________________________________________________

NAME
ns_after, ns_cancel, ns_pause, ns_resume, ns_schedule_daily, ns_schedule_proc, ns_schedule_weekly, ns_unschedule_proc - commands SYNOPSIS
ns_after seconds {script | procname ?args?} ns_cancel id ns_pause id ns_resume id ns_schedule_daily ?-thread? ?-once? hour minute {script | procname ?args?} ns_schedule_proc ?-thread? ?-once? interval {script | procname ?args?} ns_schedule_weekly ?-thread? ?-once? day hour minute {script | procname ?args?} ns_unschedule_proc id _________________________________________________________________ DESCRIPTION
ns_after run the specified script or procedure after the specified number of seconds ns_after returns an id which can be used with the ns_pause, ns_cancel and ns_resume apis. ns_cancel stops the scheduled running of the id returned by an ns_after returns 1 if unscheduled 0 if the script of procedure couldn't be unscheduled ns_pause pauses the scheduled running of the id returned by an ns_after returns 1 if paused, 0 if the script of procedure couldn't be paused ns_resume resumes the scheduled running of the id returned by an ns_after returns 1 if resumed, 0 if the script of procedure couldn't be resumed ns_schedule_daily ns_schedule_daily runs the specified Tcl script or procedure (procname) once a day at the time specified by hour and minute. The hour can be from 0 to 23, and the minute can be from 0 to 59. Specify -thread if you want a thread created to run the procedure. This will allow the scheduler to continue with other scheduled procedures. Specifying -thread is appropriate in situations where the script will not return immediately, such as when the script performs network activity. Specify -once if you want the script to run only one time. The default is that the script will be re-scheduled after each time it is run. ns_schedule_daily returns an id number for the scheduled procedure that is needed to stop the scheduled procedure with ns_unsched- ule_proc. ns_schedule_proc ns_schedule_proc runs the specified Tcl script or procedure (procname) at an interval specified by interval. The interval is the number of seconds between runs of the script. Specify -thread if you want a thread created to run the procedure. This will allow the scheduler to continue with other scheduled procedures. Specifying -thread is appropriate in situations where the script will not return immediately, such as when the script performs network activity. Specify -once if you want the script to run only one time. The default is that the script will be re-scheduled after each time it is run. ns_schedule_proc returns an id number for the scheduled procedure that is needed to stop the scheduled procedure with ns_unsched- ule_proc. ns_schedule_weekly ns_schedule_weekly runs the specified Tcl script or procedure (procname) once a week on the day specified by day and the time speci- fied by hour and minute. The day can be from 0 to 6, where 0 represents Sunday. The hour can be from 0 to 23, and the minute can be from 0 to 59. Specify -thread if you want a thread created to run the procedure. This will allow the scheduler to continue with other scheduled procedures. Specifying -thread is appropriate in situations where the script will not return immediately, such as when the script performs network activity. Specify -once if you want the script to run only one time. The default is that the script will be re-scheduled after each time it is run. ns_schedule_weekly returns an id number for the scheduled procedure that is needed to stop the scheduled procedure with ns_unsched- ule_proc. ns_unschedule_proc id ns_unschedule_proc stops a scheduled procedure from executing anymore. The scheduled procedure to be stopped is identified by its id, which was returned by the ns_schedule* function that was used to schedule the procedure. EXAMPLES
ns_after ns_cancel ns_pause ns_resume This example illustrates a web interface used to manage jobs. Depending on the action provided a job can be created, cancelled, paused or resumed. set action [ns_queryget action] set job [ns_queryget job] switch $action { create { set job [ns_after 10 [ns_queryget script]] ns_puts "Job created with id: $job" } cancel { if {[ns_cancel $job]} { ns_puts "Job $job cancelled" } else { ns_puts "Job $job not cancelled" } } pause { if {[ns_pause $job]} { ns_puts "Job $job paused" } else { ns_puts "Job $job not paused } } resume { if {[ns_resume $job]} { ns_puts "Job $job resumed" } else { ns_puts "Job $job couldn't be resumed" } } default { ns_puts "Invalid action $action" } } ns_schedule_daily This example defines a script called rolllog that uses ns_accesslog to roll the access log to a file with an extension containing the current date. The ns_schedule_daily function is used to execute the rolllog script on a daily basis. # Script to roll and rcp log file to host "grinder" proc rolllog {} { set suffix [ns_strftime "%y-%m-%d"] set new [ns_accesslog file].$suffix ns_accesslog roll $new exec rcp $new grinder:/logs/[file tail $new] } # Schedule "rolllog" to run at 3:30 am each morning ns_schedule_daily -thread 3 30 rolllog ns_schedule_proc proc dosomething blah { ns_log Notice "proc with arg '$blah'" } ns_schedule_proc 10 dosomething $arg1 SEE ALSO
KEYWORDS
schedule pause resume unschedule cancel after AOLserver 4.0 ns_sched(3aolserver)
All times are GMT -4. The time now is 12:38 PM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy