File processing


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting File processing
# 1  
Old 04-19-2010
File processing

Hi guys,I have 2 sets of files. File A has around 4 million lines. The format is

fileA

Code:
test.term.n4814 test.term.n3199
test.term.n4814 test.term.n4803
test.term.n4814 test.term.n_1767
test.term.n4810 test.term.n_3708
test.term.n4811 test.term.n_3745
test.term.n4817 test.term.n_3869
test.term.n4812 test.term.n_64430
test.term.n4814 test.term.n_75678
test.term.n4814 test.term.n_75686
test.term.n4819 test.term.n_75702
test.term.n4812 test.term.n_77979
test.term.n4818 test.term.n_78077
test.term.n4813 test.term.n_78522
test.term.n4815 test.term.n_87649
test.term.n4817 test.term.n_87818

File B has fewer lines [ a few thousands] for e.g.

fileB
Code:
test.term.n_75702
test.term.n4819
test.term.n4814
term.n_78077

I am trying to write a script which will search if a line from file B exisits in File A and then delete it. The script I have written is

filter.sh
Code:
#!/bin/sh
cp fileA temp1 
while read line 
do 
cp temp2 temp1 
sed "/${line}/d" temp1 > temp2 
done < fileB 
cp temp2 filtered_fileA 
\rm temp*


This script works for small files like above.

However the actual file which has over 4 million lines has been running for more than 4-5 days.

Is there a faster and easier way to do this on large files?

And the expected output is

filtered_fileA
Code:
test.term.n4810 test.term.n_3708
test.term.n4811 test.term.n_3745
test.term.n4817 test.term.n_3869
test.term.n4812 test.term.n_64430 
test.term.n4812 test.term.n_77979 
test.term.n4813 test.term.n_78522 
test.term.n4815 test.term.n_87649 
test.term.n4817 test.term.n_87818


Thanks in advance,
Naveen

Last edited by naveen@; 04-19-2010 at 12:06 PM.. Reason: formatting issue
# 2  
Old 04-19-2010
Could you please format your post so we can read it?
# 3  
Old 04-19-2010
Sorry about that Jim,

I took care of it.

Thanks
Naveen
# 4  
Old 04-19-2010
Code:
awk 'FILENAME=="fileB"  {arr[$0]++}
       FILENAME=="fileA"  {if( $2 in arr  || $1 in arr ) {deleted++;next} else {print $0} }
       END{print "lines deleted", deleted } ' fileB fileA > newfile


Last edited by jim mcnamara; 04-19-2010 at 12:26 PM..
# 5  
Old 04-19-2010
Thanks a lot Jim.

Is there a way to check just the number of lines it has deleted from fileA. I.e something like tkdiff fileA filtered_fileA will highlight the lines deleted in gui mode.

But as the files have millions of lines it would be great if I could get a number of the deleted lines.

Thanks,
Naveen
# 6  
Old 04-19-2010
See the change in the original code above
Login or Register to Ask a Question

Previous Thread | Next Thread

9 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

awk - Rename output file, after processing, same as input file

I have one input file ABC.txt and one output DEF.txt. After the ABC is processed and created output, I want to rename ABC.txt to ABC.orig and DEF to ABC.txt. Currently when I am doing this, it does not process the input file as it cannot read and write to the same file. How can I achieve this? ... (12 Replies)
Discussion started by: High-T
12 Replies

2. Programming

awk processing / Shell Script Processing to remove columns text file

Hello, I extracted a list of files in a directory with the command ls . However this is not my computer, so the ls functionality has been revamped so that it gives the filesizes in front like this : This is the output of ls command : I stored the output in a file filelist 1.1M... (5 Replies)
Discussion started by: ajayram
5 Replies

3. Shell Programming and Scripting

Recursive file processing from a path and printing output in a file

Hi All, The script below read the path and searches for the directories/subdirectories and for the files. If files are found in the sub directories then read the content of the all files and put the content in csv(comma delimted) format and the call the write to xml function to write the std... (1 Reply)
Discussion started by: Optimus81
1 Replies

4. Shell Programming and Scripting

File Processing

i am having the input file as below 123456789: xxxxx12xxxxxxxxxxxxxxxxxx a_cnt 123456789: xxxxxxxxxxxxxxxxxxxxxxx a_cnt 123456789: a_cnt xxxxaq1wsxxxxxxxxxxxx12xxxxxxxxxx 123456789: xxxxxxxxxxxxasxxxx a_cnt i need the numbers in the backets of a_cnt O/p required as below 1 2 3 4... (2 Replies)
Discussion started by: expert
2 Replies

5. Shell Programming and Scripting

How to make parallel processing rather than serial processing ??

Hello everybody, I have a little problem with one of my program. I made a plugin for collectd (a stats collector for my servers) but I have a problem to make it run in parallel. My program gathers stats from logs, so it needs to run in background waiting for any new lines added in the log... (0 Replies)
Discussion started by: Samb95
0 Replies

6. Shell Programming and Scripting

How to processing the log file within certain dates based on the file name

Hi I am working on the script parsing specific message "TEST" from multiple file. The log file name looks like: N3.2009-11-26-03-05-02.console.log.tar.gz N4.2009-11-29-00-25-03.console.log.tar.gz N6.2009-12-01-10-05-02.console.log.tar.gz I am using the following command: zgrep -a --text... (1 Reply)
Discussion started by: shyork2001
1 Replies

7. Shell Programming and Scripting

how to change the current file processing to some other random file in awk ?

Hello, say suppose i am processing an file emp.dat the field of which are deptno empno empname etc now say suppose i want to change the file to emp.lst then how can i do it? Here i what i attempted but in vain BEGIN{ system("sort emp.dat > emp.lst") FILENAME="emp.lst" } { print... (2 Replies)
Discussion started by: salman4u
2 Replies

8. Shell Programming and Scripting

Checking for a control file before processing a data file

Hi All, I am very new to Shell scripting... I got a requirement. I will have few text files(data files) in a particular directory. they will be with .txt extension. With same name, but with a different extension control files also will be there. For example, Sample_20081001.txt is the data... (4 Replies)
Discussion started by: purna.cherukuri
4 Replies

9. Shell Programming and Scripting

Have a shell script check for a file to exist before processing another file

I have a shell script that runs all the time looking for a certain type of file and then it processes the file through a series of other scripts. The script is watching a directory that has files uploaded to it via SFTP. It already checks the size of the file to make sure that it is not still... (3 Replies)
Discussion started by: heprox
3 Replies
Login or Register to Ask a Question