Extract common data out of multiple files | Unix Linux Forums | UNIX for Dummies Questions & Answers

  Go Back    


UNIX for Dummies Questions & Answers If you're not sure where to post a UNIX or Linux question, post it here. All UNIX and Linux newbies welcome !!

Extract common data out of multiple files

UNIX for Dummies Questions & Answers


Tags
awk, cut, grep, perl, python

Closed Thread    
 
Thread Tools Search this Thread Display Modes
    #1  
Old 12-24-2012
macmath macmath is offline
Registered User
 
Join Date: Dec 2012
Last Activity: 24 December 2012, 6:32 AM EST
Location: Marseille, France
Posts: 1
Thanks: 0
Thanked 0 Times in 0 Posts
Extract common data out of multiple files

I am trying to extract common list of Organisms from different files
For example I took 3 files and showed expected result. In real I have more than 1000 files. I am aware about the useful use of awk and grep but unaware in depth so need guidance regarding it.

I want to use awk/ grep/ cut/ perl/ python to get the needful result.
File A:
Pseudomonas stutzeri A1501
Pseudomonas fragi A22
Pseudomonas fluorescens A506
Aeromonas caviae Ae398
Rickettsiella grylli
Aeromonas veronii AMC34
File B:
Rickettsiella grylli
Pseudomonas fulva 12-X
Pseudomonas extremaustralis 14-3 substr. 14-3b
Aeromonas caviae Ae398
Gallaecimonas xiamenensis 3-C-1
Pseudomonas stutzeri A1501
File C:
Pseudomonas extremaustralis
Pseudomonas fulva 12-X
Pseudomonas extremaustralis 14-3 substr. 14-3b
Aeromonas caviae Ae398
Rickettsiella grylli
Pseudomonas stutzeri A1501
Expected Result file : Common organism
Aeromonas caviae Ae398
Pseudomonas stutzeri A1501
Rickettsiella grylli
Hoping for your suggestions and support.
Thank you in advance
Sponsored Links
    #2  
Old 12-24-2012
Scrutinizer's Avatar
Scrutinizer Scrutinizer is online now Forum Staff  
Moderator
 
Join Date: Nov 2008
Last Activity: 23 November 2014, 5:46 AM EST
Location: Amsterdam
Posts: 9,608
Thanks: 293
Thanked 2,445 Times in 2,193 Posts
If there is maximum of 1 entry per file:

Code:
awk '++A[$0]>=ARGC-1' file*

This would then be a bit more robust:

Code:
awk '{$1=$1} ++A[$0]>=ARGC-1' file*

But 1000 files is probably going to be too many for the command line length.

Otherwise try:

Code:
( 
  set -- file*
  for f
  do
    cat "$f"
  done | awk '{$1=$1} ++A[$0]>=c' c=$# 
)


Last edited by Scrutinizer; 12-24-2012 at 08:14 AM..
Sponsored Links
    #3  
Old 12-24-2012
RudiC RudiC is offline Forum Advisor  
Registered User
 
Join Date: Jul 2012
Last Activity: 22 November 2014, 2:29 PM EST
Location: Aachen, Germany
Posts: 4,682
Thanks: 78
Thanked 1,158 Times in 1,088 Posts
Try this
Code:
$ cat file?|sort|uniq -c|sort -rnb|grep "^ *3"| cut -d" " -f8-30
Rickettsiella grylli
Pseudomonas stutzeri A1501
Aeromonas caviae Ae398

    #4  
Old 12-25-2012
sathyaonnuix's Avatar
sathyaonnuix sathyaonnuix is offline
Registered User
 
Join Date: Aug 2012
Last Activity: 26 June 2014, 12:23 PM EDT
Posts: 129
Thanks: 35
Thanked 14 Times in 13 Posts
A simple while loop will do the work for you


Code:
file_count=$(ls -lrt file? |wc -l)
sort -u file1 > temp;cat temp > file1;rm temp
while read i
do
result_count=$(grep -lw "$i" file? | wc -l)
if [ $result_count -eq $file_count ]; then
 echo $i
fi
done < file1


Last edited by sathyaonnuix; 12-31-2012 at 02:00 AM..
Sponsored Links
    #5  
Old 12-26-2012
mukulverma2408 mukulverma2408 is offline
Registered User
 
Join Date: Jul 2012
Last Activity: 4 September 2014, 7:17 AM EDT
Posts: 61
Thanks: 12
Thanked 2 Times in 2 Posts
Hi Sathyaonnuix,

The solution shared by you is very impressive..... but in case if file1 has same lines multiple times (which is common to other files as well) then it will result in multiple occurrence of that line in output.
Maybe we can use sort and unique to overcome this little problem somewhat like:

Code:
cat file1|sort|uniq > tmp.tmp

and then apply while loop on this tmp. file
Sponsored Links
    #6  
Old 12-27-2012
sathyaonnuix's Avatar
sathyaonnuix sathyaonnuix is offline
Registered User
 
Join Date: Aug 2012
Last Activity: 26 June 2014, 12:23 PM EDT
Posts: 129
Thanks: 35
Thanked 14 Times in 13 Posts
Hello Mukul,
Thanks for your feedback. When grep -l command is used, it suppresses the repetition.


Code:
# cat file
repeat
repeat
repeat
123
456
567


Code:
# grep -lw repeat file
file

Sponsored Links
    #7  
Old 12-30-2012
mukulverma2408 mukulverma2408 is offline
Registered User
 
Join Date: Jul 2012
Last Activity: 4 September 2014, 7:17 AM EDT
Posts: 61
Thanks: 12
Thanked 2 Times in 2 Posts
Hi sathyaonnuix,
Consider the below scenario :

Code:
cat filea
line1
line2
repeat1
line3
repeat2
line4
repeat1


Code:
cat fileb
1234567
repeat1
repeat2
bbbbbb


Code:
cat filec
line1
repeat1
repeat2
line2
repeat1
line3

Now executing the script for these three files would result in below output

Code:
repeat1
repeat2
repeat1

repetition of repeat1 which i think was not required

P.S. I am running while loop on filea
Sponsored Links
Closed Thread

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

More UNIX and Linux Forum Topics You Might Find Helpful
Thread Thread Starter Forum Replies Last Post
Extract common words from two/more csv files nick2011 Shell Programming and Scripting 1 03-06-2012 07:06 AM
Using AWK: Extract data from multiple files and output to multiple new files Liverpaul09 UNIX for Dummies Questions & Answers 3 10-12-2010 04:59 AM
AWK, extract data from multiple files Liverpaul09 UNIX for Dummies Questions & Answers 8 09-29-2010 09:43 AM
Get common lines from multiple files genehunter Shell Programming and Scripting 9 09-02-2010 04:47 AM
How to rename multiple files with a common suffix er_ashu UNIX for Dummies Questions & Answers 1 09-28-2007 11:52 AM



All times are GMT -4. The time now is 06:56 AM.