Extract common data out of multiple files

12-24-2012

Registered User

1, 0

Join Date: Dec 2012

Last Activity: 24 December 2012, 6:32 AM EST

Location: Marseille, France

Posts: 1

Thanks Given: 0

Thanked 0 Times in 0 Posts

Extract common data out of multiple files

I am trying to extract common list of Organisms from different files
For example I took 3 files and showed expected result. In real I have more than 1000 files. I am aware about the useful use of awk and grep but unaware in depth so need guidance regarding it.

I want to use awk/ grep/ cut/ perl/ python to get the needful result.
File A:

Pseudomonas stutzeri A1501
Pseudomonas fragi A22
Pseudomonas fluorescens A506
Aeromonas caviae Ae398
Rickettsiella grylli
Aeromonas veronii AMC34

File B:

Rickettsiella grylli
Pseudomonas fulva 12-X
Pseudomonas extremaustralis 14-3 substr. 14-3b
Aeromonas caviae Ae398
Gallaecimonas xiamenensis 3-C-1
Pseudomonas stutzeri A1501

File C:

Pseudomonas extremaustralis
Pseudomonas fulva 12-X
Pseudomonas extremaustralis 14-3 substr. 14-3b
Aeromonas caviae Ae398
Rickettsiella grylli
Pseudomonas stutzeri A1501

Expected Result file : Common organism

Aeromonas caviae Ae398
Pseudomonas stutzeri A1501
Rickettsiella grylli

Hoping for your suggestions and support.
Thank you in advance

macmath

View Public Profile for macmath

Find all posts by macmath

12-24-2012

Moderator

12,296, 3,792

Join Date: Nov 2008

Last Activity: 1 January 2021, 1:47 AM EST

Location: Amsterdam

Posts: 12,296

Thanks Given: 679

Thanked 3,792 Times in 3,282 Posts

If there is maximum of 1 entry per file:

Code:

awk '++A[$0]>=ARGC-1' file*

This would then be a bit more robust:

Code:

awk '{$1=$1} ++A[$0]>=ARGC-1' file*

But 1000 files is probably going to be too many for the command line length.

Otherwise try:

Code:

( 
  set -- file*
  for f
  do
    cat "$f"
  done | awk '{$1=$1} ++A[$0]>=c' c=$# 
)

Last edited by Scrutinizer; 12-24-2012 at 08:14 AM..

This User Gave Thanks to Scrutinizer For This Post:

Scrutinizer

View Public Profile for Scrutinizer

Find all posts by Scrutinizer

12-24-2012

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

Try this

Code:

$ cat file?|sort|uniq -c|sort -rnb|grep "^ *3"| cut -d" " -f8-30
Rickettsiella grylli
Pseudomonas stutzeri A1501
Aeromonas caviae Ae398

RudiC

View Public Profile for RudiC

Find all posts by RudiC

12-25-2012

Registered User

129, 14

Join Date: Aug 2012

Last Activity: 26 June 2014, 12:23 PM EDT

Posts: 129

Thanks Given: 35

Thanked 14 Times in 13 Posts

A simple while loop will do the work for you

Code:

file_count=$(ls -lrt file? |wc -l)
sort -u file1 > temp;cat temp > file1;rm temp
while read i
do
result_count=$(grep -lw "$i" file? | wc -l)
if [ $result_count -eq $file_count ]; then
 echo $i
fi
done < file1

Last edited by sathyaonnuix; 12-31-2012 at 02:00 AM..

sathyaonnuix

View Public Profile for sathyaonnuix

Find all posts by sathyaonnuix

12-26-2012

Registered User

82, 2

Join Date: Jul 2012

Last Activity: 10 February 2020, 11:44 AM EST

Posts: 82

Thanks Given: 33

Thanked 2 Times in 2 Posts

Hi Sathyaonnuix,

The solution shared by you is very impressive..... but in case if file1 has same lines multiple times (which is common to other files as well) then it will result in multiple occurrence of that line in output.
Maybe we can use sort and unique to overcome this little problem somewhat like:

Code:

cat file1|sort|uniq > tmp.tmp

and then apply while loop on this tmp. file

mukulverma2408

View Public Profile for mukulverma2408

Find all posts by mukulverma2408

12-27-2012

Registered User

129, 14

Join Date: Aug 2012

Last Activity: 26 June 2014, 12:23 PM EDT

Posts: 129

Thanks Given: 35

Thanked 14 Times in 13 Posts

Hello Mukul,
Thanks for your feedback. When grep -l command is used, it suppresses the repetition.

Code:

# cat file
repeat
repeat
repeat
123
456
567

Code:

# grep -lw repeat file
file

sathyaonnuix

View Public Profile for sathyaonnuix

Find all posts by sathyaonnuix

12-30-2012

Registered User

82, 2

Join Date: Jul 2012

Last Activity: 10 February 2020, 11:44 AM EST

Posts: 82

Thanks Given: 33

Thanked 2 Times in 2 Posts

Hi sathyaonnuix,
Consider the below scenario :

Code:

cat filea
line1
line2
repeat1
line3
repeat2
line4
repeat1

Code:

cat fileb
1234567
repeat1
repeat2
bbbbbb

Code:

cat filec
line1
repeat1
repeat2
line2
repeat1
line3

Now executing the script for these three files would result in below output

Code:

repeat1
repeat2
repeat1

repetition of repeat1 which i think was not required

P.S. I am running while loop on filea

mukulverma2408

View Public Profile for mukulverma2408

Find all posts by mukulverma2408

UNIX for Dummies Questions & Answers

Extract common data out of multiple files

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Merge multiple files with common header

Discussion started by: msarguru

2. Shell Programming and Scripting

Get both common and missing values from multiple files

Discussion started by: Diya123

3. Shell Programming and Scripting

Extract data in tabular format from multiple files

Discussion started by: belalr

4. Shell Programming and Scripting

Compare multiple files, and extract items that are common to ALL files only

Discussion started by: castrojc

5. Shell Programming and Scripting

Find common lines between multiple files

Discussion started by: bibb

6. Shell Programming and Scripting

Extract common words from two/more csv files

Discussion started by: nick2011

7. UNIX for Dummies Questions & Answers

Using AWK: Extract data from multiple files and output to multiple new files

Discussion started by: Liverpaul09

8. UNIX for Dummies Questions & Answers

AWK, extract data from multiple files

Discussion started by: Liverpaul09

9. Shell Programming and Scripting

Get common lines from multiple files

Discussion started by: genehunter

10. UNIX for Dummies Questions & Answers

How to rename multiple files with a common suffix

Discussion started by: er_ashu