Getting non unique lines from concatenated files

03-31-2011

Registered User

164, 1

Join Date: Mar 2011

Last Activity: 6 August 2015, 12:14 AM EDT

Posts: 164

Thanks Given: 119

Thanked 1 Time in 1 Post

Hello Bartus
Morning

Today's question !!
So in your code

Code:

#!/bin/sh
LIST=$1
shift
for i in $*; do
  echo "$i:"
  perl -nase 'BEGIN{open I, "$file";@I=<I>}{print grep {/$F[0]/&&/$F[1]/} @I}' -- -file=$i $LIST
done

What if I didnt know, or didnt want to specify which feilds in file1 containing the pattern I want to grep from other files in the list ? How do I go about that? The reason I'm asking is that in my case, file1 can be in various line formats with patterns to be grepped not always located in $F[0] and $F[1]. In reality I could make different codes for different file1 types, but I was wondering if there is are smarter and efficient way to accomplish such a task! ... Could you please enlighten on this ?
Cheers and have a nice day

pawannoel

View Public Profile for pawannoel

Find all posts by pawannoel

03-31-2011

Registered User

3,733, 1,154

Join Date: Apr 2009

Last Activity: 3 August 2016, 11:03 AM EDT

Posts: 3,733

Thanks Given: 7

Thanked 1,154 Times in 1,124 Posts

Post examples of those various line formats

bartus11

View Public Profile for bartus11

Find all posts by bartus11

03-31-2011

Registered User

164, 1

Join Date: Mar 2011

Last Activity: 6 August 2015, 12:14 AM EDT

Posts: 164

Thanks Given: 119

Thanked 1 Time in 1 Post

OK sure
line format1:

Code:

chr01    16254

line format2:

Code:

chr01    lev5    16254

line format3:

Code:

chr01     lev5        SNP     16254

line format4:

Code:

SK1.chr01    SOLiD_diBayes    SNP    16254    16254    0.000000    .    .    genotype=G;reference=A;coverage=93;refAlleleCounts=0;refAlleleStarts=0;refAlleleMeanQV=0;novelAlleleCounts=88;novelAlleleStarts=55;novelAlleleMeanQV=25;diColor1=02;diColor2=02;het=0;flag=h4,h10,h9,

line format5:

Code:

SK1.chr01    16254    levure5    A    G    225    .    DP=407;AF1=0.5;CI95=0.5,0.5;DP4=142,103,72,68;MQ=31;FQ=225;PV4=0.24,1,1,1    GT:PL:GQ;telomere;ID=TEL01L;Name=TEL01L

So basically $F[0] is usually in the same place with additional string elements which can be sed out but then the actual position might be in different feilds.

Cheers

pawannoel

View Public Profile for pawannoel

Find all posts by pawannoel

03-31-2011

Registered User

3,733, 1,154

Join Date: Apr 2009

Last Activity: 3 August 2016, 11:03 AM EDT

Posts: 3,733

Thanks Given: 7

Thanked 1,154 Times in 1,124 Posts

So we can assume that from $F[0] you need part after the dot, and for second grep pattern, the first numeric field of the line? If so, try this:

Code:

#!/bin/sh
LIST=$1
shift
for i in $*; do
  echo "$i:"
  perl -nase 'BEGIN{open I, "$file";@I=<I>}{$F[0]=~s/.*\.//;/.*?\b(\d+)\b/;$x=$1;print grep {/$F[0]/&&/$x/} @I}' -- -file=$i $LIST
done

This User Gave Thanks to bartus11 For This Post:

bartus11

View Public Profile for bartus11

Find all posts by bartus11

03-31-2011

Registered User

164, 1

Join Date: Mar 2011

Last Activity: 6 August 2015, 12:14 AM EDT

Posts: 164

Thanks Given: 119

Thanked 1 Time in 1 Post

Thank you ... that works

... one thing ..... why did u have to do $x=$1 ?? and why not use $1 in the grep part ?? Is that because LIST=$1 already ??
Cheers

pawannoel

View Public Profile for pawannoel

Find all posts by pawannoel

03-31-2011

Registered User

3,733, 1,154

Join Date: Apr 2009

Last Activity: 3 August 2016, 11:03 AM EDT

Posts: 3,733

Thanks Given: 7

Thanked 1,154 Times in 1,124 Posts

Lets analyze the behavior of the code if we use $1 directly in the grep part:

Code:

perl -nase 'BEGIN{open I, "$file";@I=<I>}{$F[0]=~s/.*\.//;/.*?\b(\d+)\b/;print grep {/$F[0]/&&/$1/} @I}' -- -file=$i $LIST

When grep code is being executed, $1 in red part is changed by the blue regex (they are two separate regular expressions, each populating and replacing regex related variables). This is why it is so important to save the contents of those variables immediately after regex match ($x=$1 in original code).
PS: LIST=$1 and Perl's code $1 are two separate variables. First $1 is shells variable not visible from withing the Perl's code, thanks to keeping the code inside single quotes.

Last edited by bartus11; 03-31-2011 at 08:50 AM..

This User Gave Thanks to bartus11 For This Post:

bartus11

View Public Profile for bartus11

Find all posts by bartus11

03-31-2011

Registered User

164, 1

Join Date: Mar 2011

Last Activity: 6 August 2015, 12:14 AM EDT

Posts: 164

Thanks Given: 119

Thanked 1 Time in 1 Post

Thank you Master

---------- Post updated at 10:49 AM ---------- Previous update was at 07:13 AM ----------

Hi Bartus,
Another question about file manipulation using a different file type. Example is below and also expected output below. So basically I want to grep the contents of $F[0], $F[1] $F[4]and $F[5]. But the requirements for each feild are different. So Basically out put will be
$F[0]
$F[1] beginning - end, till $F[0] is the same
$F[4] in horizontal lines 100 chracters each "\n"
$F[5] in horizontal lines 100 chracters each "\n"
Sample file:

Code:

SK1.chr10    3006    02    02    G    G    1.000000    h4,h10,h2,h21,h22,m6    3    3    3    0    15    0    0    -1    
SK1.chr10    3007    22    22    A    A    1.000000    h4,h10,h21,h22,m6    4    4    4    0    8    0    0    -1    
SK1.chr10    3008    21    21    G    G    0.000000    h4,h10,h21,h22,     7    7    7    0    8    0    0    -1    
SK1.chr10    3009    10    10    T    T    0.000000    h4,h10,h21,h22,     11    11    11    0    15    0    0    -1    
SK1.chr10    3010    01    01    T    T    0.000000    h4,h10,h21,h22,     14    14    14    0    16    0    0    -1
SK1.chr09    455566    31    31    T    T    0.000000    h4,h10,h9,h21,h22,     11    8    8    0    10    0    0    -1    
SK1.chr09    455567    13    13    G    G    0.000000    h4,h10,h9,h15,h21,h22,     11    8    8    0    10    0    0    -1

Expected output:

Code:

SK1.chr10
455566-455567
Ref: 
GAGTT

Gen
GAGTT

SK1.chr09
3006-3010
Ref: 
TG

Gen
TG

I have no idea how to go about it ... Could you please provide insight to deal with this?

Cheers and have a nice evening

pawannoel

View Public Profile for pawannoel

Find all posts by pawannoel

UNIX for Dummies Questions & Answers

Getting non unique lines from concatenated files

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Print number of lines for files in directory, also print number of unique lines

Discussion started by: spacegoose

2. UNIX for Dummies Questions & Answers

Print unique lines without sort or unique

Discussion started by: cokedude

3. Shell Programming and Scripting

Look up 2 files and print the concatenated output

Discussion started by: aravindj80

4. Shell Programming and Scripting

Print only lines where fields concatenated match strings

Discussion started by: Ophiuchus

5. Shell Programming and Scripting

compare 2 files and return unique lines in each file (based on condition)

Discussion started by: anurupa777

6. UNIX for Dummies Questions & Answers

getting unique lines from 2 files

Discussion started by: anurupa777

7. Shell Programming and Scripting

Compare multiple files and print unique lines

Discussion started by: jacobs.smith

8. UNIX for Advanced & Expert Users

In a huge file, Delete duplicate lines leaving unique lines

Discussion started by: krishnix

9. Shell Programming and Scripting

Comparing 2 files and return the unique lines in first file

Discussion started by: shekhar_v4

10. Shell Programming and Scripting

Lines Concatenated with awk

Discussion started by: xadamz23