Getting non unique lines from concatenated files


 
Thread Tools Search this Thread
Top Forums UNIX for Dummies Questions & Answers Getting non unique lines from concatenated files
# 22  
Old 03-21-2011
Try:
Code:
x=`sort file_1`;for i in file_2 file_3; do x=`comm -12 <(echo "$x") <(sort $i)`; done; echo "$x"

This User Gave Thanks to bartus11 For This Post:
# 23  
Old 03-21-2011
Thanks Bartus11

Can you comment on the code please .... also the new thing in this code is
` `
what is the significance of this and why not using '' ?

Thank you
# 24  
Old 03-21-2011
`` (called backtick) allows storing command's output to variable. So x=`sort file_1` results in sorted contents of file_1 being loaded to "x" variable. Then iterating over other files, "x" is only saving those lines that were found in all the other files (comm -12 only shows lines common to both compared files).
This User Gave Thanks to bartus11 For This Post:
# 25  
Old 03-21-2011
@pawannoel
just for your info, uniq can also help sometime :

-d option will display duplicates lines
-u option will display lines that are uniq.

Note that uniq proceed by iterative step so that the files you are scanning must previously be sorted.

consider :

Code:
# cat tst2
1
5
1
2
4
3
2
4

This will give you all distincts values:
Code:
# sort tst2 | uniq
1
2
3
4
5

... which could also be written in a more optimized manner using the -u option of sort command (# sort -u tst2) (do not mix your brain with -u option of uniq command!) :
Code:
# sort tst2 -u           
1
2
3
4
5

This way, get all the line that are are uniq (appearing only once) in the file (note the importance to sort it first) :
Code:
# sort tst2 | uniq -u
3
5

And finally, this will give you only duplicated lines (those appearing more that once) in the file :
Code:
# sort tst2 | uniq -d
1
2
4

For education purpose now, see what gives the uniq output on an unsorted file :

Code:
# cat tst2
1
1
1
1
1
5
1
2
4
3
2
4
# uniq tst2
1
5
1
2
4
3
2
4
#

This User Gave Thanks to ctsgnb For This Post:
# 26  
Old 03-21-2011
@Bartus11: Thanks again for the comments and all your help .... I'll be back with more Smilie ...Hv a nice evening/day whatever fits ur local time

@ctsgnb:
Thank you too. I already knew about the uniq command but not about the -u and -d options .... guess I could have done a
Code:
man uniq

, but its much nicer to hear from you all .... I'm loving programming , even if I can only script a few lines of code ... keep the good info coming this way ... Merci (guessing ur French) Smilie

---------- Post updated at 02:59 PM ---------- Previous update was at 02:45 PM ----------

I actually have a question already ....concerning the same files we a dealing with .... now my output file contains a series of lines with 9 fields ...
In a normal file I could have done [CODEsort -n -k9CODE]
For me the complication arises because I need to sort with a parameter in the 9th field but it contains a series of information delimited by semicolons. An example is shown below ... Now say for example I wanted to sort the data by coverage, which is embedded with the other info in feild 9, how do I do that ? And would be nice if the code could tell the maximum and minimum "sorting parameter" for that particular file .... I hope I'm clear ...
Could you please help out on this Smilie

Thanks very much

Code:
chr01   levure5 SNP     12745   12745   0.000000        .       .       genotype=S;reference=C;coverage=91;refAlleleCounts=44;refAlleleStarts=28;refAlleleMeanQV=19;novelAlleleCounts=40;novelAlleleStarts=25;novelAlleleMeanQV=18;diColor1=22;diColor2=11;het=1;flag=
chr01   levure6 SNP     12745   12745   0.000000        .       .       genotype=S;reference=C;coverage=62;refAlleleCounts=29;refAlleleStarts=19;refAlleleMeanQV=19;novelAlleleCounts=32;novelAlleleStarts=20;novelAlleleMeanQV=20;diColor1=11;diColor2=22;het=1;flag=
chr01   levure7 SNP     12745   12745   0.000000        .       .       genotype=S;reference=C;coverage=24;refAlleleCounts=9;refAlleleStarts=8;refAlleleMeanQV=23;novelAlleleCounts=13;novelAlleleStarts=12;novelAlleleMeanQV=20;diColor1=11;diColor2=22;het=1;flag=
chr01   levure8 SNP     12745   12745   0.000000        .       .       genotype=S;reference=C;coverage=37;refAlleleCounts=18;refAlleleStarts=17;refAlleleMeanQV=18;novelAlleleCounts=18;novelAlleleStarts=13;novelAlleleMeanQV=20;diColor1=11;diColor2=22;het=1;flag=
chr01   levure5 SNP     16254   16254   0.000000        .       .       genotype=R;reference=G;coverage=111;refAlleleCounts=82;refAlleleStarts=41;refAlleleMeanQV=18;novelAlleleCounts=28;novelAlleleStarts=9;novelAlleleMeanQV=18;diColor1=10;diColor2=32;het=1;flag=
chr01   levure6 SNP     16254   16254   0.000000        .       .       genotype=R;reference=G;coverage=96;refAlleleCounts=72;refAlleleStarts=38;refAlleleMeanQV=17;novelAlleleCounts=24;novelAlleleStarts=6;novelAlleleMeanQV=15;diColor1=10;diColor2=32;het=1;flag=
chr01   levure7 SNP     16254   16254   0.000000        .       .       genotype=R;reference=G;coverage=32;refAlleleCounts=20;refAlleleStarts=18;refAlleleMeanQV=19;novelAlleleCounts=12;novelAlleleStarts=5;novelAlleleMeanQV=17;diColor1=10;diColor2=32;het=1;flag=

---------- Post updated at 03:01 PM ---------- Previous update was at 02:59 PM ----------

I would like to have control over any sorting parameter of choice within field 9 and not just coverage ...just to clarify Smilie
Cheers

---------- Post updated at 04:21 PM ---------- Previous update was at 03:01 PM ----------

@Bartus11

Hi mate ... I dont understand one thing in your code .... how is it possible to store the results of different commands in the same variable x in ur command ?? ... I mean its possible coz the code works but can u explain ?

Code:
x=`sort file_1`;for i in file_2 file_3; do x=`comm -12 <(echo "$x") <(sort $i)`; done; echo "$x"

# 27  
Old 03-21-2011
The contents of "x" variable is substituted during each "for" loop run. I'll take a look at your sorting problem.

---------- Post updated at 05:05 PM ---------- Previous update was at 04:26 PM ----------

Until someone comes with a nicer solution...
Code:
perl -lne '$h{$.}=$_;END{@o=sort{(split "=",(split ";",(split " +",$h{$a})[8])[2])[1]<=>(split "=",(split ";",(split " +",$h{$b})[8])[2])[1]}keys %h; for $i (@o){print $h{$i}}}' file

I marked with red font index that you can use to address different fields (in Perl indexes start from 0, so genotype is 0, reference is 1, coverage is 2.. etc).
This User Gave Thanks to bartus11 For This Post:
# 28  
Old 03-21-2011
Hi Bartus11,

I'm not sure of this code is working !! ... I get the following result .... just showing u the first few lines ...
Code:
pawan-noels-computer:/Volumes/USB/test noel$ perl -lne '$h{$.}=$_;END{@o=sort{(split "=",(split ";",(split " +",$h{$a})[8])[2])[1]<=>(split "=",(split ";",(split " +",$h{$b})[8])[2])[1]}keys %h; for $i (@o){print $h{$i}}}' file_1
chr15   levure5 SNP     30924   30924   0.000000        .       .       genotype=R;reference=A;coverage=43;refAlleleCounts=23;refAlleleStarts=16;refAlleleMeanQV=17;novelAlleleCounts=19;novelAlleleStarts=11;novelAlleleMeanQV=19;diColor1=01;diColor2=23;het=1;flag=
chr02   levure7 SNP     792879  792879  0.010875        .       .       genotype=M;reference=A;coverage=24;refAlleleCounts=19;refAlleleStarts=14;refAlleleMeanQV=15;novelAlleleCounts=4;novelAlleleStarts=3;novelAlleleMeanQV=26;diColor1=23;diColor2=32;het=1;flag=;gene;ID=YBR298C;Name=YBR298C;gene=MAL31;Alias=MAL31,MALT,MAL3T
chr02   levure7 SNP     459336  459336  0.000000        .       .       genotype=Y;reference=T;coverage=15;refAlleleCounts=9;refAlleleStarts=7;refAlleleMeanQV=25;novelAlleleCounts=6;novelAlleleStarts=6;novelAlleleMeanQV=22;diColor1=12;diColor2=30;het=1;flag=;gene;ID=YBR115C;Name=YBR115C;gene=LYS2;Alias=LYS2
chr10   levure6 SNP     39633   39633   0.000000        .       .       genotype=T;reference=A;coverage=9;refAlleleCounts=0;refAlleleStarts=0;refAlleleMeanQV=0;novelAlleleCounts=9;novelAlleleStarts=5;novelAlleleMeanQV=18;diColor1=23;diColor2=23;het=0;flag=h4,h10,;gene;ID=YJL211C;Name=YJL211C
chr15   levure6 SNP     30924   30924   0.000000        .       .       genotype=R;reference=A;coverage=36;refAlleleCounts=26;refAlleleStarts=19;refAlleleMeanQV=14;novelAlleleCounts=10;novelAlleleStarts=8;novelAlleleMeanQV=20;diColor1=01;diColor2=23;het=1;flag=

 
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Print number of lines for files in directory, also print number of unique lines

I have a directory of files, I can show the number of lines in each file and order them from lowest to highest with: wc -l *|sort 15263 Image.txt 16401 reference.txt 40459 richtexteditor.txt How can I also print the number of unique lines in each file? 15263 1401 Image.txt 16401... (15 Replies)
Discussion started by: spacegoose
15 Replies

2. UNIX for Dummies Questions & Answers

Print unique lines without sort or unique

I would like to print unique lines without sort or unique. Unfortunately the server I am working on does not have sort or unique. I have not been able to contact the administrator of the server to ask him to add it for several weeks. (7 Replies)
Discussion started by: cokedude
7 Replies

3. Shell Programming and Scripting

Look up 2 files and print the concatenated output

file 1 Sun Mar 17 00:01:33 2013 submit , Name="1234" Sun Mar 17 00:01:33 2013 submit , Name="1344" Sun Mar 17 00:01:33 2013 submit , Name="1124" .. .. .. .. Sun Mar 17 00:01:33 2013 submit , Name="8901" file 2 Sun Mar 17 00:02:47 2013 1234 execute SUCCEEDED Sun Mar 17... (24 Replies)
Discussion started by: aravindj80
24 Replies

4. Shell Programming and Scripting

Print only lines where fields concatenated match strings

Hello everyone, Maybe somebody could help me with an awk script. I have this input (field separator is comma ","): 547894982,M|N|J,U|Q|P,98,101,0,1,1 234900027,M|N|J,U|Q|P,98,101,0,1,1 234900023,M|N|J,U|Q|P,98,54,3,1,1 234900028,M|H|J,S|Q|P,98,101,0,1,1 234900030,M|N|J,U|F|P,98,101,0,1,1... (2 Replies)
Discussion started by: Ophiuchus
2 Replies

5. Shell Programming and Scripting

compare 2 files and return unique lines in each file (based on condition)

hi my problem is little complicated one. i have 2 files which appear like this file 1 abbsss:aa:22:34:as akl abc 1234 mkilll:as:ss:23:qs asc abc 0987 mlopii:cd:wq:24:as asd abc 7866 file2 lkoaa:as:24:32:sa alk abc 3245 lkmo:as:34:43:qs qsa abc 0987 kloia:ds:45:56:sa acq abc 7805 i... (5 Replies)
Discussion started by: anurupa777
5 Replies

6. UNIX for Dummies Questions & Answers

getting unique lines from 2 files

hi i have used comm -13 <(sort 1.txt) <(sort 2.txt) option to get the unique lines that are present in file 2 but not in file 1. but some how i am getting the entire file 2. i would expect few but not all uncommon lines fro my dat. is there anything wrong with the way i used the command? my... (1 Reply)
Discussion started by: anurupa777
1 Replies

7. Shell Programming and Scripting

Compare multiple files and print unique lines

Hi friends, I have multiple files. For now, let's say I have two of the following style cat 1.txt cat 2.txt output.txt Please note that my files are not sorted and in the output file I need another extra column that says the file from which it is coming. I have more than 100... (19 Replies)
Discussion started by: jacobs.smith
19 Replies

8. UNIX for Advanced & Expert Users

In a huge file, Delete duplicate lines leaving unique lines

Hi All, I have a very huge file (4GB) which has duplicate lines. I want to delete duplicate lines leaving unique lines. Sort, uniq, awk '!x++' are not working as its running out of buffer space. I dont know if this works : I want to read each line of the File in a For Loop, and want to... (16 Replies)
Discussion started by: krishnix
16 Replies

9. Shell Programming and Scripting

Comparing 2 files and return the unique lines in first file

Hi, I have 2 files file1 ******** 01-05-09|java.xls| 02-05-08|c.txt| 08-01-09|perl.txt| 01-01-09|oracle.txt| ******** file2 ******** 01-02-09|windows.xls| 02-05-08|c.txt| 01-05-09|java.xls| 08-02-09|perl.txt| 01-01-09|oracle.txt| ******** (8 Replies)
Discussion started by: shekhar_v4
8 Replies

10. Shell Programming and Scripting

Lines Concatenated with awk

Hello, I have a bash shell script and I use awk to print certain columns of one file and direct the output to another file. If I do a less or cat on the file it looks correct, but if I email the file and open it with Outlook the lines outputted by awk are concatenated. Here is my awk line:... (6 Replies)
Discussion started by: xadamz23
6 Replies
Login or Register to Ask a Question