Getting non unique lines from concatenated files


 
Thread Tools Search this Thread
Top Forums UNIX for Dummies Questions & Answers Getting non unique lines from concatenated files
# 99  
Old 03-31-2011
Hello Bartus
MorningSmilie Today's question !!
So in your code
Code:
#!/bin/sh
LIST=$1
shift
for i in $*; do
  echo "$i:"
  perl -nase 'BEGIN{open I, "$file";@I=<I>}{print grep {/$F[0]/&&/$F[1]/} @I}' -- -file=$i $LIST
done

What if I didnt know, or didnt want to specify which feilds in file1 containing the pattern I want to grep from other files in the list ? How do I go about that? The reason I'm asking is that in my case, file1 can be in various line formats with patterns to be grepped not always located in $F[0] and $F[1]. In reality I could make different codes for different file1 types, but I was wondering if there is are smarter and efficient way to accomplish such a task! ... Could you please enlighten on this ?
Cheers and have a nice daySmilie
# 100  
Old 03-31-2011
Post examples of those various line formats Smilie
# 101  
Old 03-31-2011
OK sure
line format1:
Code:
chr01    16254

line format2:
Code:
chr01    lev5    16254

line format3:
Code:
chr01     lev5        SNP     16254

line format4:
Code:
SK1.chr01    SOLiD_diBayes    SNP    16254    16254    0.000000    .    .    genotype=G;reference=A;coverage=93;refAlleleCounts=0;refAlleleStarts=0;refAlleleMeanQV=0;novelAlleleCounts=88;novelAlleleStarts=55;novelAlleleMeanQV=25;diColor1=02;diColor2=02;het=0;flag=h4,h10,h9,

line format5:
Code:
SK1.chr01    16254    levure5    A    G    225    .    DP=407;AF1=0.5;CI95=0.5,0.5;DP4=142,103,72,68;MQ=31;FQ=225;PV4=0.24,1,1,1    GT:PL:GQ;telomere;ID=TEL01L;Name=TEL01L

So basically $F[0] is usually in the same place with additional string elements which can be sed out but then the actual position might be in different feilds.

CheersSmilie
# 102  
Old 03-31-2011
So we can assume that from $F[0] you need part after the dot, and for second grep pattern, the first numeric field of the line? If so, try this:
Code:
#!/bin/sh
LIST=$1
shift
for i in $*; do
  echo "$i:"
  perl -nase 'BEGIN{open I, "$file";@I=<I>}{$F[0]=~s/.*\.//;/.*?\b(\d+)\b/;$x=$1;print grep {/$F[0]/&&/$x/} @I}' -- -file=$i $LIST
done

This User Gave Thanks to bartus11 For This Post:
# 103  
Old 03-31-2011
Thank you ... that works Smilie ... one thing ..... why did u have to do $x=$1 ?? and why not use $1 in the grep part ?? Is that because LIST=$1 already ??
Cheers Smilie
# 104  
Old 03-31-2011
Lets analyze the behavior of the code if we use $1 directly in the grep part:
Code:
perl -nase 'BEGIN{open I, "$file";@I=<I>}{$F[0]=~s/.*\.//;/.*?\b(\d+)\b/;print grep {/$F[0]/&&/$1/} @I}' -- -file=$i $LIST

When grep code is being executed, $1 in red part is changed by the blue regex (they are two separate regular expressions, each populating and replacing regex related variables). This is why it is so important to save the contents of those variables immediately after regex match ($x=$1 in original code).
PS: LIST=$1 and Perl's code $1 are two separate variables. First $1 is shells variable not visible from withing the Perl's code, thanks to keeping the code inside single quotes.

Last edited by bartus11; 03-31-2011 at 08:50 AM..
This User Gave Thanks to bartus11 For This Post:
# 105  
Old 03-31-2011
Thank you Master Smilie

---------- Post updated at 10:49 AM ---------- Previous update was at 07:13 AM ----------

Hi Bartus,
Another question about file manipulation using a different file type. Example is below and also expected output below. So basically I want to grep the contents of $F[0], $F[1] $F[4]and $F[5]. But the requirements for each feild are different. So Basically out put will be
$F[0]
$F[1] beginning - end, till $F[0] is the same
$F[4] in horizontal lines 100 chracters each "\n"
$F[5] in horizontal lines 100 chracters each "\n"
Sample file:
Code:
SK1.chr10    3006    02    02    G    G    1.000000    h4,h10,h2,h21,h22,m6    3    3    3    0    15    0    0    -1    
SK1.chr10    3007    22    22    A    A    1.000000    h4,h10,h21,h22,m6    4    4    4    0    8    0    0    -1    
SK1.chr10    3008    21    21    G    G    0.000000    h4,h10,h21,h22,     7    7    7    0    8    0    0    -1    
SK1.chr10    3009    10    10    T    T    0.000000    h4,h10,h21,h22,     11    11    11    0    15    0    0    -1    
SK1.chr10    3010    01    01    T    T    0.000000    h4,h10,h21,h22,     14    14    14    0    16    0    0    -1
SK1.chr09    455566    31    31    T    T    0.000000    h4,h10,h9,h21,h22,     11    8    8    0    10    0    0    -1    
SK1.chr09    455567    13    13    G    G    0.000000    h4,h10,h9,h15,h21,h22,     11    8    8    0    10    0    0    -1

Expected output:
Code:
SK1.chr10
455566-455567
Ref: 
GAGTT

Gen
GAGTT

SK1.chr09
3006-3010
Ref: 
TG

Gen
TG

I have no idea how to go about it ... Could you please provide insight to deal with this?

Cheers and have a nice evening Smilie
 
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Print number of lines for files in directory, also print number of unique lines

I have a directory of files, I can show the number of lines in each file and order them from lowest to highest with: wc -l *|sort 15263 Image.txt 16401 reference.txt 40459 richtexteditor.txt How can I also print the number of unique lines in each file? 15263 1401 Image.txt 16401... (15 Replies)
Discussion started by: spacegoose
15 Replies

2. UNIX for Dummies Questions & Answers

Print unique lines without sort or unique

I would like to print unique lines without sort or unique. Unfortunately the server I am working on does not have sort or unique. I have not been able to contact the administrator of the server to ask him to add it for several weeks. (7 Replies)
Discussion started by: cokedude
7 Replies

3. Shell Programming and Scripting

Look up 2 files and print the concatenated output

file 1 Sun Mar 17 00:01:33 2013 submit , Name="1234" Sun Mar 17 00:01:33 2013 submit , Name="1344" Sun Mar 17 00:01:33 2013 submit , Name="1124" .. .. .. .. Sun Mar 17 00:01:33 2013 submit , Name="8901" file 2 Sun Mar 17 00:02:47 2013 1234 execute SUCCEEDED Sun Mar 17... (24 Replies)
Discussion started by: aravindj80
24 Replies

4. Shell Programming and Scripting

Print only lines where fields concatenated match strings

Hello everyone, Maybe somebody could help me with an awk script. I have this input (field separator is comma ","): 547894982,M|N|J,U|Q|P,98,101,0,1,1 234900027,M|N|J,U|Q|P,98,101,0,1,1 234900023,M|N|J,U|Q|P,98,54,3,1,1 234900028,M|H|J,S|Q|P,98,101,0,1,1 234900030,M|N|J,U|F|P,98,101,0,1,1... (2 Replies)
Discussion started by: Ophiuchus
2 Replies

5. Shell Programming and Scripting

compare 2 files and return unique lines in each file (based on condition)

hi my problem is little complicated one. i have 2 files which appear like this file 1 abbsss:aa:22:34:as akl abc 1234 mkilll:as:ss:23:qs asc abc 0987 mlopii:cd:wq:24:as asd abc 7866 file2 lkoaa:as:24:32:sa alk abc 3245 lkmo:as:34:43:qs qsa abc 0987 kloia:ds:45:56:sa acq abc 7805 i... (5 Replies)
Discussion started by: anurupa777
5 Replies

6. UNIX for Dummies Questions & Answers

getting unique lines from 2 files

hi i have used comm -13 <(sort 1.txt) <(sort 2.txt) option to get the unique lines that are present in file 2 but not in file 1. but some how i am getting the entire file 2. i would expect few but not all uncommon lines fro my dat. is there anything wrong with the way i used the command? my... (1 Reply)
Discussion started by: anurupa777
1 Replies

7. Shell Programming and Scripting

Compare multiple files and print unique lines

Hi friends, I have multiple files. For now, let's say I have two of the following style cat 1.txt cat 2.txt output.txt Please note that my files are not sorted and in the output file I need another extra column that says the file from which it is coming. I have more than 100... (19 Replies)
Discussion started by: jacobs.smith
19 Replies

8. UNIX for Advanced & Expert Users

In a huge file, Delete duplicate lines leaving unique lines

Hi All, I have a very huge file (4GB) which has duplicate lines. I want to delete duplicate lines leaving unique lines. Sort, uniq, awk '!x++' are not working as its running out of buffer space. I dont know if this works : I want to read each line of the File in a For Loop, and want to... (16 Replies)
Discussion started by: krishnix
16 Replies

9. Shell Programming and Scripting

Comparing 2 files and return the unique lines in first file

Hi, I have 2 files file1 ******** 01-05-09|java.xls| 02-05-08|c.txt| 08-01-09|perl.txt| 01-01-09|oracle.txt| ******** file2 ******** 01-02-09|windows.xls| 02-05-08|c.txt| 01-05-09|java.xls| 08-02-09|perl.txt| 01-01-09|oracle.txt| ******** (8 Replies)
Discussion started by: shekhar_v4
8 Replies

10. Shell Programming and Scripting

Lines Concatenated with awk

Hello, I have a bash shell script and I use awk to print certain columns of one file and direct the output to another file. If I do a less or cat on the file it looks correct, but if I email the file and open it with Outlook the lines outputted by awk are concatenated. Here is my awk line:... (6 Replies)
Discussion started by: xadamz23
6 Replies
Login or Register to Ask a Question