Matching two files with special field separator


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Matching two files with special field separator
# 1  
Old 12-09-2013
Matching two files with special field separator

Hello,

I have a file with such structure:

Code:
>ENSGALG00000000011|ENSGALT00000000012|57|1123|1125
AACTGTGTGTTTTTT
>ENSGALG00000000012|ENSGALT00000000013|57|1145|1155
AAAAAAGGTCCTGTGTGC
>ENSGALG00000000015|ENSGALT00000000014|57|1144|1155
AAAATGTGTGTGTGTGTGTGTG

I want to use another file to extract the ones that have a specific idea in the first part, that is to use this file:

Code:
ENSGALG00000000011
ENSGALG00000000015

To get the final output like this:

Code:
>ENSGALG00000000011|ENSGALT00000000012|57|1123|1125
AACTGTGTGTTTTTT
>ENSGALG00000000015|ENSGALT00000000014|57|1144|1155
AAAATGTGTGTGTGTGTGTGTG

I know this code:

Code:
awk 'FNR == NR {_[$1]++} FNR < NR {if ( $1 in _ ) print $1, $0}' filetwo fileone

to compare the first fields of two files and print the matched ones but because of this special field separators, I don't know how this is feasible with this example.

Thanks a lot in advance for your help.
Cheers,
# 2  
Old 12-09-2013
Another approach?

Once before, I had a similar situation.
1) I appended the '|' character to that 2nd file
2) I then used the grep with -f file option

Is this a possible solution for you?
# 3  
Old 12-09-2013
Quote:
Originally Posted by Homa
Hello,

I have a file with such structure:

Code:
>ENSGALG00000000011|ENSGALT00000000012|57|1123|1125
AACTGTGTGTTTTTT
>ENSGALG00000000012|ENSGALT00000000013|57|1145|1155
AAAAAAGGTCCTGTGTGC
>ENSGALG00000000015|ENSGALT00000000014|57|1144|1155
AAAATGTGTGTGTGTGTGTGTG

I want to use another file to extract the ones that have a specific idea in the first part, that is to use this file:

Code:
ENSGALG00000000011
ENSGALG00000000015

To get the final output like this:

Code:
>ENSGALG00000000011|ENSGALT00000000012|57|1123|1125
AACTGTGTGTTTTTT
>ENSGALG00000000015|ENSGALT00000000014|57|1144|1155
AAAATGTGTGTGTGTGTGTGTG

I know this code:

Code:
awk 'FNR == NR {_[$1]++} FNR < NR {if ( $1 in _ ) print $1, $0}' filetwo fileone

to compare the first fields of two files and print the matched ones but because of this special field separators, I don't know how this is feasible with this example.

Thanks a lot in advance for your help.
Cheers,


i think you can give a try with
Code:
awk -F

option to specify the filed limiter of your choice.
This User Gave Thanks to zozoo For This Post:
# 4  
Old 12-09-2013
Ok, I added the
Code:
-F

Code:
awk -F "|" 'FNR == NR {a[$1]++} FNR < NR {if ( $1 in a ) print $0}' filetwo fileone

and it works but it only prints the headers and not the content, that is the sequence of letters below it, sorry for this question but how can I get over this problem?

Thanks!
# 5  
Old 12-09-2013
Code:
$ cat file1
>ENSGALG00000000011|ENSGALT00000000012|57|1123|1125
AACTGTGTGTTTTTT
>ENSGALG00000000012|ENSGALT00000000013|57|1145|1155
AAAAAAGGTCCTGTGTGC
>ENSGALG00000000015|ENSGALT00000000014|57|1144|1155
AAAATGTGTGTGTGTGTGTGTG

Code:
$ cat file2
ENSGALG00000000011
ENSGALG00000000015

Code:
$ awk -F"|" 'FNR==NR{A[">"$1];next}($1 in A){print;getline;print}' file2 file1
>ENSGALG00000000011|ENSGALT00000000012|57|1123|1125
AACTGTGTGTTTTTT
>ENSGALG00000000015|ENSGALT00000000014|57|1144|1155
AAAATGTGTGTGTGTGTGTGTG

These 2 Users Gave Thanks to Akshay Hegde For This Post:
# 6  
Old 12-09-2013
already solution provided by akshay
# 7  
Old 12-09-2013
Oh, thanks, but now, there is another problem, in my actual file, the content of each of the headers is longer than one line, for example:

Code:
>ENSGALG00000014675|ENSGALT00000023647|1|1603|1605
cttttccactttgctctcatcCTGCTATTGGATTTgagatgcatgtcTGTTAATATTGTA
GCCTTTGGAAATGAAAGAGATGGATTTTCTGAAGACAATCAGCAGTCAAGTCTGATCTGG
AGCTATCTAGGGAGAAGTGCTCTCATTTCAGAGACTGAAAGTGGTCTGTTGCTGAATTCT
GCCAATCACATTAGAAATCCTGTTTTTACTGAATATCAAGCCTGCGTGTTTGGAAATGTC
AGATTGGTGGTACATGACTGTCCTCTTTGGGATATATTTGACAGTGACTGGTATACTTCT
CGCAGTCTCATTGGAGGAGCTGATATTATTGTGATTaaatactctgtcaatGACAAGACT
TCATTTCAAGAATTAAAGGACAGTTATGTCCCAATGATAAAAAAAGCGTTAAACCACTGT
TCAGTTCCAGTAATAATTTCTGCTATTGGTGCAAGAAAAAATGTGCCTTGTACCTGCCCA
CTGTGCACTTCAGACAGAAGGAGCTGTGTTACTTCTTCTGAAGGAGttcagcttgctaaa
gaactaggagctacgtatcttgaattgcnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
nnnnnnnnnnnnnnggaatattttatgatccaaagTTTGAATCGGAAGTCATCTGAAAAA
ATGAAGAAAAGAAGAAAGACCCAGAAGTACCATCGAGTTAAACCCCCTCAGCTTGAACAA
CCAGAAAAAATGCCAATCTTAAGAGGTGAAGCCTCACATTATGACTCTGATTTACACAAG
TTGCTGTCCTGCTGCCAGTGTGTGGATGTGATATTTTACTCAGAAGACTTAAAGAAAGTA
GTAGAAGCTCACAAGATCATTTTGTGCTCTGTAAGCCATGTCTTCATGTTACTTTTCAAA
GTGAAGAGTCCAGCTGATATTCATGATTCTGCTATCATACGGACTGCGCAAAGTCTCTTT
GCAGTGAACAGTGAAGCTGTGTTTCCGTTTCCTAGCAGTGGCTCATCATGCGACCCACCA
GTAAGAGTCATTGTTAAAGACTCCATCTTCTGTTCTTGTTTGTCAGACATTCTACACTTC
ATTTATTCAGGTGCTTTCCAGTGGGAACGGTTAGAAGAAGATATAAAGAAGAAGCTAA

Using this script:
Code:
awk -F"|" 'FNR==NR{A[">"$1];next}($1 in A){print;getline;print}'

prints only the first line of each content, is there a way to solve this? thanks!
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Field matching in two data files

Hello, I am looking to output all of the lines from file2 whose 11th field is present in the first field in file1. Then the second field from file1 should be appended as such: file1: 2222 0.35 4444 0.25 5555 0.75 file2: col1 col2 col3 col4 col5 col6 col7 col8 col9 col10 1111 col1 col2... (4 Replies)
Discussion started by: palex
4 Replies

2. Shell Programming and Scripting

Inserting a field without disturbing field separator on other fields

Hi All, I have the input as below: cat input 032016002 2.891 97.109 16.605 27.172 24.017 32.207 0.233 0.021 39.810 0.077 0.026 19.644 13.882 0.131 11.646 0.102 11.449 76.265 23.735 16.991 83.009 8.840 91.160 0.020 99.980 52.102 47.898 44.004 55.996 39.963 18.625 0.121 1.126 40.189... (15 Replies)
Discussion started by: am24
15 Replies

3. Shell Programming and Scripting

Field separator

Hello All, I have a file, but I want to separate the file at a particular record with comma"," in the line Input file APPLE6SSAMSUNGS5PRICEPERPIECEDOLLAR600EACH010020340URX581949695US to Output file APPLE6S,SAMSUNGS5,PRICEPERPIECE,DOLLAR600EACH,010020340URX581949695,US This is for... (11 Replies)
Discussion started by: m6248m
11 Replies

4. UNIX for Advanced & Expert Users

Removing special chars from file and maintain field separator

Running SunOs 5.6. Solaris. I've been able to remove all special characters from a fixed length file which appear in the first column but as a result all subsequent columns have shifted to the left by the amount of characters deleted. It is a space separated file. Line 1 in input file is... (6 Replies)
Discussion started by: iffy290
6 Replies

5. Shell Programming and Scripting

Merging two special character separated files based on pattern matching

Hi. I have 2 files of below format. File1 AA~1~STEVE~3.1~4.1~5.1 AA~2~DANIEL~3.2~4.2~5.2 BB~3~STEVE~3.3~4.3~5.3 BB~4~TIM~3.4~4.4~5.4 File 2 AA~STEVE~AA STEVE WORKS at AUTO COMPANY AA~DANIEL~AA DANIEL IS A ELECTRICIAN BB~STEVE~BB STEVE IS A COOK I want to match 1st and 3rd... (2 Replies)
Discussion started by: crypto87
2 Replies

6. UNIX for Dummies Questions & Answers

change field separator only from nth field until NF

Hi ! input: 111|222|333|aaa|bbb|ccc 999|888|777|nnn|kkk 444|666|555|eee|ttt|ooo|ppp With awk, I am trying to change the FS "|" to "; " only from the 4th field until the end (the number of fields vary between records). In order to get: 111|222|333|aaa; bbb; ccc 999|888|777|nnn; kkk... (1 Reply)
Discussion started by: beca123456
1 Replies

7. Shell Programming and Scripting

Field separator X'1F'

Hi, I have a flat file with fields separated by a X'1F' i have to fetch 4th field from second line. please help me how to achieve it. I tried with below command and its not working. cut -f4 -d`echo -e '\x1f'` filename.txt I am using SunOS. Thanks in advance. (2 Replies)
Discussion started by: rohan10k
2 Replies

8. Shell Programming and Scripting

awk, comma as field separator and text inside double quotes as a field.

Hi, all I need to get fields in a line that are separated by commas, some of the fields are enclosed with double quotes, and they are supposed to be treated as a single field even if there are commas inside the quotes. sample input: for this line, 5 fields are supposed to be extracted, they... (8 Replies)
Discussion started by: kevintse
8 Replies

9. Shell Programming and Scripting

Field matching between 2 files

Okay so I'm pretty new to scripting therefore this problem seems pretty tough. I have a main file that has a column of IP addresses and I have to compare it with 3 separate files that also have IP address columns. These 3 files are automatically generated from 3 different servers. Each time... (2 Replies)
Discussion started by: Spunkerspawn
2 Replies

10. Shell Programming and Scripting

Matching lines across multiple csv files and merging a particular field

I have about 20 CSV's that all look like this: "","","","","","","","","","","","","","","",""What I've been told I need to produce is the exact same thing, but with each file now containing the start_code from every other file where the email matches. It doesn't matter if any of the other... (1 Reply)
Discussion started by: Demosthenes
1 Replies
Login or Register to Ask a Question