Quote:
And, of that subset that match the $4 value, we have to assume that the same subset or a smaller subset may be classified as "intron", the same subset or a smaller subset may be classified as "exon", and the same subset or a smaller subset may be classified as "splicing" when your criteria are applied. How are the subset of lines that match the $4 value supposed to be combined or selected so that only one of the possible results are returned (presumably the one possible result that is the one that you want to match from all of the ones in the subset that do match)
file2 is a very large file of genes and all associated coding exons. So, using the
SDHB gene, as an example, that is one of the ~22,000 genes in the human genome. A gene is made up a variable exons, introns, intragenic regions.
File2 only lists the coding sequence of a gene, that is what is currently known to code for a protein product and contribute to the human "genetic makeup".
File1 is created from a script that output all regions in a particular gene that may need to be interrogated further. The problem is not all those regions, defined by the
$1,
$2, and
$3 values may be important to know (there is still a lot unknown about the human genome).... its complex as @bakunin kindly described (a big thanks to your wife
.
The intons and intrageneic regions regulate/effect exons (both coding and non-coding) but are still largely an unknown. What is known is that coding exons(file2) and splicing(defined as +/- 10) are important.
Using the value in
$4of
file1 that is looked up in
file2 to return the subset of a gene to use. There may be multiple lines in each file for that gene but the combination of
$2 and
$3 will define each line in
file1 as
intron,
exon, or
splicing.
I hope this helps and apologize for the long post but since I am excited to help share knowledge and really appreciate all the help... thank you very much
.