Fetch entries with specific pattern


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Fetch entries with specific pattern
# 1  
Old 01-21-2013
Fetch entries with specific pattern

Hi all,

I have following sample input file which is a part of big file:

Code:
ID   AINX_HUMAN              Reviewed;         499 AA.
AC   Q16352; B1AQK0; Q9BRC5;
DT   30-MAY-2000, integrated into UniProtKB/Swiss-Prot.
DT   23-JAN-2002, sequence version 2.
DT   28-NOV-2012, entry version 123.
DE   RecName: Full=Alpha-internexin;
DE            Short=Alpha-Inx;
DE   AltName: Full=66 kDa neurofilament protein;
DE            Short=NF-66;
DE            Short=Neurofilament-66;
DE   AltName: Full=Neurofilament 5;
GN   Name=INA; Synonyms=NEF5;
OS   Homo sapiens (Human).
OC   Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
OC   Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini;
OC   Catarrhini; Hominidae; Homo.
OX   NCBI_TaxID=9606;
RN   [1]
RP   NUCLEOTIDE SEQUENCE [MRNA], AND VARIANT SER-92.
RC   TISSUE=Fetal brain;
RX   MEDLINE=95287809; PubMed=7769995; 
RA   Chan S.-O., Chiu F.-C.;
RT   "Cloning and developmental expression of human 66 kd neurofilament
RT   protein.";
RL   Brain Res. Mol. Brain Res. 29:177-184(1995).
RN   [2]
RP   NUCLEOTIDE SEQUENCE [LARGE SCALE GENOMIC DNA].
RX   PubMed=15164054; DOI=10.1038/nature02462;
RA   Deloukas P., Earthrowl M.E., Grafham D.V., Rubenfield M., French L.,
RA   Steward C.A., Sims S.K., Jones M.C., Searle S., Scott C., Howe K.,
RA   Hunt S.E., Andrews T.D., Gilbert J.G.R., Swarbreck D., Ashurst J.L.,
RA   Taylor A., Battles J., Bird C.P., Ainscough R., Almeida J.P.,
RA   Ashwell R.I.S., Ambrose K.D., Babbage A.K., Bagguley C.L., Bailey J.,
RA   Banerjee R., Bates K., Beasley H., Bray-Allen S., Brown A.J.,
RA   Brown J.Y., Burford D.C., Burrill W., Burton J., Cahill P., Camire D.,
RA   Carter N.P., Chapman J.C., Clark S.Y., Clarke G., Clee C.M., Clegg S.,
RA   Corby N., Coulson A., Dhami P., Dutta I., Dunn M., Faulkner L.,
RA   Frankish A., Frankland J.A., Garner P., Garnett J., Gribble S.,
RA   Griffiths C., Grocock R., Gustafson E., Hammond S., Harley J.L.,
RA   Hart E., Heath P.D., Ho T.P., Hopkins B., Horne J., Howden P.J.,
RA   Huckle E., Hynds C., Johnson C., Johnson D., Kana A., Kay M.,
RA   Kimberley A.M., Kershaw J.K., Kokkinaki M., Laird G.K., Lawlor S.,
RA   Lee H.M., Leongamornlert D.A., Laird G., Lloyd C., Lloyd D.M.,
RA   Loveland J., Lovell J., McLaren S., McLay K.E., McMurray A.,
RA   Mashreghi-Mohammadi M., Matthews L., Milne S., Nickerson T.,
RA   Nguyen M., Overton-Larty E., Palmer S.A., Pearce A.V., Peck A.I.,
RA   Pelan S., Phillimore B., Porter K., Rice C.M., Rogosin A., Ross M.T.,
RA   Sarafidou T., Sehra H.K., Shownkeen R., Skuce C.D., Smith M.,
RA   Standring L., Sycamore N., Tester J., Thorpe A., Torcasso W.,
RA   Tracey A., Tromans A., Tsolas J., Wall M., Walsh J., Wang H.,
RA   Weinstock K., West A.P., Willey D.L., Whitehead S.L., Wilming L.,
RA   Wray P.W., Young L., Chen Y., Lovering R.C., Moschonas N.K.,
RA   Siebert R., Fechtel K., Bentley D., Durbin R.M., Hubbard T.,
RA   Doucette-Stamm L., Beck S., Smith D.R., Rogers J.;
RT   "The DNA sequence and comparative analysis of human chromosome 10.";
RL   Nature 429:375-381(2004).
RN   [3]
RP   NUCLEOTIDE SEQUENCE [LARGE SCALE GENOMIC DNA].
RA   Mural R.J., Istrail S., Sutton G.G., Florea L., Halpern A.L.,
RA   Mobarry C.M., Lippert R., Walenz B., Shatkay H., Dew I., Miller J.R.,
RA   Flanigan M.J., Edwards N.J., Bolanos R., Fasulo D., Halldorsson B.V.,
RA   Hannenhalli S., Turner R., Yooseph S., Lu F., Nusskern D.R.,
RA   Shue B.C., Zheng X.H., Zhong F., Delcher A.L., Huson D.H.,
RA   Kravitz S.A., Mouchard L., Reinert K., Remington K.A., Clark A.G.,
RA   Waterman M.S., Eichler E.E., Adams M.D., Hunkapiller M.W., Myers E.W.,
RA   Venter J.C.;
RL   Submitted (SEP-2005) to the EMBL/GenBank/DDBJ databases.
RN   [4]
RP   NUCLEOTIDE SEQUENCE [LARGE SCALE MRNA].
RC   TISSUE=Brain;
RX   PubMed=15489334; 
RG   The MGC Project Team;
RT   "The status, quality, and expansion of the NIH full-length cDNA
RT   project: the Mammalian Gene Collection (MGC).";
RL   Genome Res. 14:2121-2127(2004).
RN   [5]
RP   PROTEIN SEQUENCE OF 46-83; 105-111; 121-130; 139-145; 216-228;
RP   279-288; 323-330; 339-367; 378-397 AND 407-430, AND MASS SPECTROMETRY.
RC   TISSUE=Brain, Cajal-Retzius cell, and Fetal brain cortex;
RA   Lubec G., Afjehi-Sadat L., Chen W.-Q., Sun Y.;
RL   Submitted (DEC-2008) to UniProtKB.
RN   [6]
RP   PHOSPHORYLATION [LARGE SCALE ANALYSIS] AT SER-496, AND MASS
RP   SPECTROMETRY.
RC   TISSUE=Embryonic kidney;
RX   PubMed=17525332; 
RA   Matsuoka S., Ballif B.A., Smogorzewska A., McDonald E.R. III,
RA   Hurov K.E., Luo J., Bakalarski C.E., Zhao Z., Solimini N.,
RA   Lerenthal Y., Shiloh Y., Gygi S.P., Elledge S.J.;
RT   "ATM and ATR substrate analysis reveals extensive protein networks
RT   responsive to DNA damage.";
RL   Science 316:1160-1166(2007).
RN   [7]
RP   ACETYLATION [LARGE SCALE ANALYSIS] AT LYS-290, AND MASS SPECTROMETRY.
RX   PubMed=19608861; DOI=10.1126/science.1175371;
RA   Choudhary C., Kumar C., Gnad F., Nielsen M.L., Rehman M., Walther T.,
RA   Olsen J.V., Mann M.;
RT   "Lysine acetylation targets protein complexes and co-regulates major
RT   cellular functions.";
RL   Science 325:834-840(2009).
RN   [8]
RP   VARIANT [LARGE SCALE ANALYSIS] GLN-110.
RX   PubMed=16959974; DOI=10.1126/science.1133427;
RA   Sjoeblom T., Jones S., Wood L.D., Parsons D.W., Lin J., Barber T.D.,
RA   Mandelker D., Leary R.J., Ptak J., Silliman N., Szabo S.,
RA   Buckhaults P., Farrell C., Meeh P., Markowitz S.D., Willis J.,
RA   Dawson D., Willson J.K.V., Gazdar A.F., Hartigan J., Wu L., Liu C.,
RA   Parmigiani G., Park B.H., Bachman K.E., Papadopoulos N.,
RA   Vogelstein B., Kinzler K.W., Velculescu V.E.;
RT   "The consensus coding sequences of human breast and colorectal
RT   cancers.";
RL   Science 314:268-274(2006).
CC   -!- FUNCTION: Class-IV neuronal intermediate filament that is able to
CC       self-assemble. It is involved in the morphogenesis of neurons. It
CC       may form an independent structural network without the involvement
CC       of other neurofilaments or it may cooperate with NF-L to form the
CC       filamentous backbone to which NF-M and NF-H attach to form the
CC       cross-bridges.
CC   -!- TISSUE SPECIFICITY: Found predominantly in adult CNS.
CC   -!- DEVELOPMENTAL STAGE: Expressed in brain as early as the 16th week
CC       of gestation, and increased rapidly and reached a steady state
CC       level by the 18th week of gestation.
CC   -!- PTM: O-glycosylated (By similarity).
CC   -!- PTM: Phosphorylated upon DNA damage, probably by ATM or ATR.
CC   -!- SIMILARITY: Belongs to the intermediate filament family.
CC   -----------------------------------------------------------------------
CC   Copyrighted by the UniProt Consortium, see 
CC   Distributed under the Creative Commons Attribution-NoDerivs License
CC   -----------------------------------------------------------------------
DR   EMBL; S78296; AAB34482.1; -; mRNA.
DR   EMBL; AL591408; CAI16744.1; -; Genomic_DNA.
DR   EMBL; CH471066; EAW49653.1; -; Genomic_DNA.
DR   EMBL; BC006359; AAH06359.1; -; mRNA.
DR   IPI; IPI00001453; -.
DR   PIR; I52658; I52658.
DR   RefSeq; NP_116116.1; NM_032727.3.
DR   UniGene; Hs.500916; -.
DR   ProteinModelPortal; Q16352; -.
DR   SMR; Q16352; 90-241, 259-329, 333-402.
DR   IntAct; Q16352; 3.
DR   STRING; Q16352; -.
DR   PhosphoSite; Q16352; -.
DR   DMDM; 20141266; -.
DR   PaxDb; Q16352; -.
DR   PeptideAtlas; Q16352; -.
DR   PRIDE; Q16352; -.
DR   DNASU; 9118; -.
DR   Ensembl; ENST00000369849; ENSP00000358865; ENSG00000148798.
DR   GeneID; 9118; -.
DR   KEGG; hsa:9118; -.
DR   UCSC; uc001kws.3; human.
DR   CTD; 9118; -.
DR   GeneCards; GC10P105026; -.
DR   HGNC; HGNC:6057; INA.
DR   HPA; CAB002059; -.
DR   HPA; HPA008057; -.
DR   MIM; 605338; gene.
DR   neXtProt; NX_Q16352; -.
DR   PharmGKB; PA29867; -.
DR   eggNOG; NOG149366; -.
DR   HOGENOM; HOG000230977; -.
DR   HOVERGEN; HBG013015; -.
DR   InParanoid; Q16352; -.
DR   KO; K07608; -.
DR   OMA; ASSYRKV; -.
DR   OrthoDB; EOG4R5031; -.
DR   PhylomeDB; Q16352; -.
DR   GenomeRNAi; 9118; -.
DR   NextBio; 34171; -.
DR   ArrayExpress; Q16352; -.
DR   Bgee; Q16352; -.
DR   CleanEx; HS_INA; -.
DR   Genevestigator; Q16352; -.
DR   GermOnline; ENSG00000148798; Homo sapiens.
DR   GO; GO:0005883; C:neurofilament; TAS:ProtInc.
DR   GO; GO:0005200; F:structural constituent of cytoskeleton; TAS:ProtInc.
DR   GO; GO:0030154; P:cell differentiation; IEA:UniProtKB-KW.
DR   GO; GO:0007399; P:nervous system development; IEA:UniProtKB-KW.
DR   GO; GO:0060052; P:neurofilament cytoskeleton organization; IEA:Compara.
DR   GO; GO:0042246; P:tissue regeneration; IEA:Compara.
DR   InterPro; IPR016044; F.
DR   InterPro; IPR001664; IF.
DR   InterPro; IPR006821; Intermed_filament_DNA-bd.
DR   InterPro; IPR018039; Intermediate_filament_CS.
DR   PANTHER; PTHR23239; PTHR23239; 1.
DR   Pfam; PF00038; Filament; 1.
DR   Pfam; PF04732; Filament_head; 1.
DR   PROSITE; PS00226; IF; 1.
PE   1: Evidence at protein level;
KW   Acetylation; Coiled coil; Complete proteome; Developmental protein;
KW   Differentiation; Direct protein sequencing; Glycoprotein;
KW   Intermediate filament; Neurogenesis; Phosphoprotein; Polymorphism;
KW   Reference proteome.
FT   CHAIN         1    499       Alpha-internexin.
FT                                /FTId=PRO_0000063783.
FT   REGION        1     87       Head.
FT   REGION       88    408       Rod.
FT   REGION       88    129       Coil 1A.
FT   REGION      130    142       Linker 1.
FT   REGION      143    238       Coil 1B.
FT   REGION      239    262       Linker 2.
FT   REGION      263    408       Coil 2.
FT   REGION      409    499       Tail.
FT   COMPBIAS    449    454       Poly-Glu.
FT   MOD_RES      72     72       Phosphoserine (By similarity).
FT   MOD_RES     290    290       N6-acetyllysine.
FT   MOD_RES     335    335       Phosphoserine (By similarity).
FT   MOD_RES     496    496       Phosphoserine.
FT   VARIANT      92     92       T -> S (in dbSNP:rs1063455).
FT                                /FTId=VAR_049808.
FT   VARIANT     110    110       E -> Q (in a breast cancer sample;
FT                                somatic mutation).
FT                                /FTId=VAR_036369.
FT   VARIANT     149    149       D -> H (in dbSNP:rs1063456).
FT                                /FTId=VAR_033497.
FT   CONFLICT     37     41       GFRSQ -> ASVE (in Ref. 1; AAB34482).
FT   CONFLICT     67     67       R -> A (in Ref. 1; AAB34482).
FT   CONFLICT    128    132       ALRQR -> RCDT (in Ref. 1; AAB34482).
FT   CONFLICT    141    141       E -> Q (in Ref. 1; AAB34482).
FT   CONFLICT    147    152       LRDLRA -> PRHLP (in Ref. 1; AAB34482).
FT   CONFLICT    191    198       GAERALKA -> RRARLKR (in Ref. 1;
FT                                AAB34482).
FT   CONFLICT    244    244       A -> R (in Ref. 1; AAB34482).
FT   CONFLICT    263    263       S -> A (in Ref. 1; AAB34482).
FT   CONFLICT    301    301       S -> T (in Ref. 1; AAB34482).
FT   CONFLICT    310    311       EE -> DQ (in Ref. 1; AAB34482).
FT   CONFLICT    318    318       Missing (in Ref. 1; AAB34482).
SQ   SEQUENCE   499 AA;  55391 MW;  4C972764E9E68D3E CRC64;
     MSFGSEHYLC SSSSYRKVFG DGSRLSARLS GAGGAGGFRS QSLSRSNVAS SAACSSASSL
     GLGLAYRRPP ASDGLDLSQA AARTNEYKII RTNEKEQLQG LNDRFAVFIE KVHQLETQNR
     ALEAELAALR QRHAEPSRVG ELFQRELRDL RAQLEEASSA RSQALLERDG LAEEVQRLRA
     RCEEESRGRE GAERALKAQQ RDVDGATLAR LDLEKKVESL LDELAFVRQV HDEEVAELLA
     TLQASSQAAA EVDVTVAKPD LTSALREIRA QYESLAAKNL QSAEEWYKSK FANLNEQAAR
     STEAIRASRE EIHEYRRQLQ ARTIEIEGLR GANESLERQI LELEERHSAE VAGYQDSIGQ
     LENDLRNTKS EMARHLREYQ DLLNVKMALD IEIAAYRKLL EGEETRFSTS GLSISGLNPL
     PNPSYLLPPR ILSATTSKVS STGLSLKKEE EEEEASKVAS KKTSQIGESF EEILEETVIS
     TKKTEKSNIE ETTISSQKI



the expected output is to fetch entries: AC entries and CC entries infront of which -!- is present with FUNCTION. It should print only the sentences in FUNCTION before the next -!- entries come.

Code:
AC   Q16352; B1AQK0; Q9BRC5;
CC   -!- FUNCTION: Class-IV neuronal intermediate filament that is able to self-assemble. It is involved in the morphogenesis of neurons. It may form an independent structural network without the involvement of other euro filaments or it may cooperate with NF-L to form the filamentous backbone to which NF-M and NF-H attach to form the cross-bridges.

Please let me know relevant shell scripting. I tried awk and sed it doesnt work.
# 2  
Old 01-21-2013
Code:
awk '/^AC/{print}/^CC/&&/-!-/&&/FUNCTION/{f=1;}/^CC/&&/-!-/&&!/FUNCTION/{f=0;}f==1{print}' filename

# 3  
Old 01-21-2013
Code:
awk '/^AC/; /^CC.*-!-/{if(p)print p; p=x} p{sub(/^CC */,x); p=p FS $0} /^CC.*-!-.*FUNCTION/{p=$0}' file

Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

How to fetch specific data from a file.?

Hi , I have a file which contains 2 days logs(here it is 24 and 25) I want to list data only for date 25 fron the file. please suggest me how should i get this. file content mentioned below 17-05-24 Name Succ Fail 00:00:29 ... (5 Replies)
Discussion started by: scriptor
5 Replies

2. Shell Programming and Scripting

Fetch specific entries

Hi Guys This time my input sample from a Big file like this In TTDS00002 UniProt ID P11229 TTDS00002 Name Muscarinic acetylcholine receptor M1 TTDS00002 Type of target Successful target TTDS00002 Synonyms M1 receptor TTDS00002 Disease Alzheimer's disease... (13 Replies)
Discussion started by: Priyanka Chopra
13 Replies

3. Shell Programming and Scripting

Fetch entries in front of specific word till next word

Hi all I have following file which I have to edit for research purpose file:///tmp/moz-screenshot.png body, div, table, thead, tbody, tfoot, tr, th, td, p { font-family: "Liberation Sans"; font-size: x-small; } Drug: KRP-104 QD Drug: Placebo Drug: Metformin|Drug:... (15 Replies)
Discussion started by: Priyanka Chopra
15 Replies

4. Shell Programming and Scripting

Search specific name in a file and fetch specific entries

Hi all, I have 2 files, One file contain data like this FHIT CS CHRM1 PDE3A PDE3B HSP90AA1 PTK2 HTR1A ESR1 PARP1 PLA2G1B These names are mentioned in the second file(Please see attached second file) as (7 Replies)
Discussion started by: manigrover
7 Replies

5. Shell Programming and Scripting

Urgent request to consider:Search specific name in a file and fetch specific entries

Hi all, I have 2 files, One file contain data like this FHIT CS CHRM1 PDE3A PDE3B HSP90AA1 PTK2 HTR1A ESR1 PARP1 PLA2G1B These names are mentioned in the second file(Please see attached second file) as # Drug_Target_X_Gene_Name:(Where X can be any number (1-1000) (1 Reply)
Discussion started by: manigrover
1 Replies

6. Shell Programming and Scripting

How to fetch specific fields

Dear Friends, Please provide some commands to fecth specific filed (data yellow color) from below data.. Input data 2648: 1;20120707;3591|4;20290107;90|5;20290107;3|9;20120705;0|10;20120705;0|16;20290113;15|29;20120705;0 2658: 1;20120722;0|4;20290422;1200|9;20120705;0|10;20120705;0 2646:... (4 Replies)
Discussion started by: suresh3566
4 Replies

7. Shell Programming and Scripting

how to find entries, NOT starting with specific pattern

Hey,I have a file in following format >1 ABC........ >2 XYZ..... >3 ABC........ >4 MNO....... >5 ABC....... now I would like to find only those entries that doesn't start with ABC (specific pattern)e.g preferred output: >2 XYZ.... >4 MNO....... it will be nice if anybody how... (2 Replies)
Discussion started by: ankitachaurasia
2 Replies

8. Shell Programming and Scripting

fetch last line no form file which is match with specific pattern by grep command

Hi i have a file which have a pattern like this Nov 10 session closed Nov 10 Nov 9 08:14:27 EST5EDT 2010 on tty . Nov 10 Oct 19 02:14:21 EST5EDT 2010 on pts/tk . Nov 10 afrtetryytr Nov 10 session closed Nov 10 Nov 10 03:21:04 EST5EDT 2010 Dec 8 Nov 10 05:03:02 EST5EDT 2010 ... (13 Replies)
Discussion started by: Himanshu_soni
13 Replies

9. Shell Programming and Scripting

How to fetch a specific line from file

Hi, I have text file in the following strucher . The files contain hondreds of lines. value1;value2;value3;value4 I would like to get back the line with lowest date (values4 field). In this case its line number 3. groupa;Listener;1;20110120162018 groupb;Database;0;20110201122641... (4 Replies)
Discussion started by: yoavbe
4 Replies

10. Shell Programming and Scripting

To fetch specific words from a file

Hi All, I have a file like this,(This is a sql output file) cat query_file 200000029 12345 10001 0.2 0 I want to fetch the values 200000029,10001,0.2 .I tried using the below code but i could get... (2 Replies)
Discussion started by: girish.raos
2 Replies
Login or Register to Ask a Question