Extract specific contents from each line

01-01-2013

Registered User

10, 0

Join Date: Nov 2012

Last Activity: 7 February 2013, 7:45 PM EST

Posts: 10

Thanks Given: 10

Thanked 0 Times in 0 Posts

Extract specific contents from each line

Hi all,
Happy new year!
Here I have a problem with extract specific information from each line in unix:
My file is the dbSNP flat file, take two SNPs for examples:

Code:

REFSNP-DOCSUM-SET (FULL-DUMP)
CREATED ON: 2012-06-08 10:50

rs782 | human | 9606 | snp | genotype=NO | submitterlink=YES | updated 2012-05-24 15:16
ss796 | WIAF | WIAF-4053 | orient=+ | ss_pick=YES
SNP | alleles='A/G' | het=0 | se(het)=0
VAL | validated=NO | min_prob=? | max_prob=? | notwithdrawn
CTG | assembly=GRCh37.p5 | chr=22 | chr-pos=21368027 | NT_011520.12 | ctg-start=758596 | ctg-end=758596 | loctype=2 | orient=-
LOC | MGC16703 | locus_id=113691 | fxn-class=intron-variant | mrna_acc=NR_003608.1
CTG | assembly=HuRef | chr=22 | chr-pos=4636312 | NW_001838740.2 | ctg-start=101457 | ctg-end=101457 | loctype=2 | orient=+

rs783 | human | 9606 | snp | genotype=NO | submitterlink=YES | updated 2012-05-24 15:16
ss797 | WIAF | WIAF-4054 | orient=- | ss_pick=NO
ss142579 | SC | bK747E2_58940 | orient=+ | ss_pick=NO
ss11007317 | BCM_SSAHASNP | chr22.NT_011520.9_8852191 | orient=+ | ss_pick=NO
ss13383292 | SC_SNP | NT_011520.9_8852191 | orient=+ | ss_pick=NO
ss16932069 | CSHL-HAPMAP | CSHL-HuAA-200402.chr22.NT_011520.9_8852191 | orient=+ | ss_pick=NO
ss19502978 | CSHL-HAPMAP | CSHL-HuDD-200402.chr22.NT_011520.9_8852191 | orient=+ | ss_pick=NO
ss21850370 | SSAHASNP | WGSA-200403-chr22.chr22.NT_011520.9_8852191 | orient=+ | ss_pick=NO
ss24092252 | PERLEGEN | afd0119572 | orient=+ | ss_pick=NO
ss44310901 | ABI | hCV483863 | orient=+ | ss_pick=NO
ss65824218 | KRIBB_YJKIM | KHS1 | orient=+ | ss_pick=NO
ss67995054 | ILLUMINA | HumanHap650Yv1.0_rs783 | orient=+ | ss_pick=NO
ss71559615 | ILLUMINA | HumanHap650Yv3.0_rs783 | orient=+ | ss_pick=NO
ss75334879 | ILLUMINA | ILMN_Human_1M_rs783 | orient=+ | ss_pick=NO
ss78462080 | HGSV | Cor12878_SNV_20070510.chr22_27786176 | orient=+ | ss_pick=NO
ss80220046 | HGSV | Cor18507_SNV_20070510.chr22_27786176 | orient=+ | ss_pick=NO
ss84312729 | HGSV | Cor19240_SNV_20070510.chr22_27786176 | orient=+ | ss_pick=NO
ss85606923 | HGSV | Cor19129_SNV_20070510.chr22_27786176 | orient=+ | ss_pick=NO
ss91901591 | BCMHGSC_JDW | JWB-1519492 | orient=+ | ss_pick=NO
ss96097027 | HUMANGENOME_JCVI | 1103691025572 | orient=+ | ss_pick=NO
ss103852070 | BGI | BGI_rs783 | orient=+ | ss_pick=NO
ss112600866 | 1000GENOMES | CEU.trio.12.15.2008_3797684_chr22_27791622 | orient=+ | ss_pick=NO
ss114124839 | 1000GENOMES | NA19240_2008_12_16_3433050_chr22_27791622 | orient=+ | ss_pick=NO
ss117385925 | ILLUMINA-UK | NA18507_000015702_NCBI36.1_chr22_27791622 | orient=+ | ss_pick=NO
ss119336900 | KRIBB_YJKIM | KHS1499147 | orient=+ | ss_pick=NO
ss138345568 | ENSEMBL | ENSSNP11917357 | orient=+ | ss_pick=NO
ss143589615 | ENSEMBL | ENSSNP9790017 | orient=+ | ss_pick=NO
ss157114151 | GMI | GMI_SNP_24747627 | orient=+ | ss_pick=NO
ss167823401 | COMPLETE_GENOMICS | NA07022_36_chr22_27791622 | orient=+ | ss_pick=NO
ss169076235 | COMPLETE_GENOMICS | NA19240_36_chr22_27791622 | orient=+ | ss_pick=NO
ss171908720 | COMPLETE_GENOMICS | NA20431_36_chr22_27791622 | orient=+ | ss_pick=NO
ss174572413 | ILLUMINA | Human1M-Duov3_B_rs783-127_B_R_1502386830 | orient=- | ss_pick=NO
ss204071565 | BUSHMAN | BUSHMAN-chr22-27791621 | orient=+ | ss_pick=NO
ss208816497 | BCM-HGSC-SUB | BCM_CMT_1011-3271932 | orient=+ | ss_pick=NO
ss228652888 | 1000GENOMES | pilot_1_YRI_10462571_chr22_27791622 | orient=+ | ss_pick=NO
ss238048334 | 1000GENOMES | pilot_1_CEU_7652963_chr22_27791622 | orient=+ | ss_pick=NO
ss244172327 | 1000GENOMES | pilot_1_CHB+JPT_6057404_chr22_27791622 | orient=+ | ss_pick=NO
ss283616234 | GMI | GMI_AK_SNP_7936655 | orient=+ | ss_pick=YES
ss292750126 | PJP | SNP_2256484_chr22_27791622 | orient=+ | ss_pick=NO
ss479369913 | ILLUMINA | HumanOmni2.5-4v1_D_kgp10584265-0_B_R_1817221274 | orient=- | ss_pick=NO
ss484300582 | ILLUMINA | HumanOmni2.5-4v1_B_SNP22-27791622-0_B_R_1627743334 | orient=- | ss_pick=NO
SNP | alleles='A/G' | het=0.476714 | se(het)=0.10536
VAL | validated=YES | min_prob=? | max_prob=? | notwithdrawn | byCluster,byFrequency,byOtherPop,by2Hit2Allele,byHapMap
GMAF | allele=G | count=2184 | MAF=0.391
CTG | assembly=GRCh37.p5 | chr=22 | chr-pos=29461622 | NT_011520.12 | ctg-start=8852191 | ctg-end=8852191 | loctype=2 | orient=+

I want to extract the: "alleles=XXX" and "allele=XX" for each SNP, can anyone help on this?

To make things difficulty:
1) not for every SNP, it has both. Some SNPs don't have "allele=XX" but only "alleles=XXX"
2) the "XXX" here are not just A,T,G,C, but sometimes "-".

thanks!

Last edited by Corona688; 01-01-2013 at 05:24 PM..

luoruicd

View Public Profile for luoruicd

Find all posts by luoruicd

01-01-2013

Registered User

23,310, 4,623

Join Date: Aug 2005

Last Activity: 7 July 2020, 11:47 AM EDT

Location: Saskatchewan

Posts: 23,310

Thanks Given: 1,331

Thanked 4,623 Times in 4,217 Posts

Code:

awk -F'|' '{for(N=1; N<=NF; N++) if($N ~ /allel/) print $N }' inputfile

This User Gave Thanks to Corona688 For This Post:

Corona688

View Public Profile for Corona688

Visit Corona688's homepage!

Find all posts by Corona688

01-01-2013

Registered User

10, 0

Join Date: Nov 2012

Last Activity: 7 February 2013, 7:45 PM EST

Posts: 10

Thanks Given: 10

Thanked 0 Times in 0 Posts

Thanks a lot!! I am pretty new with unix. This is very help!

Besides, Is there a way I can print out the output as:
SNP alleles alleles?
for example:
rs782 'A/G'
rs783 A/G G
If that SNP doesn't have "allele", leave the 3rd column as " ". If that SNP has multiple "allele", merge all the unique ones as "X/X/X" in the 3rd column.

thanks!

luoruicd

View Public Profile for luoruicd

Find all posts by luoruicd

01-01-2013

Moderator

3,689, 1,352

Join Date: Jan 2012

Last Activity: 22 August 2020, 11:29 PM EDT

Location: Galactic Empire

Posts: 3,689

Thanks Given: 268

Thanked 1,352 Times in 1,258 Posts

AWK Code:

Code:

awk -F'|' ' \
  /snp/ {
     printf "\n%s", $1;
  }
  /allele/ {
     A=$2;
     gsub("allele=","",A);
     gsub("alleles=","",A);
     printf "%s", A;
  } END {
     printf "\n"
}' inputfile

OUTPUT:

Code:

rs782  'A/G'
rs783  'A/G'  G

This User Gave Thanks to Yoda For This Post:

Yoda

View Public Profile for Yoda

Visit Yoda's homepage!

Find all posts by Yoda

01-02-2013

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

bipinajith's proposal works as long as all of the allele= and alleles= entries are in the 2nd field in the input lines, and all of the allele= entries come after the alleles= entry, except that it doesn't put the requested "/" between allele= entries when more than one is present. It also provides an extra leading newline that wasn't requested.

The following should work as requested no matter what order they are in nor which fields contain allele and alleles entries even if multiple entries appear on the same line. It will also print multiple alleles= entries if they occur using a comma to separate subsequent occurrences:

Code:

awk -F ' *| *' 'function pr() { 
        if(r) printf("%s %s %s\n", r, p, s)
        p = r = s = ""
        n1 = 1
}
NF == 0 {pr()
         next
}
n1 {    n1 = 0
        r = $1
}
/allele/ {for(i = 1; i <= NF; i++) {
                if($i ~ /allele=/)
                        s = (s ? s "/" : "") substr($i, index($i, "=") + 1)
                if($i ~ /alleles=/)
                        p = (p ? p "," : "") substr($i, index($i, "=") + 1)        }
}END {   pr()}' input

As always, if you're using a Solaris system, use /usr/xpg4/bin/awk or nawk instead of awk.

This User Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

01-02-2013

Registered User

10, 0

Join Date: Nov 2012

Last Activity: 7 February 2013, 7:45 PM EST

Posts: 10

Thanks Given: 10

Thanked 0 Times in 0 Posts

Thanks! It works pretty well!
I am trying to understand this code, this line puzzles me:
s = (s ? s "/" : "") substr($i, index($i, "=") + 1

You probably uses some regular expression, right? Can you recommend some document or book for me to better understand this?

luoruicd

View Public Profile for luoruicd

Find all posts by luoruicd

01-02-2013

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Quote:

Originally Posted by luoruicd

For a reference I would start the awk man page on your system and the awk man pages on this site, and then look at the Wikipedia entry for awk; it provides a good overview of the language with examples, differences provided by various versions of awk, and a decent list of reference materials.

There are several regular expressions in this script, but none of them are in this statement. (And note that you missed a closing parenthesis at the end of this statement.)

Note that the variable s is intended to be a concatenation of the values of single alleles found in fields of the form allele=X with a / separating entries if more than one is found. Also note that awk sets all uninitialized variables to an empty string (or zero depending on context) and that the function pr() in this script resets s to an empty string whenever it is called to print the data that has been accumulated for an SNP. And, due to the if statement on the previous line, this line of code is executed only in the ith field on the current input line matches the regular expression /allele=/ and this will match if and only if the field contains the string "allele=".

The ? : operators behave the same way in awk as they do in C, C++, and several other languages (i.e., in this case if the expression before the ? is not an empty string, return the expression between the ? and the :; otherwise return the expression after the :); the substr() function returns a substring or the string named by its first operand starting at the offset specified by its second operand (and since there is no third argument, returns the rest of the string); the index() function finds the position in its first argument where the string specified by its second argument first appears. So stringing all of this together in a single statement:

Code:

s = (s ? s "/" : "") substr($i, index($i, "=") + 1)

Set the variable s to (if s is not an empty string, the concatenation of the current value of the variable s followed by a slash character, or if s is an empty string, an empty string) concatenated with the character(s) following the = sign in allele=X.

So assume you have an SNP listing in your input file contained the following fields (all on one line or on different lines):

Code:

allele=C | allele=A | allele=G

When the first one of these is found, s will be set to "C".
When the second one is found s will be set to the concatenation of "C", "/", and "A" (i.e., "C/A").
And, when the third one is found, s will be set to "C/A/G".

Last edited by Don Cragun; 01-02-2013 at 04:22 AM.. Reason: Add suggestion to look at awk man pages

This User Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

UNIX for Dummies Questions & Answers

Extract specific contents from each line

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Extract specific line in an html file starting and ending with specific pattern to a text file

Discussion started by: dejavo

2. Shell Programming and Scripting

sed to replace specific positions on line with file contents

Discussion started by: nwalsh88

3. Shell Programming and Scripting

sed or awk, cut, to extract specific data from line

Discussion started by: ocramas

4. Shell Programming and Scripting

how to read the contents of two files line by line and compare the line by line?

Discussion started by: mjavalkar

5. Shell Programming and Scripting

Using awk to read a specific line and a specific field on that line.

Discussion started by: Bungkai

6. Shell Programming and Scripting

Extract character between specific line numbers

Discussion started by: gowrishankar05

7. Shell Programming and Scripting

Extract a specific line from a stream

Discussion started by: Oddant

8. Shell Programming and Scripting

extract specific line if the search pattern is found

Discussion started by: Sekar1

9. Shell Programming and Scripting

Shell script or command help to extract specific contents from a long list of content

Discussion started by: patrick87

10. Shell Programming and Scripting

extract the lines between specific line number from a text file

Discussion started by: return_user