Getting the correct identifier in the output file


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Getting the correct identifier in the output file
# 1  
Old 12-08-2009
Getting the correct identifier in the output file

Hi All
I do have a file like this:

Code:
1
1       12      26      289     3.2e-027        GCGTATGGCGGC
2       12      26      215     6.7e+006        TTCCACCTTTTG
3       9       26      175     8.9e+016        GCGGTAACT
4       20      26      232     1.7e+013        TTTTTATTTTTTTTTTTTCC
5       7       26      161     7.2e+019        ATGCAAA
6       7       26      161     4.2e+019        CTTCAAA
7       7       26      144     7.4e+025        AGAAAAA
8       7       26      155     2.6e+021        TAGGCTG
9       9       26      148     7.3e+028        AATTTATTC
10      7       26      156     1.8e+021        TTGATTT
2
1       16      37      404     2.3e-025        AAAATTGCATGCATGC
2       12      37      351     6.1e-009        AAGAAAAAAAAA
3       9       37      328     1.5e-007        TTTGCCGCC
4       20      37      369     1.2e+001        AAAAGAGGAAAAAAAAAAAA
5       9       37      295     3.1e+007        ATGCATGTA
6       9       37      280     3.3e+014        CATTTTTTT
7       16      37      313     6.1e+015        AGAGAAAAATTAAAAA
8       11      37      288     7.5e+015        AATAATTTGAG
9       7       37      247     4.5e+023        GGAAAGG
4       20      37      369     1.2e+001        AAAAGAGGAAAAAAAAAAAA
3
1       11      36      329     6.0e-012        ATTTGCATGCA
2       7       36      277     7.0e+001        GTGGGGA
3       9       36      273     3.9e+008        CTTACATGC
4       12      36      287     7.1e+010        AAAAAAAGTAAA
5       9       36      254     1.9e+017        ATTTGGCGA
6       7       36      228     6.7e+023        TCCCTTC
7       12      36      255     2.8e+024        TAATAATTTATT
8       16      36      252     5.6e+032        TTTTAAAGAATAATCA
9       16      36      228     1.3e+042        TTTTTTCTGTATTATT
10      12      36      224     5.1e+035        CCACATAAAAAT
.
.
.
.

150
1       7       11      102     7.0e-001        CCCGCCA
2       7       11      90      2.0e+005        GCACTTT
3       12      11      108     7.0e+004        CCCCCAACAATA
4       9       11      94      3.4e+007        GATTTGGAA
5       7       11      87      1.1e+007        AAGAGCT
6       9       11      91      2.1e+009        ATTAAGTTT
7       7       11      84      7.0e+007        CTGGTCA
8       12      11      100     4.4e+009        TTTATTAATCAT
9       7       11      77      3.0e+011        ATTTATG
10      12      11      90      1.7e+013        CATTTTTTTTAC

Basically it is 150 groups and between each group there is an identifier of the group.
I need to search patterns based on column six and output the line that contained the matched pattern along with the identifier of the group.

I tried this code (suggested by a member here)

Code:
#!/bin/bash

read pattern
while read line; do
    [ ${#line} == 1 ] && identifier="$line"
    pat=$(echo $line | grep $pattern)
    [ $? == 0 ] && echo $identifier $pat
done <your_file_here

When I searched for GCATGC using this code, the output looked like this:

Code:
1 3 9 36 281 2.0e+004 ATTGCATGC
2 4 12 50 403 1.3e+005 GCATGCAAATTT
7 8 15 9 90 7.2e+008 TGCATGCAAAAATGC
9 8 7 14 103 3.4e+008 GCATGCA
9 2 7 35 293 1.4e-004 GCATGCA
9 3 11 27 225 1.5e+006 GCATGCAAAAT
9 3 9 31 273 1.8e-004 TTGCATGCA
9 7 7 9 75 4.4e+005 TGCATGC
9 1 9 21 186 4.3e-002 TGCATGCAA
9 1 19 12 165 3.9e-005 TGGCGGGAAATGCATGCAG
9 1 20 49 538 1.4e-036 TTTAAAATTGCATGCATGCA
9 6 7 17 132 1.7e+007 GCATGCA
9 4 11 14 128 2.2e+006 TGCATGCACAC

There is a problem in the output when the identifier is greater than 10. As you see the identifer stays as 9 itself.

Is there a way I could modify the above code to correct this problem while generating the output.
Please let me know.
LA

Last edited by pludi; 12-08-2009 at 06:55 PM.. Reason: code tags, please...
# 2  
Old 12-08-2009
Code:
 awk '/GCATGC/ {print int(NR/11)+1,$0}' urfile
2 1       16      37      404     2.3e-025        AAAATTGCATGCATGC
3 1       11      36      329     6.0e-012        ATTTGCATGCA

# 3  
Old 12-08-2009
Hi
Code:
[ ${#line} == 1 ]

means if the $line is one character wide, not if the number of fields amounts to one.

You could use e.g.
Code:
#!/bin/bash
pattern="$1"
while read line; do
  set -- $line
  if [[ $# -eq 1 ]]; then
    identifier=$line
  fi
  if [[ $line =~ $pattern ]]; then
    printf "$identifier\t$line\n"
  fi
done <infile

-or- awk:
Code:
awk 'NF==1{id=$1}/GCATGC/{print id"\t"$0}' infile


Last edited by Scrutinizer; 12-08-2009 at 09:26 PM..
# 4  
Old 12-08-2009
I know it's not relevant to what you are asking, but I can't help but notice that those four letters, C A T G, are used to represent nucleotide bases, the building blocks of DNA.

Are at liberty to mention what your are doing? Maybe searching for a genetic defect that causes some specific disease?
# 5  
Old 12-08-2009
Quote:
Originally Posted by KenJackson
I know it's not relevant to what you are asking, but I can't help but notice that those four letters, C A T G, are used to represent nucleotide bases, the building blocks of DNA.

Are at liberty to mention what your are doing? Maybe searching for a genetic defect that causes some specific disease?
Wooo, that's interesting. Smilie

Happy to provide the help on this case.
# 6  
Old 12-09-2009
Quote:
Originally Posted by Scrutinizer
Hi
Code:
[ ${#line} == 1 ]

means if the $line is one character wide, not if the number of fields amounts to one.

You could use e.g.
Code:
#!/bin/bash
pattern="$1"
while read line; do
  set -- $line
  if [[ $# -eq 1 ]]; then
    identifier=$line
  fi
  if [[ $line =~ $pattern ]]; then
    printf "$identifier\t$line\n"
  fi
done <infile

this gets my vote. i must've been in zombie land when i used ${#line} (not enough coffee...)

i've also recoded the other piece of code to this, you can use whichever you prefer...

Code:
read pattern
while read line; do
    identifier=$(echo $line | awk 'NF==1{print $1}')	
    [ -z $identifier ] && identifier=$previous
    previous=$identifier
    pat=$(echo $line | grep $pattern)
    [ $? == 0 ] && echo $identifier $pat
done <file_data

the problem with the pure awk solutions is that the pattern has to be hard coded into the script... in the previous thread the OP mentioned he might just be entering pattern fragments, so he'd have to change the code everytime.
# 7  
Old 12-09-2009
Another way, no matter the data sessions have always with 1 + 10 format.

Code:
awk 'NF==1 {$0=">\n"$0}1' urfile |awk 'BEGIN{RS=">";FS="\n"} {for (i=1;i<=NF;i++) if ($i~/GATTT/) print $2,$i}'
1 10      7       26      156     1.8e+021        TTGATTT
150 4       9       11      94      3.4e+007        GATTTGGAA

Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Same sed code prints(p) correct, but writtes(w) wrong output

Dear all, I am using sed as an alternative to grep in order to get a specific line from each of multiple files located in the same directory. I am using sed because it prints the lines in the correct order (unlike grep). When I write sed code that prints out the output I get it correct, but... (1 Reply)
Discussion started by: JaNaJaNa
1 Replies

2. Shell Programming and Scripting

awk output is not the correct count

The awk below runs and produces the following output on the file2. This is just an example of the format as the file is ~14MB. file1.txt is attached. I am trying to count the ids that match between the two files and out the ids that are missing. Thank you :). file2 970 NM_213590 ... (2 Replies)
Discussion started by: cmccabe
2 Replies

3. Shell Programming and Scripting

Now showing the correct output

Hello I am working on one script where I am trying to display all the directories which is inside the workspace but somehow it is giving me weird output and this is occurring only with one directory other also having the result.html file inside the directory. for i in `ls -1 | egrep -iv... (2 Replies)
Discussion started by: anuragpgtgerman
2 Replies

4. Shell Programming and Scripting

Need output of script on screen and file with correct return status of the called script.

Hi, I am trying to capture logs of the script in the file as well as on the screen. I have used exec and tee command for this. While using exec command I am getting the correct output in the file but, script output is not getting displayed on the screen as it get executed. Below is my sample... (14 Replies)
Discussion started by: Prathmesh
14 Replies

5. Shell Programming and Scripting

Output not in correct format - cd script

I have a script that looks like this: dirname2=/usr/tmp/filelist/*/* for dirname2 in /tmp/filelist/*/*; do (cd $dirname2/catalog ||echo "file does not exist" && echo "$dirname2" |cut -d '/' -f 7,8 && echo $i && ls -la |awk 'NR>3 {SUM += $5} END { print "Total number of kb " SUM }');done... (2 Replies)
Discussion started by: newbie2010
2 Replies

6. Shell Programming and Scripting

Html output in correct format

Hi, I am running two scripts as below. In Script 1 i am getting correct output in proper HTML format while in script 2 i am not getting output in mail and only html code is getting printed.I want to get the output of script 2. Please guide. 1.IFILE=/home/home01/Report.csv if #Checks... (7 Replies)
Discussion started by: Vivekit82
7 Replies

7. Shell Programming and Scripting

How to print the output in correct order?

Hi, while using following awk commend I’m getting confused, The output is not like as the row present in input files, can anyone explain and tell me how to print in the order like in input. value=$(awk 'FNR>1 && NR==FNR{a=$4;next} a{sum+=$4} END {for(i in sum){printf i"\t"sum/2"@@";}}'... (5 Replies)
Discussion started by: Shenbaga.d
5 Replies

8. Shell Programming and Scripting

diff output is it correct??

I'm asking for explanation about the output of the diff format when i compare the two files f1 and f2: root@host1 # cat f1 205226 205237 205250 205255 205262 205274 205307 205403 205464 205477 205500 205520 205626 205759 205766 205776 (2 Replies)
Discussion started by: ahmad.zuhd
2 Replies

9. Shell Programming and Scripting

AWK not giving me correct output.

i have a line like this in my script IP=`get_IP <hostname> | awk '{ print $1 }' echo $IP the problem is get_IP <hostname> returns data formated as follows: ip 1.1.1.1 name server_name the code above returns 1.1.1.1 server_name and i just need the 1.1.1.1 I have tried to add "|... (5 Replies)
Discussion started by: mcdef
5 Replies

10. Shell Programming and Scripting

Conversion of Exponential to numeric in awk- not correct output

Hi All, I have 1 million records file. Using awk, I am counting the number of records. But as the number is huge, after crossing a number, awk is displaying it in exponential format. At the end, I need to verify this count given by awk with expected count. But as it is in exponential format,... (3 Replies)
Discussion started by: ssunda6
3 Replies
Login or Register to Ask a Question