Help with generating a script

09-14-2011

Registered User

2,100, 402

Join Date: Apr 2009

Last Activity: 11 February 2020, 10:24 AM EST

Posts: 2,100

Thanks Given: 26

Thanked 402 Times in 360 Posts

Not sure what that "invalid option" error is. What's your Unix/Linux system and version of awk? In short, what's the output of the following commands?

Code:

uname -a
uname --all
awk
awk --version

You may want to try this script:

Code:

echo Please input gene name you wish to look for
read GENE
OUT="$GENE$(date '+%Y%m%d').txt"
echo "The output file is $OUT"
for file in *.txt
do
  awk -vNAME=$GENE '$13 ~ NAME { print FILENAME": "$0 }' $file >> $OUT
done

Assuming the files "file1.txt" and "file2.txt" are tab-delimited files in the current directory, the execution of this script is as follows -

Code:

$
$
$ cat file1.txt
chr_name        chr_start       chr_end ref_base        alt_base        hom_het snp_quality     tot_depth       alt_depth       dbSNP   dbSNP131     freqlenomes
chr01   14907   14907   A       G       het     108     52      39      snp131  rs6682375       ncRNA   WASH7P  .       .       rs6682375       .    .
chr01   14930   14930   A       G       het     148     62      44      snp131  rs6682385       ncRNA   WASH7P  .       .       rs6682385       1000g0.71nov_all
chr01   761752  761752  C       T       hom     225     69      69      snp131  rs1057213       ncRNA   NCRNA00115      .       .       rs1057213    0.5442010nov_all
chr01   761800  761800  A       T       hom     42      11      11      snp131  rs1064272       ncRNA   NCRNA00115      .       .       rs1064272    0.1142010nov_all
$
$
$ cat file2.txt
chr_name        chr_start       chr_end ref_base        alt_base        hom_het snp_quality     tot_depth       alt_depth       dbSNP   dbSNP131     freqlenomes
chr01   17556   17556   C       T       het     43      30      9       .       .       ncRNA   WASH7P  .       .       .       .       .
chr01   69511   69511   A       G       hom     225     106     106     snp131  rs2691305       exonic  OR4F5   nonsynonymous   SNV     "OR4F5:NM_0010.7892010nov_all421G:p.T141A,"
chr01   761732  761732  C       T       hom     225     103     102     snp131  rs2286139       ncRNA   NCRNA00115      .       .       rs2286139    0.5372010nov_all
$
$
$ cat search.sh
echo Please input gene name you wish to look for
read GENE
OUT="$GENE$(date '+%Y%m%d').txt"
echo "The output file is $OUT"
for file in *.txt
do
  awk -vNAME=$GENE '$13 ~ NAME { print FILENAME": "$0 }' $file >> $OUT
done
$
$
$ # Now run the script
$
$ . search.sh
Please input gene name you wish to look for
WASH7P
The output file is WASH7P20110914.txt
$
$ # Display the content of the output file
$
$ cat WASH7P20110914.txt
file1.txt: chr01        14907   14907   A       G       het     108     52      39      snp131  rs6682375       ncRNA   WASH7P  .       .       rs668.375
file1.txt: chr01        14930   14930   A       G       het     148     62      44      snp131  rs6682385       ncRNA   WASH7P  .       .       rs6680.71g2010nov_all
file2.txt: chr01        17556   17556   C       T       het     43      30      9       .       .       ncRNA   WASH7P  .       .       .       .    .
$
$
$

Or you could try the following script that uses Perl -

Code:

echo Please input gene name you wish to look for
read GENE
OUT="$GENE$(date '+%Y%m%d').txt"
echo "The output file is $OUT"
for file in *.txt
do
  perl -F"\t" -lane "print \"\$ARGV:\$_\" if \$F[12] eq $GENE" $file >> $OUT
done

The execution of the script:

Code:

$
$
$ rm WASH7P20110914.txt
$
$ # Display the script content
$
$ cat search1.sh
echo Please input gene name you wish to look for
read GENE
OUT="$GENE$(date '+%Y%m%d').txt"
echo "The output file is $OUT"
for file in *.txt
do
  perl -F"\t" -lane "print \"\$ARGV:\$_\" if \$F[12] eq $GENE" $file >> $OUT
done
$
$
$ # Now run the script
$
$ . search1.sh
Please input gene name you wish to look for
WASH7P
The output file is WASH7P20110914.txt
$
$
$ # Display the content of the output file
$
$ cat WASH7P20110914.txt
file1.txt:chr01 14907   14907   A       G       het     108     52      39      snp131  rs6682375       ncRNA   WASH7P  .       .       rs6682375    .
file1.txt:chr01 14930   14930   A       G       het     148     62      44      snp131  rs6682385       ncRNA   WASH7P  .       .       rs6682385    0.71g2010nov_all
file2.txt:chr01 17556   17556   C       T       het     43      30      9       .       .       ncRNA   WASH7P  .       .       .       .       .
$
$
$

tyler_durden

This User Gave Thanks to durden_tyler For This Post:

durden_tyler

View Public Profile for durden_tyler

Find all posts by durden_tyler

09-14-2011

Registered User

474, 160

Join Date: Feb 2011

Last Activity: 22 May 2020, 9:47 AM EDT

Posts: 474

Thanks Given: 51

Thanked 160 Times in 135 Posts

I don't believe anyone has addressed this point:

Quote:

Originally Posted by kellywilliams

I have many files that are all currently in .xslx and I'm not sure if they need to be .csv or .txt for this to work... Each of these files has ~90,000 lines.

Kelly

Kelly, to use durden_tyler's solution you do need to export the files into tab-delimited text files. I assume you mean excel spreadsheet files (xlsx). The xlsx format is a proprietry binary format (probably a zipped xml document now but still in a proprietry format).

Andrew

apmcd47

View Public Profile for apmcd47

Find all posts by apmcd47

09-14-2011

Registered User

10, 0

Join Date: Nov 2010

Last Activity: 26 February 2014, 10:11 PM EST

Posts: 10

Thanks Given: 10

Thanked 0 Times in 0 Posts

To durden_tyler

Hi Tyler_Durden,

Thank you for your help, unfortunately the script is still not working. I have tried it on the two computers in my laboratory running linux. Here is the command output you suggested from computer 1 (via Terminal on a MacBook Pro):

Code:

$ uname -a
Darwin anzac-172-16-75-136.anzac.edu.au 10.8.0 Darwin Kernel Version 10.8.0: Tue Jun  7 16:33:36 PDT 2011; root:xnu-1504.15.3~1/RELEASE_I386 i386
$ uname --all
uname: illegal option -- -
usage: uname [-amnprsv]
$ awk
usage: awk [-F fs] [-v var=value] [-f progfile | 'prog'] [file ...]
$ awk --version
awk version 20070501

And computer2 (running RedHat):

Code:

$ uname -a
Linux neuro.anzac.edu.au 2.6.18-238.5.1.el5 #1 SMP Mon Feb 21 05:52:39
EST 2011 x86_64 x86_64 x86_64 GNU/Linux
$ uname --all
Linux neuro.anzac.edu.au 2.6.18-238.5.1.el5 #1 SMP Mon Feb 21 05:52:39
EST 2011 x86_64 x86_64 x86_64 GNU/Linux
$ awk
Usage: awk [POSIX or GNU style options] -f progfile [--] file ...
Usage: awk [POSIX or GNU style options] [--] 'program' file ...
POSIX options:          GNU long options:
       -f progfile             --file=progfile
       -F fs                   --field-separator=fs
       -v var=val              --assign=var=val
       -m[fr] val
       -W compat               --compat
       -W copyleft             --copyleft
       -W copyright            --copyright
       -W dump-variables[=file]        --dump-variables[=file]
       -W exec=file            --exec=file
       -W gen-po               --gen-po
       -W help                 --help
       -W lint[=fatal]         --lint[=fatal]
       -W lint-old             --lint-old
       -W non-decimal-data     --non-decimal-data
       -W profile[=file]       --profile[=file]
       -W posix                --posix
       -W re-interval          --re-interval
       -W source=program-text  --source=program-text
       -W traditional          --traditional
       -W usage                --usage
       -W version              --version

To report bugs, see node `Bugs' in `gawk.info', which is
section `Reporting Problems and Bugs' in the printed version.

gawk is a pattern scanning and processing language.
By default it reads standard input and writes standard output.

Examples:
       gawk '{ sum += $1 }; END { print sum }' file
       gawk -F: '{ print $1 }' /etc/passwd
$ awk --version
GNU Awk 3.1.5
Copyright (C) 1989, 1991-2005 Free Software Foundation.

This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA
02110-1301, USA.

So when I ran your first script on computer1

Quote:

echo Please input gene name you wish to look for
read GENE
OUT="$GENE$(date '+%Y%m%d').txt"
echo "The output file is $OUT"
for file in *.txt
do
awk -vNAME=$GENE '$13 ~ NAME { print FILENAME": "$0 }' $file >> $OUT
done

I got the following output again

Code:

$ ./3SNPs_in_gene.sh 
Please input gene name you wish to look for
WASH7P
The output file is WASH7P20110915.txt
awk: invalid -v option

awk: invalid -v option

awk: invalid -v option

awk: invalid -v option

awk: invalid -v option

awk: invalid -v option

awk: invalid -v option

$

And when I ran script 2 on computer 1:

Code:

echo Please input gene name you wish to look for
read GENE
OUT="$GENE$(date '+%Y%m%d').txt"
echo "The output file is $OUT"
for file in *.txt
do
  perl -F"\t" -lane "print \"\$ARGV:\$_\" if \$F[12] eq $GENE" $file >> $OUT
done

The WASH7P20110915.txt file was empty.

Similarly, when I ran both scripts on computer 2, the WASH7P20110915.txt file was empty.

If you could help that would be great - thank you so much already for your help. Also, when the script is looking in *.txt, will that include looking in the $OUT.txt file?

Kelly

kellywilliams

View Public Profile for kellywilliams

Find all posts by kellywilliams

09-14-2011

Registered User

23,310, 4,623

Join Date: Aug 2005

Last Activity: 7 July 2020, 11:47 AM EDT

Location: Saskatchewan

Posts: 23,310

Thanks Given: 1,331

Thanked 4,623 Times in 4,217 Posts

There should be a space between "-v" and NAME=$var.

You should also be quoting it so it doesn't split on spaces.

So:

Code:

awk -v NAME="${VAR}"

Corona688

View Public Profile for Corona688

Visit Corona688's homepage!

Find all posts by Corona688

09-14-2011

Registered User

1,910, 488

Join Date: Sep 2008

Last Activity: 22 December 2019, 2:31 AM EST

Location: San Jose, CA

Posts: 1,910

Thanks Given: 54

Thanked 488 Times in 481 Posts

Code:

awk -v pattern="WASH7P" '$13 ~ pattern {print FILENAME":"$0}' file1.txt file2.txt > out.txt

If you are using solaris, use nawk

--ahamed

---------- Post updated at 03:50 PM ---------- Previous update was at 03:41 PM ----------

or

Code:

grep WASH7P file1.txt file2.txt >> out.txt

--ahamed

ahamed101

View Public Profile for ahamed101

Find all posts by ahamed101

09-14-2011

Registered User

10, 0

Join Date: Nov 2010

Last Activity: 26 February 2014, 10:11 PM EST

Posts: 10

Thanks Given: 10

Thanked 0 Times in 0 Posts

Thank you Corona688, this stopped the -v invalid option, but the output file is still empty.

This is what I am using, but I am still getting an empty output file.

Code:

echo Please input gene name you wish to look for
read GENE
OUT="$GENE$(date '+%Y%m%d').txt"
echo "The output file is $OUT"
for file in *.txt
do
  awk -v NAME="$GENE" '$13 ~ NAME { print FILENAME": "$0 }' $file >> $OUT
done

I have attached 2 example files. A GENE that is in both (thus will give an output) is SOX13.

Many thanks,

Kelly

file2.txt (1.56 MB)

file1.txt (1.66 MB)

kellywilliams

View Public Profile for kellywilliams

Find all posts by kellywilliams

09-14-2011

Registered User

1,910, 488

Join Date: Sep 2008

Last Activity: 22 December 2019, 2:31 AM EST

Location: San Jose, CA

Posts: 1,910

Thanks Given: 54

Thanked 488 Times in 481 Posts

I think there is some issue with file type

Code:

#your original file showed this and a normal grep SOX was not working on this
root@bt:/tmp# file file1.txt 
file1.txt: Little-endian UTF-16 Unicode text, with very long lines, with CRLF line terminators

#then I opened it in gedit and saved it once again with Character Encoding as "Current Locale UTF-8" and then it started working.
root@bt:/tmp# gedit file1.txt 
root@bt:/tmp# file file1.txt 
file1.txt: ASCII text, with very long lines, with CRLF line terminators

file2.txt has just one single line??

--ahamed

ahamed101

View Public Profile for ahamed101

Find all posts by ahamed101

Shell Programming and Scripting

Help with generating a script

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Is there a way to handle commas inside the data when generating a csv file from shell script?

Discussion started by: patk625

2. Shell Programming and Scripting

Random number generating script?

Discussion started by: LINUXnoob15

3. Shell Programming and Scripting

Help with ahem Prime number Generating Script

Discussion started by: drewann

4. Shell Programming and Scripting

auto-generating assembly code by variables found by script

Discussion started by: Behrouzx77

5. Shell Programming and Scripting

Converting date/time and generating offsets in bash script

Discussion started by: emdan

6. Shell Programming and Scripting

Help generating a script for next-generation sequencing data

Discussion started by: kellywilliams

7. Shell Programming and Scripting

Problem with script generating files in directory recursively

Discussion started by: bb2

8. UNIX for Dummies Questions & Answers

A shell script or software for generating random passwords

Discussion started by: dwiravi

9. Shell Programming and Scripting

Generating millions of record using shell script

Discussion started by: Rahil2k9

10. Shell Programming and Scripting

Awk Script for generating a report

Discussion started by: manoj.naidu