Run sed and awk in multiple files in adirectory

06-04-2018

Registered User

5, 0

Join Date: Jun 2018

Last Activity: 7 June 2018, 11:48 AM EDT

Posts: 5

Thanks Given: 2

Thanked 0 Times in 0 Posts

Run sed and awk in multiple files in adirectory

Dear linux users

I was running around of 200 djob for a Blastp search in a cluster. All my input files were protein fasta file (prot.fna.1, prot.fna.2 ...prot.fna.200). The output of each individual slurm job is located in a corresponding file ending with *test (prot.fna.1.test, prot.fna.2.test ...prot.fna.200.test) in the same directory. Unfortunately, these Jobs were canceled due to time limit on the node. Now I want to extract all the remaining sequences from my protein fasta files a way to run them again and all the results could be concatenated. Here his what I doing :
1. I look for the first string of one *test file with this command:

Code:

(awk '{print $1}' prot.fna.1.test | tail -n1)

, this scrip print me the �pattern�
2. All the sequences after this matching pattern in the corresponding fasta input (prot.fasta.1) is printed using this command :

Code:

cat prot.fasta.1 | sed -e '1,/pattern/ d' | sed -ne '/^>/,$ p'

Repeating this for 200 files one by one is time consuming. I want to run this script in all the files , but I can't. I am writing you to see if you can help me implement this please. Here is what I am doing using these scripts :

Code:

[dderilus@boqueron ERR598955_orfm_250_out]$ awk '{print $1}' ERR598955_orfm.fna.11.test | tail -n1
  ERR598955.6981687_74_5_4
  [dderilus@boqueron ERR598955_orfm_250_out]$ cat ERR598955_orfm.fna.11 | sed -e '1,/ERR598955.6981687_74_5_4/ d' | sed -ne '/^>/,$ p' > first_out.fna1.11
  [dderilus@boqueron ERR598955_orfm_250_out]$ awk '{print $1}' ERR598955_orfm.fna.12.test | tail -n1
  ERR598955.7664144_89_2_3
  [dderilus@boqueron ERR598955_orfm_250_out]$ cat ERR598955_orfm.fna.1 | sed -e '1,/ERR598955.7664144_89_2_3/ d' | sed -ne '/^>/,$ p' > first_out.fna1.12
  [dderilus@boqueron ERR598955_orfm_250_out]$ less first_out.fna1.12
  [dderilus@boqueron ERR598955_orfm_250_out]$ cat ERR598955_orfm.fna.12 | sed -e '1,/ERR598955.7664144_89_2_3/ d' | sed -ne '/^>/,$ p' > first_out.fna1.12
  [dderilus@boqueron ERR598955_orfm_250_out]$ awk '{print $1}' ERR598955_orfm.fna.13.test | tail -n1
  ERR598955.8364684_101_2_4
  [dderilus@boqueron ERR598955_orfm_250_out]$ cat ERR598955_orfm.fna.13 | sed -e '1,/ERR598955.8364684_101_2_4/ d' | sed -ne '/^>/,$ p' > first_out.fna1.13
  [dderilus@boqueron ERR598955_orfm_250_out]$ awk '{print $1}' ERR598955_orfm.fna.14.test | tail -n1
  ERR598955.9053411_57_6_5
  [dderilus@boqueron ERR598955_orfm_250_out]$ cat ERR598955_orfm.fna.14 | sed -e '1,/ERR598955.9053411_57_6_5/ d' | sed -ne '/^>/,$ p' > first_out.fna1.14
  [dderilus@boqueron ERR598955_orfm_250_out]$ awk '{print $1}' ERR598955_orfm.fna.15.test | tail -n1
  ERR598955.9746341_78_3_2
  [dderilus@boqueron ERR598955_orfm_250_out]$ cat ERR598955_orfm.fna.15 | sed -e '1,/ERR598955.9746341_78_3_2/ d' | sed -ne '/^>/,$ p' > first_out.fna1.15
  [dderilus@boqueron ERR598955_orfm_250_out]$ awk '{print $1}' ERR598955_orfm.fna.16.test | tail -n1
  ERR598955.10426164_9_3_3
  [dderilus@boqueron ERR598955_orfm_250_out]$ cat ERR598955_orfm.fna.16 | sed -e '1,/ERR598955.10426164_9_3_3/ d' | sed -ne '/^>/,$ p' > first_out.fna1.16
  [dderilus@boqueron ERR598955_orfm_250_out]$ awk '{print $1}' ERR598955_orfm.fna.17.test | tail -n1
  ERR598955.11123991_2_2_2
  [dderilus@boqueron ERR598955_orfm_250_out]$ cat ERR598955_orfm.fna.17 | sed -e '1,/ERR598955.11123991_2_2_2/ d' | sed -ne '/^>/,$ p' > first_out.fna1.17
  [dderilus@boqueron ERR598955_orfm_250_out]$ awk '{print $1}' ERR598955_orfm.fna.18.test | tail -n1
  ERR598955.11810206_3_6_1
  [dderilus@boqueron ERR598955_orfm_250_out]$ cat ERR598955_orfm.fna.18 | sed -e '1,/ERR598955.11810206_3_6_1/ d' | sed -ne '/^>/,$ p' > first_out.fna1.18
  [dderilus@boqueron ERR598955_orfm_250_out]$ awk '{print $1}' ERR598955_orfm.fna.19.test | tail -n1
  ERR598955.12519405_1_4_4
  [dderilus@boqueron ERR598955_orfm_250_out]$ cat ERR598955_orfm.fna.19 | sed -e '1,/ERR598955.12519405_1_4_4/ d' | sed -ne '/^>/,$ p' > first_out.fna1.19

This is time consuming, I would be very grateful if you can help me to do that with one script.

Thanks in advance

Cordially

Moderator's Comments:

Please add code tags!

Last edited by MadeInGermany; 06-04-2018 at 05:22 AM.. Reason: Added code tags, removed font tags

Dieunel

View Public Profile for Dieunel

Find all posts by Dieunel

06-04-2018

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

Welcome to the forum.

I understand you want to automate a task to run over 200 files. Programming / scripting is for exactly this, and I'm pretty sure your request can be fulfilled elegantly and fast. Unfortunately I (at least) don't really understand what your after. Please rephrase your request, and supply representative sample data.

This User Gave Thanks to RudiC For This Post:

RudiC

View Public Profile for RudiC

Find all posts by RudiC

06-04-2018

Registered User

5,091, 1,931

Join Date: May 2012

Last Activity: 15 July 2020, 4:46 AM EDT

Location: Simplicity

Posts: 5,091

Thanks Given: 565

Thanked 1,931 Times in 1,668 Posts

Proposal: a bash script with a for loop

Code:

#!/bin/bash
fmask="ERR598955_orfm.fna"
for f in $fmask.*
do
  [ -f "$f" ] || continue # ensure this is an existing file
  ext=${f#$fmask} # strip off the leading fmask
  pattern=$(awk '{print $1; exit}' "$f") # pattern is the 1st word in the 1st line
  sed -n '1,/^'"${pattern}"'$/ d ; /^>/,$ p' "$f" > "first_out.$ext"
done

You might need to work on it...

This User Gave Thanks to MadeInGermany For This Post:

MadeInGermany

View Public Profile for MadeInGermany

Find all posts by MadeInGermany

06-04-2018

Registered User

5, 0

Join Date: Jun 2018

Last Activity: 7 June 2018, 11:48 AM EDT

Posts: 5

Thanks Given: 2

Thanked 0 Times in 0 Posts

Thank you for your quick response and help I am a biginner on linux. I will try to run this for loop script with my data to see if it is working.

I am sorry for my english

Cordially

---------- Post updated at 09:09 PM ---------- Previous update was at 06:20 PM ----------

Dear moderator

This bash script generates a file for each command line. However all the files are empty. Is there any way that I can improve it. I am sorry if my question looks trivial.I am just starting with linux programming.

Regards

Dieunel

View Public Profile for Dieunel

Find all posts by Dieunel

06-05-2018

Registered User

5,091, 1,931

Join Date: May 2012

Last Activity: 15 July 2020, 4:46 AM EDT

Location: Simplicity

Posts: 5,091

Thanks Given: 565

Thanked 1,931 Times in 1,668 Posts

Ok, I made some mistakes, correction follows.

Code:

#!/bin/bash
fmask="ERR598955_orfm.fna"
for f in $fmask.*
do
  [ -f "$f" ] || continue # ensure this is an existing file
  ext=${f#$fmask} # strip off the leading fmask
  ext=${ext#.} # strip off a leading dot
  pattern=$(awk '{x=$1} END {print x}' "$f") # pattern is the 1st word in the last line
  newfile=first_out.$ext
  sed -n '1,/'"${pattern}"'/d; /^>/,$p' "$f" > "$newfile"
done

If it does not do what you want, please run it in debug mode:

Code:

/bin/bash -x scriptname

MadeInGermany

View Public Profile for MadeInGermany

Find all posts by MadeInGermany

06-05-2018

Registered User

5, 0

Join Date: Jun 2018

Last Activity: 7 June 2018, 11:48 AM EDT

Posts: 5

Thanks Given: 2

Thanked 0 Times in 0 Posts

Dear moderator

Thank you foir the follow up. Unfortunately inspite that I run it in debuging mode, iu have empty output, here is what I did :

Code:

$ cat sov.sh 
#!/bin/bash
fmask="ERR598955_orfm.fna"
for f in $fmask.*
do
  [ -f "$f" ] || continue # ensure this is an existing file
  ext=${f#$fmask} # strip off the leading fmask
  ext=${ext#.} # strip off a leading dot
  pattern=$(awk '{x=$1} END {print x}' "$f") # pattern is the 1st word in the last line
  newfile=first_out.$ext
  sed -n '1,/^'"${pattern}"'/d; /^>/,$p' "$f" > "$newfile"
done
[dderilus@boqueron test]$ ls
ERR598955_orfm.fna.1  ERR598955_orfm.fna.1.test  ERR598955_orfm.fna.2  ERR598955_orfm.fna.2.test  sov.sh
[dderilus@boqueron test]$ /bin/bash -x sov.sh 
+ fmask=ERR598955_orfm.fna
+ for f in '$fmask.*'
+ '[' -f ERR598955_orfm.fna.1 ']'
+ ext=.1
+ ext=1
++ awk '{x=$1} END {print x}' ERR598955_orfm.fna.1
+ pattern=GGSSFMGCPSSVMSPASGYSKPAIILNSVVHPIKDDPPHKRSVNTVFQNYALFPHMTVSQNIG
+ newfile=first_out.1
+ sed -n '1,/^GGSSFMGCPSSVMSPASGYSKPAIILNSVVHPIKDDPPHKRSVNTVFQNYALFPHMTVSQNIG/d; /^>/,$p' ERR598955_orfm.fna.1
+ for f in '$fmask.*'
+ '[' -f ERR598955_orfm.fna.1.test ']'
+ ext=.1.test
+ ext=1.test
++ awk '{x=$1} END {print x}' ERR598955_orfm.fna.1.test
+ pattern=ERR598955.61408_2_2_1
+ newfile=first_out.1.test
+ sed -n '1,/^ERR598955.61408_2_2_1/d; /^>/,$p' ERR598955_orfm.fna.1.test
+ for f in '$fmask.*'
+ '[' -f ERR598955_orfm.fna.2 ']'
+ ext=.2
+ ext=2
++ awk '{x=$1} END {print x}' ERR598955_orfm.fna.2
+ pattern=LSEKKSSQNPLLFSICLIFFWTTFLILPEKAFWRV
+ newfile=first_out.2
+ sed -n '1,/^LSEKKSSQNPLLFSICLIFFWTTFLILPEKAFWRV/d; /^>/,$p' ERR598955_orfm.fna.2
+ for f in '$fmask.*'
+ '[' -f ERR598955_orfm.fna.2.test ']'
+ ext=.2.test
+ ext=2.test
++ awk '{x=$1} END {print x}' ERR598955_orfm.fna.2.test
+ pattern=ERR598955.712540_97_1_3
+ newfile=first_out.2.test
+ sed -n '1,/^ERR598955.712540_97_1_3/d; /^>/,$p' ERR598955_orfm.fna.2.test
[dderilus@boqueron test]$ ls -sh
total 317M
148M ERR598955_orfm.fna.1   16M ERR598955_orfm.fna.1.test  149M ERR598955_orfm.fna.2  5.3M ERR598955_orfm.fna.2.test     0 first_out.1     0 first_out.1.test     0 first_out.2     0 first_out.2.test  4.0K sov.sh

Moderator's Comments:

Please use code tags around code and listings!

Last edited by MadeInGermany; 06-05-2018 at 11:19 AM.. Reason: added code tags

Dieunel

View Public Profile for Dieunel

Find all posts by Dieunel

06-05-2018

Registered User

5,091, 1,931

Join Date: May 2012

Last Activity: 15 July 2020, 4:46 AM EDT

Location: Simplicity

Posts: 5,091

Thanks Given: 565

Thanked 1,931 Times in 1,668 Posts

Quote:

Code:

+ sed -n '1,/^ERR598955.61408_2_2_1/d; /^>/,$p' ERR598955_orfm.fna.1.test

The ^ before the pattern requires the pattern to be at the very beginning of the line. Change

Code:

sed -n '1,/^'"${pattern}"'/d; /^>/,$p' "$f" > "$newfile"

Code:

sed -n '1,/'"${pattern}"'/d; /^>/,$p' "$f" > "$newfile"

MadeInGermany

View Public Profile for MadeInGermany

Find all posts by MadeInGermany

Programming

Run sed and awk in multiple files in adirectory

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

How to use a loop for multiple files in a folder to run awk command?

Discussion started by: sajmar

2. Shell Programming and Scripting

Run one script on multiple files and print out multiple files.

Discussion started by: grace_shen

3. UNIX for Dummies Questions & Answers

Run one script on multiple files and print out multiple files.

Discussion started by: grace_shen

4. Shell Programming and Scripting

Bash Scipting (New); Run multiple greps > multiple files

Discussion started by: LDHB2012

5. UNIX for Dummies Questions & Answers

renaming multiple files using sed or awk one liner

Discussion started by: pandeesh

6. UNIX for Dummies Questions & Answers

Using AWK: Extract data from multiple files and output to multiple new files

Discussion started by: Liverpaul09

7. UNIX for Dummies Questions & Answers

best method of replacing multiple strings in multiple files - sed or awk? most simple preferred :)

Discussion started by: rich@ardz

8. Shell Programming and Scripting

Split line to multiple files Awk/Sed/Shell Script help

Discussion started by: saint2006

9. Shell Programming and Scripting

How to run multiple awk files

Discussion started by: repinementer

10. UNIX for Dummies Questions & Answers

when I try to run rm on multiple files I have problem to delete files with space

Discussion started by: umen