Run sed and awk in multiple files in adirectory


 
Thread Tools Search this Thread
Top Forums Programming Run sed and awk in multiple files in adirectory
# 1  
Old 06-04-2018
Run sed and awk in multiple files in adirectory

Dear linux users

I was running around of 200 djob for a Blastp search in a cluster. All my input files were protein fasta file (prot.fna.1, prot.fna.2 ...prot.fna.200). The output of each individual slurm job is located in a corresponding file ending with *test (prot.fna.1.test, prot.fna.2.test ...prot.fna.200.test) in the same directory. Unfortunately, these Jobs were canceled due to time limit on the node. Now I want to extract all the remaining sequences from my protein fasta files a way to run them again and all the results could be concatenated. Here his what I doing :
1. I look for the first string of one *test file with this command:
Code:
(awk '{print $1}' prot.fna.1.test | tail -n1)

, this scrip print me the “pattern”
2. All the sequences after this matching pattern in the corresponding fasta input (prot.fasta.1) is printed using this command :
Code:
cat prot.fasta.1 | sed -e '1,/pattern/ d' | sed -ne '/^>/,$ p'

Repeating this for 200 files one by one is time consuming. I want to run this script in all the files , but I can't. I am writing you to see if you can help me implement this please. Here is what I am doing using these scripts :

Code:
[dderilus@boqueron ERR598955_orfm_250_out]$ awk '{print $1}' ERR598955_orfm.fna.11.test | tail -n1
  ERR598955.6981687_74_5_4
  [dderilus@boqueron ERR598955_orfm_250_out]$ cat ERR598955_orfm.fna.11 | sed -e '1,/ERR598955.6981687_74_5_4/ d' | sed -ne '/^>/,$ p' > first_out.fna1.11
  [dderilus@boqueron ERR598955_orfm_250_out]$ awk '{print $1}' ERR598955_orfm.fna.12.test | tail -n1
  ERR598955.7664144_89_2_3
  [dderilus@boqueron ERR598955_orfm_250_out]$ cat ERR598955_orfm.fna.1 | sed -e '1,/ERR598955.7664144_89_2_3/ d' | sed -ne '/^>/,$ p' > first_out.fna1.12
  [dderilus@boqueron ERR598955_orfm_250_out]$ less first_out.fna1.12
  [dderilus@boqueron ERR598955_orfm_250_out]$ cat ERR598955_orfm.fna.12 | sed -e '1,/ERR598955.7664144_89_2_3/ d' | sed -ne '/^>/,$ p' > first_out.fna1.12
  [dderilus@boqueron ERR598955_orfm_250_out]$ awk '{print $1}' ERR598955_orfm.fna.13.test | tail -n1
  ERR598955.8364684_101_2_4
  [dderilus@boqueron ERR598955_orfm_250_out]$ cat ERR598955_orfm.fna.13 | sed -e '1,/ERR598955.8364684_101_2_4/ d' | sed -ne '/^>/,$ p' > first_out.fna1.13
  [dderilus@boqueron ERR598955_orfm_250_out]$ awk '{print $1}' ERR598955_orfm.fna.14.test | tail -n1
  ERR598955.9053411_57_6_5
  [dderilus@boqueron ERR598955_orfm_250_out]$ cat ERR598955_orfm.fna.14 | sed -e '1,/ERR598955.9053411_57_6_5/ d' | sed -ne '/^>/,$ p' > first_out.fna1.14
  [dderilus@boqueron ERR598955_orfm_250_out]$ awk '{print $1}' ERR598955_orfm.fna.15.test | tail -n1
  ERR598955.9746341_78_3_2
  [dderilus@boqueron ERR598955_orfm_250_out]$ cat ERR598955_orfm.fna.15 | sed -e '1,/ERR598955.9746341_78_3_2/ d' | sed -ne '/^>/,$ p' > first_out.fna1.15
  [dderilus@boqueron ERR598955_orfm_250_out]$ awk '{print $1}' ERR598955_orfm.fna.16.test | tail -n1
  ERR598955.10426164_9_3_3
  [dderilus@boqueron ERR598955_orfm_250_out]$ cat ERR598955_orfm.fna.16 | sed -e '1,/ERR598955.10426164_9_3_3/ d' | sed -ne '/^>/,$ p' > first_out.fna1.16
  [dderilus@boqueron ERR598955_orfm_250_out]$ awk '{print $1}' ERR598955_orfm.fna.17.test | tail -n1
  ERR598955.11123991_2_2_2
  [dderilus@boqueron ERR598955_orfm_250_out]$ cat ERR598955_orfm.fna.17 | sed -e '1,/ERR598955.11123991_2_2_2/ d' | sed -ne '/^>/,$ p' > first_out.fna1.17
  [dderilus@boqueron ERR598955_orfm_250_out]$ awk '{print $1}' ERR598955_orfm.fna.18.test | tail -n1
  ERR598955.11810206_3_6_1
  [dderilus@boqueron ERR598955_orfm_250_out]$ cat ERR598955_orfm.fna.18 | sed -e '1,/ERR598955.11810206_3_6_1/ d' | sed -ne '/^>/,$ p' > first_out.fna1.18
  [dderilus@boqueron ERR598955_orfm_250_out]$ awk '{print $1}' ERR598955_orfm.fna.19.test | tail -n1
  ERR598955.12519405_1_4_4
  [dderilus@boqueron ERR598955_orfm_250_out]$ cat ERR598955_orfm.fna.19 | sed -e '1,/ERR598955.12519405_1_4_4/ d' | sed -ne '/^>/,$ p' > first_out.fna1.19

This is time consuming, I would be very grateful if you can help me to do that with one script.



Thanks in advance


Cordially

Moderator's Comments:
Mod Comment Please add code tags!

Last edited by MadeInGermany; 06-04-2018 at 05:22 AM.. Reason: Added code tags, removed font tags
# 2  
Old 06-04-2018
Welcome to the forum.

I understand you want to automate a task to run over 200 files. Programming / scripting is for exactly this, and I'm pretty sure your request can be fulfilled elegantly and fast. Unfortunately I (at least) don't really understand what your after. Please rephrase your request, and supply representative sample data.
This User Gave Thanks to RudiC For This Post:
# 3  
Old 06-04-2018
Proposal: a bash script with a for loop
Code:
#!/bin/bash
fmask="ERR598955_orfm.fna"
for f in $fmask.*
do
  [ -f "$f" ] || continue # ensure this is an existing file
  ext=${f#$fmask} # strip off the leading fmask
  pattern=$(awk '{print $1; exit}' "$f") # pattern is the 1st word in the 1st line
  sed -n '1,/^'"${pattern}"'$/ d ; /^>/,$ p' "$f" > "first_out.$ext"
done

You might need to work on it...
This User Gave Thanks to MadeInGermany For This Post:
# 4  
Old 06-04-2018
Thank you for your quick response and help I am a biginner on linux. I will try to run this for loop script with my data to see if it is working.


I am sorry for my english


Cordially

---------- Post updated at 09:09 PM ---------- Previous update was at 06:20 PM ----------

Dear moderator


This bash script generates a file for each command line. However all the files are empty. Is there any way that I can improve it. I am sorry if my question looks trivial.I am just starting with linux programming.


Regards
# 5  
Old 06-05-2018
Ok, I made some mistakes, correction follows.
Code:
#!/bin/bash
fmask="ERR598955_orfm.fna"
for f in $fmask.*
do
  [ -f "$f" ] || continue # ensure this is an existing file
  ext=${f#$fmask} # strip off the leading fmask
  ext=${ext#.} # strip off a leading dot
  pattern=$(awk '{x=$1} END {print x}' "$f") # pattern is the 1st word in the last line
  newfile=first_out.$ext
  sed -n '1,/'"${pattern}"'/d; /^>/,$p' "$f" > "$newfile"
done

If it does not do what you want, please run it in debug mode:
Code:
/bin/bash -x scriptname

# 6  
Old 06-05-2018
Dear moderator


Thank you foir the follow up. Unfortunately inspite that I run it in debuging mode, iu have empty output, here is what I did :


Code:
$ cat sov.sh 
#!/bin/bash
fmask="ERR598955_orfm.fna"
for f in $fmask.*
do
  [ -f "$f" ] || continue # ensure this is an existing file
  ext=${f#$fmask} # strip off the leading fmask
  ext=${ext#.} # strip off a leading dot
  pattern=$(awk '{x=$1} END {print x}' "$f") # pattern is the 1st word in the last line
  newfile=first_out.$ext
  sed -n '1,/^'"${pattern}"'/d; /^>/,$p' "$f" > "$newfile"
done
[dderilus@boqueron test]$ ls
ERR598955_orfm.fna.1  ERR598955_orfm.fna.1.test  ERR598955_orfm.fna.2  ERR598955_orfm.fna.2.test  sov.sh
[dderilus@boqueron test]$ /bin/bash -x sov.sh 
+ fmask=ERR598955_orfm.fna
+ for f in '$fmask.*'
+ '[' -f ERR598955_orfm.fna.1 ']'
+ ext=.1
+ ext=1
++ awk '{x=$1} END {print x}' ERR598955_orfm.fna.1
+ pattern=GGSSFMGCPSSVMSPASGYSKPAIILNSVVHPIKDDPPHKRSVNTVFQNYALFPHMTVSQNIG
+ newfile=first_out.1
+ sed -n '1,/^GGSSFMGCPSSVMSPASGYSKPAIILNSVVHPIKDDPPHKRSVNTVFQNYALFPHMTVSQNIG/d; /^>/,$p' ERR598955_orfm.fna.1
+ for f in '$fmask.*'
+ '[' -f ERR598955_orfm.fna.1.test ']'
+ ext=.1.test
+ ext=1.test
++ awk '{x=$1} END {print x}' ERR598955_orfm.fna.1.test
+ pattern=ERR598955.61408_2_2_1
+ newfile=first_out.1.test
+ sed -n '1,/^ERR598955.61408_2_2_1/d; /^>/,$p' ERR598955_orfm.fna.1.test
+ for f in '$fmask.*'
+ '[' -f ERR598955_orfm.fna.2 ']'
+ ext=.2
+ ext=2
++ awk '{x=$1} END {print x}' ERR598955_orfm.fna.2
+ pattern=LSEKKSSQNPLLFSICLIFFWTTFLILPEKAFWRV
+ newfile=first_out.2
+ sed -n '1,/^LSEKKSSQNPLLFSICLIFFWTTFLILPEKAFWRV/d; /^>/,$p' ERR598955_orfm.fna.2
+ for f in '$fmask.*'
+ '[' -f ERR598955_orfm.fna.2.test ']'
+ ext=.2.test
+ ext=2.test
++ awk '{x=$1} END {print x}' ERR598955_orfm.fna.2.test
+ pattern=ERR598955.712540_97_1_3
+ newfile=first_out.2.test
+ sed -n '1,/^ERR598955.712540_97_1_3/d; /^>/,$p' ERR598955_orfm.fna.2.test
[dderilus@boqueron test]$ ls -sh
total 317M
148M ERR598955_orfm.fna.1   16M ERR598955_orfm.fna.1.test  149M ERR598955_orfm.fna.2  5.3M ERR598955_orfm.fna.2.test     0 first_out.1     0 first_out.1.test     0 first_out.2     0 first_out.2.test  4.0K sov.sh

Moderator's Comments:
Mod Comment Please use code tags around code and listings!

Last edited by MadeInGermany; 06-05-2018 at 11:19 AM.. Reason: added code tags
# 7  
Old 06-05-2018
Quote:
Code:
+ sed -n '1,/^ERR598955.61408_2_2_1/d; /^>/,$p' ERR598955_orfm.fna.1.test

The ^ before the pattern requires the pattern to be at the very beginning of the line. Change
Code:
sed -n '1,/^'"${pattern}"'/d; /^>/,$p' "$f" > "$newfile"

to
Code:
sed -n '1,/'"${pattern}"'/d; /^>/,$p' "$f" > "$newfile"

Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

How to use a loop for multiple files in a folder to run awk command?

Dear folks I have two data set which there names are "final.map" and "1.geno" and look like this structures: final.map: gi|358485511|ref|NC_006088.3| 2044 gi|358485511|ref|NC_006088.3| 2048 gi|358485511|ref|NC_006088.3| 2187 gi|358485511|ref|NC_006088.3| 17654 ... (2 Replies)
Discussion started by: sajmar
2 Replies

2. Shell Programming and Scripting

Run one script on multiple files and print out multiple files.

How can I Run one script on multiple files and print out multiple files. FOR EXAMPLE i want to run script.pl on 100 files named 1.txt ....100.txt under same directory and print out corresponding file 1.gff ....100.gff.THANKS (4 Replies)
Discussion started by: grace_shen
4 Replies

3. UNIX for Dummies Questions & Answers

Run one script on multiple files and print out multiple files.

How can I run the following command on multiple files and print out the corresponding multiple files. perl script.pl genome.gff 1.txt > 1.gff However, there are multiples files of 1.txt, from 1----100.txt Thank you so much. No duplicate posting! Continue here. (0 Replies)
Discussion started by: grace_shen
0 Replies

4. Shell Programming and Scripting

Bash Scipting (New); Run multiple greps > multiple files

Hi everyone, I'm new to the forums, as you can probably tell... I'm also pretty new to scripting and writing any type of code. I needed to know exactly how I can grep for multiple strings, in files located in one directory, but I need each string to output to a separate file. So I'd... (19 Replies)
Discussion started by: LDHB2012
19 Replies

5. UNIX for Dummies Questions & Answers

renaming multiple files using sed or awk one liner

hi, I have a directory "test" under which there are 3 files a.txt,b.txt and c.txt. I need to rename those files to a.pl,b.pl and c.pl respectively. is it possible to achieve this in a sed or awk one liner? i have searched but many of them are scripts. I need to do this in a one liner. I... (2 Replies)
Discussion started by: pandeesh
2 Replies

6. UNIX for Dummies Questions & Answers

Using AWK: Extract data from multiple files and output to multiple new files

Hi, I'd like to process multiple files. For example: file1.txt file2.txt file3.txt Each file contains several lines of data. I want to extract a piece of data and output it to a new file. file1.txt ----> newfile1.txt file2.txt ----> newfile2.txt file3.txt ----> newfile3.txt Here is... (3 Replies)
Discussion started by: Liverpaul09
3 Replies

7. UNIX for Dummies Questions & Answers

best method of replacing multiple strings in multiple files - sed or awk? most simple preferred :)

Hi guys, say I have a few files in a directory (58 text files or somthing) each one contains mulitple strings that I wish to replace with other strings so in these 58 files I'm looking for say the following strings: JAM (replace with BUTTER) BREAD (replace with CRACKER) SCOOP (replace... (19 Replies)
Discussion started by: rich@ardz
19 Replies

8. Shell Programming and Scripting

Split line to multiple files Awk/Sed/Shell Script help

Hi, I need help to split lines from a file into multiple files. my input look like this: 13 23 45 45 6 7 33 44 55 66 7 13 34 5 6 7 87 45 7 8 8 9 13 44 55 66 77 8 44 66 88 99 6 I want to split every 3 lines from this file to be written to individual files. (3 Replies)
Discussion started by: saint2006
3 Replies

9. Shell Programming and Scripting

How to run multiple awk files

I'm trying some thing like this. But not working It worked for bash files Now I want some thing like that along with multiple input files by redirecting their outputs as inputs of next command like below Could you guyz p0lz help me on this #!/usr/bin/awk -f BEGIN { } script1a.awk... (2 Replies)
Discussion started by: repinementer
2 Replies

10. UNIX for Dummies Questions & Answers

when I try to run rm on multiple files I have problem to delete files with space

Hello when I try to run rm on multiple files I have problem to delete files with space. I have this command : find . -name "*.cmd" | xargs \rm -f it doing the work fine but when it comes across files with spaces like : "my foo file.cmd" it refuse to delete it why? (1 Reply)
Discussion started by: umen
1 Replies
Login or Register to Ask a Question