I have some files in a directory and a short list of strings. I want to loop through the files and remove lines containing the string and renumber.
There are some issues. The first is the strings that can contain troublesome characters like single quotes and parenthesis. Here is one list of strings,
It is very likely that the list will contain the same string more than once. I either need to clean that up or have the script allow for instances where the string is not found.
The other complexity is that the line numbering doesn't start until the 15th line of the file.
I was thinking of something like,
Code:
#!/bin/bash
REMOVE_LIST=(
'1-[3-(3,3-dimethyl-2-oxobutylidene)(1,4-diazaperhydroin-2-ylidene)]-3,3-dimethylbutan-2-one' \
'N-{(1E)-2-[4-(methylethyl)phenyl]-1-azaprop-1-enyl}-2-[(2-methylphenyl)amino]acetamide' \
'1-acetyl-3-(5,6-dimethylisoindolin-2-yl)benzene' \
'2-[(6-hydroxy-4,4-dimethyl-2-oxocyclohex-1(6)-enyl)(4-methylphenyl)methyl]-5,5-dimethylcyclohexane-1,3-dione' \
'1,4,5-triphenyl-4-imidazoline-2-thione' \
'1-(2-naphthylmethyl)-2-(naphthylmethyl)benzimidazole' \
'1-(2-naphthyl)-2-({2-[(2-(2-naphthyl)-2-oxoethyl)piperidyl]ethyl}piperidyl)ethan-1-one_bromide_bromide' \
'1-(2-hydroxyphenyl)-2,6-dimethyl-5-phenylhydropyrimidin-4-one' \
'1-[3-(3,3-dimethyl-2-oxobutylidene)(1,4-diazaperhydroin-2-ylidene)]-3,3-dimethylbutan-2-one' \
'4-(1,3-dioxobenzo[c]azolidin-2-yl)-N-methyl-N-(1,2,2,6,6-pentamethyl(4-piperidyl))butanamide'
)
# collect list of files
FILE_LIST=($(ls './'*'out.txt' ))
# loop on files
for FILE in ${FILE_LIST[@]}
do
echo $FILE
# loop on strings to remove
for REMOVE_STRING in ${REMOVE_LIST[@]}
do
echo $REMOVE_STRING
# remove string, change cp to mv when this is working
grep -v "$REMOVE_STRING" $FILE > TEMP && mv TEMP $FILE'_tmp'
done
done
This code works for the line removal but is rather inefficient since it has to make separate calls to grep for each item in the remove list and do that for every file. This does not have to be particularly fast, but I would prefer if it was not quite so moronic.
As for the line renumbering starting with the 15th line, I have no idea.
Suggestions would be appreciated.
---------- Post updated at 07:17 PM ---------- Previous update was at 06:22 PM ----------
This is part of one of the files. You can see that the numbering starts on the line following f*order. If it helps, the numbers start on the first line that begins with a number. The f*order field can have value from f0-f9. The number of columns and rows in the files vary. This example shows the first 8 columns and 10 data rows.
Code:
f0order CVorder Name f0 RI_7 E99 E199 E299
NA NA NA NA R_r2 0.796 0.831 0.848
NA NA NA NA R_MeAE 88.54 80.06 76.27
NA NA NA NA R_MdAE 72.24 63.66 61.66
NA NA NA NA R_SE 104.44 96.49 92.37
NA NA NA NA T_r2 0.794 0.821 0.827
NA NA NA NA T_MeAE 108.38 105.79 99.11
NA NA NA NA T_MdAE 88.95 91.94 86.61
NA NA NA NA T_SE 107.44 105.46 104.84
NA NA NA NA V_r2 0.83 0.847 0.857
NA NA NA NA V_MeAE 108.36 103.86 97.23
NA NA NA NA V_MdAE 96.69 90.04 79.31
NA NA NA NA V_SE 102.58 103.24 102.13
f0order CVorder Name f0 RI_7 E99 E199 E299
1 2 2-ethylpyridine R 519 683 653 638
2 3 3-ethylpyridine R 535 675 646 631
3 4 2,6-lutidine R 506 632 614 608
4 5 2,5-lutidine R 517 620 605 598
5 6 2,3-lutidine R 518 612 598 592
6 7 3,4-lutidine R 528 600 589 583
7 8 3,5-lutidine R 532 569 560 559
8 9 2,4,6-collidine R 544 585 586 590
9 10 4-(methylamino)pyridine R 511 450 429 417
10 12 4-dimethylaminopyridine R 533 500 487 481
The only thing I can think of at the moment would be to copy the first 14 lines to a temp file and then delete them. Then I would renumber the rest of the file and then cat the file back together.
Do the lines in the remove list exist (unquoted) in a file? (If so, a single grep -Fvf string_file file would seem better for this problem than one grep invocation for each fixed string. But awk is probably better yet since it can do both the line removal and the renumbering.) Do any of these strings ever contain any whitespace characters?
Does the renumbering apply only to the 1st field in the lines to be renumbered? Or, does the 2nd field also need to be modified? (If so, how?)
You said that the number of rows and columns vary from file to file. Does the field to be matched also vary, or is it always the 3rd field? If it isn't always the 3rd field, is it always in a field with the string Name as the header in line 1 in that file? (An awk script will run faster if we know which field to match.)
This User Gave Thanks to Don Cragun For This Post:
This is a script that currently works for this task. It is messy an un-elegant, but I am posting it since I sometimes think that working code is often a better explanation than a description given in prose, even where the code leaves allot to be desired.
Code:
#!/bin/bash
REMOVE_LIST=(
'1-[3-(3,3-dimethyl-2-oxobutylidene)(1,4-diazaperhydroin-2-ylidene)]-3,3-dimethylbutan-2-one' \
'N-{(1E)-2-[4-(methylethyl)phenyl]-1-azaprop-1-enyl}-2-[(2-methylphenyl)amino]acetamide' \
'1-acetyl-3-(5,6-dimethylisoindolin-2-yl)benzene' \
'2-[(6-hydroxy-4,4-dimethyl-2-oxocyclohex-1(6)-enyl)(4-methylphenyl)methyl]-5,5-dimethylcyclohexane-1,3-dione' \
'1,4,5-triphenyl-4-imidazoline-2-thione' \
'1-(2-naphthylmethyl)-2-(naphthylmethyl)benzimidazole' \
'1-(2-naphthyl)-2-({2-[(2-(2-naphthyl)-2-oxoethyl)piperidyl]ethyl}piperidyl)ethan-1-one_bromide_bromide' \
'1-(2-hydroxyphenyl)-2,6-dimethyl-5-phenylhydropyrimidin-4-one' \
'1-[3-(3,3-dimethyl-2-oxobutylidene)(1,4-diazaperhydroin-2-ylidene)]-3,3-dimethylbutan-2-one' \
'4-(1,3-dioxobenzo[c]azolidin-2-yl)-N-methyl-N-(1,2,2,6,6-pentamethyl(4-piperidyl))butanamide'
)
SET='A'
PARAM_SET='ON-0.25'
FOLD_LIST=(f0 f1 f2 f3 f4 f5 f6 f7 f8 f9)
AS_LIST=(V_mae V_se S_mae S_se)
# loop on fold list
for FOLD in ${FOLD_LIST[@]}
do
# loop on as list
for ANNEALING_SET in ${AS_LIST[@]}
do
# assign directory name
FILE_DIR=$(ls -d './'$SET'/'$FOLD'/'$FOLD'_anneal/'$PARAM_SET'/'$ANNEALING_SET)
# collect list of files
FILE_LIST=($(ls $FILE_DIR'/'*'out.txt' ))
# loop on files
for FILE in ${FILE_LIST[@]}
do
echo $FILE
# loop on strings to remove
for REMOVE_STRING in ${REMOVE_LIST[@]}
do
echo $REMOVE_STRING
# remove string, change cp to mv when this is working
grep -F -v "$REMOVE_STRING" $FILE > TEMP && mv TEMP $FILE
done
# re number data rows
# copy first 14 lines to temp file
sed 14q $FILE > './'$FILE_DIR'/'headers.txt
# copy remaining lines to temp file
sed -n '15,$p' $FILE > './'$FILE_DIR'/'data.txt
# add new line numbers to data block
nl './'$FILE_DIR'/'data.txt > './'$FILE_DIR'/'TEMP
mv './'$FILE_DIR'/'TEMP './'$FILE_DIR'/'data.txt
# remove old numbering column
cut './'$FILE_DIR'/'data.txt -f1,3- > './'$FILE_DIR'/'TEMP
mv './'$FILE_DIR'/'TEMP './'$FILE_DIR'/'data.txt
# recombine headers with data
cat './'$FILE_DIR'/'headers.txt './'$FILE_DIR'/'data.txt > $FILE
# cleanup
rm './'$FILE_DIR'/'headers.txt; rm './'$FILE_DIR'/'data.txt
done
done
done
To answer your questions, the strings do not exist in a file, but that would be easy enough to create if it would be helpful.
Only the first column of the data section needs to be renumbered, the numbers in column 2 can remain as is.
The remove string will always refer to the 3rd field, and this will always be the name column.
The headers are on line 1 and then are duplicated on line 14.
There could be cases in the future where there are more than 14 rows before the data begins. In all cases, you could look for the second instance of the header row (some row that matches row 1) to know that the data starts on the next row.
You could try something like this as a replacement:
Code:
#!/bin/bash
# Initiailize variables:
AS_LIST=(V_mae V_se S_mae S_se)
FOLD_LIST=(f0 f1 f2 f3 f4 f5 f6 f7 f8 f9)
PARAM_SET='ON-0.25'
REMOVE_LIST=(
'1-[3-(3,3-dimethyl-2-oxobutylidene)(1,4-diazaperhydroin-2-ylidene)]-3,3-dimethylbutan-2-one'
'N-{(1E)-2-[4-(methylethyl)phenyl]-1-azaprop-1-enyl}-2-[(2-methylphenyl)amino]acetamide'
'1-acetyl-3-(5,6-dimethylisoindolin-2-yl)benzene'
'2-[(6-hydroxy-4,4-dimethyl-2-oxocyclohex-1(6)-enyl)(4-methylphenyl)methyl]-5,5-dimethylcyclohexane-1,3-dione'
'1,4,5-triphenyl-4-imidazoline-2-thione'
'1-(2-naphthylmethyl)-2-(naphthylmethyl)benzimidazole'
'1-(2-naphthyl)-2-({2-[(2-(2-naphthyl)-2-oxoethyl)piperidyl]ethyl}piperidyl)ethan-1-one_bromide_bromide'
'1-(2-hydroxyphenyl)-2,6-dimethyl-5-phenylhydropyrimidin-4-one'
'1-[3-(3,3-dimethyl-2-oxobutylidene)(1,4-diazaperhydroin-2-ylidene)]-3,3-dimethylbutan-2-one'
'4-(1,3-dioxobenzo[c]azolidin-2-yl)-N-methyl-N-(1,2,2,6,6-pentamethyl(4-piperidyl))butanamide'
)
SET='A'
# loop on fold list
for FOLD in ${FOLD_LIST[@]}
do
# loop on as list
for AS in ${AS_LIST[@]}
do
# assign directory name
FILE_DIR="./$SET/$FOLD/$FOLD_anneal/$PARAM_SET/$AS"
# loop on files
for FILE in "$FILE_DIR"/*out.txt
do echo "$FILE"
printf '%s\n' "${REMOVE_LIST[@]}" | awk '
BEGIN { FS = OFS = "\t"
}
FNR == NR {
# Gather remove list...
rl[$0]
next
}
FNR == 1 {
# Get header from 2nd file.
h = $0
hc = 2
}
# Copy input lines until we have copied the
# header line twice...
hc { if(h == $0) {
# Decrement the # of times we
# need to print the header...
hc--
}
print
next
}
# Skip lines with Name (field 3) in remove list.
$3 in rl {
next
}
{ # Renumber remaining lines.
$1 = ++oc
}
1 # Print renumbered lines.
' - "$FILE" > "$FILE"_ && mv "$FILE"_ "$FILE"
done
done
done
This User Gave Thanks to Don Cragun For This Post:
In case you could be interested in a Perl procedural solution that can be expanded to do more processing if needed.
Code:
#!/usr/bin/perl
use strict;
use warnings;
# the first file will contain patterns to match; stop the process if no file is given
my $pattern_file = shift or die "A file with searching patterns must be given: $!\n";
open my $match_lines, '<', $pattern_file or die "Could not open $pattern_file $!\n";
# index every line as a pattern to search
my %search = map { chomp; $_, 1 } <$match_lines>;
close $match_lines;
# other files to work on must be given. Process them one by one
while (@ARGV) {
my $n; # counter to re-number lines 15 and above
my $filename_in = shift; # get next filename to work on
my $filename_out = "new_" . $filename_in; # fabricate next output filename# create corresponding input and output file handles
open my $cur_file_in, '<', "$filename_in" or die "Could not open $filename_in: $!\n";
open my $cur_file_out, '>', "$filename_out" or die "Could not create $filename_out: $!\n";
# feedback to stdout
print "processing file $filename_in to $filename_out... ";
# processing lines from current input file
while (my $line = <$cur_file_in>) {
# tokenizing into fields, separated by one or more spaces, the read line
my @fields = split /\s+/, $line;
# explicitly, saying that is OK if the variable is not defined
no warnings 'uninitialized';
# keep any lines where the pattern is not found in third field
if (not $search{$fields[2]}) {
# lines above 14 to save gets a renumber sequence
if ($. > 14 and $fields[0] ne '\n') {
$fields[0]= ++$n;
}
# results are finally written to disk
print $cur_file_out "@fields\n";
}
}
$n = 0; # clear the renumbering counter for next file# ready to recycle
close $cur_file_in;
close $cur_file_out;
# feedback to stdout
print "[done]\n";
}
Usage
Code:
perl prog.pl rlist foo.*
Last edited by Aia; 10-12-2014 at 04:53 PM..
Reason: grammar
Well the script I posted works, but it takes more than 38 minutes to run on the directory tree I tested. The test directory has 2000 files spread over 40 directories with 2446 lines per file and a remove list of 28 strings.
The script posted by Don Cragun finished in,
real 1m50.953s
user 3m0.410s
sys 0m49.394s
I like the fact that this script looks for the second instance of the header row to know where to start the renumbering. There are versions of my data where that would be useful.
For the Perl script posted by Aia, these files are in a directory structure of 40 different directories. I don't know perl well enough to set up the looping to troll through all of that and test the script. It seems as if I would have to run my script and then call your script and pass $FILE_LIST along with the file with the patterns to remove. Is that right?
From the post by Scrutinizer, I also do quite see how to loop through my directory structure and create lists of files to pass to the code.
Once again, it is amazing how much better a well formed script will perform.
In vi I would like to remove a line containing a string. I thought after reading this I could do this.
https://www.unix.com/302297288-post3.html
:'3560,3572/gcc/d'
It keeps complaining vi mark not set. And sometimes it complains E488: Trailing characters.
I don't understand what mark... (5 Replies)
I would like to identify every line with a specific string, in this case: "Mamma".
I would like to remove that line, and also the line above it and below it. So the below
Where are all amazing Flats
Look At The Great Big White
Hey There Hot Mamma
You Are So hot Baby
I wish You were Mine... (5 Replies)
I am outputting a line like this
print $2 "/" $4The last character though is a ":" and I want to remove it. Is there any neat way to remove it? Or am I forced to do something like this:
print $2 "/" substr($4, 1, length($4) - 1)Thanks. (6 Replies)
Hi,
I have a file like :
I want to remove last string in last line (here total string is "05550"~). And last line end with ~ character.
Output should be :
Please help me
Thanks in advance (3 Replies)
I am trying to write a script that will allow me to recursively look at my directories, and output all filenames to a txt document. I am almost finished, however I am hitting one last snag. Here is my script so far:
find . | grep .jpg | awk -F"/" '{print $NF}' > output.txtThis will give me an... (7 Replies)
Hi Folks,
I am new to ksh, i have informatica parameter file that i need to update everyday with shell script. i need your help updating this file with new parameters.
sample data
$$TABLE1_DATE=04-27-2011
$$TABLE2_DATE=04-23-2011
$$TABLE3_DATE=03-19-2011
.......Highligned... (4 Replies)
I want to add a "-r <remote_host>" option to my ksh script, causing the script to run a script of the same name on the specified remote host. The remote invocation should itself include all the command-line options of the original invocation, less the -r option.
For example, this invocation:
... (7 Replies)
Hi
Can anyone tell me how can i remove new line character from a string.
My requirement is to read a line from a file and store it to a string.
read line
string1=$line
read line
string2=$line
echo $string1$string2
The result i am getting in different line. i want the output in the same... (1 Reply)
I am looking for a way to remove any line in a text file that contains the string "Mac address". I guess you would grep and sed, but I am not sure how to do this. Thanks for you help. (3 Replies)