Remove line containing string and renumber

10-11-2014

Registered User

362, 16

Join Date: Mar 2010

Last Activity: 3 March 2020, 10:38 PM EST

Location: Boston

Posts: 362

Thanks Given: 193

Thanked 16 Times in 15 Posts

Remove line containing string and renumber

Hello,

I have some files in a directory and a short list of strings. I want to loop through the files and remove lines containing the string and renumber.

There are some issues. The first is the strings that can contain troublesome characters like single quotes and parenthesis. Here is one list of strings,

Code:

1-[3-(3,3-dimethyl-2-oxobutylidene)(1,4-diazaperhydroin-2-ylidene)]-3,3-dimethylbutan-2-one
N-{(1E)-2-[4-(methylethyl)phenyl]-1-azaprop-1-enyl}-2-[(2-methylphenyl)amino]acetamide
1-acetyl-3-(5,6-dimethylisoindolin-2-yl)benzene
2-[(6-hydroxy-4,4-dimethyl-2-oxocyclohex-1(6)-enyl)(4-methylphenyl)methyl]-5,5-dimethylcyclohexane-1,3-dione
1,4,5-triphenyl-4-imidazoline-2-thione
1-(2-naphthylmethyl)-2-(naphthylmethyl)benzimidazole
1-(2-naphthyl)-2-({2-[(2-(2-naphthyl)-2-oxoethyl)piperidyl]ethyl}piperidyl)ethan-1-one_bromide_bromide
1-(2-hydroxyphenyl)-2,6-dimethyl-5-phenylhydropyrimidin-4-one
1-[3-(3,3-dimethyl-2-oxobutylidene)(1,4-diazaperhydroin-2-ylidene)]-3,3-dimethylbutan-2-one
4-(1,3-dioxobenzo[c]azolidin-2-yl)-N-methyl-N-(1,2,2,6,6-pentamethyl(4-piperidyl))butanamide

It is very likely that the list will contain the same string more than once. I either need to clean that up or have the script allow for instances where the string is not found.

The other complexity is that the line numbering doesn't start until the 15th line of the file.

I was thinking of something like,

Code:

#!/bin/bash

REMOVE_LIST=(
             '1-[3-(3,3-dimethyl-2-oxobutylidene)(1,4-diazaperhydroin-2-ylidene)]-3,3-dimethylbutan-2-one' \
             'N-{(1E)-2-[4-(methylethyl)phenyl]-1-azaprop-1-enyl}-2-[(2-methylphenyl)amino]acetamide' \
             '1-acetyl-3-(5,6-dimethylisoindolin-2-yl)benzene' \
             '2-[(6-hydroxy-4,4-dimethyl-2-oxocyclohex-1(6)-enyl)(4-methylphenyl)methyl]-5,5-dimethylcyclohexane-1,3-dione' \
             '1,4,5-triphenyl-4-imidazoline-2-thione' \
             '1-(2-naphthylmethyl)-2-(naphthylmethyl)benzimidazole' \
             '1-(2-naphthyl)-2-({2-[(2-(2-naphthyl)-2-oxoethyl)piperidyl]ethyl}piperidyl)ethan-1-one_bromide_bromide' \
             '1-(2-hydroxyphenyl)-2,6-dimethyl-5-phenylhydropyrimidin-4-one' \
             '1-[3-(3,3-dimethyl-2-oxobutylidene)(1,4-diazaperhydroin-2-ylidene)]-3,3-dimethylbutan-2-one' \
             '4-(1,3-dioxobenzo[c]azolidin-2-yl)-N-methyl-N-(1,2,2,6,6-pentamethyl(4-piperidyl))butanamide'
            )

# collect list of files
FILE_LIST=($(ls  './'*'out.txt' ))

# loop on files
for FILE in ${FILE_LIST[@]}
do
   echo $FILE

   # loop on strings to remove
   for REMOVE_STRING in ${REMOVE_LIST[@]}
   do
      echo $REMOVE_STRING
      # remove string, change cp to mv when this is working
      grep -v "$REMOVE_STRING" $FILE > TEMP && mv TEMP $FILE'_tmp'
   done

done

This code works for the line removal but is rather inefficient since it has to make separate calls to grep for each item in the remove list and do that for every file. This does not have to be particularly fast, but I would prefer if it was not quite so moronic.

As for the line renumbering starting with the 15th line, I have no idea.

Suggestions would be appreciated.

---------- Post updated at 07:17 PM ---------- Previous update was at 06:22 PM ----------

This is part of one of the files. You can see that the numbering starts on the line following f*order. If it helps, the numbers start on the first line that begins with a number. The f*order field can have value from f0-f9. The number of columns and rows in the files vary. This example shows the first 8 columns and 10 data rows.

Code:

f0order	CVorder	Name	f0	RI_7	E99	E199	E299
NA	NA	NA	NA	R_r2	0.796	0.831	0.848
NA	NA	NA	NA	R_MeAE	88.54	80.06	76.27
NA	NA	NA	NA	R_MdAE	72.24	63.66	61.66
NA	NA	NA	NA	R_SE	104.44	96.49	92.37
NA	NA	NA	NA	T_r2	0.794	0.821	0.827
NA	NA	NA	NA	T_MeAE	108.38	105.79	99.11
NA	NA	NA	NA	T_MdAE	88.95	91.94	86.61
NA	NA	NA	NA	T_SE	107.44	105.46	104.84
NA	NA	NA	NA	V_r2	0.83	0.847	0.857
NA	NA	NA	NA	V_MeAE	108.36	103.86	97.23
NA	NA	NA	NA	V_MdAE	96.69	90.04	79.31
NA	NA	NA	NA	V_SE	102.58	103.24	102.13
f0order	CVorder	Name	f0	RI_7	E99	E199	E299
1	2	2-ethylpyridine	R	519	683	653	638
2	3	3-ethylpyridine	R	535	675	646	631
3	4	2,6-lutidine	R	506	632	614	608
4	5	2,5-lutidine	R	517	620	605	598
5	6	2,3-lutidine	R	518	612	598	592
6	7	3,4-lutidine	R	528	600	589	583
7	8	3,5-lutidine	R	532	569	560	559
8	9	2,4,6-collidine	R	544	585	586	590
9	10	4-(methylamino)pyridine	R	511	450	429	417
10	12	4-dimethylaminopyridine	R	533	500	487	481

The only thing I can think of at the moment would be to copy the first 14 lines to a temp file and then delete them. Then I would renumber the rest of the file and then cat the file back together.

LMHmedchem

LMHmedchem

View Public Profile for LMHmedchem

Find all posts by LMHmedchem

10-11-2014

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Do the lines in the remove list exist (unquoted) in a file? (If so, a single grep -Fvf string_file file would seem better for this problem than one grep invocation for each fixed string. But awk is probably better yet since it can do both the line removal and the renumbering.) Do any of these strings ever contain any whitespace characters?

Does the renumbering apply only to the 1st field in the lines to be renumbered? Or, does the 2nd field also need to be modified? (If so, how?)

You said that the number of rows and columns vary from file to file. Does the field to be matched also vary, or is it always the 3rd field? If it isn't always the 3rd field, is it always in a field with the string Name as the header in line 1 in that file? (An awk script will run faster if we know which field to match.)

This User Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

10-11-2014

Registered User

362, 16

Join Date: Mar 2010

Last Activity: 3 March 2020, 10:38 PM EST

Location: Boston

Posts: 362

Thanks Given: 193

Thanked 16 Times in 15 Posts

This is a script that currently works for this task. It is messy an un-elegant, but I am posting it since I sometimes think that working code is often a better explanation than a description given in prose, even where the code leaves allot to be desired.

Code:

#!/bin/bash

REMOVE_LIST=(
             '1-[3-(3,3-dimethyl-2-oxobutylidene)(1,4-diazaperhydroin-2-ylidene)]-3,3-dimethylbutan-2-one' \
             'N-{(1E)-2-[4-(methylethyl)phenyl]-1-azaprop-1-enyl}-2-[(2-methylphenyl)amino]acetamide' \
             '1-acetyl-3-(5,6-dimethylisoindolin-2-yl)benzene' \
             '2-[(6-hydroxy-4,4-dimethyl-2-oxocyclohex-1(6)-enyl)(4-methylphenyl)methyl]-5,5-dimethylcyclohexane-1,3-dione' \
             '1,4,5-triphenyl-4-imidazoline-2-thione' \
             '1-(2-naphthylmethyl)-2-(naphthylmethyl)benzimidazole' \
             '1-(2-naphthyl)-2-({2-[(2-(2-naphthyl)-2-oxoethyl)piperidyl]ethyl}piperidyl)ethan-1-one_bromide_bromide' \
             '1-(2-hydroxyphenyl)-2,6-dimethyl-5-phenylhydropyrimidin-4-one' \
             '1-[3-(3,3-dimethyl-2-oxobutylidene)(1,4-diazaperhydroin-2-ylidene)]-3,3-dimethylbutan-2-one' \
             '4-(1,3-dioxobenzo[c]azolidin-2-yl)-N-methyl-N-(1,2,2,6,6-pentamethyl(4-piperidyl))butanamide'
            )

SET='A'
PARAM_SET='ON-0.25'
FOLD_LIST=(f0 f1 f2 f3 f4 f5 f6 f7 f8 f9)
AS_LIST=(V_mae V_se S_mae S_se)

# loop on fold list
for FOLD in ${FOLD_LIST[@]}
do
   # loop on as list
   for ANNEALING_SET in ${AS_LIST[@]}
   do

      # assign directory name
      FILE_DIR=$(ls -d './'$SET'/'$FOLD'/'$FOLD'_anneal/'$PARAM_SET'/'$ANNEALING_SET)
      # collect list of files
      FILE_LIST=($(ls $FILE_DIR'/'*'out.txt' ))

      # loop on files
      for FILE in ${FILE_LIST[@]}
      do
         echo $FILE

         # loop on strings to remove
         for REMOVE_STRING in ${REMOVE_LIST[@]}
         do
            echo $REMOVE_STRING
            # remove string, change cp to mv when this is working
            grep -F -v "$REMOVE_STRING" $FILE > TEMP && mv TEMP $FILE
         done

         # re number data rows
         # copy first 14 lines to temp file
         sed 14q $FILE > './'$FILE_DIR'/'headers.txt
         # copy remaining lines to temp file
         sed -n '15,$p' $FILE > './'$FILE_DIR'/'data.txt
         # add new line numbers to data block
         nl './'$FILE_DIR'/'data.txt > './'$FILE_DIR'/'TEMP
         mv './'$FILE_DIR'/'TEMP  './'$FILE_DIR'/'data.txt
         # remove old numbering column
         cut './'$FILE_DIR'/'data.txt -f1,3- > './'$FILE_DIR'/'TEMP
         mv './'$FILE_DIR'/'TEMP './'$FILE_DIR'/'data.txt
         # recombine headers with data
         cat './'$FILE_DIR'/'headers.txt  './'$FILE_DIR'/'data.txt > $FILE
         # cleanup
         rm './'$FILE_DIR'/'headers.txt;  rm './'$FILE_DIR'/'data.txt

      done
   done
done

To answer your questions, the strings do not exist in a file, but that would be easy enough to create if it would be helpful.

Only the first column of the data section needs to be renumbered, the numbers in column 2 can remain as is.

The remove string will always refer to the 3rd field, and this will always be the name column.

The headers are on line 1 and then are duplicated on line 14.

There could be cases in the future where there are more than 14 rows before the data begins. In all cases, you could look for the second instance of the header row (some row that matches row 1) to know that the data starts on the next row.

LMHmedchem

LMHmedchem

View Public Profile for LMHmedchem

Find all posts by LMHmedchem

10-12-2014

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

You could try something like this as a replacement:

Code:

#!/bin/bash
# Initiailize variables:
AS_LIST=(V_mae V_se S_mae S_se)
FOLD_LIST=(f0 f1 f2 f3 f4 f5 f6 f7 f8 f9)
PARAM_SET='ON-0.25'
REMOVE_LIST=(
	'1-[3-(3,3-dimethyl-2-oxobutylidene)(1,4-diazaperhydroin-2-ylidene)]-3,3-dimethylbutan-2-one'
	'N-{(1E)-2-[4-(methylethyl)phenyl]-1-azaprop-1-enyl}-2-[(2-methylphenyl)amino]acetamide'
	'1-acetyl-3-(5,6-dimethylisoindolin-2-yl)benzene'
	'2-[(6-hydroxy-4,4-dimethyl-2-oxocyclohex-1(6)-enyl)(4-methylphenyl)methyl]-5,5-dimethylcyclohexane-1,3-dione'
	'1,4,5-triphenyl-4-imidazoline-2-thione'
	'1-(2-naphthylmethyl)-2-(naphthylmethyl)benzimidazole'
	'1-(2-naphthyl)-2-({2-[(2-(2-naphthyl)-2-oxoethyl)piperidyl]ethyl}piperidyl)ethan-1-one_bromide_bromide'
	'1-(2-hydroxyphenyl)-2,6-dimethyl-5-phenylhydropyrimidin-4-one'
	'1-[3-(3,3-dimethyl-2-oxobutylidene)(1,4-diazaperhydroin-2-ylidene)]-3,3-dimethylbutan-2-one'
	'4-(1,3-dioxobenzo[c]azolidin-2-yl)-N-methyl-N-(1,2,2,6,6-pentamethyl(4-piperidyl))butanamide'
)
SET='A'

# loop on fold list
for FOLD in ${FOLD_LIST[@]}
do
	# loop on as list
	for AS in ${AS_LIST[@]}
	do

		# assign directory name
		FILE_DIR="./$SET/$FOLD/$FOLD_anneal/$PARAM_SET/$AS"
		# loop on files
		for FILE in "$FILE_DIR"/*out.txt
		do	echo "$FILE"
			printf '%s\n' "${REMOVE_LIST[@]}" | awk '
				BEGIN {	FS = OFS = "\t"
				}
				FNR == NR {
					# Gather remove list...
					rl[$0]
					next
				}
				FNR == 1 {
					# Get header from 2nd file.
					h = $0
					hc = 2
				}
				# Copy input lines until we have copied the
				# header line twice...
				hc {	if(h == $0) {
						# Decrement the # of times we
						# need to print the header...
						hc--
					}
					print
					next
				}
				# Skip lines with Name (field 3) in remove list.
				$3 in rl {
					next
				}
				{	# Renumber remaining lines.
					$1 = ++oc
				}
				1	# Print renumbered lines.
			' - "$FILE" > "$FILE"_ && mv "$FILE"_ "$FILE"
		done
	done
done

This User Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

10-12-2014

Moderator

12,296, 3,792

Join Date: Nov 2008

Last Activity: 1 January 2021, 1:47 AM EST

Location: Amsterdam

Posts: 12,296

Thanks Given: 679

Thanked 3,792 Times in 3,282 Posts

Alternatively in case it starts on the 15th line, try:

Code:

awk 'NR==FNR{A[$1]; next} FNR==1{c=1; close(f); f=FILENAME ".new"} $3 in A{next} FNR>14{sub($1,c++)} {print>f}' rmlist *out.txt

where rmlist contains:

Code:

1-[3-(3,3-dimethyl-2-oxobutylidene)(1,4-diazaperhydroin-2-ylidene)]-3,3-dimethylbutan-2-one
N-{(1E)-2-[4-(methylethyl)phenyl]-1-azaprop-1-enyl}-2-[(2-methylphenyl)amino]acetamide
1-acetyl-3-(5,6-dimethylisoindolin-2-yl)benzene
2-[(6-hydroxy-4,4-dimethyl-2-oxocyclohex-1(6)-enyl)(4-methylphenyl)methyl]-5,5-dimethylcyclohexane-1,3-dione
1,4,5-triphenyl-4-imidazoline-2-thione
1-(2-naphthylmethyl)-2-(naphthylmethyl)benzimidazole
1-(2-naphthyl)-2-({2-[(2-(2-naphthyl)-2-oxoethyl)piperidyl]ethyl}piperidyl)ethan-1-one_bromide_bromide
1-(2-hydroxyphenyl)-2,6-dimethyl-5-phenylhydropyrimidin-4-one
1-[3-(3,3-dimethyl-2-oxobutylidene)(1,4-diazaperhydroin-2-ylidene)]-3,3-dimethylbutan-2-one
4-(1,3-dioxobenzo[c]azolidin-2-yl)-N-methyl-N-(1,2,2,6,6-pentamethyl(4-piperidyl))butanamide

Afterwards the new files will have a .new extension

Last edited by Scrutinizer; 10-12-2014 at 03:24 PM..

This User Gave Thanks to Scrutinizer For This Post:

Scrutinizer

View Public Profile for Scrutinizer

Find all posts by Scrutinizer

10-12-2014

Registered User

1,781, 705

Join Date: May 2008

Last Activity: 10 November 2021, 5:38 PM EST

Posts: 1,781

Thanks Given: 62

Thanked 705 Times in 653 Posts

In case you could be interested in a Perl procedural solution that can be expanded to do more processing if needed.

Code:

#!/usr/bin/perl

use strict;
use warnings;

# the first file will contain patterns to match; stop the process if no file is given
my $pattern_file = shift or die "A file with searching patterns must be given: $!\n";
open my $match_lines, '<', $pattern_file or die "Could not open $pattern_file $!\n";

# index every line as a pattern to search
my %search = map { chomp; $_, 1 } <$match_lines>;
close $match_lines;

# other files to work on must be given. Process them one by one
while (@ARGV) {

	my $n; # counter to re-number lines 15 and above
	my $filename_in = shift; # get next filename to work on 
	my $filename_out = "new_" . $filename_in; # fabricate next output filename
	
	# create corresponding input and output file handles
	open my $cur_file_in, '<', "$filename_in" or die "Could not open $filename_in: $!\n";
	open my $cur_file_out, '>', "$filename_out" or die "Could not create $filename_out: $!\n";

	# feedback to stdout
	print "processing file $filename_in to $filename_out... ";

	# processing lines from current input file
	while (my $line = <$cur_file_in>) {
		
		# tokenizing into fields, separated by one or more spaces, the read line
		my @fields =  split /\s+/, $line;

		# explicitly, saying that is OK if the variable is not defined
		no warnings 'uninitialized';
		
		# keep any lines where the pattern is not found in third field
		if (not $search{$fields[2]}) {
			
			# lines above 14 to save gets a renumber sequence
			if ($. > 14 and $fields[0] ne '\n') {
				$fields[0]= ++$n;	
			}
			# results are finally written to disk
			print $cur_file_out "@fields\n";
		}
	}
	$n = 0; # clear the renumbering counter for next file
	# ready to recycle
	close $cur_file_in;
	close $cur_file_out;
	
	# feedback to stdout
	print "[done]\n";
}

Usage

Code:

perl prog.pl rlist foo.*

Last edited by Aia; 10-12-2014 at 04:53 PM.. Reason: grammar

This User Gave Thanks to Aia For This Post:

Aia

View Public Profile for Aia

Find all posts by Aia

10-13-2014

Registered User

362, 16

Join Date: Mar 2010

Last Activity: 3 March 2020, 10:38 PM EST

Location: Boston

Posts: 362

Thanks Given: 193

Thanked 16 Times in 15 Posts

Well the script I posted works, but it takes more than 38 minutes to run on the directory tree I tested. The test directory has 2000 files spread over 40 directories with 2446 lines per file and a remove list of 28 strings.

The script posted by Don Cragun finished in,
real 1m50.953s
user 3m0.410s
sys 0m49.394s

I like the fact that this script looks for the second instance of the header row to know where to start the renumbering. There are versions of my data where that would be useful.

For the Perl script posted by Aia, these files are in a directory structure of 40 different directories. I don't know perl well enough to set up the looping to troll through all of that and test the script. It seems as if I would have to run my script and then call your script and pass $FILE_LIST along with the file with the patterns to remove. Is that right?

From the post by Scrutinizer, I also do quite see how to loop through my directory structure and create lists of files to pass to the code.

Once again, it is amazing how much better a well formed script will perform.

LMHmedchem

LMHmedchem

View Public Profile for LMHmedchem

Find all posts by LMHmedchem

Shell Programming and Scripting

Remove line containing string and renumber

9 More Discussions You Might Find Interesting

1. UNIX for Advanced & Expert Users

Vi remove line range containing a string

Discussion started by: cokedude

2. Shell Programming and Scripting

Remove every line with specific string, and also the one above and below it

Discussion started by: phpchick

3. Shell Programming and Scripting

gawk to remove last character in a line or string

Discussion started by: benalt

4. Shell Programming and Scripting

Remove last string from last line in a file

Discussion started by: mnmonu

5. Shell Programming and Scripting

Remove last string from each line

Discussion started by: Davinator

6. Shell Programming and Scripting

Remove line based on string and put new line with parameter

Discussion started by: victor369

7. Shell Programming and Scripting

Remove Command-Line Option from String

Discussion started by: mattmiller

8. Shell Programming and Scripting

How to remove new line char from a string

Discussion started by: sreedivia

9. Shell Programming and Scripting

Remove Line that contains specific string

Discussion started by: CBarraford