Remove line containing string and renumber


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Remove line containing string and renumber
# 1  
Old 10-11-2014
Remove line containing string and renumber

Hello,

I have some files in a directory and a short list of strings. I want to loop through the files and remove lines containing the string and renumber.

There are some issues. The first is the strings that can contain troublesome characters like single quotes and parenthesis. Here is one list of strings,

Code:
1-[3-(3,3-dimethyl-2-oxobutylidene)(1,4-diazaperhydroin-2-ylidene)]-3,3-dimethylbutan-2-one
N-{(1E)-2-[4-(methylethyl)phenyl]-1-azaprop-1-enyl}-2-[(2-methylphenyl)amino]acetamide
1-acetyl-3-(5,6-dimethylisoindolin-2-yl)benzene
2-[(6-hydroxy-4,4-dimethyl-2-oxocyclohex-1(6)-enyl)(4-methylphenyl)methyl]-5,5-dimethylcyclohexane-1,3-dione
1,4,5-triphenyl-4-imidazoline-2-thione
1-(2-naphthylmethyl)-2-(naphthylmethyl)benzimidazole
1-(2-naphthyl)-2-({2-[(2-(2-naphthyl)-2-oxoethyl)piperidyl]ethyl}piperidyl)ethan-1-one_bromide_bromide
1-(2-hydroxyphenyl)-2,6-dimethyl-5-phenylhydropyrimidin-4-one
1-[3-(3,3-dimethyl-2-oxobutylidene)(1,4-diazaperhydroin-2-ylidene)]-3,3-dimethylbutan-2-one
4-(1,3-dioxobenzo[c]azolidin-2-yl)-N-methyl-N-(1,2,2,6,6-pentamethyl(4-piperidyl))butanamide

It is very likely that the list will contain the same string more than once. I either need to clean that up or have the script allow for instances where the string is not found.

The other complexity is that the line numbering doesn't start until the 15th line of the file.

I was thinking of something like,
Code:
#!/bin/bash

REMOVE_LIST=(
             '1-[3-(3,3-dimethyl-2-oxobutylidene)(1,4-diazaperhydroin-2-ylidene)]-3,3-dimethylbutan-2-one' \
             'N-{(1E)-2-[4-(methylethyl)phenyl]-1-azaprop-1-enyl}-2-[(2-methylphenyl)amino]acetamide' \
             '1-acetyl-3-(5,6-dimethylisoindolin-2-yl)benzene' \
             '2-[(6-hydroxy-4,4-dimethyl-2-oxocyclohex-1(6)-enyl)(4-methylphenyl)methyl]-5,5-dimethylcyclohexane-1,3-dione' \
             '1,4,5-triphenyl-4-imidazoline-2-thione' \
             '1-(2-naphthylmethyl)-2-(naphthylmethyl)benzimidazole' \
             '1-(2-naphthyl)-2-({2-[(2-(2-naphthyl)-2-oxoethyl)piperidyl]ethyl}piperidyl)ethan-1-one_bromide_bromide' \
             '1-(2-hydroxyphenyl)-2,6-dimethyl-5-phenylhydropyrimidin-4-one' \
             '1-[3-(3,3-dimethyl-2-oxobutylidene)(1,4-diazaperhydroin-2-ylidene)]-3,3-dimethylbutan-2-one' \
             '4-(1,3-dioxobenzo[c]azolidin-2-yl)-N-methyl-N-(1,2,2,6,6-pentamethyl(4-piperidyl))butanamide'
            )

# collect list of files
FILE_LIST=($(ls  './'*'out.txt' ))

# loop on files
for FILE in ${FILE_LIST[@]}
do
   echo $FILE

   # loop on strings to remove
   for REMOVE_STRING in ${REMOVE_LIST[@]}
   do
      echo $REMOVE_STRING
      # remove string, change cp to mv when this is working
      grep -v "$REMOVE_STRING" $FILE > TEMP && mv TEMP $FILE'_tmp'
   done

done

This code works for the line removal but is rather inefficient since it has to make separate calls to grep for each item in the remove list and do that for every file. This does not have to be particularly fast, but I would prefer if it was not quite so moronic.

As for the line renumbering starting with the 15th line, I have no idea.

Suggestions would be appreciated.

---------- Post updated at 07:17 PM ---------- Previous update was at 06:22 PM ----------

This is part of one of the files. You can see that the numbering starts on the line following f*order. If it helps, the numbers start on the first line that begins with a number. The f*order field can have value from f0-f9. The number of columns and rows in the files vary. This example shows the first 8 columns and 10 data rows.

Code:
f0order	CVorder	Name	f0	RI_7	E99	E199	E299
NA	NA	NA	NA	R_r2	0.796	0.831	0.848
NA	NA	NA	NA	R_MeAE	88.54	80.06	76.27
NA	NA	NA	NA	R_MdAE	72.24	63.66	61.66
NA	NA	NA	NA	R_SE	104.44	96.49	92.37
NA	NA	NA	NA	T_r2	0.794	0.821	0.827
NA	NA	NA	NA	T_MeAE	108.38	105.79	99.11
NA	NA	NA	NA	T_MdAE	88.95	91.94	86.61
NA	NA	NA	NA	T_SE	107.44	105.46	104.84
NA	NA	NA	NA	V_r2	0.83	0.847	0.857
NA	NA	NA	NA	V_MeAE	108.36	103.86	97.23
NA	NA	NA	NA	V_MdAE	96.69	90.04	79.31
NA	NA	NA	NA	V_SE	102.58	103.24	102.13
f0order	CVorder	Name	f0	RI_7	E99	E199	E299
1	2	2-ethylpyridine	R	519	683	653	638
2	3	3-ethylpyridine	R	535	675	646	631
3	4	2,6-lutidine	R	506	632	614	608
4	5	2,5-lutidine	R	517	620	605	598
5	6	2,3-lutidine	R	518	612	598	592
6	7	3,4-lutidine	R	528	600	589	583
7	8	3,5-lutidine	R	532	569	560	559
8	9	2,4,6-collidine	R	544	585	586	590
9	10	4-(methylamino)pyridine	R	511	450	429	417
10	12	4-dimethylaminopyridine	R	533	500	487	481

The only thing I can think of at the moment would be to copy the first 14 lines to a temp file and then delete them. Then I would renumber the rest of the file and then cat the file back together.

LMHmedchem
# 2  
Old 10-11-2014
Do the lines in the remove list exist (unquoted) in a file? (If so, a single grep -Fvf string_file file would seem better for this problem than one grep invocation for each fixed string. But awk is probably better yet since it can do both the line removal and the renumbering.) Do any of these strings ever contain any whitespace characters?

Does the renumbering apply only to the 1st field in the lines to be renumbered? Or, does the 2nd field also need to be modified? (If so, how?)

You said that the number of rows and columns vary from file to file. Does the field to be matched also vary, or is it always the 3rd field? If it isn't always the 3rd field, is it always in a field with the string Name as the header in line 1 in that file? (An awk script will run faster if we know which field to match.)
This User Gave Thanks to Don Cragun For This Post:
# 3  
Old 10-11-2014
This is a script that currently works for this task. It is messy an un-elegant, but I am posting it since I sometimes think that working code is often a better explanation than a description given in prose, even where the code leaves allot to be desired.

Code:
#!/bin/bash

REMOVE_LIST=(
             '1-[3-(3,3-dimethyl-2-oxobutylidene)(1,4-diazaperhydroin-2-ylidene)]-3,3-dimethylbutan-2-one' \
             'N-{(1E)-2-[4-(methylethyl)phenyl]-1-azaprop-1-enyl}-2-[(2-methylphenyl)amino]acetamide' \
             '1-acetyl-3-(5,6-dimethylisoindolin-2-yl)benzene' \
             '2-[(6-hydroxy-4,4-dimethyl-2-oxocyclohex-1(6)-enyl)(4-methylphenyl)methyl]-5,5-dimethylcyclohexane-1,3-dione' \
             '1,4,5-triphenyl-4-imidazoline-2-thione' \
             '1-(2-naphthylmethyl)-2-(naphthylmethyl)benzimidazole' \
             '1-(2-naphthyl)-2-({2-[(2-(2-naphthyl)-2-oxoethyl)piperidyl]ethyl}piperidyl)ethan-1-one_bromide_bromide' \
             '1-(2-hydroxyphenyl)-2,6-dimethyl-5-phenylhydropyrimidin-4-one' \
             '1-[3-(3,3-dimethyl-2-oxobutylidene)(1,4-diazaperhydroin-2-ylidene)]-3,3-dimethylbutan-2-one' \
             '4-(1,3-dioxobenzo[c]azolidin-2-yl)-N-methyl-N-(1,2,2,6,6-pentamethyl(4-piperidyl))butanamide'
            )

SET='A'
PARAM_SET='ON-0.25'
FOLD_LIST=(f0 f1 f2 f3 f4 f5 f6 f7 f8 f9)
AS_LIST=(V_mae V_se S_mae S_se)

# loop on fold list
for FOLD in ${FOLD_LIST[@]}
do
   # loop on as list
   for ANNEALING_SET in ${AS_LIST[@]}
   do

      # assign directory name
      FILE_DIR=$(ls -d './'$SET'/'$FOLD'/'$FOLD'_anneal/'$PARAM_SET'/'$ANNEALING_SET)
      # collect list of files
      FILE_LIST=($(ls $FILE_DIR'/'*'out.txt' ))

      # loop on files
      for FILE in ${FILE_LIST[@]}
      do
         echo $FILE

         # loop on strings to remove
         for REMOVE_STRING in ${REMOVE_LIST[@]}
         do
            echo $REMOVE_STRING
            # remove string, change cp to mv when this is working
            grep -F -v "$REMOVE_STRING" $FILE > TEMP && mv TEMP $FILE
         done

         # re number data rows
         # copy first 14 lines to temp file
         sed 14q $FILE > './'$FILE_DIR'/'headers.txt
         # copy remaining lines to temp file
         sed -n '15,$p' $FILE > './'$FILE_DIR'/'data.txt
         # add new line numbers to data block
         nl './'$FILE_DIR'/'data.txt > './'$FILE_DIR'/'TEMP
         mv './'$FILE_DIR'/'TEMP  './'$FILE_DIR'/'data.txt
         # remove old numbering column
         cut './'$FILE_DIR'/'data.txt -f1,3- > './'$FILE_DIR'/'TEMP
         mv './'$FILE_DIR'/'TEMP './'$FILE_DIR'/'data.txt
         # recombine headers with data
         cat './'$FILE_DIR'/'headers.txt  './'$FILE_DIR'/'data.txt > $FILE
         # cleanup
         rm './'$FILE_DIR'/'headers.txt;  rm './'$FILE_DIR'/'data.txt

      done
   done
done

To answer your questions, the strings do not exist in a file, but that would be easy enough to create if it would be helpful.

Only the first column of the data section needs to be renumbered, the numbers in column 2 can remain as is.

The remove string will always refer to the 3rd field, and this will always be the name column.

The headers are on line 1 and then are duplicated on line 14.

There could be cases in the future where there are more than 14 rows before the data begins. In all cases, you could look for the second instance of the header row (some row that matches row 1) to know that the data starts on the next row.

LMHmedchem
# 4  
Old 10-12-2014
You could try something like this as a replacement:
Code:
#!/bin/bash
# Initiailize variables:
AS_LIST=(V_mae V_se S_mae S_se)
FOLD_LIST=(f0 f1 f2 f3 f4 f5 f6 f7 f8 f9)
PARAM_SET='ON-0.25'
REMOVE_LIST=(
	'1-[3-(3,3-dimethyl-2-oxobutylidene)(1,4-diazaperhydroin-2-ylidene)]-3,3-dimethylbutan-2-one'
	'N-{(1E)-2-[4-(methylethyl)phenyl]-1-azaprop-1-enyl}-2-[(2-methylphenyl)amino]acetamide'
	'1-acetyl-3-(5,6-dimethylisoindolin-2-yl)benzene'
	'2-[(6-hydroxy-4,4-dimethyl-2-oxocyclohex-1(6)-enyl)(4-methylphenyl)methyl]-5,5-dimethylcyclohexane-1,3-dione'
	'1,4,5-triphenyl-4-imidazoline-2-thione'
	'1-(2-naphthylmethyl)-2-(naphthylmethyl)benzimidazole'
	'1-(2-naphthyl)-2-({2-[(2-(2-naphthyl)-2-oxoethyl)piperidyl]ethyl}piperidyl)ethan-1-one_bromide_bromide'
	'1-(2-hydroxyphenyl)-2,6-dimethyl-5-phenylhydropyrimidin-4-one'
	'1-[3-(3,3-dimethyl-2-oxobutylidene)(1,4-diazaperhydroin-2-ylidene)]-3,3-dimethylbutan-2-one'
	'4-(1,3-dioxobenzo[c]azolidin-2-yl)-N-methyl-N-(1,2,2,6,6-pentamethyl(4-piperidyl))butanamide'
)
SET='A'

# loop on fold list
for FOLD in ${FOLD_LIST[@]}
do
	# loop on as list
	for AS in ${AS_LIST[@]}
	do

		# assign directory name
		FILE_DIR="./$SET/$FOLD/$FOLD_anneal/$PARAM_SET/$AS"
		# loop on files
		for FILE in "$FILE_DIR"/*out.txt
		do	echo "$FILE"
			printf '%s\n' "${REMOVE_LIST[@]}" | awk '
				BEGIN {	FS = OFS = "\t"
				}
				FNR == NR {
					# Gather remove list...
					rl[$0]
					next
				}
				FNR == 1 {
					# Get header from 2nd file.
					h = $0
					hc = 2
				}
				# Copy input lines until we have copied the
				# header line twice...
				hc {	if(h == $0) {
						# Decrement the # of times we
						# need to print the header...
						hc--
					}
					print
					next
				}
				# Skip lines with Name (field 3) in remove list.
				$3 in rl {
					next
				}
				{	# Renumber remaining lines.
					$1 = ++oc
				}
				1	# Print renumbered lines.
			' - "$FILE" > "$FILE"_ && mv "$FILE"_ "$FILE"
		done
	done
done

This User Gave Thanks to Don Cragun For This Post:
# 5  
Old 10-12-2014
Alternatively in case it starts on the 15th line, try:
Code:
awk 'NR==FNR{A[$1]; next} FNR==1{c=1; close(f); f=FILENAME ".new"} $3 in A{next} FNR>14{sub($1,c++)} {print>f}' rmlist *out.txt

where rmlist contains:
Code:
1-[3-(3,3-dimethyl-2-oxobutylidene)(1,4-diazaperhydroin-2-ylidene)]-3,3-dimethylbutan-2-one
N-{(1E)-2-[4-(methylethyl)phenyl]-1-azaprop-1-enyl}-2-[(2-methylphenyl)amino]acetamide
1-acetyl-3-(5,6-dimethylisoindolin-2-yl)benzene
2-[(6-hydroxy-4,4-dimethyl-2-oxocyclohex-1(6)-enyl)(4-methylphenyl)methyl]-5,5-dimethylcyclohexane-1,3-dione
1,4,5-triphenyl-4-imidazoline-2-thione
1-(2-naphthylmethyl)-2-(naphthylmethyl)benzimidazole
1-(2-naphthyl)-2-({2-[(2-(2-naphthyl)-2-oxoethyl)piperidyl]ethyl}piperidyl)ethan-1-one_bromide_bromide
1-(2-hydroxyphenyl)-2,6-dimethyl-5-phenylhydropyrimidin-4-one
1-[3-(3,3-dimethyl-2-oxobutylidene)(1,4-diazaperhydroin-2-ylidene)]-3,3-dimethylbutan-2-one
4-(1,3-dioxobenzo[c]azolidin-2-yl)-N-methyl-N-(1,2,2,6,6-pentamethyl(4-piperidyl))butanamide

Afterwards the new files will have a .new extension

Last edited by Scrutinizer; 10-12-2014 at 03:24 PM..
This User Gave Thanks to Scrutinizer For This Post:
# 6  
Old 10-12-2014
In case you could be interested in a Perl procedural solution that can be expanded to do more processing if needed.
Code:
#!/usr/bin/perl

use strict;
use warnings;

# the first file will contain patterns to match; stop the process if no file is given
my $pattern_file = shift or die "A file with searching patterns must be given: $!\n";
open my $match_lines, '<', $pattern_file or die "Could not open $pattern_file $!\n";

# index every line as a pattern to search
my %search = map { chomp; $_, 1 } <$match_lines>;
close $match_lines;

# other files to work on must be given. Process them one by one
while (@ARGV) {

	my $n; # counter to re-number lines 15 and above
	my $filename_in = shift; # get next filename to work on 
	my $filename_out = "new_" . $filename_in; # fabricate next output filename
	
	# create corresponding input and output file handles
	open my $cur_file_in, '<', "$filename_in" or die "Could not open $filename_in: $!\n";
	open my $cur_file_out, '>', "$filename_out" or die "Could not create $filename_out: $!\n";

	# feedback to stdout
	print "processing file $filename_in to $filename_out... ";

	# processing lines from current input file
	while (my $line = <$cur_file_in>) {
		
		# tokenizing into fields, separated by one or more spaces, the read line
		my @fields =  split /\s+/, $line;

		# explicitly, saying that is OK if the variable is not defined
		no warnings 'uninitialized';
		
		# keep any lines where the pattern is not found in third field
		if (not $search{$fields[2]}) {
			
			# lines above 14 to save gets a renumber sequence
			if ($. > 14 and $fields[0] ne '\n') {
				$fields[0]= ++$n;	
			}
			# results are finally written to disk
			print $cur_file_out "@fields\n";
		}
	}
	$n = 0; # clear the renumbering counter for next file
	# ready to recycle
	close $cur_file_in;
	close $cur_file_out;
	
	# feedback to stdout
	print "[done]\n";
}

Usage
Code:
perl prog.pl rlist foo.*


Last edited by Aia; 10-12-2014 at 04:53 PM.. Reason: grammar
This User Gave Thanks to Aia For This Post:
# 7  
Old 10-13-2014
Well the script I posted works, but it takes more than 38 minutes to run on the directory tree I tested. The test directory has 2000 files spread over 40 directories with 2446 lines per file and a remove list of 28 strings.

The script posted by Don Cragun finished in,
real 1m50.953s
user 3m0.410s
sys 0m49.394s

I like the fact that this script looks for the second instance of the header row to know where to start the renumbering. There are versions of my data where that would be useful.

For the Perl script posted by Aia, these files are in a directory structure of 40 different directories. I don't know perl well enough to set up the looping to troll through all of that and test the script. It seems as if I would have to run my script and then call your script and pass $FILE_LIST along with the file with the patterns to remove. Is that right?

From the post by Scrutinizer, I also do quite see how to loop through my directory structure and create lists of files to pass to the code.

Once again, it is amazing how much better a well formed script will perform.

LMHmedchem
Login or Register to Ask a Question

Previous Thread | Next Thread

9 More Discussions You Might Find Interesting

1. UNIX for Advanced & Expert Users

Vi remove line range containing a string

In vi I would like to remove a line containing a string. I thought after reading this I could do this. https://www.unix.com/302297288-post3.html :'3560,3572/gcc/d' It keeps complaining vi mark not set. And sometimes it complains E488: Trailing characters. I don't understand what mark... (5 Replies)
Discussion started by: cokedude
5 Replies

2. Shell Programming and Scripting

Remove every line with specific string, and also the one above and below it

I would like to identify every line with a specific string, in this case: "Mamma". I would like to remove that line, and also the line above it and below it. So the below Where are all amazing Flats Look At The Great Big White Hey There Hot Mamma You Are So hot Baby I wish You were Mine... (5 Replies)
Discussion started by: phpchick
5 Replies

3. Shell Programming and Scripting

gawk to remove last character in a line or string

I am outputting a line like this print $2 "/" $4The last character though is a ":" and I want to remove it. Is there any neat way to remove it? Or am I forced to do something like this: print $2 "/" substr($4, 1, length($4) - 1)Thanks. (6 Replies)
Discussion started by: benalt
6 Replies

4. Shell Programming and Scripting

Remove last string from last line in a file

Hi, I have a file like : I want to remove last string in last line (here total string is "05550"~). And last line end with ~ character. Output should be : Please help me Thanks in advance (3 Replies)
Discussion started by: mnmonu
3 Replies

5. Shell Programming and Scripting

Remove last string from each line

I am trying to write a script that will allow me to recursively look at my directories, and output all filenames to a txt document. I am almost finished, however I am hitting one last snag. Here is my script so far: find . | grep .jpg | awk -F"/" '{print $NF}' > output.txtThis will give me an... (7 Replies)
Discussion started by: Davinator
7 Replies

6. Shell Programming and Scripting

Remove line based on string and put new line with parameter

Hi Folks, I am new to ksh, i have informatica parameter file that i need to update everyday with shell script. i need your help updating this file with new parameters. sample data $$TABLE1_DATE=04-27-2011 $$TABLE2_DATE=04-23-2011 $$TABLE3_DATE=03-19-2011 .......Highligned... (4 Replies)
Discussion started by: victor369
4 Replies

7. Shell Programming and Scripting

Remove Command-Line Option from String

I want to add a "-r <remote_host>" option to my ksh script, causing the script to run a script of the same name on the specified remote host. The remote invocation should itself include all the command-line options of the original invocation, less the -r option. For example, this invocation: ... (7 Replies)
Discussion started by: mattmiller
7 Replies

8. Shell Programming and Scripting

How to remove new line char from a string

Hi Can anyone tell me how can i remove new line character from a string. My requirement is to read a line from a file and store it to a string. read line string1=$line read line string2=$line echo $string1$string2 The result i am getting in different line. i want the output in the same... (1 Reply)
Discussion started by: sreedivia
1 Replies

9. Shell Programming and Scripting

Remove Line that contains specific string

I am looking for a way to remove any line in a text file that contains the string "Mac address". I guess you would grep and sed, but I am not sure how to do this. Thanks for you help. (3 Replies)
Discussion started by: CBarraford
3 Replies
Login or Register to Ask a Question