find & Replace text using two non-unique delimiters.


 
Thread Tools Search this Thread
Top Forums Programming find & Replace text using two non-unique delimiters.
# 1  
Old 02-28-2018
find & Replace text using two non-unique delimiters.

I can find and replace text when the delimiters are unique. What I cannot do is replace text using two NON-unique delimiters:

Ex.,

Code:
"This html code <text blah >contains <garbage blah blah >. All tags must go,<text > but some must be replaced with <garbage blah blah > without erasing other info."

Code:
delimiter1: '<garbage'
delimiter2: '>'
replace with: 'important info'

delimiter3: '<'
delimiter4: '>'
replace with: ''


I get this:

Code:
This html code contains important info


And I want this:

Code:
This html code contains important info. All tags must go, but some must be replaced with important info without erasing other info.


The issue is that the program keeps seeing the '>' which is tied in with the'<text >' tag and using it instead of using the '>' which is tied in with '<garbage'.

In my real-world scenario, these tags are much more complicated and will have a variety of text inbetween whilst being different sizes and having different endings; also, certain tags must be deleted first, second, and so on, so changing the order will not help this situation.

I want to make code that understands that the '>' delimiter, which I want to use as an end position for '<garbage' tag, can only be the one which comes closest AFTER the '<garbage' tag (and if it understands that, then it cannot make a mistake); but I do not know how to do this. I have it working perfectly in an awk program, but not in C++. And I will not use boost; I'd rather then just stick with awk in that case.

Here is my code:

Code:
// Compile and run with:
//
// g++ -O -Wall replace.cpp -o replace
//

#include<iostream>
#include<string>
#include<fstream>

using namespace std;

string replaceText (string text, string tStart, string tStop, string tReplace)
{

	long int begPos;
	long int endPos;
	int found=1;

	while ((text.find(tStart) != std::string::npos) && (found == 1)) {

		found = 0;

		begPos = text.find(tStart);
		endPos = text.find(tStop);

			if (tStop != "")
                        {
				text.replace(begPos, endPos - begPos + tStop.length(), tReplace);
				found = 1;
			}else{
				text.replace(begPos, tStart.length(), tReplace );
				found = 1;
			}

		// Used for testing to see positions of replaced text:
		std::cout << "Replacing from: " << tStart << " ...to... " << tStop << " at Start Pos: " << begPos << " Stop Pos: " << endPos << " with " << tReplace << " \n" << endl;

	}

	return text;
}

int main(int argc, char* argv[])
{
	
	keyFound="This html code <text blah >contains <garbage blah blah >. All tags must go, <text > but some must be replaced with <garbage blah blah > without easing other info.";

	// Run this code twice: once with the below line of code commented, and once without:
	keyFound=replaceText(keyFound, "<garbage", ">", "important info");
	keyFound=replaceText(keyFound, "<", ">", "");

	std::cout << keyFound << endl;

	return 0;
}


I am not expecting an entire answer, but maybe if someone could lead me to a resource which has a fitting answer. I've been looking all around, and I cannot seem to find anything. Also, I am new to C++.

I understand that this is an incredibly complicated thing with no simple answer.

Thank you.
# 2  
Old 02-28-2018
When trying to match the end of a tag with its start, you need to look for the entire tag in a single search. Since there are several tags on the line, the way you are searching for the end tag may well find a > that comes before <garbage within the text string that you are searching.

To match the string starting with <garbage and ending with the closest matching > after that, try matching using the single BRE or ERE <garbage[^>]*>.
This User Gave Thanks to Don Cragun For This Post:
# 3  
Old 02-28-2018
Quote:
Originally Posted by Don Cragun
When trying to match the end of a tag with its start, you need to look for the entire tag in a single search. Since there are several tags on the line, the way you are searching for the end tag may well find a > that comes before <garbage within the text string that you are searching.

To match the string starting with <garbage and ending with the closest matching > after that, try matching using the single BRE or ERE <garbage[^>]*>.
Woot! So easy to do having been told this! Thank you! Smilie

Code:
#include <string>
#include <iostream>
#include <regex>
using namespace std;

int main(int argc, char * argv[]) {

	string test;

	test="This html code <text blah >contains <garbage blah blah >. All tags must go, <text > but some must be replaced with <garbage blah blah > without easing other info.";

	regex reg("<garbage[^>]*>");
	test = regex_replace(test, reg, "important info");

	cout << test << endl;

	return 0;
}

Result:
Code:
This html code <text blah >contains important info. All tags must go, <text > but some must be replaced with important info without easing other info.

Something tells me this regex function is going to be a livesaver! Smilie

Okay, now to do what I do best...

... sleep. Smilie
# 4  
Old 02-28-2018
We assume that you know that exactly the same thing works in awk:
Code:
echo "This html code <text blah >contains <garbage blah blah >. All tags must go, <text > but some must be replaced with <garbage blah blah > without easing other info." |
    awk '{sub("<garbage[^>]*>", "important info")}1'

to make a substitution for the first occurrence producing the output:
Code:
This html code <text blah >contains important info. All tags must go, <text > but some must be replaced with <garbage blah blah > without easing other info.

or:
Code:
echo "This html code <text blah >contains <garbage blah blah >. All tags must go, <text > but some must be replaced with <garbage blah blah > without easing other info." |
    awk '{gsub("<garbage[^>]*>", "important info")}1'

to make a substitution for all occurrences producing the output:
Code:
This html code <text blah >contains important info. All tags must go, <text > but some must be replaced with important info without easing other info.

This User Gave Thanks to Don Cragun For This Post:
# 5  
Old 03-01-2018
Good you had another improvement of your code. Applying what you learned in some of your other threads (gsub (tagIn "[^" tagOut "]*" tagOut, ""), post7, post2), you'd get what you request in post#1, setting tagin first to <garbage, then to just <.
This User Gave Thanks to RudiC For This Post:
# 6  
Old 03-02-2018
Quote:
Originally Posted by Don Cragun
We assume that you know that exactly the same thing works in awk:
Yes, I found that out. How satisfying it is to just drag and drop the regex parameters into the C++ code and have them work! Smilie

Quote:
Good you had another improvement of your code. Applying what you learned in some of your other threads (gsub (tagIn "[^" tagOut "]*" tagOut, ""), post7, post2), you'd get what you request in post#1, setting tagin first to <garbage, then to just <.
The code has been updated, but please do not feel obligated to respond, though I do very much appreciate and welcome the advice of all of you! There is no urgent desire to fix anything; I'm just 'putting it out there.' Smilie

If anyone would like to peruse and comment, they are welcome to:
Code:
// This program parses an XML dictionary file and prints a formatted result.
//
// NOTE: The required XML dictionary (16mb) will be downloaded to this
//       machine if it is not found! It will be stored in: ~/.config/latin/
//
// The goals of this project:
//
//	1. < 100 lines code
//	2. Simple & elegant coding
//	3. Fast & efficient execution.
//
//		"Do one thing,
//		 and do it well."
//
//		—Linux Credo
//
// Compile with:
// $ g++ -O -Wall lat.cpp -o lat
//
// Run with:
// $ lat amo sum totus
//
// Where 'amo', 'sum', and 'totus' are the words to be searched
//
// Gather online possibilities and pipe output into 'less'
// ('latc' script required for this functionality!!!):
//
// $ lat $(latc quam totus amor)
//
// Where 'quam', 'totus', and 'amor' are your search terms
//
// For testing. Completely clear terminal to not confuse with other text.
// $ reset; g++ -O -Wall lat.cpp -o lat; sleep 2; lat amo sum totus | less
//

#include<iostream>
#include<string>
#include<regex>
#include<fstream>
#include<unistd.h>
#include<sys/types.h>
#include<pwd.h>

using namespace std;

int main(int argc, char* argv[])
{
	// No search term entered. Bye!
	if (!argv[1]) return 1;

	std::string line;					// Used for file input
	std::string charToStr(argv[1]);				// Cannot use char with strings
	std::string keyStart	("key=\"" + charToStr + "\"");	// Key tags which word in XML file is surrounded
	std::string keyEnd	("</entry>");
	std::string text;
        struct passwd *pw = getpwuid(getuid());                 // Set up to get ~/
	std::string homeDir = pw->pw_dir;
	std::string XMLfile	(homeDir + "/.config/latin/Perseus_text_1999.04.0060.xml");
	std::string XMLfileDlURL="http://www.perseus.tufts.edu/hopper/dltext?doc=Perseus:text:1999.04.0060";

	//ifstream myFileTest (XMLfile);
	ifstream myFile(XMLfile);

	// Download dictionary if not found
	if (myFile.fail())
	{

		std::cout << "\nNote: The XML dictionary file " << XMLfile << " has not been found.\n\nDownloading and preparing XML file...\n\n";

                string dlCmd=("mkdir -p " + homeDir  + "/.config/latin/ && cd " + homeDir + "/.config/latin/ && wget -O- " +  XMLfileDlURL  +  " | tr -d '\\r' > " +  XMLfile);

		// system() won't accept a string
                const char * sysCharCmd = dlCmd.c_str();

		system(sysCharCmd);

		// Check again to see if the file was created and can be found
		myFile.clear();

		if (myFile.fail())
		{
			std::cout << "Could not download or find file!\n\nExiting...\n\n";
			return 2;
		}else{
			std::cout << "Finished downloading!\n\nRestart program to use new dictionary.\n\n";
			return 0;
		}
	}


	// Go through all given keys from command line parameters
	for(int keyNum = 1; keyNum < argc; keyNum++ )
	{
		charToStr=argv[keyNum];				// Make compatible with int
		keyStart="key=\"" + charToStr + "\"";
		text="";					// Do not append text

		myFile.clear();					// Go to beginning of file
		myFile.seekg(0, ios::beg);

		// Find search key and save result in 'text' string
		while (getline (myFile,line) && text == "")
			if (line.find(keyStart) != std::string::npos)	// We found a key!
				do					// Grab keys text
					text += line;
				while (getline (myFile,line) && line.find(keyEnd) == std::string::npos);

		// Don't waste time—go to next iteration!
		if (text == "")
		{
			std::cout << "Search key '" << charToStr << "' not found.\n" << endl;
			continue;
		}

		/* User may want to define an entire paragrapth of words
		   at one time, so do string modification right after
		   each key to allow first results to be shown instantly. */

		// Replace regex pattern in slot #1 with the text in slot #2.
		std::string tReplace[] = {"<orth>", "[", "</orth>", ",", "</gen>", ".", "<sense id.*><etym lang=\"la\" opt=\"n\">", "[", "<etym lang=\"la\" opt=\"n\">", "[", "</etym>, <trans opt=\"n\">|</etym>\\.—", "]\n\n • ", "(</etym>\\. —</sense>|</etym>\\.)", "]", "</etym>\\. </sense>", "", "(\\.|</usg>) ?— ?</sense>", ".", "<sense[^>]*>", "\n\n", "<[^>]*>", "", " — ", "\n\n • ", "\\. ?+—", ".\n\n • ", " +", " ", ". ?—", "\n\n", " ,", ",", " \\.", ".", " :", ":", "‘ ", "‘", " ’", "’", "^ ", "", "\\( ", "\\(", " \\)", "\\)" };

		// Now manipulate that text string and make it pretty.
		signed int repSize = (sizeof(tReplace) / sizeof(tReplace[0]));
		for (signed int i = 0; i < repSize; i += 2)
		{
			regex reg(tReplace[i]);
			text = regex_replace(text, reg, tReplace[i + 1]);
		}

		// Give lots of space to easily distinguish between definitions
		std::cout << text << "\n\n\n";

	}

	myFile.close();

	return 0;

}


Last edited by bedtime; 03-02-2018 at 07:13 PM.. Reason: WOOT! All bugs fixed.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Delete characters & find unique IP addresses with port

Hi, I have a file having following content. <sip:9376507346@97.208.31.7:51088 <sip:9907472291@97.208.31.7:51208 <sip:8103742422@97.208.31.7:51024 <sip:9579892841@97.208.31.7:51080 <sip:9370904222@97.208.31.7:51104 <sip:9327665215@97.208.31.7:51104 <sip:9098364262@97.208.31.7:51024... (2 Replies)
Discussion started by: SunilB2011
2 Replies

2. Shell Programming and Scripting

Finding a text in files & replacing it with unique strings

Hallo Everyone. I have to admit I'm shell scripting illiterate . I need to find certain strings in several text files and replace each of the string by unique & corresponding text. I prepared a csv file with 3 columns: <filename>;<old_pattern>;<new_pattern> ... (5 Replies)
Discussion started by: gordom
5 Replies

3. UNIX for Dummies Questions & Answers

Find & Replace

Hi I am looking to rename the contents of this dir, each one with a new timestamp, interval of a second for each so it the existing format is on lhs and what I want is to rename each of these to what is on rhs..hopefully it nake sense CDR.20060505.150006.gb CDR.20121211.191500.gb... (3 Replies)
Discussion started by: rob171171
3 Replies

4. Red Hat

copy & replace text

how can i copy a certain word from a text file then use this word to replace in another text file?? i tried to use something like: awk '{print "Hit the",$1,"with your",$2}' /aaa/qqqq.txt > uu.txt but i can't add an argument to point to the second file which i will replace in. please... (8 Replies)
Discussion started by: mos33
8 Replies

5. Shell Programming and Scripting

Find and add/replace text in text files

Hi. I would like to have experts help on below action. I have text files in which page nubmers exists in form like PAGE : 1 PAGE : 2 PAGE : 3 and so on there is other text too. I would like to know is it possible to check the last occurance of Page... (6 Replies)
Discussion started by: lodhi1978
6 Replies

6. Homework & Coursework Questions

[Scripting]Find & replace using user input then replacing text after

Use and complete the template provided. The entire template must be completed. If you don't, your post may be deleted! 1. The problem statement, all variables and given/known data: (o) Checkout an auto part: should prompt the user for the name of the auto part and borrower's name: Name:... (2 Replies)
Discussion started by: SlapnutsGT
2 Replies

7. Shell Programming and Scripting

Find & Replace string in multiple files & folders using perl

find . -type f -name "*.sql" -print|xargs perl -i -pe 's/pattern/replaced/g' this is simple logic to find and replace in multiple files & folders Hope this helps. Thanks Zaheer (0 Replies)
Discussion started by: Zaheer.mic
0 Replies

8. Shell Programming and Scripting

get part of file with unique & non-unique string

I have an archive file that holds a batch of statements. I would like to be able to extract a certain statement based on the unique customer # (ie. 123456). The end for each statement is noted by "ENDSTM". I can find the line number for the beginning of the statement section with sed. ... (5 Replies)
Discussion started by: andrewsc
5 Replies

9. Shell Programming and Scripting

find & incremental replace?

Looking for a way using sed/awk/perl to replace port numbers in a file with an incrementing number. The original file looks like... Host cmg-iqdrw3p4 LocalForward *:9043 localhost:9043 Host cmg-iqdro3p3a LocalForward *:10000 localhost:10000 Host cmg-iqdro3p3b LocalForward... (2 Replies)
Discussion started by: treadwm
2 Replies

10. Shell Programming and Scripting

Find & Replace

I get a text file with 70+ columns (seperated by Tab) and about 10000 rows. The 58th Column is all numbers. But sometimes 58th columns has "/xxx=##" after the numeric data. I want to truncate this string using the script. Any Ideas...:confused: (3 Replies)
Discussion started by: gagansharma
3 Replies
Login or Register to Ask a Question