Merge two strings by overlapped region


 
Thread Tools Search this Thread
Top Forums Programming Merge two strings by overlapped region
# 22  
Old 04-03-2014
Stack and heap are all just memory inside that 4-gigabyte array of bytes.

I repeat:
Quote:
Originally Posted by Corona688
I repeat: You've written your program around assumptions which don't hold water -- like the very idea of "str3" being a separate entity from "str1" and "str2" just because you threw pointers to the same memory into a different function.
Fixing it to the point it will work isn't a matter of finding the bit which is "stopping it from working". The fundamental design of your program is flawed, and will continue to be flawed, even if we correct the obvious things.

I will rewrite some of it instead. Give me some time.
This User Gave Thanks to Corona688 For This Post:
# 23  
Old 04-03-2014
Since VM can be mapped anywhere, they started by putting the environment, code, constants, initialized vaiables and unitialized variables in the bottom of memory, the heap, and the stack is started at top top of memory. growing toward each other. I think signal flags and open files are really in the kernel, and the fd is just an offset into a kernel per-process array of pointers to open files, so many processes can have the same open file on one or more fd numbers. The loader puts the arguments' pointers in an array with a terminating null pointer and starts up main(). If you wrote in C++, static objects' constructors would run before main(). Suppose you call strcmp(x,y). The code will push the adresses of x and y into the stack, and maybe allow a return value space, and call the strcpy code. Often values pushed are promoted to 4 byte integer even if short or char. If strcpy has automatic variables, they are located below the stack pointer setting after the call arguments. For calls within calls, each set of arguments and auto variables is called a stack frame. So, even though nobody else knows the address of a subroutine's static variables, the compiler/linker knows. The subroutine could pass a pointer to it, or load a global pointer with its address, to make it visible to others. Some even return it, which is not usual since it might not be MT-Safe! For instance, if you return a char* of a static char[50] with a null terminated string in it, unless it is const, someone might rewrite it, and if it is a product of the call input arguments, the next call will change the value, usually to something the first recipient of the pointer did not ask for. Some of the time libraries are like this, and have newer, safe variations.
This User Gave Thanks to DGPickett For This Post:
# 24  
Old 04-03-2014
Code:
#include <stdio.h>
#include <string.h>
#include <stdlib.h>

void strmerg(char * const out, const char *str1, const char *str2) {
        // Position in 'str1'.  Counts backwards from end.
        // Each loop, str1+a will be 1 longer, i.e.  "t", "at", "tat", "atat", "Gatat"...
        int a=strlen(str1)-1;
        // How many bytes of overlap has been found.  0 if none.
        int found=0;
        // How many characters of overlap to check for.  Counts up from 1.
        int b=1;

        while(a >= 0) // Loop until a is negative
        {
                // Compare the last 'b' bytes of str1, to the first 'b' bytes of str2.
                // Using strncmp instead of strcmp prevents it from checking ALL of str2,
                // because strncmp takes maximum length as an argument.
                // It will return 0 if they are equal.
                if(strncmp(str1+a, str2, b) == 0) found=b;
                a--;
                b++;
        }

        strcpy(out, str1);
        // Strip off the first 'found' characters of str2.
        strcat(out, str2+found);
}

int main(int argc, char *argv[])
{
        char out[4096];
        if (argc != 3) {
                printf("Error! \nUsage: ./%s string1 string2\n", argv[0]);
                exit(EXIT_FAILURE);
        }

        printf("string 1 is %s\n", argv[1]);
        printf("string 2 is %s\n", argv[2]);

        strmerg(out, argv[1], argv[2]);
        printf("Output is %s\n", out);

        return(0);
}

There's probably ways to make strmerg faster.
This User Gave Thanks to Corona688 For This Post:
# 25  
Old 04-04-2014
This is very succinct code, and worked like charm. Thank you very much!
Questions: 1) As no malloc() was used to allocate memory for out, str1 and str2 in your strmerg() function. Is it because they are all declared const char* ?
2) In main() char out[4096] was used to hold the merged string, in practice, string 1 and string 2 can be as large as mega-bases. I thought dynamic allocation of the memory for out using a pointer would be better. It seems there is no such thing in C to dynamically allocate space for the merged string based on the two inputs (???). Can I ask if if there is anything inappropriate with my modification on main() function?
Code:
int main(int argc, char *argv[])
{
        char *out;
        if (argc != 3) {
                printf("Error! \nUsage: ./%s string1 string2\n", argv[0]);
                exit(EXIT_FAILURE);
        }

        out = malloc(sizeof(char)*(strlen(argv[1]) + strlen(argv[2]))+1);

        printf("string 1 is %s\n", argv[1]);
        printf("string 2 is %s\n", argv[2]);

        strmerg(out, argv[1], argv[2]);
        printf("Output is %s\n", out);

        free(out);
        return(0);
}

Thanks a lot!
# 26  
Old 04-04-2014
Quote:
Originally Posted by yifangt
Questions: 1) As no malloc() was used to allocate memory for out, str1 and str2 in your strmerg() function. Is it because they are all declared const char* ?
Look closer, they're not all const char *.

It's nothing to do with malloc.

The two which are 'const char *' are set that way because I don't need to write to their contents.

The one which isn't 'const char *' is because I need to write to its contents.

I could have made them all plain 'char *' and it would have still worked. The 'const' is a reminder to me, the programmer, of which arguments I should be writing to and which I shouldn't. It's also a safety mechanism, so if I try to cram unwritable things into it, the compiler will complain.

Quote:
2) In main() char out[4096] was used to hold the merged string, in practice, string 1 and string 2 can be as large as mega-bases. I thought dynamic allocation of the memory for out using a pointer would be better. It seems there is no such thing in C to dynamically allocate space for the merged string based on the two inputs (???).
...and having said so, you go ahead and do the "impossible" in the same breath Smilie

I avoided malloc because pointers still confuse you. I just thought it was more straightforward to do it that way.

Quote:
Can I ask if if there is anything inappropriate with my modification on main() function?[/COLOR]
Actually that looks fine. It allocates more memory than it needs to so isn't ideal, but will do what you want it to do, and won't crash.

Here's what I would do:

Code:
#include <stdio.h>
#include <string.h>
#include <stdlib.h>

// Returns newly-allocated memory which you must 'free' later.
char *strmerg(const char *str1, const char *str2) {
        // Note:  size_t and ssize_t are technically more correct for numbers holding string length,
        // particularly for LARGE strings.  I had to use ssize_t for 'a' since I need a negative number
        // for my while loop.

        // Position in 'str1'.  Counts backwards from end.
        // Each loop, str1+a will be 1 longer, i.e.  "t", "at", "tat", "atat", "Gatat"...
        ssize_t a=strlen(str1)-1;
        // How many bytes of overlap has been found.  0 if none.
        size_t found=0;
        // How many characters of overlap to check for.  Counts up from 1.
        size_t b=1;

        while(a >= 0) // Loop until a is negative
        {
                // Compare the last 'b' bytes of str1, to the first 'b' bytes of str2.
                // Using strncmp instead of strcmp prevents it from checking ALL of str2,
                // because strncmp takes maximum length as an argument.
                // It will return 0 if they are equal.
                if(str1[a] == str2[0]) // Optimization
                if(strncmp(str1+a, str2, b) == 0) found=b;
                a--;
                b++;
        }

        b=strlen(str1); // don't count 5 megabases more times than necessary

        // Code block only because some compilers require you to declare
        // all variables at the top of a code block
        {
                char * const out=malloc(sizeof(char)*(b+strlen(str2+found)+1));
                strcpy(out, str1);
                strcpy(out+b, str2+found); // Faster than strcat
                return(out);
        }
}

int main(int argc, char *argv[])
{
        if (argc != 3) {
                printf("Error! \nUsage: ./%s string1 string2\n", argv[0]);
                exit(EXIT_FAILURE);
        }
        else
        {
                char * const out=strmerg(argv[1], argv[2]);
                printf("string 1 is %s\n", argv[1]);
                printf("string 2 is %s\n", argv[2]);
                printf("Output is %s\n", out);
                free(out);
        }
        return(0);
}


Last edited by Corona688; 04-04-2014 at 02:13 PM..
This User Gave Thanks to Corona688 For This Post:
# 27  
Old 04-05-2014
Here is a strmerg() similar to Corona688's. But, where his version checks for a match for every trailing substring of string 1; this version starts with the longest possible match based on the length of the shorter string and stops as soon as it finds a match. It also verifies that malloc() succeeded before copying data into the buffer allocated by malloc(). With multi-megabyte input strings of similar length and short matches, the speed difference is likely to be unnoticeable. But if there are frequent relatively long matching strings or if string 1 is longer than string 2, this version might be significantly faster.

This won't work on some older systems, because it uses <inttypes.h> to produce relatively portable printf() conversion specifications for objects of type size_t. And, Corona688's code makes much better use of the const qualifier, than this code does.

Code:
#include <inttypes.h>
#include <stdio.h>
#include <string.h>
#include <stdlib.h>

char *
strmerg(const char *str1, const char *str2) {
	size_t	str1_l = strlen(str1),	// length of string1
		str2_l = strlen(str2),	// length of string2
		// match_l is longest possible match (length of shortest input)
		match_l = str1_l < str2_l ? str1_l : str2_l;
		// base is ptr to start of longest possible match in string1
	char	*base = (char *)str1 + str1_l - match_l,
		c1,		// 1st char in string2 possible match
		c2 = *str2,	// 1st char in string2
		*out;
	// At this point, match_l is the longest possible match (based on the
	// lengths of string1 and string2) and base points to the spot in
	// string1 where the longest possible match could start.

	// While there is a mismatch...
	while((c1 = *base) && ((c1 != c2) || strncmp(base, str2, match_l))) {
		// Decrement the longest possible match, and increment the
		// spot in string1 where the longest possible match could
		// still start.  Note that even though we aren't checking
		// the value of match_l in this loop, we are guaranteed that
		// the loop will end with match_l greater than or equal to
		// zero.  (If it is zero, the only match is the terminating
		// null byte in string1.)
		match_l--;
		base++;
	}
	// When we get to this point, match_l is the length of the longest
	// substring of the tail of string1 that matches the head of string2.

	// Allocate space to hold the resulting string:	length of string1 +
	// length of string2 - length of matched substring + 1 for the
	// terminating null byte.  NOTE:  The caller is responsible for
	// freeing the allocated space when it is no longer needed.
	out = malloc(str1_l + str2_l - match_l + 1);
	if(out) {
		// Copy string1 and unmatched portion of string2 to out.
		strcpy(out, str1);
		strcpy(out + str1_l, str2 + match_l);
	}
	return(out);
}

int
main(int argc, char *argv[]) {
	char *result;	// Space for the pointer that strmerg() will return

	if (argc != 3) {
		fprintf(stderr, "Error! Bad arg count\nUsage: %s %s\n",
			argv[0], "string1 string2");
		exit(EXIT_FAILURE);
	}

	printf("string1: \"%s\" (%"PRIuMAX" bytes)\n",
		argv[1], (intmax_t)strlen(argv[1]));
	printf("string2: \"%s\" (%"PRIuMAX" bytes)\n",
		argv[2], (intmax_t)strlen(argv[2]));

	result = strmerg(argv[1], argv[2]);
	if(result == NULL) {
		perror("strmerg() failed:");
		exit(EXIT_FAILURE);
	}
	printf("strmerg(string1, string2) output: \"%s\" (%"PRIuMAX" bytes)\n",
		result, (intmax_t)strlen(result));
	free(result);
	result = strmerg(argv[2], argv[1]);
	if(result == NULL) {
		perror("strmerg() failed:");
		exit(EXIT_FAILURE);
	}
	printf("strmerg(string2, string1) output: \"%s\" (%"PRIuMAX" bytes)\n",
		result, (intmax_t)strlen(result));
	free(result);
	return(0);
}

When compiled and linked into a.out, the command:
Code:
./a.out ACGTGCCC CCCCCGTGTGTGT

produces:
Code:
string1: "ACGTGCCC" (8 bytes)
string2: "CCCCCGTGTGTGT" (13 bytes)
strmerg(string1, string2) output: "ACGTGCCCCCGTGTGTGT" (18 bytes)
strmerg(string2, string1) output: "CCCCCGTGTGTGTACGTGCCC" (21 bytes)

and the command:
Code:
./a.out aaaaaaaaaa1 1aaaaaaaaaaaa

produces the output:
Code:
string1: "aaaaaaaaaa1" (11 bytes)
string2: "1aaaaaaaaaaaa" (13 bytes)
strmerg(string1, string2) output: "aaaaaaaaaa1aaaaaaaaaaaa" (23 bytes)
strmerg(string2, string1) output: "1aaaaaaaaaaaa1" (14 bytes)

These 2 Users Gave Thanks to Don Cragun For This Post:
# 28  
Old 04-07-2014
Corona & Don!
These two should be added to the <string.h> as strmerg() function like the others, e.g. strcat(), strcpy() ... etc.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Merge strings with ignore case

I have a bi-lingual database of a large number of dictionaries. It so happens that in some a given string is in upper case and in others it is in lower case. An example will illustrate the issue. toll Tax=पथ-कर Toll tax=राहदारी कर toll tax=टोल I want to treat all three instances of toll tax... (3 Replies)
Discussion started by: gimley
3 Replies

2. Shell Programming and Scripting

Merge strings from a file into a template

I am preparing a morphological grammar of Marathi to be placed in open-source. I have two files. The first file called Adverbs contains a whole list of words, one word per line A sample is given below: आधी इतक इतपत उलट एवढ ऐवजी कड कडनं कडल कडील कडून कडे करता करिता खाल (2 Replies)
Discussion started by: gimley
2 Replies

3. Programming

Perl script to merge cells in column1 which has same strings, for all sheets in a excel workbook

Perl script to merge cells ---------- Post updated at 12:59 AM ---------- Previous update was at 12:54 AM ---------- I am using below code to read files from a dir and print to excel. open(my $in, '<', $file) or die "Could not open file: $!"; my $rowCount = 0; my $colCount = 0;... (11 Replies)
Discussion started by: Jack_Bruce
11 Replies

4. Shell Programming and Scripting

Merge left hand strings mapping to different right hand strings

Hello, I am working on an Urdu to Hindi dictionary which has the following structure: a=b a=c n=d n=q and so on. i.e. Headword separated from gloss by a = I am giving below a live sample بتا=बता بتا=बित्ता بتا=बुत्ता بتان=बतान بتان=बितान بتانا=बिताना I need the following... (3 Replies)
Discussion started by: gimley
3 Replies

5. AIX

Change lv REGION in HDISK1

Dears my rootvg is missed up i can not extend the /opt as soon as i try to extend the Filesystem its give me that there is not enough space . as there any way to change the REGION of the LVs in HDISK1 ? lspv -p hdisk0 hdisk0: PP RANGE STATE REGION LV NAME TYPE ... (8 Replies)
Discussion started by: thecobra151
8 Replies

6. UNIX for Dummies Questions & Answers

overlapped genomic coordinates

Hi, I would like to know how can I get the ID of a feature if its genomic coordinates overlap the coordinates of another file. Example: Get the 4th column (ID) of this file1: chr1 10 100 gene1 chr2 3000 5000 gene2 chr3 200 1500 gene3 if it overlaps with a feature in this file2: chr2... (1 Reply)
Discussion started by: fadista
1 Replies

7. Shell Programming and Scripting

Region between lines

How can I find the regions between specific lines? I have a file which contains lines like this: chr1 0 17388 0 chr1 17388 17444 1 chr1 17444 17599 2 chr1 17599 17601 1 chr1 17601 569791 0 chr1 569791 569795 1 chr1 569795 569808 2 chr1 569808 569890 3 chr1 569890 570047 4 ... (9 Replies)
Discussion started by: linseyr
9 Replies

8. UNIX for Advanced & Expert Users

Best practice - determining what region you are on

Hello all, I have a question about what you think the best practice is to determine what region you are running on when you have a system setup with a DEV/TEST, QA, and PROD regions running the same scripts in all. So, when you run in DEV, you have a different directory structure, and you... (4 Replies)
Discussion started by: Rediranch
4 Replies

9. UNIX for Dummies Questions & Answers

Merge two strings not from files

str1="this oracle data base record" str2="one two three four five" Output: this one oracle two data three base four record five str1 and str2 have the same column but they are not fixed columns. I can do it with "paste" but I do not want to create file everytime the script runs from... (2 Replies)
Discussion started by: buddyme
2 Replies

10. UNIX for Advanced & Expert Users

stack region

how can i determine that what percentage of stack region is currently is used? (i am using tru64 unix) (2 Replies)
Discussion started by: yakari
2 Replies
Login or Register to Ask a Question