Merge two strings by overlapped region


 
Thread Tools Search this Thread
Top Forums Programming Merge two strings by overlapped region
# 1  
Old 04-01-2014
Merge two strings by overlapped region

Hello, I am trying to concatenate two strings by merging the overlapped region. E.g.
Code:
Seq1=ACGTGCCC
Seq2=CCCCCGTGTGTGT
Seq_merged=ACGTGCCCCCGTGTGTGT

Function strcat(char *dest, char *src) appends the src string to the dest string, ignoring the overlapped parts (prefix of src and suffix of dest). Googled for a while, this seems to be related to longest common substring computing, which is a too big question for me.
I have tried following code, but always got an error: Seq_merged=ACGTGCCCCCCGTGTGTGT, which has an exra "C". What did I miss?
Code:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define MAXLEN 4096

//strmerg was from: http://effprog.wordpress.com/2010/11/18/concatenation-of-two-strings-omitting-overlapping-string/
char *strmerg(char *dst, const char *src)
{
    size_t dstLen = strlen(dst);
    size_t srcLen = strlen(src);

    char *p = dst + dstLen + srcLen;            /* Pointer to the end of the concatenated string */
    const char *q = src + srcLen - 1;            /* Pointer to the last character of the src */
    char *r = dst + dstLen - 1;                    /* Temp Pointer to the last character of the dst */
    char *end = r;                                /* Permanent Pointer to the last character of the dst */
    *p = '\0';                                    /*terminating the concatened string with NULL character */

    while (q >= src) 
{        /*Copy src in reverse */
    if (*r == *q) {                                /*Till it matches with the src, decrement r */
        r--;
    } else {
        r = end;
        if (*r == *q) {
        r--;
        }
    }

    *p-- = *q--;
    }

    while (r >= dst)                            /*Copy dst, ending with r */
    *p-- = *r--;

    return p + 1;
}

int main(int argc, char **argv)
{
    char *str1, *str2;        //Original two strings
    char *str3;                //resulting string

    str1 = malloc(sizeof(char) * MAXLEN);    //allocate memory
    str2 = malloc(sizeof(char) * MAXLEN);    //allocate memory

    str3 = malloc(sizeof(char) * MAXLEN * 2);    //allocate memory, maximum space needed is the sum of the two original string lengths

    if (argc != 3) {
    printf("Error! \nUsage: ./arg[0]=program argv[1]=string1 argv[2]=string2\n");
    exit(EXIT_FAILURE);
    }

    strcpy(str1, argv[1]);
    strcpy(str2, argv[2]);

    printf("Input strings are: \nSeq1=%s\nSeq2=%s\n", str1, str2);

    str3=strmerg(str1, str2);
    printf("\nConcatenated string is: Seq_merged=%s\n", str3);
/*Some problem with these free(), do not know why?
free(str1);
free(str2);
free(str3);
*/
    return 0;
}

I tried more cases, it seems the problem comes if the overlapping region is repetitive.
Code:
./prog ACGTGCCC CCCCCGTGTGTGT 
Seq1=ACGTGCCC
Seq2=CCCCCGTGTGTGT 
Seq_merged=ACGTGCCCCCCGTGTGTGT 
./prog ACGTGatcg atcgCCGTGTGTGT
Seq1= ACGTGatcg
Seq2= atcgCCGTGTGTGT
Seq_merged=ACGTGatcgCCGTGTGTGT
./prog ACGTGatatat atatCCGTGTGTGT
Seq1=ACGTGatatat
Seq2=atatCCGTGTGTGT
Seq_merged=ACGTGatatatatatCCGTGTGTGT

Can anyone have a look at it for me? Thanks a lot!

Last edited by yifangt; 04-01-2014 at 04:29 PM..
# 2  
Old 04-01-2014
Trim string1 end as far as concatenated new + string2 still contains string1?
This User Gave Thanks to DGPickett For This Post:
# 3  
Old 04-01-2014
In case it matters to you, be aware that your initial strcpy's from argv are unsafe (if the command line arguments exceed your definition of MAXLEN).

Regards,
Alister
This User Gave Thanks to alister For This Post:
# 4  
Old 04-01-2014
Use the argv where they lay (in the heap already, part of environment), just assign the char* to a identifying variable, and do not make a copy. If you must copy, malloc for the strlen+1 or go to C++ RWCString, JAVA. You just need cha* for str1 and str2, a dynamically sized char[]'s for last good and trial of strlen(str1)+strlen(str2)+1. The test is memcmp(str1, trial, strlen(str1)). When you trim str1 to nothing or it mismatches, the last good is it.

You do not need to free() when you exit(), exit() does it all: fflush(), fclose(), close() (socket disconnect, TCP DB session rollback) and virtual free(). All memory for a process is released on exit(). Memory leaks are a problem for daemons, which almost never exit(), and internal processing loops.

You cannot copy a char[] with = here: str3=strmerg(str1, str2); You destroyed the value of str3 placed there by malloc. the only clue it can use to free(), a memory leak since you did not save that value for free(). Subroutines that return char[] can either use a static but that is a vlaue destroyed at the next call, not MT-Safe, or malloc a new buffer to return, whose free() falls on the caller, or more usually the caller should send it in as an additional arg, and if the size is not explicit, with a size, like with the improvement of sprintf() to snprintf(), localtime() to localtime_r(): https://www.unix.com/man-page/opensolaris/0/snprintf/ https://www.unix.com/man-page/opensolaris/0/localtime_r/

Last edited by DGPickett; 04-01-2014 at 04:17 PM..
This User Gave Thanks to DGPickett For This Post:
# 5  
Old 04-01-2014
Thanks for your replies!
DGPickett, could you be more specific on these two places and please comments on my code?
1. just assign the char* to a identifying variable, and do not make a copy. What is the correct way?
2. You cannot copy a char[] with = here: str3=strmerg(str1, str2); You destroyed the value of str3 placed there by malloc.
Do you mean str3 = malloc(sizeof(char) * MAXLEN * 2); this line is not needed?
Thank you!
# 6  
Old 04-01-2014
1) Simplest thing in the world:

Code:
const char *str1 = argv[1];
const char *str2 = argv[2];

The limitation of this, of course, is that it's just pointer pointing to the same memory as argv[1]. That doesn't matter since argv[1] isn't going to change and str1 won't be edited.

They are 'const' so that you can't edit them by accident, it'd be a compiler error (or a lot of intentional typecasting) to try. Any string parameters your functions take which don't get edited should be 'const' too, to signify this. For example:

Code:
STRCPY(3)                  Linux Programmer's Manual                 STRCPY(3)



NAME
       strcpy, strncpy - copy a string

SYNOPSIS
       #include <string.h>

       char *strcpy(char *dest, const char *src);

'src' for strcpy will accept both constant and non-constant strings because of this, but if you try to put a constant string into 'dest', it will cause a compiler error. This is better than a crash later.
This User Gave Thanks to Corona688 For This Post:
# 7  
Old 04-02-2014
Thanks Corona688!

Got the idea to use const char * for my case. However, after I changed the two lines,
Code:
char *str1 = argv[1]; 
const char *str2 = argv[2];

my code was compiled without error/warnings, but did not give any result.
Code:
Input strings are: 
Seq1=
Seq2=
Concatenated string is: Seq_merged=

I understand the direct assignment of *str1 = argv[1], *str2 = argv[2] to pass two pointers. There must be subtle things here I have missed.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Merge strings with ignore case

I have a bi-lingual database of a large number of dictionaries. It so happens that in some a given string is in upper case and in others it is in lower case. An example will illustrate the issue. toll Tax=पथ-कर Toll tax=राहदारी कर toll tax=टोल I want to treat all three instances of toll tax... (3 Replies)
Discussion started by: gimley
3 Replies

2. Shell Programming and Scripting

Merge strings from a file into a template

I am preparing a morphological grammar of Marathi to be placed in open-source. I have two files. The first file called Adverbs contains a whole list of words, one word per line A sample is given below: आधी इतक इतपत उलट एवढ ऐवजी कड कडनं कडल कडील कडून कडे करता करिता खाल (2 Replies)
Discussion started by: gimley
2 Replies

3. Programming

Perl script to merge cells in column1 which has same strings, for all sheets in a excel workbook

Perl script to merge cells ---------- Post updated at 12:59 AM ---------- Previous update was at 12:54 AM ---------- I am using below code to read files from a dir and print to excel. open(my $in, '<', $file) or die "Could not open file: $!"; my $rowCount = 0; my $colCount = 0;... (11 Replies)
Discussion started by: Jack_Bruce
11 Replies

4. Shell Programming and Scripting

Merge left hand strings mapping to different right hand strings

Hello, I am working on an Urdu to Hindi dictionary which has the following structure: a=b a=c n=d n=q and so on. i.e. Headword separated from gloss by a = I am giving below a live sample بتا=बता بتا=बित्ता بتا=बुत्ता بتان=बतान بتان=बितान بتانا=बिताना I need the following... (3 Replies)
Discussion started by: gimley
3 Replies

5. AIX

Change lv REGION in HDISK1

Dears my rootvg is missed up i can not extend the /opt as soon as i try to extend the Filesystem its give me that there is not enough space . as there any way to change the REGION of the LVs in HDISK1 ? lspv -p hdisk0 hdisk0: PP RANGE STATE REGION LV NAME TYPE ... (8 Replies)
Discussion started by: thecobra151
8 Replies

6. UNIX for Dummies Questions & Answers

overlapped genomic coordinates

Hi, I would like to know how can I get the ID of a feature if its genomic coordinates overlap the coordinates of another file. Example: Get the 4th column (ID) of this file1: chr1 10 100 gene1 chr2 3000 5000 gene2 chr3 200 1500 gene3 if it overlaps with a feature in this file2: chr2... (1 Reply)
Discussion started by: fadista
1 Replies

7. Shell Programming and Scripting

Region between lines

How can I find the regions between specific lines? I have a file which contains lines like this: chr1 0 17388 0 chr1 17388 17444 1 chr1 17444 17599 2 chr1 17599 17601 1 chr1 17601 569791 0 chr1 569791 569795 1 chr1 569795 569808 2 chr1 569808 569890 3 chr1 569890 570047 4 ... (9 Replies)
Discussion started by: linseyr
9 Replies

8. UNIX for Advanced & Expert Users

Best practice - determining what region you are on

Hello all, I have a question about what you think the best practice is to determine what region you are running on when you have a system setup with a DEV/TEST, QA, and PROD regions running the same scripts in all. So, when you run in DEV, you have a different directory structure, and you... (4 Replies)
Discussion started by: Rediranch
4 Replies

9. UNIX for Dummies Questions & Answers

Merge two strings not from files

str1="this oracle data base record" str2="one two three four five" Output: this one oracle two data three base four record five str1 and str2 have the same column but they are not fixed columns. I can do it with "paste" but I do not want to create file everytime the script runs from... (2 Replies)
Discussion started by: buddyme
2 Replies

10. UNIX for Advanced & Expert Users

stack region

how can i determine that what percentage of stack region is currently is used? (i am using tru64 unix) (2 Replies)
Discussion started by: yakari
2 Replies
Login or Register to Ask a Question