Help with sort and keep data record to calculate N50 in c


 
Thread Tools Search this Thread
Top Forums Programming Help with sort and keep data record to calculate N50 in c
# 1  
Old 07-18-2011
Help with sort and keep data record to calculate N50 in c

Input_file_1
Code:
#content_1
A
#content_2
AF
#content_3
AAR
#content_4
ASEI
#content_5
AS
#content_6
ADFSFGS

Rules:
1. Based on c program to calculate content of each "#". Result getting from the above Input_file_1 are 1,2,3,4,2,7;
2. Sort length on reverse order (descending order). 7, 4, 3, 2, 2, 1, 1;
3. Hope that the program able to store the above record (7, 4, 3, 2, 2, 1, 1) temporary for downstream analysis;
4. Sum all the total of Input_file_1: 7 + 4 + 3 + 2 + 2 + 1 + 1 = 20;
5. Divide (50%) the total sum of Input_file_1 as a threhold value: 20/2 = 10;
6. N50 must be equal to or greater than 50% of the total sum in Input_file_1 (10);
7. 7+4 = 11 (greater than 10);
Desired output result after running c program:
Code:
4

Many thanks for any advice.
# 2  
Old 07-18-2011
I don't understand step 7. why is output 4? I googled N50, it typically means how many of the largest integers need to be added together to equal 50%, so 7+4=11, requires 2 integers and output is 2? I guess you want the smallest member.

I think this is proper solution, finally. Using your example file:

Code:
[mute@geek ~/test]$ ./n50 n50.txt
4

Code:
#include <stdio.h>
#include <string.h>
#include <stdlib.h>

#define MAXLINES        32768

int lines[MAXLINES];
int line_count = 0;

int qsort_cmp(const void *p1, const void *p2)
{
        return (*(int *)p2 - *(int *)p1);
}

int main(int argc, char**argv)
{
        FILE *fh;
        char buf[512];
        int i, sum = 0, threshold;

        if (argc != 2)
        {
                printf("Usage: %s <input file>\n", argv[0]);
                return -1;
        }

        if ((fh = fopen(argv[1], "r")) == NULL)
        {
                perror("fopen");
                return -1;
        }

        while (fgets(buf, sizeof(buf), fh))
        {
                int len = strlen(buf);

                /* skip comments */
                if (buf[0] == '#') continue;

                /* strip newline at end */
                buf[--len] = 0;

                /* add to sum. */
                sum += len;

                /* keep record */
                lines[line_count++] = len;
        }

        qsort(lines, line_count, sizeof(int), qsort_cmp);

        threshold = sum / 2;

        for (i = sum = 0; (sum += lines[i]) < threshold; i++)
                ;

        printf("%d\n", lines[i]);

        return 0;
}


Last edited by neutronscott; 07-18-2011 at 02:28 AM.. Reason: had it totally wrong at first
This User Gave Thanks to neutronscott For This Post:
# 3  
Old 07-18-2011
Hi, friend.
This is one of the thread that mention well about N50 calculation, Calculating an N50 from Velvet output | (R news & tutorials)
The N50 of my example should be 4 instead of 2.
I'm trying with your approaches now with test file.
Hopefully we getting the same approaches Smilie
# 4  
Old 07-18-2011
As per your PM, the content should handle newlines. Also, I added DEBUG statements so you can view what the program is doing...

Code:
#include <stdio.h>
#include <string.h>
#include <stdlib.h>

#define DEBUG
#define MAXLINES        32768


int lines[MAXLINES];
int line_count = 0;

int qsort_cmp(const void *p1, const void *p2)
{
        return (*(int *)p2 - *(int *)p1);
}

int main(int argc, char**argv)
{
        FILE *fh;
        char buf[512];
#ifdef  DEBUG
        char contentblock[512];
#endif
        int i, sum = 0, threshold;

        if (argc != 2)
        {
                printf("Usage: %s <input file>\n", argv[0]);
                return -1;
        }

        if ((fh = fopen(argv[1], "r")) == NULL)
        {
                perror("fopen");
                return -1;
        }

#ifdef  DEBUG
        /* if a block is given before first '#' ... */
        strcpy(contentblock, "null");
#endif
        lines[0] = 0;
        while (fgets(buf, sizeof(buf), fh))
        {
                int len = strlen(buf);
                /* strip newline at end */
                buf[--len] = 0;
                /* new content block */
                if (buf[0] == '#') {
#ifdef  DEBUG
                        printf("%s length %d\n", contentblock, lines[line_count]);
                        strcpy(contentblock, buf + 1);
#endif
                        lines[++line_count] = 0;
                        continue;
                }
                /* keep record */
                lines[line_count] += len;
                /* add to sum. */
                sum += len;
        }
        threshold = sum / 2;
#ifdef  DEBUG
        printf("%s length %d\n", contentblock, lines[line_count]);
        printf("sum = %d, threshold = %d\n", sum, threshold);
        printf("Before sort: ");
        for (i = 0; i <= line_count; i++)
                printf("%d%s", lines[i], (i == line_count) ? "\n" : " + ");
#endif

        qsort(lines, line_count + 1, sizeof(int), qsort_cmp);
#ifdef  DEBUG
        printf("After sort: ");
        for (i = 0; i <= line_count; i++)
                printf("%d%s", lines[i], (i == line_count) ? "\n" : " + ");
#endif
        for (i = sum = 0; (sum += lines[i]) < threshold; i++)
        {
#ifdef  DEBUG
                printf("%d%s", lines[i], (sum >= threshold) ? "\n" : " + ");
#endif
        }

        printf("%d\n", lines[i]);

        return 0;
}

This User Gave Thanks to neutronscott For This Post:
# 5  
Old 07-19-2011
Many thanks, neutronscott.
Your program work very fast for huge data Smilie
It is amazing.
Do you have any idea how to edit the program to allow it print out only the N50 number instead of whole data analysis detail?
I try to edit it.
But can't work Smilie
Thanks for your assist.
# 6  
Old 07-19-2011
#undef DEBUG
This User Gave Thanks to neutronscott For This Post:
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Sort by record column, Compare with conditons and export the result

Hello, I am new to Unix and would like to seek a help, please. I have 2 files (file_1 and file_2), I need to perform the following actions. 1 ) Sort the both file by the column 26-36 (which is Invoice number) what is sort command with the column sort? 2) Compare the file_1.sorted and... (3 Replies)
Discussion started by: Usagi
3 Replies

2. Shell Programming and Scripting

Help with calculate the total sum of record in column one

Input file: 101M 10M10D20M1I70M 10M10D39M4I48M 10M10D91M 10M10I13M2I7M1I58M 10M10I15M1D66M Output file: 101M 101 0 0 10M10D20M1I70M 100 1 10 10M10D39M4I48M 97 4 10 10M10D91M 101 0 10 10M10I13M2I7M1I58M 88 13 0 10M10I15M1D66M 91 10 1 I'm interested to count how many total of... (6 Replies)
Discussion started by: perl_beginner
6 Replies

3. Shell Programming and Scripting

awk --> math-operation in data-record and joining with second file data

Hi! I have a pretty complex job - at least for me! i have two csv-files with meassurement-data: fileA ...... (2 Replies)
Discussion started by: IMPe
2 Replies

4. Shell Programming and Scripting

Calculate average for repeated ID within a data

I have an awk script that gives the following output: Average end-to-end transmission delay 2.7 to 5.7 is 0.635392 seconds Average end-to-end transmission delay 2.1 to 5.1 is 0.66272 seconds Average end-to-end transmission delay 2.1 to 5.1 is 0.691712 seconds Average end-to-end transmission... (4 Replies)
Discussion started by: ENG_MOHD
4 Replies

5. UNIX for Dummies Questions & Answers

gawk asort to sort record groups based on one subfield

input ("/" delimited fields): style1/book1 (author_C)/editor1/2000 style1/book2 (author_A)/editor2/2004 style1/book3 (author_B)/editor3/2001 style2/book8 (author_B)/editor4/2010 style2/book5 (author_A)/editor2/1998 Records with same field 1 belong to the same group. Using asort (not sort),... (3 Replies)
Discussion started by: lucasvs
3 Replies

6. Shell Programming and Scripting

AWK exclude first and last record, sort and print

Hi everyone, I've really searched for a solution to this and this is what I found so far: I need to sort a command output (here represented as a "cat file" command) and from the second down to the second-last line based on the second row and then print ALL the output with the specified section... (7 Replies)
Discussion started by: dentex
7 Replies

7. Shell Programming and Scripting

sort file specifying record length

I've been searching high and low for this...but, maybe I'm just missing something. I have a file to be sorted that, unfortunately, contains binary data at the end of the line. As you may guess, this binary data may contain a newline character, which messes up the sort. I think I could resolve this... (5 Replies)
Discussion started by: jcagle
5 Replies

8. Shell Programming and Scripting

Help with calculate total sum of same data problem

Long list of input file: AGDRE1 0.1005449050 AGDRE1 2.1005443435 AGDRE1 1.2005449050 AGDRE1 5.1005487870 AASFV3 50.456304789 AASFV3 2.3659706549 AASFV3 6.3489807860 AASFV3 3.0089890148 RTRTRS 5.6546403546 . . Desired output file: AGDRE1 8.5021829410 AASFV3 62.180245240... (2 Replies)
Discussion started by: perl_beginner
2 Replies

9. Shell Programming and Scripting

Calculate data and make it into new column using awk

Hi everyone, just some simple question... i've been using a awk script to calculate my data... i have 3 files: file a1.txt: 2 3 4 5 3 4 file a2.txt: 4 5 6 7 8 (1 Reply)
Discussion started by: yat
1 Replies

10. Shell Programming and Scripting

How To Calculate Data

Hi All, I want to calculate the total timing used by total users. Here "OUT" is showing that when an user logged in and "IN" timing is showing at what time user is logged out. If the corresponding IN-OUT is not matching then it consider the time from the mid-night of last day. Then Total... (0 Replies)
Discussion started by: krishna_sicsr
0 Replies
Login or Register to Ask a Question