Extract certain columns from big data

06-14-2013

Registered User

232, 15

Join Date: Sep 2009

Last Activity: 16 May 2020, 12:46 AM EDT

Posts: 232

Thanks Given: 149

Thanked 15 Times in 14 Posts

Happypoker,
check it out..

Code:

# cat file
000 A B C D 2.22 3.4 1.03
001 B F S D 3.2 2.1 3.2 2.3
132 F S F X 2.3 3.4 5.3 2.1

-- Put into array & print:

Code:

$ awk '{for (i=1;i<=NF;i++) A[NR,i]=$i;for(j=1;j<=i;j++)printf A[NR,j] " ";printf "\n"}' file
000 A B C D 2.22 3.4 1.03
001 B F S D 3.2 2.1 3.2 2.3
132 F S F X 2.3 3.4 5.3 2.1

- Now you can control the amount of column output you want, with the value of i ,
- here i=5 taken.

Code:

$ awk '{for (i=1;i<=5;i++) A[NR,i]=$i;for(j=1;j<=i;j++)printf A[NR,j] " ";printf "\n"}' file
000 A B C D
001 B F S D
132 F S F X

cheers,

---------- Post updated at 06:27 PM ---------- Previous update was at 06:04 PM ----------

Hi Happypoker,

Adding to that, if you want to print a range of column , say from

Code:

start=100 , to end=6700

you can do it as below:

Code:

$ s=100;e=6700
$ awk -v s1=$s -v e1=$e '{for (i=1;i<=e1;i++) A[NR,i]=$i;for(j=s1;j<=i;j++)printf A[NR,j] " ";printf "\n"}' file

Enjoy..,Have fun!.

rveri

View Public Profile for rveri

Find all posts by rveri

06-14-2013

Registered User

5,091, 1,931

Join Date: May 2012

Last Activity: 15 July 2020, 4:46 AM EDT

Location: Simplicity

Posts: 5,091

Thanks Given: 565

Thanked 1,931 Times in 1,668 Posts

Why store in array?
Print directly

Code:

awk '{for (i=start;i<=end;i++) printf "%s ",$i; printf "\n"}' start=100 end=6700 file

With cut

Code:

cut -f 100-6700 file

Last edited by MadeInGermany; 06-14-2013 at 07:51 PM..

This User Gave Thanks to MadeInGermany For This Post:

MadeInGermany

View Public Profile for MadeInGermany

Find all posts by MadeInGermany

06-15-2013

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

The standards only define the behavior of awk when its input files are text files. A file with 6,600 fields isn't likely to be a text file an any UNIX or Linux system I've seen. The maximum length of a line in a text file is LINE_MAX bytes including the terminating newline character. (You can get the value of LINE_MAX on your system using the command:

Code:

getconf LINE_MAX

The standards allow LINE_MAX to be as low as 2,048 bytes.) Some implementations of awk may accept longer lines and behave as you would like them to. Others will print a diagnostic if an input or output line exceeds LINE_MAX. Others will silently truncate long lines (in this case probably provided truncated output lines). And, others may read LINE_MAX bytes, treat that as a line, and then read the next LINE_MAX bytes as the next input line (guaranteeing garbage output for your application.) Note that if you have an awk that handles long lines as you want it to and try to create arrays of 2,000 fields from 30,000,000 input records and then try printing the results at the end, you'd need for awk to have access to a minimum of 600,000,000,000 bytes of data to store that data even if each field is only one byte long (one byte of data + a terminating null byte + an 8 byte pointer to the string for each field). With data of this magnitude, you will have to process it on the fly; not accumulate data and process it when you find the end of your input file.

The standards do require that conforming implementations of the cut utility be able to handle arbitrary line lengths (assuming that they have access to the memory they need to hold lines being processed). The standards also require that fold (with the -b option) and paste be able to break apart and recreate files that would be text files except for unlimited line lengths, and cat and wc can work on files of any size and type. All other standard text processing utilities (e.g., awk, ed/ex, grep, sed, vi, etc.) have unspecified or undefined behavior if their input files are not text file with lines no longer than LINE_MAX.

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

06-15-2013

Registered User

5,091, 1,931

Join Date: May 2012

Last Activity: 15 July 2020, 4:46 AM EDT

Location: Simplicity

Posts: 5,091

Thanks Given: 565

Thanked 1,931 Times in 1,668 Posts

For really big data, here is a C program:

Code:

#include <stdio.h>
#include <stdlib.h>

main(argc,argv)
int argc; char *argv[];
{
  char c;
  FILE *fin;
  char IFS = '\t';
  int cnt = 0;
  int start = atoi(argv[1]);
  int end = atoi(argv[2]);
  fin = fopen(argv[3],"rb");

  while ((c = getc(fin)) != EOF)
  {
     if (c == IFS) ++cnt;
     else if (c == '\n') { cnt=0; putc(c, stdout); }
     if (cnt >= start && cnt <= end) putc(c, stdout);
  }
  fclose(fin);
}

Code:

gcc -o thisprogram thisprogram.c
./thisprogram 100 6700 file

Last edited by MadeInGermany; 06-15-2013 at 03:34 PM..

MadeInGermany

View Public Profile for MadeInGermany

Find all posts by MadeInGermany

06-15-2013

Registered User

3,231, 978

Join Date: Dec 2009

Last Activity: 11 June 2014, 8:40 PM EDT

Posts: 3,231

Thanks Given: 179

Thanked 978 Times in 791 Posts

Quote:

Originally Posted by MadeInGermany

Code:

awk '{for (i=start;i<=end;i++) printf "%s ",$i; printf "\n"}' start=100 end=6700 file

With cut

Code:

cut -f 100-6700 file

Quote:

Originally Posted by MadeInGermany

Code:

gcc -o thisprogram thisprogram.c
./thisprogram 100 6700 file

Compared to the way that awk and cut index fields, the C code is off by one; it is 0-indexed instead of 1-indexed. So the 100-6700 specified is actually 101-6701 in awk/cut.

If the first field (start=0) is part of the range, every newline will be duplicated in the output.

If the first field is not part of the range (start > 0), there will always be a leading IFS character.

Regards,
Alister

alister

View Public Profile for alister

Find all posts by alister

06-16-2013

Registered User

5,091, 1,931

Join Date: May 2012

Last Activity: 15 July 2020, 4:46 AM EDT

Location: Simplicity

Posts: 5,091

Thanks Given: 565

Thanked 1,931 Times in 1,668 Posts

The duplicate newline printing was a bug.
With field numbers starting at 1, and no leading IFS character:

Code:

#include <stdio.h>
#include <stdlib.h>

main(argc,argv)
int argc; char *argv[];
{
  char c;
  FILE *fin;
  char IFS = '\t';
  int start = atoi(argv[1]);
  int end = atoi(argv[2]);
  int field = 1;
  int first = 0;
  fin = fopen(argv[3], "rb");

  while ((c = getc(fin)) != EOF)
  {
    if (c == '\n') { field=1; first=0; putc(c, stdout); }
    else {
      if (c == IFS) ++field;
      if (field >= start && field <= end && first++) putc(c, stdout);
    }
  }
  fclose(fin);
}

MadeInGermany

View Public Profile for MadeInGermany

Find all posts by MadeInGermany

06-17-2013

Registered User

232, 15

Join Date: Sep 2009

Last Activity: 16 May 2020, 12:46 AM EDT

Posts: 232

Thanks Given: 149

Thanked 15 Times in 14 Posts

MadeInGermany ,
I tried the C program after compiled , it doesnt work for me somehow , tried in Ubuntu.

Code:

# ./thisprogram 100 200 col8k



ubuntu:#

It prints nothing. Just prints 4 blank line.

rveri

View Public Profile for rveri

Find all posts by rveri

Shell Programming and Scripting

Extract certain columns from big data

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Extract Big and continuous regions

Discussion started by: amrutha_sastry

2. Shell Programming and Scripting

Want to extract certain lines from big file

Discussion started by: mad man

3. Shell Programming and Scripting

Compare 2 csv files by columns, then extract certain columns of matcing rows

Discussion started by: bkane3

4. What is on Your Mind?

Big Data for System Admins

Discussion started by: nightup2222

5. Shell Programming and Scripting

Extract certain entries from big file:Request to check

Discussion started by: manigrover

6. Red Hat

Linux in Big Data projects

Discussion started by: johnsmith111

7. Shell Programming and Scripting

Sort a big data file

Discussion started by: rubber08

8. Shell Programming and Scripting

Transpose columns to Rows : Big data

Discussion started by: genehunter

9. Shell Programming and Scripting

Extract data based on match against one column data from a long list data

Discussion started by: patrick87

10. Shell Programming and Scripting

How to cut some data from big file

Discussion started by: almanto