Extract certain columns from big data


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Extract certain columns from big data
# 8  
Old 06-14-2013
Happypoker,
check it out..

Code:
# cat file
000 A B C D 2.22 3.4 1.03
001 B F S D 3.2 2.1 3.2 2.3
132 F S F X 2.3 3.4 5.3 2.1


-- Put into array & print:
Code:
$ awk '{for (i=1;i<=NF;i++) A[NR,i]=$i;for(j=1;j<=i;j++)printf A[NR,j] " ";printf "\n"}' file
000 A B C D 2.22 3.4 1.03
001 B F S D 3.2 2.1 3.2 2.3
132 F S F X 2.3 3.4 5.3 2.1




- Now you can control the amount of column output you want, with the value of i ,
- here i=5 taken.

Code:
$ awk '{for (i=1;i<=5;i++) A[NR,i]=$i;for(j=1;j<=i;j++)printf A[NR,j] " ";printf "\n"}' file
000 A B C D
001 B F S D
132 F S F X


cheers,

---------- Post updated at 06:27 PM ---------- Previous update was at 06:04 PM ----------

Hi Happypoker,

Adding to that, if you want to print a range of column , say from
Code:
start=100 , to end=6700

you can do it as below:


Code:
$ s=100;e=6700
$ awk -v s1=$s -v e1=$e '{for (i=1;i<=e1;i++) A[NR,i]=$i;for(j=s1;j<=i;j++)printf A[NR,j] " ";printf "\n"}' file

Enjoy..,Have fun!.
# 9  
Old 06-14-2013
Why store in array?
Print directly
Code:
awk '{for (i=start;i<=end;i++) printf "%s ",$i; printf "\n"}' start=100 end=6700 file

With cut
Code:
cut -f 100-6700 file


Last edited by MadeInGermany; 06-14-2013 at 07:51 PM..
This User Gave Thanks to MadeInGermany For This Post:
# 10  
Old 06-15-2013
The standards only define the behavior of awk when its input files are text files. A file with 6,600 fields isn't likely to be a text file an any UNIX or Linux system I've seen. The maximum length of a line in a text file is LINE_MAX bytes including the terminating newline character. (You can get the value of LINE_MAX on your system using the command:
Code:
getconf LINE_MAX

The standards allow LINE_MAX to be as low as 2,048 bytes.) Some implementations of awk may accept longer lines and behave as you would like them to. Others will print a diagnostic if an input or output line exceeds LINE_MAX. Others will silently truncate long lines (in this case probably provided truncated output lines). And, others may read LINE_MAX bytes, treat that as a line, and then read the next LINE_MAX bytes as the next input line (guaranteeing garbage output for your application.) Note that if you have an awk that handles long lines as you want it to and try to create arrays of 2,000 fields from 30,000,000 input records and then try printing the results at the end, you'd need for awk to have access to a minimum of 600,000,000,000 bytes of data to store that data even if each field is only one byte long (one byte of data + a terminating null byte + an 8 byte pointer to the string for each field). With data of this magnitude, you will have to process it on the fly; not accumulate data and process it when you find the end of your input file.

The standards do require that conforming implementations of the cut utility be able to handle arbitrary line lengths (assuming that they have access to the memory they need to hold lines being processed). The standards also require that fold (with the -b option) and paste be able to break apart and recreate files that would be text files except for unlimited line lengths, and cat and wc can work on files of any size and type. All other standard text processing utilities (e.g., awk, ed/ex, grep, sed, vi, etc.) have unspecified or undefined behavior if their input files are not text file with lines no longer than LINE_MAX.
# 11  
Old 06-15-2013
For really big data, here is a C program:
Code:
#include <stdio.h>
#include <stdlib.h>

main(argc,argv)
int argc; char *argv[];
{
  char c;
  FILE *fin;
  char IFS = '\t';
  int cnt = 0;
  int start = atoi(argv[1]);
  int end = atoi(argv[2]);
  fin = fopen(argv[3],"rb");

  while ((c = getc(fin)) != EOF)
  {
     if (c == IFS) ++cnt;
     else if (c == '\n') { cnt=0; putc(c, stdout); }
     if (cnt >= start && cnt <= end) putc(c, stdout);
  }
  fclose(fin);
}

Code:
gcc -o thisprogram thisprogram.c
./thisprogram 100 6700 file


Last edited by MadeInGermany; 06-15-2013 at 03:34 PM..
# 12  
Old 06-15-2013
Quote:
Originally Posted by MadeInGermany
Code:
awk '{for (i=start;i<=end;i++) printf "%s ",$i; printf "\n"}' start=100 end=6700 file

With cut
Code:
cut -f 100-6700 file

Quote:
Originally Posted by MadeInGermany
Code:
gcc -o thisprogram thisprogram.c
./thisprogram 100 6700 file

Compared to the way that awk and cut index fields, the C code is off by one; it is 0-indexed instead of 1-indexed. So the 100-6700 specified is actually 101-6701 in awk/cut.

If the first field (start=0) is part of the range, every newline will be duplicated in the output.

If the first field is not part of the range (start > 0), there will always be a leading IFS character.

Regards,
Alister
# 13  
Old 06-16-2013
The duplicate newline printing was a bug.
With field numbers starting at 1, and no leading IFS character:
Code:
#include <stdio.h>
#include <stdlib.h>

main(argc,argv)
int argc; char *argv[];
{
  char c;
  FILE *fin;
  char IFS = '\t';
  int start = atoi(argv[1]);
  int end = atoi(argv[2]);
  int field = 1;
  int first = 0;
  fin = fopen(argv[3], "rb");

  while ((c = getc(fin)) != EOF)
  {
    if (c == '\n') { field=1; first=0; putc(c, stdout); }
    else {
      if (c == IFS) ++field;
      if (field >= start && field <= end && first++) putc(c, stdout);
    }
  }
  fclose(fin);
}

# 14  
Old 06-17-2013
MadeInGermany ,
I tried the C program after compiled , it doesnt work for me somehow , tried in Ubuntu.


Code:
# ./thisprogram 100 200 col8k



ubuntu:#

It prints nothing. Just prints 4 blank line.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Extract Big and continuous regions

Hi all, I have a file like this I want to extract only those regions which are big and continous chr1 3280000 3440000 chr1 3440000 3920000 chr1 3600000 3920000 # region coming within the 3440000 3920000. so i don't want it to be printed in output chr1 3920000 4800000 chr1 ... (2 Replies)
Discussion started by: amrutha_sastry
2 Replies

2. Shell Programming and Scripting

Want to extract certain lines from big file

Hi All, I am trying to get some lines from a file i did it with while-do-loop. since the files are huge it is taking much time. now i want to make it faster. The requirement is the file will be having 1 million lines. The format is like below. ##transaction, , , ,blah, blah... (38 Replies)
Discussion started by: mad man
38 Replies

3. Shell Programming and Scripting

Compare 2 csv files by columns, then extract certain columns of matcing rows

Hi all, I'm pretty much a newbie to UNIX. I would appreciate any help with UNIX coding on comparing two large csv files (greater than 10 GB in size), and output a file with matching columns. I want to compare file1 and file2 by 'id' and 'chain' columns, then extract exact matching rows'... (5 Replies)
Discussion started by: bkane3
5 Replies

4. What is on Your Mind?

Big Data for System Admins

Hello, I have been working as Solaris/Linux Admin since past 8 years. I am looking options for my profile change, but there is some limitation. I worked as 24x7 support for admin, server support, high availability, etc. But been worked on developing side and scripting part. When I search for Big... (2 Replies)
Discussion started by: nightup2222
2 Replies

5. Shell Programming and Scripting

Extract certain entries from big file:Request to check

Hi all I have a big file which I have attached here. And, I have to fetch certain entries and arrange in 5 columns Name Drug DAP ID disease approved or notIn the attached file data is arranged with tab separated columns in this way: and other data is... (2 Replies)
Discussion started by: manigrover
2 Replies

6. Red Hat

Linux in Big Data projects

Hey guys, we will be interested in learning from your experience in using Linux in Big Data projects. Has anyone used Hadoop, or MapR or Horton Works on Linux and any experiences you may have had on these. I am more interested in knowing if a certain distribution of Linux is better supported for... (1 Reply)
Discussion started by: johnsmith111
1 Replies

7. Shell Programming and Scripting

Sort a big data file

Hello, I have a big data file (160 MB) full of records with pipe(|) delimited those fields. I`m sorting the file on the first field. I'm trying to sort with "sort" command and it brings me 6 minutes. I have tried with some transformation methods in perl but it results "Out of memory". I was... (2 Replies)
Discussion started by: rubber08
2 Replies

8. Shell Programming and Scripting

Transpose columns to Rows : Big data

Hi, I did read a few posts on the subjects, tried out a few solutions, but did not solve my problem. https://www.unix.com/302121568-post11.html https://www.unix.com/shell-programming-scripting/137953-large-file-columns-into-rows-etc-4.html Please help. Problem very similar to the second link... (15 Replies)
Discussion started by: genehunter
15 Replies

9. Shell Programming and Scripting

Extract data based on match against one column data from a long list data

My input file: data_5 Ali 422 2.00E-45 102/253 140/253 24 data_3 Abu 202 60.00E-45 12/23 140/23 28 data_1 Ahmad 256 7.00E-45 120/235 140/235 22 data_4 Aman 365 8.00E-45 15/65 140/65 20 data_10 Jones 869 9.00E-45 65/253 140/253 18... (12 Replies)
Discussion started by: patrick87
12 Replies

10. Shell Programming and Scripting

How to cut some data from big file

How to cut data from big file my file around 30 gb I tried "head -50022172 filename > newfile.txt ,and tail -5454283 newfile.txt. It's slowy. afer that I tried sed -n '46467831,50022172p' filename > newfile.txt ,also slow Please recommend me , faster command to cut some data from... (4 Replies)
Discussion started by: almanto
4 Replies
Login or Register to Ask a Question