06-14-2013
Extract certain columns from big data
The dataset I'm working on is about 450G, with about 7000 colums and 30,000,000 rows.
I want to extract about 2000 columns from the original file to form a new file.
I have the list of number of the columns I need, but don't know how to extract them.
Thanks!
10 More Discussions You Might Find Interesting
1. Shell Programming and Scripting
How to cut data from big file
my file around 30 gb
I tried "head -50022172 filename > newfile.txt ,and tail -5454283 newfile.txt. It's slowy.
afer that I tried sed -n '46467831,50022172p' filename > newfile.txt ,also slow
Please recommend me , faster command to cut some data from... (4 Replies)
Discussion started by: almanto
4 Replies
2. Shell Programming and Scripting
My input file:
data_5 Ali 422 2.00E-45 102/253 140/253 24
data_3 Abu 202 60.00E-45 12/23 140/23 28
data_1 Ahmad 256 7.00E-45 120/235 140/235 22
data_4 Aman 365 8.00E-45 15/65 140/65 20
data_10 Jones 869 9.00E-45 65/253 140/253 18... (12 Replies)
Discussion started by: patrick87
12 Replies
3. Shell Programming and Scripting
Hi,
I did read a few posts on the subjects, tried out a few solutions, but did not solve my problem.
https://www.unix.com/302121568-post11.html
https://www.unix.com/shell-programming-scripting/137953-large-file-columns-into-rows-etc-4.html
Please help. Problem very similar to the second link... (15 Replies)
Discussion started by: genehunter
15 Replies
4. Shell Programming and Scripting
Hello,
I have a big data file (160 MB) full of records with pipe(|) delimited those fields. I`m sorting the file on the first field.
I'm trying to sort with "sort" command and it brings me 6 minutes.
I have tried with some transformation methods in perl but it results "Out of memory". I was... (2 Replies)
Discussion started by: rubber08
2 Replies
5. Red Hat
Hey guys, we will be interested in learning from your experience in using Linux in Big Data projects. Has anyone used Hadoop, or MapR or Horton Works on Linux and any experiences you may have had on these. I am more interested in knowing if a certain distribution of Linux is better supported for... (1 Reply)
Discussion started by: johnsmith111
1 Replies
6. Shell Programming and Scripting
Hi all
I have a big file which I have attached here.
And, I have to fetch certain entries and arrange in 5 columns
Name Drug DAP ID disease approved or notIn the attached file data is arranged with tab separated columns in this way:
and other data is... (2 Replies)
Discussion started by: manigrover
2 Replies
7. What is on Your Mind?
Hello,
I have been working as Solaris/Linux Admin since past 8 years. I am looking options for my profile change, but there is some limitation. I worked as 24x7 support for admin, server support, high availability, etc. But been worked on developing side and scripting part.
When I search for Big... (2 Replies)
Discussion started by: nightup2222
2 Replies
8. Shell Programming and Scripting
Hi all, I'm pretty much a newbie to UNIX. I would appreciate any help with UNIX coding on comparing two large csv files (greater than 10 GB in size), and output a file with matching columns.
I want to compare file1 and file2 by 'id' and 'chain' columns, then extract exact matching rows'... (5 Replies)
Discussion started by: bkane3
5 Replies
9. Shell Programming and Scripting
Hi All,
I am trying to get some lines from a file i did it with while-do-loop. since the files are huge it is taking much time. now i want to make it faster.
The requirement is the file will be having 1 million lines.
The format is like below.
##transaction, , , ,blah, blah... (38 Replies)
Discussion started by: mad man
38 Replies
10. Shell Programming and Scripting
Hi all,
I have a file like this I want to extract only those regions which are big and continous
chr1 3280000 3440000
chr1 3440000 3920000
chr1 3600000 3920000 # region coming within the 3440000 3920000. so i don't want it to be printed in output
chr1 3920000 4800000
chr1 ... (2 Replies)
Discussion started by: amrutha_sastry
2 Replies
LEARN ABOUT DEBIAN
h5fromtxt
H5FROMTXT(1) h5utils H5FROMTXT(1)
NAME
h5fromtxt - convert text input to an HDF5 file
SYNOPSIS
h5fromtxt [OPTION]... [HDF5FILE]
DESCRIPTION
h5fromtxt takes a series of numbers from standard input and outputs a multi-dimensional numeric dataset in an HDF5 file.
HDF5 is a free, portable binary format and supporting library developed by the National Center for Supercomputing Applications at the Uni-
versity of Illinois in Urbana-Champaign. A single h5 file can contain multiple data sets; by default, h5fromtxt creates a dataset called
"data", but this can be changed via the -d option, or by using the syntax HDF5FILE:DATASET. The -a option can be used to append new
datasets to an existing HDF5 file.
All characters besides the numbers (and associated decimal points, etcetera) in the input are ignored. By default, the data is assumed to
be a two-dimensional MxN dataset where M is the number of rows (delimited by newlines) and N is the number of columns. In this case, it is
an error for the number of columns to vary between rows. If M or N is 1 then the data is written as a one-dimensional dataset.
Alternatively, you can specify the dimensions of the data explicitly via the -n size option, where size is e.g. "2x2x2". In this case,
newlines are ignored and the data is taken as an array of the given size stored in row-major ("C") order (where the last index varies most
quickly as you step through the data). e.g. a 2x2x2 array would be have the elements listed in the order: (0,0,0), (0,0,1), (0,1,0),
(0,1,1), (1,0,0), (1,0,1), (1,1,0), (1,1,1).
A simple example is:
h5fromtxt foo.h5 <<EOF
1 2 3 4
5 6 7 8
EOF
which reads in a 2x4 space-delimited array from standard input.
OPTIONS
-h Display help on the command-line options and usage.
-V Print the version number and copyright info for h5fromtxt.
-v Verbose output.
-a If the HDF5 output file already exists, append the data as a new dataset rather than overwriting the file (the default behavior).
An existing dataset of the same name within the file is overwritten, however.
-n size
Instead of trying to infer the dimensions of the array from the rows and columns of the input, treat the data as a sequence of num-
bers in row-major order forming an array of dimensions size. size is of the form MxNxLx... (with M, N, L being numbers) and may be
of any dimensionality.
-T Transpose the input when it is written, reversing the dimensions.
-d name
Write to dataset name in the output; otherwise, the output dataset is called "data" by default. Alternatively, use the syntax
HDF5FILE:DATASET.
BUGS
Send bug reports to S. G. Johnson, stevenj@alum.mit.edu.
AUTHORS
Written by Steven G. Johnson. Copyright (c) 2005 by the Massachusetts Institute of Technology.
h5utils March 9, 2002 H5FROMTXT(1)