Large file - columns into rows etc


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Large file - columns into rows etc
# 15  
Old 06-13-2010
Actually, that didn't work either... still getting the same error. Though, considering my lack of knowledge in the area, I'm sure I'm doing something wrong. Smilie
# 16  
Old 06-23-2010
Just curious to know whether anyone else has a solution to this.... I think I've tried everything but nothing seems to work... and as I don't know much about writing scripts - the use of laymens terms would be greatly appreciated.

Another problem I'm having is that I want to use this data file in a linux based software, but it's saying that it can only find 10 columns but when I do a count in the file, there is the correct number.

Any ideas. Dunno what else to do...

---------- Post updated at 06:33 PM ---------- Previous update was at 03:19 PM ----------

I've had a few pm's about a better description of the data and what exactly I need, so here is an example of 6 columns * 6 rows....

This is a genetic data file....
Code:
       ind1   ind2  ind3  ind4  ind5  ind6
rs1    AA    AG    GG    GA   AA    GG
rs2    CT    TT    TT    --    CC    TC
rs3    AG    AA    --    GG    GA    GA
rs4    TT    CT    --    TT    TC    --
rs5    GG    --    GA    AA    GG    AG
rs6    CG    CG    CC    GG    --    GC

I would like the output to be like this:
Code:
ind1 A A C T A G T T G G C G
ind2 A G T T A A C T 0 0 C G
ind3 G G T T 0 0 0 0 G A C C
ind4 G A 0 0 G G T T A A G G
ind5 A A C C G A T C G G 0 0
ind6 G G T C G A 0 0 A G G C

Hope that helps a bit.... I thought that transposing the original file and doing some another data manipulation using shell/awk to end up like the end product above would suffice but obviously that's not working.
# 17  
Old 06-23-2010
Hi, try this:
Code:
nawk '{for(i=1;i<=NF;i++) if(NR==1) A[i]=$i" ";else B[i-1]=B[i-1]$i}
      END{for(i=1;i<=NF-1;i++) {gsub(/./,"& ",B[i]); print A[i],B[i]}}' infile

# 18  
Old 06-25-2010
So - I set this running, though its been 24 hours and its still running - is that normal for such a large file?

Edit: to check if this was working, I ran it with a smaller file but my outfile is 0kb big - which indicates nothing has worked... what could I be doing wrong???
# 19  
Old 06-25-2010
Try this:
Code:
#!/bin/ksh

#set -x

typeset TEMP=.tmp.$$.dat

function transpose_file
{
#set -x
  typeset file=$1
  typeset -Z3 i=0  ## -Zn, n is the order of maximum number of columns. So if n is 3 here, max number of columns can be only 999

  cat $file | while read -A fields
  do
    fld_cnt=${#fields[@]}	## Number of fields in the current record
    for ((i=0 ; i< ${fld_cnt} ; i++))
    do
      ## Print the value of every field in a separate file
      ## You can tweak the value here, before printing it out to the file
      print -n -R "${fields[i]} " >> ${TEMP}.$i
    done
  done

  ## Append a newline to each of the temp files (here is an assumption that number of fields is same for each record)
  for ((i=0 ; i< ${fld_cnt} ; i++))
  do
    print >> ${TEMP}.$i
  done
  
  ## cat all the temp files together
  cat ${TEMP}.*

  rm ${TEMP}.*
}


file=${1:-input.dat}
output=${2:-output.dat}

transpose_file $file > $output

Note: If script does not run with ksh, try using ksh93 (some systems keep ksh exec as the older ksh88 version).


It took, 50 seconds to transpose a file with 247x500 records, so the extrapolated estimate for 500K records would be around 13/14 hours.
If you need better performance, try implementing this same logic in C.

I would, however, not recommend to feed in 500K columns to any process. Also, I believe, most standard shell commands will not be able to handle that big a line.

Perhaps, you should address the problem in a different way... Why do you really need that kind of a file format?
Can't you feed in data in an id-value kind of a pair?
For example,

Code:
ind1 A 
ind1 A 
ind1 C 
ind1 T 
ind1 A 
ind1 G 
...
ind2 A 
ind2 G 
ind2 T
ind2 T 
...
indN G
indN G 
indN T 
indN T
...

Or:
Code:
KEY:ind1 
A 
A 
C 
T 
A 
G 
...
KEY:indN
G
G 
T 
T
...

So you would get ~247x500K rows worth of data, but each line will be of a manageable size.
# 20  
Old 06-25-2010
Hi,

Thanks for the reply. I'm not sure if anyone is familiar with it, but I need this particular format for the program PLINK. File format is for genome-wide SNP data for each individual (one row = one individual), i.e:

FAMID INID FID MID SEX AFF rs1a rs1b rs2a rs2b rs3a rs3b..... rs500Ka rs500Kb
n1
n2
n3
--
n247

Plink needs all information in one file, it will not work if separated like the way you have suggested. I am unable to figure out a way to transpose the data from what I got from the genotyping people to get it in the form for this program.

Does that help?
# 21  
Old 06-25-2010
Hi, I wasn't aware that you needed to transpose such a large file. Transpositions take up a lot of memory, so that is likely to become a problem... What numbers of rows and columns are we talking about for the small file and the big file?
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

Extract spread columns from large file

Dear all, I want to extract around 300 columns from a very large file with almost 2million columns. There are no headers, but I can find out which column numbers I want. I know I can extract with the function 'cut -f2' for example just the second column but how do I do this for such a large... (1 Reply)
Discussion started by: fndijk
1 Replies

2. UNIX for Dummies Questions & Answers

Help with solution to add together columns of large file

Hi everyone. I have a file with ~500 columns and I would like to perform a simple calculation on every two columns. The file looks like this: $cat input id A B C D E F.....X 1 2 4 2 3 4 1 n 2 4 6 4 6 4 5 n 3 4 7 5 2 2 3 n 4 ... (5 Replies)
Discussion started by: torchij
5 Replies

3. Shell Programming and Scripting

Deleting all the fields(columns) from a .csv file if all rows in that columns are blanks

Hi Friends, I have come across some files where some of the columns don not have data. Key, Data1,Data2,Data3,Data4,Data5 A,5,6,,10,, A,3,4,,3,, B,1,,4,5,, B,2,,3,4,, If we see the above data on Data5 column do not have any row got filled. So remove only that column(Here Data5) and... (4 Replies)
Discussion started by: ks_reddy
4 Replies

4. Shell Programming and Scripting

Dedup a large file(30M rows)

Hi, I have a large file with number of records in there. I need some help to find only first row based on a key and ignore other rows with the same key. I tried few things but file is huge(30 million rows). So need some solution that is very efficient. e.g Junk|Apple|7|Random|data|here...... (2 Replies)
Discussion started by: ran123
2 Replies

5. Shell Programming and Scripting

Convert columns to rows in a file

Hello, I have a huge tab delimited file with around 40,000 columns and 900 rows I want to convert columns to a row. INPUT file look like this. the first line is a headed of a file. ID marker1 marker2 marker3 marker4 b1 A G A C ... (5 Replies)
Discussion started by: ryan9011
5 Replies

6. UNIX for Dummies Questions & Answers

Delete large number of columns rom file

Hi, I have a data file that contains 61 columns. I want to delete all the columns except columns, 3,6 and 8. The columns are tab de-limited. How would I achieve this on the terminal? Thanks (2 Replies)
Discussion started by: lost.identity
2 Replies

7. Shell Programming and Scripting

Deleting specific rows in large files having rows greater than 100000

Hi Guys, I need help in modifying a large text file containing more than 1-2 lakh rows of data using unix commands. I am quite new to the unix language the text file contains data in a pipe delimited format sdfsdfs sdfsdfsd START_ROW sdfsd|sdfsdfsd|sdfsdfasdf|sdfsadf|sdfasdf... (9 Replies)
Discussion started by: manish2009
9 Replies

8. Shell Programming and Scripting

Rows to Columns - File Transpose

Hi I have an input file and I want to transpose it but I need to take care that if any field is missing for a record it should be popoulated with space for that field - using a shell script INFILE ---------- emp=1 sal=2 loc=abc emp=2 sal=21 sal=22 loc=xyz emp=5 loc=abc OUTFILE... (10 Replies)
Discussion started by: 46019
10 Replies

9. Shell Programming and Scripting

How to delete rows by RowNumber from a Large text file

Friends, I have text file with 700,000 rows. Once I load this file to our database via our cutom process, it logs the row number for rejected rows. How do I delete rows from a Large text file based on the Row Number? Thanks, Prashant (8 Replies)
Discussion started by: ppat7046
8 Replies

10. Shell Programming and Scripting

How to changes rows to columns in a file

Hi, I have a small requirement in chainging the rows to columns. The below example.txt contains info as shown Name:Person1 Age:30 Name:Person2 Age:40 Name:Person3 Age:50 I want to make it displayed as hown below Name:Person1 Age:30 Name:person2 Age:40 Name:Person3 Age:50 I... (4 Replies)
Discussion started by: oracle123
4 Replies
Login or Register to Ask a Question