Large file - columns into rows etc


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Large file - columns into rows etc
# 22  
Old 06-25-2010
Quote:
Originally Posted by Scrutinizer
Hi, I wasn't aware that you needed to transpose such a large file. Transpositions take up a lot of memory, so that is likely to become a problem... What numbers of rows and columns are we talking about for the small file and the big file?
To be precise: 248 columns (including rsIDs); 598678 rows (including individual IDs) - in original file.
# 23  
Old 06-25-2010
That would mean 1 MiB long lines...

Can't plink work with transposed filesets? PLINK: Transposed Filesets ?
# 24  
Old 06-25-2010
Yeah it does, but my files are all over the place so it will take me just as long to try and organise them in the alternative way. I just thought that transposing the file may be easier and less time consuming.
# 25  
Old 06-25-2010
Quote:
Originally Posted by Myrona
J....

This is a genetic data file....
Code:
       ind1   ind2  ind3  ind4  ind5  ind6
rs1    AA    AG    GG    GA   AA    GG
rs2    CT    TT    TT    --    CC    TC
rs3    AG    AA    --    GG    GA    GA
rs4    TT    CT    --    TT    TC    --
rs5    GG    --    GA    AA    GG    AG
rs6    CG    CG    CC    GG    --    GC

I would like the output to be like this:
Code:
ind1 A A C T A G T T G G C G
ind2 A G T T A A C T 0 0 C G
ind3 G G T T 0 0 0 0 G A C C
ind4 G A 0 0 G G T T A A G G
ind5 A A C C G A T C G G 0 0
ind6 G G T C G A 0 0 A G G C

Hope that helps a bit.... I thought that transposing the original file and doing some another data manipulation using shell/awk to end up like the end product above would suffice but obviously that's not working.
Storing each field of such big file in awk is going to be slow and memory intensive. If your input is really two letters separated by blanks like the one posted, this awk can do it in 10 minutes on my computer:
Code:
NR < 2 {
  c = split($0, h)
  next
}
{
  gsub(/^[^ \t]*[ \t]/, "", $0)
  gsub(/-/, "0", $0)
  gsub(/[ \t]/, "", $0)
  a[NR] = $0
}
END{
  r = NR
  ORS = ""
  for (i = 1; i <= c; ++i) {
    printf("%s", h[i])
    for (j = 2; j <= r; ++j)
      print "", substr(a[j], i+i-1, 1), substr(a[j], i+i, 1)
    printf("\n")
  }
}

The idea here is to store the whole input line and use substr to access the columns. If your format is very strict, i.e. two letters separated by a tab, then you can store the input line as is and use substr(a[j],i+i+i +offset,1).
# 26  
Old 06-25-2010
Quote:
Originally Posted by Myrona
Hi,

Thanks for the reply. I'm not sure if anyone is familiar with it, but I need this particular format for the program PLINK. File format is for genome-wide SNP data for each individual (one row = one individual), i.e:

FAMID INID FID MID SEX AFF rs1a rs1b rs2a rs2b rs3a rs3b..... rs500Ka rs500Kb
n1
n2
n3
--
n247

Plink needs all information in one file, it will not work if separated like the way you have suggested. I am unable to figure out a way to transpose the data from what I got from the genotyping people to get it in the form for this program.

Does that help?
Sorry, not really familiar with that program. If it is normal for that program to handle 100K+ data points, I think there would be some alternate file layouts that it supports.

As for the transposition, did you try my script? It should work (without any memory issues as it directly dumps everything to files). Its just that it will take a long time to finish. If you need better performance you can re-write the logic in C.
# 27  
Old 07-01-2010
From what I can gather by doing grep in unix that this works but again, there is the memory problem when transferring the file over to windows and trying to open it. I'm using the PFE program to open it but if anyone may have a better, more stable program, let me know.

As for the PLINK program... the program is specifically for analysis of genome-wide data which is usually over 500K data points.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

Extract spread columns from large file

Dear all, I want to extract around 300 columns from a very large file with almost 2million columns. There are no headers, but I can find out which column numbers I want. I know I can extract with the function 'cut -f2' for example just the second column but how do I do this for such a large... (1 Reply)
Discussion started by: fndijk
1 Replies

2. UNIX for Dummies Questions & Answers

Help with solution to add together columns of large file

Hi everyone. I have a file with ~500 columns and I would like to perform a simple calculation on every two columns. The file looks like this: $cat input id A B C D E F.....X 1 2 4 2 3 4 1 n 2 4 6 4 6 4 5 n 3 4 7 5 2 2 3 n 4 ... (5 Replies)
Discussion started by: torchij
5 Replies

3. Shell Programming and Scripting

Deleting all the fields(columns) from a .csv file if all rows in that columns are blanks

Hi Friends, I have come across some files where some of the columns don not have data. Key, Data1,Data2,Data3,Data4,Data5 A,5,6,,10,, A,3,4,,3,, B,1,,4,5,, B,2,,3,4,, If we see the above data on Data5 column do not have any row got filled. So remove only that column(Here Data5) and... (4 Replies)
Discussion started by: ks_reddy
4 Replies

4. Shell Programming and Scripting

Dedup a large file(30M rows)

Hi, I have a large file with number of records in there. I need some help to find only first row based on a key and ignore other rows with the same key. I tried few things but file is huge(30 million rows). So need some solution that is very efficient. e.g Junk|Apple|7|Random|data|here...... (2 Replies)
Discussion started by: ran123
2 Replies

5. Shell Programming and Scripting

Convert columns to rows in a file

Hello, I have a huge tab delimited file with around 40,000 columns and 900 rows I want to convert columns to a row. INPUT file look like this. the first line is a headed of a file. ID marker1 marker2 marker3 marker4 b1 A G A C ... (5 Replies)
Discussion started by: ryan9011
5 Replies

6. UNIX for Dummies Questions & Answers

Delete large number of columns rom file

Hi, I have a data file that contains 61 columns. I want to delete all the columns except columns, 3,6 and 8. The columns are tab de-limited. How would I achieve this on the terminal? Thanks (2 Replies)
Discussion started by: lost.identity
2 Replies

7. Shell Programming and Scripting

Deleting specific rows in large files having rows greater than 100000

Hi Guys, I need help in modifying a large text file containing more than 1-2 lakh rows of data using unix commands. I am quite new to the unix language the text file contains data in a pipe delimited format sdfsdfs sdfsdfsd START_ROW sdfsd|sdfsdfsd|sdfsdfasdf|sdfsadf|sdfasdf... (9 Replies)
Discussion started by: manish2009
9 Replies

8. Shell Programming and Scripting

Rows to Columns - File Transpose

Hi I have an input file and I want to transpose it but I need to take care that if any field is missing for a record it should be popoulated with space for that field - using a shell script INFILE ---------- emp=1 sal=2 loc=abc emp=2 sal=21 sal=22 loc=xyz emp=5 loc=abc OUTFILE... (10 Replies)
Discussion started by: 46019
10 Replies

9. Shell Programming and Scripting

How to delete rows by RowNumber from a Large text file

Friends, I have text file with 700,000 rows. Once I load this file to our database via our cutom process, it logs the row number for rejected rows. How do I delete rows from a Large text file based on the Row Number? Thanks, Prashant (8 Replies)
Discussion started by: ppat7046
8 Replies

10. Shell Programming and Scripting

How to changes rows to columns in a file

Hi, I have a small requirement in chainging the rows to columns. The below example.txt contains info as shown Name:Person1 Age:30 Name:Person2 Age:40 Name:Person3 Age:50 I want to make it displayed as hown below Name:Person1 Age:30 Name:person2 Age:40 Name:Person3 Age:50 I... (4 Replies)
Discussion started by: oracle123
4 Replies
Login or Register to Ask a Question