Creating matrix from folders and subfolders


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Creating matrix from folders and subfolders
# 1  
Old 12-17-2012
Creating matrix from folders and subfolders

Hello,

Greetings!
please help me produce the following solution. I need
to produce one big matrix file from several files in different levels.
If it helps, the index folder provides information on chromosome index and
the data folder provides information on values for chromosomes.

there are 2 folders at the same level, index and data.
The index folder has multiple files named chr1, chr2 etc.
The data folder has many subfolders, Each subfolder has multiple files named chr1, chr2 etc. with the same names
as files in the index folder. A particular file and its namesake will have the same number of rows in it.
So if chr1 in index has 5 rows, chr1 in all subfolders within data will also have 5 rows.

The output should be a big matrix with a nested format, where the rownames(first col) starting row2 should be the file names
and the column names(first row) startng col3 should be the names of corresponding subfolders in data folder.

All files have 1 column and multiple rows with only integer numbers.

Code:
Index folder 

chr1

1
2
3
5
6

chr2

1
2
3
4
5
7

chr3

1
5
7


Data Folder 

Subfolder1

chr1

1
0
1
0
0

chr2

0
1
0
1
0
1

chr3

0
0
2

Subfolder2

chr1

1
1
2
2
3

chr2

1
3
4
6
0
0


chr3

1
0
0

Output

		Subfolder1	Subfolder2
chr1	1	1	1
chr1	2	0	1
chr1	3	1	2
chr1	5	0	2
chr1	6	0	3
chr2	1	0	1
chr2	2	1	3
chr2	3	0	4
chr2	4	1	6
chr2	5	0	0
chr2	7	1	0
chr3	1	0	1
chr3	5	0	0
chr3	7	2	0


Last edited by newbie83; 12-18-2012 at 12:01 AM..
# 2  
Old 12-18-2012
How about using awk:

Code:
find index data -type f -print | awk '
/^index/ {
   FL=$0
   n=split(FL,p,"/");
   F[++files]=p[n]
   n=0
   while ((getline < FL) > 0) {
       I[F[files],++n]=$0
       C[F[files]]=n
   }
   close(FL)
}
/^data/ { FL=$0
   n=split(FL,p,"/");
   subdir=p[n-1]
   S[subdir]=1
   file=p[n]
   n=0
   while ((getline < FL) > 0)
      D[file,subdir,++n]=$0
   close(FL)
}
END{
    printf "\t"
    for(subdir in S) printf "\t%s",subdir;
    printf "\n"
    for(i=1;i<=files;i++) {
        for(c=1;c<=C[F[i]];c++) {
            printf "%s\t%s",F[i],I[F[i],c]
            for(subdir in S) printf "\t%s",D[F[i],subdir,c];
            printf "\n"
        }
    }
}'

This User Gave Thanks to Chubler_XL For This Post:
# 3  
Old 12-18-2012
This works like a charm with the sample data, but with the actual data it is taking forever, ...40 mins and it hasn't written a single output row..i guess i will have to wait it out....thanks again...

if itsnt much trouble, is there a more efficient way?

---------- Post updated at 09:29 AM ---------- Previous update was at 02:38 AM ----------

Update : 6 hours in , still no output lines, the data code and the code are fine...
just that the data is too big , 22 gigs to be precise.

any suggestions on how to speed things up?
# 4  
Old 12-18-2012
Wow, 22Gb is a lot of data I'm assuming your using GNU awk or it would have probably fallen over by now.

The solution does need to read in all the data files before any output starts. I would assume the final output phase will be quite quick so dont get too worried that no output has appeared yet.

What is the total number of index files and the total number of subdirectories?

I can think of another method to solve this problem it those file/subdir counts aren't too massive, but I have some other stuff to do for the next 3 hours or so - I'll start working on it for you then.
This User Gave Thanks to Chubler_XL For This Post:
# 5  
Old 12-18-2012
There are 13 files in index, 83 sub-folders in data with 13 files each.
The size of files is what is creating this hang-time.
please take your time,its not a matter of life and death to get done in the next few hours.
and cant thank you enough for your help.
# 6  
Old 12-18-2012
As you only have 83 subfolders gawk should be able to keep all the files open as it works (ulimit -n controls how many files gawk can have open) this will reduce the memory requirements from tens of Gb to a few Kb (and you should get output pretty much straight away).

Many traditional awk have a 15 open file limit and if you only have this awk your out of luck with this solution.

Code:
find index data -type f -print | awk -F/ '
/^index/ { F[++files]=$NF }
/^data/  { S[$(NF-1)] }
END {
    printf "\t"
    for(subdir in S) printf "\t%s",subdir;
    printf "\n"
    for(i=1;i<=files;i++) {
        while((getline < ("index/"F[i])) > 0) {
            printf "%s\t%s",F[i],$0
            for(subdir in S) {
                getline < ("data/"subdir"/"F[i])
                printf "\t%s",$0
            }
            printf "\n"
        }
        close(F[i])
        for(subdir in S) close("data/"subdir"/"F[i]);
    }
}'


Last edited by Chubler_XL; 12-18-2012 at 05:09 PM.. Reason: Fix indents + remove debug line
This User Gave Thanks to Chubler_XL For This Post:
# 7  
Old 12-18-2012
The code runs fine with the sample, but with the actual data it just prints the first row , the sub-folder names. let me play around the code a little bit and try to find out whats happening.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Command to change add permissions for a new user to all files in all subfolders and folders

Hi there! I'm new to Unix and haven't done command line stuff since MS-Dos and Turbo Pascal (hah!), I would love some help figuring out this basic command (what I assume is basic). I'd like to add a User to the permissions of all files in a folder and all files in all subfolders, as well... (9 Replies)
Discussion started by: Janjbrt
9 Replies

2. Shell Programming and Scripting

Move specific folders and subfolders in a directory

I am trying to move specific folders and subfolders within a directory using the below. I can see the folders to move and they are at the location, but I am getting an error. Thank you :). mv -v /home/cmccabe/Desktop/NGS/API/6-10-2016{bam/{validation,coverage},bedtools /media/cmccabe/"My... (6 Replies)
Discussion started by: cmccabe
6 Replies

3. Shell Programming and Scripting

List all the files in the present path and Folders and subfolders files also

Hi, I need a script/command to list out all the files in current path and also the files in folder and subfolders. Ex: My files are like below $ ls -lrt total 8 -rw-r--r-- 1 abc users 419 May 25 10:27 abcd.xml drwxr-xr-x 3 abc users 4096 May 25 10:28 TEST $ Under TEST, there are... (2 Replies)
Discussion started by: divya bandipotu
2 Replies

4. Ubuntu

Creating Matrix

Hi all, I'm a newbie in shell scripting and currently I'm trying to create a matrix using bash. The Output will look like this AB CDE FG 1 2 3 4 5 6 7 I'm stuck on the ABCDEFG display. printFlightSeats() { rows=7 columns=7 for ((i=0;i<=$rows;i++)) do (0 Replies)
Discussion started by: vinzping
0 Replies

5. Shell Programming and Scripting

Help to move folders, subfolders and files from unix to windows

Hi Unix Gurus, I am able to copy only files that exist in the parent folder. My parent folder has sub folders and within sub folders there are lots files. I need to copy folder, sub folders and files from Unix to the remote windows SFTP location. The directory structure is something like... (1 Reply)
Discussion started by: shankar1dada
1 Replies

6. Shell Programming and Scripting

Search and Replace text in folders and Subfolders

Hi, I need help in writing a script to search a particular text in multiple files present in folders and sub folders and replace it with another string which also has special characters like '&', '|', etc.. I know sed command will be used to replace the text but i'm not sure how to use it for... (5 Replies)
Discussion started by: Asheesh
5 Replies

7. Windows & DOS: Issues & Discussions

Copy folders and subfolders from unix to windows

Sir From a unix machine some folders and their folders have to be copied to windows XP PC. Please help me with a batch file or a shell script. I am new to the the shell and batch files. Thanks in anticipation. sastry (3 Replies)
Discussion started by: chssastry
3 Replies

8. UNIX for Advanced & Expert Users

find size of folders and its subfolders with the Owner details

HI, I have the following command that shows me the total size of folders and subfolders : du -hs *| sort -n result: 1.0M sandeep 1.4G sandy 1.4M important 1.6M files but I will need to know the size of folders and its subfolders( not size of individual files though)... (5 Replies)
Discussion started by: bsandeep_80
5 Replies

9. Shell Programming and Scripting

Script to Analyze folders and subfolders

I would like to know if there is a script out there that someone may have already written that I can use to analyze folders and sub folders on my AIX system. It can be in perl or a basic korn script. Thanks in advance. (7 Replies)
Discussion started by: seacros
7 Replies

10. Shell Programming and Scripting

How to create folders/subfolders using shellscript

Hi, Can any one help me how to create folders using shellscript.My requirement is: FolderName: Main/Main1 :Main/Main2 :Main/Main3 underSubFolder : Main1/A :Main1/B :Main1/C underSubfolder: A/A1 ... (2 Replies)
Discussion started by: ram2s2001
2 Replies
Login or Register to Ask a Question