Merging multiple files using lines from one file


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Merging multiple files using lines from one file
# 1  
Old 09-23-2012
Merging multiple files using lines from one file

I have been working of this script for a very long time and I have searched the internet for direction but I am stuck here.
I have about 3000 files with two columns each. The length of each file is 50000. Each of these files is named this way b.4, b.5, b.6, b.7, b.8, b.9, b.10, b.11, b.12 .......b.3000

For example b.4 and b.10 looks like this: b.4
Code:
1     1
2     1
3     1
4     0
5     1
6     0.75
7     1
8      1
8      0
9    0.34
10    0.45
.....
50000  1

b.10:
Code:
1     1
2     0.87
3     1
4     1
5     0.89
6      1
7      0
8   
9
10     1
.....
50000  1


All the other b. files have the same first column.

I now have another file named 'list'. list has just one column with 3000 lines.
cat $ ls list:
Code:
4
7
10

...
3000

I want to list to compare the b.4, b.5, b.6, b.7, b.8, b.9, b.10, b.11,....... to b.3000. Then print out two columns of 3000 lines from these files common to file "list". The output should look like this:
Code:
4      0         1
7      1         1     
10    0.45      1
..
3000   1         1

Code:
for i in 'home/John/list'
do
      echo $i > fn1
for files in 'home/John/b*'
do
       echo $file > ff.
       paste fn1 ffn fnuse.temp
       cp fnuse.temp fnuse
     done
done

Please guys, I need your help. Thanks

Last edited by Franklin52; 09-24-2012 at 04:54 AM.. Reason: Please use code tags for data and code samples
# 2  
Old 09-23-2012
Quote:
Originally Posted by iconig
I have been working of this script for a very long time and I have searched the internet for direction but I am stuck here.
I have about 3000 files with two columns each. The length of each file is 50000. Each of these files is named this way b.4, b.5, b.6, b.7, b.8, b.9, b.10, b.11, b.12 .......b.3000

For example b.4 and b.10 looks like this: b.4
1 1
2 1
3 1
4 0
5 1
6 0.75
7 1
8 1
8 0
9 0.34
10 0.45
.....
50000 1

b.10:
1 1
2 0.87
3 1
4 1
5 0.89
6 1
7 0
8
9
10 1
.....
50000 1


All the other b. files have the same first column.

I now have another file named 'list'. list has just one column with 3000 lines.
cat $ ls list:
4
7
10

...
3000

I want to list to compare the b.4, b.5, b.6, b.7, b.8, b.9, b.10, b.11,....... to b.3000. Then print out two columns of 3000 lines from these files common to file "list". The output should look like this:

4 0 1
7 1 1
10 0.45 1
..
3000 1 1

Code:
for i in 'home/John/list'
do
      echo $i > fn1
for files in 'home/John/b*'
do
       echo $file > ff.
       paste fn1 ffn fnuse.temp
       cp fnuse.temp fnuse
     done
done

Please guys, I need your help. Thanks
I'm confused. Even after replacing your MAN tags with CODE tags in your original posting (which makes it easier to see what your code may be trying to do), you have for files, but never use $files. In echo $file > ff. $file is undefined and ff. isn't used. In the paste fn1 ffn fnuse.temp, none of the operands are are defined by this script.

Am I correct in assuming that you're trying to create a file that contains 3000 lines each of which contains 3001 columns where the 1st column is the line numbers chosen from the 50000 line b.* files as specified by home/john/list and the remaining columns are the 2nd column of each of the b.* files?

Do you realize that if this is what you want, the result won't be a text file on most systems? By definition, lines in text files are limited to LINE_MAX bytes and LINE_MAX can be as small as 2048 on standards-conforming systems. The common standard utilities like the editors, awk, grep, and sed are only specified to work on text files. About the only utilities guaranteed to work on files with arbitrarily long line lengths are cat, cut, and paste.

If you specify b.* as a specification for your b.* files, do you realize that the order of files processed would be b1, b10, b100, b1000, b11... rather than b1, b2, b3...? Does the order in which these files are processed matter?
# 3  
Old 09-23-2012
Yes you are right with the output file having 3001 columns and 3000 rows. I am interested in the correct matching of column 1 in the list with the first column from each other file. The output file is supposed to create the row and column in the order of the list i.e picking the lines that correspond to that in the list for all the b.files

---------- Post updated at 07:26 PM ---------- Previous update was at 07:25 PM ----------

I am not too experienced with the nested loop or if there is a better way to do this

---------- Post updated at 07:52 PM ---------- Previous update was at 07:26 PM ----------
Code:
for i in 'home/John/list'
do
      echo $i > fn.
for files in 'home/John/b*'
do
       echo $file > ffn
       paste fn1 ffn fnuse.temp
       cp fnuse.temp fnuse
     done
done

The $file are the b.4, b.5... ,b.3000 files

---------- Post updated at 09:44 PM ---------- Previous update was at 07:52 PM ----------

Please could anybody be of help? Thanks

Last edited by Franklin52; 09-24-2012 at 04:54 AM.. Reason: Please use code tags for data and code samples
# 4  
Old 09-23-2012
Assuming that your version of awk has lots of memory and no limits on output line lengths, your system has a LARGE value for ARG_MAX, and that your shell doesn't limit the number of arguments you can pass to an application; the following will create a file with the contents you've requested:
Code:
#!/bin/ksh
awk -v printkey=1 '
FNR==NR{key[$1] = ++kc
        next
}
$1 in key{
        if(out[key[$1]] != "")
                out[key[$1]] = out[key[$1]] FS $2
        else    out[key[$1]] = printkey > 0 ? $1 FS $2 : $2
}
END {   for(i = 1; i <= kc; print out[i++]){}
}' list b.? b.?? b.??? b.???? > out

With 3000 input files of 50000 lines each, this awk program is going to take quite a while to complete. I would expect that it will run into some line length or memory limits which will necessitate running this awk program multiple times on smaller sets of the b.* files with the output from each run saved in a temp file. The paste utility can then be used to join the temp files into a single output file. (Note that in this case the 1st invocation of awk needs to have printkey=1 and all remaining invocations of awk need to have printkey=0 (or unset) so the key will only appear in the output lines once.

Note also that line there will be more than 6000 bytes on each line of output, so with 3000 lines this will be more than 18Mb (assuming 1 byte of output per field and not counting the line number at the start of the line); your file size may be MUCH larger depending on the contents of your input files. On many systems you won't be able to do much of anything with this output file but cut fields out of it for further processing.

Good luck!
This User Gave Thanks to Don Cragun For This Post:
# 5  
Old 09-23-2012
I am using bash, not kWh, would this be able to work for bash? Also could you explain the kc?
Thanks
# 6  
Old 09-23-2012
Quote:
Originally Posted by iconig
I am using bash, not kWh, would this be able to work for bash? Also could you explain the kc?
Thanks
It looks like you have the same auto-correcting spell checker that I have. Smilie I use ksh, but there is nothing in this script that will keep it from working at least in ksh, sh, and bash (although they may each have different limits on the number of arguments they'll pass to invoked utilities.

In the awk script, kc is the number of keys present in the 1st file argument given to awk (in this case, I specified that file to be list; you specified it to be home/John/list, although I imagine you're missing a "/" at the start of that pathname.
# 7  
Old 09-24-2012
Yes you are right, it was an oversight.

---------- Post updated at 11:54 PM ---------- Previous update was at 11:13 PM ----------

Once I create the columns and rows, I would just send it into an external storage for further processing. How long do you imagine this iteration will take?
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Merging multiple lines into single line based on one column

I Want to merge multiple lines based on the 1st field and keep into single record. SRC File: AAA_POC_DB.TAB1 AAA_POC_DB.TAB2 AAA_POC_DB.TAB3 AAA_POC_DB.TAB4 BBB_POC_DB.TAB1 BBB_POC_DB.TAB2 CCC_POC_DB.TAB6 OUTPUT ----------------- 'AAA_POC_DB','TAB1','TAB2','TAB3','TAB4'... (10 Replies)
Discussion started by: raju2016
10 Replies

2. Shell Programming and Scripting

Merging multiple lines to columns with awk, while inserting commas for missing lines

Hello all, I have a large csv file where there are four types of rows I need to merge into one row per person, where there is a column for each possible code / type of row, even if that code/row isn't there for that person. In the csv, a person may be listed from one to four times... (9 Replies)
Discussion started by: RalphNY
9 Replies

3. Shell Programming and Scripting

Merging multiple lines

I do have a text file with multiple lines on it. I want to put the lines of text into a single line where ever there is ";" for example ert, ryt, yvig, fgr; rtyu, hjk, uio, hyu, hjo; ghj, tyu, gho, hjp, jklo, kol; The resultant file I would like to have is ert, ryt, yvig, fgr;... (2 Replies)
Discussion started by: Kanja
2 Replies

4. Shell Programming and Scripting

awk Merging multiple files with symbol representing new file

I just tried following ls *.dat|sort -t"_" -k2n,2|while read f1 && read f2; do awk '{print}' $f1 awk FNR==1'{print $1,$2,$3,$4,$5,"*","*","*" }' OFS="\t" $f2 awk '{print}' $f2 donegot following result 18-Dec-1983 11:45:00 AM 18.692 84.672 0 25.4 24 18-Dec-1983 ... (3 Replies)
Discussion started by: Akshay Hegde
3 Replies

5. Shell Programming and Scripting

Merging multiple files from multiple columns

Hi guys, I have very basic linux experience so I need some help with a problem. I have 3 files from which I want to extract columns based on common fields between them. File1: --- rs74078040 NA 51288690 T G 461652 0.99223 0.53611 3 --- rs77209296 NA 51303525 T G 461843 0.98973 0.60837 3... (10 Replies)
Discussion started by: bartman2099
10 Replies

6. Shell Programming and Scripting

merging multiple lines into single line

Hi, 1. Each message starts with date 2. There is blank line between each message 3. Each message does not contain same number of lines. Any help in merging multiple lines in each message to a single line is much appreciated. AIX: Korn Shell Error log file looks like below. ... (5 Replies)
Discussion started by: bala123
5 Replies

7. Shell Programming and Scripting

merging two .txt files by alternating x lines from file 1 and y lines from file2

Hi everyone, I have two files (A and B) and want to combine them to one by always taking 10 rows from file A and subsequently 6 lines from file B. This process shall be repeated 40 times (file A = 400 lines; file B = 240 lines). Does anybody have an idea how to do that using perl, awk or sed?... (6 Replies)
Discussion started by: ink_LE
6 Replies

8. Shell Programming and Scripting

Merging information from multiple files to a single file

Hello, I am new to unix and need help with a problem. I have 2 files each containing multiple columns of information ie; File 1 : A B C D E 1 2 3 4 5 File 2 : F G 6 7 I would like to merge the information from File 2 to File 1 so that the data reads as follows; File 1: A... (4 Replies)
Discussion started by: crunchie
4 Replies

9. Shell Programming and Scripting

Matching lines across multiple csv files and merging a particular field

I have about 20 CSV's that all look like this: "","","","","","","","","","","","","","","",""What I've been told I need to produce is the exact same thing, but with each file now containing the start_code from every other file where the email matches. It doesn't matter if any of the other... (1 Reply)
Discussion started by: Demosthenes
1 Replies

10. Shell Programming and Scripting

Merging columns from multiple files in one file

Hi, I want to select columns from multiple files and combine them in one file. The files are simulation-data-files with 23 columns each and about 50 rows. I now use: cut -f 11 Sweep?wing-30?scale=0.?0?fan2?.txt | pr -3 | awk '{printf("\n%s\t%s\t%s",$1,$2,$3)}' > ../Data_Processed/output.txtI... (1 Reply)
Discussion started by: isgoed
1 Replies
Login or Register to Ask a Question