Merging multiple files using lines from one file

09-23-2012

Registered User

23, 0

Join Date: Jun 2012

Last Activity: 30 January 2013, 1:23 PM EST

Posts: 23

Thanks Given: 3

Thanked 0 Times in 0 Posts

Merging multiple files using lines from one file

Code:

b.10:

Code:

All the other b. files have the same first column.

I now have another file named 'list'. list has just one column with 3000 lines.
cat $ ls list:

Code:

I want to list to compare the b.4, b.5, b.6, b.7, b.8, b.9, b.10, b.11,....... to b.3000. Then print out two columns of 3000 lines from these files common to file "list". The output should look like this:

Code:

4      0         1
7      1         1     
10    0.45      1
..
3000   1         1

Code:

for i in 'home/John/list'
do
      echo $i > fn1
for files in 'home/John/b*'
do
       echo $file > ff.
       paste fn1 ffn fnuse.temp
       cp fnuse.temp fnuse
     done
done

Please guys, I need your help. Thanks

Last edited by Franklin52; 09-24-2012 at 04:54 AM.. Reason: Please use code tags for data and code samples

iconig

View Public Profile for iconig

Find all posts by iconig

09-23-2012

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Quote:

Originally Posted by iconig

I have been working of this script for a very long time and I have searched the internet for direction but I am stuck here.
I have about 3000 files with two columns each. The length of each file is 50000. Each of these files is named this way b.4, b.5, b.6, b.7, b.8, b.9, b.10, b.11, b.12 .......b.3000

For example b.4 and b.10 looks like this: b.4
1 1
2 1
3 1
4 0
5 1
6 0.75
7 1
8 1
8 0
9 0.34
10 0.45
.....
50000 1

b.10:
1 1
2 0.87
3 1
4 1
5 0.89
6 1
7 0
8
9
10 1
.....
50000 1

All the other b. files have the same first column.

I now have another file named 'list'. list has just one column with 3000 lines.
cat $ ls list:
4
7
10

...
3000

I want to list to compare the b.4, b.5, b.6, b.7, b.8, b.9, b.10, b.11,....... to b.3000. Then print out two columns of 3000 lines from these files common to file "list". The output should look like this:

4 0 1
7 1 1
10 0.45 1
..
3000 1 1

Code:

for i in 'home/John/list'
do
      echo $i > fn1
for files in 'home/John/b*'
do
       echo $file > ff.
       paste fn1 ffn fnuse.temp
       cp fnuse.temp fnuse
     done
done

Please guys, I need your help. Thanks

I'm confused. Even after replacing your MAN tags with CODE tags in your original posting (which makes it easier to see what your code may be trying to do), you have for files, but never use $files. In echo $file > ff. $file is undefined and ff. isn't used. In the paste fn1 ffn fnuse.temp, none of the operands are are defined by this script.

Am I correct in assuming that you're trying to create a file that contains 3000 lines each of which contains 3001 columns where the 1st column is the line numbers chosen from the 50000 line b.* files as specified by home/john/list and the remaining columns are the 2nd column of each of the b.* files?

Do you realize that if this is what you want, the result won't be a text file on most systems? By definition, lines in text files are limited to LINE_MAX bytes and LINE_MAX can be as small as 2048 on standards-conforming systems. The common standard utilities like the editors, awk, grep, and sed are only specified to work on text files. About the only utilities guaranteed to work on files with arbitrarily long line lengths are cat, cut, and paste.

If you specify b.* as a specification for your b.* files, do you realize that the order of files processed would be b1, b10, b100, b1000, b11... rather than b1, b2, b3...? Does the order in which these files are processed matter?

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

09-23-2012

Registered User

23, 0

Join Date: Jun 2012

Last Activity: 30 January 2013, 1:23 PM EST

Posts: 23

Thanks Given: 3

Thanked 0 Times in 0 Posts

Yes you are right with the output file having 3001 columns and 3000 rows. I am interested in the correct matching of column 1 in the list with the first column from each other file. The output file is supposed to create the row and column in the order of the list i.e picking the lines that correspond to that in the list for all the b.files

---------- Post updated at 07:26 PM ---------- Previous update was at 07:25 PM ----------

I am not too experienced with the nested loop or if there is a better way to do this

---------- Post updated at 07:52 PM ---------- Previous update was at 07:26 PM ----------

Code:

for i in 'home/John/list'
do
      echo $i > fn.
for files in 'home/John/b*'
do
       echo $file > ffn
       paste fn1 ffn fnuse.temp
       cp fnuse.temp fnuse
     done
done

The $file are the b.4, b.5... ,b.3000 files

---------- Post updated at 09:44 PM ---------- Previous update was at 07:52 PM ----------

Please could anybody be of help? Thanks

Last edited by Franklin52; 09-24-2012 at 04:54 AM.. Reason: Please use code tags for data and code samples

iconig

View Public Profile for iconig

Find all posts by iconig

09-23-2012

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Assuming that your version of awk has lots of memory and no limits on output line lengths, your system has a LARGE value for ARG_MAX, and that your shell doesn't limit the number of arguments you can pass to an application; the following will create a file with the contents you've requested:

Code:

#!/bin/ksh
awk -v printkey=1 '
FNR==NR{key[$1] = ++kc
        next
}
$1 in key{
        if(out[key[$1]] != "")
                out[key[$1]] = out[key[$1]] FS $2
        else    out[key[$1]] = printkey > 0 ? $1 FS $2 : $2
}
END {   for(i = 1; i <= kc; print out[i++]){}
}' list b.? b.?? b.??? b.???? > out

With 3000 input files of 50000 lines each, this awk program is going to take quite a while to complete. I would expect that it will run into some line length or memory limits which will necessitate running this awk program multiple times on smaller sets of the b.* files with the output from each run saved in a temp file. The paste utility can then be used to join the temp files into a single output file. (Note that in this case the 1st invocation of awk needs to have printkey=1 and all remaining invocations of awk need to have printkey=0 (or unset) so the key will only appear in the output lines once.

Note also that line there will be more than 6000 bytes on each line of output, so with 3000 lines this will be more than 18Mb (assuming 1 byte of output per field and not counting the line number at the start of the line); your file size may be MUCH larger depending on the contents of your input files. On many systems you won't be able to do much of anything with this output file but cut fields out of it for further processing.

Good luck!

This User Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

09-23-2012

Registered User

23, 0

Join Date: Jun 2012

Last Activity: 30 January 2013, 1:23 PM EST

Posts: 23

Thanks Given: 3

Thanked 0 Times in 0 Posts

I am using bash, not kWh, would this be able to work for bash? Also could you explain the kc?
Thanks

iconig

View Public Profile for iconig

Find all posts by iconig

09-23-2012

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Quote:

Originally Posted by iconig

I am using bash, not kWh, would this be able to work for bash? Also could you explain the kc?
Thanks

It looks like you have the same auto-correcting spell checker that I have.

I use ksh, but there is nothing in this script that will keep it from working at least in ksh, sh, and bash (although they may each have different limits on the number of arguments they'll pass to invoked utilities.

In the awk script, kc is the number of keys present in the 1st file argument given to awk (in this case, I specified that file to be list; you specified it to be home/John/list, although I imagine you're missing a "/" at the start of that pathname.

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

09-24-2012

Registered User

23, 0

Join Date: Jun 2012

Last Activity: 30 January 2013, 1:23 PM EST

Posts: 23

Thanks Given: 3

Thanked 0 Times in 0 Posts

Yes you are right, it was an oversight.

---------- Post updated at 11:54 PM ---------- Previous update was at 11:13 PM ----------

Once I create the columns and rows, I would just send it into an external storage for further processing. How long do you imagine this iteration will take?

iconig

View Public Profile for iconig

Find all posts by iconig

Shell Programming and Scripting

Merging multiple files using lines from one file

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Merging multiple lines into single line based on one column

Discussion started by: raju2016

2. Shell Programming and Scripting

Merging multiple lines to columns with awk, while inserting commas for missing lines

Discussion started by: RalphNY

3. Shell Programming and Scripting

Merging multiple lines

Discussion started by: Kanja

4. Shell Programming and Scripting

awk Merging multiple files with symbol representing new file

Discussion started by: Akshay Hegde

5. Shell Programming and Scripting

Merging multiple files from multiple columns

Discussion started by: bartman2099

6. Shell Programming and Scripting

merging multiple lines into single line

Discussion started by: bala123

7. Shell Programming and Scripting

merging two .txt files by alternating x lines from file 1 and y lines from file2

Discussion started by: ink_LE

8. Shell Programming and Scripting

Merging information from multiple files to a single file

Discussion started by: crunchie

9. Shell Programming and Scripting

Matching lines across multiple csv files and merging a particular field

Discussion started by: Demosthenes

10. Shell Programming and Scripting

Merging columns from multiple files in one file

Discussion started by: isgoed