Paste columns based on common column: multiple files

12-15-2017

Registered User

58, 1

Join Date: Nov 2017

Last Activity: 1 May 2020, 4:16 PM EDT

Posts: 58

Thanks Given: 1

Thanked 1 Time in 1 Post

@RudiC:
Yup, I saw that just now. I can't get my head around it. TF, FC I'm absolutely stumped by it. Please explain.

genome

View Public Profile for genome

Find all posts by genome

12-15-2017

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

Code:

awk '
BEGIN           {TF = ARGC - 1                                  # save initial total file count (minus 1 for $0) 
                }

FNR == 1        {FC++                                           # count files opened
                 if (FC <= TF) ARGV[ARGC++] = FILENAME          # if still in the original files 
                }                                               # list, append file name to argument list  
                                                                # so every file will be read twice
FC <= TF        {CNT[$2]++                                      # if still in org. files, count $2 occurrences
                 next                                           # and don''t proceed in script
                }

CNT[$2] == TF   {if (!LINE[$2]) SEQ[++SN] = $2                  # only for $2 that occurred file count times, increment
                                                                # sequence counter if 1. occurrence of $2 in 2. round
                 LINE[$2] = LINE[$2] $0 " "                     # collect lines of every file into a string array
                }

END             {for (s=1; s<=SN; s++) print LINE[SEQ[s]]       # print all collected (5 fold) lines in correct order
                }
'  HGWAS?/merged_info_CHR1.info

I modified this a bit to account for varying file argument counts. Please be aware that one of your input files has DOS <CR> (\r, ^M 0x0D) line terminators, distorting the screen output / result.

Last edited by RudiC; 12-15-2017 at 01:54 PM..

RudiC

View Public Profile for RudiC

Find all posts by RudiC

12-15-2017

Registered User

58, 1

Join Date: Nov 2017

Last Activity: 1 May 2020, 4:16 PM EDT

Posts: 58

Thanks Given: 1

Thanked 1 Time in 1 Post

Sorry, rudic I can't understand now too.

Code:

FNR == 1

How does this help?

If occurrence of $2 is equal to file count:

Code:

CNT[$2] == TF   {if (!LINE[$2]) SEQ[++SN] = $2                  # only for $2 that occurred file count times, increment
                                                                # sequence counter if 1. occurrence of $2 in 2. round
                 LINE[$2] = LINE[$2] $0 " "                     # collect lines of every file into a string array
                }

if (!LINE[$2]) SEQ[++SN] = $2

What is this line doing?
LINE, SEQ, SN

Edited: added occurrence

Last edited by genome; 12-15-2017 at 02:07 PM..

genome

View Public Profile for genome

Find all posts by genome

12-15-2017

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

In awk, input lines are counted in the (internal) NR variable across ALL files, while FNR does the same but is reset for every new file. So FNR==1 indicates the begin of a new file, and the (user defined) file count FC is incremented.
I highly recommend to do some reading on awk, e.g. man awk. There, all system variables are listed and you can tell them from user vars like TF and SEQ that are declared and allocated when needed.
The square brackets [...] enclose array indices, so LINE and SEQ are arrays, holding the output lines and the order in which to print them. SN is a scalar serial number.

RudiC

View Public Profile for RudiC

Find all posts by RudiC

12-16-2017

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

Well, not one of my magic moments. Try instead

Code:

awk '
BEGIN           {TF = ARGC - 1
                }

                {if (!LINE[$2]) SEQ[++SN] = $2
                 LINE[$2] = LINE[$2] $0 " "
                 CNT[$2]++
                }
END             {for (s=1; s<=SN; s++) if (CNT[SEQ[s]] == TF) print LINE[SEQ[s]]
                }
'  HGWAS?/merged_info_CHR1.info

Only if you're running out of memory with too large or too many files, you might want to fall back to the post#9 version reading files twice but saving some memory.

RudiC

View Public Profile for RudiC

Find all posts by RudiC

12-29-2017

Registered User

58, 1

Join Date: Nov 2017

Last Activity: 1 May 2020, 4:16 PM EDT

Posts: 58

Thanks Given: 1

Thanked 1 Time in 1 Post

Hi Rudic

Sorry for my delay in writing back.
I am still going through your code and am struck with it.
I'm copying few lines from #9:

Code:

awk '
BEGIN           {TF = ARGC - 1                                  # save initial total file count (minus 1 for $0) 
                }
#------

FNR == 1 
#-------
FC <= TF
        
#---------
CNT[$2] == TF

I'm surprised you've not used if condition for these lines. What I mean is:

Code:


awk '
BEGIN           {TF = ARGC - 1                                  # save initial total file count (minus 1 for $0) 
                }

#-----
if (FNR == 1 )

#----
if (FC <= TF        )

#-----
if (CNT[$2] == TF )

The code doesn't work too if I use if with these checks. My usual mind set would put if to confirm on these and then proceed ahead. I'm slow with catching up on awk. Sorry.

---------- Post updated at 04:51 PM ---------- Previous update was at 04:21 PM ----------

The difficult parts for me to get hold of me are:

1- The files are read twice. I cannot even think/read how in the code that's been taken care of.
2- The variables, SN, TF they are internal variables but they have their presence all the way in the code. I mean they don't need to be initiated before using them as in other languages (C, etc. )

genome

View Public Profile for genome

Find all posts by genome

12-30-2017

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

A reflection of awk basics might help. man awk:

Quote:

THE AWK LANGUAGE
1. Program structure
An AWK program is a sequence of pattern {action} pairs and user function definitions.
A pattern can be:
BEGIN
END
expression
expression , expression

FNR == 1 is such an expression; if its evaluation yields TRUE, the action following will be executed. So, NO if needed in this awk construct (although the if statement is available for flow control in the action parts).

1- Reading files twice has been eliminated in post#12. Nevertheless, the trick was to append the actual file name to the awk script's parameter list as described in post#9's comments.
2- Yes, awk variables are being created / initiated the first time they're referenced.

RudiC

View Public Profile for RudiC

Find all posts by RudiC

Shell Programming and Scripting

Paste columns based on common column: multiple files

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

How to copy a column of multiple files and paste into new excel file (next to column)?

Discussion started by: dineshkumarsrk

2. Shell Programming and Scripting

Join columns across multiple lines in a Text based on common column using BASH

Discussion started by: nv186000

3. UNIX for Dummies Questions & Answers

Merge selective columns from files based on common key

Discussion started by: dovah

4. UNIX for Dummies Questions & Answers

How to join 2 .txt files based on a common column?

Discussion started by: alisrpp

5. Shell Programming and Scripting

common entries between files based on 1st column

Discussion started by: Diya123

6. UNIX for Dummies Questions & Answers

Writing a loop to merge multiple files by common column

Discussion started by: evelibertine

7. Shell Programming and Scripting

Join multiple files based on 1 common column

Discussion started by: quincyjones

8. Shell Programming and Scripting

Merging 2 files based on a common column

Discussion started by: Lucky Ali

9. Shell Programming and Scripting

sum multiple columns based on column value

Discussion started by: jjoe

10. Shell Programming and Scripting

How to convert 2 column data into multiple columns based on a keyword in a row??

Discussion started by: ks_reddy