Merge multiple tab delimited files with index checking Post: 302986837

Sponsored Content

Top Forums Shell Programming and Scripting Merge multiple tab delimited files with index checking Post 302986837 by LMHmedchem on Wednesday 30th of November 2016 12:46:42 PM

11-30-2016

Registered User

Below is a script I put this together last night.

The runtime for this was ~40 seconds for 40 input files, each with 2500 rows. That's not too awful but I think this code is a bit ghastly. It would be faster if I collected all of the data in memory instead of writing it to a file and then reading it back in.

This solution also used sed in the pipe to replaces the E0 values with a value read from the file name as the data is passed to the new file. That is almost the only think about this script that I like. The code is not generalized but could be a bit more so in a few places.

RudiC, I will check out your latest post in a few minutes.

LMHmedchem

Code:

#!/bin/bash

# name of output file
output_file=$1

# collect names of all pred output files in array, files are in pwd with script
pred_file_list=($(ls  *'_pred.txt'))

# the first file forms the base of the output, so capture the name here
first_file=${pred_file_list[0]}

# get set, fold, rnd from file name
unset FIELD; IFS='_' read -a FIELD <<< "$first_file"
set_fold_rnd=${FIELD[0]}'_'${FIELD[1]}'_'${FIELD[2]}

# use the first output file as the base file for the rest
# collect columns 1,3,and 4 and pipe to aggregate file
# change E0 to set fold and rnd ini from file name
cut -f1,3,4 ${pred_file_list[0]} | sed "s/E0/$set_fold_rnd/1" > tmp_output1.txt

# loop through file list 
for pred_file in "${pred_file_list[@]}"
do
   # don't enter the first file twice
   if [ "$pred_file" != "$first_file" ]; then
      # get set, fold, rnd ini from filename
      unset FIELD; set_fold_rnd='';
      # create substitute column header value from filename
      IFS='_' read -a FIELD <<< "$pred_file"
      set_fold_rnd=${FIELD[0]}'_'${FIELD[1]}'_'${FIELD[2]}
      # collect columns 3and 4 and pipe to temp file
      # change E0 to set fold and rnd ini from file name
      cut -f3,4 './'$pred_file | sed "s/E0/$set_fold_rnd/1" > tmp_output2.txt
      # merge temp file with aggregate file to create second temp
      paste tmp_output1.txt  tmp_output2.txt > tmp_output3.txt
      # rename second temp back to aggregate file name
      mv tmp_output3.txt  tmp_output1.txt
      # cleanup
      rm -f tmp_output2.txt tmp_output3.txt
   fi
done

# tmp_output1.txt now contains all of the renamed data columns and all of the name columns

# name columns to check
# this could be dynamic by reading header line and recording the positions where "name" is found
declare -a field_check_array=(3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71 73 75 77 79)

# data columns to output
# this could be dynamic by reading header line and recording the positions where "E0" is found
declare -a output_cols_array=(0 1 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80)

# process the resulting aggregate file
while read line; do 

   # reinitialize array and output line string
   unset FIELD; output_line='';
   # read tab separated line into array
   IFS=$'\t' read -a FIELD <<< "$line"

   # for each line check the value of each field in field_check_array against the first field
   # check name fields to make sure they are all the same, exit if they are not
   for field_check in "${field_check_array[@]}"
   do
      if [ "${FIELD[1]}" != "${FIELD[$field_check]}" ]; then
         echo "names do not match"
         echo "FIELD[1]: " ${FIELD[1]}
         echo "FIELD["$field_check"]: " ${FIELD[$field_check]}
         exit -1
      fi
   done

   # if all name fields check for this row
   # add fields in output_cols_array to output_line string
   for output_col in "${output_cols_array[@]}"
   do
      # get value for next field
      cell="${FIELD[$output_col]}"

      # if this is the first column, the size of the output string will be 0, no tab
      if [ -z "$output_line" ]; then
         output_line="$cell"
      else
         # concatenate with row string
         output_line="$output_line"$'\t'"$cell"
      fi
   done

   # if file does not exist, this is the first row of output
   if [ ! -f "$output_file" ]; then
      # create file, touch and then append prevents empty column from newline???
      touch $output_file
      # write first row
      echo "${output_line}" >> $output_file
   # if file exists, append
   else
      echo "${output_line}" >> $output_file
   fi

done < tmp_output1.txt

# cleanup
rm -f tmp_output1.txt

---------- Post updated at 12:46 PM ---------- Previous update was at 12:13 PM ----------

I made a few modifications to the script posted by RudiC.

This just changes the code that creates the substitute header from,
HD = HD OFS $3 OFS $4 "_" T[2]
to
HD = HD OFS $3 OFS T[1] "_" T[2] "_" T[3]

For the filename "A_f0_r179_pred.txt", this results in the header, "A_f0_r179" instead of the header E0_f0.

It also changes the input regular expression from,
A_*_pred.txt
to
*_*_pred.txt
because there are file names that start with letters other than A.

Code:

#!/bin/bash

# name of output file
output_file=$1

awk '
NR == 1         {HD = $1
                }
FNR == 1        {split (FILENAME, T, "_")
                 HD = HD OFS $3 OFS T[1] "_" T[2] "_" T[3]
                }

                {IX  = FNR - 1
                 MAX = IX>MAX?IX:MAX 
                }

FNR == NR       {ID[IX]   = $1
                 NAME[IX] = $3
                }
$1 == ID[IX] &&
$3 == NAME[IX]  {OUT[IX] = OUT[IX] $3 OFS $4 OFS
                 next
                }

                {OUT[IX]  = OUT[IX] OFS OFS
                }

END             {print HD
                 for (i=1; i<=MAX; i++) print ID[i], OUT[i]
                }
' OFS="\t" *_*_pred.txt > $output_file

This runs in 0.2 seconds (compared to 40 seconds for my script). The only issue is that the Name columns are still appearing in the final output and I only need the Name once.

I could add more code to process the output and remove all of the "Name" columns except the first one.

LMHmedchem

LMHmedchem

View Public Profile for LMHmedchem

Find all posts by LMHmedchem

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Multiple commands TAB delimited

Hey guys... Running Solaris 5.6, trying to write an easy /sbin/sh script. I want to run several commands, then have the results appear on one line. Additionally, I want the results to be separated by <TAB>. Let's say that my script calls three commands (date, pwd, and hostname), I would want...

2. Shell Programming and Scripting

Working with Tab-Delimited files

I have a tab-Delimited file: Eg: 'test' file contains: a<tab>b<tab>c<tab>.... Based on certain condition, I wanna increase the number of lines of this file.How do I do that Eg: If some value in the database is 1 then one line in 'test' file is fine.. If some value in the database is 2...

3. Shell Programming and Scripting

merge two text files of different size on common index

I have two text files. text file 1: ID filePath col1 col2 col3 1 10584588.mol 269.126 190.958 23.237 2 10584549.mol 281.001 200.889 27.7414 3 10584511.mol 408.824 158.316 29.8561 4 10584499.mol 245.632 153.241 25.2815 5 10584459.mol ...

4. UNIX for Advanced & Expert Users

merge two tab delimited file with exact same number of rows in unix/linux

Hi I have two tab delimited file with different number of columns but same number of rows. I need to combine these two files in such a way that row 1 in file 2 comes adjacent to row 1 in file 1. For example: The content of file1: field1 field2 field3 a1 a2 a3 b1 b2 b3...

5. Shell Programming and Scripting

script to merge two files on an index

I have a need to merge two files on the value of an index column. input file 1 id filePath MDL_NUMBER 1 MFCD00008104.mol MFCD00008104 2 MFCD00012849.mol MFCD00012849 3 MFCD00037597.mol MFCD00037597 4 MFCD00064558.mol MFCD00064558 5 MFCD00064559.mol MFCD00064559 input file 2 ...

6. Shell Programming and Scripting

Checking in a directory how many files are present and basing on that merge all the files

Hi, My requirement is,there is a directory location like: :camp/current/ In this location there can be different flat files that are generated in a single day with same header and the data will be different, differentiated by timestamp, so i need to verify how many files are generated...

7. Shell Programming and Scripting

Insert a header record (tab delimited) in multiple files

Hi Forum. I'm struggling to find a solution for the following issue. I have multiple files a1.txt, a2.txt, a3.txt, etc. and I would like to insert a tab-delimited header record at the beginning of each of the files. This is my code so far but it's not working as expected. for i in...

8. UNIX for Dummies Questions & Answers

How to sort the 6th field of tab delimited files?

Here's a sample of the data: NAME BIRTHDAY SEX LOCATION AGE ID Jim 05/11/1986 M Japan 27 86 Rei 08/25/1990 F Korea 24 33 Jane 02/24/1985 F India 29 78 I've been trying to sort files using the...

9. UNIX for Beginners Questions & Answers

UNIX - 2 tab delimited files, conditional column extraction

Please know that I am very new to unix and trying to learn 'on the job'. I'm only manipulating large tab-delimited files (millions of rows), but I'm stuck and don't know how to proceed with the following. Hoping for some friendly advice :) I have 2 tab-delimited files - with differing column &...

10. UNIX for Beginners Questions & Answers

Match tab-delimited files based on key

I thought I had this figured out but was wrong so am humbly asking for help. The task is to add an additional column to FILE 1 based on records in FILE 2. The key is in COLUMN 1 for FILE 1 and in COLUMN 1 OR COLUMN 2 for FILE 2. I want to add the third column from FILE 2 to the beginning of...