Make copy of text file with columns removed (based on header) Post: 302931989

Sponsored Content

Top Forums Shell Programming and Scripting Make copy of text file with columns removed (based on header) Post 302931989 by LMHmedchem on Saturday 17th of January 2015 09:15:01 PM

01-17-2015

Registered User

This is what I have at the moment,

Code:

#!/bin/sh

# file with list of colums to remove (by header name)
REMOVE_LIST_FILE='./remove_list.txt'
SET='A'

## 1) read list of columns to remove into array, skip comment lines
# clear array
unset LIST_TO_REMOVE
echo ""

# read from file
while IFS=$'\n' read line
do
    if [[ "$line" =~ \#.* ]];then
        echo "skipping comment line:" $line
    else
        echo "adding column header:" $line
        LIST_TO_REMOVE=("${LIST_TO_REMOVE[@]}" "$line")
    fi
done < $REMOVE_LIST_FILE

## 2) translate list of headers to list of column numbers
echo ""
echo "creating files without columns" ${LIST_TO_REMOVE[@]}
echo ""

# find position of column to remove based on the first input file
COLUMN_CHECK_FILE=$(ls './'$SET'/input_data/base/f0_'*'_'$SET'_'*'.txt')

# clear columns to be removed
unset COLS_TO_REMOVE

# find column number for column to remove
for CURRENT_REMOVE in "${LIST_TO_REMOVE[@]}"
do
   # grab header row (row 3) from column check file
   HEADER_ROW_LIST=($(cat $COLUMN_CHECK_FILE | sed -n '3p'))
   ELEMENT_COUNTER='0';  HEADER_POSITION='0';

# loop through headers
for HEADER_ROW in "${HEADER_ROW_LIST[@]}"
do
      # incrementer counter
      (( ELEMENT_COUNTER++ ))
      if [ "$HEADER_ROW" == "$CURRENT_REMOVE" ]; then
         HEADER_POSITION=$ELEMENT_COUNTER
         COLS_TO_REMOVE=("${COLS_TO_REMOVE[@]}" "$HEADER_POSITION")
      fi
   done
done

# sort list of columns to remove
SORT_COLS_TO_REMOVE=($(IFS=$'\n'  sort -n <<< "${COLS_TO_REMOVE[*]}"))
echo "SORT_COLS_TO_REMOVE" ${SORT_COLS_TO_REMOVE[@]}
echo ""

## 3) create -f field string to pass to cut

# the cut string always starts with "-f1-" because the first column will never be cut
# the code works by taking the column number to remove and creating the integer +1 and -1
# the +/-1 values are seperated by a comma and added to the cut -f string
# if the column to remove is 4 3,5 is added to the cut string to make -f1-3,5

# additionally, the +1 value is checked to make sure it is not also on the list of
# columns to remove

# initalize
CUT_STRING='-f1-'; CUT_COUNTER='0'; SERIES='0'

# loop on list of columns to remove
for CUT_COLUMN in "${SORT_COLS_TO_REMOVE[@]}"
do
   # set the position of the next item in the remove list
   let "POS_CHECK=$CUT_COUNTER+1"
   # set the column number of the next column
   let "NEXT_VALUE=$CUT_COLUMN+1"
   # find the column number of the next colum in the remove list
   CHECK_NEXT=${SORT_COLS_TO_REMOVE[$POS_CHECK]}

   # if the next col in sequence is not the next col on the remove list
   if [ "$NEXT_VALUE" != "$CHECK_NEXT" ]; then

      # if we are not in a series, create col number before and after
      if [ "$SERIES" == "0" ]; then
         let "REMOVE_m1=$CUT_COLUMN-1"
         let "REMOVE_p1=$CUT_COLUMN+1"
         # add col before and after to cut string with comma seperation
         # if col 4 is to be cut, creates 3,5
         CUT_STRING=$CUT_STRING$REMOVE_m1','$REMOVE_p1'-'

      # if we are in a series, we arrive here when the end of the series
      # has been identified, create series end bracket by adding 1 to current col position
      else
         let "SERIES_END_BRACKET=$CUT_COLUMN+1"
         # add col before and after to cut string with comma seperation
         # if series to cut is 13 14 15, creates 12,16
         CUT_STRING=$CUT_STRING$SERIES_START_BRACKET','$SERIES_END_BRACKET
         # series is complete, so deactivate series
         SERIES='0'
      fi

   # if the next col in sequence is the next col on the remove list, in a series
   else
      # if series is not active, start series by storing current position -1
      if [ "$SERIES" == "0" ]; then
         let "REMOVE_m1=$CUT_COLUMN-1"
         SERIES_START_BRACKET=$REMOVE_m1
         # set marker that series is active
         SERIES='1'
      fi
   fi

   # increment position
   ((CUT_COUNTER++))

# for CUT_COLUMN in "${SORT_COLS_TO_REMOVE[@]}" done
done

# if the last character is not -, add
if [ "${CUT_STRING:$i:-1}" != "-" ]; then
   # add trailing - to cut string, check if this works in all cases
   CUT_STRING=$CUT_STRING'-'
fi
echo "CUT_STRING" $CUT_STRING
echo ""

# at this point, a -f cut argument string has been created, if the cut list was 4 13 14 16,
# the cut string is -f1-3,5-12,16- which gives the desired outcome


## 4) create modified files with the cut columns removed
# remove columns from base input files based on LIST_TO_REMOVE

# list of input files to modify
BASE_INPUT_FILE_LIST=($(ls './'$SET'/input_data/base/'*'_'$SET'_'*'.txt'))
# directory to write modified files
MOD_INPUT_FILE_DIR=$(ls -d './'$SET'/input_data/')

echo "creating modified input files"
# create coppies of base input with identified columns removed
for BASE_INPUT_FILE in "${BASE_INPUT_FILE_LIST[@]}"
do
   # create path for revised file
   REVISED_FILE=$(echo $BASE_INPUT_FILE | awk 'BEGIN {FS="/"} {print $5}')
   REVISED_FILE='./'$SET'/input_data/'$REVISED_FILE
    # remove columns from file as specified in CUT_STRING
   cut --output-delimiter=$'\t' $CUT_STRING  $BASE_INPUT_FILE > $REVISED_FILE
done

Sorry for the very long script. This seems to work and is reasonably fast.

Code:

real    0m0.437s
user    0m0.253s
sys     0m0.011s

It's hard to know that step 3 has been coded in a manner that will work for all possibilities. That is the kind of algorithm that is nothing but a box of bear traps.

I have attached a set of test files in case anyone would rather test than read the above. Just extract the tar.gz and run cut_columns.sh. You may have to make the script executable. You can play around with the files that are removed by modifying remove_list.txt.

LMHmedchem

cut_script_test.tar.gz (2.17 MB)

Last edited by LMHmedchem; 01-17-2015 at 11:06 PM..

LMHmedchem

View Public Profile for LMHmedchem

Find all posts by LMHmedchem

9 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

Merging two files based on two columns to make a third file

Hi there, I'm trying to merge two files and make a third file. However, two of the columns need to match exactly in both files AND I want everything from both files in the output if the two columns match in that row. First file looks like this: chr1 10001980 T A Second...

2. Shell Programming and Scripting

Copy and Paste Columns in a Tab-Limited Text file

I have this text file with a very large number of columns (10,000+) and I want to move the first column to the position of the six column so that the text file looks like this: Before cutting and pasting ID Family Mother Father Trait Phenotype aaa bbb ...

3. UNIX for Dummies Questions & Answers

Extracting rows from a text file based on the values of two columns (given ranges)

Hi, I have a tab delimited text file with multiple columns. The second and third columns include numbers that have not been sorted. I want to extract rows where the second column includes a value between -0.01 and 0.01 (including both numbers) and the first third column includes a value between...

4. Shell Programming and Scripting

Reading columns from a text file and to make an array for each column

Hi, I am not so familiar with bash scripting and would appreciate your help here. I have a text file 'input.txt' like this: 2 3 4 5 6 7 8 9 10 I want to store each column in an array like this a ={2 5 8}, b={3 6 9}, c={4 7 10} so that i can access any element, e.g b=6 for the later use.

5. Shell Programming and Scripting

Extract columns based on header

Hi to all, I have two files. File1 has no header, two columns: sample1 A sample2 B sample3 B sample4 C sample5 A sample6 D sample7 D File2 has a header, except for the first 3 columns (chr,start,end). "sample1" is the header for the 4th ,5th ,6th columns, "sample2" is the header...

6. Emergency UNIX and Linux Support

Average columns based on header name

Hi Friends, I have files with columns like this. This sample input below is partial. Please check below for main file link. Each file will have only two rows. ...

7. UNIX for Beginners Questions & Answers

Keep only columns in first two rows based on partial header pattern.

I have this code below that only prints out certain columns from the first two rows (doesn't affect rows 3 and beyond). How can I do the same on a partial header pattern �G_TP� instead of having to know specific column numbers (e.g. 374-479)? I've tried many other commands within this pipe with no...

8. Shell Programming and Scripting

Find columns in a file based on header and print to new file

Hello, I have to fish out some specific columns from a file based on the header value. I have the list of columns I need in a different file. I thought I could read in the list of headers I need, # file with header names of required columns in required order headers_file=$2 # read contents...

9. Shell Programming and Scripting

Find header in a text file and prepend it to all lines until another header is found

I've been struggling with this one for quite a while and cannot seem to find a solution for this find/replace scenario. Perhaps I'm getting rusty. I have a file that contains a number of metrics (exactly 3 fields per line) from a few appliances that are collected in parallel. To identify the...