Make copy of text file with columns removed (based on header)

01-14-2015

Registered User

362, 16

Join Date: Mar 2010

Last Activity: 3 March 2020, 10:38 PM EST

Location: Boston

Posts: 362

Thanks Given: 193

Thanked 16 Times in 15 Posts

Make copy of text file with columns removed (based on header)

Hello,

I have some tab delimited text files with a three header rows. The headers look like, (sorry the tabs look so messy).

Code:

index	group	Name	input	input	input	input	input	input	input	input	input	input	input
int	char	string	double	double	double	double	double	double	double	double	double	double	double
id	group	Name	AtR_Ptb_L	flatness	inv_dx2	rvalHyd	sumLip	xv0	dxv1	Gmax	k2	Spyridin_N	Salph_N

The files could have any number of columns. What I need to do is simple. I just need to copy the file with one or more columns removed. The columns to be removed would be specified by the value in the third row. For example, I could want the files with columns "dxv1" and "k2" removed.

Code:

index	group	Name	input	input	input	input	input	input	input	input	input
int	char	string	double	double	double	double	double	double	double	double	double
id	group	Name	AtR_Ptb_L	flatness	inv_dx2	rvalHyd	sumLip	xv0	Gmax	Spyridin_N	Salph_N

The order of the remaining columns should be the same. It doesn't matter how the list of columns to be removed is formatted. It can be any kind of list.

I have read posts about how to copy specific columns with cut or awk, but not how to skip specific cols and copy everything else. One thing to do would be to find the position of the cols to be removed and use cut, but how to set that up to work in a general implementation is a bit unclear to me. I also suspect that awk would be more efficient.

Any suggestions?

LMHmedchem

LMHmedchem

View Public Profile for LMHmedchem

Find all posts by LMHmedchem

01-14-2015

Moderator

3,105, 1,603

Join Date: May 2013

Last Activity: 31 August 2020, 1:46 AM EDT

Location: Chennai

Posts: 3,105

Thanks Given: 1,269

Thanked 1,603 Times in 1,369 Posts

Hello LMHmedchem,

Following may help you in same.
1st: For any line:

Code:

awk -vs1="dxv1" -vs2="k2" '{for(i=1;i<=NF;i++){if($i==s1 || $i==s2){$i=""}}} 1' Input_file

Output will be as follows.

Code:

index   group   Name    input   input   input   input   input   input   input   input   input   input   input
int     char    string  double  double  double  double  double  double  double  double  double  double  double
id group Name AtR_Ptb_L flatness inv_dx2 rvalHyd sumLip xv0  Gmax  Spyridin_N Salph_N

2nd: If for only 3rd line:

Code:

awk -vs1="dxv1" -vs2="k2" 'NR==3{for(i=1;i<=NF;i++){if($i==s1 || $i==s2){$i=""}}} 1'  Input_file

Output will be as follows.

Code:

index   group   Name    input   input   input   input   input   input   input   input   input   input   input
int     char    string  double  double  double  double  double  double  double  double  double  double  double
id group Name AtR_Ptb_L flatness inv_dx2 rvalHyd sumLip xv0  Gmax  Spyridin_N Salph_N

Thanks,
R. Singh

This User Gave Thanks to RavinderSingh13 For This Post:

RavinderSingh13

View Public Profile for RavinderSingh13

Find all posts by RavinderSingh13

01-14-2015

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

If you want those fields removed in every record, not just the header, try:

Code:

awk     'NR==3          {MX=split (RM, T, " ")
                         for (i=1; i<=NF; i++)
                             for (n=1; n<=MX; n++)
                                 if ($i==T[n]) T[n]=i
                        }
         !(NR%3)        {for (n=1; n<=MX; n++) $(T[n])=""
                         $0=$0; $1=$1
                        }
         1
        ' FS="\t+" OFS="\t" RM="dxv1 k2" file

This User Gave Thanks to RudiC For This Post:

RudiC

View Public Profile for RudiC

Find all posts by RudiC

01-15-2015

Registered User

362, 16

Join Date: Mar 2010

Last Activity: 3 March 2020, 10:38 PM EST

Location: Boston

Posts: 362

Thanks Given: 193

Thanked 16 Times in 15 Posts

Quote:

Originally Posted by RudiC

If you want those fields removed in every record, not just the header, try:

Code:

awk     'NR==3          {MX=split (RM, T, " ")
                         for (i=1; i<=NF; i++)
                             for (n=1; n<=MX; n++)
                                 if ($i==T[n]) T[n]=i
                        }
         !(NR%3)        {for (n=1; n<=MX; n++) $(T[n])=""
                         $0=$0; $1=$1
                        }
         1
        ' FS="\t+" OFS="\t" RM="dxv1 k2" file

This approach does not seem to work. The input and output files still have the same number of columns. The values dxv1 and k2 have been removed from the third row, but it looks like for the rest of the file, one column has been removed from every third row instead of the entire column being removed.

I have attached the original file,
original_f0_RSV_1912_A_S1v6_RI7_1916_15-01-10.txt

the file as modified by the code above,
modified_f0_RSV_1912_A_S1v6_RI7_1916_15-01-10.txt

and the output I was trying to create,
intended_f0_RSV_1912_A_S1v6_RI7_1916_15-01-10.txt

The method posted by RavinderSingh13 modifies the third row, but not the rest of the file.

This code does what I want,

Code:

# assign value of header for column to be removed
REMOVE='dxv1'
# assign data input file for $FOLD
BASE_INPUT_FILE_LIST=($(ls './'$SET'/input_data/base/'$FOLD'_'*'_'$SET'_'*'.txt'))
# assign modified input file directory
MOD_INPUT_FILE_DIR=$(ls -d './'$SET'/input_data/')
echo $MOD_INPUT_FILE_DIR

for BASE_INPUT_FILE in "${BASE_INPUT_FILE_LIST[@]}"
do
   echo $BASE_INPUT_FILE
   # change path to filename
   REVISED_FILE=$(echo $BASE_INPUT_FILE | awk 'BEGIN {FS="/"} {print $5}')
   REVISED_FILE='./'$SET'/input_data/'$REVISED_FILE
   echo $REVISED_FILE

   # find the location of the column to be removed
   HEADER_ROW_LIST=($(cat $BASE_INPUT_FILE | sed -n '3p'))
   ELEMENT_COUNTER='0';  HEADER_POSITION='0'

   # loop through headers
   for HEADER_ROW in "${HEADER_ROW_LIST[@]}"
   do
      # incrementer counter
      (( ELEMENT_COUNTER++ ))
      echo $HEADER_ROW
      if [ "$HEADER_ROW" == "$REMOVE" ]; then
         echo "found remove at position" $ELEMENT_COUNTER
         HEADER_POSITION=$ELEMENT_COUNTER
      fi
   done
   echo $REMOVE "was found at position" $HEADER_POSITION

   # create values before and after position to be removed
   let "REMOVE_m1=$HEADER_POSITION-1";  let "REMOVE_p1=$HEADER_POSITION+1";

   echo "REMOVE_m1" $REMOVE_m1
   echo "REMOVE_p1" $REMOVE_p1

   # remove column from file
   cut --output-delimiter=$'\t' -f1-$REMOVE_m1,$REMOVE_p1-  $BASE_INPUT_FILE > $REVISED_FILE

This does not currently allow for more than one column to be removed, though the code could be called separately for each column.

Thanks,

LMHmedchem

files.zip (662.9 KB)

Last edited by LMHmedchem; 01-15-2015 at 08:38 PM..

LMHmedchem

View Public Profile for LMHmedchem

Find all posts by LMHmedchem

01-16-2015

Registered User

169, 52

Join Date: Oct 2014

Last Activity: 16 October 2017, 6:37 PM EDT

Location: California USA

Posts: 169

Thanks Given: 18

Thanked 52 Times in 48 Posts

Edit:
Code removed: not working right.

Last edited by ongoto; 01-16-2015 at 06:45 PM..

This User Gave Thanks to ongoto For This Post:

ongoto

View Public Profile for ongoto

Find all posts by ongoto

01-17-2015

Registered User

362, 16

Join Date: Mar 2010

Last Activity: 3 March 2020, 10:38 PM EST

Location: Boston

Posts: 362

Thanks Given: 193

Thanked 16 Times in 15 Posts

This is what I have at the moment,

Code:

#!/bin/sh

# file with list of colums to remove (by header name)
REMOVE_LIST_FILE='./remove_list.txt'
SET='A'

## 1) read list of columns to remove into array, skip comment lines
# clear array
unset LIST_TO_REMOVE
echo ""

# read from file
while IFS=$'\n' read line
do
    if [[ "$line" =~ \#.* ]];then
        echo "skipping comment line:" $line
    else
        echo "adding column header:" $line
        LIST_TO_REMOVE=("${LIST_TO_REMOVE[@]}" "$line")
    fi
done < $REMOVE_LIST_FILE

## 2) translate list of headers to list of column numbers
echo ""
echo "creating files without columns" ${LIST_TO_REMOVE[@]}
echo ""

# find position of column to remove based on the first input file
COLUMN_CHECK_FILE=$(ls './'$SET'/input_data/base/f0_'*'_'$SET'_'*'.txt')

# clear columns to be removed
unset COLS_TO_REMOVE

# find column number for column to remove
for CURRENT_REMOVE in "${LIST_TO_REMOVE[@]}"
do
   # grab header row (row 3) from column check file
   HEADER_ROW_LIST=($(cat $COLUMN_CHECK_FILE | sed -n '3p'))
   ELEMENT_COUNTER='0';  HEADER_POSITION='0';

# loop through headers
for HEADER_ROW in "${HEADER_ROW_LIST[@]}"
do
      # incrementer counter
      (( ELEMENT_COUNTER++ ))
      if [ "$HEADER_ROW" == "$CURRENT_REMOVE" ]; then
         HEADER_POSITION=$ELEMENT_COUNTER
         COLS_TO_REMOVE=("${COLS_TO_REMOVE[@]}" "$HEADER_POSITION")
      fi
   done
done

# sort list of columns to remove
SORT_COLS_TO_REMOVE=($(IFS=$'\n'  sort -n <<< "${COLS_TO_REMOVE[*]}"))
echo "SORT_COLS_TO_REMOVE" ${SORT_COLS_TO_REMOVE[@]}
echo ""

## 3) create -f field string to pass to cut

# the cut string always starts with "-f1-" because the first column will never be cut
# the code works by taking the column number to remove and creating the integer +1 and -1
# the +/-1 values are seperated by a comma and added to the cut -f string
# if the column to remove is 4 3,5 is added to the cut string to make -f1-3,5

# additionally, the +1 value is checked to make sure it is not also on the list of
# columns to remove

# initalize
CUT_STRING='-f1-'; CUT_COUNTER='0'; SERIES='0'

# loop on list of columns to remove
for CUT_COLUMN in "${SORT_COLS_TO_REMOVE[@]}"
do
   # set the position of the next item in the remove list
   let "POS_CHECK=$CUT_COUNTER+1"
   # set the column number of the next column
   let "NEXT_VALUE=$CUT_COLUMN+1"
   # find the column number of the next colum in the remove list
   CHECK_NEXT=${SORT_COLS_TO_REMOVE[$POS_CHECK]}

   # if the next col in sequence is not the next col on the remove list
   if [ "$NEXT_VALUE" != "$CHECK_NEXT" ]; then

      # if we are not in a series, create col number before and after
      if [ "$SERIES" == "0" ]; then
         let "REMOVE_m1=$CUT_COLUMN-1"
         let "REMOVE_p1=$CUT_COLUMN+1"
         # add col before and after to cut string with comma seperation
         # if col 4 is to be cut, creates 3,5
         CUT_STRING=$CUT_STRING$REMOVE_m1','$REMOVE_p1'-'

      # if we are in a series, we arrive here when the end of the series
      # has been identified, create series end bracket by adding 1 to current col position
      else
         let "SERIES_END_BRACKET=$CUT_COLUMN+1"
         # add col before and after to cut string with comma seperation
         # if series to cut is 13 14 15, creates 12,16
         CUT_STRING=$CUT_STRING$SERIES_START_BRACKET','$SERIES_END_BRACKET
         # series is complete, so deactivate series
         SERIES='0'
      fi

   # if the next col in sequence is the next col on the remove list, in a series
   else
      # if series is not active, start series by storing current position -1
      if [ "$SERIES" == "0" ]; then
         let "REMOVE_m1=$CUT_COLUMN-1"
         SERIES_START_BRACKET=$REMOVE_m1
         # set marker that series is active
         SERIES='1'
      fi
   fi

   # increment position
   ((CUT_COUNTER++))

# for CUT_COLUMN in "${SORT_COLS_TO_REMOVE[@]}" done
done

# if the last character is not -, add
if [ "${CUT_STRING:$i:-1}" != "-" ]; then
   # add trailing - to cut string, check if this works in all cases
   CUT_STRING=$CUT_STRING'-'
fi
echo "CUT_STRING" $CUT_STRING
echo ""

# at this point, a -f cut argument string has been created, if the cut list was 4 13 14 16,
# the cut string is -f1-3,5-12,16- which gives the desired outcome


## 4) create modified files with the cut columns removed
# remove columns from base input files based on LIST_TO_REMOVE

# list of input files to modify
BASE_INPUT_FILE_LIST=($(ls './'$SET'/input_data/base/'*'_'$SET'_'*'.txt'))
# directory to write modified files
MOD_INPUT_FILE_DIR=$(ls -d './'$SET'/input_data/')

echo "creating modified input files"
# create coppies of base input with identified columns removed
for BASE_INPUT_FILE in "${BASE_INPUT_FILE_LIST[@]}"
do
   # create path for revised file
   REVISED_FILE=$(echo $BASE_INPUT_FILE | awk 'BEGIN {FS="/"} {print $5}')
   REVISED_FILE='./'$SET'/input_data/'$REVISED_FILE
    # remove columns from file as specified in CUT_STRING
   cut --output-delimiter=$'\t' $CUT_STRING  $BASE_INPUT_FILE > $REVISED_FILE
done

Sorry for the very long script. This seems to work and is reasonably fast.

Code:

real    0m0.437s
user    0m0.253s
sys     0m0.011s

It's hard to know that step 3 has been coded in a manner that will work for all possibilities. That is the kind of algorithm that is nothing but a box of bear traps.

I have attached a set of test files in case anyone would rather test than read the above. Just extract the tar.gz and run cut_columns.sh. You may have to make the script executable. You can play around with the files that are removed by modifying remove_list.txt.

LMHmedchem

cut_script_test.tar.gz (2.17 MB)

Last edited by LMHmedchem; 01-17-2015 at 11:06 PM..

LMHmedchem

View Public Profile for LMHmedchem

Find all posts by LMHmedchem

01-18-2015

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

Quote:

Originally Posted by LMHmedchem

This approach does not seem to work. The input and output files still have the same number of columns. The values dxv1 and k2 have been removed from the third row, but it looks like for the rest of the file, one column has been removed from every third row instead of the entire column being removed.

Due to missing samples, the assumption was every record is spread over three lines, so the relevant values had to be removed in the third lines...

Just remove the !(NR%3) to remove the columns in every line.

This User Gave Thanks to RudiC For This Post:

RudiC

View Public Profile for RudiC

Find all posts by RudiC

Shell Programming and Scripting

Make copy of text file with columns removed (based on header)

9 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Find header in a text file and prepend it to all lines until another header is found

Discussion started by: verdepollo

2. Shell Programming and Scripting

Find columns in a file based on header and print to new file

Discussion started by: LMHmedchem

3. UNIX for Beginners Questions & Answers

Keep only columns in first two rows based on partial header pattern.

Discussion started by: aachave1

4. Emergency UNIX and Linux Support

Average columns based on header name

Discussion started by: jacobs.smith

5. Shell Programming and Scripting

Extract columns based on header

Discussion started by: aec

6. Shell Programming and Scripting

Reading columns from a text file and to make an array for each column

Discussion started by: Asif Siddique

7. UNIX for Dummies Questions & Answers

Extracting rows from a text file based on the values of two columns (given ranges)

Discussion started by: evelibertine

8. Shell Programming and Scripting

Copy and Paste Columns in a Tab-Limited Text file

Discussion started by: evelibertine

9. UNIX for Dummies Questions & Answers

Merging two files based on two columns to make a third file

Discussion started by: infiniteabyss