Make copy of text file with columns removed (based on header)


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Make copy of text file with columns removed (based on header)
# 1  
Old 01-14-2015
Make copy of text file with columns removed (based on header)

Hello,

I have some tab delimited text files with a three header rows. The headers look like, (sorry the tabs look so messy).

Code:
index	group	Name	input	input	input	input	input	input	input	input	input	input	input
int	char	string	double	double	double	double	double	double	double	double	double	double	double
id	group	Name	AtR_Ptb_L	flatness	inv_dx2	rvalHyd	sumLip	xv0	dxv1	Gmax	k2	Spyridin_N	Salph_N

The files could have any number of columns. What I need to do is simple. I just need to copy the file with one or more columns removed. The columns to be removed would be specified by the value in the third row. For example, I could want the files with columns "dxv1" and "k2" removed.
Code:
index	group	Name	input	input	input	input	input	input	input	input	input
int	char	string	double	double	double	double	double	double	double	double	double
id	group	Name	AtR_Ptb_L	flatness	inv_dx2	rvalHyd	sumLip	xv0	Gmax	Spyridin_N	Salph_N

The order of the remaining columns should be the same. It doesn't matter how the list of columns to be removed is formatted. It can be any kind of list.

I have read posts about how to copy specific columns with cut or awk, but not how to skip specific cols and copy everything else. One thing to do would be to find the position of the cols to be removed and use cut, but how to set that up to work in a general implementation is a bit unclear to me. I also suspect that awk would be more efficient.

Any suggestions?

LMHmedchem
# 2  
Old 01-14-2015
Hello LMHmedchem,

Following may help you in same.
1st: For any line:
Code:
awk -vs1="dxv1" -vs2="k2" '{for(i=1;i<=NF;i++){if($i==s1 || $i==s2){$i=""}}} 1' Input_file

Output will be as follows.
Code:
index   group   Name    input   input   input   input   input   input   input   input   input   input   input
int     char    string  double  double  double  double  double  double  double  double  double  double  double
id group Name AtR_Ptb_L flatness inv_dx2 rvalHyd sumLip xv0  Gmax  Spyridin_N Salph_N

2nd: If for only 3rd line:
Code:
awk -vs1="dxv1" -vs2="k2" 'NR==3{for(i=1;i<=NF;i++){if($i==s1 || $i==s2){$i=""}}} 1'  Input_file

Output will be as follows.
Code:
index   group   Name    input   input   input   input   input   input   input   input   input   input   input
int     char    string  double  double  double  double  double  double  double  double  double  double  double
id group Name AtR_Ptb_L flatness inv_dx2 rvalHyd sumLip xv0  Gmax  Spyridin_N Salph_N

Thanks,
R. Singh
This User Gave Thanks to RavinderSingh13 For This Post:
# 3  
Old 01-14-2015
If you want those fields removed in every record, not just the header, try:
Code:
awk     'NR==3          {MX=split (RM, T, " ")
                         for (i=1; i<=NF; i++)
                             for (n=1; n<=MX; n++)
                                 if ($i==T[n]) T[n]=i
                        }
         !(NR%3)        {for (n=1; n<=MX; n++) $(T[n])=""
                         $0=$0; $1=$1
                        }
         1
        ' FS="\t+" OFS="\t" RM="dxv1 k2" file

This User Gave Thanks to RudiC For This Post:
# 4  
Old 01-15-2015
Quote:
Originally Posted by RudiC
If you want those fields removed in every record, not just the header, try:
Code:
awk     'NR==3          {MX=split (RM, T, " ")
                         for (i=1; i<=NF; i++)
                             for (n=1; n<=MX; n++)
                                 if ($i==T[n]) T[n]=i
                        }
         !(NR%3)        {for (n=1; n<=MX; n++) $(T[n])=""
                         $0=$0; $1=$1
                        }
         1
        ' FS="\t+" OFS="\t" RM="dxv1 k2" file

This approach does not seem to work. The input and output files still have the same number of columns. The values dxv1 and k2 have been removed from the third row, but it looks like for the rest of the file, one column has been removed from every third row instead of the entire column being removed.

I have attached the original file,
original_f0_RSV_1912_A_S1v6_RI7_1916_15-01-10.txt

the file as modified by the code above,
modified_f0_RSV_1912_A_S1v6_RI7_1916_15-01-10.txt

and the output I was trying to create,
intended_f0_RSV_1912_A_S1v6_RI7_1916_15-01-10.txt

The method posted by RavinderSingh13 modifies the third row, but not the rest of the file.

This code does what I want,
Code:
# assign value of header for column to be removed
REMOVE='dxv1'
# assign data input file for $FOLD
BASE_INPUT_FILE_LIST=($(ls './'$SET'/input_data/base/'$FOLD'_'*'_'$SET'_'*'.txt'))
# assign modified input file directory
MOD_INPUT_FILE_DIR=$(ls -d './'$SET'/input_data/')
echo $MOD_INPUT_FILE_DIR

for BASE_INPUT_FILE in "${BASE_INPUT_FILE_LIST[@]}"
do
   echo $BASE_INPUT_FILE
   # change path to filename
   REVISED_FILE=$(echo $BASE_INPUT_FILE | awk 'BEGIN {FS="/"} {print $5}')
   REVISED_FILE='./'$SET'/input_data/'$REVISED_FILE
   echo $REVISED_FILE

   # find the location of the column to be removed
   HEADER_ROW_LIST=($(cat $BASE_INPUT_FILE | sed -n '3p'))
   ELEMENT_COUNTER='0';  HEADER_POSITION='0'

   # loop through headers
   for HEADER_ROW in "${HEADER_ROW_LIST[@]}"
   do
      # incrementer counter
      (( ELEMENT_COUNTER++ ))
      echo $HEADER_ROW
      if [ "$HEADER_ROW" == "$REMOVE" ]; then
         echo "found remove at position" $ELEMENT_COUNTER
         HEADER_POSITION=$ELEMENT_COUNTER
      fi
   done
   echo $REMOVE "was found at position" $HEADER_POSITION

   # create values before and after position to be removed
   let "REMOVE_m1=$HEADER_POSITION-1";  let "REMOVE_p1=$HEADER_POSITION+1";

   echo "REMOVE_m1" $REMOVE_m1
   echo "REMOVE_p1" $REMOVE_p1

   # remove column from file
   cut --output-delimiter=$'\t' -f1-$REMOVE_m1,$REMOVE_p1-  $BASE_INPUT_FILE > $REVISED_FILE

This does not currently allow for more than one column to be removed, though the code could be called separately for each column.

Thanks,

LMHmedchem

Last edited by LMHmedchem; 01-15-2015 at 08:38 PM..
# 5  
Old 01-16-2015
Edit:
Code removed: not working right.

Last edited by ongoto; 01-16-2015 at 06:45 PM..
This User Gave Thanks to ongoto For This Post:
# 6  
Old 01-17-2015
This is what I have at the moment,

Code:
#!/bin/sh

# file with list of colums to remove (by header name)
REMOVE_LIST_FILE='./remove_list.txt'
SET='A'

## 1) read list of columns to remove into array, skip comment lines
# clear array
unset LIST_TO_REMOVE
echo ""

# read from file
while IFS=$'\n' read line
do
    if [[ "$line" =~ \#.* ]];then
        echo "skipping comment line:" $line
    else
        echo "adding column header:" $line
        LIST_TO_REMOVE=("${LIST_TO_REMOVE[@]}" "$line")
    fi
done < $REMOVE_LIST_FILE

## 2) translate list of headers to list of column numbers
echo ""
echo "creating files without columns" ${LIST_TO_REMOVE[@]}
echo ""

# find position of column to remove based on the first input file
COLUMN_CHECK_FILE=$(ls './'$SET'/input_data/base/f0_'*'_'$SET'_'*'.txt')

# clear columns to be removed
unset COLS_TO_REMOVE

# find column number for column to remove
for CURRENT_REMOVE in "${LIST_TO_REMOVE[@]}"
do
   # grab header row (row 3) from column check file
   HEADER_ROW_LIST=($(cat $COLUMN_CHECK_FILE | sed -n '3p'))
   ELEMENT_COUNTER='0';  HEADER_POSITION='0';

# loop through headers
for HEADER_ROW in "${HEADER_ROW_LIST[@]}"
do
      # incrementer counter
      (( ELEMENT_COUNTER++ ))
      if [ "$HEADER_ROW" == "$CURRENT_REMOVE" ]; then
         HEADER_POSITION=$ELEMENT_COUNTER
         COLS_TO_REMOVE=("${COLS_TO_REMOVE[@]}" "$HEADER_POSITION")
      fi
   done
done

# sort list of columns to remove
SORT_COLS_TO_REMOVE=($(IFS=$'\n'  sort -n <<< "${COLS_TO_REMOVE[*]}"))
echo "SORT_COLS_TO_REMOVE" ${SORT_COLS_TO_REMOVE[@]}
echo ""

## 3) create -f field string to pass to cut

# the cut string always starts with "-f1-" because the first column will never be cut
# the code works by taking the column number to remove and creating the integer +1 and -1
# the +/-1 values are seperated by a comma and added to the cut -f string
# if the column to remove is 4 3,5 is added to the cut string to make -f1-3,5

# additionally, the +1 value is checked to make sure it is not also on the list of
# columns to remove

# initalize
CUT_STRING='-f1-'; CUT_COUNTER='0'; SERIES='0'

# loop on list of columns to remove
for CUT_COLUMN in "${SORT_COLS_TO_REMOVE[@]}"
do
   # set the position of the next item in the remove list
   let "POS_CHECK=$CUT_COUNTER+1"
   # set the column number of the next column
   let "NEXT_VALUE=$CUT_COLUMN+1"
   # find the column number of the next colum in the remove list
   CHECK_NEXT=${SORT_COLS_TO_REMOVE[$POS_CHECK]}

   # if the next col in sequence is not the next col on the remove list
   if [ "$NEXT_VALUE" != "$CHECK_NEXT" ]; then

      # if we are not in a series, create col number before and after
      if [ "$SERIES" == "0" ]; then
         let "REMOVE_m1=$CUT_COLUMN-1"
         let "REMOVE_p1=$CUT_COLUMN+1"
         # add col before and after to cut string with comma seperation
         # if col 4 is to be cut, creates 3,5
         CUT_STRING=$CUT_STRING$REMOVE_m1','$REMOVE_p1'-'

      # if we are in a series, we arrive here when the end of the series
      # has been identified, create series end bracket by adding 1 to current col position
      else
         let "SERIES_END_BRACKET=$CUT_COLUMN+1"
         # add col before and after to cut string with comma seperation
         # if series to cut is 13 14 15, creates 12,16
         CUT_STRING=$CUT_STRING$SERIES_START_BRACKET','$SERIES_END_BRACKET
         # series is complete, so deactivate series
         SERIES='0'
      fi

   # if the next col in sequence is the next col on the remove list, in a series
   else
      # if series is not active, start series by storing current position -1
      if [ "$SERIES" == "0" ]; then
         let "REMOVE_m1=$CUT_COLUMN-1"
         SERIES_START_BRACKET=$REMOVE_m1
         # set marker that series is active
         SERIES='1'
      fi
   fi

   # increment position
   ((CUT_COUNTER++))

# for CUT_COLUMN in "${SORT_COLS_TO_REMOVE[@]}" done
done

# if the last character is not -, add
if [ "${CUT_STRING:$i:-1}" != "-" ]; then
   # add trailing - to cut string, check if this works in all cases
   CUT_STRING=$CUT_STRING'-'
fi
echo "CUT_STRING" $CUT_STRING
echo ""

# at this point, a -f cut argument string has been created, if the cut list was 4 13 14 16,
# the cut string is -f1-3,5-12,16- which gives the desired outcome


## 4) create modified files with the cut columns removed
# remove columns from base input files based on LIST_TO_REMOVE

# list of input files to modify
BASE_INPUT_FILE_LIST=($(ls './'$SET'/input_data/base/'*'_'$SET'_'*'.txt'))
# directory to write modified files
MOD_INPUT_FILE_DIR=$(ls -d './'$SET'/input_data/')

echo "creating modified input files"
# create coppies of base input with identified columns removed
for BASE_INPUT_FILE in "${BASE_INPUT_FILE_LIST[@]}"
do
   # create path for revised file
   REVISED_FILE=$(echo $BASE_INPUT_FILE | awk 'BEGIN {FS="/"} {print $5}')
   REVISED_FILE='./'$SET'/input_data/'$REVISED_FILE
    # remove columns from file as specified in CUT_STRING
   cut --output-delimiter=$'\t' $CUT_STRING  $BASE_INPUT_FILE > $REVISED_FILE
done

Sorry for the very long script. This seems to work and is reasonably fast.
Code:
real    0m0.437s
user    0m0.253s
sys     0m0.011s

It's hard to know that step 3 has been coded in a manner that will work for all possibilities. That is the kind of algorithm that is nothing but a box of bear traps.

I have attached a set of test files in case anyone would rather test than read the above. Just extract the tar.gz and run cut_columns.sh. You may have to make the script executable. You can play around with the files that are removed by modifying remove_list.txt.

LMHmedchem

Last edited by LMHmedchem; 01-17-2015 at 11:06 PM..
# 7  
Old 01-18-2015
Quote:
Originally Posted by LMHmedchem
This approach does not seem to work. The input and output files still have the same number of columns. The values dxv1 and k2 have been removed from the third row, but it looks like for the rest of the file, one column has been removed from every third row instead of the entire column being removed.
Due to missing samples, the assumption was every record is spread over three lines, so the relevant values had to be removed in the third lines...

Just remove the !(NR%3) to remove the columns in every line.
This User Gave Thanks to RudiC For This Post:
Login or Register to Ask a Question

Previous Thread | Next Thread

9 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Find header in a text file and prepend it to all lines until another header is found

I've been struggling with this one for quite a while and cannot seem to find a solution for this find/replace scenario. Perhaps I'm getting rusty. I have a file that contains a number of metrics (exactly 3 fields per line) from a few appliances that are collected in parallel. To identify the... (3 Replies)
Discussion started by: verdepollo
3 Replies

2. Shell Programming and Scripting

Find columns in a file based on header and print to new file

Hello, I have to fish out some specific columns from a file based on the header value. I have the list of columns I need in a different file. I thought I could read in the list of headers I need, # file with header names of required columns in required order headers_file=$2 # read contents... (11 Replies)
Discussion started by: LMHmedchem
11 Replies

3. UNIX for Beginners Questions & Answers

Keep only columns in first two rows based on partial header pattern.

I have this code below that only prints out certain columns from the first two rows (doesn't affect rows 3 and beyond). How can I do the same on a partial header pattern “G_TP” instead of having to know specific column numbers (e.g. 374-479)? I've tried many other commands within this pipe with no... (4 Replies)
Discussion started by: aachave1
4 Replies

4. Emergency UNIX and Linux Support

Average columns based on header name

Hi Friends, I have files with columns like this. This sample input below is partial. Please check below for main file link. Each file will have only two rows. ... (8 Replies)
Discussion started by: jacobs.smith
8 Replies

5. Shell Programming and Scripting

Extract columns based on header

Hi to all, I have two files. File1 has no header, two columns: sample1 A sample2 B sample3 B sample4 C sample5 A sample6 D sample7 D File2 has a header, except for the first 3 columns (chr,start,end). "sample1" is the header for the 4th ,5th ,6th columns, "sample2" is the header... (4 Replies)
Discussion started by: aec
4 Replies

6. Shell Programming and Scripting

Reading columns from a text file and to make an array for each column

Hi, I am not so familiar with bash scripting and would appreciate your help here. I have a text file 'input.txt' like this: 2 3 4 5 6 7 8 9 10 I want to store each column in an array like this a ={2 5 8}, b={3 6 9}, c={4 7 10} so that i can access any element, e.g b=6 for the later use. (1 Reply)
Discussion started by: Asif Siddique
1 Replies

7. UNIX for Dummies Questions & Answers

Extracting rows from a text file based on the values of two columns (given ranges)

Hi, I have a tab delimited text file with multiple columns. The second and third columns include numbers that have not been sorted. I want to extract rows where the second column includes a value between -0.01 and 0.01 (including both numbers) and the first third column includes a value between... (1 Reply)
Discussion started by: evelibertine
1 Replies

8. Shell Programming and Scripting

Copy and Paste Columns in a Tab-Limited Text file

I have this text file with a very large number of columns (10,000+) and I want to move the first column to the position of the six column so that the text file looks like this: Before cutting and pasting ID Family Mother Father Trait Phenotype aaa bbb ... (5 Replies)
Discussion started by: evelibertine
5 Replies

9. UNIX for Dummies Questions & Answers

Merging two files based on two columns to make a third file

Hi there, I'm trying to merge two files and make a third file. However, two of the columns need to match exactly in both files AND I want everything from both files in the output if the two columns match in that row. First file looks like this: chr1 10001980 T A Second... (12 Replies)
Discussion started by: infiniteabyss
12 Replies
Login or Register to Ask a Question