Make copy of text file with columns removed (based on header)
Hello,
I have some tab delimited text files with a three header rows. The headers look like, (sorry the tabs look so messy).
The files could have any number of columns. What I need to do is simple. I just need to copy the file with one or more columns removed. The columns to be removed would be specified by the value in the third row. For example, I could want the files with columns "dxv1" and "k2" removed.
The order of the remaining columns should be the same. It doesn't matter how the list of columns to be removed is formatted. It can be any kind of list.
I have read posts about how to copy specific columns with cut or awk, but not how to skip specific cols and copy everything else. One thing to do would be to find the position of the cols to be removed and use cut, but how to set that up to work in a general implementation is a bit unclear to me. I also suspect that awk would be more efficient.
If you want those fields removed in every record, not just the header, try:
This approach does not seem to work. The input and output files still have the same number of columns. The values dxv1 and k2 have been removed from the third row, but it looks like for the rest of the file, one column has been removed from every third row instead of the entire column being removed.
I have attached the original file, original_f0_RSV_1912_A_S1v6_RI7_1916_15-01-10.txt
the file as modified by the code above, modified_f0_RSV_1912_A_S1v6_RI7_1916_15-01-10.txt
and the output I was trying to create, intended_f0_RSV_1912_A_S1v6_RI7_1916_15-01-10.txt
The method posted by RavinderSingh13 modifies the third row, but not the rest of the file.
This code does what I want,
This does not currently allow for more than one column to be removed, though the code could be called separately for each column.
#!/bin/sh
# file with list of colums to remove (by header name)
REMOVE_LIST_FILE='./remove_list.txt'
SET='A'
## 1) read list of columns to remove into array, skip comment lines
# clear array
unset LIST_TO_REMOVE
echo ""
# read from file
while IFS=$'\n' read line
do
if [[ "$line" =~ \#.* ]];then
echo "skipping comment line:" $line
else
echo "adding column header:" $line
LIST_TO_REMOVE=("${LIST_TO_REMOVE[@]}" "$line")
fi
done < $REMOVE_LIST_FILE
## 2) translate list of headers to list of column numbers
echo ""
echo "creating files without columns" ${LIST_TO_REMOVE[@]}
echo ""
# find position of column to remove based on the first input file
COLUMN_CHECK_FILE=$(ls './'$SET'/input_data/base/f0_'*'_'$SET'_'*'.txt')
# clear columns to be removed
unset COLS_TO_REMOVE
# find column number for column to remove
for CURRENT_REMOVE in "${LIST_TO_REMOVE[@]}"
do
# grab header row (row 3) from column check file
HEADER_ROW_LIST=($(cat $COLUMN_CHECK_FILE | sed -n '3p'))
ELEMENT_COUNTER='0'; HEADER_POSITION='0';
# loop through headers
for HEADER_ROW in "${HEADER_ROW_LIST[@]}"
do
# incrementer counter
(( ELEMENT_COUNTER++ ))
if [ "$HEADER_ROW" == "$CURRENT_REMOVE" ]; then
HEADER_POSITION=$ELEMENT_COUNTER
COLS_TO_REMOVE=("${COLS_TO_REMOVE[@]}" "$HEADER_POSITION")
fi
done
done
# sort list of columns to remove
SORT_COLS_TO_REMOVE=($(IFS=$'\n' sort -n <<< "${COLS_TO_REMOVE[*]}"))
echo "SORT_COLS_TO_REMOVE" ${SORT_COLS_TO_REMOVE[@]}
echo ""
## 3) create -f field string to pass to cut
# the cut string always starts with "-f1-" because the first column will never be cut
# the code works by taking the column number to remove and creating the integer +1 and -1
# the +/-1 values are seperated by a comma and added to the cut -f string
# if the column to remove is 4 3,5 is added to the cut string to make -f1-3,5
# additionally, the +1 value is checked to make sure it is not also on the list of
# columns to remove# initalize
CUT_STRING='-f1-'; CUT_COUNTER='0'; SERIES='0'
# loop on list of columns to remove
for CUT_COLUMN in "${SORT_COLS_TO_REMOVE[@]}"
do
# set the position of the next item in the remove list
let "POS_CHECK=$CUT_COUNTER+1"
# set the column number of the next column
let "NEXT_VALUE=$CUT_COLUMN+1"
# find the column number of the next colum in the remove list
CHECK_NEXT=${SORT_COLS_TO_REMOVE[$POS_CHECK]}
# if the next col in sequence is not the next col on the remove list
if [ "$NEXT_VALUE" != "$CHECK_NEXT" ]; then
# if we are not in a series, create col number before and after
if [ "$SERIES" == "0" ]; then
let "REMOVE_m1=$CUT_COLUMN-1"
let "REMOVE_p1=$CUT_COLUMN+1"
# add col before and after to cut string with comma seperation
# if col 4 is to be cut, creates 3,5
CUT_STRING=$CUT_STRING$REMOVE_m1','$REMOVE_p1'-'
# if we are in a series, we arrive here when the end of the series
# has been identified, create series end bracket by adding 1 to current col position
else
let "SERIES_END_BRACKET=$CUT_COLUMN+1"
# add col before and after to cut string with comma seperation
# if series to cut is 13 14 15, creates 12,16
CUT_STRING=$CUT_STRING$SERIES_START_BRACKET','$SERIES_END_BRACKET
# series is complete, so deactivate series
SERIES='0'
fi
# if the next col in sequence is the next col on the remove list, in a series
else
# if series is not active, start series by storing current position -1
if [ "$SERIES" == "0" ]; then
let "REMOVE_m1=$CUT_COLUMN-1"
SERIES_START_BRACKET=$REMOVE_m1
# set marker that series is active
SERIES='1'
fi
fi
# increment position
((CUT_COUNTER++))
# for CUT_COLUMN in "${SORT_COLS_TO_REMOVE[@]}" done
done
# if the last character is not -, add
if [ "${CUT_STRING:$i:-1}" != "-" ]; then
# add trailing - to cut string, check if this works in all cases
CUT_STRING=$CUT_STRING'-'
fi
echo "CUT_STRING" $CUT_STRING
echo ""
# at this point, a -f cut argument string has been created, if the cut list was 4 13 14 16,
# the cut string is -f1-3,5-12,16- which gives the desired outcome## 4) create modified files with the cut columns removed
# remove columns from base input files based on LIST_TO_REMOVE# list of input files to modify
BASE_INPUT_FILE_LIST=($(ls './'$SET'/input_data/base/'*'_'$SET'_'*'.txt'))
# directory to write modified files
MOD_INPUT_FILE_DIR=$(ls -d './'$SET'/input_data/')
echo "creating modified input files"
# create coppies of base input with identified columns removed
for BASE_INPUT_FILE in "${BASE_INPUT_FILE_LIST[@]}"
do
# create path for revised file
REVISED_FILE=$(echo $BASE_INPUT_FILE | awk 'BEGIN {FS="/"} {print $5}')
REVISED_FILE='./'$SET'/input_data/'$REVISED_FILE
# remove columns from file as specified in CUT_STRING
cut --output-delimiter=$'\t' $CUT_STRING $BASE_INPUT_FILE > $REVISED_FILE
done
Sorry for the very long script. This seems to work and is reasonably fast.
Code:
real 0m0.437s
user 0m0.253s
sys 0m0.011s
It's hard to know that step 3 has been coded in a manner that will work for all possibilities. That is the kind of algorithm that is nothing but a box of bear traps.
I have attached a set of test files in case anyone would rather test than read the above. Just extract the tar.gz and run cut_columns.sh. You may have to make the script executable. You can play around with the files that are removed by modifying remove_list.txt.
This approach does not seem to work. The input and output files still have the same number of columns. The values dxv1 and k2 have been removed from the third row, but it looks like for the rest of the file, one column has been removed from every third row instead of the entire column being removed.
Due to missing samples, the assumption was every record is spread over three lines, so the relevant values had to be removed in the third lines...
Just remove the !(NR%3) to remove the columns in every line.
I've been struggling with this one for quite a while and cannot seem to find a solution for this find/replace scenario. Perhaps I'm getting rusty.
I have a file that contains a number of metrics (exactly 3 fields per line) from a few appliances that are collected in parallel. To identify the... (3 Replies)
Hello,
I have to fish out some specific columns from a file based on the header value. I have the list of columns I need in a different file. I thought I could read in the list of headers I need,
# file with header names of required columns in required order
headers_file=$2
# read contents... (11 Replies)
I have this code below that only prints out certain columns from the first two rows (doesn't affect rows 3 and beyond). How can I do the same on a partial header pattern “G_TP” instead of having to know specific column numbers (e.g. 374-479)? I've tried many other commands within this pipe with no... (4 Replies)
Hi Friends,
I have files with columns like this. This sample input below is partial.
Please check below for main file link. Each file will have only two rows.
... (8 Replies)
Hi to all,
I have two files. File1 has no header, two columns:
sample1 A
sample2 B
sample3 B
sample4 C
sample5 A
sample6 D
sample7 D
File2 has a header, except for the first 3 columns (chr,start,end). "sample1" is the header for the 4th ,5th ,6th columns, "sample2" is the header... (4 Replies)
Hi,
I am not so familiar with bash scripting and would appreciate your help here.
I have a text file 'input.txt' like this:
2 3 4
5 6 7
8 9 10
I want to store each column in an array like this
a ={2 5 8}, b={3 6 9}, c={4 7 10}
so that i can access any element, e.g b=6 for the later use. (1 Reply)
Hi,
I have a tab delimited text file with multiple columns. The second and third columns include numbers that have not been sorted. I want to extract rows where the second column includes a value between -0.01 and 0.01 (including both numbers) and the first third column includes a value between... (1 Reply)
I have this text file with a very large number of columns (10,000+) and I want to move the first column to the position of the six column so that the text file looks like this:
Before cutting and pasting
ID Family Mother Father Trait Phenotype
aaa bbb ... (5 Replies)
Hi there,
I'm trying to merge two files and make a third file.
However, two of the columns need to match exactly in both files AND I want everything from both files in the output if the two columns match in that row.
First file looks like this:
chr1 10001980 T A
Second... (12 Replies)