How to remove mth and nth column from a file?


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting How to remove mth and nth column from a file?
# 15  
Old 08-06-2013
Quote:
Originally Posted by MadeInGermany
Code:
      printf sep"%s", $i

That would be a problem if OFS were set to something that's special in a printf format string. For example, if OFS were %, then the output would always be %s.

I suspect it was just an oversight on your part, but for those that don't get it, here's the correct way:
Code:
    printf "%s%s", sep, $i

Regards,
Alister

Last edited by alister; 08-06-2013 at 09:54 PM..
# 16  
Old 08-07-2013
Quote:
Originally Posted by alister
I did not test your code, but looking at it there appears to be an off-by-one bug at i < last. last corresponds to the final field and it is never printed. It should be i <= last.
I remember finding this bug but probably my modification didn't went through.

Quote:
Originally Posted by alister
Aside from that, your implementation is also a bit overcomplicated. There is no need to explicitly split the record into an array when AWK has already split it into field variables for your convenience.

For portability, simplicty, and flexibility, I recommend:
Code:
{
    for (i=1; i<=NF; i++)
        if (i != m  &&  i != n)
            s = s OFS $i
    print substr(s, length(OFS)+1)
    s=""
}

Obviously, FS and OFS must be set to the appropriate values.
In the world of awk yes, but there's no way I would do that in C, and somehow doesn't make me want to do it in awk as well. If you're careful about speed you'll naturally not use that method despite appearing to be simpler. One could argue that that could be better but one would not.

Sometimes we think we could simplify things by minimizing our code but sometimes it just gets more bloated. Mine may not have been in its most optimized form but at least there's the balance. Yeah I know speed isn't crucial really but we have our opinions.

Edit: You could actually be correct about appending strings instead of calling multiple printfs since that could possibly cause multiple ioctl calls depending on awk's implementation, but I wouldn't consider print substr(s, length(OFS)+1).

Last edited by konsolebox; 08-07-2013 at 04:43 AM..
This User Gave Thanks to konsolebox For This Post:
# 17  
Old 08-07-2013
Quote:
Code:
    printf "%s%s", sep, $i

Regards,
Alister
Or
Code:
printf "%s", sep $i

---------- Post updated at 03:15 AM ---------- Previous update was at 02:40 AM ----------

Quote:
Originally Posted by konsolebox
I remember finding this bug but probably my modification didn't went through.

In the world of awk yes, but there's no way I would do that in C, and somehow doesn't make me want to do it in awk as well. If you're careful about speed you'll naturally not use that method despite appearing to be simpler. One could argue that that could be better but one would not.

Sometimes we think we could simplify things by minimizing our code but sometimes it just gets more bloated. Mine may not have been in its most optimized form but at least there's the balance. Yeah I know speed isn't crucial really but we have our opinions.

Edit: You could actually be correct about appending strings instead of calling multiple printfs since that could possibly cause multiple ioctl calls depending on awk's implementation, but I wouldn't consider print substr(s, length(OFS)+1).
Alister addressed your explicit split(), ignoring the built-in auto-split. That's unnecessary overhead.
The formatting method does not really matter, but why not present an alternative? I was even inspired to present a 3rd method.
# 18  
Old 08-07-2013
Considering the use of FS and OFS I now have this version:
Code:
awk -v m=3 -v n=9 -v FS=, -v OFS=, -- '{
    j = 0
    for (i = 1; i <= NF; ++i) {
        if (i == m || i == n) {
            ++j
            continue
        }
        $(i - j) = $i
    }
    NF -= j
    print
}'

And adding Alister's suggestion this is the best change I could have for my original code:
Code:
awk -v m=3 -v n=9 FS=, OFS=, -- '{
    append = 0
    for (i = 1; i <= NF; ++i) {
        if (i != m && i != n) {
            if (append) {
                s = s OFS $i
            } else {
                s = $i
                append = 1
            }
        }
    }
    print s
    s = ""
}' file


Last edited by konsolebox; 08-07-2013 at 06:29 AM..
# 19  
Old 08-07-2013
Quote:
Originally Posted by konsolebox
If you're careful about speed you'll naturally not use that method despite appearing to be simpler. One could argue that that could be better but one would not.
Given the dearth of detail, any optimization efforts would be aimless.

For an average implementation, on average hardware, processing an average text file, under average user expectations, the performance discrepancy between the AWK scripts will be insignificant, and there has been no indication by the OP that this situation is anything but average.

For an extraordinary situation, the details which we do not have (awk implementation? data set characteristics?) are crucial.

Testing with gawk, mawk, and busybox and two types of data, one with modest lines (100 columns, 292 bytes each) and the other with much wider lines (32,765 columns, 185,484 bytes each), yielded highly inconsistent results.

My original suggestion was sometimes the fastest, but only when lines were modestly-sized. As you correctly pointed out, my code does not scale; performance degrades drastically with increasing line length.

Casual testing suggests that you're using gawk, because otherwise the performance of your more recent suggestions regresses greatly compared to your original contribution.

Gawk running the following script was the fastest of all possible implementation/script combinations (that I tested):
Quote:
Originally Posted by konsolebox
Code:
awk -v m=3 -v n=9 -v FS=, -v OFS=, -- '{
    j = 0
    for (i = 1; i <= NF; ++i) {
        if (i == m || i == n) {
            ++j
            continue
        }
        $(i - j) = $i
    }
    NF -= j
    print
}'

However, that very same script under Busybox was also the slowest of all interpreter/script combinations (slower even than any run of my original sloth). This script was also the slowest of all under mawk.

The highlighted statements trigger recomputation of $0 in all three implementations, but only gawk implements an optimization to lazily avoid that overhead until $0 itself (not its fields) is referenced. For the details, follow field0_valid in gawk - field.c

There are a lot of systems out there that do not use gawk by default. Even among Linux installations, most embedded systems and most Debian derivatives (including most Ubuntu and Ubuntu-derivative versions) do not use it. For all of them, this revision is a setback.

In the absence of any specifics, in my judgement, your original solution exhibits the best balance of scalability and predictable performance across implementations. Minus the redundant split, the off-by-one in the loop condition, and the printf format string bugs:
Code:
{
    append = 0
    for (i = 1; i <= NF; ++i) {
        if (i != m && i != n) {
            if (append) {
                printf "%s%s", OFS, $i
            } else {
                printf "%s", $i
                append = 1
            }
        }
    }
    print ""
}

In this specific case, though, since there is no constraint requiring AWK and since any cut implementation would outperform any AWK implementation running any of these scripts ... by a significant margin, the performance debate is academic.

Regards,
Alister

Last edited by alister; 08-07-2013 at 11:38 PM..
Login or Register to Ask a Question

Previous Thread | Next Thread

9 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Taking nth column and putting its value in n+1 column using awk

Hello Members, Need your expert opinion how to tackle below. I have an input file that looks like below: USS|AWCC|AFGAW|93|70 USSAA|Roshan TDCA|AFGTD|93|72,79 ALB|Vodafone|ALBVF|355|69 ALGEE|Wataniya (Nedjma)|DZAWT|213|50,550 I like output file in below format: ... (7 Replies)
Discussion started by: umarsatti
7 Replies

2. Shell Programming and Scripting

How to search and replace string from nth column from a file?

I wanted to search for a string and replace it with other string from nth column of a file which is comma seperated which I am able to do with below # For Comma seperated file without quotes awk 'BEGIN{OFS=FS=","}$"'"$ColumnNo"'"=="'"$PPK"'"{$"'"$ColumnNo"'"="'"$NPK"'"}{print}' ${FileName} ... (5 Replies)
Discussion started by: Amit Joshi
5 Replies

3. Shell Programming and Scripting

Remove the values from a certain column without deleting the Column name in a .CSV file

(14 Replies)
Discussion started by: dhruuv369
14 Replies

4. Shell Programming and Scripting

Break Column nth in a CSV file into two

Hi Guys, Need help with logic to break Column nth in a CSV file into two for e.g Refer below the second column as the nth column "abcd","","type/beta-version" need output in a following format "abcd","/place/asia/india/mumbai","/product/sw/tomcat","type/beta-version" ... (5 Replies)
Discussion started by: awk-admirer
5 Replies

5. Shell Programming and Scripting

Need help with awk statement to break nth column in csv file into 3 separate columns

Hello Members, I have a csv file in the format below. Need help with awk statement to break nth column into 3 separate columns and export the changes to new file. input file --> file.csv cat file.csv|less "product/fruit/mango","location/asia/india","type/alphonso" need output in... (2 Replies)
Discussion started by: awk-admirer
2 Replies

6. Shell Programming and Scripting

Get the nth word of mth line in a file

Hi.. May be a simple question but I just began to write unix scripts a week ago, for sorting some huge amount of experiment data, so I got no common sense about unix scripting and really need your helps... The situation is, I want to read the nth word of mth line in a file, and then store it... (3 Replies)
Discussion started by: freezelty
3 Replies

7. Shell Programming and Scripting

Calculating average for every Nth line in the Nth column

Is there an awk script that can easily perform the following operation? I have a data file that is in the format of 1944-12,5.6 1945-01,9.8 1945-02,6.7 1945-03,9.3 1945-04,5.9 1945-05,0.7 1945-06,0.0 1945-07,0.0 1945-08,0.0 1945-09,0.0 1945-10,0.2 1945-11,10.5 1945-12,22.3... (3 Replies)
Discussion started by: ncwxpanther
3 Replies

8. Shell Programming and Scripting

Using AWK to find top Nth values in Nth column

I have an awk script to find the maximum value of the 2nd column of a 2 column datafile, but I need to find the top 5 maximum values of the 2nd column. Here is the script that works for the maximum value. awk 'BEGIN { subjectmax=$1 ; max=0} $2 >= max {subjectmax=$1 ; max=$2} END {print... (3 Replies)
Discussion started by: ncwxpanther
3 Replies

9. Shell Programming and Scripting

How to Print from nth field to mth fields using awk

Hi, Is there any short method to print from a particular field till another filed using awk? Example File: File1 ==== 1|2|acv|vbc|......|100|342 2|3|afg|nhj|.......|100|346 Expected output: File2 ==== acv|vbc|.....|100 afg|nhj|.....|100 (8 Replies)
Discussion started by: machomaddy
8 Replies
Login or Register to Ask a Question