How to remove mth and nth column from a file?

08-06-2013

Registered User

3,231, 978

Join Date: Dec 2009

Last Activity: 11 June 2014, 8:40 PM EDT

Posts: 3,231

Thanks Given: 179

Thanked 978 Times in 791 Posts

Quote:

Originally Posted by MadeInGermany

Code:

      printf sep"%s", $i

That would be a problem if OFS were set to something that's special in a printf format string. For example, if OFS were %, then the output would always be %s.

I suspect it was just an oversight on your part, but for those that don't get it, here's the correct way:

Code:

    printf "%s%s", sep, $i

Regards,
Alister

Last edited by alister; 08-06-2013 at 09:54 PM..

alister

View Public Profile for alister

Find all posts by alister

08-07-2013

Registered User

48, 10

Join Date: Aug 2010

Last Activity: 30 August 2014, 6:29 AM EDT

Posts: 48

Thanks Given: 0

Thanked 10 Times in 9 Posts

Quote:

Originally Posted by alister

I did not test your code, but looking at it there appears to be an off-by-one bug at i < last. last corresponds to the final field and it is never printed. It should be i <= last.

I remember finding this bug but probably my modification didn't went through.

Quote:

Originally Posted by alister

Aside from that, your implementation is also a bit overcomplicated. There is no need to explicitly split the record into an array when AWK has already split it into field variables for your convenience.

For portability, simplicty, and flexibility, I recommend:

Code:

{
    for (i=1; i<=NF; i++)
        if (i != m  &&  i != n)
            s = s OFS $i
    print substr(s, length(OFS)+1)
    s=""
}

Obviously, FS and OFS must be set to the appropriate values.

In the world of awk yes, but there's no way I would do that in C, and somehow doesn't make me want to do it in awk as well. If you're careful about speed you'll naturally not use that method despite appearing to be simpler. One could argue that that could be better but one would not.

Sometimes we think we could simplify things by minimizing our code but sometimes it just gets more bloated. Mine may not have been in its most optimized form but at least there's the balance. Yeah I know speed isn't crucial really but we have our opinions.

Edit: You could actually be correct about appending strings instead of calling multiple printfs since that could possibly cause multiple ioctl calls depending on awk's implementation, but I wouldn't consider print substr(s, length(OFS)+1).

Last edited by konsolebox; 08-07-2013 at 04:43 AM..

This User Gave Thanks to konsolebox For This Post:

konsolebox

View Public Profile for konsolebox

Find all posts by konsolebox

08-07-2013

Registered User

5,091, 1,931

Join Date: May 2012

Last Activity: 15 July 2020, 4:46 AM EDT

Location: Simplicity

Posts: 5,091

Thanks Given: 565

Thanked 1,931 Times in 1,668 Posts

Quote:

Code:

    printf "%s%s", sep, $i

Regards,
Alister

Code:

printf "%s", sep $i

---------- Post updated at 03:15 AM ---------- Previous update was at 02:40 AM ----------

Quote:

Originally Posted by konsolebox

I remember finding this bug but probably my modification didn't went through.

In the world of awk yes, but there's no way I would do that in C, and somehow doesn't make me want to do it in awk as well. If you're careful about speed you'll naturally not use that method despite appearing to be simpler. One could argue that that could be better but one would not.

Sometimes we think we could simplify things by minimizing our code but sometimes it just gets more bloated. Mine may not have been in its most optimized form but at least there's the balance. Yeah I know speed isn't crucial really but we have our opinions.

Edit: You could actually be correct about appending strings instead of calling multiple printfs since that could possibly cause multiple ioctl calls depending on awk's implementation, but I wouldn't consider print substr(s, length(OFS)+1).

Alister addressed your explicit split(), ignoring the built-in auto-split. That's unnecessary overhead.
The formatting method does not really matter, but why not present an alternative? I was even inspired to present a 3rd method.

MadeInGermany

View Public Profile for MadeInGermany

Find all posts by MadeInGermany

08-07-2013

Registered User

48, 10

Join Date: Aug 2010

Last Activity: 30 August 2014, 6:29 AM EDT

Posts: 48

Thanks Given: 0

Thanked 10 Times in 9 Posts

Considering the use of FS and OFS I now have this version:

Code:

awk -v m=3 -v n=9 -v FS=, -v OFS=, -- '{
    j = 0
    for (i = 1; i <= NF; ++i) {
        if (i == m || i == n) {
            ++j
            continue
        }
        $(i - j) = $i
    }
    NF -= j
    print
}'

And adding Alister's suggestion this is the best change I could have for my original code:

Code:

awk -v m=3 -v n=9 FS=, OFS=, -- '{
    append = 0
    for (i = 1; i <= NF; ++i) {
        if (i != m && i != n) {
            if (append) {
                s = s OFS $i
            } else {
                s = $i
                append = 1
            }
        }
    }
    print s
    s = ""
}' file

Last edited by konsolebox; 08-07-2013 at 06:29 AM..

konsolebox

View Public Profile for konsolebox

Find all posts by konsolebox

08-07-2013

Registered User

3,231, 978

Join Date: Dec 2009

Last Activity: 11 June 2014, 8:40 PM EDT

Posts: 3,231

Thanks Given: 179

Thanked 978 Times in 791 Posts

Quote:

Originally Posted by konsolebox

If you're careful about speed you'll naturally not use that method despite appearing to be simpler. One could argue that that could be better but one would not.

Given the dearth of detail, any optimization efforts would be aimless.

For an average implementation, on average hardware, processing an average text file, under average user expectations, the performance discrepancy between the AWK scripts will be insignificant, and there has been no indication by the OP that this situation is anything but average.

For an extraordinary situation, the details which we do not have (awk implementation? data set characteristics?) are crucial.

Testing with gawk, mawk, and busybox and two types of data, one with modest lines (100 columns, 292 bytes each) and the other with much wider lines (32,765 columns, 185,484 bytes each), yielded highly inconsistent results.

My original suggestion was sometimes the fastest, but only when lines were modestly-sized. As you correctly pointed out, my code does not scale; performance degrades drastically with increasing line length.

Casual testing suggests that you're using gawk, because otherwise the performance of your more recent suggestions regresses greatly compared to your original contribution.

Gawk running the following script was the fastest of all possible implementation/script combinations (that I tested):

Quote:

Originally Posted by konsolebox

Code:

awk -v m=3 -v n=9 -v FS=, -v OFS=, -- '{
    j = 0
    for (i = 1; i <= NF; ++i) {
        if (i == m || i == n) {
            ++j
            continue
        }
        $(i - j) = $i
    }
    NF -= j
    print
}'

However, that very same script under Busybox was also the slowest of all interpreter/script combinations (slower even than any run of my original sloth). This script was also the slowest of all under mawk.

The highlighted statements trigger recomputation of $0 in all three implementations, but only gawk implements an optimization to lazily avoid that overhead until $0 itself (not its fields) is referenced. For the details, follow field0_valid in gawk - field.c

There are a lot of systems out there that do not use gawk by default. Even among Linux installations, most embedded systems and most Debian derivatives (including most Ubuntu and Ubuntu-derivative versions) do not use it. For all of them, this revision is a setback.

In the absence of any specifics, in my judgement, your original solution exhibits the best balance of scalability and predictable performance across implementations. Minus the redundant split, the off-by-one in the loop condition, and the printf format string bugs:

Code:

{
    append = 0
    for (i = 1; i <= NF; ++i) {
        if (i != m && i != n) {
            if (append) {
                printf "%s%s", OFS, $i
            } else {
                printf "%s", $i
                append = 1
            }
        }
    }
    print ""
}

In this specific case, though, since there is no constraint requiring AWK and since any cut implementation would outperform any AWK implementation running any of these scripts ... by a significant margin, the performance debate is academic.

Regards,
Alister

Last edited by alister; 08-07-2013 at 11:38 PM..

alister

View Public Profile for alister

Find all posts by alister

Shell Programming and Scripting

How to remove mth and nth column from a file?

9 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Taking nth column and putting its value in n+1 column using awk

Discussion started by: umarsatti

2. Shell Programming and Scripting

How to search and replace string from nth column from a file?

Discussion started by: Amit Joshi

3. Shell Programming and Scripting

Remove the values from a certain column without deleting the Column name in a .CSV file

Discussion started by: dhruuv369

4. Shell Programming and Scripting

Break Column nth in a CSV file into two

Discussion started by: awk-admirer

5. Shell Programming and Scripting

Need help with awk statement to break nth column in csv file into 3 separate columns

Discussion started by: awk-admirer

6. Shell Programming and Scripting

Get the nth word of mth line in a file

Discussion started by: freezelty

7. Shell Programming and Scripting

Calculating average for every Nth line in the Nth column

Discussion started by: ncwxpanther

8. Shell Programming and Scripting

Using AWK to find top Nth values in Nth column

Discussion started by: ncwxpanther

9. Shell Programming and Scripting

How to Print from nth field to mth fields using awk

Discussion started by: machomaddy