Want to remove all lines but not latest 50 lines from a file

10-26-2013

Moderator

12,296, 3,792

Join Date: Nov 2008

Last Activity: 1 January 2021, 1:47 AM EST

Location: Amsterdam

Posts: 12,296

Thanks Given: 679

Thanked 3,792 Times in 3,282 Posts

Quote:

Originally Posted by alister

Hi, Scrutinzer.
[..]
Perhaps you could confirm by using tr to delete any spaces?

Code:

dd if=/dev/null of=fname bs=1 seek=$(tail -n5 fname | tee >(wc -c | tr -d ' ') 1<>fname)

Alternatively, depending on how dd converts the text to an int, leading blanks might not be a problem if protected from shell parsing. Perhaps simply double quoting the command substitution will do (although this feels fragile):

Code:

dd if=/dev/null of=fname bs=1 seek="$(tail -n5 fname | tee >(wc -c) 1<>fname)"

I retested and can confirm that without the leading spaces in the output of wc it now works on all platforms, except HPUX, where the file still became 0 bytes like before.. It did not change anything there since wc does not produce leading spaces there.. I did notice this:

Code:

hpux64$ echo hello | tee >(wc -c) >/dev/null
0

Which on the other platforms produced 6

Quote:

The read/write nature of tee's stdout is not relevant. The utility of <> in this case is that it leaves the file descriptor's offset at 0 and allows tail's output (via tee) to write to the beginning of the file without truncation (which dd will perform afterwards). >> and > are both unsuitable since the former appends all writes and the latter truncates before the first write.

Nice!

_________

Quote:

Originally Posted by drl

Hi, alister.

Thanks for the reminder. I usually use my function:

Code:

pe() { for _i;do printf "%s" "$_i";done; printf "\n"; }

print like echo, because the name is shrtr, and uses printf -- but sometimes I forget ... cheers, drl

There seems to be a difference:

Code:

$ echo 1 "2 3"
1 2 3
$ pe 1 "2 3"
12 3

Wouldn't

Code:

pe() { printf "%s" "$@"; printf "\n"; }

produce the same result?

Perhaps:

Code:

pe() { printf "%s" "$1"; [ $# -gt 1 ] && shift && printf " %s" "$@"; printf "\n"; }

-edit- maybe even this:

Code:

pe () { printf "%s\n" "$*" ; }

Last edited by Scrutinizer; 10-26-2013 at 09:00 PM..

These 2 Users Gave Thanks to Scrutinizer For This Post:

Scrutinizer

View Public Profile for Scrutinizer

Find all posts by Scrutinizer

10-26-2013

Registered User

2,288, 480

Join Date: Apr 2007

Last Activity: 3 May 2020, 8:28 AM EDT

Location: Saint Paul, MN USA / BSD, CentOS, Debian, OS X, Solaris

Posts: 2,288

Thanks Given: 430

Thanked 480 Times in 395 Posts

Hi, Scrutinizer.

I don't recall if I had the intention of omitting spaces, but, yes, that's how it works. I may need to add an additional function to handle both. I'll think on that.

The templates I have for setting up the environment, displaying input, posting output, and sometimes doing the comparison now runs into the 50s. Although they are under version control, it can be a pain updating those that are related and have common code. My goal is to encourage the user to actually copy and paste the scripts, possibly staging the data files, running the code, and seeing if it matches my results. That also keeps me honest.

Thanks for the feedback ... cheers, drl

drl

View Public Profile for drl

Find all posts by drl

10-26-2013

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Note that the last paragraph of the description section of the current POSIX Standard's wc utility man page is:

Quote:

Tails relative to the end of the file may be saved in an internal buffer, and thus may be limited in length. Such a buffer, if any, shall be no smaller than {LINE_MAX}*10 bytes.

so, tail -n 50000 is likely to fail (i.e., give some number between 10 and 50,000 lines from the end of the file with a good chance that the first line would be missing one or more bytes from the start of that line) without warning on some systems. I'm almost positive that is the way it worked on UNIX System V Release 4's tail utility and I have no idea how many versions of tail that were originally derived from that source have been modified to handle unlimited numbers of bytes or lines when grabbing data from the end of a file.

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

10-27-2013

Registered User

2,288, 480

Join Date: Apr 2007

Last Activity: 3 May 2020, 8:28 AM EDT

Location: Saint Paul, MN USA / BSD, CentOS, Debian, OS X, Solaris

Posts: 2,288

Thanks Given: 430

Thanked 480 Times in 395 Posts

Hi.

For the version of tail that I used, as noted above:

Code:

       If the first character of N (the number of bytes or lines) is a `+',
       print beginning with the Nth item from the start of each file,
       otherwise, print the last N items in the file.  N may have a multiplier
       suffix: b 512, kB 1000, K 1024, MB 1000*1000, M 1024*1024, GB
       1000*1000*1000, G 1024*1024*1024, and so on for T, P, E, Z, Y.

so it looks like the GNU folks have made provisions for some large quantities of data. I have not looked at the source, however.

For the tests that I ran, I numbered all the lines, counted the final tail (50,000 of 14,000,000), and printed the first and last line of the output file. All seemed correct and intact. The system tail and the version in perl at http://cpansearch.perl.org/src/CWEST/ppt-0.14/bin/tail both return essentially immediately when asked for the last 5 lines of the 14M line file. I noticed some seeks in the perl version, so I'd guess that both those versions are using such performance techniques.

I don't disagree that some utilities will produce wrong results without warning, but that's not the right thing to do, and those systems should be avoided if possible, and especially if one will normally try to solve problems such as we are addressing here. Another possibility is to use the perl version, although I didn't rigorously test it (and the calling sequence is slightly different, but easily changed).

I might worry more about the shell capacity, although I didn't have any trouble, and I recall a test I did on an early version of SunOS, and was pleasantly surprised at its capacity for variable storage. Of course, the MMV for specific cases.

Best wishes ... cheers, drl

Last edited by drl; 10-27-2013 at 09:48 AM.. Reason: Minor typo.

drl

View Public Profile for drl

Find all posts by drl

10-28-2013

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

For discussion:

Code:

tac < file | head -5 | tac | tee file >/dev/null

will keep the data stream within the pipes, by the input redirection file is opened for read early by the shell, before tee will write to it. The whole thing is pretty fast, although it will read the entire file. It keeps the inode of file. I'm not sure if and where the caveats are. I tested it with 10 million lines to get them down to 5000 on my Linux system.

RudiC

View Public Profile for RudiC

Find all posts by RudiC

10-28-2013

Registered User

2,288, 480

Join Date: Apr 2007

Last Activity: 3 May 2020, 8:28 AM EDT

Location: Saint Paul, MN USA / BSD, CentOS, Debian, OS X, Solaris

Posts: 2,288

Thanks Given: 430

Thanked 480 Times in 395 Posts

Hi, RudiC.

Quote:

Originally Posted by RudiC

For discussion:

Code:

tac < file | head -5 | tac | tee file >/dev/null

...

Using the data and framework in the previous posts (14M lines, about 1 GB), I get:

Code:

tac: standard input: read error: Inappropriate ioctl for device
  30442 2334740 data1

not 50000 as desired.

The timing was:

Code:

real	0m35.072s
user	0m2.732s
sys	0m10.137s

The number of lines in the result varied, of 4 times, it did get 50000 once, the other 3 were 27K, 29K, and 30K, all of the latter with the ioctl message.

This is just for this 3GB workstation, a single datapoint.

Best wishes ... cheers, drl

Last edited by drl; 11-02-2013 at 02:27 PM.. Reason: Minor typo.

This User Gave Thanks to drl For This Post:

drl

View Public Profile for drl

Find all posts by drl

10-29-2013

Registered User

3,231, 978

Join Date: Dec 2009

Last Activity: 11 June 2014, 8:40 PM EDT

Posts: 3,231

Thanks Given: 179

Thanked 978 Times in 791 Posts

Quote:

Originally Posted by RudiC

For discussion:

Code:

tac < file | head -5 | tac | tee file >/dev/null

will keep the data stream within the pipes, by the input redirection file is opened for read early by the shell, before tee will write to it. The whole thing is pretty fast, although it will read the entire file. It keeps the inode of file. I'm not sure if and where the caveats are. I tested it with 10 million lines to get them down to 5000 on my Linux system.

As drl demonstrated, this solution is not reliable.

The problem with this approach is that we cannot make any assumptions about when tee will truncate the file. The elements of a pipeline need not be created and scheduled sequentially. Even if they are, the number and size of the pipeline's buffers (both in the kernel and userspace) impose an upper limit on the amount of data that can be moved before truncation.

Code:

$ seq 100000 > data
$ wc -l < data
100000

$ cp data data.bkp
$ tac data | tee data >/dev/null
tac: data: read error
$ wc -l < data
24420

$ cp data.bkp data
$ tac data | head -n 100000 | tee data >/dev/null
tac: data: read error
$ wc -l < data
46266

$ cp data.bkp data
$ tac data | head -n 100000 | head -n 100000 | tee data >/dev/null
tac: data: read error
$ wc -l < data
68111

$ cp data.bkp data
$ tac data | head -n 100000 | head -n 100000 | head -n 100000 | tee data >/dev/null
tac: data: read error
$ wc -l < data
98148

With enough buffering, you might get lucky ...

Code:

$ cp data.bkp data
$ tac data | head -n 100000 | head -n 100000 | head -n 100000 | head -n 100000 | tee data >/dev/null
$ wc -l < data
100000

... or not.

Code:

$ cp data.bkp data
$ tac data | head -n 100000 | head -n 100000 | head -n 100000 | head -n 100000 | tee data >/dev/null
tac: data: read error
$ wc -l < data
80399

Regards,
Alister

This User Gave Thanks to alister For This Post:

alister

View Public Profile for alister

Find all posts by alister

UNIX for Dummies Questions & Answers

Want to remove all lines but not latest 50 lines from a file

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

awk to remove lines that do not start with digit and combine line or lines

Discussion started by: cmccabe

2. Shell Programming and Scripting

Remove lines that are subsets of other lines in File

Discussion started by: MisterJellyBean

3. Shell Programming and Scripting

Two files, remove lines from second based on lines in first

Discussion started by: esoffron

4. Shell Programming and Scripting

Remove lines from file

Discussion started by: idiotboy

5. Shell Programming and Scripting

remove blank lines and merge lines in shell

Discussion started by: dvah

6. Shell Programming and Scripting

remove : lines from file

Discussion started by: aronmelon

7. Shell Programming and Scripting

remove lines from file

Discussion started by: lweegp

8. UNIX for Dummies Questions & Answers

vi to remove lines in file

Discussion started by: kjbaumann

9. Shell Programming and Scripting

To remove the lines in my file

Discussion started by: gsiva

10. Shell Programming and Scripting

remove lines from file

Discussion started by: bluemoon1