Adding filename and line number from multiple files to final file

04-17-2013

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Quote:

Originally Posted by bioinfo

Thanks

This script is awesome but somewhat tough for me as I am a beginner. Is there any possibililty to add something easy to my previous code (below) to do the same thing.

Code:

cat *.txt | awk '{print $2, $4}' | sed "/#ainst\|#Time/d" > out.txt
or
cat *.txt | awk '{NR >=3 && NR <= 1002 {print $2, $4}' > out.txt

Thanks

Besides being unneeded, if you cat the files instead of letting awk open them, awk can't recover the filenames. The awk command can easily perform arithmetic calculations (such as subtracting the number of comments found); sed can't.

You said you wanted to print a portion of the file's name on each output line. You can't do that with either of the scripts above. In your scripts, neither awk nor sed have access to the filenames of the input files they are processing.

You said you wanted to delete the 1st two lines (which start with a #) from each file. Your 2nd script can't be made to do that if you keep the cat. Your 2nd script throws away the 1st two lines of the 1st file but keeps all others. It can easily throw away all lines starting with # (like my awk script did), but since you are only giving awk one input file, it can't throw away the 1st two lines of any file except the first file it is given if it is using line numbers instead of matching lines that start with a #.

You said you wanted to print the line number (not counting the comment lines) for each line in your input files. You can easily do that by using the awk script I suggested; you can't do it with either of your pipelines without making the awk portion of your pipeline look a lot more like what I suggested before.

What is it about the awk script I provided that is too tough to understand?

This User Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

04-17-2013

Registered User

50, 0

Join Date: Dec 2012

Last Activity: 12 August 2013, 3:07 AM EDT

Posts: 50

Thanks Given: 52

Thanked 0 Times in 0 Posts

Thanks a lot for letting me know the concepts in detail.

I did not understand the following things:

Code:

FNR==1{ # This is the first line in a new file...

fn = substr(FILENAME, 5, 3) # Save 3 characters from this filename

printf("%s\t%d\t%s\t%s\n", fn, FNR - n, $2, $4)

I have a general question. In which case I should use cut/paste/cat commands or grep or sed or awk. I am really confused.

I am getting different answers while googling.

Thanks again.

bioinfo

View Public Profile for bioinfo

Find all posts by bioinfo

04-17-2013

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Quote:

Originally Posted by bioinfo

Thanks a lot for letting me know the concepts in detail. Smilie

I did not understand the following things:

Code:

FNR==1{ # This is the first line in a new file...

The awk utility maintains several variables as it processes a line of text from a file. As you already know, NR is the number of records that have been read from all of the input files. FNR is the number of records that have been read from the current file. When FNR is equal to 1, the condition portion of this awk statement is true and the action portion of the statement will be executed.

Quote:

Originally Posted by bioinfo

Code:

fn = substr(FILENAME, 5, 3) # Save 3 characters from this filename

FILENAME is another variable maintained by the awk utility. It contains the name of the file that is being processed. You said your filenames were:

Code:

file001.txt
file002.txt
00000000011 character
12345678901 number within filename
file003.txt
...
file020.txt

The substr(string, start, count) function in awk returns count characters starting at character number start from string. For example when FILENAME is file001.txt, substr() will return the characters in red and store them in the variable fn (i.e., fn will contain the portion of the filename you want to print at the start of each line printed from this input file.

Quote:

Originally Posted by bioinfo

Code:

printf("%s\t%d\t%s\t%s\n", fn, FNR - n, $2, $4)

The awk printf(format, argument...) function is VERY similar to the C Language printf() function and the printf utility. In this case the function call:

Code:

printf("%s\t%d\t%s\t%s\n", fn, FNR - n, $2, $4)

prints the saved portion of the filename as a character string, a tab character, the current line number in the current file minus the number of lines in the current file starting with # as a decimal numeric string, a tab character, the 2nd field from the current line as a character string, a tab character, and the 4th field from the current line as a character string followed by a newline character.

Quote:

Originally Posted by bioinfo

I have a general question. In which case I should use cut/paste/cat commands or grep or sed or awk. I am really confused. Smilie

I am getting different answers while googling.

Thanks again.

You should use cut and paste when cut and paste do what you need to do and they do it more simply or more efficiently than could be done with your shell's built-in utilities AND you don't need more complex processing (such as that provided by awk or sed) that needs to be used to get the job done.

You should use cat when you need to concatenate two or more files into a single output file, when you need to feed the contents of one or more files into a utility that doesn't accept pathname operands, or when you have a version of cat that provides a non-standard extension that performs some text manipulation as it copies files that you need to perform.

You should NEVER use:

Code:

cat *.txt|awk 'awk program'

instead of:

Code:

awk 'awk program' *.txt

Creating an additional process like this takes more system resources to run your command, makes it run slower, and keeps awk from knowing how many files are being processed and what the names of the files are.

Many of the original UNIX utilities were designed to perform a transformation data read from standard input and write the transformed data to standard output. (These utilities can be called filters.) The idea was that filters could combined in a pipeline to perform much more complex tasks without making each utility more complex than needed. (This is an example of your basic KISS [Keep It Simple, Stupid] principle.) Unfortunately, many of today's utilities on many systems have forgotten the KISS principle.

Even with the original UNIX utilities, there were frequently many different ways to get a job done. Choosing which utilities to use depends on what you are trying to do, your ability to recognize the alternatives available, your ability to use the alternative tools available, and your knowledge of how utilities have evolved on various systems over the years so you know what will work portably on all of the systems you want to use and which code might have to be tweaked if you want to move your script to a different system.

Despite the fact that many of us have degrees in computer science or computer engineering (or both), there is a lot of art (as well as science and engineering) in programming.

This User Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

04-17-2013

Registered User

858, 184

Join Date: Mar 2013

Last Activity: 12 May 2013, 11:33 PM EDT

Posts: 858

Thanks Given: 18

Thanked 184 Times in 179 Posts

Quote:

In which case I should use cut/paste/cat commands or grep or sed or awk. I am really confused.

There is no hard and fast rule. cut and paste go together, and are very useful, along with join. cat is not needed that much. grep finds lines. sed makes quick arbitrary changes. awk works well with fields and can make programs. bash ties it all together, and can make programs. The main thing I would recommend is keep the code easy to read and maintain, even for other viewers. Emphasize readability over performance. Avoid trying to always do everything with awk, or always do everything with perl. It sounds like you already know it's better to learn a variety of commands, such as the key ones you mentioned, and a few others such as uniq, head, tail, and sort, and use the commands within the context of shell scripts.

This User Gave Thanks to hanson44 For This Post:

hanson44

View Public Profile for hanson44

Find all posts by hanson44

04-18-2013

Registered User

50, 0

Join Date: Dec 2012

Last Activity: 12 August 2013, 3:07 AM EDT

Posts: 50

Thanks Given: 52

Thanked 0 Times in 0 Posts

Thanks a lot Don Cragon for such an extensive explanation and hanson44.

Does the amount of space between the lines matter or we can write awk program in one line too? Is it for proper readability only?

Thanks.

---------- Post updated at 10:54 AM ---------- Previous update was at 10:36 AM ----------

Hurray!
I got my output.

Thanks

bioinfo

View Public Profile for bioinfo

Find all posts by bioinfo

04-18-2013

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Quote:

Originally Posted by bioinfo

Thanks a lot Don Cragon for such an extensive explanation and hanson44. Smilie

Thanks

If is logically possible to write any awk script as a single line, if you're willing to type it into your shell. If the awk program is in a shell file to be executed, you'll have to restrict the length of each line in your script to the limits supported by your editor. You can also throw away all of the comments and change all of the variable names to single characters to make the script shorter.

I choose to write programs in a way that is easy for me to read and understand rather than to try to artificially produce 1-liners. If you ask me about an awk script I submitted here a month ago, I don't want to deal with the obfuscation caused by collapsing an easily read script into a single line.

If you take a script I supplied, modify it slightly to add a new feature, collapse it to a single line, and then ask me to help you debug your new feature; I will definitely be slower to respond and it will be much more likely that I won't respond at all.

This User Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

04-18-2013

Registered User

50, 0

Join Date: Dec 2012

Last Activity: 12 August 2013, 3:07 AM EDT

Posts: 50

Thanks Given: 52

Thanked 0 Times in 0 Posts

Ok thanks.

bioinfo

View Public Profile for bioinfo

Find all posts by bioinfo

Shell Programming and Scripting

Adding filename and line number from multiple files to final file

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Insert the line number from text file to filename output

Discussion started by: martinsmith

2. Shell Programming and Scripting

Adding user name to file, and then displaying new line number

Discussion started by: sabster

3. Shell Programming and Scripting

adding line number to end of records in file

Discussion started by: edstevens

4. Shell Programming and Scripting

editing line in text file adding number to value in file

Discussion started by: say170

5. Shell Programming and Scripting

insert filename into each line of multiple files

Discussion started by: linux.yahoo

6. Shell Programming and Scripting

Adding filename to each line of the file

Discussion started by: scripting_newbe

7. Shell Programming and Scripting

Adding text in final line

Discussion started by: anushree.a

8. Shell Programming and Scripting

Adding a columnfrom a specifit line number to a specific line number

Discussion started by: Ezy

9. Shell Programming and Scripting

Adding multiple line at the end of the file

Discussion started by: kaibiganmi

10. Shell Programming and Scripting

Grabing Date from filename and adding to the end of each line in the file.

Discussion started by: rkumar28