Sponsored Content
Top Forums UNIX for Advanced & Expert Users Extracting specific lines from data file Post 302681687 by Don Cragun on Friday 3rd of August 2012 09:28:21 PM
Old 08-03-2012
Quote:
Originally Posted by alister
A good awk solution is a much better approach.

AWK can handle this without having to read file2 more than once.

Your grep approach is treating the contents of file1 as a list of regular expressions when it should be treated as a list of literal text. While it doesn't seem to be a problem with the sample data, if the real data contains regular expression metacharacters, there will be problems. This can be avoided if fixed-string matching is used (-F).

The grep approach will match text at any location in the line, not just the first field. Also, it doesn't require that the match consist of an entire field; a substring match will trigger a false positive. Attempting to workaround this by wrapping "$code" with anchors and delimiters won't work if -F is used.


That's a good approach, but the implementation isn't as elegant and idiomatic as it could be. I would suggest ...
Code:
awk 'NR==FNR {a[$1]; next} $1 in a' file1 file2

Regards,
Alister
I agree that using awk is much better than using the shell while loop as long as file2 isn't huge. And the shell solution won't work if anything in file1's 1st field contains any regular expression meta-characters. A common problem with the questions we get on this forum is that the questions give trivial examples of input and expected output without stating anything about the actual sizes of datasets that will be processed nor of actual specifications for the contents of the fields being processed. (I started using UNIX in the early 70's on a PDP-11 and a 3B20. There wasn't enough room in the user's address space to build an array in awk for a file of the size you might see processing customer records for a telco.)

---------- Post updated at 06:28 PM ---------- Previous update was at 06:02 PM ----------

Note also that the awk script provided by migurus will only give you the last entry in file2 if more than one line in file2 has a first field that matches the first field of any line in file1.

The awk script provided by Alister doesn't have this problem.
 

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Extracting text out of specific lines

Hi, I have a file like LAHORE 2009-04-16 16:04:19 THU S5830 FAULT MESSAGE SUPPRESS STATUS LOC : ASP00 STS : SUPPRESSING CONTINUE INF : F6201 TRUNK. DATA FAULT REPORT COMPLETED LAHORE 2009-04-16 16:04:20 THU S8400 ISUP SIGNALLING TRACE -... (3 Replies)
Discussion started by: krabu
3 Replies

2. Shell Programming and Scripting

extracting specific lines from a file

hi all, i searched in unix.com and accquired the following commands for extracting specific lines from a file .. sed -n '16482,16482p' in.sql > out.sql awk 'NR>=10&&NR<=20' in.sql > out.sql.... these commands are working fine if i give the line numbers as such .. but if i pass a... (2 Replies)
Discussion started by: sais
2 Replies

3. Shell Programming and Scripting

Using Awk for extracting data in specific format

please help me writing a awk script 001_r.pdb 0.0265185 001_r.pdb 0.0437049 001_r.pdb 0.0240642 001_r.pdb 0.0310264 001_r.pdb 0.0200482 001_r.pdb 0.0146746 001_r.pdb 0.0351344 001_r.pdb 0.0347856 001_r.pdb 0.036119 001_r.pdb 1.49 002_r.pdb 0.0281011 002_r.pdb 0.0319908 002_r.pdb... (5 Replies)
Discussion started by: phoenix_nebula
5 Replies

4. Shell Programming and Scripting

extracting specific text from lines

Hello, i've got this output text: and i need it to look something like this: which means that there won't be absolute path of each directory, just it's size and the last word after last '/' in each line, and i also don't need last line '1.7M /tmp' Looks like there is a simple... (5 Replies)
Discussion started by: krater559
5 Replies

5. Shell Programming and Scripting

Extracting specific lines of data from a file and related lines of data based on a grep value range?

Hi, I have one file, say file 1, that has data like below where 19900107 is the date, 19900107 12 144 129 0.7380047 19900108 12 168 129 0.3149017 19900109 12 192 129 3.2766666E-02 ... (3 Replies)
Discussion started by: Wynner
3 Replies

6. Shell Programming and Scripting

Extracting Tag along with specific lines

I have this input file: and the desired output is as follows: Desired Output This is a sample taken from a huge file. Basically, the script should take the tag (TDK11..1>) add everything that has bukle=A until it sees the blank lines. Then takes the next tag (TDK2222>) adds everything that... (4 Replies)
Discussion started by: Ernst
4 Replies

7. UNIX for Dummies Questions & Answers

Filtering data -extracting specific lines

I have a table to data which one of the columns include string of text from within that, I am searching to include few lines but not others for example I want to to include some combination of word address such as (address.| address? |the address | your address) but not (ip address | email... (17 Replies)
Discussion started by: A-V
17 Replies

8. UNIX for Dummies Questions & Answers

Extracting data between specific lines, multiple times

I need help extracting specific lines in a text file. The file looks like this: POSITION TOTAL-FORCE (eV/Angst) ----------------------------------------------------------------------------------- 1.86126 1.86973 1.86972 ... (14 Replies)
Discussion started by: captainalright
14 Replies

9. Shell Programming and Scripting

Extracting data from multiple lines

Hi All, I am stuck in one step.. I have one file named file.txt having content: And SGMT.perd_id = (SELECT cal.fiscal_perd_id FROM $ODS_TARGT.TIM_DT_CAL_D CAL FROM $ODS_TARGT.GL_COA_SEGMNT_XREF_A SGMT SGMT.COA_XREF_TYP_IDN In (SEL COA_XREF_TYP_IDN From... (4 Replies)
Discussion started by: Shilpi Gupta
4 Replies

10. Shell Programming and Scripting

Extracting data from specific rows and columns from multiple csv files

I have a series of csv files in the following format eg file1 Experiment Name,XYZ_07/28/15, Specimen Name,Specimen_001, Tube Name, Control, Record Date,7/28/2015 14:50, $OP,XYZYZ, GUID,abc, Population,#Events,%Parent All Events,10500, P1,10071,95.9 Early Apoptosis,1113,11.1 Late... (6 Replies)
Discussion started by: pawannoel
6 Replies
JOIN(1) 						    BSD General Commands Manual 						   JOIN(1)

NAME
join -- relational database operator SYNOPSIS
join [-a file_number | -v file_number] [-e string] [-o list] [-t char] [-1 field] [-2 field] file1 file2 DESCRIPTION
The join utility performs an ``equality join'' on the specified files and writes the result to the standard output. The ``join field'' is the field in each file by which the files are compared. The first field in each line is used by default. There is one line in the output for each pair of lines in file1 and file2 which have identical join fields. Each output line consists of the join field, the remaining fields from file1 and then the remaining fields from file2. The default field separators are tab and space characters. In this case, multiple tabs and spaces count as a single field separator, and leading tabs and spaces are ignored. The default output field separator is a single space character. Many of the options use file and field numbers. Both file numbers and field numbers are 1 based, i.e., the first file on the command line is file number 1 and the first field is field number 1. The following options are available: -a file_number In addition to the default output, produce a line for each unpairable line in file file_number. -e string Replace empty output fields with string. -o list The -o option specifies the fields that will be output from each file for each line with matching join fields. Each element of list has the either the form 'file_number.field', where file_number is a file number and field is a field number, or the form '0' (zero), representing the join field. The elements of list must be either comma (',') or whitespace separated. (The latter requires quoting to protect it from the shell, or, a simpler approach is to use multiple -o options.) -t char Use character char as a field delimiter for both input and output. Every occurrence of char in a line is significant. -v file_number Do not display the default output, but display a line for each unpairable line in file file_number. The options -v 1 and -v 2 may be specified at the same time. -1 field Join on the field'th field of file 1. -2 field Join on the field'th field of file 2. When the default field delimiter characters are used, the files to be joined should be ordered in the collating sequence of sort(1), using the -b option, on the fields on which they are to be joined, otherwise join may not report all field matches. When the field delimiter char- acters are specified by the -t option, the collating sequence should be the same as sort(1) without the -b option. If one of the arguments file1 or file2 is ``-'', the standard input is used. EXIT STATUS
The join utility exits 0 on success, and >0 if an error occurs. COMPATIBILITY
For compatibility with historic versions of join, the following options are available: -a In addition to the default output, produce a line for each unpairable line in both file 1 and file 2. -j1 field Join on the field'th field of file 1. -j2 field Join on the field'th field of file 2. -j field Join on the field'th field of both file 1 and file 2. -o list ... Historical implementations of join permitted multiple arguments to the -o option. These arguments were of the form 'file_number.field_number' as described for the current -o option. This has obvious difficulties in the presence of files named '1.2'. These options are available only so historic shell scripts do not require modification. They should not be used in new code. LEGACY DESCRIPTION
The -e option causes a specified string to be substituted into empty fields, even if they are in the middle of a line. In legacy mode, the substitution only takes place at the end of a line. Only documented options are allowed. In legacy mode, some obsolete options are re-written into current options. For more information about legacy mode, see compat(5). SEE ALSO
awk(1), comm(1), paste(1), sort(1), uniq(1), compat(5) STANDARDS
The join command conforms to IEEE Std 1003.1-2001 (``POSIX.1''). BSD
July 5, 2004 BSD
All times are GMT -4. The time now is 04:38 PM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy