Extracting specific lines from data file Post: 302681687

Sponsored Content

Top Forums UNIX for Advanced & Expert Users Extracting specific lines from data file Post 302681687 by Don Cragun on Friday 3rd of August 2012 09:28:21 PM

08-03-2012

Registered User

Quote:

Originally Posted by alister

A good awk solution is a much better approach.

AWK can handle this without having to read file2 more than once.

Your grep approach is treating the contents of file1 as a list of regular expressions when it should be treated as a list of literal text. While it doesn't seem to be a problem with the sample data, if the real data contains regular expression metacharacters, there will be problems. This can be avoided if fixed-string matching is used (-F).

The grep approach will match text at any location in the line, not just the first field. Also, it doesn't require that the match consist of an entire field; a substring match will trigger a false positive. Attempting to workaround this by wrapping "$code" with anchors and delimiters won't work if -F is used.

That's a good approach, but the implementation isn't as elegant and idiomatic as it could be. I would suggest ...

Code:

awk 'NR==FNR {a[$1]; next} $1 in a' file1 file2

Regards,
Alister

I agree that using awk is much better than using the shell while loop as long as file2 isn't huge. And the shell solution won't work if anything in file1's 1st field contains any regular expression meta-characters. A common problem with the questions we get on this forum is that the questions give trivial examples of input and expected output without stating anything about the actual sizes of datasets that will be processed nor of actual specifications for the contents of the fields being processed. (I started using UNIX in the early 70's on a PDP-11 and a 3B20. There wasn't enough room in the user's address space to build an array in awk for a file of the size you might see processing customer records for a telco.)

---------- Post updated at 06:28 PM ---------- Previous update was at 06:02 PM ----------

Note also that the awk script provided by migurus will only give you the last entry in file2 if more than one line in file2 has a first field that matches the first field of any line in file1.

The awk script provided by Alister doesn't have this problem.

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Extracting text out of specific lines

Hi, I have a file like LAHORE 2009-04-16 16:04:19 THU S5830 FAULT MESSAGE SUPPRESS STATUS LOC : ASP00 STS : SUPPRESSING CONTINUE INF : F6201 TRUNK. DATA FAULT REPORT COMPLETED LAHORE 2009-04-16 16:04:20 THU S8400 ISUP SIGNALLING TRACE -...

2. Shell Programming and Scripting

extracting specific lines from a file

hi all, i searched in unix.com and accquired the following commands for extracting specific lines from a file .. sed -n '16482,16482p' in.sql > out.sql awk 'NR>=10&&NR<=20' in.sql > out.sql.... these commands are working fine if i give the line numbers as such .. but if i pass a...

3. Shell Programming and Scripting

Using Awk for extracting data in specific format

please help me writing a awk script 001_r.pdb 0.0265185 001_r.pdb 0.0437049 001_r.pdb 0.0240642 001_r.pdb 0.0310264 001_r.pdb 0.0200482 001_r.pdb 0.0146746 001_r.pdb 0.0351344 001_r.pdb 0.0347856 001_r.pdb 0.036119 001_r.pdb 1.49 002_r.pdb 0.0281011 002_r.pdb 0.0319908 002_r.pdb...

4. Shell Programming and Scripting

extracting specific text from lines

Hello, i've got this output text: and i need it to look something like this: which means that there won't be absolute path of each directory, just it's size and the last word after last '/' in each line, and i also don't need last line '1.7M /tmp' Looks like there is a simple...

5. Shell Programming and Scripting

Extracting specific lines of data from a file and related lines of data based on a grep value range?

Hi, I have one file, say file 1, that has data like below where 19900107 is the date, 19900107 12 144 129 0.7380047 19900108 12 168 129 0.3149017 19900109 12 192 129 3.2766666E-02 ...

6. Shell Programming and Scripting

Extracting Tag along with specific lines

I have this input file: and the desired output is as follows: Desired Output This is a sample taken from a huge file. Basically, the script should take the tag (TDK11..1>) add everything that has bukle=A until it sees the blank lines. Then takes the next tag (TDK2222>) adds everything that...

7. UNIX for Dummies Questions & Answers

Filtering data -extracting specific lines

I have a table to data which one of the columns include string of text from within that, I am searching to include few lines but not others for example I want to to include some combination of word address such as (address.| address? |the address | your address) but not (ip address | email...

8. UNIX for Dummies Questions & Answers

Extracting data between specific lines, multiple times

I need help extracting specific lines in a text file. The file looks like this: POSITION TOTAL-FORCE (eV/Angst) ----------------------------------------------------------------------------------- 1.86126 1.86973 1.86972 ...

9. Shell Programming and Scripting

Extracting data from multiple lines

Hi All, I am stuck in one step.. I have one file named file.txt having content: And SGMT.perd_id = (SELECT cal.fiscal_perd_id FROM $ODS_TARGT.TIM_DT_CAL_D CAL FROM $ODS_TARGT.GL_COA_SEGMNT_XREF_A SGMT SGMT.COA_XREF_TYP_IDN In (SEL COA_XREF_TYP_IDN From...

10. Shell Programming and Scripting

Extracting data from specific rows and columns from multiple csv files

I have a series of csv files in the following format eg file1 Experiment Name,XYZ_07/28/15, Specimen Name,Specimen_001, Tube Name, Control, Record Date,7/28/2015 14:50, $OP,XYZYZ, GUID,abc, Population,#Events,%Parent All Events,10500, P1,10071,95.9 Early Apoptosis,1113,11.1 Late...

LEARN ABOUT HPUX

join

join(1) 						      General Commands Manual							   join(1)

NAME

       join - relational database operator

SYNOPSIS

       [options] file1 file2

DESCRIPTION

       forms,  on  the	standard output, a join of the two relations specified by the lines of file1 and file2.  If file1 or file2 is the standard
       input is used.

       file1 and file2 must be sorted in increasing collating sequence (see Environment Variables below) on the fields on which  they  are  to	be
       joined; normally the first in each line.

       The  output contains one line for each pair of lines in file1 and file2 that have identical join fields.  The output line normally consists
       of the common field followed by the rest of the line from file1, then the rest of the line from file2.

       The default input field separators are space, tab, or new-line.	In this case, multiple separators count as one field separator, and  lead-
       ing separators are ignored.  The default output field separator is a space.

       Some of the below options use the argument n.  This argument should be a or a referring to either file1 or file2, respectively.

   Options
       In addition to the normal output,
		   produce a line for each unpairable line in file n, where n is or

       Replace empty output fields by string
		   s.

       Join on field
		   m  of  both	files.	 The argument m must be delimited by space characters.	This option and the following two are provided for
		   backward compatibility.  Use of the and options ( see below ) is recommended for portability.

       Join on field
		   m of file1.

       Join on field
		   m of file2.

       Each output line comprises the fields specified in
		   list, each element of which has the form where n is a file number and m is a field number.  The common  field  is  not  printed
		   unless specifically requested.

       Use character
		   c  as a separator (tab character).  Every appearance of c in a line is significant.	The character c is used as the field sepa-
		   rator for both input and output.

       Instead of the default output,
		   produce a line only for each unpairable line in file_number, where file_number is or

       Join on field
		   f of file 1.  Fields are numbered starting with 1.

       Join on field
		   f of file 2.  Fields are numbered starting with 1.

EXTERNAL INFLUENCES

   Environment Variables
       determines the collating sequence expects from input files.

       determines the alternative blank character as an input field separator, and the interpretation of data within files as single and/or multi-
       byte characters.  also determines whether the separator defined through the option is a single- or multi-byte character.

       If  or  is  not specified in the environment or is set to the empty string, the value of is used as a default for each unspecified or empty
       variable.  If is not specified or is set to the empty string, a default of ``C'' (see lang(5)) is used instead of If any  internationaliza-
       tion variable contains an invalid setting, behaves as if all internationalization variables are set to ``C'' (see environ(5)).

   International Code Set Support
       Single- and multi-byte character code sets are supported with the exception that multi-byte-character file names are not supported.

EXAMPLES

       The following command line joins the password file and the group file, matching on the numeric group ID, and outputting the login name, the
       group name, and the login directory.  It is assumed that the files have been sorted in the collating sequence defined by the or environment
       variable on the group ID fields.

       The  following  command produces an output consisting all possible combinations of lines that have identical first fields in the two sorted
       files sf1 and sf2, with each line consisting of the first and third fields from and the second and fourth fields from

WARNINGS

       With default field separation, the collating sequence is that of with the sequence is that of a plain sort.

       The conventions of and are incongruous.

       Numeric filenames may cause conflict when the option is used immediately before listing filenames.

AUTHOR

       was developed by OSF and HP.

SEE ALSO

       awk(1), comm(1), sort(1), uniq(1).

STANDARDS CONFORMANCE

																	   join(1)

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Extracting text out of specific lines

Discussion started by: krabu

2. Shell Programming and Scripting

extracting specific lines from a file

Discussion started by: sais

3. Shell Programming and Scripting

Using Awk for extracting data in specific format

Discussion started by: phoenix_nebula

4. Shell Programming and Scripting

extracting specific text from lines

Discussion started by: krater559