Turning to SED to select specific records

06-18-2012

Registered User

64, 0

Join Date: Jul 2011

Last Activity: 26 March 2015, 3:49 AM EDT

Posts: 64

Thanks Given: 40

Thanked 0 Times in 0 Posts

Turning to SED to select specific records

Hi All,

I am looking for a simple concise solution most likely using sed to process the following 4 rows of data from the same record and only keeps it if the second record satisfy certain critea such as surname matches up to smith or jackson:

Code:

 
John (firstname)
Smith (surname) 
20/05/1984 (dob)
Male (gender)

It would have been possible to use AWK if the data are on the same line with a fixed delimiter.

There is no problem writing many lines of shell scripting but I am hoping to find an easy brief solution in SED but not familiar with how it could be done.

I am running on Solaris 10 x86 platform.

Your assistance would be much appreciated,
George

Last edited by gjackson123; 06-18-2012 at 10:36 AM.. Reason: Tidy up code & provide platform detail

gjackson123

View Public Profile for gjackson123

Find all posts by gjackson123

06-18-2012

Banned

68, 9

Join Date: May 2012

Last Activity: 7 August 2015, 4:00 PM EDT

Posts: 68

Thanks Given: 7

Thanked 9 Times in 9 Posts

Can you please provide a sample input file and intended output file.... I am a bit confused about what you are assuming to be a record in your file.

Based on my understanding ( that a record will always be the combination of the above 4 rows, and the second row in the above set should begin with 'Smith' to be selected), here is my solution:

Code:

 sed 's/(gender)/&*/g' file1 | awk -F'\n' '$2 ~ /^Smith.*/ {print}' RS='*'

Note: This solution assumes that '*' does not appear anywhere in your data. Replace it with another character (which does not occur in your data) if this is not the case.

Last edited by jawsnnn; 06-18-2012 at 10:50 AM..

jawsnnn

View Public Profile for jawsnnn

Find all posts by jawsnnn

06-20-2012

Registered User

64, 0

Join Date: Jul 2011

Last Activity: 26 March 2015, 3:49 AM EDT

Posts: 64

Thanks Given: 40

Thanked 0 Times in 0 Posts

Turning to SED to select specific records

Hi jawsnnn,

Thanks for your valuable input.

There is no need to provide sample input data file since your understanding of its composition is correct as shown from this initial post. Nevertheless, I am wondering whether if you could provide a brief one liner explanation on how your code would work since my SED knowledge is limited. Also, which of the following minor updates would accommodate for more than one surname:

Code:

sed 's/(gender)/&*/g' file1 | awk -F'\n' '$2 ~ /^Smith.*|^Jone.*|^Green.*/ {print}' RS='*'
 
                               or
 
  sed 's/(gender)/&*/g' file1 | awk -F'\n' '$2 ~ /^(Smith|Jone|Green).*/ {print}' RS='*'
 
                               or
 
  sed 's/(gender)/&*/g' file1 | awk -F'\n' '$2 ~ /^(?:Smith|Jone|Green).*/ {print}' RS='*'

I will test out each of these statements to see which one work and let you know.
Thanks again,
George

gjackson123

View Public Profile for gjackson123

Find all posts by gjackson123

06-20-2012

Banned

68, 9

Join Date: May 2012

Last Activity: 7 August 2015, 4:00 PM EDT

Posts: 68

Thanks Given: 7

Thanked 9 Times in 9 Posts

I think the first variation should work fine for multiple surnames. Let me explain the solution:

Code:

sed 's/(gender)/&*/g' file1 | awk -F'\n' '$2 ~ /^Smith.*/ {print}' RS='*'

1. I appended an asterisk '*' to the string (gender), i.e. the end of your record using sed by using

Code:

sed 's/(gender)/&*/g'

Here & is replaced by the matched string.

2. Then I divide the output of this command into records separated by '*' with fields separated by '/n' or newline character. This enables me to treat the four lines in each set as four different fields in the awk command. I achieve this by setting two variables:

Code:

RS='*'
and
-F='\n'

3. Then, I simply match the second field (i.e. the second row of all sets) to the pattern

Code:

^Smith.*

which matches fields starting with the string Smith followed by any character. In retrospect, the .* in the pattern is probably not needed.

jawsnnn

View Public Profile for jawsnnn

Find all posts by jawsnnn

06-20-2012

Registered User

2,288, 480

Join Date: Apr 2007

Last Activity: 3 May 2020, 8:28 AM EDT

Location: Saint Paul, MN USA / BSD, CentOS, Debian, OS X, Solaris

Posts: 2,288

Thanks Given: 430

Thanked 480 Times in 395 Posts

Hi, gjackson123.

Quote:

Originally Posted by gjackson123

...There is no need to provide sample input data file ...

Meta-advice.

If one were to want more than one suggested solution, one would supply sample data. That allows consistency among results. Otherwise, you are putting an additional burden on the responders to come up with sample data, which, in addition to being likely different from one another, may not be representative of the real set. In general, if faced with the task of creating sample data in addition to a solution, then I probably will move on to other questions without attempting to solve the problem.

Best wishes ... cheers, drl

Last edited by drl; 06-22-2012 at 11:51 AM..

This User Gave Thanks to drl For This Post:

drl

View Public Profile for drl

Find all posts by drl

06-21-2012

Registered User

64, 0

Join Date: Jul 2011

Last Activity: 26 March 2015, 3:49 AM EDT

Posts: 64

Thanks Given: 40

Thanked 0 Times in 0 Posts

Turning to SED to select specific records

Hi jawsnnn & drl,

Below is the employee.txt as requested:

$ more employee.txt

Code:

John
Barry
21/04/1988
Male
Jessica
Smith
16/09/2000
Female
Joyce
Brown
05/12/1985
Female
Kyle
Jones
02/10/1945
Male

Code:

$ sed 's/(gender)/&*/g' employee.txt | more 
John 
Barry 
21/04/1988 
Male 
Jessica 
Smith 
16/09/2000 
Female 
Joyce 
Brown 
05/12/1985 
Female 
Kyle 
Jones 
02/10/1945 
Male

It doesn’t look like the sed statement is doing anything with it. Should the (gender) be replaced with something else? What should I expect the data to look like out of sed and into awk which I am more comfortable with?

I am interested getting a solution with all everyone’s help.

Thanks again,

George

---------- Post updated 06-22-12 at 12:19 AM ---------- Previous update was 06-21-12 at 06:22 PM ----------

Hi,

Below are some more attempts to figure out how your SED & AWK statements work:

Code:

$ uname -a
SunOS startrek 5.10 Generic_141444-09 sun4v sparc SUNW,SPARC-Enterprise-T5220

Code:

$ more employee.txt
John
Barry
21/04/1988
Male
Jessica
Smith
16/09/2000
Female
Joyce
Brown
05/12/1985
Female
Kyle
Jones
02/10/1945
Male

## Returned the same list & order

Code:

$ sed 's/(gender)/&*/g' employee.txt
John
Barry
21/04/1988
Male
Jessica
Smith
16/09/2000
Female
Joyce
Brown
05/12/1985
Female
Kyle
Jones
02/10/1945
Male

## Returned the same list & order

Code:

$ sed 's/(Male)/&*/g' employee.txt  
John
Barry
21/04/1988
Male
Jessica
Smith
16/09/2000
Female
Joyce
Brown
05/12/1985
Female
Kyle
Jones
02/10/1945
Male

## Returned the same list & order

Code:

$ sed 's/(Female)/&*/g' employee.txt
John
Barry
21/04/1988
Male
Jessica
Smith
16/09/2000
Female
Joyce
Brown
05/12/1985
Female
Kyle
Jones
02/10/1945
Male

## Awk is not getting the right output from SED

Code:

$ sed 's/(Male)/&*/g' employee.txt | awk -F'\n' '$2 ~ /^Smith.*/ { print }' RS='*'
$

## Same input to AWK as from SED

Code:

$ awk -F'\n' '$2 ~ /^Smith.*/ { print }' RS='*' employee.txt                  
$

I suspect the problem is from

Code:

sed 's/(gender)/&*/g'

but I am still trying to wrap my head around it.

Also, what is the purpose of the round brackets () around gender, & and *? The sed statement appears to be doing a global replacement of (gender) with &* even though I not clear whether the gender should be replaced with something else?

Thanks a lot,

George

Last edited by gjackson123; 06-22-2012 at 02:27 AM.. Reason: Cleaned out spurious formatting

gjackson123

View Public Profile for gjackson123

Find all posts by gjackson123

06-21-2012

Moderator

1,484, 567

Join Date: Mar 2011

Last Activity: 28 November 2020, 9:34 AM EST

Posts: 1,484

Thanks Given: 68

Thanked 567 Times in 444 Posts

Perhaps this is your requirement :

Code:

$ cat input
John
Barry
21/04/1988
Male
Jessica
Smith
16/09/2000
Female
Joyce
Brown
05/12/1985
Female
Kyle
Jones
02/10/1945
Male


$ awk 'BEGIN { RS="Male|Female" } { print $1,$2,$3 } ' input
John Barry 21/04/1988
Jessica Smith 16/09/2000
Joyce Brown 05/12/1985
Kyle Jones 02/10/1945

This User Gave Thanks to Peasant For This Post:

Peasant

View Public Profile for Peasant

Find all posts by Peasant

UNIX for Dummies Questions & Answers

Turning to SED to select specific records

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Quick way to select many records from a large file

Discussion started by: zenongz

2. Shell Programming and Scripting

Select records and fields

Discussion started by: giuliangiuseppe

3. Shell Programming and Scripting

To select non-duplicate records using awk

Discussion started by: paresh n doshi

4. Shell Programming and Scripting

awk print only select records from file2

Discussion started by: sigh2010

5. Shell Programming and Scripting

Block of records to select from a file

Discussion started by: nvkuriseti

6. Shell Programming and Scripting

mysql how to select a specific row from a table

Discussion started by: kpddong

7. UNIX for Dummies Questions & Answers

Grep specific records from a file of records that are separated by an empty line

Discussion started by: Atrisa

8. Shell Programming and Scripting

Automatically select records from several files and then run a C executable file inside the script

Discussion started by: Gtolis

9. Shell Programming and Scripting

Using a variable to select records with awk

Discussion started by: joeyg

10. UNIX for Dummies Questions & Answers

Select records based on search criteria on first column

Discussion started by: shashi_kiran_v