Split content based on keywords


Login or Register to Reply

 
Thread Tools Search this Thread
# 1  
Old 1 Week Ago
Split content based on keywords

I need to split the file contents with multiple rows based on patterns

Sample:
Input:
Code:
ABC101testXYZ102UKMNO1092testing
ABC999testKMNValid

Output:
Code:
ABC101test
XYZ102U
KMN1092testing
ABC999test
KMNValid

In this ABC , XYZ and KMN are patterns

Last edited by Jairaj; 1 Week Ago at 04:41 AM..
# 2  
Old 1 Week Ago
the last my example is not entirely correct
Code:
sed 's/ABC\|XYZ\|KMN/\n&/g;s/^\n//' file

--- Post updated at 13:45 ---

And the first one is better to correct Smilie
Code:
sed -r 's/\B(ABC|XYZ|KMN)/\n&/g' file

# 3  
Old 6 Days Ago
It’s working.Thanks !

Can you tell me how this statement(coomand) flow will work ?
# 4  
Old 6 Days Ago
Hello Jairaj,

In awk, could you please try following.
Code:
awk '{gsub("ABC|XYZ|MNO|KMN",ORS"&");sub(/^\n/,"")} 1'  Input_file

Thanks,
R. Singh
This User Gave Thanks to RavinderSingh13 For This Post:
Jairaj (6 Days Ago)
# 5  
Old 6 Days Ago
It’s working.Thanks !

Can you tell me how this statement(coomand) flow will work ?
# 6  
Old 6 Days Ago
Hi Jairaj,
I'm sorry, I have problems with English, I can not.
Enter this command in the terminal
Code:
LESS=+/" *s/regexp/replacement/" man sed

# 7  
Old 5 Days Ago
Quote:
Originally Posted by nezabudka
I'm sorry, I have problems with English, I can not.
If i may try?

Code:
sed 's/ABC\|XYZ\|KMN/\n&/g;s/^\n//' file

This sed-program consists of two statements which are applied one after the other to every line:

Code:
s/ABC\|XYZ\|KMN/\n&/g
s/^\n//

Let us start with the second one as it is easier: it is a "replacement" command and replaces one expression with another. Actually the "s" stands for "substitute":

Code:
s/<something to match>/<something that replaces what was matched>/

What does it replace? It replaces a start-of-line (^) followed by a newline character (\n) with nothing. The start-of-line is not really a character, so effectively it deletes a newline character, should it follow a line start but no other newline characters.

The first line is a bit more complicated: basically it is a replacement command too and works the same way as the second line. Now, what does it replace?

Code:
/ABC\|XYZ\|KMN/

This matches one of the strings separated by the escaped pipe-characters, so effectively it matches either "ABC" or "XYZ" or "KMN". Now, what will these strings be replaced with?

Code:
/\n&/

The first is a \n, which means a newline character. The second character, &, means what has been matched before. As i said the first expression will match one of three different strings. The string which was matched in the first expression is put here so effectively it replaces the string with itself plus a newline character up front.

The final g is just an option and says that the operation should occur as often as possible and not only for the first opportunity. If you have a substitution command like:

Code:
s/a/b/

It will replace "a" with "b" but only the first occurence of "a". An input string of "aaa" will become "baa", but with the "g" in place it will become "bbb" because all the "a"s will be replaced, not only the first one. So, to put it all together, this is waht will happen to an input string:

Code:
# input string:
ABC101testXYZ102UKMNO1092testing

# after first command (newlines are encoded as "\n" for better understanding):
\nABC101test\nXYZ102U\nKMNO1092testing

# after the second command:
ABC101test\nXYZ102U\nKMNO1092testing

# what will really be written (newlines not encoded any more):
ABC101test
XYZ102U
KMNO1092testing


Quote:
Originally Posted by nezabudka
Code:
sed 's/ABC\|XYZ\|KMN/\n&/g;s/^\n//' file

Code:
sed -r 's/\B(ABC|XYZ|KMN)/\n&/g' file

Notice that the use of Extended Regular Expressions as well as the usage of "\n" as a newline character is not covered by a standard-conforming sed.

There are several (similar but not identical) regular expression engines used in UNIX/Linux:

The most basic "regular expressions" although they are usually called "file globs" are used by the shell: i.e. the expression filename* where "*" is expanded to any string of any length is an example of this regexp syntax.

Then there are Basic Regular Expressions or "BRE"s. The syntax of BREs is standardized by POSIX and is used in utilities like sed, grep (in its default mode, see below) and so on.

Notice that the GNU project deviated from this standard and developed their own variant of BREs, the GNU Basic Regular Expressions. The GNU variants of sed, grep and so on use these instead of the POSIX BREs. One example for the difference between the GNU-BREs and POSIX-BREs is the quantifier "+", which means "one or more (of the previous expression". For instance, the regexp:

Code:
/Xa*Y/

will match "XaY", "XaaY" and so on, but also "XY". To exclude that latter and restrict the pattern to one or more "a" you would need to write

Code:
/Xaa*Y/         # POSIX, variant 1
/Xa\{1,\}Y/     # POSIX, variant 2
/Xa+Y/          # GNU

Notice that the two POSIX variants are understood by all regexp engines, the GNU variant is understood only by GNU-tools.

Then there are Extended Regular Expression or EREs. EREs are basically a superset of BREs but with a few quirks. For instance you do not escape grouping or numerical quantifiers:

Code:
/Xa\{1,\}Y/     # BRE
/Xa{1,}Y/       # ERE
/X\(abc\)*Y/       # BRE
/X(abc)*Y/         # ERE

There is a POSIX standard for these and they are used in utilities like awk, grep -E (the -E option switches the used regexp engine from BRE to ERE), egrep (this is basically a grep with the -E option set and fixed) and so on.

Again, GNU has its own variant of ERE called GNU-ERE and used in the respective GNU variants of GNU-awk, GNU-egrep, etc. but also GNU-sed when used with the "-E" or the equivalent "-r"-switch.

I hope this helps.

bakunin
These 2 Users Gave Thanks to bakunin For This Post:
Jairaj (4 Days Ago) nezabudka (5 Days Ago)
Login or Register to Reply

|
Thread Tools Search this Thread
Search this Thread:
Advanced Search

More UNIX and Linux Forum Topics You Might Find Helpful
awk to print line based on two keywords cmccabe Shell Programming and Scripting 8 05-12-2017 12:36 AM
Split a file in more files based on score content paolo.kunder Shell Programming and Scripting 5 11-20-2013 05:29 AM
Extracting words and lines based on keywords seemad Shell Programming and Scripting 2 06-09-2013 10:24 PM
Column transformation based on keywords in that column ks_reddy Shell Programming and Scripting 4 02-27-2013 06:36 AM
HELP: Shell Script to read a Log file line by line and extract Info based on KEYWORDS matching biztank Shell Programming and Scripting 8 07-24-2012 11:44 PM
split file content into specific folders turki_00 Shell Programming and Scripting 7 07-03-2012 04:07 PM
split content and write to new record Jairaj Shell Programming and Scripting 8 02-14-2012 01:31 PM
Split the file based on the content arukuku Shell Programming and Scripting 6 02-03-2012 01:22 AM
copy range of lines in a file based on keywords from another file kaaliakahn Shell Programming and Scripting 6 01-25-2012 11:50 AM
Sorting lines based on keywords for MySQL script vivek d r Shell Programming and Scripting 4 01-06-2012 04:06 AM
split file content Jairaj Shell Programming and Scripting 6 09-09-2011 02:50 AM
Forwarding based on keywords in sendmail vostrushka UNIX for Advanced & Expert Users 2 07-19-2011 05:28 PM
How to keep appending a newly created file based on some keywords supreet Shell Programming and Scripting 6 06-02-2009 02:34 AM
split file depending on content Chaitrali Shell Programming and Scripting 4 11-14-2007 08:15 AM
Capture lines based on keywords nimo Shell Programming and Scripting 4 11-03-2005 09:40 PM