Isolate and Extract a Pattern Substring (Digits Only)

03-21-2009

Registered User

8, 0

Join Date: May 2008

Last Activity: 23 May 2011, 5:39 AM EDT

Posts: 8

Thanks Given: 3

Thanked 0 Times in 0 Posts

Isolate and Extract a Pattern Substring (Digits Only)

Hi guys,

I have a text file report generated from egrepping multiple files.
The text files themselves are obtianed after many succesive refinements, so they contain already the desired number, but this is surrounded by unwanted characters, newlines, spaces, it is not always at the start of the line, as can be seen in sample below:
$ egrep [0-7]\{7}
dte--0072.txt:1596223
dte--0073.txt:1560379
dte--0075.txt:!!! !!�!!!!!! !!�! !�� !!!!!!!!!! !!?! 1623749
dte--0076.txt:1596014
dte--0077.txt: 1791213
dte--0078.txt: 1767933
dte--0079.txt:_____1777023

What I need to generate is a clean report that looks like this:
desired clean report
dte--0072.txt:1596223
dte--0073.txt:1560379
dte--0075.txt:1623749
dte--0076.txt:1596014
dte--0077.txt:1791213
dte--0078.txt:1767933
dte--0079.txt:1777023

How can I do this? I am too new to regex, so I was hoping maybe someone can help with negating the expression, or a sed oneliner.

Note: The string is always of the same pattern:
- digits only
- the same number of digits (in this report a 7digit number)
- there are no spaces or any other signs between the digit pattern, it is like 1234567
- only need the digits (nothing before or after the number pattern)
- would prefer to operate the command directly on the multiple files, as in the egrep, so the report file is already preserving the filenames on the same line with the contained number string

Can you please help?

Thanks!

Last edited by netfreighter; 03-22-2009 at 04:58 AM..

netfreighter

View Public Profile for netfreighter

Find all posts by netfreighter

03-21-2009

Registered User

307, 29

Join Date: May 2008

Last Activity: 7 September 2011, 6:25 AM EDT

Location: Maryland, USA

Posts: 307

Thanks Given: 2

Thanked 29 Times in 21 Posts

Try this. It uses sed to remove everything after the colon that's not a digit.

Code:

 egrep [0-7]\{7} | sed 's/:[^0-9]*/:/'

KenJackson

View Public Profile for KenJackson

Find all posts by KenJackson

03-22-2009

Registered User

8, 0

Join Date: May 2008

Last Activity: 23 May 2011, 5:39 AM EDT

Posts: 8

Thanks Given: 3

Thanked 0 Times in 0 Posts

Quote:

Originally Posted by KenJackson

Try this. It uses sed to remove everything after the colon that's not a digit.

Code:

 egrep [0-7]\{7} | sed 's/:[^0-9]*/:/'

Ken, many thanks for that.

I did try the command, and it removes most characters trailing AFTER the number, except "_" underscore; however the characters BEFORE the digit pattern are still there, as can be seen from sample output:

dte--0070.txt:0015965_____
dte--0071.txt:0015709_____
dte--0072.txt:0015962
dte--0073.txt:0015603
dte--0075.txt:�!!!!!! !!�! !�� !!!!!!!!!! !!?! 0016237

So now what is left is to remove the noise before the number.

Perhaps there is a way to rather extract "only just what is a 7 digit pattern", rather than provide for all the possible trailing or preceding symbols? The files are obtained from non-English languages so there may be funny non-printing symbols that need to be removed.

Thank you!

netfreighter

View Public Profile for netfreighter

Find all posts by netfreighter

03-22-2009

Registered User

307, 29

Join Date: May 2008

Last Activity: 7 September 2011, 6:25 AM EDT

Location: Maryland, USA

Posts: 307

Thanks Given: 2

Thanked 29 Times in 21 Posts

I am surprised that the last line got through. There must be something there that I'm not seeing.

But yes, we can pick out explicitly 7 characters. In fact, if all the filenames are similar enough you can drop egrep and use only sed. Try this:

Code:

sed 's/\(.*txt:\)[^0-9]*\([0-9]\{7\}\).*/\1\2/'

The escaped parentheses capture and copy .*txt: to \1 and [0-9]\{7\} to \2. The rest is dropped.

KenJackson

View Public Profile for KenJackson

Find all posts by KenJackson

03-22-2009

Registered User

7,747, 559

Join Date: Feb 2007

Last Activity: 20 April 2020, 11:28 AM EDT

Location: The Netherlands

Posts: 7,747

Thanks Given: 139

Thanked 559 Times in 520 Posts

Something like this should be suffice, it excludes every character not in the class of characters enclosed between the brackets:

Code:

sed 's/[^a-zA-Z0-9:-.]//g' file

Regards

Franklin52

View Public Profile for Franklin52

Find all posts by Franklin52

04-03-2009

Registered User

8, 0

Join Date: May 2008

Last Activity: 23 May 2011, 5:39 AM EDT

Posts: 8

Thanks Given: 3

Thanked 0 Times in 0 Posts

SED command or REGEX to extract only the number from a textfile

Hello and thank you both for the answers.
I apologise for the delay in my answer but been overwhelmed with tasks lately. It is now time to turn to this project and finish off.

I have tried both commands, but they only partially remove the unwanted characters.
I also discovered that they are not question marks, those are just replacement characters because the console terminal does not have enough characters to display the actual signs.

So, first I run

Code:

$  egrep [0-9]\{9\} *.txt > repfile

then

Code:

$ sed 's/\(.*txt:\)[^0-9]*\([0-9]\{7\}\).*/\1\2/' repfile

and the output file is showing matching lines that are only cleaned BEFORE the number, while after the number trailing characters remain:
repfile:dte--0055.txt:?! !?!!!!?! !!?! !!?! !!?!! !!!!!!!!! 001431616
repfile:dte--0056.txt:?? !?!!!!?! !???!!!?! 001548532______
repfile:dte--0057.txt:0015817
repfile:dte--0058.txt:!!!! ??!?? !!?! !!? )??? !!!!!!?! 001438615
repfile:dte--0059.txt:0016327
repfile:dte--0060.txt:!)!> !?!!!!?? ??!? !!!! ??! ??!? !?! 001467161

I opened the file in TextEditapp in MacOSX and I see strange characters like
"� �!!! ��)!��(� �? �!� ______ " (may not show correctly but it's symbols that look like superscript and foreign letters)

Since these nnumbers are going to be extracted from multilingual files, non-English writing, thenn it is hard to predict what symbols are to be encountered, so to cleanup the number I gues I would need a SED command or REGEX to extract only the number from a textfile.

Some command that negates "anything else that is NOT a 9-digit number to be removed"

I have a feeling that egrep or a regular rexpression could do that, but do not know where to look.
Maybe some character classes like [punct] can be used?

What the final result should look like:
repfile:0057.txt:001581743
repfile:0058.txt:001438615
repfile:0059.txt:001632790
repfile:0060.txt:001467161

Where that number is the only such number available, one per each textfile.

Thanks!

Last edited by netfreighter; 04-03-2009 at 11:31 AM..

netfreighter

View Public Profile for netfreighter

Find all posts by netfreighter

04-03-2009

Registered User

8, 0

Join Date: May 2008

Last Activity: 23 May 2011, 5:39 AM EDT

Posts: 8

Thanks Given: 3

Thanked 0 Times in 0 Posts

I also tried

Code:

awk ' /[[:digit:]]/ { print $1 } ' repfile

but $1 or $2 only map to "words", whichmeans those digit squences that have some separator or blank, therefore only some or the desired numbers are filtered out, the others that are surrounded by grabage characters without a break or space remain unfiltered.
Any idea how to awk the desired PATTERN only?

netfreighter

View Public Profile for netfreighter

Find all posts by netfreighter

Shell Programming and Scripting

Isolate and Extract a Pattern Substring (Digits Only)

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

How can I extract digits at the end of a string in UNIX shell scripting?

Discussion started by: mingch

2. Shell Programming and Scripting

Extract n-digits from string in perl

Discussion started by: james2009

3. Shell Programming and Scripting

awk extract certain digits from file with index substr

Discussion started by: sdf

4. UNIX for Advanced & Expert Users

Regex pattern for multiple digits

Discussion started by: krishmaths

5. Shell Programming and Scripting

extract digits from a string in unix

Discussion started by: sonu_pal

6. UNIX for Dummies Questions & Answers

sed to isolate file paths separated by a pattern

Discussion started by: nixjennings

7. Shell Programming and Scripting

Extract a substring.

Discussion started by: shellpower

8. Shell Programming and Scripting

Need Help... to extract the substring

Discussion started by: dashok.83

9. Shell Programming and Scripting

Extract digits at end of string

Discussion started by: offirc

10. Shell Programming and Scripting

How to pattern match on digits and then increment?

Discussion started by: sdutto01