filtering out duplicate substrings, regex string from a string


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting filtering out duplicate substrings, regex string from a string
# 1  
Old 06-15-2010
Java filtering out duplicate substrings, regex string from a string

My input contains a single word lines.
From each line
Quote:
a) I want to remove all text that starts with 'dp' including 'dp'.
Ex: prjgoodBlaBladpgoodBlaBla ---> prjgoodBlaBla
b) Also I want to remove duplicate substrings.
Ex: prjtestBlaBlatestBlaBla ---> prjtestBlaBla
Logic I have in mind but having hard time implementing: Take 4 thru 10 characters [testBla] , if its found in the string, remove all text starting from second occurance of it.
data.txt
Code:
 
prjtestBlaBlatestBlaBla
prjthisBlaBlathisBlaBla
prjthatBlaBladpthatBlaBla
prjgoodBlaBladpgoodBlaBla
prjgood1BlaBla123dpgood1BlaBla123


Desired output -->
data_out.txt
Code:
 
prjtestBlaBla
prjthisBlaBla
prjthatBlaBla
prjgoodBlaBla
prjgood1BlaBla123

I am able to get part a) of my requirement working using following,,
Code:
 
> sed 's/dp\(.*\)\..*/\1/' data.txt
prjtestBlaBlatestBlaBla
prjthisBlaBlathisBlaBla
prjthatBlaBladpthatBlaBla
prjgoodBlaBladpgoodBlaBla
prjgood1BlaBla123dpgood1BlaBla123

but not part b).

Last edited by kchinnam; 06-15-2010 at 12:19 PM.. Reason: formatting changes
# 2  
Old 06-15-2010
Code:
perl -pe 's/dp.*// || s/(\w+)\1/\1/' data.txt


Last edited by bartus11; 06-15-2010 at 12:42 PM.. Reason: little mistake in code, now should be fine
# 3  
Old 06-15-2010
bart, its working.. Thanks for the solution..

Can you explain what does || and (\w+) do ?

Can we get it working using sed !? can someone help ?

Code:
> /usr/xpg4/bin/sed -e 's/dp.*//' -e 's/(\w+)\1/\1/' data.txt
sed: command garbled: s/(\w+)\1/\1/

# 4  
Old 06-15-2010
I don't know how to make it work in sed. "||" works as "exclusive or" in perl so it checks if first command was successful, and if it was then second one is not processed. (\w+)\1 matches first occurance of consecutive duplicate strings (it is extension of "(\w)\1", which would match consecutive duplicate characters, like "aa","bb" and so on).
# 5  
Old 06-15-2010
Thanks bart.

Does anyone know how to do this using sed ? does any other shell in Unix recognize (\w+) as consecutive duplicate strings?
# 6  
Old 06-15-2010
Quote:
Originally Posted by kchinnam
Thanks bart.

Does anyone know how to do this using sed ? does any other shell in Unix recognize (\w+) as consecutive duplicate strings?
It is not just (w+), but (w+)\1. That "\1" is important, as it matches string matched before by (\w+). In other words "\1" matches the duplicated part.
# 7  
Old 06-15-2010
Does anyone know how to achieve this using sed ?
Code:
perl -pe 's/dp.*// || s/(\w+)\1/\1/' data.txt


Last edited by vgersh99; 06-15-2010 at 04:54 PM.. Reason: code tags, please!
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Grep with regex containing one string but not the other

Hi to you all, I'm just struggling with a regex problem and I'm pretty sure that I'm missing sth obvious... :confused: I need a regex to feed my grep in order to find lines that contain one string but not the other. Here's the data example: 2015-04-08 19:04:55,926|xxxxxxxxxx| ... (11 Replies)
Discussion started by: stresing
11 Replies

2. Shell Programming and Scripting

Remove not only the duplicate string but also the keyword of the string in Perl

Hi Perl users, I have another problem with text processing in Perl. I have a file below: Linux Unix Linux Windows SUN MACOS SUN SUN HP-AUX I want the result below: Unix Windows SUN MACOS HP-AUX so the duplicate string will be removed and also the keyword of the string on... (2 Replies)
Discussion started by: askari
2 Replies

3. Shell Programming and Scripting

Need Help of filtering string from a file.

HI All, We have an Redhat Machine, And some folder with couple simple text files, this files containing a lot of lines with various strings and IP address with different classes. The Requirement in eventually , is to pass the all various IP addresses to Excel. My question is : what is... (4 Replies)
Discussion started by: James Stone
4 Replies

4. Shell Programming and Scripting

Extracting substrings from a string of variable length

I have a string like Months=jan feb mar april x y .. Here the number of fields in Months is not definite I need to extract each field in the Months string and pass it to awk . Don't want to use for in since it is a loop . How can i do it (2 Replies)
Discussion started by: Nevergivup
2 Replies

5. Shell Programming and Scripting

KSH: Split String into smaller substrings based on count

KSH HP-SOL-Lin Cannot use xAWK I have several strings that are quite long and i want to break them down into smaller substrings. What I have String = "word1 word2 word3 word4 .....wordx" What I want String1="word1 word2" String2="word 3 word4" String3="word4 word5" Stringx="wordx... (5 Replies)
Discussion started by: nitrobass24
5 Replies

6. Shell Programming and Scripting

Filtering protocol and string in tcpdump command?

Hello to all in forum, Maybe some unix expert could help me. I have the following tcpdump command: tcpdump -i any port 13907 -s 0 -w Out.cap I would like to run tcpdump to only capture data related with especific string. Within the dump the protocol is GSM MAP and the string is Address... (0 Replies)
Discussion started by: cgkmal
0 Replies

7. Shell Programming and Scripting

sed or awk command to replace a string pattern with another string based on position of this string

here is what i want to achieve... consider a file contains below contents. the file size is large about 60mb cat dump.sql INSERT INTO `table1` (`id`, `action`, `date`, `descrip`, `lastModified`) VALUES (1,'Change','2011-05-05 00:00:00','Account Updated','2012-02-10... (10 Replies)
Discussion started by: vivek d r
10 Replies

8. Shell Programming and Scripting

filtering string

hlow all i need help for my case i want to get variable 20(in bold) but filter in print $3 not $2 so this input 95:20111005_20111123:1821546322 96:20111005_20111123:0053152068 97:20111005_20111123:1820960407 98:20111005_20111123:2021153102 99:20111005_20111123:2021153202... (4 Replies)
Discussion started by: zvtral
4 Replies

9. Shell Programming and Scripting

Need help in string filtering (KSH)

Hi all, I'm interested in printing out only the prefix of a formatted set of filenames. All files of this type have the same 8 character suffix. I'm using KSH. Is there a command I could use to print the filenames, less the last 8 characters? Was thinking of using sed 's/<last 8 chars>//',... (1 Reply)
Discussion started by: rockysfr
1 Replies

10. UNIX for Dummies Questions & Answers

Filtering text from a string

I'm trying to write a script which prints out the users who are loged in. Printing the output of the "users" command isn't the problem. What I want is to filter out my own username. users | grep -v (username) does not work because the whole line in which username exists is suppressed. If... (5 Replies)
Discussion started by: Cozmic
5 Replies
Login or Register to Ask a Question