Remove Special Characters and Numbers From a Wordlist


Login or Register for Dates, Times and to Reply

 
Thread Tools Search this Thread
Top Forums UNIX for Beginners Questions & Answers Remove Special Characters and Numbers From a Wordlist
# 1  
Remove Special Characters and Numbers From a Wordlist

I sux at this type of stuff. I have a huge wordlist. I want to get rid of everything in each word except the letters. I want to remove numbers and all special characters. And since this list was created using cewl I some how picked up something like so Latin characters and would like to remove them as well. If there is a way to do this and someone gives me the string to use could you also drop down and explain to me how the above string works since I would love to learn how to do things like this myself.


Thanks in advance.
# 2  
Try:
Code:
$ echo "This134isastrangeword33 This134isàstrangërword33@%$" | sed 's/[^[:alpha:]]//g'
ThisisastrangewordThisisàstrangërword
$echo "This134isastrangeword33 This134isàstrangërword33@%$" | sed 's/[^a-zA-z]//g'
ThisisastrangewordThisisstrangrword

^ means negation
[^[:alpha:]] mean any non-letter.
[^a-zA-Z] means any non-ascii letter
sed 's/[^[:alpha:]]//g' means delete any non-letter on a line
# 3  
So using this will remove anything not in the U.S alphabet if I understand you correctly?
# 4  
Probably worth adding space characters.
# 5  
my intentions are to clean out everything except the letters right now. Once that is done I will be adding back numbers that I can control. What I have right now is 81 gigs of words pulled from sites. I am currently removing duplicates which I know will cut the size down.



I am a little confused but I assume I should run this
Code:
$ echo | sed 's/[^[:alpha:]]//g'

and then this

Code:
$echo| sed 's/[^a-zA-z]//g'


and I will end up with this
Thisisastrangeword

I know I will be back for more questions. I am at work so if I do not get back right away to let you know how greatful I am for your input then I want to thank you now. So thank you and correct me if I am wrong on my input.
Thanks, Thank, Thanks!
# 6  
Make a test run that will filter out only 10 lines and write to the file
Code:
sed -n '1,10s/[^[:alpha:][:blank:]]//gp' your_file > tmp_file

Open it and if you are satisfied with the result, run on the whole file.
Code:
sed 's/[^[:alpha:][:blank:]]//g' your_file > tmp_file

--- Post updated at 20:13 ---

If the file is large, probably better
Code:
sed 's/[^[:alpha:][:blank:]]//g; 10q' your_file > tmp_file

# 7  
Interesting, so this
Code:
'1,10


denotes the number of lines? See this is how I learn best. If I have different variations of code in front of me and with what they do then I can look at the differences and that sticks with me better.


Yes it is a large list. It has a mass amount of numbers and Latin and even Chinese. I have no idea where those came from because I scan only U.S sites and U.S newspapers online. Normally only one or 2 links deep. But I got them from somewhere.


I am always ready to take in nuggets of information as I search for my gold. Thank you all.
Login or Register for Dates, Times and to Reply

Previous Thread | Next Thread
Thread Tools Search this Thread
Search this Thread:
Advanced Search

Test Your Knowledge in Computers #972
Difficulty: Easy
The Linux kernel is written in assembly language.
True or False?

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Remove Special Characters Within Text

Hi, I have a "|" delimited file that is exported from a database. There is one column in the file which has description/comments entered by some application user. It has "Control-M" character and "New Line" character in between the text. Hence, when i export the data, this record with the new... (4 Replies)
Discussion started by: tarun.trehan
4 Replies

2. Shell Programming and Scripting

How to remove special characters?

Hi Gurus, I have file which contains some unicode charachator like "". I want to replace it with some charactors. I searched in internet and got command sed "s//-/g", but I don't know how to type in unix command line. Please help me for this one. Thanks in advance (7 Replies)
Discussion started by: ken6503
7 Replies

3. Shell Programming and Scripting

Remove the special characters from field

Hi, In source data few of columns are having special charates(like *) due to this i am not able to display the data into flat file.it's displaying the some of junk data into the flat file. source dataExample: Address1="XDERFTG * HYJUYTG" how to remove the special charates in a string (2 Replies)
Discussion started by: koti_rama
2 Replies

4. Shell Programming and Scripting

Remove string between two special characters

Hi All, I have a variable like AVAIL="\ BACK:bkpstg:testdb3.iad.expertcity.com:backtest|\ #AUTH:authstg:testdb3.iad.expertcity.com:authiapd|\ TEST:authstg:testdb3.iad.expertcity.com:authiapd|\ " What I want to do here is that If a find # before any entry, remove the entire string... (5 Replies)
Discussion started by: engineermayur
5 Replies

5. Shell Programming and Scripting

remove special characters

hello all I am writing a perl code and i wish to remove the special characters for text. I wish to remove all extended ascii characters. If the list of special characters is huge, how can i do this using substitute command s/specialcharacters/null/g I really want to code like... (3 Replies)
Discussion started by: vasuarjula
3 Replies

6. UNIX for Dummies Questions & Answers

How to Remove Special Characters

Dear Members, We have a file which contains some special characters. I need to replace these special character by a new line character(\n). The Special character is \x85. I am not sure what this character means and how we can remove it. Any inputs are greatly appreciated. Thanks... (5 Replies)
Discussion started by: sandeep_1105
5 Replies

7. Shell Programming and Scripting

How to remove special characters from each line?

Hello, Is there a simpler way to remove special characters (color codes) from each lines in a log file? I use sed like in the example below but I think there should be a more simple way to achieve the same result: $ cat -vet file1 ^, , , , Maybe to convert the file somehow? ... (5 Replies)
Discussion started by: majormark
5 Replies

8. Shell Programming and Scripting

Remove special characters from string

Hi there, I'd like to write a script that removes any set of character from any string. The first argument would be the string, the second argument would be the characters to remove. For example: $ myscript "My name's Santiago. What's yours?" "atu" My nme's Snigo. Wh's yors? I wrote the... (11 Replies)
Discussion started by: chebarbudo
11 Replies

9. UNIX for Dummies Questions & Answers

Remove directory that has special Characters

Hi All, I have a script written that creates a new directory within the shell program and if a parameter isn't passed in, it creates a strange directory name by mistake. So I have a directory like "-_12" and I am unable to remove it. I tried removing it using double quote and many others. I have... (12 Replies)
Discussion started by: datherriault
12 Replies

10. UNIX for Dummies Questions & Answers

remove special and unicode characters

Hi, How do I remove the lines where special characters or Unicode characters appear? The following query does work but I wonder if there is a better way. cat test.txt | egrep -v '\)|#|,|&|-|\(|\\|\/|\.' The following lines show that my query is incomplete. Warning: The word "*Khan" is... (1 Reply)
Discussion started by: shantanuo
1 Replies

Featured Tech Videos