get rid of non-alphanumeric characters


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting get rid of non-alphanumeric characters
# 1  
Old 11-29-2010
get rid of non-alphanumeric characters

Hi!
Could anyone so kindly help me a code to eliminate from a txt file, obtained by collecting and merge several web-page, every word (string) containing non alphabetical, numeric and punctuation character (i.e NON a-zA-Z0-9, underscore and punctuation mark)?

Thanks a lot for the help to anyone sending a reply!
mjomba from Tanzania
# 2  
Old 11-29-2010
this could help...?
Code:
sed 's/[^a-zA-Z0-9_:]/ /g' inputfile

This User Gave Thanks to michaelrozar17 For This Post:
# 3  
Old 11-29-2010
Try this out ...

cat inputFile|sed 's/ [a-zA-Z0-9]*[^a-zA-Z0-9][^a-zA-Z0-9]*[a-zA-Z0-9]* / /g'|sed 's/[ ][a-zA-Z0-9]*[^a-zA-Z0-9 ][^a-zA-Z0-9 ]*[a-zA-Z0-9]*//g'
# 4  
Old 11-29-2010
Code:
awk '{for(i=1;i<=NF;i++)if($i~/[^[:graph:]]/)$i=x}1' file

or
Code:
awk '{for(i=1;i<=NF;i++)if($i~/[^[:alnum:]._]/)$i=x}1'

if you only want to include . and _ for example
# 5  
Old 11-29-2010
A representative example would help. Depends what you mean by "word", "punctuation" etc. and whether you will retain the line terminator.

One example of removing every character except those listed is:

Code:
cat oldfile tr -cd '[:alnum:][:punct:][:space:]' > newfile

See here for definitions of the various character classes.
Regex Tutorial - POSIX Bracket Expressions

If you actually need to work on "words" it needs a clear definition of what constitutes a "word".
# 6  
Old 12-17-2010
Thank you very much!
It does what I needed:
Code:
cat oldfile | tr -cd '[:alnum:][:punct:][:space:]' > newfilemjomba


Last edited by Scott; 12-17-2010 at 08:16 AM.. Reason: Code tags
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Generate a string of alphanumeric characters

Hi, I want a script of a code that will allow me to generate all possible combinations of alphanumberica characters of length 12 such that each string will contain numbers and either small or capital letters. For example a string may look like this: 123AB45cd678. (11 Replies)
Discussion started by: faizlo
11 Replies

2. UNIX for Dummies Questions & Answers

Getting rid of abnormal Characters

i'm grepping for words in the /var/adm/messages (sun solaris). but it looks like while my grepping finds the strings, when it outputs them out, the beginning of some lines are chopped off. Jun 13 14:06:02 sky.net ufs: NOTICE: alloc: /prod: file system full 3 14:39:19 sky.net ufs: NOTICE:... (1 Reply)
Discussion started by: SkySmart
1 Replies

3. Shell Programming and Scripting

Getting rid of abnormal Characters

ok, so i have no clue why this script i wrote spits out these bizarre characters: i cant even copy and paste those characters on here because it just doesn't show up properly. my question is, using sed, how can i get rid of all characters that aren't normal? echo "abnormal characters" |... (4 Replies)
Discussion started by: SkySmart
4 Replies

4. Shell Programming and Scripting

Sed or trim to remove non alphanumeric and alpha characters?

Hi All, I am new to Unix and trying to run some scripting on a linux box. I am trying to remove the non alphanumeric characters and alpha characters from the following line. <measResults>883250 869.898 86432.4 809875.22 804609 60023 59715 </measResults> Desired output is: 883250... (6 Replies)
Discussion started by: jackma
6 Replies

5. UNIX for Dummies Questions & Answers

Any way to get rid of ^M characters in a text file using pr?

When I use vi to see what's in the file I get this: int add1(int x) {^M return x + 1;^M} ^Mint subtract1(int x) {^M return x - 1;^M} ^Mint double_it(int x) {^M return x * 2;^M} ^Mint halve_it(int x) {^Mreturn x / 2;^M} ^Mint main() {^M int myint;^M int result;^M ... (2 Replies)
Discussion started by: Nonito84
2 Replies

6. Shell Programming and Scripting

Getting rid of non-numeric and non-characters

I have a database script that always produces the following output: 0 btw, the unwanted character looks like a square on a unix system. it doesn't look like the above quote. how can I get rid of it and only keep the "0"? ---------- Post updated at 01:57 PM ---------- Previous update was... (2 Replies)
Discussion started by: SkySmart
2 Replies

7. UNIX for Dummies Questions & Answers

Need help getting rid of bold characters

Hi! So i've got this shell script that asks questions and the user is required to input answers. The answers typed are bold. sh-*.*$ sh filename dir cat question tput bold read ans tput sgr0 ... and so on tput sgr0 exit So when the script ends i don't get the bold characters... (3 Replies)
Discussion started by: Kingzy
3 Replies

8. Shell Programming and Scripting

Sorting with non- and alphanumeric characters

Hi guys, I'm new to this forum and I'm not a UNIX expert. I can't figure out this certain problem i'm having: I need to sort some words, some of the words are annotations (enclosed within < and >). I need to have them sorted alphabetically with all non-alphanumeric characters up front. For... (2 Replies)
Discussion started by: fed.m.ang
2 Replies

9. UNIX and Linux Applications

get rid of special characters

Hi Friends, we have recently installed RHEL4.4 and when i give the commd ls -l > tt it prints the file name with some special charactes like ^[[00m1 in the begining of the file name and at the end of the file name. I wanted to use the file names of removing it before taking the backup and... (4 Replies)
Discussion started by: vakharia Mahesh
4 Replies

10. UNIX for Dummies Questions & Answers

getting rid of control characters

how can i get rid of the control characters , ex. ^M, ^G, in a file? thanks... (2 Replies)
Discussion started by: apalex
2 Replies
Login or Register to Ask a Question