cat / sed process weird characters


 
Thread Tools Search this Thread
Top Forums UNIX for Advanced & Expert Users cat / sed process weird characters
# 1  
Old 08-03-2011
cat / sed process weird characters

Hi everyone,
I'm trying to write a shell script that process a log file. The log format is generally:
(8 digit hex of unix time),(system ID),(state)\n
My shell script gets the file from the web, saves it in a local text directory. I then want to change the hex to decimal, convert from unix time to a day/month/year MST format and write out.

I have something that *mostly* works, by downloading the file, opening it with cat, piping the result to sed, using sed to get all the hex values and looping through them.

Unfortunately, there's a bug in the software that produces the log and for some systems the id isn't defined (someone probably forgot to initialize that variable), and it produces a line that looks like: 3B6A7227,››ù√剃,0

When I open this file with cat, the output for lines like that usually just contains a lot of question marks. This is the line I'm using to isolate the hex values:

Code:
cat ~/Downloads/log.txt | sed 's/[^0-9A-Za-z,\n]//g' | sed 's/,.*,[0,1]$//'

Originally I just had the second "sed"; I added the first one in an attempt to remove all the "weird" characters. Unfortunately, when I run this, it comes out as a list of hex numbers EXCEPT for the weird entries. These entries now have their hex number, a comma, then a number of question marks (and sometimes a decimal number), then another comma and the state.

How can I get rid of these? I realize the bug in the logging code needs to be fixed, but I don't have control over that, i'm just trying to clean up the log file.

Thanks!
# 2  
Old 08-03-2011
I find "cat -vt" pretty nice for making the invisible and odd more behaved. If they slip you a ^J or comma you are a goner, though:
Code:
$ all256|cat -vt
^@^A^B^C^D^E^F^G^H^I
^K^L^M^N^O^P^Q^R^S^T^U^V^W^X^Y^Z^[^\^]^^^_ !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~^?M-^@M-^AM-^BM-^CM-^DM-^EM-^FM-^GM-^HM-^IM-
M-^KM-^LM-^MM-^NM-^OM-^PM-^QM-^RM-^SM-^TM-^UM-^VM-^WM-^XM-^YM-^ZM-^[M-^\M-^]M-^^M-^_M- M-!M-"M-#M-$M-%M-&M-'M-(M-)M-*M-+M-,M--M-.M-/M-0M-1M-2M-3M-4M-5M-6M-7M-8M-9M-:M-;M-<M-=M->M-?M-@M-AM-BM-CM-DM-EM-FM-GM-HM-IM-JM-KM-LM-MM-NM-OM-PM-QM-RM-SM-TM-UM-VM-WM-XM-YM-ZM-[M-\M-]M-^M-_M-`M-aM-bM-cM-dM-eM-fM-gM-hM-iM-jM-kM-lM-mM-nM-oM-pM-qM-rM-sM-tM-uM-vM-wM-xM-yM-zM-{M-|M-}M-~M-^?

If you write a C or PERL app, you can ignore ^j without 2 commas and reverse ignore comma if more than 2, pushing the user id back to a fixed value. Maybe sometimes they cannot determine the user name from id #, and write the binary id # ?
# 3  
Old 08-03-2011
Untested idea to clean the file by removing any characters except 0-9 A-Z a-z commas and newlines using the unix "tr" command.

Code:
cat ~/Downloads/log.txt | while read old_line
do
         echo "${old_line}"| tr -dc '[0-9][A-Z][a-z],\n'
done

If this does not solve your problem, please post sample data which shows a couple of good lines and a couple of bad lines when displayed by the unix "od" command (which will show exactly what characters are in the file). We don't need the whole file.
Code:
cat ~/Downloads/log.txt | od -xc

This User Gave Thanks to methyl For This Post:
# 4  
Old 08-03-2011
Methyl - your idea would work, but when I try it, I get an error: 'tr: illegal byte sequence'

Here is a sample data set:
(in plaintext):

Code:
3B698960,eSWPump,0
3B698C36,sHeatPre,1
3B698C36,ePHPump,1
3B698C36,eSWPump,1
3B698CB4,Ô!˛ˇ√8ÏÔ,1
3B698CB4,eHWRPump,1
3B698CB4,››ù√剃,1
3B698CB4,eDownHRV,1
3B698CB4,eUpHRV,1
3B698E91,sHeatPre,0
3B698E91,ePHPump,0
3B698E91,eSWPump,0

and


Code:
$ cat ~/Downloads/log.txt | od -xc
0000000      4233    3936    3938    3036    652c    5753    7550    706d
           3   B   6   9   8   9   6   0   ,   e   S   W   P   u   m   p
0000020      302c    000a    4233    3936    4338    3633    732c    6548
           ,   0  \n  \0   3   B   6   9   8   C   3   6   ,   s   H   e
0000040      7461    7250    2c65    0a31    4233    3936    4338    3633
           a   t   P   r   e   ,   1  \n   3   B   6   9   8   C   3   6
0000060      652c    4850    7550    706d    312c    330a    3642    3839
           ,   e   P   H   P   u   m   p   ,   1  \n   3   B   6   9   8
0000100      3343    2c36    5365    5057    6d75    2c70    0a31    4233
           C   3   6   ,   e   S   W   P   u   m   p   ,   1  \n   3   B
0000120      3936    4338    3442    ef2c    fe21    c3ff    ec38    2cef
           6   9   8   C   B   4   , 357   ! 376 377 303   8 354 357   ,
0000140      0a31    4233    3936    4338    3442    652c    5748    5052
           1  \n   3   B   6   9   8   C   B   4   ,   e   H   W   R   P
0000160      6d75    2c70    0a31    4233    3936    4338    3442    dd2c
           u   m   p   ,   1  \n   3   B   6   9   8   C   B   4   , 032
0000200      dd1a    c39d    e48c    2cc4    0a31    4233    3936    4338
         032 335 235   Ì  ** 344 304   ,   1  \n   3   B   6   9   8   C
0000220      3442    652c    6f44    6e77    5248    2c56    0a31    4233
           B   4   ,   e   D   o   w   n   H   R   V   ,   1  \n   3   B
0000240      3936    4338    3442    652c    7055    5248    2c56    0a31
           6   9   8   C   B   4   ,   e   U   p   H   R   V   ,   1  \n
0000260      4233    3936    4538    3139    732c    6548    7461    7250
           3   B   6   9   8   E   9   1   ,   s   H   e   a   t   P   r
0000300      2c65    0a30    4233    3936    4538    3139    652c    4850
           e   ,   0  \n   3   B   6   9   8   E   9   1   ,   e   P   H
0000320      7550    706d    302c    330a    3642    3839    3945    2c31
           P   u   m   p   ,   0  \n   3   B   6   9   8   E   9   1   ,
0000340      5365    5057    6d75    2c70    0030                        
           e   S   W   P   u   m   p   ,   0                            
0000351

thanks!

Last edited by bencpeters; 08-03-2011 at 09:49 PM.. Reason: wrong option on od...
# 5  
Old 08-03-2011
I can't reproduce your "tr" error. Please post the command typed an the complete error message.

There are certainly some weird characters in the second comma-delimited field of certain records in this sample data. There is also a weird trailing null character at the end of the first record.

What Operating System and version are you running?
What Shell do you use?
# 6  
Old 08-03-2011
Here's the tr error:


Code:
$ cat ~/Downloads/log.txt | while read old_line; do echo "${old_line}" | tr -dc '[0-9][A-Z][a-z],\n'; done
3B698960,eSWPump,0

3B698C36,ePHPump,1
3B698C36,eSWPump,1
tr: Illegal byte sequence
3B698CB4,3B698CB4,eHWRPump,1
tr: Illegal byte sequence
3B698CB4,3B698CB4,eDownHRV,1
3B698CB4,eUpHRV,1
3B698E91,sHeatPre,0
3B698E91,ePHPump,0

I'm running OS X 10.6.5, bash version 3.2.48
# 7  
Old 08-03-2011
My first guess would be that some locale-aware code in the underlying C library that tr is using does not approve of certain byte sequences. You could try running tr in the C/POSIX locale: LC_ALL=C tr ...

Regards,
Alister
This User Gave Thanks to alister For This Post:
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

How to find out the weird blank characters?

I have a text file downloaded from the web, I want to count the unique words used in the file, and a person's speaking length during conversation by counting the words between the opening and closing quotation marks which differ from the standard ASCII code. Also I found out the file contains some... (2 Replies)
Discussion started by: yifangt
2 Replies

2. Shell Programming and Scripting

Control characters -weird problem

I am using Korn shell on Linux 2.6x platform , and I am suing the following code to capture the lines which contain CONTROL CHARACTERS in my file : awk '/]/ {print NR}' EROLLMENT_INPUT.txt The problem is that this code shows the file has control characters when the file is in folder A ,... (2 Replies)
Discussion started by: kumarjt
2 Replies

3. Shell Programming and Scripting

Weird ^M characters is disturbing the paste command

Dear all, I have the files: xaa xab xac and I try to paste them using $paste -d, xaa xab xac I see: output 3e-130 ,6e-78 ,5e-74 6e-124 ,0,007 ,0,026 2e-119 When I type: $ paste -d, xaa xab xac |less I see: output 3e-130^M,6e-78^M,5e-74 6e-124^M,0,007^M,0,026 (2 Replies)
Discussion started by: valente
2 Replies

4. Shell Programming and Scripting

share a shell script which can replace weird characters in directory or file name

I just finish the shell script . This shell can replace weird characters (such as #$%^@!'"...) in file or directory name by "_" I spent long time on replacing apostrophe in file/directory name added: 2012-03-14 the 124th line (/usr/bin/perl -i -e "s#\'#\\'#g" /tmp/rpdir_level$i.tmp) is... (5 Replies)
Discussion started by: begonia
5 Replies

5. Shell Programming and Scripting

Extra control characters being added when I create a file using cat command

Hi, I am using Cygwin.I created a new file and type into it using cat > newfile. When I open this using vi editor, it contains loads of extra control characters. Whats happening? (1 Reply)
Discussion started by: erora
1 Replies

6. Shell Programming and Scripting

cat file_list | [script to print last some characters]

Hello guys, I have a list of files. For example: /disk1/mediator_home/tmp/ntest/TSFILE00.8256.GGG1-U.0908250009.unp.20090824P8.is /disk1/mediator_home/tmp/ntest/TSFILE00.8257.GGG1-U.0908250013.unp.20090825P1.is... (2 Replies)
Discussion started by: mr_bold
2 Replies

7. Shell Programming and Scripting

long process listing with /usr/ucb/ps weird behaves

hello I am trying to run the following script to get the my-progam pid: #!/bin/ksh tt=`/usr/ucb/ps| grep -i $1| grep -v grep | awk '{print $2}'` echo $tt When I run the script I get the more PIDs $./test.sh my-program 12033 15033 15034 Actually my-program's PID is 12033....I... (6 Replies)
Discussion started by: sreeniatbp
6 Replies

8. Shell Programming and Scripting

weird issue about h, g, x in SED

I have a file called merge2.t: Hi Hello how are you. </Endtag> <New> I am fine.</New> This is a test. freelong how Here is the SED: sed -n ' /<\/Endtag>/ !{ H } /<\/Endtag>/ { x p } (4 Replies)
Discussion started by: freelong
4 Replies

9. Shell Programming and Scripting

Weird Ascii characters in file names

Hi. I have files in my OS that has weird file names with not-conventional ascii characters. I would like to run them but I can't refer them. I know the ascii # of the problematic characters. I can't change their name since it belongs to a 3rd party program... but I want to run it. is there... (2 Replies)
Discussion started by: yamsin789
2 Replies

10. UNIX for Dummies Questions & Answers

How to get rid of all the weird characters and color on bash shell

Does anyone of you know how to turn off color and weird characters on bash shell when using the command "script"? Everytime users on my server used that command to record their script, they either couldn't print it because lp kept giving the "unknown format character" messages or the print paper... (1 Reply)
Discussion started by: Micz
1 Replies
Login or Register to Ask a Question