Sponsored Content
Top Forums Shell Programming and Scripting How to find out the weird blank characters? Post 303003858 by yifangt on Thursday 21st of September 2017 02:17:32 PM
Old 09-21-2017
How to find out the weird blank characters?

I have a text file downloaded from the web, I want to count the unique words used in the file, and a person's speaking length during conversation by counting the words between the opening and closing quotation marks which differ from the standard ASCII code. Also I found out the file contains some weird blank characters that are invisible from stdout which are the entry that has 118391 and the one has 6380 occurrence in the example.
It seems to me the file was processed with Mac PC by the single/double quotes I can guess, but I am not sure. Here is the output of my Ubuntu terminal:
Code:
tr -d '[:blank:]' < infile.txt | grep -o "." | sort | uniq -c | head
      4 ·
   1089 ‘
   1098 ’
  12146 “
  12147 ”
 118391  
   6380 
  12237 about
     31 alot
    154 apple

1) How do I find out the invisible "blank/empty" characters in the file so that I can get rid of them in order to count the words?
2) How do I count the speaking duration of a person at conversations by the opening/closing double quotation pair? What I tried is:
Code:
grep "“.*”" infile.txt

This regex is too greedy that sometime combines adjacent dialogues into single one.
Thanks!

Last edited by yifangt; 09-21-2017 at 03:25 PM..
 

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Blank characters between Datas

Hello, I read a file whose in lines are datas and between thses datas there is blank characters (10, 12 or 5 or 1 .......) So when i use the command while read line in the script(see under) there is also only one character between the datas and the others blank characters are not here. ... (3 Replies)
Discussion started by: steiner
3 Replies

2. Shell Programming and Scripting

Deleting the blank line in a file and counting the characters....

Hi, I am trying to do two things in my script. I will really appreciate any help in this regards. Is there a way to delete a last line from a pipe delimited flat file if the last line is blank. If the line is not blank then do nothing..... Is there a way to count a word that are starting... (4 Replies)
Discussion started by: rkumar28
4 Replies

3. UNIX for Dummies Questions & Answers

How to get rid of all the weird characters and color on bash shell

Does anyone of you know how to turn off color and weird characters on bash shell when using the command "script"? Everytime users on my server used that command to record their script, they either couldn't print it because lp kept giving the "unknown format character" messages or the print paper... (1 Reply)
Discussion started by: Micz
1 Replies

4. Shell Programming and Scripting

Weird Ascii characters in file names

Hi. I have files in my OS that has weird file names with not-conventional ascii characters. I would like to run them but I can't refer them. I know the ascii # of the problematic characters. I can't change their name since it belongs to a 3rd party program... but I want to run it. is there... (2 Replies)
Discussion started by: yamsin789
2 Replies

5. UNIX for Advanced & Expert Users

cat / sed process weird characters

Hi everyone, I'm trying to write a shell script that process a log file. The log format is generally: (8 digit hex of unix time),(system ID),(state)\n My shell script gets the file from the web, saves it in a local text directory. I then want to change the hex to decimal, convert from unix time... (7 Replies)
Discussion started by: bencpeters
7 Replies

6. Shell Programming and Scripting

share a shell script which can replace weird characters in directory or file name

I just finish the shell script . This shell can replace weird characters (such as #$%^@!'"...) in file or directory name by "_" I spent long time on replacing apostrophe in file/directory name added: 2012-03-14 the 124th line (/usr/bin/perl -i -e "s#\'#\\'#g" /tmp/rpdir_level$i.tmp) is... (5 Replies)
Discussion started by: begonia
5 Replies

7. Shell Programming and Scripting

Removing one or more blank characters from beginning of a line

Hi, I was trying to remove the blank from beginning of a line. when I try: sed 's/^ +//' filename it does not work but when I try sed 's/^ *//' filename it works But I think the first command should have also replaced any line with one or more blanks. Kindly help me in understanding... (5 Replies)
Discussion started by: babom
5 Replies

8. Shell Programming and Scripting

Weird ^M characters is disturbing the paste command

Dear all, I have the files: xaa xab xac and I try to paste them using $paste -d, xaa xab xac I see: output 3e-130 ,6e-78 ,5e-74 6e-124 ,0,007 ,0,026 2e-119 When I type: $ paste -d, xaa xab xac |less I see: output 3e-130^M,6e-78^M,5e-74 6e-124^M,0,007^M,0,026 (2 Replies)
Discussion started by: valente
2 Replies

9. Shell Programming and Scripting

Control characters -weird problem

I am using Korn shell on Linux 2.6x platform , and I am suing the following code to capture the lines which contain CONTROL CHARACTERS in my file : awk '/]/ {print NR}' EROLLMENT_INPUT.txt The problem is that this code shows the file has control characters when the file is in folder A ,... (2 Replies)
Discussion started by: kumarjt
2 Replies

10. Shell Programming and Scripting

To check Blank Lines, Blank Records and Junk Characters in a File

Hi All Need Help I have a file with the below format (ABC.TXT) : ®¿¿ABCDHEJJSJJ|XCBJSKK01|M|7348974982790 HDFLJDKJSKJ|KJALKSD02|M|7378439274898 KJHSAJKHHJJ|LJDSAJKK03|F|9898982039999 (cont......) I need to write a script where it will check for : blank lines (between rows,before... (6 Replies)
Discussion started by: chatwithsaurav
6 Replies
wc(1)							      General Commands Manual							     wc(1)

NAME
wc - count words, lines, and bytes or characters in a file SYNOPSIS
[file]... DESCRIPTION
The command counts lines, words, and bytes or characters in the named files, or in the standard input if no file names are specified. It also keeps a total count for all named files. A word is a string of characters delimited by spaces, tabs, or newlines. Options recognizes the following options: Report the number of bytes in each input file. Report the number of newline characters in each input file. Report the number of characters in each input file. Report the number of words in each input file. The and options are mutually exclusive. Otherwise, the and or options can be used in any combination to specify that a subset of lines, words, and bytes or characters are to be reported. When any option is specified, reports only the information requested. If no option is specified, the default output is When a file is specified on the command line, its name is printed along with the counts. Standard Output By default, the standard output contains an entry for each input file in the form: newlines words bytes file If the option is specified, the number of characters replaces the bytes field in this format. If any option is specified, the fields for the unspecified options are omitted. If no file operand is specified, neither the file name nor the preceding blank character is written. If more than one file operand is specified, an additional line is written at the end of the output, of the same format as the other lines, except that the word (in the POSIX locale) is written instead of a file name and the total of each column is written as appropriate. Under UNIX Standard environment, a word is a string of characters delimited by spaces, tabs, newline, carriage-return, vertical tab, or form-feed. RETURN VALUE
exits with one of the following values: Successful completion. An error occurred. EXTERNAL INFLUENCES
For information about the UNIX Standard environment, see standards(5). Environment Variables determines the range of graphics and space characters, and the interpretation of text as single- and/or multibyte characters. determines the language in which messages are displayed. If or is not specified in the environment or is null, they default to the value of If is not specified or is null, it defaults to (see lang(5)). If any internationalization variable contains an invalid setting, they all default to See environ(5). International Code Set Support Single- and multibyte character code sets are supported. with a newline character, the count will be off by one. WARNINGS
The command counts the number of newlines to determine the line count. If a text file has a final line that is not terminated with a new- line character, the count will be off by one. EXAMPLES
Print the number of words and characters in The following is printed when the above command is executed: where words is the number of words and chars is the number of characters in SEE ALSO
standards(5). STANDARDS CONFORMANCE
wc(1)
All times are GMT -4. The time now is 03:24 PM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy