Need to strip control-A characters from a column in a file


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Need to strip control-A characters from a column in a file
# 8  
Old 05-20-2015
Using builtins and hardcoded for 5, (4 + 1), fields. Not sure how long this will take on such a huge single file though...
Your version would need IFS=$'\001' and hardcoded for 33, (32 + 1), fileds. escaped line breaks will be needed for your version.
HW I/O will be a huge hit...
OSX 10.7.5, default bash terminal.
Code:
#!/bin/bash
echo '1,A,USA,0
2,B,GERMANY,0
3,C,IND,IA,0
4,D,CH,INA,0
1,A,USA,0
2,B,GERMANY,0
3,C,IND,IA,0
4,D,CH,INA,0
1,A,USA,0
2,B,GERMANY,0
3,C,IND,IA,0
4,D,CH,INA,0
1,A,USA,0
2,B,GERMANY,0
3,C,IND,IA,0
4,D,CH,INA,0
1,A,USA,0
2,B,GERMANY,0
3,C,IND,IA,0
4,D,CH,INA,0
1,A,USA,0
2,B,GERMANY,0
3,C,IND,IA,0
4,D,CH,INA,0
1,A,USA,0
2,B,GERMANY,0
3,C,IND,IA,0
4,D,CH,INA,0
1,A,USA,0
2,B,GERMANY,0
3,C,IND,IA,0
4,D,CH,INA,0
1,A,USA,0
2,B,GERMANY,0
3,C,IND,IA,0
4,D,CH,INA,0
1,A,USA,0
2,B,GERMANY,0
3,C,IND,IA,0
4,D,CH,INA,0
1,A,USA,0
2,B,GERMANY,0
3,C,IND,IA,0
4,D,CH,INA,0
1,A,USA,0
2,B,GERMANY,0
3,C,IND,IA,0
4,D,CH,INA,0
1,A,USA,0
2,B,GERMANY,0
3,C,IND,IA,0
4,D,CH,INA,0
1,A,USA,0
2,B,GERMANY,0
3,C,IND,IA,0
4,D,CH,INA,0
1,A,USA,0
2,B,GERMANY,0
3,C,IND,IA,0
4,D,CH,INA,0
1,A,USA,0
2,B,GERMANY,0
3,C,IND,IA,0
4,D,CH,INA,0
1,A,USA,0
2,B,GERMANY,0
3,C,IND,IA,0
4,D,CH,INA,0
1,A,USA,0
2,B,GERMANY,0
3,C,IND,IA,0
4,D,CH,INA,0
1,A,USA,0
2,B,GERMANY,0
3,C,IND,IA,0
4,D,CH,INA,0
1,A,USA,0
2,B,GERMANY,0
3,C,IND,IA,0
4,D,CH,INA,0
1,A,USA,0
2,B,GERMANY,0
3,C,IND,IA,0
4,D,CH,INA,0
1,A,USA,0
2,B,GERMANY,0
3,C,IND,IA,0
4,D,CH,INA,0
1,A,USA,0
2,B,GERMANY,0
3,C,IND,IA,0
4,D,CH,INA,0
1,A,USA,0
2,B,GERMANY,0
3,C,IND,IA,0
4,D,CH,INA,0
1,A,USA,0
2,B,GERMANY,0
3,C,IND,IA,0
4,D,CH,INA,0
1,A,USA,0
2,B,GERMANY,0
3,C,IND,IA,0
4,D,CH,INA,0
1,A,USA,0
2,B,GERMANY,0
3,C,IND,IA,0
4,D,CH,INA,0' > /tmp/flatfile
# > /tmp/newflatfile
# Not saved IFS for this test, using hex vale of ",".
IFS=$'\x2C'
while read line
do
	line=($line)
	if [ ${#line[@]} -eq 5 ]
	then

		printf "${line[0]}$IFS${line[1]}$IFS${line[2]}${line[3]}$IFS${line[4]}\n" # >> /tmp/newflatfile
	else
		printf "${line[0]}$IFS${line[1]}$IFS${line[2]}$IFS${line[3]}\n" # >> /tmp/newflatfile
	fi
done < /tmp/flatfile
# cat /tmp/newflatfile

Results:-
Code:
Last login: Wed May 20 08:01:31 on ttys000
AMIGA:barrywalker~> cd Desktop/Code/Shell
AMIGA:barrywalker~/Desktop/Code/Shell> chmod 755 ff.sh
AMIGA:barrywalker~/Desktop/Code/Shell> ./ff.sh
1,A,USA,0
2,B,GERMANY,0
3,C,INDIA,0
4,D,CHINA,0
1,A,USA,0
2,B,GERMANY,0
3,C,INDIA,0
4,D,CHINA,0
1,A,USA,0
2,B,GERMANY,0
3,C,INDIA,0
4,D,CHINA,0
1,A,USA,0
2,B,GERMANY,0
3,C,INDIA,0
4,D,CHINA,0
1,A,USA,0
2,B,GERMANY,0
3,C,INDIA,0
4,D,CHINA,0
1,A,USA,0
2,B,GERMANY,0
3,C,INDIA,0
4,D,CHINA,0
1,A,USA,0
2,B,GERMANY,0
3,C,INDIA,0
4,D,CHINA,0
1,A,USA,0
2,B,GERMANY,0
3,C,INDIA,0
4,D,CHINA,0
1,A,USA,0
2,B,GERMANY,0
3,C,INDIA,0
4,D,CHINA,0
1,A,USA,0
2,B,GERMANY,0
3,C,INDIA,0
4,D,CHINA,0
1,A,USA,0
2,B,GERMANY,0
3,C,INDIA,0
4,D,CHINA,0
1,A,USA,0
2,B,GERMANY,0
3,C,INDIA,0
4,D,CHINA,0
1,A,USA,0
2,B,GERMANY,0
3,C,INDIA,0
4,D,CHINA,0
1,A,USA,0
2,B,GERMANY,0
3,C,INDIA,0
4,D,CHINA,0
1,A,USA,0
2,B,GERMANY,0
3,C,INDIA,0
4,D,CHINA,0
1,A,USA,0
2,B,GERMANY,0
3,C,INDIA,0
4,D,CHINA,0
1,A,USA,0
2,B,GERMANY,0
3,C,INDIA,0
4,D,CHINA,0
1,A,USA,0
2,B,GERMANY,0
3,C,INDIA,0
4,D,CHINA,0
1,A,USA,0
2,B,GERMANY,0
3,C,INDIA,0
4,D,CHINA,0
1,A,USA,0
2,B,GERMANY,0
3,C,INDIA,0
4,D,CHINA,0
1,A,USA,0
2,B,GERMANY,0
3,C,INDIA,0
4,D,CHINA,0
1,A,USA,0
2,B,GERMANY,0
3,C,INDIA,0
4,D,CHINA,0
1,A,USA,0
2,B,GERMANY,0
3,C,INDIA,0
4,D,CHINA,0
1,A,USA,0
2,B,GERMANY,0
3,C,INDIA,0
4,D,CHINA,0
1,A,USA,0
2,B,GERMANY,0
3,C,INDIA,0
4,D,CHINA,0
1,A,USA,0
2,B,GERMANY,0
3,C,INDIA,0
4,D,CHINA,0
1,A,USA,0
2,B,GERMANY,0
3,C,INDIA,0
4,D,CHINA,0
AMIGA:barrywalker~/Desktop/Code/Shell> _

# 9  
Old 05-20-2015
Quote:
Originally Posted by harsha1238
Hi All,

I currently have flat file with 32 columns. The field delimiter is cntl-A ( \x01). The file has been extracted from an oracle table using a datastage job. However, in the 6th field, the data contains additional control -A characters which came as a part of the table data.

I need some help in removing these control-A characters in just this 6th field alone.

I tried using sed command to replace the first 5 delimiters and last 24 delimiters with another delimiter , like a | , and then use tr to strip off the remaining control-A characters. But it is taking too long. Any help is appreciated.
I don't understand your logic here. If you have a file with 32 fields (or columns), then there should be 31 delimiters separating those fields. But, in the last paragraph of your description you talk about saving the 1st 5 delimiters and the last 24 delimiters. The 5 + 24 delimiters that you are saving would work if your input file had 30 fields; not 32 fields.

Not having any real sample data means that we can only guess at which number in the above is wrong.

It looks like wisecracker's code will work as long as there is no more than one added delimiter in the problem field (and that is all you demonstrated in your example), but I think you're saying that there can be zero or more unwanted delimiters in the problem field. (And, as he said, his script is easy for a 4 field test file, but gets awkward when you extend that logic to 30 or 32 fields.)

I tried using Chubbier_XL's code (with 32 changed to 4 globally and the [ICODE]printf "\x01"[/ICODEs changed to printf "," globally with a test file I set up similar to your sample using commas and didn't get the results I was expecting.

Perhaps this alternative approach will help:
Code:
#!/bin/ksh
# Define real field delimiters and number of delimiters that should appear
BadField=6		# Field that may contain delimiters in data
Delim=$(printf '\x01')	# Delimiter character
Nfields=32		# # of desired fields 
Unused='|'		# A character that never appears in the data

# Fake values for sample input...  Remove these lines when processing real data.
BadField=3
Delim=","
Nfields=4

# awk script to clean up field # BadField...
awk -v B="$BadField" -v D="$Delim" -v N="$Nfields" -v U="$Unused" '
BEGIN {	DERE = "[" D "]"
	UERE = "[" U "]"
}
{	n = gsub(DERE, U) # Get delim count and change them to unused chars.
	for(i = 1; i < B; i++)
		sub(UERE, D)	# Change one initial unused char back to delim
	for(i = n - N + 1; i > 0; i--)
		sub(UERE, "")	# Delete one unused (extra delim) from field B
	gsub(UERE, D)		# Change remaining unused chars back to delim
}
1				# Print updated lines
' file				# Specify input file

and a sample input file named file containing:
Code:
1,A,USA,0
2,B,GERMANY,0
3,C,IN,DIA,1
4,D,CHI,NA,1
5,E,A,B,C,D,E,F,G,6

it produces the output:
Code:
1,A,USA,0
2,B,GERMANY,0
3,C,INDIA,1
4,D,CHINA,1
5,E,ABCDEFG,6

which seems to be what you are trying to do with your sample.

If you want to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk.

Although written and tested using a Korn shell, this should work with any shell that recognizes basic POSIX shell command substitution and parameter expansion syntax (e.g., ash, bash, dash, ksh, zsh, and many others; but not csh and its derivatives and not an original Bourne shell).

If this does do what you want with the sample data, remove the code shown in red and verify that the remaining settings for BadField and Nfields are correct and it should work for your files with ctrl-A as the field delimiter. Obviously, you need to unzip your input files and re-zip the output produced.
This User Gave Thanks to Don Cragun For This Post:
# 10  
Old 05-20-2015
Quote:
Originally Posted by Don Cragun
I tried using Chubbier_XL's code (with 32 changed to 4 globally and the [ICODE]printf "\x01"[/ICODEs changed to printf "," globally with a test file I set up similar to your sample using commas and didn't get the results I was expecting.
Seems to work OK for me:

Code:
$ cat > infile2 <<EOF
> 1,A,USA,0
> 2,B,GERMANY,0
> 3,C,IND,IA,0
> 4,D,CH,INA,0
> EOF

$ awk -F$(printf ',') '
> NF>4{
>    E=NF-4
>    for(i=4;i<4+E;i++) $3=$3$i
>    for(i=4;i<=4;i++) $i=$(i+E)
>    NF=4
> } 1' OFS=$(printf ',') infile2
1,A,USA,0
2,B,GERMANY,0
3,C,INDIA,0
4,D,CHINA,0


Might be easier if I replace the hard-coded field numbers with variables:
  • STRIP=field number that contains extra FS chars and needs to be joined
  • FLDS=Required number of fields

Code:
awk -F$(printf ',') '
BEGIN{ FLDS=4; STRIP=3 }
NF>FLDS{
   E=NF-FLDS
   for(i=STRIP+1;i<STRIP+1+E;i++) $(STRIP)=$(STRIP)$i
   for(i=STRIP+1;i<=FLDS;i++) $i=$(i+E)
   NF=FLDS
} 1' OFS=$(printf ',') infile

This User Gave Thanks to Chubler_XL For This Post:
# 11  
Old 05-20-2015
Hi Chubler_XL,
Yes, sorry. I didn't get nearly enough sleep the last couple of nights. I missed changing the 7s.

- Don
# 12  
Old 05-21-2015
Hi Don.

I have only just noticed that the OP might have extra delimiters in the same field but I was working on the OP's post #3.

The problem is not the coding but the time factor to check such a huge file when the extra coding is needed to test for this/these new requirement(s).

Maybe builtins are not the way to go but I will re-rwite using them when I get home from work tonight...

Thanks for your comments.

Bazza...

Last edited by wisecracker; 05-21-2015 at 11:58 AM..
# 13  
Old 05-21-2015
Quote:
Originally Posted by wisecracker
Hi Don.

I have only just noticed that the OP might have extra delimiters in the same field but I was working on the OP's post #3.

The problem is not the coding but the time factor to check such a huge file when the extra coding is needed to test for this/these new requirement(s).

Maybe builtins are not the way to go but I will re-rwite using them when I get home from work tonight...

Thanks for your comments.

Bazza...
I completely agree that the submitter could have made it more clear whether only one "additional" field separator could be present in the line (since the examples only showed one), but the original post sounded to me like zero or more field separators could appear as part of the data in field 6 (which for some unknown reason was field 3 in the examples).

One thing you could do to speed up your script and make it more reliable would be to change:
Code:
while read line
do
	line=($line)
	if [ ${#line[@]} -eq 5 ]
	then

		printf "${line[0]}$IFS${line[1]}$IFS${line[2]}${line[3]}$IFS${line[4]}\n" # >> /tmp/newflatfile
	else
		printf "${line[0]}$IFS${line[1]}$IFS${line[2]}$IFS${line[3]}\n" # >> /tmp/newflatfile
	fi
done < /tmp/flatfile

to:
Code:
while read line
do
	line=($line)
	if [ ${#line[@]} -eq 5 ]
	then	printf '%s\n' "${line[0]}$IFS${line[1]}$IFS${line[2]}${line[3]}$IFS${line[4]}"
	else	printf '%s\n' "${line[0]}$IFS${line[1]}$IFS${line[2]}$IFS${line[3]}" 
	fi
done < /tmp/flatfile # >>/tmp/newflatfile

Adding the format strings to the printf statements protects your script in case an input line contains any percent-sign or backslash characters. And, moving the redirections to the end of the loop (assuming that the #s commenting out the redirections will be removed at some point) instead of on each printf in the loop should speed things up. The open() and close() calls are fast when compared to the fork() and exec() needed to invoke an external utility, but doing millions of them to write a multi-Gb file when only one of each is needed will make a significant difference in your script's running time.

Note also that the submitter hasn't given any indication of what OS or shell are being used. The awk utility is universally available on UNIX-like systems. But, since array handling is not required in the shell by the standards, I tend to avoid using shell arrays in suggestions until I've determined that the submitter is using a shell that supports arrays. (This is just a personal preference. There is no reason why you should avoid arrays in code you suggest as long as you specify what shell you're using, as you always do.)
This User Gave Thanks to Don Cragun For This Post:
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

How to strip some characters before putting in array?

Hi Gurus, my current code like below: nawk '{f1 = (NF>1)?$1:""}{print f1, $NF}'|sed -e 's/s(/,/g;s/)//g;s/ *,/,/'|nawk -F"," '{ab}END{for (i in b) if (!(i in a))print i}' I have file like below. (this is autosys job dependencies) the job with s() is dependencies, the job without s() is... (10 Replies)
Discussion started by: ken6503
10 Replies

2. Red Hat

Special control characters in file

Hi Guys, We receive some huge files on to Linux server. Source system use FTP mechanism to transfer these files on our server. Occasionally one record is getting corrupted while transfer, some control characters are injecting into the file. How to fix this issue ? please advice ? Sample... (2 Replies)
Discussion started by: srikanth38
2 Replies

3. Shell Programming and Scripting

How to view the control characters in a file?

Hello, How can I view control and special characters of a text file?. For example, space, tabs, new line chars etc. Can I use hexdump for it? Thanks (3 Replies)
Discussion started by: reddyr
3 Replies

4. Shell Programming and Scripting

Request for advise on how to remove control characters in a UNIX file extracted from top command

Hi, Please excuse for posting new thread on control characters, I am facing some difficulties in removing the control character from a file extracted from top command, i am able to see control characters using more command and in vi mode, through cat control characters are not visible ... (8 Replies)
Discussion started by: karthikram
8 Replies

5. Shell Programming and Scripting

Strip First few Characters

I want to strip first few characters from each record until a proper datesamp is found. Request for getNextPage.................06/29/12 07:49:30 VVUKOVIC@67.208.166.131{7A805FEF76A62FCBB23EA78B5380EF95.tomcat1}TP-Processor14 LogExchUsage: ERROR:: isprof=false : exch=NSDQ output should be... (2 Replies)
Discussion started by: ratheeshjulk
2 Replies

6. Shell Programming and Scripting

sed replacing specific characters and control characters by escaping

sed -e "s// /g" old.txt > new.txt While I do know some control characters need to be escaped, can normal characters also be escaped and still work the same way? Basically I do not know all control characters that have a special meaning, for example, ?, ., % have a meaning and have to be escaped... (11 Replies)
Discussion started by: ijustneeda
11 Replies

7. Shell Programming and Scripting

Extra control characters being added when I create a file using cat command

Hi, I am using Cygwin.I created a new file and type into it using cat > newfile. When I open this using vi editor, it contains loads of extra control characters. Whats happening? (1 Reply)
Discussion started by: erora
1 Replies

8. Shell Programming and Scripting

display all possible control characters from .xml file in unix

Hi, I have a .xml file in unix. We are passing this file through a xml parser. But we are getting some control characters from input file and XML parser is failing for the control character in file.Now I am getting following error, Error at byte 243206625 of file filename_$.xml: Error... (1 Reply)
Discussion started by: fantushmayu
1 Replies

9. Shell Programming and Scripting

Hidden control characters in a Unix Text File!

Can anyone seem to know how to find out whether a UNIX text file has 'hidden' control characters? Can I view them using 'vi' by some command line options? If there are control characters in a text file which are invisible/hidden.. then how do I get rid of them? Your intelletual answers are... (6 Replies)
Discussion started by: kewl_guy
6 Replies

10. Programming

Identifying and removing control characters in a file.

What is the best method to identify an remove control characters in a file. Would it be easier to do this in Unix or in C. (0 Replies)
Discussion started by: oracle8
0 Replies
Login or Register to Ask a Question