Need to strip control-A characters from a column in a file

05-20-2015

Registered User

1,709, 666

Join Date: Jan 2013

Last Activity: 20 May 2020, 1:43 PM EDT

Location: Loughborough

Posts: 1,709

Thanks Given: 838

Thanked 666 Times in 467 Posts

Using builtins and hardcoded for 5, (4 + 1), fields. Not sure how long this will take on such a huge single file though...
Your version would need IFS=$'\001' and hardcoded for 33, (32 + 1), fileds. escaped line breaks will be needed for your version.
HW I/O will be a huge hit...
OSX 10.7.5, default bash terminal.

Code:

#!/bin/bash
echo '1,A,USA,0
2,B,GERMANY,0
3,C,IND,IA,0
4,D,CH,INA,0
1,A,USA,0
2,B,GERMANY,0
3,C,IND,IA,0
4,D,CH,INA,0
1,A,USA,0
2,B,GERMANY,0
3,C,IND,IA,0
4,D,CH,INA,0
1,A,USA,0
2,B,GERMANY,0
3,C,IND,IA,0
4,D,CH,INA,0
1,A,USA,0
2,B,GERMANY,0
3,C,IND,IA,0
4,D,CH,INA,0
1,A,USA,0
2,B,GERMANY,0
3,C,IND,IA,0
4,D,CH,INA,0
1,A,USA,0
2,B,GERMANY,0
3,C,IND,IA,0
4,D,CH,INA,0
1,A,USA,0
2,B,GERMANY,0
3,C,IND,IA,0
4,D,CH,INA,0
1,A,USA,0
2,B,GERMANY,0
3,C,IND,IA,0
4,D,CH,INA,0
1,A,USA,0
2,B,GERMANY,0
3,C,IND,IA,0
4,D,CH,INA,0
1,A,USA,0
2,B,GERMANY,0
3,C,IND,IA,0
4,D,CH,INA,0
1,A,USA,0
2,B,GERMANY,0
3,C,IND,IA,0
4,D,CH,INA,0
1,A,USA,0
2,B,GERMANY,0
3,C,IND,IA,0
4,D,CH,INA,0
1,A,USA,0
2,B,GERMANY,0
3,C,IND,IA,0
4,D,CH,INA,0
1,A,USA,0
2,B,GERMANY,0
3,C,IND,IA,0
4,D,CH,INA,0
1,A,USA,0
2,B,GERMANY,0
3,C,IND,IA,0
4,D,CH,INA,0
1,A,USA,0
2,B,GERMANY,0
3,C,IND,IA,0
4,D,CH,INA,0
1,A,USA,0
2,B,GERMANY,0
3,C,IND,IA,0
4,D,CH,INA,0
1,A,USA,0
2,B,GERMANY,0
3,C,IND,IA,0
4,D,CH,INA,0
1,A,USA,0
2,B,GERMANY,0
3,C,IND,IA,0
4,D,CH,INA,0
1,A,USA,0
2,B,GERMANY,0
3,C,IND,IA,0
4,D,CH,INA,0
1,A,USA,0
2,B,GERMANY,0
3,C,IND,IA,0
4,D,CH,INA,0
1,A,USA,0
2,B,GERMANY,0
3,C,IND,IA,0
4,D,CH,INA,0
1,A,USA,0
2,B,GERMANY,0
3,C,IND,IA,0
4,D,CH,INA,0
1,A,USA,0
2,B,GERMANY,0
3,C,IND,IA,0
4,D,CH,INA,0
1,A,USA,0
2,B,GERMANY,0
3,C,IND,IA,0
4,D,CH,INA,0
1,A,USA,0
2,B,GERMANY,0
3,C,IND,IA,0
4,D,CH,INA,0' > /tmp/flatfile
# > /tmp/newflatfile
# Not saved IFS for this test, using hex vale of ",".
IFS=$'\x2C'
while read line
do
	line=($line)
	if [ ${#line[@]} -eq 5 ]
	then

		printf "${line[0]}$IFS${line[1]}$IFS${line[2]}${line[3]}$IFS${line[4]}\n" # >> /tmp/newflatfile
	else
		printf "${line[0]}$IFS${line[1]}$IFS${line[2]}$IFS${line[3]}\n" # >> /tmp/newflatfile
	fi
done < /tmp/flatfile
# cat /tmp/newflatfile

Results:-

Code:

Last login: Wed May 20 08:01:31 on ttys000
AMIGA:barrywalker~> cd Desktop/Code/Shell
AMIGA:barrywalker~/Desktop/Code/Shell> chmod 755 ff.sh
AMIGA:barrywalker~/Desktop/Code/Shell> ./ff.sh
1,A,USA,0
2,B,GERMANY,0
3,C,INDIA,0
4,D,CHINA,0
1,A,USA,0
2,B,GERMANY,0
3,C,INDIA,0
4,D,CHINA,0
1,A,USA,0
2,B,GERMANY,0
3,C,INDIA,0
4,D,CHINA,0
1,A,USA,0
2,B,GERMANY,0
3,C,INDIA,0
4,D,CHINA,0
1,A,USA,0
2,B,GERMANY,0
3,C,INDIA,0
4,D,CHINA,0
1,A,USA,0
2,B,GERMANY,0
3,C,INDIA,0
4,D,CHINA,0
1,A,USA,0
2,B,GERMANY,0
3,C,INDIA,0
4,D,CHINA,0
1,A,USA,0
2,B,GERMANY,0
3,C,INDIA,0
4,D,CHINA,0
1,A,USA,0
2,B,GERMANY,0
3,C,INDIA,0
4,D,CHINA,0
1,A,USA,0
2,B,GERMANY,0
3,C,INDIA,0
4,D,CHINA,0
1,A,USA,0
2,B,GERMANY,0
3,C,INDIA,0
4,D,CHINA,0
1,A,USA,0
2,B,GERMANY,0
3,C,INDIA,0
4,D,CHINA,0
1,A,USA,0
2,B,GERMANY,0
3,C,INDIA,0
4,D,CHINA,0
1,A,USA,0
2,B,GERMANY,0
3,C,INDIA,0
4,D,CHINA,0
1,A,USA,0
2,B,GERMANY,0
3,C,INDIA,0
4,D,CHINA,0
1,A,USA,0
2,B,GERMANY,0
3,C,INDIA,0
4,D,CHINA,0
1,A,USA,0
2,B,GERMANY,0
3,C,INDIA,0
4,D,CHINA,0
1,A,USA,0
2,B,GERMANY,0
3,C,INDIA,0
4,D,CHINA,0
1,A,USA,0
2,B,GERMANY,0
3,C,INDIA,0
4,D,CHINA,0
1,A,USA,0
2,B,GERMANY,0
3,C,INDIA,0
4,D,CHINA,0
1,A,USA,0
2,B,GERMANY,0
3,C,INDIA,0
4,D,CHINA,0
1,A,USA,0
2,B,GERMANY,0
3,C,INDIA,0
4,D,CHINA,0
1,A,USA,0
2,B,GERMANY,0
3,C,INDIA,0
4,D,CHINA,0
1,A,USA,0
2,B,GERMANY,0
3,C,INDIA,0
4,D,CHINA,0
1,A,USA,0
2,B,GERMANY,0
3,C,INDIA,0
4,D,CHINA,0
1,A,USA,0
2,B,GERMANY,0
3,C,INDIA,0
4,D,CHINA,0
1,A,USA,0
2,B,GERMANY,0
3,C,INDIA,0
4,D,CHINA,0
AMIGA:barrywalker~/Desktop/Code/Shell> _

wisecracker

View Public Profile for wisecracker

Find all posts by wisecracker

05-20-2015

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Quote:

Originally Posted by harsha1238

Hi All,

I currently have flat file with 32 columns. The field delimiter is cntl-A ( \x01). The file has been extracted from an oracle table using a datastage job. However, in the 6th field, the data contains additional control -A characters which came as a part of the table data.

I need some help in removing these control-A characters in just this 6th field alone.

I tried using sed command to replace the first 5 delimiters and last 24 delimiters with another delimiter , like a | , and then use tr to strip off the remaining control-A characters. But it is taking too long. Any help is appreciated.

I don't understand your logic here. If you have a file with 32 fields (or columns), then there should be 31 delimiters separating those fields. But, in the last paragraph of your description you talk about saving the 1st 5 delimiters and the last 24 delimiters. The 5 + 24 delimiters that you are saving would work if your input file had 30 fields; not 32 fields.

Not having any real sample data means that we can only guess at which number in the above is wrong.

It looks like wisecracker's code will work as long as there is no more than one added delimiter in the problem field (and that is all you demonstrated in your example), but I think you're saying that there can be zero or more unwanted delimiters in the problem field. (And, as he said, his script is easy for a 4 field test file, but gets awkward when you extend that logic to 30 or 32 fields.)

I tried using Chubbier_XL's code (with 32 changed to 4 globally and the [ICODE]printf "\x01"[/ICODEs changed to printf "," globally with a test file I set up similar to your sample using commas and didn't get the results I was expecting.

Perhaps this alternative approach will help:

Code:

#!/bin/ksh
# Define real field delimiters and number of delimiters that should appear
BadField=6		# Field that may contain delimiters in data
Delim=$(printf '\x01')	# Delimiter character
Nfields=32		# # of desired fields 
Unused='|'		# A character that never appears in the data

# Fake values for sample input...  Remove these lines when processing real data.
BadField=3
Delim=","
Nfields=4

# awk script to clean up field # BadField...
awk -v B="$BadField" -v D="$Delim" -v N="$Nfields" -v U="$Unused" '
BEGIN {	DERE = "[" D "]"
	UERE = "[" U "]"
}
{	n = gsub(DERE, U) # Get delim count and change them to unused chars.
	for(i = 1; i < B; i++)
		sub(UERE, D)	# Change one initial unused char back to delim
	for(i = n - N + 1; i > 0; i--)
		sub(UERE, "")	# Delete one unused (extra delim) from field B
	gsub(UERE, D)		# Change remaining unused chars back to delim
}
1				# Print updated lines
' file				# Specify input file

and a sample input file named file containing:

Code:

1,A,USA,0
2,B,GERMANY,0
3,C,IN,DIA,1
4,D,CHI,NA,1
5,E,A,B,C,D,E,F,G,6

it produces the output:

Code:

1,A,USA,0
2,B,GERMANY,0
3,C,INDIA,1
4,D,CHINA,1
5,E,ABCDEFG,6

which seems to be what you are trying to do with your sample.

If you want to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk.

Although written and tested using a Korn shell, this should work with any shell that recognizes basic POSIX shell command substitution and parameter expansion syntax (e.g., ash, bash, dash, ksh, zsh, and many others; but not csh and its derivatives and not an original Bourne shell).

If this does do what you want with the sample data, remove the code shown in red and verify that the remaining settings for BadField and Nfields are correct and it should work for your files with ctrl-A as the field delimiter. Obviously, you need to unzip your input files and re-zip the output produced.

This User Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

05-20-2015

Moderator

3,791, 1,452

Join Date: Oct 2010

Last Activity: 1 August 2020, 1:38 AM EDT

Posts: 3,791

Thanks Given: 183

Thanked 1,452 Times in 1,302 Posts

Quote:

Originally Posted by Don Cragun

I tried using Chubbier_XL's code (with 32 changed to 4 globally and the [ICODE]printf "\x01"[/ICODEs changed to printf "," globally with a test file I set up similar to your sample using commas and didn't get the results I was expecting.

Seems to work OK for me:

Code:

$ cat > infile2 <<EOF
> 1,A,USA,0
> 2,B,GERMANY,0
> 3,C,IND,IA,0
> 4,D,CH,INA,0
> EOF

$ awk -F$(printf ',') '
> NF>4{
>    E=NF-4
>    for(i=4;i<4+E;i++) $3=$3$i
>    for(i=4;i<=4;i++) $i=$(i+E)
>    NF=4
> } 1' OFS=$(printf ',') infile2
1,A,USA,0
2,B,GERMANY,0
3,C,INDIA,0
4,D,CHINA,0

Might be easier if I replace the hard-coded field numbers with variables:

STRIP=field number that contains extra FS chars and needs to be joined
FLDS=Required number of fields

Code:

awk -F$(printf ',') '
BEGIN{ FLDS=4; STRIP=3 }
NF>FLDS{
   E=NF-FLDS
   for(i=STRIP+1;i<STRIP+1+E;i++) $(STRIP)=$(STRIP)$i
   for(i=STRIP+1;i<=FLDS;i++) $i=$(i+E)
   NF=FLDS
} 1' OFS=$(printf ',') infile

This User Gave Thanks to Chubler_XL For This Post:

Chubler_XL

View Public Profile for Chubler_XL

Find all posts by Chubler_XL

05-20-2015

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Hi Chubler_XL,
Yes, sorry. I didn't get nearly enough sleep the last couple of nights. I missed changing the 7s.

- Don

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

05-21-2015

Registered User

1,709, 666

Join Date: Jan 2013

Last Activity: 20 May 2020, 1:43 PM EDT

Location: Loughborough

Posts: 1,709

Thanks Given: 838

Thanked 666 Times in 467 Posts

Hi Don.

I have only just noticed that the OP might have extra delimiters in the same field but I was working on the OP's post #3.

The problem is not the coding but the time factor to check such a huge file when the extra coding is needed to test for this/these new requirement(s).

Maybe builtins are not the way to go but I will re-rwite using them when I get home from work tonight...

Thanks for your comments.

Bazza...

Last edited by wisecracker; 05-21-2015 at 11:58 AM..

wisecracker

View Public Profile for wisecracker

Find all posts by wisecracker

05-21-2015

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Quote:

Originally Posted by wisecracker

I completely agree that the submitter could have made it more clear whether only one "additional" field separator could be present in the line (since the examples only showed one), but the original post sounded to me like zero or more field separators could appear as part of the data in field 6 (which for some unknown reason was field 3 in the examples).

One thing you could do to speed up your script and make it more reliable would be to change:

Code:

while read line
do
	line=($line)
	if [ ${#line[@]} -eq 5 ]
	then

		printf "${line[0]}$IFS${line[1]}$IFS${line[2]}${line[3]}$IFS${line[4]}\n" # >> /tmp/newflatfile
	else
		printf "${line[0]}$IFS${line[1]}$IFS${line[2]}$IFS${line[3]}\n" # >> /tmp/newflatfile
	fi
done < /tmp/flatfile

to:

Code:

while read line
do
	line=($line)
	if [ ${#line[@]} -eq 5 ]
	then	printf '%s\n' "${line[0]}$IFS${line[1]}$IFS${line[2]}${line[3]}$IFS${line[4]}"
	else	printf '%s\n' "${line[0]}$IFS${line[1]}$IFS${line[2]}$IFS${line[3]}" 
	fi
done < /tmp/flatfile # >>/tmp/newflatfile

Adding the format strings to the printf statements protects your script in case an input line contains any percent-sign or backslash characters. And, moving the redirections to the end of the loop (assuming that the #s commenting out the redirections will be removed at some point) instead of on each printf in the loop should speed things up. The open() and close() calls are fast when compared to the fork() and exec() needed to invoke an external utility, but doing millions of them to write a multi-Gb file when only one of each is needed will make a significant difference in your script's running time.

Note also that the submitter hasn't given any indication of what OS or shell are being used. The awk utility is universally available on UNIX-like systems. But, since array handling is not required in the shell by the standards, I tend to avoid using shell arrays in suggestions until I've determined that the submitter is using a shell that supports arrays. (This is just a personal preference. There is no reason why you should avoid arrays in code you suggest as long as you specify what shell you're using, as you always do.)

This User Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

Shell Programming and Scripting

Need to strip control-A characters from a column in a file

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

How to strip some characters before putting in array?

Discussion started by: ken6503

2. Red Hat

Special control characters in file

Discussion started by: srikanth38

3. Shell Programming and Scripting

How to view the control characters in a file?

Discussion started by: reddyr

4. Shell Programming and Scripting

Request for advise on how to remove control characters in a UNIX file extracted from top command

Discussion started by: karthikram

5. Shell Programming and Scripting

Strip First few Characters

Discussion started by: ratheeshjulk

6. Shell Programming and Scripting

sed replacing specific characters and control characters by escaping

Discussion started by: ijustneeda

7. Shell Programming and Scripting

Extra control characters being added when I create a file using cat command

Discussion started by: erora

8. Shell Programming and Scripting

display all possible control characters from .xml file in unix

Discussion started by: fantushmayu

9. Shell Programming and Scripting

Hidden control characters in a Unix Text File!

Discussion started by: kewl_guy

10. Programming

Identifying and removing control characters in a file.

Discussion started by: oracle8