Identify lines with wrong format in a file and fix


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Identify lines with wrong format in a file and fix
# 1  
Old 10-30-2015
Identify lines with wrong format in a file and fix

Gurus,

I have a data file which has a certain number of columns say 101. It has one description column which contains foreign characters and due to this some times, those special characters are translated to new line character and resulting in failing the process.

I am using the following awk command to indentify the number of columns

Code:
cat File name | awk -F "|" '{print NF}' | sort -u

If the file has foreign characters then it will display some thing like this if I run the above command.
Code:
101
102

I have written a small code to compare the counts of previous line with the next line and displays the line no. I have not done extensive unix scripting but the performance is not good as it takes for ever.

Code:
#!/usr/bin/ksh
filename=$1
count=0
prev_count=0
lineno=1
while IFS= read line
do
count=`echo $line| awk -F "|" '{print NF}'`
if [ $count -ne $prev_count ]
then
echo $prev_count
echo $count
echo $line_no
echo "previous count is not matching"
fi
prev_count=$count
line_no=`expr $line_no + 1`
done < $filename

Is there a better way to identify the issue record and replace the foreign character with space or remove it all together.

Here is the issue record:

bad:
Code:
98671|20150731|C||CNY||||||||||||2015|07||PT 9867100000258887|00099PPD-CD DDJ½|«~ª|20150731|20150731|||||||||164737||0|||PT||CCID|||WKN|||||CNY|D|00001.000000|+000000000000877.50|+000000000000877.50|164737||||||61|680006003000211008187000||||WKNPT150731A000099|OTH|||||||||||EXT|164737|CN||68|1008187000||0600300021|||||CNY||0600300021||||||||||||

good:

Code:
00001|20150731|C||CNY||||||||||||2015|07||FB 0000100000000001|FB - GA1890|20150731|20150731|||||||||164737||0|||FB||CCID|||WKN|||||CNY|C|00001.000000|+000000000791640.00|+000000000791640.00|164737||||||62|580088999000011008400000||||WKNFB1507315800001|OTH|||||||||||EXT|164737|CN||58|1008400000||8899900001|||||CNY||8899900001||||||||||||

00001|20150731|C||CNY||||||||||||2015|07||AT 0000100000000002||20150731|20150731|||||||||164737||0|||AT||CCID|||CAT|||||CNY|C|00001.000000|+000000000002000.00|+000000000002000.00|164737||||||62|680002888554511008010000||||CATEX1507310813829|OTH|||||||||||EXT|164737|CN||68|1008010000||0288855451|||||CNY||0288855451||||||||||||

When the above bad record is read by abinitio etl tool, it gets new line character and the record gets scrambled and will shifted to next column and process fails.

Any help or quick command will be helpful. Thanks.

Last edited by Don Cragun; 10-30-2015 at 05:35 PM.. Reason: Add CODE tags.
# 2  
Old 10-30-2015
Why not do the entire loop in awk? Try
Code:
awk '{print NR, NF}' FS="|" file

to find the field count of the lines with line-No.s.
The interpretation of non- ASCII characters is locale- dependent. This
Code:
LC_ALL=C sed -r s/[$'\200'-$'\377'']+/ /g' file

would do as you requested.

---------- Post updated at 22:06 ---------- Previous update was at 22:01 ----------

Actually, I can't see those non-ASCII chars modify the field count. Try also
Code:
awk '1; {print NR, NF, gsub ("[\200-\377]+", " ")} 1' FS="|" file

to compare the lines before - after the replacement
# 3  
Old 11-03-2015
RudiC,

Thanks for your suggestion. After searching for other threads in this forum, I have modified the code as follows. It identifies the problem records which have greater than 101 columns.

Code:
 awk -F "|" 'NF > 101 {print NR,$0}' <filename>

But to fix it is still a manual step. I am getting the output as shown below.

Code:
313301 98671|20150929|C||CNY||||||||||||2015|09||PT  9867100000303106|00283PPD-CDDDJ½|«~¿¬|20150929|20150929|||||||||164737||0|||PT||CCID|||WKN|||||CNY|D|00001.000000|+000000000000610.00|+000000000000610.00|164737||||||61|680006003000211008187000||||WKNPT150929A000283|OTH|||||||||||EXT|164737|CN||68|1008187000||0600300021|||||CNY||0600300021||||||||||||

Is there a way to identify the record which is modified to
Code:
|«~¿¬

in advance and fix it.
# 4  
Old 11-03-2015
I don't understand your request. The red pipe char is the normal ASCII 0X7C used as a field separator in your file. Do you imply that one is an artefact? Did you try the gsub from post#1?

Last edited by RudiC; 11-04-2015 at 01:20 PM.. Reason: included "red"
This User Gave Thanks to RudiC For This Post:
# 5  
Old 11-04-2015
RudiC,

I tried your command and it works. It removes the foreign characters.


< 313301 98671|20150929|C||CNY||||||||||||2015|09||PT 9867100000303106|00283PPD-CDDDJ½|«~¿¬|20150929|20150929|||||||||164737||0|||PT||CCID|||WKN|||||CNY|D|00001.000000|+000000000000610.00|+00 0000000000610.00|164737||||||61|680006003000211008187000||||WKNPT150929A000283|OTH|||||||||||EXT|164 737|CN||68|1008187000||0600300021|||||CNY||0600300021||||||||||||

> 313301 98671|20150929|C||CNY||||||||||||2015|09||PT 9867100000303106|00283PPD-CDDDJ | ~ |20150929|20150929|||||||||164737||0|||PT||CCID|||WKN|||||CNY|D|00001.000000|+000000000000610.00|+00 0000000000610.00|164737||||||61|680006003000211008187000||||WKNPT150929A000283|OTH|||||||||||EXT|164 737|CN||68|1008187000||0600300021|||||CNY||0600300021||||||||||||

Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Script Output coming in wrong format....

Hi team, getting output logs wrong in different format from telnet script ... getting Output.txt macro_outdoor_dist-6.0.0(v4_0_2) DN:1.3.903 (1101:100:11w:500:3:2:103:aa) macro_outdoor_dist-8.1.0(v3_1_0) DN:1.3.409 (N/A)... (3 Replies)
Discussion started by: Ganesh Mankar
3 Replies

2. Shell Programming and Scripting

How to identify whether the script is in Unix format or not ?

Hi All, I have the below scenario in my environment Developers used to copy file from windows to Linux box. Some time on the copied file developers miss to run the dos2unix utility. Because of this script gets failed during the execution. Most of the failures are due to the dos2unix format... (7 Replies)
Discussion started by: kalpeer
7 Replies

3. UNIX for Dummies Questions & Answers

How to identify broken lines in a file?

Hi, I have a 100 byte length fixed width file . In that three rows are broken and went off to next line. How can I identify the broken lines? E.g. ABCD1234MNRD4321 abcd1234mnrd 4321 As you can see in my example my second row with small case alphabets is broken... (5 Replies)
Discussion started by: okkadu
5 Replies

4. UNIX for Dummies Questions & Answers

Use Regex to identify / format a complex string

First of all, please have mercy on me. I am not a noob to programming, but I am about as noob as you can get with regex. That being said, I have a problem. I've got a string that looks something like this: Publication - Bob M. Jones, Tony X. Stark, and Fred D. Man, \"Really Awesome Article... (1 Reply)
Discussion started by: egill
1 Replies

5. UNIX for Dummies Questions & Answers

Script to display lines in a file in a particular format

I have a bunch of files in various folders. I want to go through each of them and display certain lines in a particular format All files have a similar format Date: Time: User: Message: Miscellaneous: (and some other stuff)I want to display to only the "Date:", "Time:" "User:" lines in... (7 Replies)
Discussion started by: newbiegal01
7 Replies

6. Shell Programming and Scripting

incrementing lines in the file & format output.

Hi All, I need read the file and out put format as below using ksh, I wrote below script its keep on repeating first line in the file. may i know the best way to get the below out put while incrementing line in the file. cat b.txt |awk '{print $0}' |while read line do aa=`cat $line |head -1... (7 Replies)
Discussion started by: ashanabey
7 Replies

7. UNIX for Dummies Questions & Answers

Format/Fix Timestamp Data in a File.

Hello Experts, I have a timestamp(6) column in a .csv data file , format of the data is as below:- ETCT,P,Elec, Inc.,abc,11/5/2010 4:16:09.000000 PM,Y,Y,Y I want the timestamp column to be properly formatted like 11/05/2010 04:16:09.000000 PM Currently the "0" is missing with... (3 Replies)
Discussion started by: mtlrsk
3 Replies

8. Shell Programming and Scripting

format lines on a file

Hi again: I have a log file wihch has always this format: DATA line 1 DATA line 2 ^^^^^ | Spaces or TABs The first line always begins from the start, but the second begins with spaces or TABs, Question: How can I add the second line to the first one? I mean this: DATA... (6 Replies)
Discussion started by: iga3725
6 Replies

9. Shell Programming and Scripting

Need help to print lines contains particular string format in a file

Hi, I want to print the lines in a file that matches particular string format using shell scripting. (4 Replies)
Discussion started by: sudhakaryadav
4 Replies

10. Shell Programming and Scripting

Problem with Script that writes max lines of a file - Any ideas how to fix?

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx (4 Replies)
Discussion started by: mmiller99
4 Replies
Login or Register to Ask a Question