Clense Junk Data File - Using Shell or awk or sed


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Clense Junk Data File - Using Shell or awk or sed
# 1  
Old 06-07-2006
Clense Junk Data File - Using Shell or awk or sed

Hello Shell Gurus i need help in solving this puzzle. We have a junk data file that needs to be fed into the database. Need to clense the data file thru shell script. I am not a expert and so need help with

Here is what i need to do on the input file

-Step -1 Replace all pipes ‘|' within the file with space ‘ ‘

-Step - 2 Remove Special Character and junk data within the file - Tricky part is we do not have a defined set of special / junk character. Solution would be to remove any character that's not a part of the keyboard stroke.

Remove Character NOT IN [ A-Z, a-z, 0-9, `,~, !, @, #, $, %, &, *, (, ), _, -, + ,=, .,",',:,;,{,},[,],<,>,?,/,\,|,, )

NOTE Basically remove any special charater thats not on the key board stroke.

- Step - 3 Check the count of pipes on each line of the data to make sure we have the correct number. I would receive 4 pipes on each line. Which means if there are less we need to keep pading the next line ( concat the below lines ). This fields is basicall a memo where the user would have typed a small paragraph that needs to be joined into a single line.

-Step - 4 Replace all zzz with pipe ‘|'


Note : Below is a QA step to be embedded within the script after clensing. This is just to spit out a error log file that can be used to identify and fix records manually

-Step - 5 Check the length of the 2nd field > 50 and third field > 200 if yes write to error log file the line number and the record info

-Step - 6 Check the number of fields or pipe within each line. if fields not equal to 4 then write to the same error log. The line number and record record info

Sample Broken Lines and data
-----------------------------


467zzzComputer|MonitorzzzPurchase Prise $150
Best Price $100
Cheapest Price $75
highest price $200zzzTzzz


Correct record would look like this
467|Computer Monitor|Purchase Prise $150 Best Price $100 Cheapest Price $75 highest price $200|T|

Note. Broken lines fixed. The '|' got replaced with a space where it read Computer|Monitor. The memo field converted into single line. Also all zzz got replaced with a pipe.

Thanks
# 2  
Old 06-07-2006
Quote:
move this to after #3 check the pipe count
-Step -1 Replace all pipes ‘|' within the file with space ‘ ‘
Code:
 tr -s '|' ' ' < oldfile > newfile

Quote:
-Step - 2 Remove Special Character and junk data within the file - Tricky part is we do not have a defined set of special / junk character. Solution would be to remove any character that's not a part of the keyboard stroke.

Remove Character NOT IN [ A-Z, a-z, 0-9, `,~, !, @, #, $, %, &, *, (, ), _, -, + ,=, .,",',:,;,{,},[,],<,>,?,/,\,|,, )
Code:
sed 's/^A-Za-z0-9, `~!@#$%&*()_-+=."\|':;{}\[\]<>\?\/\\//g' filename > newfile

Quote:
- Step - 3 Check the count of pipes on each line of the data to make sure we have the correct number. I would receive 4 pipes on each line. Which means if there are less we need to keep pading the next line ( concat the below lines ). This fields is basicall a memo where the user would have typed a small paragraph that needs to be joined into a single line.
Not sure about this step....
Quote:
-Step - 4 Replace all zzz with pipe ‘|'
Code:
sed 's/zzz/|/g' oldfile > newfile

Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Advanced & Expert Users

Need Optimization shell/awk script to aggreagte (sum) for all the columns of Huge data file

Optimization shell/awk script to aggregate (sum) for all the columns of Huge data file File delimiter "|" Need to have Sum of all columns, with column number : aggregation (summation) for each column File not having the header Like below - Column 1 "Total Column 2 : "Total ... ...... (2 Replies)
Discussion started by: kartikirans
2 Replies

2. UNIX for Dummies Questions & Answers

Remove untagged and junk data from an XML

Hi All , I have seen a lot of code samples which suggest how to remove the junk data from and XML , I need a code in unix which removes the junk characters as well as the valid characters those are not in XML tags , for example my XML is as follows : <?xml version="1.0"... (6 Replies)
Discussion started by: IshuGupta
6 Replies

3. Shell Programming and Scripting

awk - sed / reading from a data file and doing algebraic operations

Hi everyone, I am trying to write a bash script which reads a data file and does some algebraic operations. here is the structure of data.xml file that I have; 1 <data> 2 . 3 . 4 . 5 </data> 6 <data> 7 . 8 . 9 . 10</data> etc. Each data block contains same number of lines (say... (4 Replies)
Discussion started by: hayreter
4 Replies

4. Shell Programming and Scripting

AWK/Shell script for formatting data in a file

Hi All, Need an urgent help to convert a unix file in to a particular format: **source file:** 1111111 2d2f2h2 3dfgsd3 ........... 1111111 <-- repeats in every nth line. remaining all lines will be different 123ss41 432ff45 ........... 1111111 <-- repetition qwe1234 123weq3... (1 Reply)
Discussion started by: rajivnairfis
1 Replies

5. Shell Programming and Scripting

AWK, Perl or Shell? Unique strings and their maximum values from 3 column data file

I have a file containing data like so: 2012-01-02 GREEN 4 2012-01-02 GREEN 6 2012-01-02 GREEN 7 2012-01-02 BLUE 4 2012-01-02 BLUE 3 2012-01-02 GREEN 4 2012-01-02 RED 4 2012-01-02 RED 8 2012-01-02 GREEN 4 2012-01-02 YELLOW 5 2012-01-02 YELLOW 2 I can't always predict what the... (4 Replies)
Discussion started by: rich@ardz
4 Replies

6. Shell Programming and Scripting

how to get data from hex file using SED or AWK based on pattern sign

I have a binary (hex) file I need to parse to get some data which are encoded this way: .* b4 . . . 01 12 .* af .* 83 L1 x1 x2 xL 84 L2 y1 y2 yL By another words there is a stream of hexadecimal bytes (in my example separated by space for better readability). I need to get value stored in... (3 Replies)
Discussion started by: sameucho
3 Replies

7. Shell Programming and Scripting

formatting data file with awk or sed

Hi, I have a (quite large) data file which looks like: _____________ header part.. more header part.. x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 ... ... x59 x60 y1 y2 y3 y4... ... y100 ______________ where x1, x2,...,x60 and y1, y2,...y100 are numbers of 10 digits (so each line... (5 Replies)
Discussion started by: lego
5 Replies

8. Shell Programming and Scripting

sed or awk to extract data from Xml file

Hi, I want to get data from Xml file by using sed or awk command. I want to get the following result : mon titre 1;Createur1;Dossier1 mon titre 1;Createur1;Dossier1 and save it in cvs file (fichier.cvs). FROM this Xml file (test.xml): <playlist version="1"> <trackList> <track>... (1 Reply)
Discussion started by: yeclota
1 Replies

9. Shell Programming and Scripting

Big data file - sed/grep/awk?

Morning guys. Another day another question. :rolleyes: I am knocking up a script to pull some data from a file. The problem is the file is very big (up to 1 gig in size), so this solution: for results in `grep "^\ ... works, but takes ages (we're talking minutes) to run. The data is held... (8 Replies)
Discussion started by: dlam
8 Replies

10. UNIX for Advanced & Expert Users

Shell Script to clense junk data file

Hello Shell Gurus i need help in solving this puzzle. We have a junk data file that needs to be fed into the database. Need to clense the data file thru shell script. I am not a expert and so need help with Here is what i need to do on the input file -Step -1 Replace all pipes ‘|' within... (0 Replies)
Discussion started by: rimss
0 Replies
Login or Register to Ask a Question