Shell script Help - Data cleansing

04-07-2016

Registered User

17, 0

Join Date: Sep 2011

Last Activity: 14 December 2016, 1:29 PM EST

Posts: 17

Thanks Given: 8

Thanked 0 Times in 0 Posts

Shell script Help - Data cleansing

Hello community, I am getting a log files from system and I need to clean the data and store as txt files for reporting purposes. Since these files are generated in Unix box, so we have to write shell script to handle the data cleansing.

Please find the sample file data looks like:

Code:

InsertTime:201604070523 DocID:101
#headers 
'DocID: 101    MOVEABLE TOOLS:   2 QTY:    0     HELD TOOLS:   0 QTY:    0     BLOCKED TOOLS:   0 QTY:    0'
#columns  'TargetDoc' 'GRank' 'LRank' 'Priority' 'Loc ID'
#widths 12 3 3 12 25
#types 'STRING' 'INTEGER' 'INTEGER' 'STRING' 'STRING'
#rows
'aaaaa' '1' '1' 'Slow' '8gkahinka.01'
'aaaaa' '1' '0' 'Slow' '7nlafnjbaflnbja.01'

#blocked '' '' 
#rule 'Rule_Abcd'
#doc '101'
#station_type ' '
#queue_duration '1.09673e-05'
#process_duration '4.61456'
#ISS-DLIS-DIAGS
InsertTime:201604070523 DocID:102
#headers 
'DocID: 102    MOVEABLE TOOLS:   2 QTY:    0     HELD TOOLS:   0 QTY:    0     BLOCKED TOOLS:   0 QTY:    0'
#columns  'TargetDoc' 'Rank' 'Check Name' 'Loc ID'
#widths 12 3 3 12 25
#types 'STRING' 'INTEGER' 'INTEGER' 'STRING' 'STRING'
#rows
'aa' '1' 'xyz' '8gkahinka.01'
'aax' '1' 'none' '7nlafnjbaflnbja.01'

#blocked '' '' 
#rule 'Rule_Axf'
#doc '102'
#station_type ' '
#queue_duration '1.09673e-05'
#process_duration '4.61456'
#ISS-DLIS-DIAGS
InsertTime:201604070750 DocID:101
#headers 
'DocID: 101    MOVEABLE TOOLS:   2 QTY:    0     HELD TOOLS:   0 QTY:    0     BLOCKED TOOLS:   0 QTY:    0'
#columns  'TargetDoc' 'GRank' 'LRank' 'Priority' 'Loc ID'
#widths 12 3 3 12 25
#types 'STRING' 'INTEGER' 'INTEGER' 'STRING' 'STRING'
#rows
'xxxx' '1' '1' 'Slow' 'bjkkacka.01'
'yyyy' '1' '0' 'Slow' 'jiafjklas.001'

#blocked '' '' 
#rule 'Rule_Abcd'
#doc '101'
#station_type ' '
#queue_duration '1.09673e-05'
#ISS-DLIS-DIAGS

This was a raw data and I need to write a shell script to cleanse the data.
1. row started with # is like comment and we need to ignore that other than #coulmns
2. #columns are give the columns names and #rows give the actual data.
3. unwanted data highlighted with red color and useful data highlighted as black color
4. The header for out put file is always all the #headers in the data along with InsertTime and DocID
5. assign the values as per header and add InsertTime & DocID values too.
6. data delimiter is | in the out put file.

Please find the desired out put:

Code:

InsertTime|DocID|TargetDoc|GRank|LRank|Priority|Loc ID|Rank|Check Name
201604070523|101|aaaaa|1|1|Slow|8gkahinka.01||
201604070523|101|aaaaa|1|0|Slow|7nlafnjbaflnbja.01||
201604070523|102|aa||||8gkahinka.01|1|xyz
201604070523|102|aax||||7nlafnjbaflnbja.01|1|none
201604070750|101|xxxx|1|1|Slow|bjkkacka.01||
201604070750|101|yyyy|1|0|Slow|jiafjklas.001||

Last edited by RudiC; 04-07-2016 at 11:11 AM.. Reason: Added code tags

pdathu

View Public Profile for pdathu

Find all posts by pdathu

04-07-2016

Registered User

176, 67

Join Date: Nov 2013

Last Activity: 21 February 2019, 3:36 AM EST

Posts: 176

Thanks Given: 14

Thanked 67 Times in 63 Posts

What have you tried so far? Also please use code tags.

pilnet101

View Public Profile for pilnet101

Find all posts by pilnet101

04-07-2016

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

Any attempts/ideas/thoughts from your side?

RudiC

View Public Profile for RudiC

Find all posts by RudiC

04-07-2016

Registered User

17, 0

Join Date: Sep 2011

Last Activity: 14 December 2016, 1:29 PM EST

Posts: 17

Thanks Given: 8

Thanked 0 Times in 0 Posts

Quote:

Originally Posted by RudiC

Any attempts/ideas/thoughts from your side?

I am newbie to shell scripting. I know basic UNIX commands & never had a chance to work on the advanced commands.

I read online about advanced commands but I don't how to use those.

Also, Please suggest any website or book for shell scripting.

---------- Post updated at 09:21 AM ---------- Previous update was at 09:20 AM ----------

Quote:

Originally Posted by pilnet101

What have you tried so far? Also please use code tags.

I am newbie to shell scripting. I am new to community and I will use the code tag my later posts.

pdathu

View Public Profile for pdathu

Find all posts by pdathu

04-07-2016

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

Phew, what an ordeal! Far from elegant, utterly clumsy, but it does what is requested:

Code:

awk '
BEGIN           {ARGV[ARGC++] = ARGV[1]
                 HD[++HDCNT]  = "InsertTime"
                 HD[++HDCNT]  = "DocID"
                }

/^#col/         {FS = "\047"
                 $0 = $0
                 for (i=2; i<NF; i+=2)  {gsub (" ", "_", $i)
                                         if (FNR == NR) {if (!($i in X))        {X[$i]
                                                                                 HD[++HDCNT] = $i
                                                                                }
                                                        }
                                        }
                 FS = " "
                 $0 = $0
                 if (FNR != NR) {delete COL
                                 COL[1] = 1
                                 COL[2] = 2
                                 for (i=2; i<=NF; i++)  {for (j=3; j<=HDCNT; j++)       {if ($i == HD[j])       {COL[i+1] = j
                                                                                                                 break
                                                                                                                }
                                                                                        }
                                                        }
                                }
                }

FNR == NR       {next
                }

FNR == 1        {for (i=1; i<HDCNT; i++) printf "%s|", HD[i]
                 printf "%s%s", HD[HDCNT], ORS
                }

                {gsub (/\047/, "")
                }

/^#/ || /^ *$/ ||
/^DocID/        {next
                }

/^Inse/         {gsub (/(^| )[^ :]*:/, " ")
                 INS = $1
                 DOC = $2
                 next
                }

                {$1 = INS OFS DOC OFS $1
                 n = split ($0, T)
                 $0 = ""
                 OFS = "|"
                 for (i=1; i<=n; i++)   {$(COL[i]) = T[i]
                                         $HDCNT = $HDCNT
                                        }   
                 print
                 OFS = " "
                }
'  file
InsertTime|DocID|TargetDoc|GRank|LRank|Priority|Loc_ID|Rank|Check_Name
201604070523|101|aaaaa|1|1|Slow|8gkahinka.01||
201604070523|101|aaaaa|1|0|Slow|7nlafnjbaflnbja.01||
201604070523|102|aa||||8gkahinka.01|1|xyz
201604070523|102|aax||||7nlafnjbaflnbja.01|1|none
201604070750|101|xxxx|1|1|Slow|bjkkacka.01||
201604070750|101|yyyy|1|0|Slow|jiafjklas.001||

We're running through the file twice, and I had to replace the spaces in the column headers by underscore char to avoid additional field splits. Anybody out there to try to pimp it up?

Last edited by RudiC; 04-07-2016 at 05:56 PM..

This User Gave Thanks to RudiC For This Post:

RudiC

View Public Profile for RudiC

Find all posts by RudiC

04-07-2016

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

Well, one pass only, and slightly prettier:

Code:

awk '
BEGIN           {HD[++HDCNT]  = "InsertTime"
                 HD[++HDCNT]  = "DocID"
                }

/^#col/         {FS = "\047"
                 $0 = $0
                 for (i=2; i<NF; i+=2)  {gsub (" ", "_", $i)
                                         if (!($i in X))        {X[$i]
                                                                 HD[++HDCNT] = $i
                                                                }
                                        }
                 FS = " "
                 $0 = $0
                 RW[++RCNT] = $0
                }

/^#/       ||
/^ *$/     ||
/^.DocID/       {next
                }

/^Inse/         {gsub (/(^| )[^ :]*:/, " ")
                 INS = $1
                 DOC = $2
                 next
                }

                {gsub (/\047/, "")
                 LN[RCNT,++LCNT[RCNT]] = INS FS DOC FS $0
                }


END             {for (i=1; i<HDCNT; i++) printf "%s|", HD[i]
                 printf "%s%s", HD[HDCNT], ORS
                 OFS = "|"
                 for (r=1; r<=RCNT; r++)
                        {delete COL 
                         COL[1] = 1
                         COL[2] = 2
                         n = split (RW[r], T)
                         for (i=2; i<=n; i++) for (j=3; j<=HDCNT; j++) if (T[i] == HD[j]) COL[i+1] = j
                         for (l=1; l<=LCNT[r]; l++)
                                {n = split (LN[r,l], T)
                                 $0 = ""
                                 for (i=1; i<=n; i++) $(COL[i]) = T[i]
                                 $HDCNT = $HDCNT
                                 print
                                }
                        }
                }
'  file
InsertTime|DocID|TargetDoc|GRank|LRank|Priority|Loc_ID|Rank|Check_Name
201604070523|101|aaaaa|1|1|Slow|8gkahinka.01||
201604070523|101|aaaaa|1|0|Slow|7nlafnjbaflnbja.01||
201604070523|102|aa||||8gkahinka.01|1|xyz
201604070523|102|aax||||7nlafnjbaflnbja.01|1|none
201604070750|101|xxxx|1|1|Slow|bjkkacka.01||
201604070750|101|yyyy|1|0|Slow|jiafjklas.001||

Last edited by RudiC; 04-07-2016 at 06:52 PM..

This User Gave Thanks to RudiC For This Post:

RudiC

View Public Profile for RudiC

Find all posts by RudiC

04-07-2016

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Hi pdathu,
Someone who has been posting questions to this forum for well over 4 years is not "a newbie who has never had a chance to work on the advanced commands".

You may have chosen to use this site as your unpaid programming staff instead of using it as a way to learn how to use basic UNIX, Linux, and BSD system utilities; but that doesn't qualify you to be someone who has never had a chance to work on the advanced commands.

In future threads, we will be very disappointed if you still claim to be a newbie and refuse to make any attempt to show us that you have invested any effort to solve problems on your own.

Hi RudiC,
I won't claim that this is any prettier than your suggestions (and it does read the input file twice), but it is a slightly different approach and doesn't need to convert spaces in field names to underscores. But, it does depend on the fact that there are no spaces in any of the fixed header fields names or data (i.e., those that are not defined by a "#columns" comment line) since it uses spaces and colons as field separators on the:

Code:

'FixedHeaderName1:FHN1data[ OtherFixedHeaderName:OTHERdata]...'

lines.

And, since it defines SQ outside of the awk script using "'" instead of using "\047", it will also work on mainframe systems using EBCDIC as well as on systems using codesets that are supersets of ASCII.

Code:

awk -v SQ="'" '
function AddHeader(name) {
	# Add a given header name (name) to the list of headers to be used by
	# this run of this script if this header name has not been seen before
	# and note the output field number to be used for this header name.
	if(name in OutField) return
	Hdr[++HdrCnt] = name
	OutField[name]
}
BEGIN {	# Add fixed header names.  Note that fixed header names must not
	# contain any spaces and the 1st fixed header name must appear as the
	# first field on any input line supplying fixed header data!
	AddHeader("InsertTime")
	AddHeader("DocID")

	# Set number of fixed header names.
	FixedHdrCnt = HdrCnt

	# Set output field separator.
	OFS = "|"
}
FNR == NR && $1 == "#columns" {
	# On the 1st pass through the input file, gather the header names to be
	# processed on this run.
	n = split($0, a, SQ)
	for(i = 2; i <= n; i += 2)
		AddHeader(a[i])
	next
}
FNR == 1 && NR > 1 {
	# We are starting the 2nd pass through the input file.  Print the
	# header.
	for(i = 1; i <= HdrCnt; i++)
		printf("%s%s", Hdr[i], (i == HdrCnt) ? ORS : OFS)
}
NR > FNR && $0 ~ ("^" Hdr[1] ":") {
	# Gather fixed header data for following row lines.
	n = split($0, a, /[: ]/)
	for(i = 1; i <= FixedHdrCnt; i++)
		Data[a[i * 2 - 1]] = a[i * 2]

	# Clear columnsFound so we do not try to process "#headers" data lines.
	columnsFound = 0
}
$1 == "#columns" {
	# Gather input field header names to be used for this section of the
	# input file.
	n = split($0, a, SQ)
	for(i = 2; i <= n;  i += 2)
		FieldName[i] = a[i]

	# Note that we have found headers for this section.  (There really
	# should be a separate check for a line starting with "#rows", but the
	# directions given state that that line has to be ignored.)
	columnsFound = 1
}
NR > FNR && $0 ~ ("^" SQ) && columnsFound {
	# Gather and print data from this input line:
	# Gather data...
	n = split($0, a, SQ)
	for(i = 2; i <= n; i += 2)
		Data[FieldName[i]] = a[i]

	# Print fixed data...
	for(i = 1; i <= FixedHdrCnt; i++)
		printf("%s%s", Data[Hdr[i]], OFS)

	# Print and delete line specific data...
	for(i = FixedHdrCnt + 1; i <= HdrCnt; i++) {
		printf("%s%s", Data[Hdr[i]], (i == HdrCnt) ? ORS : OFS)
		delete Data[Hdr[i]]
	}
}' file file

If anyone wants to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk or nawk.

With the given sample input data, the above script produces the output:

Code:

InsertTime|DocID|TargetDoc|GRank|LRank|Priority|Loc ID|Rank|Check Name
201604070523|101|aaaaa|1|1|Slow|8gkahinka.01||
201604070523|101|aaaaa|1|0|Slow|7nlafnjbaflnbja.01||
201604070523|102|aa||||8gkahinka.01|1|xyz
201604070523|102|aax||||7nlafnjbaflnbja.01|1|none
201604070750|101|xxxx|1|1|Slow|bjkkacka.01||
201604070750|101|yyyy|1|0|Slow|jiafjklas.001||

as requested.

This User Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

Shell Programming and Scripting

Shell script Help - Data cleansing

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Shell script to correct the data

Discussion started by: sonu_pal

2. UNIX for Dummies Questions & Answers

Shell script to read lines in a text file and filter user data Shell Programming and Scripting

Discussion started by: VikrantD

3. Shell Programming and Scripting

Need a shell script to clean data

Discussion started by: netbanker

4. UNIX for Advanced & Expert Users

Convert column data to row data using shell script

Discussion started by: sktkpl

5. UNIX for Dummies Questions & Answers

Data Importing using shell script

Discussion started by: gaur.deepti

6. Shell Programming and Scripting

Help with cleansing data

Discussion started by: dsravan

7. Shell Programming and Scripting

reformat data with a shell script

Discussion started by: manishabh

8. UNIX for Dummies Questions & Answers

cleansing file in unix

Discussion started by: Amey Joshi

9. Shell Programming and Scripting

Getting remote data through shell script

Discussion started by: armohans

10. Shell Programming and Scripting

Pipe data to shell script

Discussion started by: tomjones07