Shell script Help - Data cleansing


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Shell script Help - Data cleansing
# 8  
Old 04-08-2016
Quote:
Originally Posted by Don Cragun
Hi pdathu,
Someone who has been posting questions to this forum for well over 4 years is not "a newbie who has never had a chance to work on the advanced commands".

You may have chosen to use this site as your unpaid programming staff instead of using it as a way to learn how to use basic UNIX, Linux, and BSD system utilities; but that doesn't qualify you to be someone who has never had a chance to work on the advanced commands.

In future threads, we will be very disappointed if you still claim to be a newbie and refuse to make any attempt to show us that you have invested any effort to solve problems on your own.

Hi RudiC,
I won't claim that this is any prettier than your suggestions (and it does read the input file twice), but it is a slightly different approach and doesn't need to convert spaces in field names to underscores. But, it does depend on the fact that there are no spaces in any of the fixed header fields names or data (i.e., those that are not defined by a "#columns" comment line) since it uses spaces and colons as field separators on the:
Code:
'FixedHeaderName1:FHN1data[ OtherFixedHeaderName:OTHERdata]...'

lines.

And, since it defines SQ outside of the awk script using "'" instead of using "\047", it will also work on mainframe systems using EBCDIC as well as on systems using codesets that are supersets of ASCII.

Code:
awk -v SQ="'" '
function AddHeader(name) {
	# Add a given header name (name) to the list of headers to be used by
	# this run of this script if this header name has not been seen before
	# and note the output field number to be used for this header name.
	if(name in OutField) return
	Hdr[++HdrCnt] = name
	OutField[name]
}
BEGIN {	# Add fixed header names.  Note that fixed header names must not
	# contain any spaces and the 1st fixed header name must appear as the
	# first field on any input line supplying fixed header data!
	AddHeader("InsertTime")
	AddHeader("DocID")

	# Set number of fixed header names.
	FixedHdrCnt = HdrCnt

	# Set output field separator.
	OFS = "|"
}
FNR == NR && $1 == "#columns" {
	# On the 1st pass through the input file, gather the header names to be
	# processed on this run.
	n = split($0, a, SQ)
	for(i = 2; i <= n; i += 2)
		AddHeader(a[i])
	next
}
FNR == 1 && NR > 1 {
	# We are starting the 2nd pass through the input file.  Print the
	# header.
	for(i = 1; i <= HdrCnt; i++)
		printf("%s%s", Hdr[i], (i == HdrCnt) ? ORS : OFS)
}
NR > FNR && $0 ~ ("^" Hdr[1] ":") {
	# Gather fixed header data for following row lines.
	n = split($0, a, /[: ]/)
	for(i = 1; i <= FixedHdrCnt; i++)
		Data[a[i * 2 - 1]] = a[i * 2]

	# Clear columnsFound so we do not try to process "#headers" data lines.
	columnsFound = 0
}
$1 == "#columns" {
	# Gather input field header names to be used for this section of the
	# input file.
	n = split($0, a, SQ)
	for(i = 2; i <= n;  i += 2)
		FieldName[i] = a[i]

	# Note that we have found headers for this section.  (There really
	# should be a separate check for a line starting with "#rows", but the
	# directions given state that that line has to be ignored.)
	columnsFound = 1
}
NR > FNR && $0 ~ ("^" SQ) && columnsFound {
	# Gather and print data from this input line:
	# Gather data...
	n = split($0, a, SQ)
	for(i = 2; i <= n; i += 2)
		Data[FieldName[i]] = a[i]

	# Print fixed data...
	for(i = 1; i <= FixedHdrCnt; i++)
		printf("%s%s", Data[Hdr[i]], OFS)

	# Print and delete line specific data...
	for(i = FixedHdrCnt + 1; i <= HdrCnt; i++) {
		printf("%s%s", Data[Hdr[i]], (i == HdrCnt) ? ORS : OFS)
		delete Data[Hdr[i]]
	}
}' file file

If anyone wants to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk or nawk.

With the given sample input data, the above script produces the output:
Code:
InsertTime|DocID|TargetDoc|GRank|LRank|Priority|Loc ID|Rank|Check Name
201604070523|101|aaaaa|1|1|Slow|8gkahinka.01||
201604070523|101|aaaaa|1|0|Slow|7nlafnjbaflnbja.01||
201604070523|102|aa||||8gkahinka.01|1|xyz
201604070523|102|aax||||7nlafnjbaflnbja.01|1|none
201604070750|101|xxxx|1|1|Slow|bjkkacka.01||
201604070750|101|yyyy|1|0|Slow|jiafjklas.001||

as requested.
Hello Sir, I joined this community 4 yrs ago but my responsibilities changed to different OS & tech. Now, I am working with Unix OS & shell scripting.

Thanks for your code and its working fine for some test cases.

Can you suggest any books or online sites to improve my Unix OS & its scripting knowledge.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Shell script to correct the data

Hi, I have below data in my flat file.I would like to remove the quotes and comma necessary from the data.Below is the details I would like to have in my output. Could anybody help me providing the Unix shell script for this. Input : ABC,ABC,10/15/2012,"47,936,164.567 ","1,036,997.453... (2 Replies)
Discussion started by: sonu_pal
2 Replies

2. UNIX for Dummies Questions & Answers

Shell script to read lines in a text file and filter user data Shell Programming and Scripting

sxsaaas (3 Replies)
Discussion started by: VikrantD
3 Replies

3. Shell Programming and Scripting

Need a shell script to clean data

Hi, Appreciated if anyone can throw some hint I have a file format like this: old(1): PRCNCP 1 old(2): PRSKU ... (6 Replies)
Discussion started by: netbanker
6 Replies

4. UNIX for Advanced & Expert Users

Convert column data to row data using shell script

Hi, I want to convert a 3-column data to 3-row data using shell script. Any suggestion in this regard is highly appreciated. Thanks. (4 Replies)
Discussion started by: sktkpl
4 Replies

5. UNIX for Dummies Questions & Answers

Data Importing using shell script

Hi All, I have a .csv file pipe delimter.., I am using excel data import option for importing the data from a pipe delimter file to xls...I want to make this happen using shell script. Please let me know how can I do this using shell script. Regards, Deepti (2 Replies)
Discussion started by: gaur.deepti
2 Replies

6. Shell Programming and Scripting

Help with cleansing data

I have a file with 27 fields seperated by pipe. I have a field 17 that is defined as numeric and the data coming in might contain character and other miscellaneous data like (@,!,~,#,%,^,&,*,(,)). I have to make sure that the column strictly contains numeric data and if it contains any of the... (2 Replies)
Discussion started by: dsravan
2 Replies

7. Shell Programming and Scripting

reformat data with a shell script

Can anyone help me with a shell script that can do the following: I have a data in fasta format (first line is the header, followed by a sequence of characters). >ALLLY GGCCCCTCGAGCCTCGAACCGGAACCTCCAAATCCGAGACGCTCTGCTTATGAGGACCTC GAAATATGCCGGCCAGTGAAAAAATCTTGTGGCTTTGAGGGCTTTTGGTTGGCCAGGGGC... (5 Replies)
Discussion started by: manishabh
5 Replies

8. UNIX for Dummies Questions & Answers

cleansing file in unix

Hi Experts, Our requirement is to cleanse a specific formatted file in unix. For example : File pattern is : Job name.......................................... \\\\Jobs\Amey ABC PQRS ABCD XYZ Job name.......................................... WEQ RED AAA Desired Result: (2 Replies)
Discussion started by: Amey Joshi
2 Replies

9. Shell Programming and Scripting

Getting remote data through shell script

Hi, I need to get the details (File System status & Memory status) of a remote server. I am executing a shell script in ksh and preparing the report. Pls help. Regards, armohans. (1 Reply)
Discussion started by: armohans
1 Replies

10. Shell Programming and Scripting

Pipe data to shell script

Sorry about the noobish question but... How do I capture data thats piped to my script? For instance, ls -al | myscript.sh How do I access the output from ls -al in myscript.sh? (3 Replies)
Discussion started by: tomjones07
3 Replies
Login or Register to Ask a Question