Hello community, I am getting a log files from system and I need to clean the data and store as txt files for reporting purposes. Since these files are generated in Unix box, so we have to write shell script to handle the data cleansing.
This was a raw data and I need to write a shell script to cleanse the data.
1. row started with # is like comment and we need to ignore that other than #coulmns
2. #columns are give the columns names and #rows give the actual data.
3. unwanted data highlighted with red color and useful data highlighted as black color
4. The header for out put file is always all the #headers in the data along with InsertTime and DocID
5. assign the values as per header and add InsertTime & DocID values too.
6. data delimiter is | in the out put file.
Please find the desired out put:
Code:
InsertTime|DocID|TargetDoc|GRank|LRank|Priority|Loc ID|Rank|Check Name
201604070523|101|aaaaa|1|1|Slow|8gkahinka.01||
201604070523|101|aaaaa|1|0|Slow|7nlafnjbaflnbja.01||
201604070523|102|aa||||8gkahinka.01|1|xyz
201604070523|102|aax||||7nlafnjbaflnbja.01|1|none
201604070750|101|xxxx|1|1|Slow|bjkkacka.01||
201604070750|101|yyyy|1|0|Slow|jiafjklas.001||
Last edited by RudiC; 04-07-2016 at 11:11 AM..
Reason: Added code tags
We're running through the file twice, and I had to replace the spaces in the column headers by underscore char to avoid additional field splits. Anybody out there to try to pimp it up?
Hi pdathu,
Someone who has been posting questions to this forum for well over 4 years is not "a newbie who has never had a chance to work on the advanced commands".
You may have chosen to use this site as your unpaid programming staff instead of using it as a way to learn how to use basic UNIX, Linux, and BSD system utilities; but that doesn't qualify you to be someone who has never had a chance to work on the advanced commands.
In future threads, we will be very disappointed if you still claim to be a newbie and refuse to make any attempt to show us that you have invested any effort to solve problems on your own.
Hi RudiC,
I won't claim that this is any prettier than your suggestions (and it does read the input file twice), but it is a slightly different approach and doesn't need to convert spaces in field names to underscores. But, it does depend on the fact that there are no spaces in any of the fixed header fields names or data (i.e., those that are not defined by a "#columns" comment line) since it uses spaces and colons as field separators on the:
And, since it defines SQ outside of the awk script using "'" instead of using "\047", it will also work on mainframe systems using EBCDIC as well as on systems using codesets that are supersets of ASCII.
Code:
awk -v SQ="'" '
function AddHeader(name) {
# Add a given header name (name) to the list of headers to be used by
# this run of this script if this header name has not been seen before
# and note the output field number to be used for this header name.
if(name in OutField) return
Hdr[++HdrCnt] = name
OutField[name]
}
BEGIN { # Add fixed header names. Note that fixed header names must not
# contain any spaces and the 1st fixed header name must appear as the
# first field on any input line supplying fixed header data!
AddHeader("InsertTime")
AddHeader("DocID")
# Set number of fixed header names.
FixedHdrCnt = HdrCnt
# Set output field separator.
OFS = "|"
}
FNR == NR && $1 == "#columns" {
# On the 1st pass through the input file, gather the header names to be
# processed on this run.
n = split($0, a, SQ)
for(i = 2; i <= n; i += 2)
AddHeader(a[i])
next
}
FNR == 1 && NR > 1 {
# We are starting the 2nd pass through the input file. Print the
# header.
for(i = 1; i <= HdrCnt; i++)
printf("%s%s", Hdr[i], (i == HdrCnt) ? ORS : OFS)
}
NR > FNR && $0 ~ ("^" Hdr[1] ":") {
# Gather fixed header data for following row lines.
n = split($0, a, /[: ]/)
for(i = 1; i <= FixedHdrCnt; i++)
Data[a[i * 2 - 1]] = a[i * 2]
# Clear columnsFound so we do not try to process "#headers" data lines.
columnsFound = 0
}
$1 == "#columns" {
# Gather input field header names to be used for this section of the
# input file.
n = split($0, a, SQ)
for(i = 2; i <= n; i += 2)
FieldName[i] = a[i]
# Note that we have found headers for this section. (There really
# should be a separate check for a line starting with "#rows", but the
# directions given state that that line has to be ignored.)
columnsFound = 1
}
NR > FNR && $0 ~ ("^" SQ) && columnsFound {
# Gather and print data from this input line:
# Gather data...
n = split($0, a, SQ)
for(i = 2; i <= n; i += 2)
Data[FieldName[i]] = a[i]
# Print fixed data...
for(i = 1; i <= FixedHdrCnt; i++)
printf("%s%s", Data[Hdr[i]], OFS)
# Print and delete line specific data...
for(i = FixedHdrCnt + 1; i <= HdrCnt; i++) {
printf("%s%s", Data[Hdr[i]], (i == HdrCnt) ? ORS : OFS)
delete Data[Hdr[i]]
}
}' file file
If anyone wants to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk or nawk.
With the given sample input data, the above script produces the output:
Code:
InsertTime|DocID|TargetDoc|GRank|LRank|Priority|Loc ID|Rank|Check Name
201604070523|101|aaaaa|1|1|Slow|8gkahinka.01||
201604070523|101|aaaaa|1|0|Slow|7nlafnjbaflnbja.01||
201604070523|102|aa||||8gkahinka.01|1|xyz
201604070523|102|aax||||7nlafnjbaflnbja.01|1|none
201604070750|101|xxxx|1|1|Slow|bjkkacka.01||
201604070750|101|yyyy|1|0|Slow|jiafjklas.001||
as requested.
This User Gave Thanks to Don Cragun For This Post:
Hi,
I have below data in my flat file.I would like to remove the quotes and comma necessary from the data.Below is the details I would like to have in my output.
Could anybody help me providing the Unix shell script for this.
Input :
ABC,ABC,10/15/2012,"47,936,164.567 ","1,036,997.453... (2 Replies)
Hi All,
I have a .csv file pipe delimter.., I am using excel data import option for importing the data from a pipe delimter file to xls...I want to make this happen using shell script.
Please let me know how can I do this using shell script.
Regards,
Deepti (2 Replies)
I have a file with 27 fields seperated by pipe. I have a field 17 that is defined as numeric and the data coming in might contain character and
other miscellaneous data like (@,!,~,#,%,^,&,*,(,)).
I have to make sure that the column strictly contains numeric data and if it contains any of the... (2 Replies)
Can anyone help me with a shell script that can do the following:
I have a data in fasta format (first line is the header, followed by a sequence of characters).
>ALLLY
GGCCCCTCGAGCCTCGAACCGGAACCTCCAAATCCGAGACGCTCTGCTTATGAGGACCTC
GAAATATGCCGGCCAGTGAAAAAATCTTGTGGCTTTGAGGGCTTTTGGTTGGCCAGGGGC... (5 Replies)
Hi Experts,
Our requirement is to cleanse a specific formatted file in unix.
For example :
File pattern is :
Job name..........................................
\\\\Jobs\Amey
ABC
PQRS
ABCD
XYZ
Job name..........................................
WEQ
RED
AAA
Desired Result: (2 Replies)
Hi,
I need to get the details (File System status & Memory status) of a remote server. I am executing a shell script in ksh and preparing the report.
Pls help.
Regards,
armohans. (1 Reply)
Sorry about the noobish question but...
How do I capture data thats piped to my script?
For instance,
ls -al | myscript.sh
How do I access the output from ls -al in myscript.sh? (3 Replies)