Shell script to apply functions to multiple columns dynamically


Login or Register to Reply

 
Thread Tools Search this Thread
# 8  
Quote:
Originally Posted by RudiC
Please be aware that the md5sum of '10,abc' will NEVER be 73aca49763216fb96bbc2acef7b60afb as it is case sensitive.
Looks like you want a comma included. Try
Code:
awk -F\| '
NR == 1         {for (i=1; i<=NF; i++) if ("," MCOL "," ~ "," $i ",") COL[++CNT] = i
                 print $0, "HASHED COLUMNS", "HASHVALUE"
                 next
                }
                {TMP = ""
                 for (i=1; i<=CNT; i++) TMP = TMP "," $(COL[i])
                 ("echo -n " substr (TMP, 2) " | md5sum") | getline MD5
                 sub (/ *-/, "", MD5)
                 print $0, MCOL, MD5
                }
' OFS="|" MCOL="ID,NAME" file
ID|NAME|AGE|GENDER|HASHED COLUMNS|HASHVALUE
10|ABC|30|M|ID,NAME|73aca49763216fb96bbc2acef7b60afb
20|DEF|20|F|ID,NAME|9d6555fe65eb60b2f7d9174b56f667f5

Yes thank-you, you are right I should be more careful with the cases but the output now is as expected
Also can the last line MCOL="ID,NAME" be parameterized example

Code:
mcols=$1
filename=$2
awk -F\| '
NR == 1         {for (i=1; i<=NF; i++) if ("," MCOL "," ~ "," $i ",") COL[++CNT] = i
                 print $0, "HASHED COLUMNS", "HASHVALUE"
                 next
                }
                {TMP = ""
                 for (i=1; i<=CNT; i++) TMP = TMP "," $(COL[i])
                 ("echo -n " substr (TMP, 2) " | md5sum") | getline MD5
                 sub (/ *-/, "", MD5)
                 print $0, MCOL, MD5
                }
' OFS="|" MCOL="$(mcols)" file

and call the script like sh myscript.sh "ID,NAME"(I think it is a silly question but never the less I am asking)

Also I am trying to understand your code line by line could you point me how debug this code I am not asking you to explain line by line but can you pls point me towards the direction for me to better understand the code.

Thanks.
# 9  
Quote:
Originally Posted by mkathi
Yes thank-you, you are right I should be more careful with the cases
Yes.

Quote:
Also can the last line MCOL="ID,NAME" be parameterized example . . . and call the script like sh myscript.sh "ID,NAME"(I think it is a silly question but never the less I am asking)
How about just trying and using MCOL="$1"? The syntax you used is called "command substitution".
Quote:
Also I am trying to understand your code line by line could you point me how debug this code I am not asking you to explain line by line but can you pls point me towards the direction for me to better understand the code.
Thanks.
When operating on the first data line, I collect the target column numbers as indices into an array. For all remaining lines, I assemble these column values separated by commas into a TMP variable, execute echo ... | md5sum on it, getline the result into a variable, and, after some massaging, print the desired outout line.
Aside, if your file has more lines than awk allows for open files, you'd need to close the system calls after each use...
# 10  
Quote:
Originally Posted by RudiC
Yes.

How about just trying and using MCOL="$1"? The syntax you used is called "command substitution".
When operating on the first data line, I collect the target column numbers as indices into an array. For all remaining lines, I assemble these column values separated by commas into a TMP variable, execute echo ... | md5sum on it, getline the result into a variable, and, after some massaging, print the desired outout line.
Aside, if your file has more lines than awk allows for open files, you'd need to close the system calls after each use...
Yes I will try command substitution when I reach to work tomorrow online unix terminals are giving me a hard time.

Thanks for explaining the code after a lot of googling I am at the stage where i can understand 80% of the code written except why "," MCOL "," ~ "," $1 "," the "," in this if statement but I am learning awk and will figure it out soon.

I don't quite understand what
Quote:
you'd need to close the system calls after each use...
actually means is it a syntax or is it like open and closing cursors in plsql( sorry bas example but sql is the only language i am comfortable for now)

Thanks.
# 11  
Quote:
Originally Posted by mkathi
...

why "," MCOL "," ~ "," $1 "," the "," in this if statement
Print out the two and compare / apply the matching operator ~ .


Quote:
actually means is it a syntax or is it like open and closing cursors in plsql( sorry bas example but sql is the only language i am comfortable for now)

...
awk allows for a not too small but limited number of open files, of which each echo ... | md5sum consumes one. So, once you reach that limit, action needs to be taken.
# 12  
Quote:
Originally Posted by RudiC
Print out the two and compare / apply the matching operator ~ .

makes sense thanks

awk allows for a not too small but limited number of open files, of which each echo ... | md5sum consumes one. So, once you reach that limit, action needs to be taken.
my files reea hude ranging from 10000 rec to 30000000 but I am going to use one file at time i mean open one file at a time will that still be a issue.
# 13  
That script consumes an "open file" for every line in your input file. 10000 may but 3000000 definitively will be too many. Try



Code:
awk -F\| '
NR == 1         {for (i=1; i<=NF; i++) if ("," MCOL "," ~ "," $i ",") COL[++CNT] = i
                 print $0, "HASHED COLUMNS", "HASHVALUE"
                 next
                }
                {TMP = "echo -n " $(COL[1]) ","
                 for (i=2; i<CNT; i++) TMP = TMP $(COL[i]) ","
                 TMP = TMP $(COL[CNT]) " | md5sum"
                 TMP | getline MD5
                 close (TMP)
                 sub (/ *-/, "", MD5)
                 print $0, MCOL, MD5
                }
 ' OFS="|" MCOL="ID,NAME,AGE" file


or


Code:
awk -F\| '
NR == 1         {for (i=1; i<=NF; i++) if ("," MCOL "," ~ "," $i ",") COL[++CNT] = i
                 print $0, "HASHED COLUMNS", "HASHVALUE"
                 next
                }
                {TMP = "echo -n "
                 for (i=1; i<=CNT; i++) TMP = TMP $(COL[i]) ","
                 sub (/,$/, " | md5sum", TMP)
                 TMP | getline MD5
                 close (TMP)
                 sub (/ *-/, "", MD5)
                 print $0, MCOL, MD5
                }
' OFS="|" MCOL="ID,NAME,AGE" file

# 14  
Quote:
Originally Posted by RudiC
That script consumes an "open file" for every line in your input file. 10000 may but 3000000 definitively will be too many. Try



Code:
awk -F\| '
NR == 1         {for (i=1; i<=NF; i++) if ("," MCOL "," ~ "," $i ",") COL[++CNT] = i
                 print $0, "HASHED COLUMNS", "HASHVALUE"
                 next
                }
                {TMP = "echo -n " $(COL[1]) ","
                 for (i=2; i<CNT; i++) TMP = TMP $(COL[i]) ","
                 TMP = TMP $(COL[CNT]) " | md5sum"
                 TMP | getline MD5
                 close (TMP)
                 sub (/ *-/, "", MD5)
                 print $0, MCOL, MD5
                }
 ' OFS="|" MCOL="ID,NAME,AGE" file


or


Code:
awk -F\| '
NR == 1         {for (i=1; i<=NF; i++) if ("," MCOL "," ~ "," $i ",") COL[++CNT] = i
                 print $0, "HASHED COLUMNS", "HASHVALUE"
                 next
                }
                {TMP = "echo -n "
                 for (i=1; i<=CNT; i++) TMP = TMP $(COL[i]) ","
                 sub (/,$/, " | md5sum", TMP)
                 TMP | getline MD5
                 close (TMP)
                 sub (/ *-/, "", MD5)
                 print $0, MCOL, MD5
                }
' OFS="|" MCOL="ID,NAME,AGE" file

Thanks i will have a chance to run this tomorrow against a large dataset and i will get back with the results..
note: this script taught me a lot about indexes thanks for that.
Login or Register to Reply

|
Thread Tools Search this Thread
Search this Thread:
Advanced Search

More UNIX and Linux Forum Topics You Might Find Helpful
How can I apply 'date' command to specific columns, in a BASH script?
richardsantink
Hi everyone, I have a situation in which I have multiple (3 at last count) date columns in a CSV file (, delim), which need to be changed from: January 1 2017 (note, no comma after day) to: YYYY-MM-DD So far, I am able to convert a date using: date --date="January 12, 1990" +%Y-%m-%d ...... Shell Programming and Scripting
7
Shell Programming and Scripting
Read Two Columns - Apply Condition on Six other columns
jacobs.smith
Hello All, Here is my input univ1 chr1 100 200 - GeneA 500 1 0 0.1 0.2 0.3 0.4 0.5 univ1 chr1 100 200 - GeneA 600 1 0 0.0 0.0 0.0 0.0 0.1 univ1 chr1 100 200 - GeneA 700 1 0 0.4 0.4 ...... Shell Programming and Scripting
2
Shell Programming and Scripting
How to run multiple functions in Background in UNIX Shell Scripting?
karthikram
Hi, I am using ksh , i have requirement to run 4 functions in background , 4 functions call are available in a case that case is also in function, i need to execute 1st function it should run in background and return to case and next i will call 2nd function it should run in background and...... Shell Programming and Scripting
8
Shell Programming and Scripting
Managing dynamically multiple shell
gonzo38
I want to launch some shell scripts. I would have the possibility to change the number of shell scripts launched dynamically by modifying a variable, or a configuration file. For example, I start to launch 4 scripts at the same time, and after that, by modifying a variable, 6 scripts are...... Shell Programming and Scripting
0
Shell Programming and Scripting
reallocating structures dynamically in functions
cezaryn
I've recently started using structures, but I am having problems in allocating the structure dynamically. In the code below if i allocate the structure in the main program it works fine, and i get the expected output. However if i use the function rper below to increase the size of the structure i...... Programming
0
Programming