Complex awk problem


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Complex awk problem
# 1  
Old 02-16-2013
Complex awk problem

hello,

i have a complex awk problem...

i have two tables, one with a value (0 to 1) and it's corresponding p-value, like this:

1. table:
______________________________
value p-value
... ...
0.254 0.003
0.245 0.005
0.233 0.006
... ...
______________________________

and a second with millions of values (0 to 1), like this:
2. table
______________________________
...
0.252...
0.234...
0.256...
...
______________________________

now I have to map the second list to the first table so that I get for each value the corresponding p-value (i.e. the p-value corresponding to the LOWER value in the 1. table).

expexted output:
______________________________
... ...
0.252... 0.005
0.234... 0.006
9.256... 0.003
... ...
______________________________


one possibility would be to create an indexed array with 1000 instances in this way:
...
a[0.233]=0.006
a[0.234]=0.006
...
a[0.245]=0.005
...

then create a substring of the value of the second list like x.xxx and use this as index for the array.

is there an easier way?

thank you,

dietmar

PS: I have to use AWK, because I already read 7 several gbyte large files an combine these in one table (in minutes) and I have to convert one column on the fly to the corresponding p-values. I think every other programming language will be slower (perhaps except C++, which I can't program).
# 2  
Old 02-16-2013
I don't see any easier way than what you suggested. Below some (untested) code:

Code:
awk 'NR==FNR{p[$1]=$2;next}{ v=substr($1,1,5); print v in p ? p[v] : "Some warning"}' table1 table2

# 3  
Old 02-16-2013
I guess that using associative arrays will be much slower than using integer indexed arrays, so this one might be a bit faster than the other approach. (On the other hand, man awk says:
Quote:
. . . array[expr]. Expr is internally converted to string type, so, for example, A[1] and A["1"] are the same . . .
, so my thoughts were wrong). However, try
Code:
awk     '               {tmp = int ($1 * 1000)}
         FNR==NR        {while (++s < tmp)  Ar[s] = p; p = Ar[tmp] = $2; next}
         FNR==1 && NR>1 {while (++s < 1000) Ar[s] = p}
                        {print $1, Ar[tmp]}
#        END            {for (i in Ar) print i, Ar[i]}
        ' file1 file2
0.25234 0.005
0.23487 0.006
0.256001 0.003

Make sure file1 is sorted in ascendig order!

Last edited by RudiC; 02-16-2013 at 02:38 PM..
# 4  
Old 02-16-2013
thank you both, but now i run into some other problems:

firstly one question, user8:
Code:
'NR==FNR{p[$1]=$2;next}

eats the first file and proceeds after this with the second one?

my problem: i saved the pvalue-file with R and I got something like 1e-3.
okay, I managed this as I multiplied all values by 10000 (I use now four instead of 3 digits), therefore I have values from 0 over 5 to 10000 which correspond to values from 0.0000 over 0.0005 to 1.0000 (respecting also the hint of RudiC Smilie).

How do I get the integer from a value in awk?

Code:
 
int(value*10000)

I hope so - I will try...
# 5  
Old 02-16-2013
Not sure I understand your new problems. Why don't you add a "0" to both occurrences of "1000" making them "10000" in above script and give it a shot?
# 6  
Old 02-16-2013
thank you rudiC

I do not understand you approach completely, but the while loop is run every new value from file2 (?), and as I have hundreds of millions of values, this approach has to be slower than the array approach, where the array is load once and only used every value...

the script looks now like this and works perfect (the additional p-value translation increases the time only by 1 minute from 15 to 16 minutes):

Code:
dir='/home/ws/R_workspace/OVCAD'
fn='miR.MIC'
pv='pvalues_220.txt'
cd $dir
fname=${fn%.*}
echo $fname
echo -e "N1\tN2\tMIC\tMAS\tMEV\tMCN\tMICR2\tpearson\tnon-linearity\tp-value" > ${fname}.mine
 
gawk 'BEGIN { FS = "\t"; OFS="\t" } ; NR==FNR{p[$1]=$2;next} { NR==1;{ for (i = 1; i <= NF; i++) name[i]=$i } }; \
 { split(FILENAME,fname,"."); fn=fname[1]; getline < (fn".MAS"); getline < (fn".MEV"); getline < (fn".MCN"); getline < (fn".MICR2"); getline < (fn".cor"); getline < (fn".nl") } \
 { for(k=1; k <= NF; k++) { getline; split($0, MIC, "\t") ; \
 getline mas < (fn".MAS"); split(mas, MAS, "\t") ; \
 getline mev < (fn".MEV"); split(mev, MEV, "\t") ; \
 getline mcn < (fn".MCN"); split(mcn, MCN, "\t") ; \
 getline micr2 < (fn".MICR2"); split(micr2, MICR2, "\t") ; \
 getline cor < (fn".cor"); split(cor, COR, "\t") ; \
 getline nl < (fn".nl"); split(nl, NL, "\t") ; \
 for(j=k+1; j <= NF; j++)  \
 print name[k],name[j],MIC[j],MAS[j],MEV[j],MCN[j],MICR2[j],COR[j],NL[j],p[int(MIC[j]*10000)] } }' $pv ${fname}.MIC >> ${fname}.mine

# 7  
Old 02-16-2013
You are right, you seem not understand the script. Did you try it at all?
file1 is read exactly once, populating the array from 0 to (the highest value in file1) * 1000. On the first line of file2, before any action on file2, the rest of the array (previous + 1 up to 999) is populated. Then, for every line (of the millions) in file2, field 1 is multiplied by 1000, truncated, and the match in the array is retrieved and printed. One single action per one single line.

I'd appreciate if you could attach (in advanced edit) a representative sample of each of your files, not necessarily millions of values, and anonymized if need be, so I can test and time the scripts above by myself. Thank you.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Complex text parsing with speed/performance problem (awk solution?)

I have 1.6 GB (and growing) of files with needed data between the 11th and 34th line (inclusive) of the second column of comma delimited files. There is also a lot of stray white space in the file that needs to be trimmed. They have DOS-like end of lines. I need to transpose the 11th through... (13 Replies)
Discussion started by: Michael Stora
13 Replies

2. Shell Programming and Scripting

Help with Complex Awk.

Hi, I have a file. In this file when ever the word "ABC" occurs at position from 25 and 34 I would like to replace the value at postion 100 to 5 for the first 1000 rows only. I have written the following Awk command. nawk 'substr($0,25,9)=="ABC" {print $0}' filename The above command... (4 Replies)
Discussion started by: pinnacle
4 Replies

3. Shell Programming and Scripting

awk script (complex)

picked this up from another thread. echo 1st_file.csv; nawk -F, 'NR==FNR{a++;next} a{b++} END{for(i in b){if(b-1&&a!=b){print i";\t\t"b}else{print "NEW:"i";\t\t"b} } }' OFS=, 1st_file.csv *.csv | sort -r i need to use the above but with a slight modification.. 1.compare against 3 month... (25 Replies)
Discussion started by: slashbash
25 Replies

4. Shell Programming and Scripting

Get values from 2 files - Complex "for loop and if" awk problem

Hi everyone, I've been thinking and trying/changing all day long the below code, maybe some awk expert could help me to fix the for loop I've thought, I think I'm very close to the correct output. file1 is: <boxes content="Grapes and Apples"> <box No.="Box MT. 53"> <quantity... (8 Replies)
Discussion started by: Ophiuchus
8 Replies

5. Shell Programming and Scripting

complex Awk Question

Hi, I have a file look likes this : --->start hir Trace file: pudwh_ora_9998.trc Sort options: fchela exeela ***************************************************************count = number of times OCI procedure was executed cpu = cpu time in seconds executing elapsed = elapsed... (3 Replies)
Discussion started by: yoavbe
3 Replies

6. Shell Programming and Scripting

Complex use with awk

Hi , I have file named docs.txt The content of the file look like this: DOC disk location Size ======= ===== ============= ========= TXT A /dev/dm-1 10 TXT B /dev/dm-2 10 BIN C ... (3 Replies)
Discussion started by: yoavbe
3 Replies

7. IP Networking

Need to solve complex network problem

I have a Red Hat linux server X on a x.x.0.0 network. This machine also has to communicate with another server Y on a network called y.y.0.0 Server X has two network interfaces. eth0 is configured on the x.x.0.0 network and has a default gateway on the x.x.0.0 network. In order to... (4 Replies)
Discussion started by: soliberus
4 Replies

8. Shell Programming and Scripting

Complex file count problem

Hi all! I have a question regarding possibilities to do line counts. SEARCH_VAR=TEX rsh $REM_HOST -l $REM_USER "cd $REM_DIR; ls *$SEARCH_VAR* 2> /dev/null" | sort -n | awk 'BEGIN { FS = "-" } ; { print $1"\t"$0 }' Will produce an output on the screen like this: 483 483-SOME-TEXT-1... (1 Reply)
Discussion started by: bbergstrom74
1 Replies

9. Shell Programming and Scripting

Complex problem about nested for loops

Hey, I'm writing this bash script that will test print me many copies of the same program but with different combos of 4 variables being between 1 and 100. Here's the code: #! /bin/bash x=0 for ((a=1; a < 101; a++)) do for ((b=1; b < 101; b++)) do for ((c=1; c < 101; c++)) do for... (4 Replies)
Discussion started by: Silverlining
4 Replies

10. UNIX for Dummies Questions & Answers

Complex Pipeline/Redirection/Regular Expression problem

This is a very hard unix command which I could not perform..... :( ......here it is: The file ~unx122/public/data.txt contains over 18,000 lines of text. Here are the first ten lines of the file: pagination 20657 2740 28416 31090 18145 shiner 1695 2507 9964 14512 13122 cool 29210... (1 Reply)
Discussion started by: netmaster
1 Replies
Login or Register to Ask a Question