awk problem


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting awk problem
# 1  
Old 05-13-2010
awk problem

Hi, everyone.
I have a file called data.txt that looks like this:
Code:
list1 A
list1 B
list1 C
list2 A
list2 B
list2 F
list2 H
list3 A
list3 B
list3 D
list4 A
list4 B
list5 H
list5 F

col 1 represents a virtual folder
col 2 represents the name of a book
"list2 A" means book A is added to list2

I want to find books that are in virtual folders that contain book A. the books found should be sorted in descending order by their appearing frequency. and A is not included in the result.


Thank you!

Moderator's Comments:
Mod Comment Added code tags.

Last edited by radoulov; 05-13-2010 at 09:28 AM..
# 2  
Old 05-13-2010
Re-opened as per OP request (not a homework assignment ...).

---------- Post updated at 03:08 PM ---------- Previous update was at 02:29 PM ----------

In a private conversation the OP explained that this is not a homework and that the original data is inside MySQL database.

Assuming the following data:

Code:
mysql> desc t;
+-------+-------------+------+-----+---------+-------+
| Field | Type        | Null | Key | Default | Extra |
+-------+-------------+------+-----+---------+-------+
| vf    | varchar(10) | YES  |     | NULL    |       |
| b     | varchar(10) | YES  |     | NULL    |       |
+-------+-------------+------+-----+---------+-------+
2 rows in set (0.00 sec)

mysql> select * from t;
+-------+------+
| vf    | b    |
+-------+------+
| list1 | A    |
| list1 | B    |
| list1 | C    |
| list2 | A    |
| list2 | B    |
| list2 | F    |
| list2 | H    |
| list3 | A    |
| list3 | B    |
| list3 | D    |
| list4 | A    |
| list4 | B    |
| list5 | H    |
| list5 | F    |
+-------+------+
14 rows in set (0.00 sec)

If I'm not missing something, this should give you the desired output:
Code:
select a.b from t a, t b
where b.b = 'A'
and b.vf = a.vf
and a.b != 'A'
group by a.b
order by count(1) desc;

Code:
mysql> select a.b from t a, t b
    -> where b.b = 'A'
    -> and b.vf = a.vf
    -> and a.b != 'A'
    -> group by a.b
    -> order by count(1) desc;
+------+
| b    |
+------+
| B    |
| C    |
| F    |
| H    |
| D    |
+------+
5 rows in set (0.00 sec)


Last edited by radoulov; 05-13-2010 at 09:45 AM..
# 3  
Old 05-13-2010
OK! I think I didn't put it right in the private conversation.
I can get the result I want using SQL, but it is not efficient. you use "!=" in your select statement, that would kill the performance and it is definitely not possible to put this code into production use. I did say "we have a large table", I am sorry I didn't tell you how large it is. it is 50 million and it is growing!

I bet AWK can do much well here than SQL.
Thank you for your reply!

Last edited by kevintse; 05-13-2010 at 11:24 AM..
# 4  
Old 05-13-2010
Why do you think that the != operator would impact the performance ... ?

It has nothing to do with it ....

If you index the table correctly, the SQL solution will be the fastest.

Just my 0.02€ ...
# 5  
Old 05-13-2010
You didn't test your SQL with a table as large as 50 million rows. well you don't need to, but trust me, the time consumed for this kind of SQL to be executed is intolerable. I have tested a SQL like this "SELECT b, count(*) total FROM t GROUP BY vf ORDER BY total DESC" against a 50 million table, it took me 19 seconds.
# 6  
Old 05-13-2010
The performance of such statements depends on many factors (the most important being the indexes).

Just for fun, this is an awk/sort/cut solution:

Code:
awk 'END { for (B in b) print b[B], B }
NR == FNR { $2 == "A" && vf[$1]; next }
$2 != "A" && $1 in vf { b[$2]++ }
BEGIN { ARGV[ARGC++] = ARGV[ARGC-1] }
' infile |
  sort -nr |
    cut -d\  -f2

You could add a final alphabetical ordering, if you wish.

Use gawk, nawk or /usr/xpg4/bin/awk on Solaris.

Anyway, you've been warned ...

Last edited by radoulov; 05-13-2010 at 10:49 AM.. Reason: Corrected ...
# 7  
Old 05-14-2010
Quote:
Originally Posted by radoulov
Just for fun, this is an awk/sort/cut solution:

Code:
awk 'END { for (B in b) print b[B], B }
NR == FNR { $2 == "A" && vf[$1]; next }
$2 != "A" && $1 in vf { b[$2]++ }
BEGIN { ARGV[ARGC++] = ARGV[ARGC-1] }
' infile |
  sort -nr |
    cut -d\  -f2

You could add a final alphabetical ordering, if you wish.
Hi, radoulov
Today I have taken a look at an awk manual. I learned something, but I still can't quite understand the script you offered. Could you please explain it a little bit for me.

Why do you put "ARGV[ARGC++] = ARGV[ARGC-1]" in "BEGIN"?
And this line "NR == FNR { $2 == "A" && vf[$1]; next }". since there's only one input file, "NR == FNR" should always be true, right? so why it is necessary here? and what does "vf[$1]" mean(I guess it is collecting "virtual folders", but why it is written that way?)?

---------- Post updated 05-14-10 at 01:53 AM ---------- Previous update was 05-13-10 at 11:06 PM ----------

Cool! I have figured it out myself!

"ARGV[ARGC++] = ARGV[ARGC-1]" makes AWK process the input file twice.
"NR == FNR { $2 == "A" && vf[$1]; next }" is like "NR == FNR { if ($2 == "A") vf[$1]; next }" and "next" is like "continue" in the C programming language. "NR == FNR" means that only when the first time the input file is processed will "$1" be put in the "vf" array.

DONE!
Thank you again for the code.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

awk problem

Hi I have two columns and I would like to create a third column based on how many lines away from a value of 1 in column 2, for example I have 1,0 2,0 3,0 4,0 5,0 6,1 7,0 8,0 9,0 10,0 11,1 And I want an output (6 Replies)
Discussion started by: garethsays
6 Replies

2. Shell Programming and Scripting

awk problem

i have an email list in file.txt with comma separated line1 - FIELD1,pippo@gmail.com,darth@gmail.com line2 - FIELD2,pippo@gmail.com,darth@gmail.com,sampei@gmail.com output=(awk -F ',' -v var="$awkvar" '$1==var {print $2,$3,$4}' spreadsheet.txt)but awk delete some letters at the... (8 Replies)
Discussion started by: pasaico
8 Replies

3. Shell Programming and Scripting

awk problem - combining awk statements

i have a datafile that has several lines that look like this: 2,dataflow,Sun Mar 17 16:50:01 2013,1363539001,2990,excelsheet,660,mortar,660,4 using the following command: awk -F, '{$3=strftime("%a %b %d %T %Y,%s",$3)}1' OFS=, $DATAFILE | egrep -v "\-OLDISSUES," | ${AWK} "/${MONTH} ${DAY}... (7 Replies)
Discussion started by: SkySmart
7 Replies

4. UNIX for Dummies Questions & Answers

Little problem with AWK

I thought I had solved this problem but after testing the script I came to realize that it is not doing what I need. So, here it goes again. This is the code: awk '/\>/{F=$2; N=$3; split(FILENAME, A, "."); getline; x = ">"}{print ">" A"-" x++" "F" " N"\n" $0}' This is the input file: ... (5 Replies)
Discussion started by: Xterra
5 Replies

5. Shell Programming and Scripting

awk problem

Find the number of files with sizes > 100KB in /, /bin, /usr, /usr/bin and /usr/sbin directories and output them in a two column format with the name of the directory and the number of files. i tried with awk $>ls -lh | awk '/^-/ && $5 >= 100k {print $8 $5}' but it is not working pls tell... (3 Replies)
Discussion started by: abhikamune
3 Replies

6. Shell Programming and Scripting

Problem with awk awk: program limit exceeded: sprintf buffer size=1020

Hi I have many problems with a script. I have a script that formats a text file but always prints the same error when i try to execute it The code is that: { if (NF==17){ print $0 }else{ fields=NF; all=$0; while... (2 Replies)
Discussion started by: fate
2 Replies

7. Shell Programming and Scripting

Awk problem: How to express the single quote(') by using awk print function

Actually I got a list of file end with *.txt I want to use the same command apply to all the *.txt Thus I try to find out the fastest way to write those same command in a script and then want to let them run automatics. For example: I got the file below: file1.txt file2.txt file3.txt... (4 Replies)
Discussion started by: patrick87
4 Replies

8. Shell Programming and Scripting

problem using awk

Hi there every body I'm new to shell scripting and there is a problem facing me,, please look at the following piece of code: awk ' BEGIN{ FS="<assertion id=\1"; RS="<assertion id=\"2"}/<assertion id=\"1/{print FS$2 > "/home/ds2/test/output.txt"} ' filename all I wanna do is to... (6 Replies)
Discussion started by: senior_ahmed
6 Replies

9. Shell Programming and Scripting

Problem with AWK

Hi All, How can i store a value of the unix command executed in AWK with system command. devise=`cut -c1-3 dvgp.txt` I wrote this command in awk as awk'{ code= sprintf("devise=`cut -c1-3 dvgp.txt`"); system(code); }' Is this correct. can you please suggest me how the code can be... (1 Reply)
Discussion started by: krishna_gnv
1 Replies

10. UNIX for Dummies Questions & Answers

AWK Problem

Hi, I posted something here about this yesterday but I can't seem to find it. I needed help writting a script which would append a file with new lines after every so many charachters. Example: (my original flat file) L60 LETTER OF CREDIT 60 DAYS W00 ON RECEIPT WIRE TRANSFER W30 NET... (12 Replies)
Discussion started by: gseyforth
12 Replies
Login or Register to Ask a Question