awk problem

05-13-2010

Registered User

248, 28

Join Date: May 2010

Last Activity: 24 May 2013, 10:11 PM EDT

Location: GuangZhou, China

Posts: 248

Thanks Given: 8

Thanked 28 Times in 27 Posts

awk problem

Hi, everyone.
I have a file called data.txt that looks like this:

Code:

list1 A
list1 B
list1 C
list2 A
list2 B
list2 F
list2 H
list3 A
list3 B
list3 D
list4 A
list4 B
list5 H
list5 F

col 1 represents a virtual folder
col 2 represents the name of a book
"list2 A" means book A is added to list2

I want to find books that are in virtual folders that contain book A. the books found should be sorted in descending order by their appearing frequency. and A is not included in the result.

Thank you!

Moderator's Comments:

Added code tags.

Last edited by radoulov; 05-13-2010 at 09:28 AM..

kevintse

View Public Profile for kevintse

Find all posts by kevintse

05-13-2010

Registered User

5,690, 630

Join Date: Jan 2007

Last Activity: 9 January 2017, 4:40 AM EST

Location: Варна, България / Milano, Italia

Posts: 5,690

Thanks Given: 184

Thanked 630 Times in 587 Posts

Re-opened as per OP request (not a homework assignment ...).

---------- Post updated at 03:08 PM ---------- Previous update was at 02:29 PM ----------

In a private conversation the OP explained that this is not a homework and that the original data is inside MySQL database.

Assuming the following data:

Code:

mysql> desc t;
+-------+-------------+------+-----+---------+-------+
| Field | Type        | Null | Key | Default | Extra |
+-------+-------------+------+-----+---------+-------+
| vf    | varchar(10) | YES  |     | NULL    |       |
| b     | varchar(10) | YES  |     | NULL    |       |
+-------+-------------+------+-----+---------+-------+
2 rows in set (0.00 sec)

mysql> select * from t;
+-------+------+
| vf    | b    |
+-------+------+
| list1 | A    |
| list1 | B    |
| list1 | C    |
| list2 | A    |
| list2 | B    |
| list2 | F    |
| list2 | H    |
| list3 | A    |
| list3 | B    |
| list3 | D    |
| list4 | A    |
| list4 | B    |
| list5 | H    |
| list5 | F    |
+-------+------+
14 rows in set (0.00 sec)

If I'm not missing something, this should give you the desired output:

Code:

select a.b from t a, t b
where b.b = 'A'
and b.vf = a.vf
and a.b != 'A'
group by a.b
order by count(1) desc;

Code:

mysql> select a.b from t a, t b
    -> where b.b = 'A'
    -> and b.vf = a.vf
    -> and a.b != 'A'
    -> group by a.b
    -> order by count(1) desc;
+------+
| b    |
+------+
| B    |
| C    |
| F    |
| H    |
| D    |
+------+
5 rows in set (0.00 sec)

Last edited by radoulov; 05-13-2010 at 09:45 AM..

radoulov

View Public Profile for radoulov

Find all posts by radoulov

05-13-2010

Registered User

248, 28

Join Date: May 2010

Last Activity: 24 May 2013, 10:11 PM EDT

Location: GuangZhou, China

Posts: 248

Thanks Given: 8

Thanked 28 Times in 27 Posts

OK! I think I didn't put it right in the private conversation.
I can get the result I want using SQL, but it is not efficient. you use "!=" in your select statement, that would kill the performance and it is definitely not possible to put this code into production use. I did say "we have a large table", I am sorry I didn't tell you how large it is. it is 50 million and it is growing!

I bet AWK can do much well here than SQL.
Thank you for your reply!

Last edited by kevintse; 05-13-2010 at 11:24 AM..

kevintse

View Public Profile for kevintse

Find all posts by kevintse

05-13-2010

Registered User

5,690, 630

Join Date: Jan 2007

Last Activity: 9 January 2017, 4:40 AM EST

Location: Варна, България / Milano, Italia

Posts: 5,690

Thanks Given: 184

Thanked 630 Times in 587 Posts

Why do you think that the != operator would impact the performance ... ?

It has nothing to do with it ....

If you index the table correctly, the SQL solution will be the fastest.

Just my 0.02€ ...

radoulov

View Public Profile for radoulov

Find all posts by radoulov

05-13-2010

Registered User

248, 28

Join Date: May 2010

Last Activity: 24 May 2013, 10:11 PM EDT

Location: GuangZhou, China

Posts: 248

Thanks Given: 8

Thanked 28 Times in 27 Posts

You didn't test your SQL with a table as large as 50 million rows. well you don't need to, but trust me, the time consumed for this kind of SQL to be executed is intolerable. I have tested a SQL like this "SELECT b, count(*) total FROM t GROUP BY vf ORDER BY total DESC" against a 50 million table, it took me 19 seconds.

kevintse

View Public Profile for kevintse

Find all posts by kevintse

05-13-2010

Registered User

5,690, 630

Join Date: Jan 2007

Last Activity: 9 January 2017, 4:40 AM EST

Location: Варна, България / Milano, Italia

Posts: 5,690

Thanks Given: 184

Thanked 630 Times in 587 Posts

The performance of such statements depends on many factors (the most important being the indexes).

Just for fun, this is an awk/sort/cut solution:

Code:

awk 'END { for (B in b) print b[B], B }
NR == FNR { $2 == "A" && vf[$1]; next }
$2 != "A" && $1 in vf { b[$2]++ }
BEGIN { ARGV[ARGC++] = ARGV[ARGC-1] }
' infile |
  sort -nr |
    cut -d\  -f2

You could add a final alphabetical ordering, if you wish.

Use gawk, nawk or /usr/xpg4/bin/awk on Solaris.

Anyway, you've been warned ...

Last edited by radoulov; 05-13-2010 at 10:49 AM.. Reason: Corrected ...

radoulov

View Public Profile for radoulov

Find all posts by radoulov

05-14-2010

Registered User

248, 28

Join Date: May 2010

Last Activity: 24 May 2013, 10:11 PM EDT

Location: GuangZhou, China

Posts: 248

Thanks Given: 8

Thanked 28 Times in 27 Posts

Quote:

Originally Posted by radoulov

Just for fun, this is an awk/sort/cut solution:

Code:

awk 'END { for (B in b) print b[B], B }
NR == FNR { $2 == "A" && vf[$1]; next }
$2 != "A" && $1 in vf { b[$2]++ }
BEGIN { ARGV[ARGC++] = ARGV[ARGC-1] }
' infile |
  sort -nr |
    cut -d\  -f2

You could add a final alphabetical ordering, if you wish.

Hi, radoulov
Today I have taken a look at an awk manual. I learned something, but I still can't quite understand the script you offered. Could you please explain it a little bit for me.

Why do you put "ARGV[ARGC++] = ARGV[ARGC-1]" in "BEGIN"?
And this line "NR == FNR { $2 == "A" && vf[$1]; next }". since there's only one input file, "NR == FNR" should always be true, right? so why it is necessary here? and what does "vf[$1]" mean(I guess it is collecting "virtual folders", but why it is written that way?)?

---------- Post updated 05-14-10 at 01:53 AM ---------- Previous update was 05-13-10 at 11:06 PM ----------

Cool! I have figured it out myself!

"ARGV[ARGC++] = ARGV[ARGC-1]" makes AWK process the input file twice.
"NR == FNR { $2 == "A" && vf[$1]; next }" is like "NR == FNR { if ($2 == "A") vf[$1]; next }" and "next" is like "continue" in the C programming language. "NR == FNR" means that only when the first time the input file is processed will "$1" be put in the "vf" array.

DONE!
Thank you again for the code.

kevintse

View Public Profile for kevintse

Find all posts by kevintse

Shell Programming and Scripting

awk problem

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

awk problem

Discussion started by: garethsays

2. Shell Programming and Scripting

awk problem

Discussion started by: pasaico

3. Shell Programming and Scripting

awk problem - combining awk statements

Discussion started by: SkySmart

4. UNIX for Dummies Questions & Answers

Little problem with AWK

Discussion started by: Xterra

5. Shell Programming and Scripting

awk problem

Discussion started by: abhikamune

6. Shell Programming and Scripting

Problem with awk awk: program limit exceeded: sprintf buffer size=1020

Discussion started by: fate

7. Shell Programming and Scripting

Awk problem: How to express the single quote(') by using awk print function

Discussion started by: patrick87

8. Shell Programming and Scripting

problem using awk

Discussion started by: senior_ahmed

9. Shell Programming and Scripting

Problem with AWK

Discussion started by: krishna_gnv

10. UNIX for Dummies Questions & Answers

AWK Problem

Discussion started by: gseyforth