Unix/Linux Go Back    


Shell Programming and Scripting BSD, Linux, and UNIX shell scripting — Post awk, bash, csh, ksh, perl, php, python, sed, sh, shell scripts, and other shell scripting languages questions here.

Find all lines in file such that each word on that line appears in at least n lines of the file

Shell Programming and Scripting


Reply    
 
Thread Tools Search this Thread Display Modes
    #1  
Old Unix and Linux 1 Week Ago
uncleMonty uncleMonty is offline
Registered User
 
Join Date: Jun 2017
Last Activity: 24 June 2017, 9:22 AM EDT
Posts: 5
Thanks: 3
Thanked 0 Times in 0 Posts
Find all lines in file such that each word on that line appears in at least n lines of the file

I have a file where every line includes four expressions with a caret in the middle (plus some other "words" or fields, always separated by spaces). I would like to extract from this file, all those lines such that each of the four expressions containing a caret appears in at least four different lines of the whole file. Could anyone help me?

Here is a section of my file:

Code:
5^4 + 32^1 = 6^3 + 21^2    (625, 32, 216, 441)
5^4 + 34^2 = 12^3 + 53^1    (625, 1156, 1728, 53)
5^4 + 40^2 = 13^3 + 28^1    (625, 1600, 2197, 28)
5^4 + 42^1 = 7^3 + 18^2    (625, 42, 343, 324)
5^4 + 53^2 = 15^3 + 59^1    (625, 2809, 3375, 59)
5^4 + 56^1 = 8^3 + 13^2    (625, 56, 512, 169)
5^4 + 66^2 = 17^3 + 68^1    (625, 4356, 4913, 68)
5^4 + 75^1 = 6^3 + 22^2    (625, 75, 216, 484)
5^5 + 6^4 = 65^1 + 66^2    (3125, 1296, 65, 4356)
5^5 + 7^1 = 6^3 + 54^2    (3125, 7, 216, 2916)
5^5 + 7^4 = 50^1 + 74^2    (3125, 2401, 50, 5476)
5^5 + 8^3 = 37^1 + 60^2    (3125, 512, 37, 3600)
5^5 + 9^3 = 10^1 + 62^2    (3125, 729, 10, 3844)
5^5 + 10^3 = 8^4 + 29^1    (3125, 1000, 4096, 29)
5^5 + 16^2 = 6^1 + 15^3    (3125, 256, 6, 3375)
5^5 + 17^2 = 15^3 + 39^1    (3125, 289, 3375, 39)
5^5 + 18^2 = 15^3 + 74^1    (3125, 324, 3375, 74)
5^5 + 19^1 = 14^3 + 20^2    (3125, 19, 2744, 400)
5^5 + 20^1 = 6^4 + 43^2    (3125, 20, 1296, 1849)
5^5 + 27^1 = 7^3 + 53^2    (3125, 27, 343, 2809)
5^5 + 32^2 = 8^4 + 53^1    (3125, 1024, 4096, 53)
5^5 + 32^2 = 16^3 + 53^1    (3125, 1024, 4096, 53)
5^5 + 33^1 = 13^3 + 31^2    (3125, 33, 2197, 961)
5^5 + 43^2 = 17^3 + 61^1    (3125, 1849, 4913, 61)
5^5 + 47^1 = 12^3 + 38^2    (3125, 47, 1728, 1444)
5^5 + 55^1 = 11^3 + 43^2    (3125, 55, 1331, 1849)
5^5 + 59^2 = 9^4 + 45^1    (3125, 3481, 6561, 45)
5^5 + 60^1 = 7^4 + 28^2    (3125, 60, 2401, 784)
5^5 + 60^1 = 14^3 + 21^2    (3125, 60, 2744, 441)
5^6 + 8^4 = 27^3 + 38^1    (15625, 4096, 19683, 38)
5^6 + 16^1 = 10^3 + 11^4    (15625, 16, 1000, 14641)
5^6 + 20^4 = 9^1 + 56^3    (15625, 160000, 9, 175616)
5^6 + 35^2 = 7^5 + 43^1    (15625, 1225, 16807, 43)
5^6 + 45^2 = 26^3 + 74^1    (15625, 2025, 17576, 74)

So in what I would like to extract from the file, the last line would only be included if each of "5^6", "45^2", "26^3" and "74^1" appears on at least four different lines of the entire file. Thanks for any help!
Sponsored Links
    #2  
Old Unix and Linux 1 Week Ago
Don Cragun's Unix or Linux Image
Don Cragun Don Cragun is offline Forum Staff  
Administrator
 
Join Date: Jul 2012
Last Activity: 26 June 2017, 8:37 PM EDT
Location: San Jose, CA, USA
Posts: 10,396
Thanks: 527
Thanked 3,627 Times in 3,093 Posts
Is this a homework assignment? Homework and coursework questions can only be posted in the Homework & Coursework Questions forum under special homework rules.

Please review the rules, which you agreed to when you registered, if you have not already done so.

If this post is not homework, please explain the company you work for and the nature of the problem you are working on. And, tell us what operating system and shell you're using, and show us what you have tried to do to solve this problem on your own.

If you did post homework in the main forums, please review the guidelines for posting homework and repost.

Last edited by Don Cragun; 1 Week Ago at 02:16 AM.. Reason: Fix typo: s/ this is not post/this post is not/
Sponsored Links
    #3  
Old Unix and Linux 1 Week Ago
uncleMonty uncleMonty is offline
Registered User
 
Join Date: Jun 2017
Last Activity: 24 June 2017, 9:22 AM EDT
Posts: 5
Thanks: 3
Thanked 0 Times in 0 Posts
Thanks for the friendly welcome Don. I haven't had any homework assignments for over 25 years. I'm a hobbyist working on a maths problem. I wrote a little C program to generate this data, and want to sort through it with shell tools as an intermediate step to solving the problem empirically (as a hint to myself, before I try to solve it mathematically). I am using Bash by default, since it is the default shell on my laptop running OS 10.6, but other shells are available. What I have done so far: stared at it and realised I don't know how to do this kind of multi-line search with the handful of shell commands I have taught myself over the last 30 years (and only used very infrequently, when such problems come up). I suppose I could also have tried to do this weeding out within my C program, but I can't see how to do it without having to hold everything in memory all at once (again, I write such programs very infrequently). So, it seems better to write it to a file then use some other tool in the shell to search that file. Hence my posting here. I'm sure there is a better way, but I break out my C and shell scripts about once every 6 months and at my age it's often easier to ask.

Is there anyone less suspicious who might be able to point me in a useful direction?
    #4  
Old Unix and Linux 1 Week Ago
RudiC RudiC is online now Forum Staff  
Moderator
 
Join Date: Jul 2012
Last Activity: 27 June 2017, 4:40 PM EDT
Location: Aachen, Germany
Posts: 10,965
Thanks: 280
Thanked 3,371 Times in 3,104 Posts
No reason to become ironic. This forum has a high reputation of NOT helping students and / or candidates cheat their way through classwork or exams, so questions of that kind are adequate and accepted.

Still: welcome to the forum.

For your problem, try

Code:
awk '{CNT[$1]++; CNT[$3]++;CNT[$5]++; CNT[$7]++} END {for (c in CNT) if (CNT[c] > 3) print c, "occurs", CNT[c], "times."}' file
15^3 occurs 4 times.
5^4 occurs 8 times.
5^5 occurs 21 times.
5^6 occurs 5 times.

It doesn't check if terms occur twice in one line, but the chances of that happening are quite low, I believe.
The Following User Says Thank You to RudiC For This Useful Post:
uncleMonty (1 Week Ago)
Sponsored Links
    #5  
Old Unix and Linux 1 Week Ago
uncleMonty uncleMonty is offline
Registered User
 
Join Date: Jun 2017
Last Activity: 24 June 2017, 9:22 AM EDT
Posts: 5
Thanks: 3
Thanked 0 Times in 0 Posts
Thank you Rudi. I should learn awk, shouldn't I. That is a good way to count the occurrences. Is there a way, having counted the occurrences, to echo an entire line, if and only if the 1st 3rd 5th and 7th field of that line all appear at least 4 times in the file? (For the smaller sample data I posted, it would find an answer if we searched for lines whose entries all appear at least twice, instead of four times.)

You are correct not to worry about repeats within a single line, this is ruled out by construction of the data.

P.s. apologies if I overreacted--I think what was irritating was not that someone would want to make sure my question wasn't homework (I agree that a forum can quickly become useless to experts if it is overrun by homework questions), but instead the order to "please explain the company you work for and the nature of the problem you are working on", not only because it is intrusive, but because it suggests that only people who work for a company with a work-related problem can legitimately ask for scripting assistance here. But: your forum, your rules, ok.
Sponsored Links
    #6  
Old Unix and Linux 1 Week Ago
Don Cragun's Unix or Linux Image
Don Cragun Don Cragun is offline Forum Staff  
Administrator
 
Join Date: Jul 2012
Last Activity: 26 June 2017, 8:37 PM EDT
Location: San Jose, CA, USA
Posts: 10,396
Thanks: 527
Thanked 3,627 Times in 3,093 Posts
If you don't mind reading the file twice, it is pretty simple with awk:

Code:
awk -v cnt=2 '
FNR == NR {
	c[$1]++
	c[$3]++
	c[$5]++
	c[$7]++
	next
}
c[$1] >= cnt && c[$3] >= cnt && c[$5] >= cnt && c[$7] >= cnt' file file

With cnt set to 4, you don't get any output with your posted sample data. With cnt set to 2, this produces the output:

Code:
5^5 + 18^2 = 15^3 + 74^1    (3125, 324, 3375, 74)
5^5 + 32^2 = 8^4 + 53^1    (3125, 1024, 4096, 53)
5^5 + 60^1 = 14^3 + 21^2    (3125, 60, 2744, 441)

You haven't told us what operating system you're using... If you're using a Solaris/SunOS system, you'll need to change awk in the above to /usr/xpg4/bin/awk or nawk.

Last edited by Don Cragun; 1 Week Ago at 02:14 PM.. Reason: Fix typo: s/cnt[$5]/c[$5]/ in last line of awk script.
The Following User Says Thank You to Don Cragun For This Useful Post:
uncleMonty (1 Week Ago)
Sponsored Links
    #7  
Old Unix and Linux 1 Week Ago
RudiC RudiC is online now Forum Staff  
Moderator
 
Join Date: Jul 2012
Last Activity: 27 June 2017, 4:40 PM EDT
Location: Aachen, Germany
Posts: 10,965
Thanks: 280
Thanked 3,371 Times in 3,104 Posts
I'm certain Don Cragun will accept the apologies. The forum maintainers' attitude is less to not to become useless - people in here REALLY like to help with also minor problems - but to keep up the quality of IT education. If a student fills in the homework form including institution, course and professor, s/he will be helped to develop in the right direction and find a solution of his/her own; c.f. http://www.unix.com/homework-and-coursework-questions/. By the way, vague comments on a person's company like "chemical" or "administration" would have sufficed, or even you telling us you're a hobbyist.

Back to your problem. Outputting the entire line that satisfies a condition means either keep ALL lines in memory (demanding for BIG files) or run through the input file twice - once for counting, once for printing. This is the approach in here:

Code:
awk 'NR == FNR {CNT[$1]++; CNT[$3]++;CNT[$5]++; CNT[$7]++; next} CNT[$1] > 1 && CNT[$3] > 1 && CNT[$5] > 1 && CNT[$7] > 1 ' file file
5^5 + 18^2 = 15^3 + 74^1    (3125, 324, 3375, 74)
5^5 + 32^2 = 8^4 + 53^1    (3125, 1024, 4096, 53)
5^5 + 60^1 = 14^3 + 21^2    (3125, 60, 2744, 441)

For increasing the count limit, set all the 1 s to 3 for the four comparisons in the second part.
And, yes, you're right: awk is a very powerful tool for text file analyses...
The Following User Says Thank You to RudiC For This Useful Post:
uncleMonty (1 Week Ago)
Sponsored Links
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Linux More UNIX and Linux Forum Topics You Might Find Helpful
Thread Thread Starter Forum Replies Last Post
How to find a string in a line in UNIX file and delete that line and previous 3 lines ? vadlamudy UNIX for Advanced & Expert Users 7 05-19-2014 05:25 AM
Read all lines after a string appears in the file. Nagaraja Akkiva Shell Programming and Scripting 4 10-24-2011 07:57 AM
Get last lines of file after last line with word TEST waso Shell Programming and Scripting 6 10-06-2010 08:32 PM
how to find a word in a file that appears next to a given keyword mwrg UNIX for Dummies Questions & Answers 6 09-02-2010 12:06 PM
find uniq lines in file, using the first field of line grom UNIX for Dummies Questions & Answers 8 10-20-2009 05:34 PM



All times are GMT -4. The time now is 04:48 PM.