Comparing columns in two separate files


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Comparing columns in two separate files
# 1  
Old 05-02-2012
Comparing columns in two separate files

Hey all, I have a file structure that looks something like this:

file1
Code:
306708278
88954535
234167885

file2
Code:
2012-03-27T12:32:56+00:00	137		Orchotorena	184616310003601409	306708278	es	40.4777947	Majadahonda	-3.6416896333333333	0	false	atlante83	"<a href=""http://tapbots.com/tweetbot"" rel=""nofollow"">Tweetbot for iOS</a>"	@Orchotorena un negro en edición es una imágen sin luminancia o negra. Se puede usar intencionalmente para separar planos	false	184619011038920705	88954535
2012-03-27T12:32:56+00:00	137		elasadumz	184617507607429120	238467064	id	-6.168961	Surakarta, Indonesia	106.83324766666668	0	false	Eka_put3	web	@elasadumz loh aku og gx d sapa???	false	184619010195849216	386451538
2012-03-27T12:32:57+00:00	137					en	-3.109555	no ♥ da minha gatinha.	-60.009778	0	false	jessicavaanessa	"<a href=""http://foursquare.com"" rel=""nofollow"">foursquare</a>"	I just ousted Guilherme C. as the mayor of Bebedouro Do Ceme on @foursquare! http://t.co/CfIF2LTm	false	184619011361869825	192336667
2012-03-27T12:32:57+00:00	137					en	-3.109555	no ♥ da minha gatinha.	-60.009778	0	false	jessicavaanessa	"<a href=""http://foursquare.com"" rel=""nofollow"">foursquare</a>"	I'm at Bebedouro Do Ceme (Manaus, Amazonas) http://t.co/7ZM7caXy	false	184619011592552448	192336667
2012-03-27T12:32:57+00:00	137					en	59.0687737	Oslo, Norway	11.529263533333333	0	false	jonarnesen	web	...dømt i går, pågrepet i dag for nyere forhold, og noen klager på at prøvetiden på førerkort ikke fungerer... #mullahkrekar #krekar #mullah	false	184619011747741697	18520064
2012-03-27T12:32:57+00:00	137					ja	34.7054449	大阪府	135.49594045	0	false	mucchi0312	"<a href=""http://foursquare.com"" rel=""nofollow"">foursquare</a>"	I'm at Supporter's Field http://t.co/xdHMBFyC	false	184619011697426432	495949802
2012-03-27T12:32:57+00:00	137					it	39.391708799999996	CA-Sardinia Magic Island 	9.3025356	0	false	AaronnMax	web	Listenin.... Jimmy Reed	false	184619011689033728	445455678
2012-03-27T12:32:57+00:00	137					en	3.1484500499999997	kuala lumpur	101.71766273333333	0	false	RainaBalqis	web	Kalau dj tak suka sgt ngn aku, tukar je lah tempat duduk aku.Takyah nk simpan dlm benak kau tu.	false	184619011991027712	234167885

And, what I'm looking to do is to isolate the rows from file2 where file1 is matched in either file2's 6th or 18th columns. A simple grep won't do it because in reality, file1 is too large to just pass along - definitely more than a couple hundred thousand lines. So, anyone have an idea on this? Also, obviously, this is my first post, so if I screwed up my post, that's why. I've been looking for the past few hours at sed, awk, and grep and comm and all sorts of combinations and tweaks, but can't hack this one out, as I'm getting to the limits of understanding how shell scripting works.
# 2  
Old 05-02-2012
Actually, your first post is quite correct, well done and welcome...

Are you sure it is not the fifth column instead of the 5th. And I can find the second value only in the last column. If not, then you could try this:
Code:
awk 'NR==FNR{A[$1];next}$6 in A || $18 in A' file1 file2

is so try this:
Code:
awk 'NR==FNR{A[$1];next}$5 in A || $NF in A' file1 file2

This User Gave Thanks to Scrutinizer For This Post:
# 3  
Old 05-02-2012
You could probably try using egrep, but it depends how well you know the data. It may be possible to specify a pattern to match on the nth field and therefore match the records you want. It depends what you call a column. Do you mean fixed width, i.e. starting from character 6 or character 18 (seems unlikely with you sample) or is it space sparated, and if so, how do you handle multiple spaces.

If you mean "the last bit of a record" then you can use a $ to mark where the end of record is. You would code this as:-
Code:
string$

in file1. You may have to cater for it in other positions too.

Can you elaborate a bit more on what should match and what should not, including some variations in position/column on records. Perhaps someone better than I will come up with either a better pattern match or an awk to assist if I can go no further.

Thanks, in advance.


I hope that this helps,
Robin
Liverpool/Blackburn

---------- Post updated at 05:52 PM ---------- Previous update was at 05:50 PM ----------

Oh! I've been beaten by an awk already. I suppose it's better than being beaten by an Orc though! Smilie
This User Gave Thanks to rbatte1 For This Post:
# 4  
Old 05-02-2012
See I'm not sure, Scrutinizer: technically its the 6th column, because there is no content in the 5th column (notice the additional tab space there). Nohting is ever in that column in this stage of the data, however - perhaps we could refer to them as the 5th and 17th? If you apply the second version, you get multiple copies of the same lines - perhaps a final piping is needed before it is sent to file? If so, what?
# 5  
Old 05-02-2012
OK then it probably is TAB separated you would need to run iike this :
Code:
awk -F'\t' 'NR==FNR{A[$1];next}$18 in A || $NF in A' file1 file2


--
Yes, looks like it, the number of fields is 18 on every line:
Code:
$ awk -F'\t' '{print NF}' file2
18
18
18
18
18
18
18
18

This User Gave Thanks to Scrutinizer For This Post:
# 6  
Old 05-02-2012
hey rbatte1,

Yes, you have been beaten. This data is tab separated, delimited by \n. The basic idea here is that in file1, you have a list of records that are acceptable users that have posted something - the 6th (or 5th, see my above post) column is one user id (this is a tweet, folks), and the final column is another user id. Ideally, we would collect all data points where 6th is not empty and 18th is in the list, I figured this was a halfway-step towards that?

---------- Post updated at 11:59 AM ---------- Previous update was at 11:55 AM ----------

Ahh, ok! So Scrutinizer, that will give us a result where either user id is included in file 1; in this case, just changing over to && will make it perform the intended action. A final question: how can you uniq this output in the event where a line matches twice? just uniq?
# 7  
Old 05-02-2012
Quote:
what I'm looking to do is to isolate the rows from file2 where file1 is matched in either file2's 6th or 18th column
Did you mean to say both .. and ?
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

How to use "awk" to print columns from different files in separate columns?

Hi, I'm trying to copy and paste the sixth column from a bunch of files into a single file having each column pasted in separate columns (and not one after each other in just one column.) I tried this code but works only partially because it copied and pasted 50 rows of each column... (6 Replies)
Discussion started by: Frastra
6 Replies

2. Shell Programming and Scripting

Separate columns into different text files

Hi I have large text file consisting of five columns. Sample of the file is give below: ed 2-4 12.0 commons that they depended on. मानवों नष्ट किया जिन पर वो आधारित थे। ed 3-1 12.0 Almost E, but would be over. रचना करीब करीब ई तक जाती है, मगर तब तो नाटक ख़त्म हो... (2 Replies)
Discussion started by: my_Perl
2 Replies

3. UNIX for Dummies Questions & Answers

Intersect of two columns in two separate files

Hi, I have a file like this: abc def ghi jkl mno My second file is like this (tab delimited): adsad sdfsdf dfdf wads abc dfdsf sdsf jkl sfsdf dsfds sdfd reor zxczd dsf sff Now, I want the code to report the lines (from file2) which have common strings in column 2 with the first... (4 Replies)
Discussion started by: a_bahreini
4 Replies

4. Shell Programming and Scripting

Comparing two columns from two different files

Hi, I have a single-column file1 having records like: 00AB01/11 43TG22/00 78RC09/34 ...... ...... and a second file , file 2 having two columns like 78RC09/34 1 45FD11/11 2 00AB01/11 3 43TG22/00 4 ...... ...... (8 Replies)
Discussion started by: amarn
8 Replies

5. Shell Programming and Scripting

comparing two columns from two different files

Hello, I have two files as 1.txt and 2.txt with number as columns. 1.txt 0 53.7988 1 -30.0859 2 20.1632 3 14.2135 4 14.6366 5 -37.6258 . . . 31608 -8.57333 31609 -2.58554 31610 -24.2857 2.txt (1 Reply)
Discussion started by: AKD
1 Replies

6. UNIX for Dummies Questions & Answers

Comparing columns in two files

Hi, I have two files. File1.txt has 2 columns and looks like: 458739 122345 4456 122657 34200 122600 File2.txt has many columns with column 1 the same as column2 of File1.txt, but with lot more rows: 122786 abcdefg user1@email 122778 uuhjeufh user2@email... (1 Reply)
Discussion started by: ursaan
1 Replies

7. UNIX for Dummies Questions & Answers

Comparing 2 columns from 2 files

Hi, I have two files with the same number of columns. Basically I want to print the 2 columns that match between the two files. File1 looks like this: dr12 12 6 abn dr14 12 7 abn File2 looks something like this: dr12 12 8 abn dr12 14 7 abn So basically if the first... (1 Reply)
Discussion started by: kylle345
1 Replies

8. Shell Programming and Scripting

Comparing Counts Within Separate Files

Hey all, So I have this challenge where I am attempting to compare record counts from within several different log files. I want input and output counts for each file, and I want to compare that with the result of the input/output comparison from a separate--but related file. Example: ... (2 Replies)
Discussion started by: gator76
2 Replies

9. Shell Programming and Scripting

Comparing Columns of two FIles

Dear all, I have two files in UNIX File1 and File2 as below: File1: 1,1234,.,67.897,,0 1,4134,.,87.97,,4 0,1564,.,97.8,,1 File2: 2,8798,.,67.897,,0 2,8879,.,77.97,,4 0,1564,.,97.8,,1 I want to do the following: (1) Make sure that both the files have equal number of columns and if... (4 Replies)
Discussion started by: ggopal
4 Replies

10. UNIX for Advanced & Expert Users

Comparing Columns of two FIles

Dear all, I have two files in UNIX File1 and File2 as below: File1: 1,1234,.,67.897,,0 1,4134,.,87.97,,4 0,1564,.,97.8,,1 File2: 2,8798,.,67.897,,0 2,8879,.,77.97,,4 0,1564,.,97.8,,1 I want to do the following: (1) Make sure that both the files have equal number of columns and if... (1 Reply)
Discussion started by: ggopal
1 Replies
Login or Register to Ask a Question