Advanced: Sort, count data in column, append file name


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Advanced: Sort, count data in column, append file name
# 1  
Old 08-08-2012
Advanced: Sort, count data in column, append file name

Hi. I am not sure the title gives an optimal description of what I want to do. Also, I tried to post this in the "UNIX for Dummies Questions & Answers", but it seems no-one was able to help out.

I have several text files that contain data in many columns. All the files are organized the same way, but the data in the columns might differ. I want to count the number of times data occur in specific columns, sort the output and make a new file. However, I want check several files for the occurrence of the same data, count the number of times it occurs, append the file name to each one and make a new file sorted by the number of occurrences.

File 1:
Code:
xx xx xx aab rrt xx 
xx xx xx ccd bbt xx 
xx xx xx ggt iir xx

File 2:
Code:
xx xx xx ggt iir xx
xx xx xx ccd bbt xx

File 3:
Code:
 
xx xx xx aab rrt xx 
xx xx xx ggt iir xx

First I made a modification to the files, individually (any better way?) to make the file name occur in the first column:

Code:
sed 's/^/File1\t/' file1.temp > 1.txt

This gives files with:

File1:
Code:
File1 xx xx xx aab rrt xx 
File1 xx xx xx ccd bbt xx 
File1 xx xx xx ggt iir xx

File2:
Code:
File2 xx xx xx ggt iir xx
File2 xx xx xx ccd bbt xx

File3:
Code:
File3 xx xx xx aab rrt xx 
File3 xx xx xx ggt iir xx

Then I extracted the columns of interest and sorted them and made a new file:

Code:
awk '{print $1,$5,$6}' *.txt |sort -k2 > output.txt

The output.txt file could look like this:

Code:
File1 aab rrt 
File3 aab rrt 
File1 ccd bbt 
File2 ccd bbt 
File2 ggt iir
File3 ggt iir 
File1 ggt iir

Now, I want to count the number of times column 2 and column 3 are identical for every line and keep the first column information in the output file, separated by comma or similar. I want to result to be like this:

Code:
2 ccd bbt File1 
2 aab rrt File1,File3 
3 ggt iir File1, File2, File3

It would be good (but not a requirement) to have the last column in the final file to be sorted, lane1, lane2, lane3 etc. The lane* can also be separated by columns if that is easier.

So far I have tried to use:

Code:
awk '{print $1,$5,$6}' *.txt |sort -k2|uniq -f1 -c|sort -g > final_output.txt

However, I am not able to get the column data merged in the final output file. How should I go about to do that?

-James
# 2  
Old 08-08-2012
Something like this:
Code:
$ cat file[123]
File1 xx xx xx aab rrt xx
File1 xx xx xx ccd bbt xx
File1 xx xx xx ggt iir xx
File2 xx xx xx ggt iir xx
File2 xx xx xx ccd bbt xx
File3 xx xx xx aab rrt xx
File3 xx xx xx ggt iir xx
$
$ perl -lane '$x{"$F[4] $F[5]"} .= "$F[0],"; END{for(keys %x){$x{$_}=~s/,$//;print "$_ $x{$_}"}}' file1 file2 file3
ggt iir File1,File2,File3
ccd bbt File1,File2
aab rrt File1,File3
$

This User Gave Thanks to balajesuri For This Post:
# 3  
Old 08-08-2012
Wow! Thanks for the swift reply. I am almost there. However, upon running the perl script I got the following "comma" in the wrong place:

Code:
perl -lane '$x{"$F[4] $F[5]"} .= "$F[0],"; END{for(keys %x){$x{$_}=~s/,$//;print "$_ $x{$_}"}}' file1 file2 file3

Results in:
Code:
ggt iir File1,File2,File3
  ,
 ccd bbt File1,File2
aab rrt File1,File3

Also, I would like to have the count listed in the first column:
Code:
3 ggt iir File1,File2,File3
2 ccd bbt File1,File2
2 aab rrt File1,File3

Any way of implementing this? Do I need to run uniq -c before I run the perl script?
# 4  
Old 08-08-2012
Hi

You have empty lines in the files that's why you have comma in the "wrong" place.
To add the count and have the output you want change the code to
Code:
perl -lane '$c{"$F[4] $F[5]"}++; $x{"$F[4] $F[5]"} .= "$F[0]," if $F[5]; END{for(keys %x){$x{$_}=~s/,$//;print "$c{$_} $_ $x{$_}"}}' file[123]

This User Gave Thanks to Chirel For This Post:
# 5  
Old 08-08-2012
can the order of file1, file2, file3 vary
for example output contain order like file2, file1, file3
# 6  
Old 08-09-2012
Quote:
Originally Posted by Chirel
Hi

You have empty lines in the files that's why you have comma in the "wrong" place.
To add the count and have the output you want change the code to
Code:
perl -lane '$c{"$F[4] $F[5]"}++; $x{"$F[4] $F[5]"} .= "$F[0]," if $F[5]; END{for(keys %x){$x{$_}=~s/,$//;print "$c{$_} $_ $x{$_}"}}' file[123]

Thanks, that did what I wanted. What I did myself yesterday before reading your reply was to run:
Code:
awk -F "lane" '{print NF-1}' perl_output_file > new_count_file

followed by:
Code:
paste new_count_file perl_output_file > final_output_with_count_file

But, the perl code is more impressive. From this expercise, being a biologist trying to do some simple bioinformatics, I really want to learn more Unix/script/shell programming. Wow, so powerful. Smilie
# 7  
Old 08-09-2012
Code:
awk 'BEGIN{i=1}{
			x=$1;
			$1=y;
			if(!match(c[$0],x))
				{
					if(c[$0])
						{
							c[$0]=substr(c[$0],1)","substr(x,1)
						}
					else
						{
							c[$0]=x
				};
					
				};
			if(a[$0])
				{
					a[$0]++
				}
			else
				{
					a[$0]=1;
					b[i]=$0;
					i++
				}
		}
 END{for(k=1;k<i;k++){print a[b[k]],b[k],c[b[k]]}}'  filename

output is
Code:
2  aab rrt File1,File3
2  ccd bbt File1,File2
3  ggt iir File2,File3,File1

sort on column two if you need output sorted on column two.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Append data with substring of nth column fields using awk

Hi guys, I have problem to append new data at the end of each line of the files where it takes whole value of the nth column. My expected result i just want to take a specific value only. This new data is based on substring of 11th, 12th 13th column that has comma seperated value. My code: awk... (4 Replies)
Discussion started by: null7
4 Replies

2. Shell Programming and Scripting

Help with sort only column 2 data separately

Input File Contig_1_294435nt 242231 242751 Contig_1_294435nt 242390 242782 Contig_1_294435nt 242390 242782 Contig_1_294435nt 291578 291668 Contig_2_242278nt 75910 76271 Contig_2_242278nt 76036 76316 Contig_2_242278nt 76036 76316... (2 Replies)
Discussion started by: perl_beginner
2 Replies

3. Shell Programming and Scripting

To append new data at the end of each line based on substring of last column

Hi guys, I need to append new data at the end of each line of the files. This new data is based on substring (3rd fields) of last column. Input file xxx.csv: U1234|1-5X|orange|1-5X|Act|1-5X|0.1 /sac/orange 12345 0 U5678|1-7X|grape|1-7X|Act|1-7X|0.1 /sac/grape 5678 0... (5 Replies)
Discussion started by: null7
5 Replies

4. Shell Programming and Scripting

Append data to first column delimited file

Hi, I have a data like Input: 12||34|56|78 Output: XYZ|12||34|56|78 I tried like this , but it puts it on another line awk -F "|" ' BEGIN {"XYZ"} {print $0} 'file Any quick suggessitons in sed/awk ? am using HP-UX (3 Replies)
Discussion started by: selvankj
3 Replies

5. Shell Programming and Scripting

Count column data in a text file

I have a text file that has the following column data: 0.007 0.005 0.004 0.007 How do i output the total sum of the data above? (6 Replies)
Discussion started by: alegnagrp
6 Replies

6. Shell Programming and Scripting

Count column data

Hi Guys, B07 U51C A1 44 B1 44 Yes B07 L64U A2 44 B1 44 Yes B07 L62U A2 44 B1 44 Yes B07 L11C A4 32 B1 44 NO B05 L12Z A1 12 B1 44 NO B01 651Z A2 44 B1 44 NO B04 A51Z A2 12 B1 44 NO L07 B08D A4 12 B1 44 NO B07 RU8D A4 44 B1 44 Yes B07 L58D A4 15 B1 44 No B07 LA8D A4 44 B1 44 Yes B07... (6 Replies)
Discussion started by: asavaliya
6 Replies

7. Shell Programming and Scripting

Sort data As per first Column

hI I have file A NSU30504 5 6 G 6 NSU3050B T 7 9 J NSU30506 T I 8 9 NSU3050C H J K L Output: NSU3050B T 7 9 J NSU3050C H J K L NSU30504 5 6 G 6 NSU30506 T I 8 9Video tutorial on how to use code tags in The UNIX and Linux Forums. (13 Replies)
Discussion started by: pareshkp
13 Replies

8. Shell Programming and Scripting

Sort a the file & refine data column & row format

cat file1.txt field1 "user1": field2:"data-cde" field3:"data-pqr" field4:"data-mno" field1 "user1": field2:"data-dcb" field3:"data-mxz" field4:"data-zul" field1 "user2": field2:"data-cqz" field3:"data-xoq" field4:"data-pos" Now i need to have the date like below. i have just... (7 Replies)
Discussion started by: ckaramsetty
7 Replies

9. UNIX for Advanced & Expert Users

Script to sort the files and append the extension .sort to the sorted version of the file

Hello all - I am to this forum and fairly new in learning unix and finding some difficulty in preparing a small shell script. I am trying to make script to sort all the files given by user as input (either the exact full name of the file or say the files matching the criteria like all files... (3 Replies)
Discussion started by: pankaj80
3 Replies

10. Shell Programming and Scripting

Append the data to first column

Hi, The below is the content of the file. 008.03.50.21|ID4|0015a3f01cf3 008.04.20.16|ID3|0015a3f02337 008.04.20.17|ID4_1xVoice|00131180d80e 008.04.20.03|ID3_1xVoice|0015a3694125 008.04.30.05|ID3_1xVoice|0015a3f038af 008.06.30.17|ID3_1xVoice|00159660d454... (2 Replies)
Discussion started by: ravi_rn
2 Replies
Login or Register to Ask a Question