I will explain a little bit for the output. For the row that starts with "A", "B,C,D,F,H" are extracted from all those lists that contain "A", and they are sorted by their appearing frequency in descending order.
...
The portion in red color is a repetition of the portion immediately above it in your file data1.txt.
(1) Can your actual file have data like that ?
(2) If yes, then can the "second" set be different ? For example, is it possible to have two lines like so -
Code:
A:list1,list2,list3,list4
...
A:list1,list3,list5
(3) If yes, then can there be more than two sets in data1.txt ? like so -
(4) Do you collect a unique set of "lists", for each character on the left, in such a case ?
For example, for A => list1, list2, list3, list4, list5, list7 ?
(5) The character "C" is associated with 3 lists in data1.txt:
Code:
C:list1,list2,list6
And these lists have the following set of characters:
Code:
list1:A,B,C
list2:A,B,C,F,H
...
list6:C
...
So, the distinct set of characters associated with list1, list2 and list6 should be => (A, B, C, F, H)
Your desired output for "C" is like so -
Code:
desired output:
...
C:A,B,F,H
Have you removed "C" from the right-hand-side list, because it is common on either side of the ":" character ?
(6) Is that also the reason you've omitted "G:G" from your desired output ?
Quote:
...
This is the code I used to generate the output:
Code:
#!/bin/awk -f
BEGIN { FS=":"; ts1 = "date \"+%s\""; }
NR == FNR { bookLists[$1]=$2; next } #the first file
{ books[$1]=$2 } #the second file
END {
{ print "There are in total", length(bookLists), "book lists and ", length(books), "books." > "make_data_log.txt" }
count = 0
#bookLists: A:list1,list2
#books: list1:A,B
cmd = "sort -k 3 -nr | awk ' { arr[$2]; if(!b) b=$1 } END { for (i in arr) str=str ? str\",\"i : i; print b, str} '"
for (i in bookLists) {
#print "========== book " i "=========="
split(bookLists[i], tmpBls, /,/)
for (j in tmpBls) {
split(books[tmpBls[j]], tmpBs, /,/)
for (k in tmpBs)
++result[tmpBs[k]] #print i":", tmpBs[k]
delete tmpBs
}
for (l in result) {
if (i != l) print i, l, result[l] | cmd
#if (i != l) print i, l, result[l]
if (++num == 20) break
}
close(cmd) #close the pipe, or this "sort" will be delayed until the awk program ends
delete result
delete tmpBls
num = 0
if (++count % 100000 == 0)
print count >> "make_data_log.txt"
}
{ ts2 = "date \"+%s\""; print ts2-ts1, "seconds consumed." >> "make_data_log.txt" }
}
...
What's the current response time for your actual data "data1.txt" (the one that has more than 30,000 strings delimited by commas) ?
And what is the acceptable response time for the same?
Hi all!
I am relatively new to UNIX staff, and I have come across a problem:
I have a big directory, which contains 100 smaller ones. Each of the 100 contains a file ending in .txt , so there are 100 files ending in .txt
I want to split each of the 100 files in smaller ones, which will contain... (4 Replies)
$mystring = "name:blk:house::";
print "$mystring\n";
@s_format = split(/:/, $mystring);
for ($i=0; $i <= $#s_format; $i++) {
print "index is $i,field is $s_format";
print "\n";
}
$size = $#s_format + 1;
print "total size of array is $size\n";
i am expecting my size to be 5, why is it... (5 Replies)
I have gone through all the threads in the forum and tested out different things. I am trying to split a 3GB file into multiple files. Some files are even larger than this.
For example:
split -l 3000000 filename.txt
This is very slow and it splits the file with 3 million records in each... (10 Replies)
Hi,
I have some output in the form of:
#output:
abc123
def567
hij890
ghi324
the above is in one column, stored in the variable x ( and if you wana know about x... x=sprintf(tolower(substr(someArray,1,1)substr(userArray,3,1)substr(userArray,2,1)))
when i simply print x (print x) I get... (7 Replies)
Hi... I have a question regarding the split function in PERL.
I have a very huge csv file (more than 80 million records). I need to extract a particular position(eg : 50th position) of each line from the csv file. I tried using split function. But I realized split takes a very long time.
Also... (1 Reply)
Hi... I have a question regarding the split function in PERL.
I have a very huge csv file (more than 80 million records). I need to extract a particular position(eg : 50th position) of each line from the csv file. I tried using split function. But I realized split takes a very long time.
Also... (0 Replies)
Hi... I have a question regarding the split function in PERL.
I have a very huge csv file (more than 80 million records). I need to extract a particular position(eg : 50th position) of each line from the csv file. I tried using split function. But I realized split takes a very long time.
Also... (1 Reply)
my @d =split('\|', $_);
west|ACH|3|Y|LuV|N||N||
Qt|UWST|57|Y|LSV|Y|Bng|N|KT|
It Returns d as 8 for First Line, and 9 as for Second Line . I want to Process Both the Files, How to Handle It. (3 Replies)
Hello;
I have a file consists of 4 columns separated by tab. The problem is the third fields. Some of the them are very long but can be split by the vertical bar "|". Also some of them do not contain the string "UniProt", but I could ignore it at this moment, and sort the file afterwards. Here is... (5 Replies)