Hello,
I have a file which has the following structure
Code:
word space Frequency
The file is around 30,000 headwords each along with its frequency. The words have different lengths. What I need is a PERL or AWK script which can sort the file on length of the headword and once the file is sorted on lenght: smallest to largest; sort each such set of words having the same length on their frequency.
At present I do this in Excel using the
Code:
=Len(text)
formula, but this is getting tedious.
I am giving below a sample input file
Code:
the 29962169
and 14291859
you 12345509
for 3296048
not 3091071
but 2994482
say 2345958
she 2123744
get 2081392
one 1988291
can 1915289
out 1812292
him 1571291
who 1543711
are 1487971
now 1453264
was 1399013
that 7834407
have 5930242
with 3983564
this 3814998
what 3327049
they 2684414
your 2329896
know 2221467
from 2207336
like 1845600
just 1756270
here 1558771
come 1541623
when 1465219
there 1957160
about 1903238
right 1410555
think 1398723
would 1346905
The expected output would be:
Code:
the 29962169
and 14291859
you 12345509
for 3296048
not 3091071
but 2994482
say 2345958
she 2123744
get 2081392
one 1988291
can 1915289
out 1812292
him 1571291
who 1543711
are 1487971
now 1453264
was 1399013
that 7834407
have 5930242
with 3983564
this 3814998
what 3327049
they 2684414
your 2329896
know 2221467
from 2207336
like 1845600
just 1756270
here 1558771
come 1541623
when 1465219
there 1957160
about 1903238
right 1410555
think 1398723
would 1346905
As you can see the file has been sorted on length and then on frequency.
Any help given would avoid the tedium of loading the file each time in excel. Many thanks in advance
awk ' {
l = length ($1)
if ( min == 0 )
min = l
if ( max < l )
max = l
if ( min > l )
min = l
A[l" "++c] = $0
} END {
for ( i = min; i <= max; i++ )
{
for ( j in A )
{
split (j, a)
if ( i == a[1] )
R = ( R == ""?A[j]:R RS A[j] )
}
F = "tmp"
print R > F
close (F)
cmd = "sort -nr -k2 " F
while (( cmd | getline ) > 0 )
print
close (cmd)
R = ""
}
} ' file
Many thanks to both. Unfortunately I am on Windows and pipes in Windows do not seem to work. I wonder why DOS does not allow efficient pies. Maybe they want DOS to be separate from LINUX.
Sorry for the bother. Any solution without pipes ?
Many thanx.
When you request help for something that should run under DOS, you must specify that very clearly in the original problem statement. Also, when that's the case, there's a forum for DOS scripting. The rest of the site assumes UNIX, as per the domain name.
My solution should run fine on Windows with Cygwin.
Regarding pipes and DOS, pipes connect two processes. DOS was not designed as a multitasking system.
Many thanks to both. Unfortunately I am on Windows and pipes in Windows do not seem to work.
In what way do they not seem to work? They work, they're just not the greatest implementation.
Quote:
I wonder why DOS does not allow efficient pies.
Well, it's not DOS anymore, but has to act like it. 30 years ago, it wasn't multitasking, and had to emulate pipes by running them one at a time with output saved inbetween.
I do think there must be a way to get that solution working in CMD. In what way does it not work?
I have a file of names with the following structure
NAME FREQUENCY
NAME NAME FREQUENCY
NAME NAME NAME FREQUENCY
i.e. more than one name is assigned the same frequency. An example will make this clear
SANDHYA DAS 6901
ARATI DAS 6201
KALPANA DAS 4714
GITA DAS 4550
BISWANATH DAS 3949... (4 Replies)
Hello,
I am writing an open-source stemmer in Java for Indic languages which admit a large number of suffixes.
The Java stemmer requires that each suffix string be sorted as per its length and that all strings of the same length are arranged in a single group, sorted alphabetically. Moreover as a... (3 Replies)
Hello,
I have a very large file: a dictionary of headwords of around 40000 and would like to have the dictionary sorted by its length i.e. the largest string first and the smallest at the end.
I have hunted for a perl or awk script on the forum which can do the job but there is none available.
I... (8 Replies)
Hello Everyone,
I am stuck with one issue while working on abstract flat file which i have to use as input and load data to table.
Input Data-
------ ------------------------ ---- -----------------
WFI001 Xxxxxx Control Work Item A Number of Records
------ ------------------------... (5 Replies)
I need to write a shell script "cmn" that, given an integer k, print the k most common words in descending order of frequency.
Example Usage:
user@ubuntu:/$ cmn 4 < example.txt :b: (3 Replies)
i need to write a bash script that recive a list of varuables
kaka pele ronaldo beckham zidane messi rivaldo gerrard platini
i need the program to print the longest word of the list.
word in the output appears on a separate line and word order in the output is in the order Llachsicografi costs.... (1 Reply)
How is it possible to sort different nummeric values within an Array. But i don`t want the highest or the lowest. I need the most frequently occurring value.
For examble:
My Array has to following values = (200 404 404 500 404 404 404 200 404)
The result should be 404
The values are... (3 Replies)
I have a fixed length file that need to be sorted according to the following rule
IF B=1 ORDER by A,B
Else ORDER by A,C
Input file is
ABC
131
112
122
231
212
222
Output needed
ABC
112
131
122
212
231
222 (1 Reply)
Hi, all.
I need to convert a file tab delimited/variable length file in AIX to a fixed lenght file delimited by spaces. This is the input file:
10200002<tab>US$ COM<tab>16/12/2008<tab>2,3775<tab>2,3783
19300978<tab>EURO<tab>16/12/2008<tab>3,28523<tab>3,28657
And this is the expected... (2 Replies)