Comparing lines within a word list

09-10-2015

Registered User

6,384, 2,214

Join Date: May 2005

Last Activity: 28 October 2019, 4:59 PM EDT

Location: In the leftmost byte of /dev/kmem

Posts: 6,384

Thanks Given: 143

Thanked 2,214 Times in 1,548 Posts

Quote:

Originally Posted by RudiC

I'm afraid you'll have to search for (e.g. grep) for every single combination of chars, for above it would be grep "A.E" or grep "A[[:alpha:]]E", to be more precise.

Exactly. This was what i meant by "trivial".

Quote:

Originally Posted by RudiC

Writing an awk script might cause less effort for the OS running just one command, but still it would need to open and close many many files...

I don't think so. In fact, after running grep "A.E" the job would be done. The reason is that all hits this generates would differ from each other by exactly one character whereas hits from other runs (say "AB.") will sometimes differ by one character ("ABE" and "ABX") but sometimes by two ("ACE" from the first run and "ABX" from the second), so the property of all hits differering by exactly one character within the group would be lost upon any other run.

I hope this helps.

bakunin

This User Gave Thanks to bakunin For This Post:

bakunin

View Public Profile for bakunin

Find all posts by bakunin

09-10-2015

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

Quote:

Originally Posted by bakunin

.
.
.
In fact, after running grep "A.E" the job would be done.
.
.
.

Yes - for that char combination/permutation. Doing so for the entire alphabet would require to run grep repeatedly, creating a new, different results file every time. An awk script might run once, but would have to open/reopen results files frequently, so the gain might be eaten up.

This User Gave Thanks to RudiC For This Post:

RudiC

View Public Profile for RudiC

Find all posts by RudiC

09-10-2015

Registered User

2,288, 480

Join Date: Apr 2007

Last Activity: 3 May 2020, 8:28 AM EDT

Location: Saint Paul, MN USA / BSD, CentOS, Debian, OS X, Solaris

Posts: 2,288

Thanks Given: 430

Thanked 480 Times in 395 Posts

Hi.

Off the top of my head (but before more than one cup of coffee):

1) I agree with RudiC. The OP asked for all sets.

2) I think I'd try applying the idea of a sieve as in https://en.wikipedia.org/wiki/Sieve_of_Eratosthenes

So I'd sort the list, and, starting from the top, record matches, and erase from the original the matched items. The move to the next. Perhaps hashes would be faster than going through the list a number of times, but I haven't thought about that in detail.

I'd save the sets in filenames like, for example, for the string abc: _bc, a_b, and ab_

The list is presumably from some work, perhaps a dictionary as mentioned by the OP, so it is not as large as a complete set of permutations could be (26 items taken 3 at a time, some 15K if I did that correctly). I tried a few initial stabs using a common list of words, 25143 items total, of which 800 were 3 characters long.

So perhaps an awk/perl code to start, but might need a compiled code if that's too slow.

3) I wish that the OP had given more information about why this is needed.

Best wishes ... cheers, drl

This User Gave Thanks to drl For This Post:

drl

View Public Profile for drl

Find all posts by drl

09-10-2015

Registered User

5, 0

Join Date: Sep 2015

Last Activity: 25 April 2016, 9:59 PM EDT

Posts: 5

Thanks Given: 7

Thanked 0 Times in 0 Posts

Thanks, RudiC. So it would like the indexing solution you proposed earlier? Would that happen in Perl or something else? Or is it not feasible anyway? I realize there's a lot of computing going on with this problem, but it strikes me as having a very simple solution in the right language.

---------- Post updated at 01:01 PM ---------- Previous update was at 12:53 PM ----------

Sorry all, I replied before I'd seen the thread had gone on to a second page. drl, I like the sieve idea and actually had thought something like that should be easy enough to implement. My problem is that I have no idea how to code it or which language even to use. Seemed to me there would be an easy way to make an ordered list of all permutations

AAA, AAB, AAC, AAD, . . . , AAY, AAZ; ABA, ABB, ABC, ABD, . . .

and ask which of these are real words by comparing it to a list of real words. As long as the matches are returned by order according to the all-permutations list, my results would already be grouped appropriately.

dtalvacchio

View Public Profile for dtalvacchio

Find all posts by dtalvacchio

09-10-2015

Registered User

2,202, 340

Join Date: Apr 2007

Last Activity: 10 May 2020, 8:59 AM EDT

Location: 44.21.48N 80.50.15W

Posts: 2,202

Thanks Given: 3

Thanked 340 Times in 306 Posts

Code:

spell <proposed.word.list >temp1
diff temp1 proposed.word.list

jgt

View Public Profile for jgt

Visit jgt's homepage!

Find all posts by jgt

09-10-2015

Registered User

6,384, 2,214

Join Date: May 2005

Last Activity: 28 October 2019, 4:59 PM EDT

Location: In the leftmost byte of /dev/kmem

Posts: 6,384

Thanks Given: 143

Thanked 2,214 Times in 1,548 Posts

OK, the following is a sketch, with some parallelization to make it faster. Some blanks are left but maybe someone else can fill them in:

First the real "worker" scipt: it gets three parameters, one of which is supposed to be a ".", the other two are characters. It will cycle all characters and combine the two characters with these, consult a a word list and output the results to <stdout>. A typical call is "worker a b .", which will output all three-letter-combinations "ab<something>", which are words.

The routine "IsWord" is a blank and will have to be provided. It gets a word and returns 0 (true) for a word, >0 (false) for "not a word":

Code:

#! /bin/ksh

typeset -Ll1 aChar[1]="$1"
typeset -Ll1 aChar[2]="$2"
typeset -Ll1 aChar[3]="$3"

if   [ "${aChar[1]}" = "." ] ; then
     for chWord in {a..z}${aChar[2]}${aChar[3]} ; do
          if IsWord "$chWord" ; then
               print - "$chWord"
          fi
     done
elif [ "${aChar[2]}" = "." ] ; then
     for chWord in ${aChar[1]}{a..z}${aChar[3]} ; do
          if IsWord "$chWord" ; then
               print - "$chWord"
          fi
     done
else
     for chWord in {a..z}${aChar[2]}${aChar[3]} ; do
          if IsWord "$chWord" ; then
               print - "$chWord"
          fi
     done
fi

return 0

The main script calls the worker script with various combinations of parameters. The worker script is called in background to parallelize it, degree of parallelization depends on your system and will have to be figured out:

Code:

#! /bin/ksh

typeset      fResultDir="/some/place/for/results"
typeset -Ll2 chWord=""
typeset -i   iFanOut=15            # depends on your system

for chWord in {a..z}{a..z} ; do
     while [ $(jobs | wc-l) -ge $iFanOut ] ; do
          sleep 3                          # change this to something sensible
     done
     worker ${chWord%%?} ${chWord##?} "."  > "${fResultDir}/${chWord%%?}.${chWord##?}.@.result" &
     worker ${chWord%%?} "." ${chWord##?}  > "$[fResultDir}/${chWord%%?}.@.${chWord##?}.result" &
     worker "." ${chWord%%?} ${chWord##?}  > "$[fResultDir}/@.${chWord%%?}.${chWord##?}.result" &
done

exit 0

Notice that this is a SKETCH! No efforts on error checking (like full filesystems or whatever) are made. Note also that i have used "@" as a stand-in for "." in the filenames, otherwise some files would be hidden.

I hope this helps.

bakunin

Last edited by bakunin; 09-10-2015 at 06:09 PM..

bakunin

View Public Profile for bakunin

Find all posts by bakunin

09-12-2015

Registered User

2,288, 480

Join Date: Apr 2007

Last Activity: 3 May 2020, 8:28 AM EDT

Location: Saint Paul, MN USA / BSD, CentOS, Debian, OS X, Solaris

Posts: 2,288

Thanks Given: 430

Thanked 480 Times in 395 Posts

Hi.

I wrote a perl code to try the sieve method. The input data (from a dictionary file), with non-alphabetics removed, one word per line were these 757 words:

Code:

AAA  ass  bus  den  etc  gap  hob  Jew  lop  Nan  Orr  PTA  run  spa  tor  wee
AAU  ate  but  Des  Eva  gar  hoc  jig  Los  nap  Ott  pub  rut  spy  tot  Wei
ABA  Aug  buy  dew  eve  gas  hoe  Jim  lot  Nat  our  PUC  rye  Sri  tow  wet
abc  auk  bye  dey  ewe  gay  hog  job  Lou  nay  out  pug  sac  SSE  toy  who
Abe  Ave  cab  did  eye  gee  hoi  Joe  low  NBC  ova  pun  sad  SST  TRW  why
Abo  awe  Cal  die  FAA  gel  Hom  jog  loy  NBS  owe  pup  sag  SSW  try  wig
ace  awl  cam  dig  fad  gem  hop  Jon  LSI  NCO  owl  pus  Sal  Stu  TTL  win
ACM  awn  can  dim  fag  get  hot  jot  Ltd  NCR  own  put  Sam  sub  TTY  wit
ACS  axe  cap  din  fan  gig  how  joy  LTV  Ned  pad  PVC  San  sud  tub  woe
act  aye  car  dip  far  Gil  hoy  jug  lug  nee  pal  QED  Sao  sue  tug  wok
Ada  bad  cat  Dis  fat  gin  hub  jut  lux  net  Pam  qua  sap  sum  tum  won
add  bag  caw  DNA  fay  GMT  hue  Kay  lye  new  pan  quo  sat  sun  tun  woo
ado  bah  CBS  DOD  FBI  GNP  hug  keg  Mac  nib  pap  Rae  saw  sup  TVA  wop
aft  bam  CDC  doe  FCC  gnu  huh  ken  mad  NIH  par  rag  sax  Sus  TWA  wow
age  ban  CEQ  dog  FDA  Goa  hum  key  Mae  nil  pat  raj  say  tab  two  wry
ago  bar  chi  don  Feb  gob  Hun  kid  man  nip  paw  ram  Sci  tad  TWX  yah
aid  bat  CIA  dot  fed  god  hut  Kim  Mao  nit  pax  ran  SCM  tag  ugh  yak
ail  bay  cit  Dow  fee  gog  Ian  kin  map  NNE  pay  rap  sea  tam  UHF  yam
aim  bed  cod  dry  few  GOP  IBM  kit  mar  NNW  Paz  rat  sec  tan  Uri  yap
air  bee  cog  dub  fib  got  Ibn  lab  mat  nob  PBS  raw  see  tao  urn  yaw
ala  beg  col  dud  fig  GPO  ICC  lac  maw  nod  PDP  ray  sen  tap  USA  yea
alb  bel  con  due  fin  GSA  ice  lad  max  non  pea  RCA  seq  tar  USC  yen
ale  Ben  coo  dug  fir  gum  icy  lag  Max  nor  pee  reb  set  tat  use  yet
Ali  bet  cop  dun  fit  gun  Ida  lam  may  not  peg  red  sew  tau  USN  yin
all  bey  cos  dye  fix  Gus  iii  Lao  MBA  Nov  pen  rep  sex  tax  van  yip
alp  bib  cot  ear  Flo  gut  Ike  lap  Meg  now  pep  ret  she  tea  vat  yon
AMA  bid  cow  eat  flu  guy  ill  law  Mel  NRC  per  rev  Shu  ted  vee  you
ami  big  cox  ebb  fly  gym  imp  lax  men  NSF  pet  Rex  shy  Ted  vet  yow
amp  bin  coy  EDT  FMC  gyp  Inc  lay  met  nun  pew  rho  sib  tee  vex  yuh
amy  bit  CPA  eel  fob  had  ink  lea  mew  nut  PhD  rib  sic  Tel  VHF  zag
Amy  biz  cpu  eft  foe  Hal  inn  led  mid  NYC  phi  rid  sin  ten  via  Zan
ana  BMW  CRT  egg  fog  ham  ion  lee  mig  NYU  pie  rig  sip  the  vie  zap
and  boa  cry  ego  fop  Han  Ira  leg  min  oaf  pig  rim  sir  thy  vii  Zen
ani  bob  cub  eke  for  hap  ire  Len  MIT  oak  pin  Rio  sis  tic  vis  zig
Ann  bog  cud  Eli  fox  hat  irk  Leo  mix  oar  pip  rip  sit  tid  viz  zip
ant  bon  cue  elk  FPC  haw  IRS  let  mob  oat  pit  RNA  six  tie  von  Zoe
any  boo  cup  ell  fro  hay  Ito  Lev  Moe  Oct  ply  rob  ski  til  vow  zoo
ape  bop  cur  elm  fry  hem  ITT  lew  moo  odd  pod  rod  sky  Tim  WAC
Apr  bow  cut  Ely  FTC  hen  ivy  Lew  mop  ode  Poe  roe  sly  tin  wad
APS  box  dab  end  fum  her  jab  lid  mot  off  poi  Ron  sob  tip  wag
apt  boy  dad  Eng  fun  hew  jag  lie  mow  oft  pol  rot  Soc  tit  wah
arc  BTL  dam  EPA  fur  hex  jam  lim  MPH  ohm  pop  row  sod  TNT  wan
are  BTU  Dan  era  gab  hey  Jan  Lin  Mrs  oil  pot  Roy  Sol  toe  war
ark  bub  Dar  ere  gad  hid  jar  lip  mud  old  pow  RPM  son  tog  was
arm  bud  day  erg  gag  him  jaw  lit  mug  one  ppm  rub  sop  Tom  wax
art  bug  Dec  err  gal  hip  jay  Liz  mum  opt  pro  rue  sou  ton  way
ash  bum  Dee  EST  gam  his  Jed  lob  nab  orb  pry  rug  sow  too  web
ask  bun  Del  eta  GAO  hit  jet  log  nag  ore  psi  rum  soy  top  wed

the output was 760 lines. Here are the first and last 25 lines:

Code:

Edges: 25:0:25 of 760 lines in file "t4"
 One-character differences for AAA (.AA)
FAA
 One-character differences for AAA (A.A)
ABA
AMA
 One-character differences for AAA (AA.)
AAU
 One-character differences for abc (a.c)
arc
 One-character differences for Abe (A.e)
Ave
 One-character differences for Abe (Ab.)
Abo
 One-character differences for ace (.ce)
ice
 One-character differences for ace (a.e)
age
ale
ape
are
ate
awe
axe
aye
 One-character differences for ace (ac.)
   ---
tic
 One-character differences for sic (si.)
sir
six
 One-character differences for ski (sk.)
sky
 One-character differences for sou (.ou)
you
 One-character differences for spa (sp.)
spy
 One-character differences for SSE (SS.)
SST
SSW
 One-character differences for thy (.hy)
why
 One-character differences for TVA (T.A)
TWA
 One-character differences for UHF (.HF)
VHF
 One-character differences for USC (US.)
USN
 One-character differences for wah (.ah)
yah
 One-character differences for wah (wa.)
was

Any word that has been matched in a set (say for example for abc: .bc, a.c, ab.) is removed from the list, then the next non-matched word is the basis for a new set of patterns, and the matching begins with the word after that in the list. Previously matched words are ignored.

Best wishes ... cheers, drl

drl

View Public Profile for drl

Find all posts by drl

UNIX for Dummies Questions & Answers

Comparing lines within a word list

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Comparing alternate lines of code

Discussion started by: cabled

2. Shell Programming and Scripting

Shell Script @ Find a key word and If the key word matches then replace next 7 lines only

Discussion started by: Rajeev_hbk

3. UNIX for Dummies Questions & Answers

Delete lines with a word and their above lines

Discussion started by: the_simpsons

4. UNIX for Dummies Questions & Answers

Comparing lines of data

Discussion started by: markymarkg123

5. Shell Programming and Scripting

Search the word to be deleted and delete lines above this word starting from P1 to P3

Discussion started by: vsachan

6. Shell Programming and Scripting

Comparing lines of two different files

Discussion started by: mira

7. Shell Programming and Scripting

comparing lines in file

Discussion started by: sigh2010

8. Shell Programming and Scripting

Word count of lines ending with certain word

Discussion started by: warlock129

9. Shell Programming and Scripting

comparing lines from 2 files

Discussion started by: vijaya2006

10. Shell Programming and Scripting

Comparing a distinct value in 1 list with another list

Discussion started by: manualvin