Comparing lines within a word list


 
Thread Tools Search this Thread
Top Forums UNIX for Dummies Questions & Answers Comparing lines within a word list
# 8  
Old 09-10-2015
Quote:
Originally Posted by RudiC
I'm afraid you'll have to search for (e.g. grep) for every single combination of chars, for above it would be grep "A.E" or grep "A[[:alpha:]]E", to be more precise.
Exactly. This was what i meant by "trivial".

Quote:
Originally Posted by RudiC
Writing an awk script might cause less effort for the OS running just one command, but still it would need to open and close many many files...
I don't think so. In fact, after running grep "A.E" the job would be done. The reason is that all hits this generates would differ from each other by exactly one character whereas hits from other runs (say "AB.") will sometimes differ by one character ("ABE" and "ABX") but sometimes by two ("ACE" from the first run and "ABX" from the second), so the property of all hits differering by exactly one character within the group would be lost upon any other run.

I hope this helps.

bakunin
This User Gave Thanks to bakunin For This Post:
# 9  
Old 09-10-2015
Quote:
Originally Posted by bakunin
.
.
.
In fact, after running grep "A.E" the job would be done.
.
.
.
Yes - for that char combination/permutation. Doing so for the entire alphabet would require to run grep repeatedly, creating a new, different results file every time. An awk script might run once, but would have to open/reopen results files frequently, so the gain might be eaten up.
This User Gave Thanks to RudiC For This Post:
# 10  
Old 09-10-2015
Hi.

Off the top of my head (but before more than one cup of coffee):

1) I agree with RudiC. The OP asked for all sets.

2) I think I'd try applying the idea of a sieve as in https://en.wikipedia.org/wiki/Sieve_of_Eratosthenes

So I'd sort the list, and, starting from the top, record matches, and erase from the original the matched items. The move to the next. Perhaps hashes would be faster than going through the list a number of times, but I haven't thought about that in detail.

I'd save the sets in filenames like, for example, for the string abc: _bc, a_b, and ab_

The list is presumably from some work, perhaps a dictionary as mentioned by the OP, so it is not as large as a complete set of permutations could be (26 items taken 3 at a time, some 15K if I did that correctly). I tried a few initial stabs using a common list of words, 25143 items total, of which 800 were 3 characters long.

So perhaps an awk/perl code to start, but might need a compiled code if that's too slow.

3) I wish that the OP had given more information about why this is needed.

Best wishes ... cheers, drl
This User Gave Thanks to drl For This Post:
# 11  
Old 09-10-2015
Thanks, RudiC. So it would like the indexing solution you proposed earlier? Would that happen in Perl or something else? Or is it not feasible anyway? I realize there's a lot of computing going on with this problem, but it strikes me as having a very simple solution in the right language.

---------- Post updated at 01:01 PM ---------- Previous update was at 12:53 PM ----------

Sorry all, I replied before I'd seen the thread had gone on to a second page. drl, I like the sieve idea and actually had thought something like that should be easy enough to implement. My problem is that I have no idea how to code it or which language even to use. Seemed to me there would be an easy way to make an ordered list of all permutations

AAA, AAB, AAC, AAD, . . . , AAY, AAZ; ABA, ABB, ABC, ABD, . . .

and ask which of these are real words by comparing it to a list of real words. As long as the matches are returned by order according to the all-permutations list, my results would already be grouped appropriately.
# 12  
Old 09-10-2015
Code:
spell <proposed.word.list >temp1
diff temp1 proposed.word.list

# 13  
Old 09-10-2015
OK, the following is a sketch, with some parallelization to make it faster. Some blanks are left but maybe someone else can fill them in:

First the real "worker" scipt: it gets three parameters, one of which is supposed to be a ".", the other two are characters. It will cycle all characters and combine the two characters with these, consult a a word list and output the results to <stdout>. A typical call is "worker a b .", which will output all three-letter-combinations "ab<something>", which are words.

The routine "IsWord" is a blank and will have to be provided. It gets a word and returns 0 (true) for a word, >0 (false) for "not a word":

Code:
#! /bin/ksh

typeset -Ll1 aChar[1]="$1"
typeset -Ll1 aChar[2]="$2"
typeset -Ll1 aChar[3]="$3"

if   [ "${aChar[1]}" = "." ] ; then
     for chWord in {a..z}${aChar[2]}${aChar[3]} ; do
          if IsWord "$chWord" ; then
               print - "$chWord"
          fi
     done
elif [ "${aChar[2]}" = "." ] ; then
     for chWord in ${aChar[1]}{a..z}${aChar[3]} ; do
          if IsWord "$chWord" ; then
               print - "$chWord"
          fi
     done
else
     for chWord in {a..z}${aChar[2]}${aChar[3]} ; do
          if IsWord "$chWord" ; then
               print - "$chWord"
          fi
     done
fi

return 0

The main script calls the worker script with various combinations of parameters. The worker script is called in background to parallelize it, degree of parallelization depends on your system and will have to be figured out:

Code:
#! /bin/ksh

typeset      fResultDir="/some/place/for/results"
typeset -Ll2 chWord=""
typeset -i   iFanOut=15            # depends on your system

for chWord in {a..z}{a..z} ; do
     while [ $(jobs | wc-l) -ge $iFanOut ] ; do
          sleep 3                          # change this to something sensible
     done
     worker ${chWord%%?} ${chWord##?} "."  > "${fResultDir}/${chWord%%?}.${chWord##?}.@.result" &
     worker ${chWord%%?} "." ${chWord##?}  > "$[fResultDir}/${chWord%%?}.@.${chWord##?}.result" &
     worker "." ${chWord%%?} ${chWord##?}  > "$[fResultDir}/@.${chWord%%?}.${chWord##?}.result" &
done

exit 0

Notice that this is a SKETCH! No efforts on error checking (like full filesystems or whatever) are made. Note also that i have used "@" as a stand-in for "." in the filenames, otherwise some files would be hidden.

I hope this helps.

bakunin

Last edited by bakunin; 09-10-2015 at 06:09 PM..
# 14  
Old 09-12-2015
Hi.

I wrote a perl code to try the sieve method. The input data (from a dictionary file), with non-alphabetics removed, one word per line were these 757 words:
Code:
AAA  ass  bus  den  etc  gap  hob  Jew  lop  Nan  Orr  PTA  run  spa  tor  wee
AAU  ate  but  Des  Eva  gar  hoc  jig  Los  nap  Ott  pub  rut  spy  tot  Wei
ABA  Aug  buy  dew  eve  gas  hoe  Jim  lot  Nat  our  PUC  rye  Sri  tow  wet
abc  auk  bye  dey  ewe  gay  hog  job  Lou  nay  out  pug  sac  SSE  toy  who
Abe  Ave  cab  did  eye  gee  hoi  Joe  low  NBC  ova  pun  sad  SST  TRW  why
Abo  awe  Cal  die  FAA  gel  Hom  jog  loy  NBS  owe  pup  sag  SSW  try  wig
ace  awl  cam  dig  fad  gem  hop  Jon  LSI  NCO  owl  pus  Sal  Stu  TTL  win
ACM  awn  can  dim  fag  get  hot  jot  Ltd  NCR  own  put  Sam  sub  TTY  wit
ACS  axe  cap  din  fan  gig  how  joy  LTV  Ned  pad  PVC  San  sud  tub  woe
act  aye  car  dip  far  Gil  hoy  jug  lug  nee  pal  QED  Sao  sue  tug  wok
Ada  bad  cat  Dis  fat  gin  hub  jut  lux  net  Pam  qua  sap  sum  tum  won
add  bag  caw  DNA  fay  GMT  hue  Kay  lye  new  pan  quo  sat  sun  tun  woo
ado  bah  CBS  DOD  FBI  GNP  hug  keg  Mac  nib  pap  Rae  saw  sup  TVA  wop
aft  bam  CDC  doe  FCC  gnu  huh  ken  mad  NIH  par  rag  sax  Sus  TWA  wow
age  ban  CEQ  dog  FDA  Goa  hum  key  Mae  nil  pat  raj  say  tab  two  wry
ago  bar  chi  don  Feb  gob  Hun  kid  man  nip  paw  ram  Sci  tad  TWX  yah
aid  bat  CIA  dot  fed  god  hut  Kim  Mao  nit  pax  ran  SCM  tag  ugh  yak
ail  bay  cit  Dow  fee  gog  Ian  kin  map  NNE  pay  rap  sea  tam  UHF  yam
aim  bed  cod  dry  few  GOP  IBM  kit  mar  NNW  Paz  rat  sec  tan  Uri  yap
air  bee  cog  dub  fib  got  Ibn  lab  mat  nob  PBS  raw  see  tao  urn  yaw
ala  beg  col  dud  fig  GPO  ICC  lac  maw  nod  PDP  ray  sen  tap  USA  yea
alb  bel  con  due  fin  GSA  ice  lad  max  non  pea  RCA  seq  tar  USC  yen
ale  Ben  coo  dug  fir  gum  icy  lag  Max  nor  pee  reb  set  tat  use  yet
Ali  bet  cop  dun  fit  gun  Ida  lam  may  not  peg  red  sew  tau  USN  yin
all  bey  cos  dye  fix  Gus  iii  Lao  MBA  Nov  pen  rep  sex  tax  van  yip
alp  bib  cot  ear  Flo  gut  Ike  lap  Meg  now  pep  ret  she  tea  vat  yon
AMA  bid  cow  eat  flu  guy  ill  law  Mel  NRC  per  rev  Shu  ted  vee  you
ami  big  cox  ebb  fly  gym  imp  lax  men  NSF  pet  Rex  shy  Ted  vet  yow
amp  bin  coy  EDT  FMC  gyp  Inc  lay  met  nun  pew  rho  sib  tee  vex  yuh
amy  bit  CPA  eel  fob  had  ink  lea  mew  nut  PhD  rib  sic  Tel  VHF  zag
Amy  biz  cpu  eft  foe  Hal  inn  led  mid  NYC  phi  rid  sin  ten  via  Zan
ana  BMW  CRT  egg  fog  ham  ion  lee  mig  NYU  pie  rig  sip  the  vie  zap
and  boa  cry  ego  fop  Han  Ira  leg  min  oaf  pig  rim  sir  thy  vii  Zen
ani  bob  cub  eke  for  hap  ire  Len  MIT  oak  pin  Rio  sis  tic  vis  zig
Ann  bog  cud  Eli  fox  hat  irk  Leo  mix  oar  pip  rip  sit  tid  viz  zip
ant  bon  cue  elk  FPC  haw  IRS  let  mob  oat  pit  RNA  six  tie  von  Zoe
any  boo  cup  ell  fro  hay  Ito  Lev  Moe  Oct  ply  rob  ski  til  vow  zoo
ape  bop  cur  elm  fry  hem  ITT  lew  moo  odd  pod  rod  sky  Tim  WAC
Apr  bow  cut  Ely  FTC  hen  ivy  Lew  mop  ode  Poe  roe  sly  tin  wad
APS  box  dab  end  fum  her  jab  lid  mot  off  poi  Ron  sob  tip  wag
apt  boy  dad  Eng  fun  hew  jag  lie  mow  oft  pol  rot  Soc  tit  wah
arc  BTL  dam  EPA  fur  hex  jam  lim  MPH  ohm  pop  row  sod  TNT  wan
are  BTU  Dan  era  gab  hey  Jan  Lin  Mrs  oil  pot  Roy  Sol  toe  war
ark  bub  Dar  ere  gad  hid  jar  lip  mud  old  pow  RPM  son  tog  was
arm  bud  day  erg  gag  him  jaw  lit  mug  one  ppm  rub  sop  Tom  wax
art  bug  Dec  err  gal  hip  jay  Liz  mum  opt  pro  rue  sou  ton  way
ash  bum  Dee  EST  gam  his  Jed  lob  nab  orb  pry  rug  sow  too  web
ask  bun  Del  eta  GAO  hit  jet  log  nag  ore  psi  rum  soy  top  wed

the output was 760 lines. Here are the first and last 25 lines:
Code:
Edges: 25:0:25 of 760 lines in file "t4"
 One-character differences for AAA (.AA)
FAA
 One-character differences for AAA (A.A)
ABA
AMA
 One-character differences for AAA (AA.)
AAU
 One-character differences for abc (a.c)
arc
 One-character differences for Abe (A.e)
Ave
 One-character differences for Abe (Ab.)
Abo
 One-character differences for ace (.ce)
ice
 One-character differences for ace (a.e)
age
ale
ape
are
ate
awe
axe
aye
 One-character differences for ace (ac.)
   ---
tic
 One-character differences for sic (si.)
sir
six
 One-character differences for ski (sk.)
sky
 One-character differences for sou (.ou)
you
 One-character differences for spa (sp.)
spy
 One-character differences for SSE (SS.)
SST
SSW
 One-character differences for thy (.hy)
why
 One-character differences for TVA (T.A)
TWA
 One-character differences for UHF (.HF)
VHF
 One-character differences for USC (US.)
USN
 One-character differences for wah (.ah)
yah
 One-character differences for wah (wa.)
was

Any word that has been matched in a set (say for example for abc: .bc, a.c, ab.) is removed from the list, then the next non-matched word is the basis for a new set of patterns, and the matching begins with the word after that in the list. Previously matched words are ignored.

Best wishes ... cheers, drl
 
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Comparing alternate lines of code

Hi gents, Have only a passing familiarity with linux/shell at this point, so please forgive simple question. I have text files that have lines something like the following: a b c d d d e f e f e f a b (6 Replies)
Discussion started by: cabled
6 Replies

2. Shell Programming and Scripting

Shell Script @ Find a key word and If the key word matches then replace next 7 lines only

Hi All, I have a XML file which is looks like as below. <<please see the attachment >> <?xml version="1.0" encoding="UTF-8"?> <esites> <esite> <name>XXX.com</name> <storeId>10001</storeId> <module> ... (4 Replies)
Discussion started by: Rajeev_hbk
4 Replies

3. UNIX for Dummies Questions & Answers

Delete lines with a word and their above lines

Hi, i have a file like this: A1 kdfjdljfdkljfdlf A2 lfjdlfkjddkjf A3 ***no hit*** A4 ldjfldjfdk A5 ***no hit*** A6 jldfjdlfjdlkfjd I want to remove the lines "***no hit*** and their above line to get an output file like this: (11 Replies)
Discussion started by: the_simpsons
11 Replies

4. UNIX for Dummies Questions & Answers

Comparing lines of data

Total UNIX Rookie, but I'm learning. I have columns of integer data separated by spaces, and I'm using a Mac terminal. What I want to do: 1. Compare "line 1 column 2" (x) to "line 2 column 2" (y); is y-x>=100? 2. If yes, display difference and y's line number 3. If no, increment x and y by... (9 Replies)
Discussion started by: markymarkg123
9 Replies

5. Shell Programming and Scripting

Search the word to be deleted and delete lines above this word starting from P1 to P3

Hi, I have to search a word in a text file and then I have to delete lines above from the word searched . For eg suppose the file is like this: Records P1 10,23423432 ,77:1 ,234:2 P2 10,9089004 ,77:1 ,234:2 ,87:123 ,9898:2 P3 456456 P1 :123,456456546 P2 abc:324234 (2 Replies)
Discussion started by: vsachan
2 Replies

6. Shell Programming and Scripting

Comparing lines of two different files

Hello, Please help me with this problem if you have a solution. I have two files: <file1> : In each line, first word is an Id and then other words that belong to this Id piMN-1 abc pqr xyz py12 niLM y12 FY4 pqs fiRLym F12 kite red <file2> : same as file1, but can have extra lds... (3 Replies)
Discussion started by: mira
3 Replies

7. Shell Programming and Scripting

comparing lines in file

i have 2 files and i want to compare i currently cat the files and awk print $1, $2 and doing if file1=file2 then fail, else exit 0 what i want to do is compare values, with column 1 being a reference i want to compare line by line and then still be able to do if then statement to see if worked... (1 Reply)
Discussion started by: sigh2010
1 Replies

8. Shell Programming and Scripting

Word count of lines ending with certain word

Hi all, I am trying to write a command that can help me count the number of lines in the /etc/passwd file ending in bash. I have read through other threads but am yet to find one indicating how to locate a specifc word at the end of a line. I know i will need to use the wc command but when i... (8 Replies)
Discussion started by: warlock129
8 Replies

9. Shell Programming and Scripting

comparing lines from 2 files

Hi Friends, I have 2 files A and B . I want to compare the 3rd line of file A and B . (I dont want to compare the 2 files, using diff or cmp). I just want to know whether 3rd line of A matches the 3 rd line of B. Can anybody share their knowledge on the same? Thanks , Vijaya (12 Replies)
Discussion started by: vijaya2006
12 Replies

10. Shell Programming and Scripting

Comparing a distinct value in 1 list with another list

Hi all, I need to compare the contents of 2 directories where the file contents are similar and take out the filenames whose contents does not exist within the 2 directories. Directory1 1 2 3 4 Directory2 54 55 56 57 Does anyone has a script which can do this? At the end of... (6 Replies)
Discussion started by: manualvin
6 Replies
Login or Register to Ask a Question