Lexicology or Folk Art

 Addi: My desire, regarding the dev/lex thing, is to obtain a set of word partitions with
           a "familiarity" measure based on the "confines" of a simple lexicon holding common
           morphemes.
 It's another thing that has been put on the "no-one's paying me for this" shelf a few
           times...

Intuitively, a person familiar with Latin and Greek morphemes, as used in English, can decompose a word into "meaningful parts" - adjacent combining forms, but how is this done computationally?

Given words such as:

  automobile
  paleolithic
  deoxyribonucleic

Possible decompositions might be:

  auto . mobile
  self + capable of being moved

  paleo . lith . ic
  old, ancient + stone + characteristic of

  de . oxy . ribo . nucle . ic
  from, away + compound containing oxygen + sugar + a central mass or kernel + characteristic of

Each of these parts have their own little meaning, which when combined, can losely describe the meaning of the whole word. In a rigorous and narrow context like medical terminology, a word like, encardiothoracichypoencephalotropy, might not have to be so frightening. While these "literal" morpheme definitions are not perfect, they make a sort-of "definition" of the entire word meaning. I am curious about this specialized, lower level of symbolic linguistics.

Anyway, I decided a while ago, to make a "dictionary of word parts" for myself. So I bought a HUGE Websters and went to work compiling a lexicon - what I called a "morpheme-ary" at the time.

Naturally, I naively decided to make an automatic, interactive dictionary of word parts, that would be a tight little database of stems and their definitions with a single, happy little function to spit-out the "analytical definition" of a science term. Combined with the fantasy of "synthetically building a word", this became the "Autolex".

The first task was to break a word into stems.

  for my $i (0 .. length ($word) - 1) {
    for my $j (1 .. length ($word) - $i) {
      push @stems, substr $word, $i, $j;
    }
  }

This algorithm produces all possible word parts. For "abcde" this is:

  a, ab, abc, abcd, abcde,
  b, bc, bcd, bcde,
  c, cd, cde,
  d, de,
  e

If n is equal to the length of the word, then this number of subsets is:

  n - 1
  _____
  \    
   --   n - i
  /____

  i = 1

Which, in perl would be,

% perl -wle '$n = $ARGV[0]; $sum += $n - $_ for 0 .. $n - 1; print $sum' 5
15

What we need to do now is reconstruct the original word in all possible ways with these parts. This number of re-combinations is exactly 2^(n-1). Finding these is tricky and took me nearly forever to even come-up with the right question. There is basically one main problem: Find all, "possibly overlapping", stems and their adjacent substrings and put them in the correct (original) order.

I'll spare you the painful details, but I tried a number of different avenues. With iterative algorithms, I ended up wallowing in a crazed entanglement of counters and indexes. Defining the loop boundry conditions was another joy-ride...

Eventually I struck upon using a simple recursive solution that just builds an individual stem-recombination by list splicing.

Here are the guts of the algorithm:

# Factor the word into all possible stems
# as an array reference called $stems, then...

my @parsed = ();
my @new    = ();
my $prev   = 0;

successors( $prev );

# Print-out the recombinations and
exit;

sub successors {
  my $i = shift;

  for( @{ $stems->[$i] } ) {
    # Find the end-position of the stem.
    my $n = $i + length;

    ## Ugly mystery-hack:
    # Yank-off the last two stems found, if we are at an "overlap point".
    splice @new, -2 if $prev > $i;

    $prev = $i;

    splice @new, $i, $n, $_;

    push @parsed, join( '.', @new ) if $n == $len;

    successors( $n );
  }
}

OK. So the splice calls just bug the crap out of me! I hope to find a more elegant answer...

Here is the result of parsing the string "abcde" with this "brute-force" technique:

   1)  a.b.c.d.e
   2)  a.b.c.de
   3)  a.b.cd.e
   4)  a.b.cde
   5)  a.bc.d.e
   6)  a.bc.de
   7)  a.bcd.e
   8)  a.bcde
   9)  ab.c.d.e
  10)  ab.c.de
  11)  ab.cd.e
  12)  ab.cde
  13)  abc.d.e
  14)  abc.de
  15)  abcd.e
  16)  abcde

After finding this solution one night, at my favorite all-night breakfast joint, I immediately decided it was clunky and imperfect. ;-)

This method is just fine for words with 20 letters or less, but as you get into longer terminology (such as is common in medicine or chemistry) you wait an exponential wait. For a 36 letter word on my 64mb 300mHz box it's like 69 billion hours!

Naturally, I have begun a search for ways to minimize this timeframe. The whole point is to get an idea what words like "deseptokaraspiduflemationistic" mean! I believe Porter's stemming algorithm can be used to "strip-off" well known affixes, so that a brute-force parse will not be so painful, but...

I started-out trying to parse a word by iterating over the lexicon and matching entries, rather than brute-force finding all possible recombinations. I have since brainstormed with a Lisp programmer friend and found a very close algorithm.

In words this is:

1) Find the first known stem.
2) Make an entry into the list of found word-parts.
3) If we are at a boundry, do something important - like push a completed recombination onto the growing list.
4) Return a list of the "previous", "known fragment", and "following" chunks.

This method is limited in time by the number of entries in the lexicon, rather than the potentially astronomical "all-possible-recombinations".

Here is my latest attempt at this technique:

my $word = $ARGV[0] || die "Usage: stem string\n";
my $lex;
@{ $lex }{ qw( a ab abc bc c ) } = ();

stem( $word );

exit;

sub stem {
  my $string = shift;

  for( sort keys %$lex ) {
    my $i = index $string, $_;

    if( $i >= 0 ) {
      my $pre  = substr $string, 0, $i;
      my $post = substr $string, $i + length( $_ ), 
                        length( $string ) - length( $_ ) - $i;

      print "$_ ($i): ", join( '.', $pre, $_, $post ), "\n";

      $seen->{$pre}++  if $pre;
      stem( $pre )     unless $seen->{$pre};

      $seen->{$post}++ if $post;
      stem( $post )    unless $seen->{$post};
    }
  }
}

..But this is incomplete. It does not build recombinations. It just factors a string by iterating over the lexicon.

Of course this is only the beginning. What I really want to do is score each possible recombination as to its "meaning familiarity" - a ratio of known to unknown, maybe. This will yield a weighted list of most familiar/plausible word recombinations from which to choose.

Of course this is not the end either. What I would also like to be able to do is feed a list of words in, and have the program (through a thesaurus type technique) determine all possible scientific terms, which individulally are described by the original list. That is, I'd like to submit a definition and get back a list of possible words.

-- Gene Boggs <gene@ology.net> - 00/06/19

Update! - 00/10/20

Following my Lightning Talk about this at YAPC 19100, a few discussions unfolded - mostly involving me trying to describe the problem to friends on #perl. The last one ended-up with dngor saying, "Check this out.", which turned-out to be exactly what I was thinking. YAY! :-)

- gb