Addi: My desire, regarding the dev/lex thing, is to obtain a set of word partitions with a "familiarity" measure based on the "confines" of a simple lexicon holding common morphemes. It's another thing that has been put on the "no-one's paying me for this" shelf a few times...
Intuitively, a person familiar with Latin and Greek morphemes, as used in English, can decompose a word into "meaningful parts" - adjacent combining forms, but how is this done computationally?
Given words such as:
automobile paleolithic deoxyribonucleic
Possible decompositions might be:
auto . mobile self + capable of being moved paleo . lith . ic old, ancient + stone + characteristic of de . oxy . ribo . nucle . ic from, away + compound containing oxygen + sugar + a central mass or kernel + characteristic of
Each of these parts have their own little meaning, which when combined, can losely describe the meaning of the whole word. In a rigorous and narrow context like medical terminology, a word like, encardiothoracichypoencephalotropy, might not have to be so frightening. While these "literal" morpheme definitions are not perfect, they make a sort-of "definition" of the entire word meaning. I am curious about this specialized, lower level of symbolic linguistics.
Anyway, I decided a while ago, to make a "dictionary of word parts" for myself. So I bought a HUGE Websters and went to work compiling a lexicon - what I called a "morpheme-ary" at the time.
Naturally, I naively decided to make an automatic, interactive dictionary of word parts, that would be a tight little database of stems and their definitions with a single, happy little function to spit-out the "analytical definition" of a science term. Combined with the fantasy of "synthetically building a word", this became the "Autolex".
The first task was to break a word into stems.
for my $i (0 .. length ($word) - 1) { for my $j (1 .. length ($word) - $i) { push @stems, substr $word, $i, $j; } }
This algorithm produces all possible word parts. For "abcde" this is:
a, ab, abc, abcd, abcde, b, bc, bcd, bcde, c, cd, cde, d, de, e
If n is equal to the length of the word, then this number of subsets is:
n - 1 _____ \ -- n - i /____ i = 1Which, in perl would be,
% perl -wle '$n = $ARGV[0]; $sum += $n - $_ for 0 .. $n - 1; print $sum' 5 15
What we need to do now is reconstruct the original word in all possible ways with these parts. This number of re-combinations is exactly 2^(n-1). Finding these is tricky and took me nearly forever to even come-up with the right question. There is basically one main problem: Find all, "possibly overlapping", stems and their adjacent substrings and put them in the correct (original) order.
I'll spare you the painful details, but I tried a number of different avenues. With iterative algorithms, I ended up wallowing in a crazed entanglement of counters and indexes. Defining the loop boundry conditions was another joy-ride...
Eventually I struck upon using a simple recursive solution that just builds an individual stem-recombination by list splicing.
Here are the guts of the algorithm:
# Factor the word into all possible stems # as an array reference called $stems, then... my @parsed = (); my @new = (); my $prev = 0; successors( $prev ); # Print-out the recombinations and exit; sub successors { my $i = shift; for( @{ $stems->[$i] } ) { # Find the end-position of the stem. my $n = $i + length; ## Ugly mystery-hack: # Yank-off the last two stems found, if we are at an "overlap point". splice @new, -2 if $prev > $i; $prev = $i; splice @new, $i, $n, $_; push @parsed, join( '.', @new ) if $n == $len; successors( $n ); } }
OK. So the splice calls just bug the crap out of me! I hope to find a more elegant answer...
Here is the result of parsing the string "abcde" with this "brute-force" technique:
1) a.b.c.d.e 2) a.b.c.de 3) a.b.cd.e 4) a.b.cde 5) a.bc.d.e 6) a.bc.de 7) a.bcd.e 8) a.bcde 9) ab.c.d.e 10) ab.c.de 11) ab.cd.e 12) ab.cde 13) abc.d.e 14) abc.de 15) abcd.e 16) abcde
After finding this solution one night, at my favorite all-night breakfast joint, I immediately decided it was clunky and imperfect. ;-)
This method is just fine for words with 20 letters or less, but as you get into longer terminology (such as is common in medicine or chemistry) you wait an exponential wait. For a 36 letter word on my 64mb 300mHz box it's like 69 billion hours!
Naturally, I have begun a search for ways to minimize this timeframe. The whole point is to get an idea what words like "deseptokaraspiduflemationistic" mean! I believe Porter's stemming algorithm can be used to "strip-off" well known affixes, so that a brute-force parse will not be so painful, but...
I started-out trying to parse a word by iterating over the lexicon and matching entries, rather than brute-force finding all possible recombinations. I have since brainstormed with a Lisp programmer friend and found a very close algorithm.
In words this is:
1) Find the first known stem.
2) Make an entry into the list of found word-parts.
3) If we are at a boundry, do something important - like push a completed recombination onto the
growing list.
4) Return a list of the "previous", "known fragment", and "following" chunks.
This method is limited in time by the number of entries in the lexicon, rather than the potentially astronomical "all-possible-recombinations".
Here is my latest attempt at this technique:
my $word = $ARGV[0] || die "Usage: stem string\n"; my $lex; @{ $lex }{ qw( a ab abc bc c ) } = (); stem( $word ); exit; sub stem { my $string = shift; for( sort keys %$lex ) { my $i = index $string, $_; if( $i >= 0 ) { my $pre = substr $string, 0, $i; my $post = substr $string, $i + length( $_ ), length( $string ) - length( $_ ) - $i; print "$_ ($i): ", join( '.', $pre, $_, $post ), "\n"; $seen->{$pre}++ if $pre; stem( $pre ) unless $seen->{$pre}; $seen->{$post}++ if $post; stem( $post ) unless $seen->{$post}; } } }
..But this is incomplete. It does not build recombinations. It just factors a string by iterating over the lexicon.
Of course this is only the beginning. What I really want to do is score each possible recombination as to its "meaning familiarity" - a ratio of known to unknown, maybe. This will yield a weighted list of most familiar/plausible word recombinations from which to choose.
Of course this is not the end either. What I would also like to be able to do is feed a list of words in, and have the program (through a thesaurus type technique) determine all possible scientific terms, which individulally are described by the original list. That is, I'd like to submit a definition and get back a list of possible words.
-- Gene Boggs <gene@ology.net> - 00/06/19
Following my Lightning Talk about
this at YAPC 19100, a few discussions unfolded - mostly
involving me trying to describe the problem to friends on #perl. The last one ended-up with dngor
saying, "Check this out.", which turned-out to be exactly what I
was thinking. YAY! :-)
- gb
Update! - 00/10/20