You will find this synopsis useful if you are working with one of these tools and do not remember what the right operator for your specific task is. Blanks in the table normally indicate the an operation for a given tool must be defined by more complex means. (This table is partly derived from a similar table by Jason Eisner.)
XFST | FSA Utilities | AT&T FSM lextools | Explanation |
Manual | Manual | FSM
Manual lextools Manual | Online Manuals |
xfst Web-based interface | fsa -tk | n/a | Interactive startup |
? | ? | [<sigma>] (if defined) | any alphabet symbol |
a..z (or predicates) | superclass defn in .sym file (lexmakelab compiles to .scl file for -S option) | symbol class | |
"*" (or %*) | '*' (or escape(*)) | \* | escaped literal symbol |
foo bar (whitespace separator) | foo<regex
op>bar (separated by regex op) | [foo][bar] | long symbol names |
(predicates) | [noun num=pl gender=fem | complex symbol | |
abc (Ambiguous: see Explanation below | abc (no symbol
separator) a b c (symbol separator=32) a+b+c (symbol separator=43) | (no notation for input string use fsmintersect or fsmcompose) | the input string "abc" |
a,bc (symbol separator=44) | the input | ||
E1 E2 ... En | [E1,E2,...,En], concat(E1,E2) | E1 E2 (fsmconcat) | concatenation |
{abcd} | [a,b,c,d] (no special notation) | abcd | character concat |
A* | A*, kleene_star(A) | A* | Kleene star |
A+ | A+, kleene_plus(A) | A+ | Kleene plus |
(A) | A^, option(A), {A,[]} | A? | option |
E1 | E2 | ... | En | {E1,E2,...,En}, union(E1,E2) | E1 | E2 (fsmunion) | union |
A & B | A & B, intersect[ion](A,B) | A & B (fsmintersect) | intersection |
~A | ~A, complement(A) | (fsmdifference [<sigma>]* A) | complement (negation) |
\a | ~a & ?, ? - a (expressed as `{a} in graphic interface) | !a (fsmdifference [<sigma>] a) | symbol complement |
A - B | A - B, difference(A,B) | E1 - E2 (fsmdifference) | relative complement |
~$[], \? | {} | (echo -n | fsmcompile) | the empty language |
0, [ ] | [ ] | 0 [<epsilon>] (if defined) (echo -n 0 | fsmcompile) | epsilon (the empty string) |
A .x. B, A:B | A x B, A:B | A:B | cross product |
A .o. B | A o B, compose(A,B) | E1 @ E2 (fsmcompose) | (reverse) composition |
R.1 | domain(R) | (fsmproject -i) | domain |
R.2 | range(R) | (fsmproject -o) | range |
A | A, identity(A) | A | identity transducer |
efree(A) | fsmrmepsilon | epsilon-removal | |
A! {t,w,wt}_determinize(A) | fsmdeterminize | determinization | |
A.i | invert(A) | fsminvert | invert (switch direction of transduction) |
$A | $A, [? *,A,? *] | contains | |
A/B | ignore(A,B) | ignore | |
A.r | reverse(A) | fsmreverse | reverse |
.#. | FSA Utilities Anchors | [<bos>][<eos>] | edge of string anchors |
A => L1 _ R1 A => L1 _ R1, L2 _ R2 | Restriction in FSA Utilities | restriction | |
A -> (exhaustive nondeterm.) A (->) B (optional) A @-> B (LR longest) A @> B (LR shortest) A ->@ B (RL longest) A >@ B (RL shortest) Reversse arrow swaps lower, upper A may have the form [.E.] (blocks repeat matches of epsilon in same place) B may have form L ... R (inserts L, R around A) || L _ R (upper upper) // L _ R (lower upper) \\ L _ R (upper lower) \/ L _ R (lower lower) Parallel replacement joined by , or by ,, if restricted | Replacement in FSA Utilites | lexrulecomp | replacement |
test equivalent/overlap/ sublangage/lower-bounded/ upper-universal/etc | Decidability Properties in FSA Utilities | fsmequiv, fsmprops Some complex expressions in fsm | decidable properties: equality, functionality determinizability, etc |
lexc | lexcomplex | Compilation of acyclic
automata from word lists: Comments on lexical compilation |
Examples: "[a,p,p,l,e,s^]" recognizes the strings "apple" and "apples".
Examples: "[? *,[<,'/'^,i,>]:[ ],? *]*" removes the "italics" tags from a string by transducing these sequences into the empty string.
Examples: "[a*,b*]" constructs a recognizer which accepts a, b, ab, aaaa, bbb, aabb but not aba.
Examples: "[a+,b+]" constructs a recognizer which accepts ab, aaaab, aabbb, aabb but not aba, a, b, aaa.
The operator corresponds to the question mark in common regular expression languages. A recognizer accepts zero or one occurrence of the symbol(s) which is / are in the scope of the operator.
Examples:
"[a,a^]" constructs a recognizer which accepts a and aa.
"[a, ([a,b])^]" constructs a recognizer which accepts a and aab.
Examples: "[g,e,? *,t] & $[{a, e, i, o, u},{a, e, i, o, u}]" recognizes regular past participle forms which contain at least one diphtong (gefeit, gebaut)
The complement of a language L is the set of all strings not in L. Do not confuse this with the common regular expression operator: [^...], which stands for any single symbol not in "...".
Examples: "[v,~c,v]" constructs a recognizer which accepts vv, vaaaav, vccv but not vcv
Examples: "[ziffer-{'0','1','2','3','4'},ziffer,ziffer] constructs a recognizer which accepts numbers between 500 and 999. Note that "ziffer" is a macro which expands to the union of zero to nine.
Examples: "[v,(~c & ?),v]" constructs a recognizer wich accepts vav, but not vv, vaaaav or vcv.
Examples: [h,'E' x{ä,ü},l,f] maps the input stem "hElf" (the base stem of the German verb "helfen") onto the two alternative past subjunctive stems hälf and hülf
Examples: domain([a,b:c,d]) ist abd, range([a,b:c,d]) ist acd.
Example: Assume that you work with a syllable recognizer, the alphabet of which is defined as the set {c,v} (consonant and vowel). A syllable is defined as a string which contains at least one vowel. For a sequence containing at least one syllable, the expression is "{v,c}+ & $v".
Examples: "ignore({'0','1','2','3','4','5','6','7','8','9'}+,{',' , '.' , ' ' , '-' , '/' , ':'})" produces a recognizer which accepts sequences of digits whether they contain commas, full stops, colon or white space or not. You see that this is a simple but powerful recognizer for date and time formats as well as for telephone numbers in different formats.
[[] x <begin anchor>, ? *, [] x <end anchor>] % add some anchors
o
<some recognizer or transducer>
o
[<begin anchor> x [], ? *, <end anchor> x []] % remove them
This rather awkward approach is easily simplified with suitable macros.
% Some macros from Kaplan & Kay. macro(if_p_then_s(L1,L2), % If prefix then suffix ~ [L1, ~L2]). macro(if_s_then_p(L1,L2), % If suffix then prefix ~ [~ L1,L2]. macro(p_iff_s(L1,L2), % prefix iff suffix if_p_then_s(L1,L2) & if_s_then_p(L1,L2)). macro(restrict(A,L,R), if_p_then_s([? *, A],R) & if_s_then_p(L,[A,? *])). macro(restrict(A,L1,R1,L2,R2), if_p_then_s([? *,A],{R1,R2}) & if_s_then_p(L1,[A,R1-R2]) & if_s_then_p(L2,[A,R2-R1]) & if_s_then_p({L1,L2},[A,R1&R2])).
The FSA Utilities does provide a a different sort of replacement operator with limited backreferences as described in Gerdemann & van Noord (1999)Transducers from Rewrite Rules with Backreferences. This operator may be found in the examples directory: .../Examples/GerdemannVannoord99/replace.pl.
The input "abc" to XFST is ambiguous. If multicharacter symbols are not used in the regex, then the string must be interpreted as the three symbol sequence: a b c. If multicharacter symbols are used, then the string is processed from left to right, using a longest-match strategy to find each symbol. For example, if the regex is:
[a:x b:y c:z]
Then the input "abc" is transduced to "xyz" (using "apply downward"). If however, the regex is:
[c:x ab:z] | [a:x b:y c:z]
Then the input "abc" (interpreted as: ab c) is no longer accepted.
Left and right contextual restrictions in XFST may be checked either on the input side (upper) or on the output side (lower). Kaplan & Kay showed that || (upper upper) corresponds simultaneous application, // (lower upper) corresponds to right-to-left application and \\ (upper lower) corresponds to right-to-left application. The fourth possibility \/ (lower lower) corresponds to no intuitively obvious mode of application. XFST provides no way to check restrictions partly on the input side and partly on the output side. Such multi-level contexts are possible in Kimmo-style two-level rule compilers, although only for length-preserving (or ``same-length'') transductions.
FSM provides only the predicate: fsmequiv. It is a useful exercise, however, to define some predicates on top of fsm. Defining a few simple predicates may serve as a model for how more complex operations may be defined using fsm.
First, let {epsilon} be interpreted as "true" and {} be interpreted as false:
echo -n 0 | fsmcompile > true.fsa
echo -n | fsmcompile > false.fsa
More generally, one could perhaps allow any language that accepts epsilon to be interpreted as "true".
Now let conjunction, disjunction and negation be defined as concatenation, union and complementation, respectively. With this interpretation, it is possible to design small programs using fsm that implement these operators. For example, using C++, one can implement negation with the following program:
// not.cc: note that a pipe is used so that the argument // can occur either on the command line or come from // standard input #include <iostream> #include <string> #include <pfstream.h> // g++ library extension for pipes int main(int argc, char* argv[]) { if (argc == 2) { string command = "cat "; command += argv[1]; command += " | fsmrmepsilon | fsmdeterminize"; command += " | fsmdifference true.fsa -"; system(command.c_str()); } else if (argc == 1) { opfstream pipe("|fsmrmepsilon|fsmdeterminize|fsmdifference true.fsa -"); char c; while (cin.get(c)) pipe.put(c); } }
The program is a little complicated since complementation is not a basic operation in fsm. One has to define the complementation of A in terms of sigma*-A. If we allow the alphabet to be empty, then sigma* is the same as our automaton: true.fsa. A further complication arises since the second argument to fsmdifference is required to be epsion free and deterministic.
Once negation is defined, we can use this to define the predicate empty:
// empty.cc: for simplicity, assume only a command-line // argument #include <iostream> #include <string> int main(int argc, char* argv[]) { string command = "lexcompre -S symbols.scl -l symbols.lab -s "; command += "'[<sigma>]*:[<epsilon>]'"; command += " | fsmcompose "; command += argv[1]; command += " - | fsmproject -o | not"; system(command.c_str()); }
Note that the central idea here is contained in the transducer turning sigma* into epsilon (= true.fsa). The only input that is not turned into epsilon is the empty language.
Once the predicate empty is defined, this can be used as the basis for defining other predicates such as totality, equality and subset.
Although general purpose finite-state tools may be used to construct a lexicon, it is generally preferable (easier or more efficient) to use a specialized tool. Both Xerox and AT&T provide such tools. Another set of tools for this purpose is the fsa package developed by Jan Daciuk. This package includes tools for lexicon construction, along with specialized tools for morphological analysis, spell checking, diacritic restoration, analysis of unknown words and perfect hashing.
Intersection and relative complemnt (subtraction) are generally defined only for languages, not relations. XFST allows these operations also for the special case of equal-length relations.
A question mark (?) in an XFST regular expression stands for any symbol. In an XFST network, however, the question mark stands for any unknown symbol. This distiction was also made in previous versions of the FSA Utilities. In the current version, the question mark always indicates any symbol (known or unknown).