Operators in XFST, FSA Utilities and FSM -- A synopsis

You will find this synopsis useful if you are working with one of these tools and do not remember what the right operator for your specific task is. Blanks in the table normally indicate the an operation for a given tool must be defined by more complex means. (This table is partly derived from a similar table by Jason Eisner.)

XFST FSA Utilities AT&T FSM
lextools Explanation
Manual Manual FSM Manual
lextools Manual Online Manuals
xfst
Web-based interface fsa -tk n/a Interactive startup
? ? [<sigma>] (if defined) any alphabet symbol
a..z
(or predicates) superclass defn
in .sym file
(lexmakelab compiles
to .scl file for -S option) symbol class
"*" (or %*) '*' (or escape(*)) \* escaped literal symbol
foo bar
(whitespace separator) foo<regex op>bar
(separated by regex op) [foo][bar] long symbol names
(predicates) [noun num=pl
gender=fem complex symbol
abc
(Ambiguous: see
Explanation below abc (no symbol separator)
a  b  c (symbol separator=32)
a+b+c (symbol separator=43) (no notation for input string
use fsmintersect or fsmcompose) the input string "abc"
a,bc (symbol separator=44) the input
E1 E2 ... En [E1,E2,...,En], concat(E1,E2) E1 E2 (fsmconcat) concatenation
{abcd} [a,b,c,d] (no special notation) abcd character concat
A* A*, kleene_star(A) A* Kleene star
A+ A+, kleene_plus(A) A+ Kleene plus
(A) A^, option(A), {A,[]} A? option
E1 | E2 | ... | En {E1,E2,...,En}, union(E1,E2) E1 | E2 (fsmunion) union
A & B A & B, intersect[ion](A,B) A & B (fsmintersect) intersection
~A ~A, complement(A) (fsmdifference [<sigma>]* A) complement (negation)
\a ~a & ?, ? - a
(expressed as `{a}
in graphic interface) !a (fsmdifference [<sigma>] a) symbol complement
A - B A - B, difference(A,B) E1 - E2 (fsmdifference) relative complement
~$[], \? {} (echo -n | fsmcompile) the empty language
0, [ ] [ ] 0
[<epsilon>] (if defined)
(echo -n 0 | fsmcompile) epsilon (the empty string)
A .x. B, A:B A x B, A:B A:B cross product
A .o. B A o B, compose(A,B) E1 @ E2 (fsmcompose) (reverse) composition
R.1 domain(R) (fsmproject -i) domain
R.2 range(R) (fsmproject -o) range
A A, identity(A) A identity transducer
efree(A) fsmrmepsilon epsilon-removal
A!
{t,w,wt}_determinize(A) fsmdeterminize determinization
A.i invert(A) fsminvert invert (switch direction of transduction)
$A $A, [? *,A,? *] contains
A/B ignore(A,B) ignore
A.r reverse(A) fsmreverse reverse
.#. FSA Utilities Anchors [<bos>][<eos>] edge of string anchors
A => L1 _ R1
A => L1 _ R1, L2 _ R2 Restriction in FSA Utilities restriction
A -> (exhaustive
nondeterm.)
A (->) B (optional)
A @-> B (LR longest)
A @> B (LR shortest)
A ->@ B (RL longest)
A >@ B (RL shortest)
Reversse arrow swaps
lower, upper
A may have the form [.E.]
(blocks repeat matches
of epsilon in same
place)
B may have form L ... R
(inserts L, R around A)
  || L _ R (upper upper)
  // L _ R (lower upper)
  \\ L _ R (upper lower)
  \/ L _ R (lower lower)
Parallel replacement
  joined by ,
  or by ,, if restricted Replacement in FSA Utilites lexrulecomp replacement
test equivalent/overlap/
sublangage/lower-bounded/
upper-universal/etc Decidability Properties
in FSA Utilities fsmequiv, fsmprops
Some complex expressions
in fsm decidable properties:
equality, functionality
determinizability, etc
lexc lexcomplex Compilation of acyclic automata from word lists:
Comments on lexical compilation

Explanation of operators for FSA Utilities

Grouping
This operator is the same as in common regular expression languages.
Concatenation
This operator is implicit in most common regular expression languages.
Examples: "[a,p,p,l,e,s^]" recognizes the strings "apple" and "apples".
Epsilon (empty string)
The empty string is implicit in common regular expression languages.
Examples: "[? *,[<,'/'^,i,>]:[ ],? *]*" removes the "italics" tags from a string by transducing these sequences into the empty string.
Kleene star
This operator is the same as in common regular expression languages, except that it also applies to transducers.
Examples: "[a*,b*]" constructs a recognizer which accepts a, b, ab, aaaa, bbb, aabb but not aba.
Kleene plus
This operator is the same as in common regular expression languages.
Examples: "[a+,b+]" constructs a recognizer which accepts ab, aaaab, aabbb, aabb but not aba, a, b, aaa.
Option

The operator corresponds to the question mark in common regular expression languages. A recognizer accepts zero or one occurrence of the symbol(s) which is / are in the scope of the operator.
Examples: "[a,a^]" constructs a recognizer which accepts a and aa.
"[a, ([a,b])^]" constructs a recognizer which accepts a and aab.
Intersection
Intersection combines a pair of recognizers. A string accepted by an intersection of acceptors must fulfill all the conditions imposed by
Examples: "[g,e,? *,t] & $[{a, e, i, o, u},{a, e, i, o, u}]" recognizes regular past participle forms which contain at least one diphtong (gefeit, gebaut)
Complement

The complement of a language L is the set of all strings not in L. Do not confuse this with the common regular expression operator: [^...], which stands for any single symbol not in "...".
Examples: "[v,~c,v]" constructs a recognizer which accepts vv, vaaaav, vccv but not vcv
Relative complement
The operator "subtracts" the set in the second argument from the set in the first argument. So (A - B) is equivalent to (A & ~B).
Examples: "[ziffer-{'0','1','2','3','4'},ziffer,ziffer] constructs a recognizer which accepts numbers between 500 and 999. Note that "ziffer" is a macro which expands to the union of zero to nine.
Term Complement
This operator is closer to the [^...]-operator as we know it from common regular expression languages.
Examples: "[v,(~c & ?),v]" constructs a recognizer wich accepts vav, but not vv, vaaaav or vcv.
Cross product
This operator relates an input language to an output language, to specify a transduction.
Examples: [h,'E' x{ä,ü},l,f] maps the input stem "hElf" (the base stem of the German verb "helfen") onto the two alternative past subjunctive stems hälf and hülf
Composition
The composition of transducers T1 o T2 is like a UNIX pipe; the output of T1 is taken as the input T2. In standard mathematical notation, this should be written as T2 o T1 but this would make a sequence of compositions quite hard to read. Most literature on finite state NLP uses this reversed interpretation of composition.
Domain, range
Domain and range are the two levels of a transducer (input and output level, or ``upper'' language and ``lower'' language).
Examples: domain([a,b:c,d]) ist abd, range([a,b:c,d]) ist acd.
Contains
This operator is similar to the "substring"-Operator in many programming languages. An acceptor using this operator accepts strings that contain the symbol(s) which are in the scope of the containment operator
Example: Assume that you work with a syllable recognizer, the alphabet of which is defined as the set {c,v} (consonant and vowel). A syllable is defined as a string which contains at least one vowel. For a sequence containing at least one syllable, the expression is "{v,c}+ & $v".
Ignore
This operator ignores the symbols in the second argument within the string in the first argument.
Examples: "ignore({'0','1','2','3','4','5','6','7','8','9'}+,{',' , '.' , ' ' , '-' , '/' , ':'})" produces a recognizer which accepts sequences of digits whether they contain commas, full stops, colon or white space or not. You see that this is a simple but powerful recognizer for date and time formats as well as for telephone numbers in different formats.
FSA Utilities Anchors
No special notation is provided for anchors (edge of string markers). When anchors are needed, they can be provided by:

    [[] x <begin anchor>, ? *, [] x <end anchor>]    % add some anchors
      o
    <some recognizer or transducer>
      o
    [<begin anchor> x [], ? *, <end anchor> x []]    % remove them

This rather awkward approach is easily simplified with suitable macros.
Restriction
In XFST, the notation [A => L _ R] denotes the set of strings such that for every substring matching A, this substring must be preceded by a substring matching L and followed by a substring matching R. The notation [A => L1 _ R1, L2 _ R2] denotes the set of strings such that for every substring matching A, this substring must either be preceded by a substring matching L1 and followed by a substring matching R1, or preceded by a substring matching R1 and followed by a substring matching R2. The FSA Utilities provides no such operator, however it is quite easy to define a macro for this purpose:
```
% Some macros from Kaplan & Kay.
macro(if_p_then_s(L1,L2),  % If prefix then suffix
      ~ [L1, ~L2]).
macro(if_s_then_p(L1,L2),  % If suffix then prefix
      ~ [~ L1,L2].
macro(p_iff_s(L1,L2),      % prefix iff suffix
      if_p_then_s(L1,L2) 
            & 
      if_s_then_p(L1,L2)).


macro(restrict(A,L,R),
   if_p_then_s([? *, A],R)
             &
   if_s_then_p(L,[A,? *])).


macro(restrict(A,L1,R1,L2,R2),
   if_p_then_s([? *,A],{R1,R2})
              &
   if_s_then_p(L1,[A,R1-R2])
              &
   if_s_then_p(L2,[A,R2-R1])
              &
   if_s_then_p({L1,L2},[A,R1&R2])).
```
Replacement
The FSA Utilities does not provide the wealth of replacement operators in XFST, though these can all be defined with macros (which would be a very instructive exercise).
The FSA Utilities does provide a a different sort of replacement operator with limited backreferences as described in Gerdemann & van Noord (1999)Transducers from Rewrite Rules with Backreferences. This operator may be found in the examples directory: .../Examples/GerdemannVannoord99/replace.pl.
Decidability Properties in FSA Utilities
Many useful properties of automata and transducers are decidable. Since this is a rather large topic, it is handled on a separate page in the online textbook.

XFST Input

The input "abc" to XFST is ambiguous. If multicharacter symbols are not used in the regex, then the string must be interpreted as the three symbol sequence: a b c. If multicharacter symbols are used, then the string is processed from left to right, using a longest-match strategy to find each symbol. For example, if the regex is:

[a:x b:y c:z]

Then the input "abc" is transduced to "xyz" (using "apply downward"). If however, the regex is:

[c:x ab:z] | [a:x b:y c:z]

Then the input "abc" (interpreted as: ab c) is no longer accepted.

Contextual restrictions in XFST

Left and right contextual restrictions in XFST may be checked either on the input side (upper) or on the output side (lower). Kaplan & Kay showed that || (upper upper) corresponds simultaneous application, // (lower upper) corresponds to right-to-left application and \\ (upper lower) corresponds to right-to-left application. The fourth possibility \/ (lower lower) corresponds to no intuitively obvious mode of application. XFST provides no way to check restrictions partly on the input side and partly on the output side. Such multi-level contexts are possible in Kimmo-style two-level rule compilers, although only for length-preserving (or ``same-length'') transductions.

Decidability Properties in FSM

FSM provides only the predicate: fsmequiv. It is a useful exercise, however, to define some predicates on top of fsm. Defining a few simple predicates may serve as a model for how more complex operations may be defined using fsm.

First, let {epsilon} be interpreted as "true" and {} be interpreted as false:

echo -n 0 | fsmcompile > true.fsa

echo -n | fsmcompile > false.fsa

More generally, one could perhaps allow any language that accepts epsilon to be interpreted as "true".

Now let conjunction, disjunction and negation be defined as concatenation, union and complementation, respectively. With this interpretation, it is possible to design small programs using fsm that implement these operators. For example, using C++, one can implement negation with the following program:

// not.cc: note that a pipe is used so that the argument 
// can occur either on the command line or come from 
// standard input
#include <iostream>
#include <string>
#include <pfstream.h>  // g++ library extension for pipes

int main(int argc, char* argv[]) {
  if (argc == 2) {
    string command = "cat ";
    command += argv[1];
    command += " | fsmrmepsilon | fsmdeterminize";
    command += " | fsmdifference true.fsa -";
    system(command.c_str());
  } else if (argc == 1) {
    opfstream 
      pipe("|fsmrmepsilon|fsmdeterminize|fsmdifference true.fsa -");
    char c;
    while (cin.get(c))
      pipe.put(c);
  }
}

The program is a little complicated since complementation is not a basic operation in fsm. One has to define the complementation of A in terms of sigma*-A. If we allow the alphabet to be empty, then sigma* is the same as our automaton: true.fsa. A further complication arises since the second argument to fsmdifference is required to be epsion free and deterministic.

Once negation is defined, we can use this to define the predicate empty:

// empty.cc: for simplicity, assume only a command-line 
// argument
#include <iostream>
#include <string>

int main(int argc, char* argv[]) {
  string command = 
    "lexcompre -S symbols.scl -l symbols.lab -s ";
  command += "'[<sigma>]*:[<epsilon>]'";
  command += " | fsmcompose ";
  command += argv[1];
  command += " - | fsmproject -o | not";
  system(command.c_str());
}

Note that the central idea here is contained in the transducer turning sigma* into epsilon (= true.fsa). The only input that is not turned into epsilon is the empty language.

Once the predicate empty is defined, this can be used as the basis for defining other predicates such as totality, equality and subset.

Lexicon Compilation

Although general purpose finite-state tools may be used to construct a lexicon, it is generally preferable (easier or more efficient) to use a specialized tool. Both Xerox and AT&T provide such tools. Another set of tools for this purpose is the fsa package developed by Jan Daciuk. This package includes tools for lexicon construction, along with specialized tools for morphological analysis, spell checking, diacritic restoration, analysis of unknown words and perfect hashing.

XFST Intersection and Relative Complement

Intersection and relative complemnt (subtraction) are generally defined only for languages, not relations. XFST allows these operations also for the special case of equal-length relations.

The ANY Symbol in XFST

A question mark (?) in an XFST regular expression stands for any symbol. In an XFST network, however, the question mark stands for any unknown symbol. This distiction was also made in previous versions of the FSA Utilities. In the current version, the question mark always indicates any symbol (known or unknown).

XFST	FSA Utilities	AT&T FSM lextools	Explanation
Manual	Manual	FSM Manual lextools Manual	Online Manuals
xfst Web-based interface	fsa -tk	n/a	Interactive startup
?	?	[<sigma>] (if defined)	any alphabet symbol
	a..z (or predicates)	superclass defn in .sym file (lexmakelab compiles to .scl file for -S option)	symbol class
"" (or %)	'' (or escape())	\*	escaped literal symbol
foo bar (whitespace separator)	foo<regex op>bar (separated by regex op)	[foo][bar]	long symbol names
	(predicates)	[noun num=pl gender=fem	complex symbol
abc (Ambiguous: see Explanation below	abc (no symbol separator) a b c (symbol separator=32) a+b+c (symbol separator=43)	(no notation for input string use fsmintersect or fsmcompose)	the input string "abc"
	a,bc (symbol separator=44)		the input
E1 E2 ... En	[E1,E2,...,En], concat(E1,E2)	E1 E2 (fsmconcat)	concatenation
{abcd}	[a,b,c,d] (no special notation)	abcd	character concat
A*	A*, kleene_star(A)	A*	Kleene star
A+	A+, kleene_plus(A)	A+	Kleene plus
(A)	A^, option(A), {A,[]}	A?	option
E1 \| E2 \| ... \| En	{E1,E2,...,En}, union(E1,E2)	E1 \| E2 (fsmunion)	union
A & B	A & B, intersect[ion](A,B)	A & B (fsmintersect)	intersection
~A	~A, complement(A)	(fsmdifference [<sigma>]* A)	complement (negation)
\a	~a & ?, ? - a (expressed as `{a} in graphic interface)	!a (fsmdifference [<sigma>] a)	symbol complement
A - B	A - B, difference(A,B)	E1 - E2 (fsmdifference)	relative complement
~$[], \?	{}	(echo -n \| fsmcompile)	the empty language
0, [ ]	[ ]	0 [<epsilon>] (if defined) (echo -n 0 \| fsmcompile)	epsilon (the empty string)
A .x. B, A:B	A x B, A:B	A:B	cross product
A .o. B	A o B, compose(A,B)	E1 @ E2 (fsmcompose)	(reverse) composition
R.1	domain(R)	(fsmproject -i)	domain
R.2	range(R)	(fsmproject -o)	range
A	A, identity(A)	A	identity transducer
	efree(A)	fsmrmepsilon	epsilon-removal
	A! {t,w,wt}_determinize(A)	fsmdeterminize	determinization
A.i	invert(A)	fsminvert	invert (switch direction of transduction)
$A	$A, [? ,A,? ]		contains
A/B	ignore(A,B)		ignore
A.r	reverse(A)	fsmreverse	reverse
.#.	FSA Utilities Anchors	[<bos>][<eos>]	edge of string anchors
A => L1 _ R1 A => L1 _ R1, L2 _ R2	Restriction in FSA Utilities		restriction
A -> (exhaustive nondeterm.) A (->) B (optional) A @-> B (LR longest) A @> B (LR shortest) A ->@ B (RL longest) A >@ B (RL shortest) Reversse arrow swaps lower, upper A may have the form [.E.] (blocks repeat matches of epsilon in same place) B may have form L ... R (inserts L, R around A) \|\| L _ R (upper upper) // L _ R (lower upper) \\ L _ R (upper lower) \/ L _ R (lower lower) Parallel replacement joined by , or by ,, if restricted	Replacement in FSA Utilites	lexrulecomp	replacement
test equivalent/overlap/ sublangage/lower-bounded/ upper-universal/etc	Decidability Properties in FSA Utilities	fsmequiv, fsmprops Some complex expressions in fsm	decidable properties: equality, functionality determinizability, etc
lexc		lexcomplex	Compilation of acyclic automata from word lists: Comments on lexical compilation

Operators in XFST, FSA Utilities and FSM -- A synopsis

Explanation of operators for FSA Utilities

Grouping

Concatenation

Epsilon (empty string)

Kleene star

Kleene plus

Option

Intersection

Complement

Relative complement

Term Complement

Cross product

Composition

Domain, range

Contains

Ignore

FSA Utilities Anchors

Restriction