<!--This file created 09/09/96 10:29 by Claris Home Page version 1.0-->
<HTML>
<HEAD>
   <TITLE>Collation Guide</TITLE>
   <META NAME=GENERATOR CONTENT="Claris Home Page 1.0">
   <X-SAS-WINDOW TOP=66 BOTTOM=762 LEFT=49 RIGHT=555>
</HEAD>
<BODY BGCOLOR="#FFFFFF">

<H2 ALIGN=CENTER><I>Collation Details</I></H2>

<P ALIGN=CENTER><A HREF="example1.html">Guide</A></P>

<P>If you are an advanced user and interested in trying out more
rules, here is a brief explanation of how they work. The Collation
Rules field is a list of rules, where each rule is of three forms:
</P>

<UL>
   <LI>&lt;modifier&gt;
   
   <LI>&lt;relation&gt; &lt;text-argument&gt;
   
   <LI>&lt;reset&gt; &lt;text-argument&gt;
</UL>

<H4>Text Argument</H4>

<P>A text argument is any sequence of characters, excluding special
characters (that is, whitespace characters and the characters used in
modifier, relation and reset). If those characters are desired, you
can put them in single quotes (as in the
<A HREF="example1.html#Ampersand">Ampersand example</A>).</P>

<H4>Modifier</H4>

<P>There is only a single modifier currently, which is used to
specify that all accents (secondary differences) are backwards.</P>

<P><TABLE BORDER=1 CELLPADDING=1 WIDTH="100%">
   <TR>
      <TD>
         <P ALIGN=CENTER>@
      </TD><TD COLSPAN=6>
         <P>Indicates that accents are sorted backwards, as in French
      
      </TD></TR>
</TABLE></P>

<H4>Relation</H4>

<P>The relations are the following:</P>

<P><TABLE BORDER=1 CELLPADDING=1 WIDTH="100%">
   <TR>
      <TD>
         <P ALIGN=CENTER>&lt;
      </TD><TD COLSPAN=6>
         <P>Greater, as a letter difference (primary)
      </TD></TR>
   <TR>
      <TD>
         <P ALIGN=CENTER>;
      </TD><TD COLSPAN=6>
         <P>Greater, as an accent difference (secondary)
      </TD></TR>
   <TR>
      <TD>
         <P ALIGN=CENTER>,
      </TD><TD COLSPAN=6>
         <P>Greater, as a case difference (tertiary)
      </TD></TR>
   <TR>
      <TD>
         <P ALIGN=CENTER>=
      </TD><TD COLSPAN=6>
         <P>Equal
      </TD></TR>
</TABLE></P>

<H4>Reset</H4>

<P>There is only a single reset currently, which is used primarily
for contractions and expansions, but which can also be used to add a
modification at the end of a set of rules.</P>

<P><TABLE BORDER=1 CELLPADDING=1 WIDTH="100%">
   <TR>
      <TD>
         <P ALIGN=CENTER>&amp;
      </TD><TD COLSPAN=6>
         <P>Indicates that the <I>next</I> rule follows the position
         to where the reset text-argument <I>would be</I> sorted.
         </P>
         
         <P>The reset does <I>not</I> put the text-argument into the
         sorting sequence.
      </TD></TR>
</TABLE></P>

<P>This sounds more complicated than it is in practice. For example,
the following are equivalent ways of expressing the same thing:</P>

<UL>
   <LI>a &lt; b &lt; c
   
   <LI>a &lt; b &amp; b &lt; c
   
   <LI>a &lt; c &amp; a &lt; b
</UL>

<P>Notice that the order is important, as the subsequent item goes
<I>immediately</I> after the text-argument. The following are
<I>not</I> equivalent:</P>

<UL>
   <LI>a &lt; b &amp; a &lt; c
   
   <LI>a &lt; c &amp; a &lt; b
</UL>

<P>Either the text-argument <I>must</I> already be present in the
sequence, or some initial substring of the text-argument must be
present. (e.g. "a &lt; b &amp; ae &lt; e" is valid since "a" is
present in the sequence <I>before</I> "ae" is reset). In this latter
case, "ae" is <B><I>not</I></B> entered and treated as a single
character; instead, "e" is sorted as if it were expanded to two
characters: "a" followed by an "e".</P>

<P>This difference appears in natural languages: in traditional
Spanish "ch" is treated as though it <I>contracts</I> to a single
character (expressed as "c &lt; ch &lt; d"), while in traditional
German "&auml;" (a-umlaut) is treated as though it <I>expands</I> to
two characters (expressed as "a &amp; ae ; &auml; &lt; b").</P>

<H4>Ignorable Characters</H4>

<P>The first rule must start with a relation (the examples we have
used above are really fragments; "a &lt; b" really should be "&lt; a
&lt; b"). If, however, the first relation is not "&lt;", then all the
all text-arguments up to the first "&lt;" are ignorable. For example,
", - &lt; a &lt; b" makes "-" an ignorable character, as we saw
earlier in the word "black-birds". In the samples for different
languages, you see that most accents are ignorable.</P>

<H4>Normalization and Accents</H4>

<P>The Collation object automatically normalizes text internally to
separate accents from base characters where possible. This is done
both when processing the rules, and when comparing two strings.
Collation also uses the Unicode canonical mapping to ensure that
combining sequences are sorted properly (for more information, see
<A HREF="http://www.awl.com/devpress/titles">The Unicode Standard, Version
2.0</A>.) and click on "U" for Unicode.</P>

<P>Most languages that use accents sort them in a consistent fashion,
immediatedly after the unmodified base character. This can be
achieved by making the accents ignorable, and putting them in the
right order at the beginning of the collation rules. When this is
done, only special cases like the German "&auml;" need to be handled
by explicit rules.</P>

<H4>Errors</H4>

<P>The following are errors:</P>

<UL>
   <LI>A text-argument not preceded by either a reset or relation
   character (e.g. "a &lt; b c &lt; d")
   
   <LI>A relation or reset character not followed by a text-argument
   (e.g. "a &lt; , b")
   
   <LI>A reset where the text-argument (or an initial substring of
   the text-argument) is not already in the sequence (e.g. "a &lt; b
   &amp; e &lt; f")
</UL>

<P>If you produce one of these errors, a message at the bottom of the
screen will tell you where the error is at, and select the incorrect
text (note: on some browsers, the selection will not appear
correctly).</P>

<HR>
<BR><a href="http://www.taligent.com/Copyright.html">&copy; Copyright 
1997</a>. All rights reserved. Taligent, Inc., IBM Corp.
</BODY>
</HTML>
