here is(1)a corrected elliptic curve # points(2) remarks on factzn
          The Number of Points on an Elliptic Curve Modulo a Prime

                               by


                          A.O.L. Atkin

      (This is an expanded and improved version of a 9-page ms dated
      January 1988. )

 0.^  The elliptic curve we consider is

       E:  y**2=x**3 + A*x +B                 (modulo P)        (0)

   where P is a (large) prime,certainly not 2 or 3,and usually bigger than
any auxiliary prime q,where currently 2.le.q.le.311 . We summarize briefly
some routine material.

 ^1.     Matters of Proof.

   In the method we finally describe,it is often convenient not to prove
      some of the intermediate steps; we arrive at a putative number of points,
 certainly correct,but legally speculative.Thus in this section we address
the problem:

            Prove that the number of points on (0) is P + 1 + c, where
   c is given.                                          (1.1)

       First use trial divisors on P + 1 +/- c(simultaneously) up to some
moderate bound B(perhaps 10**7,depending on the size of P). The general area
of interest here is with P from 30 to 100 decimal digits,so after each
(found factor AND 10000 primes)a Selfridge may be performed. This will
(in the worst case-often better with obvious consequences)leave us with

      P +1 + c = F1 * U1    ,  P + 1  -c = F2 * U2   ,
where F1 and F2 are (factorized) products of known primes,and U1 and U2 are
composite numbers with no prime factor less than B. We also have
    (F1,U1)=(F1,U2)=(F2,U1)=(F2,U2)=(U1,U2)=1,
of which the last needs a gcd; and we hope that
                 (F1,F2) is rather small.

  Now find + and - twists of E and points (a1 and a2) on them. By the usual
techniques,if necessary repeating with a new point/twist,we expect to show
      that there is a subgroup of the group of points on E mod P which
        has order F1*X1 on E,where X1 divides U1 and is not 1,and similarly
an order F2*X2 on E' ,the twist of E by (*/ P).  If the true numbers
of points on E and E' are P + 1 +/- C,then we have shown that (C-c) is
divisible by lcm(F1,F2) *X1 *X2 . But the absolute value of (C-c) is less
than 4*sqrt(P),so that our c is certainly proved correct when this is
 less than our proven divisor.   In particular,since X1 and X2 are both
grater than B,we have that

                   lcm(F1,F2)*B*B . gt. 4*sqrt(P)           (1.2)
 is sufficient to prove that c is correct.
      We can improve on this with rather more work. Let us ostensibly
factorize P by the elliptic curve method,using a first stage with a1**F1,
and a second stage up to B2(.gt.B) . Notice that,since P is prime and we
can identify a match without a gcd,we may use the Shanks baby-giant steps
with sort and match,as in Shanks's quadratic forms factorization method. Using
the modification introduced in SPAR saves a factor of 6 or 7,and it is quite
fast to work up to B2=10**10,and 10**12 only takes a few minutes. This will
either produce a factor of U1,or else prove that X1 is greater than B2.
A similar computation on E' will prove that X2 is greater than B2,and we
can now replace B by B2 in (1.2) above.
        In the unfortunate (and unlikely) event that these expedients fail,
we resort to ECM and possibly sieve factorizations of U1 or U2. Factorization
takes considerably longer than the techniques described above. Of course,the
congruences modulis the small primes q described below are in many cases
proved without reservation,and so give rise to further conditions on
(C-c). The difficulty is that knowing (C-c) is,say,0 or 4 mod 13 and
divisible by an UNKOWN large prime,is unhelpful: 0 is all right,but not
4.


2^.    The Construction of Modular Equations

2.1.For our present purpose,we take q to be an odd prime,and seek an algebraic
equation connecting j =x**-1 + 744 + 196884*x+...  (x=exp(2*pie*i*tau))
and some suitable function f on the subgroup GZ(q)  ("gamma sub zero q"). We
prefer this function to have as low a valence v as possible,so that the
equation has degree (q+1) in f and degree v in j. One canonically available
function is f= eta**s(tau)/ eta**s(q*tau) ,where s is the smallest even
integer such that 24 divides s*(q-1). Unfortunately its valence is
unreasonably large when q is not congruent to 1(mod 12). Another less
numerically canonical function is the function of lowest valence v
on GZ*(q) =(GZ(q),W(q)),where W(q) is the involution (0,-1 ; q,0).
This has valence 2v on GZ(q),of course,but v is usually very small,since
GZ*(q) almost always has a  Weierstrass point at infinity(I conjecture
that this is so for q.gt.389). My own package has a series of special
purpose routines for different q,but there is one universal method,
involving hypergeometric series and Berlekamp factorization,which gives
total conviction but no proof.
           The learned reader will recognize supersingular equations,Hasse
invariants,&c,in what follows,but I shall not allow myself to start explaining
anything.
       (for seq: see Weber III or author's 1975 Michigan/Japan talk)
     When q=1(mod 4),consider the Gauss hypergeometric series

      F(1/12,5/12; 1; 1728/j)  ,

 (which happens to be the fourth root of E4).  Modulo q this series is congruen
t
to  H(1/j)*H(1/j**q)*H(1/j**q**2)*  ....        (for this & more,see Carlitz)
  where H has degree d; then j**d *H(1/j) is a monic polynomial SSE(j) of
degree d= approx. q/12 .
    When q=3(mod 4),we employ F(7/12,11/12; 1; 1728/j) similarly.
 SSE(j) has only nonrepeated linear and quadratic factors. Remove the
linear factors to produce SSE*(j) of degree 2g,where g is the genus of
GZ*(q),and SSE* has precisely g quadratic factors modulo q. We do NOT
wish to find these,however.
        It is easy to prove that every function with a pole of order m
at infinity and leading term k*x**-m+...on GZ*(q) is congruent modulo q
to a polynomial in j (i.e.,the corresponding Fourier coefficients are
termwise congruent). This polynomial has the following property: it has
constant remainder(no j-term)when divided by ANY of the quadratic factors
of SSE*. That is to say,the (infinite)ring of such polynomials is precisely
that of the Berlekamp polynomials in Berlekamp's factorization method. Thus we
ostensibly factorize SSE* by B.,but never proceed to find the constants and
the factors;we stop when the Berlekamp matrix has been echelonized,and now the
(columns as I do it)are the polynomials in j(and,though we do not need it
here,the rows *dj are the differentials)modulo q.
        There is however one caveat which I have never been able to overcome
(nor,I believe,have Serre and many other people).  If one finds that,in
terms of x,the polynomial in j of smallest degree begins(for example)
    x**-3(1,1,1,2,2,2,3,3,4,5,6,7,9,10,.....)           ,

it is tempting to suppose that one has PROVED that 3 is the smallest
possible valence on GZ*(q). We are in fact sure,and even that 1,1,1,2,..
&c are right over Z,since how could it go 1,1,1+179,2-2*179,...? But
in theory we may have been misled. The least degree function over Z
(over C directly,but deep theorems of Shimura imply over Z)may be
x**-4(179,1,1,1,2,...)   and of course a compensating differential
may start  x**3(179,1,-1,...). Thus the "mod q gap series"may not
be the true gap series over C at infinity. I have of course checked
numerous cases,and it always has been,but the annoying theoretical
possiblity remains that they may differ. However,it is clear that the
mod q situation is more Weierstrassy than the "true" one,so that when
we have no WP mod q,we have no WP a fortiori over C. In this way(using
the full SSE)one can show that infinity(and hence also 0)is never a WP
of GZ(q).
  ( There are many other interesting corollaries and issues related to the
      supersingular equation,which we defer.)

           Now even if we are prepared to forgo proof,we cannot afford to be
      wrong. The speculative inference of an exact function

        x**-21(1,1,2,2,4,5,8,10,14,....
      from its reduction modulo q for,say,q=701,starts off very securely,but
      what happens as the presumed terms become larger? Our position is
      analogous to that of the tax evader,led step by step to wilder and
      more extravagant lies and guesses until finally the IRS gives him a
      wrong coefficient. Our remedy should be sought,not from the ineptitudes
      of Government,but from the technology of organized crime,in the
      process LAUNDR described in the next paragraph: for a small commission
      we can have our coefficients laundered.
          The reader might wish to try her hand at the following. We can
      be sure that the 125 after 681 is really 826,the 287 998,&c,but at
      some point discrepancy is inevitable. The'inevitability'of Bach's
      next chord given the last chord and the next note of the chorale
      does not enable us to write his works ourselves.
Q= 701 FUNCTION X**- 21(1,   1   2   2   4   5   8  10  14  18  25
  31  42  52  68  84 108 133 168 205 256 311 384 463 567 681
 125 287 489  14 294 607 289  19 540 426 430 524  67 435 296
 321 596 387 498 197 302  83 384 474 521 503 618 147  31 259
 413 501 153  98  35  11 500 188 335 373 269 208 284  63 481
 494 527 344 579 453 142 448 452 469 640 160 279 528 571 333
  67 212 309 698 365 358 654 389 687 337 458 382  48 323 246
      (Solution at end of this essay)
               Assume now that the valence v of our function f is less than
      (q+1)/24,as is almost always the case(163 is the biggest
      exception.)  Then f*eta(tau)*eta(q*tau)  is a CUSPform of
      weight one and multiplier system (*/q)*some 12-th root of unity ,and
      hence by Serre-Deligne of prescribed type and with very small coefficients
      after we subtract if needed some constant multiple of eta(tau)*(eta(q*tau
).
      These can be identified with certainty(and are almost always based on
      definite quadratic forms). Now modulo any of our Chinese primes we can
      divide back by our weight one multiplier,to get a laundered or sanitized
      function.
            Now I have to confess that I do not do this myself. For once one
      has used the SSE and the above to identify the function as a ratio of
      forms of weight one,the numerators can be readily identified as binary
      theta series,and the algorithm expressed in that form. This is also
      much faster than computing j modulo q out to several thousand terms;
      however the POWMAT and GEGAJ described below take so much longer
      relatively that the cost of computing the function is immaterial.

2.2^.    The Modular Equation(Star Type)  (routine GEGAJ)
    We have an algebraic relation

           Sigma a(r,s) . f**r . j**s   =0

       (0.le.s.le.2*v and 0.le.r.le.q+1)

      For fixed f,both j and j(q*tau) are among the solutions;for fixed j,
      the solutions are f and f((tau+l)/q),l=0,q-1. If we plan most of the
      work using j(q*tau),we see that a(r,s) multiplies a power series
      with leading term x**-(r+q*s). Two such terms are only equal if
      r=r'(mod q). Also the coefficient of f**q as a polynomial in j is
      plainly equal to -f +constant +O(terms in x and higher powers),which
      enables us to commence our equation as

           f**(q+1) + f**q *(-j**v +...+const) + f**(q-1)*...... =0  :
    This start achieved,we have r=r' and s=s' in the above,so that
      every remaining coefficient can be found by solving echelon-form
      using f**r and j(q*tau)**s. For this we need all the series
      f**r (1.le.r.le.q+1) up to(at most) q*(2v+1) terms; unfortunately
      the succesive use does not enable us to save much on storage here.
              POWMAT  is a general purpose program which produces all powers
      from 1 to k of a given power series up to N terms in each power. It
      uses the fast FFTLIB which is part of LMA4064V,and is too long to describ
e     here. Basically we find d of size sqrt(k )+a little,and get the powers
      from 1 to d and store their FFTs. Then successively the FFTs of f**2d,
      f**3d,...,are found,and each is used to find d powers of f  using
      only one inverse FFT per power. This is often quicker than
      the subsequent echelon solving. The latter could be speeded up by
      the usual "block"technique,but in the present case that was too much
      of a pain to program.

2.3^. The Modular Equation  (Eta Type)      (routine GEGEN3)

     Here we have f=eta**s/eta(q*tau)**s of valence v,say,and j. Again,
 the involution shows that q**(s/2) / f and j(q*tau) satisfy the same
      relation as f and j; this would enable a"both ends to the middle"
type of echelon solving to be applied. But we chose instead to use the
      relation between j and g=f(tau/q): for fixed j the roots are
      f((tau+l)/q),l=0,q-1, and q**(s/2)/ f, and we then used routines
      for multn/divn by eta series,for powers of sparse coefficient
      series,and Newton's formula.

3^.     The Use of the Modular Equation

     Our indifference above to the precise function used to define GZ(q)
      with j will now be explained. Given our original curve E modulo P,
      we compute the usual j modulo P as

                   j = 6912*A**3 / ( 4*A**3 + 27* B**2)

       Substitution of this value of j in the modular equation yields an
      equation f**(q+1)....=0 of degree q+1 modulo P.  (In practice,we
      leave the modular equation as a set of equations modulis enough
      Chinese primes,and reduce modulo P at this point.)
      We are concerned ONLY with the DEGREES of the irreducible factors
      of this last equation modulo P,and not with the factors themselves.
      More precisely,recalling always that q and P are distinct,we have:

      (Definition: the "split" modulo P is the permutation of (q+1) letters
      obtained by treating each irreducible factor of degree d as a cycle
      of length d;"alternating" means split in A(q+1);"symmetric" means
      split not in A(q+1). )
        (3.1) If (P/q)=1,the split is alternating.
        (3.2) If (P/q)=-1,the split is symmetric.

               If the number of points on E mod P is P + 1 + c,and
      the roots of t**2 -t*c + P =0 (mod q) are a and b,then:
        (3.3) if a=b(or c**2=4*P mod q),then the split is (1q) or (111..11)
        (3.3.1) The last case (all factors linear) occurs only when c**2 -4*P
      is divisible by q**2.
                      Now assume a and b are distinct mod q.
         (3.4) If a and b are in the field of q elements(so that (c**2-4*P)/q)
      =1),then the split is (11rrrr..rr),where rs=q-1 .     (r.gt.1)
         (3.5) If a and b are not in the field of q elements(so that
      (c**2-4*P)/q)=-1),then the split is (rrrrr.rr) with rs=q+1.(r.gt.1)
        In the last two cases,r is the EXACT ORDER of (a/b) in the fields
      of q,q**2,elements,respectively.
           Conversely,splits as prescribed above imply the appropriate
      conditions on a and b,and hence on c.  Note that we cannot
      distinguish +/- c by these criteria,which is to be expected since
      we have only used j mod P,and not A and B, and j is invariant by
      the twist of E.
                      The number of possible c mod q remaining is equal
      to (Euler)phi(r). This is always less than q/2 and in some cases
      we hope to find very small values of r.

            Computations with Polynomials Modulo P

       For large degree (q+1) and large P(here perhaps up to 100 decimal
      digits)we run into serious time. Thus efficiency is important,even
      constant improvements like 20%.
          Our prime P consists of P(1) 32-bit words starting at P(2).
      For degree N,we have three kinds of object,each represented by a
      string of N*P(1) words: the monic polynomial x**N+ a(1)*x**(N-1)+..
      ...+a(N),the same as x**N=a(1)*x**(N-1)+...+a(N),and finally the
      expression   a(1)*...without the x**N.
             Our first task is to form x**P-x modulis (P,f(x)); for this
      we do the usual left-to-right square and multiply. "Multiply" is
      trivial(only by x). For squaring we construct an auxiliary table
      of powers x**N,...x**(2N-2) modulis (P,f),in an (N-1)by N BLNMAT
      matrix of P(1)-word numbers. We also add an N-th row(see below),
      which requires a total storage of P(1)*N*N words.
           To square a(1)*x**(N-1)+...+a(N) we proceed in two stages.
    First,we have to square to get a polynomial of degree 2N-2(FFT is
      pretty useless here). Assuming  that the basic number multiplication
      is about twice the squaring time for P(1).gt.4,say,it is beneficial
      to regard a(i)*a(j) as ((a(i)+a(j))**2-a**2(i)-a**2(j))/2 ,once N
      is over about 10 or so. The same remark applies to the next stage
      (essentially forming a vector as vector*matrix): we have to find
       N dot products,and at the overhead cost of working out 1 extra
      sum of squares(N were already done in BLNMAT),we halve the dot
      product time.
             At this point,the gcd (x**P-x,f) is trivial,and after
      removal of the linear factors we need to contemplate x**(P**a) - x .
     We know that at some point(namely a=r)this will suddenly vanish.  P
      is so large that exponentiation with square and(now long)multiply
      would be absurd. We therefore form another matrix BLQMAT which
      contains the powers x**P,x**2P,...,x**((N-1)*P),as polynomials of
      degree.le.N-1 in x (and again an N-th row for sums of squares.)
      This is done by the usual "square and halve and square again"
      algorithm for forming a list of powers,when FFT is not relevant,
      and when squaring is cheaper than multiplying. Briefly,we compute
      powers as follows  1,2,4,8,16,3,6,12,24,5,10,20,7,14,9,18,11,22,
      13,15,17,19,21,23,(for N-1=25),with only the 25-th power needing
      a multiply. For,e.g.,3= half( (1+2)**2-2-4),17=half((8+9)**2-16-18).
      &c. It should be said that all these devices have been programmed,
      and do lead in practice to substantial improvements. This is not
      computer science.
           The auxiliary storage used(which really needs to be in core for
      time efficiency)is in an array with a safe counter. When it is clear
      that BLQMAT(which needs BLNMAT)would overrun storage,a trap PNOMAT
      is set,and BLNMAT is not used,at the cost of greater execution time.
       (Note that the storage for computing the modular equation can now be
      thrown away. In a perfect world one would store all the modular
      equations on disk: in practice we check legally once-for-all that
      the equation is correct if we use w Chinese primes,and only store w
      for later purposes,recomputing the resiudes.)

5^.   Some Examples of the Method before the Sieve.

5.1^.  P= 1 87668 40755  53521 23529   ,
      E is Y**2=X**3 + 105*X + 78153

      q=       degree of (x**P-x,f)   Possible r     Actual r   List of c (+/-)

      13           1                   ....           ...        3
      17           0                    3,9            3         1
      19           2                   2,6,18           6        3
      23            0                2,3,4,6,12         6        5
      29           2                  4,28             28        1,2,5,7,8,14
      31           2                  3,5,15            15        9,10,13,14
      37           2                2,3,6,9,18        9           5,10,15
      41          0                   3,7,21         7            6,9,11

    For example,when q=37,the second column shows that we are in
      case (3.4 ). Since (P/37)=1,we need (rrr..rr)to be alternating,so
 that r=4,12,and 36,are impossible. At x**(P**9) we found our running
      x**(P**a) was reduced to x,giving r=9; the final c calculation is
trivial. Note that in the cases q=29 and q=31 the value of r was the largest
      possible. In theory this can be deduced at the penultimate stage
      (when q=31 and a=5 does NOT give x**(P**5)=x,we know r must be 15:
      one can even stop after one non zero entry is found as we begin the
      a=5 computation.)  However,once BLQMAT has been done,the cost of
      continuing to the end is relatively small,and any error in any
single excluding list is so disastrous,that we often do so as a check.
            Finally notice that as a matter of global strategy,we are
interested in retaining those q and corresponding lists of c for which
phi(r)/q is small. Thus we might very well leave q=29 at a=8,since we
know we do not wish to use the list of values of c even if we double
      check it.

   5.2^. P = 10**99 + 17**2
         E is Y**2 = X**3  + 105* X + 78153
    (At present(May 17th,1991)there is no reasonable prospect that our methods
      will work for a 100-digit prime. The information modulis q can be
      gathered,but the sieve is not feasible.)
            For q=211(and in fact v=7 in (2.2))the modular equation has
      2333 terms. We need 21 Chinese primes @ 29.5" each,for a total of
10.24',to compute it. It took 14.60' to find X**P-X,with a linear part of
      degree 2. At this point
          r=3,5,7,15,21,35,and 105, were still possible.
      BLQMAT and BLNMAT together would have needed storage beyond the 4096K
      available,so that we did BLQMAT without BLNMAT,at some cost in terms of
      time,which was 23.02'. In fact r=7 was now found after only 10.54"
      more,giving

          c= +/-  47,95,105  (mod 211)    for total 48' CPU time on an IBM3090.

      (With this prime,the largest q for which BLQMAT was possible with BLNMAT
      was 197: the timings there were 357" for modular equation,12.71' for
      X**P - X,and 460" for BLQMAT. In this case r was 196,so that the check
       took substantial time,namely 302").

6^ . Other methods.

                   Before describing the sieve,sort,and match,we first discuss
      what else is available,and what are the criteria for using various method
s.
      We continue to regard the operation of our group of points as
     multiplication; many parts of the program are the same for quadratic
      forms and finite fields,where multiplication is standard.

  6.1 ^  Modular forms.

            If the modular form corresponding to E taken over Q can be identifi
ed,
 then its P-th Fourier coefficient will be -c. Obviously this is an absurd way
of getting one particular value of c for one P,if P is small(under 10**5),and
an impossible way if P is medium or larger. But if one wants the values of c
for all P up to some limit,it is probably the best method when available.

      6.2^  Direct Computation of the Character Sum

        Although this takes time P,the implied constant is very good,and with
     treble differencing of x**3 +a*x +b,double differencing to create a table
      of the Legendre symbol,and careful use of all tricks(e.g. on the IBM 3093
      one can difference y**2/2 down to zero using BCTR,&c),the time is only
      a few milliseconds up to P=10**5.

      6.3 ^ Baby-Giant Steps as in Shanks.
       In what follows r and s are integers to be suitably chosen
      later; r*s should be minimally greater than sqrt(P).

       Find a twist +/- of E with an initial point A. Form and"store"the powers
      of A from 1 to r. If r*P is small enough,this presents no problem. If
      storage becomes tight,the following technique is used(we shall not repeat
      thse remarks in section 7,and other analogous places.) We store the
      penultimate word of X,where A**r=(X,Y). We thus have an array of r
      32-bit words. We now perform a keysort(very fast) on this so as to make
      the array in ascending"logical"order(i.e. as unsigned 32-bit integers).
      If necessary,the key can be sent off to secondary storage. We now
     successively compute A**(P+1+i*r)  and match the penultimate word of its
      "X" against our sorted list,and similarly A**(P+1-i*r)  ,where i runs
      from 1 through odd integers up to 2*s-1.(Remember that,using the Weierstra
      form,we always know when any power we compute happens to become the group
      identity).
             If a match is found on the penultimate word,we find r from our key
,and directly recompute A**r for a full comparison of the "X"s. If it is a true
match,then the "Y"s are equal or equal and opposite,and we have
      found a nontrivial power of A which is equal to the group identity.One
      expects in large cases to find a number of bogus matches; the situation
      is similar to the birthday problem(we have 2**32 days in the year,so that
      once we have 2**16 or more students in our class two may well have the
      same birthday.)  However,there are not likely to be enough bogus matches
   to cause serious inefficiency.  Notice that we prefer the penultimate
      word; the last word may have some 2-adic congruences,and the first
      word may be small. (If first=penultimate we have only two words in
      all,when storage is unlikely to be a problem).
           Now P + 1 + c,the order of our group of points,lies between P +1
      + 2*Sqrt(P) and P + 1 - 2*sqrt(P). Provided that r*(2*s) is greater than
      or equal to the integer part of 2*sqrt(P),or more or less r*s.ge.sqrt(P),
      then every power of A which is equal to the identity,and which lies betwe
en    the given bounds,will be found by the above process. In particular,if
      we carry out the full search,and find exactly one match,that must be the
      group order. It is probably best,however,to leave the sieve once the
      first match is found,and to make a quick attempt to prove it correct
      by the methods of section 1; if we fail,come back to the sieve. Again,
      if the matches are too numerous,and the order of the point A is
      disastrously low,we should try again with a new point.
         Finally,excerpts of data from different places can often
      be combined logically to advantage. For example,we may quickly
      find some information on c modulo 60.  Now if our first success S
      is NOT the true order of the group,then our point has order d where
      d divides S,and the possible orders of the group are S (mod d). Als
      any number which is S (mod d) in the range over which we have
      already searched would have been a success,and it was not; this
      plainly restricts d.
 (Example: suppose that our first success is P+1 +k*sqrt(P)=S  (k.gt.0),
  following the algorithm described above. Then we know d.ge.2k*sqrt(P).
      If we know that c is odd,and in fact S is odd,the nearest possible
      multiple of d is at least 6k*sqrt(P) away. Thus if 6k.gt.k+2,or
      k.gt.2/5,we have shown that S is the true order. )


           It remains to choose r and s. They are both P**.25
      apart from constants,so that the match time is fixed.(a single
      match is O(log r). ) The major time is thus the computation
      of the powers of A. If we are going to go through the whole sieve,
      that time is proportional to r+2*s  (For each new power to be
      matched,the time to find its (X,Y) is very much greater than the
      match itself,so we do not bother with (2+ .01*log r) *s ). This
      is minimal when r=2*s approximately. However,we know the distribution
      of c/(2*sqrt(P)): if this is cos(theta) with 0.le.theta.le.pie,then
      the element probability is sin**2(theta).d(theta). We find that,assuming
      there is only one multiple of the order of A in our range(as is usually
      the case),the expected value of i where we find it is 4*(2*s)/(3*pie),
       in time 4*s/(3*pie),recalling that i runs over odd values only. Thus
      our expected effort now becomes r+ 4*s/(3*pie),which is minimal whe
       -n  r=P**.25 *sqrt(4/3pie). It is certainly worth distinguishing
      the case when one intends to leave the sieve after the first success;
      it is problematical whether it is worth bothering with the exact
      distribution of c; the assumption of uniform distribution leads to
      sqrt(1/2) in place of sqrt(4/3pie),and makes the program slightly
      easier to write.


     6.4^ The Ostensible Factorization of P by ECM.

        The method of 6.3 is certain to succeed,but is not feasible once r and
      s reach about 10**6,suggesting that P of 25 decimal digits is the limit.
      For P between 25 and 35 decimal digits(with decreasing chance of success
      as P increases)we may have recourse to the following method. Find a point
      A on some twist of E,and raise A to the usual "Large Product of Small
      Primes" power L(however,the algorithm should perhaps keep A**(iL/10),
      for i =1 through 9,even though this might normally be slightly inefficien
t.)
      A**L may itself be the identity. If not,we perform a second stage as in
      SPAR using sort and match,and perhaps find a large(often prime)B such tha
t     A**(L*B) is the identity. It is now quite easy to disentangle the exact
      order of A,which almost always defines the group order uniquely. We
      have a second chance(but no more)if we fail the first time,using a
      different twist of E with different quadratic character.

    6.5 ^ The Method of Schoof.

      Schoof's beautiful paper (Math.Comp.1985) is so well known that I do not
      here summarize it. It is an effective polynomial time algorithm,and has
      many elegant corollaries. I did in fact attempt to implement it in 1987,
      and was led to the method of the present paper,but I do not regard
      my method as any kind of "implementation" of Schoof. My Chinese
      grandfather here is rather Lehmer,with his Delay Line Sieve,who had
      partial information modulis an arbitrary number of Chinese primes,
      but a problem in its effective application.  Naturally the idea of
      using Frobenius with auxiliary primes q I learned from Schoof. More
      detailed comparisons are best left until after my method has been
      fully described.

    7 ^ Determination of the number of points on E modulo P.

      The method we now describe is almost certain to work(though I cannot
      prove this)for P up to about 50 decimal digits. From there up to 65
digits(my current record),the chances of success decrease,or strictly speaking
      an excessive time is needed in "unlucky" cases. We suppose that by the
      methods described above in section 3,we have found a number of
      limitations on c of the form c congruent to (list) modulo q,for various
      primes q(perhaps including q=8 or 9 by special purpose programs.)
             For practical purposes,we collate this information in the form

              c  =  L3  (modulo M3)      (L3 is usually zero)

        c= one of LIST1(L1) (modulo M1)       (N1 values of L1)

        c = one of LIST2(L2)  (modulo M2)      (N2 values of L2).
    (We suppose now w.l.o.m.g. that L3=0). Write

          c = M3 * (M1*R2 + M2*R1),
    where we shall also suppose that

           M1*M2*M3 .gt.4*sqrt(P)          .

     There certainly exists an R1 with abs(R1).le.M1/2,and (since we have
      that abs(c).le.M3*M1*M2/2 )the corresponding R2 has abs(R2).le.M2   .
  The congruences for c modulis M1 and M2 are equivalent to congruences for
      R1 modulo M1 and R2 modulo M2.

           We now find some point A on some twist of E and form A2=A**(M3*m1).
      The N2 possible powers A2**R2 are found,and the penultimate words of the
      relevant X are stored together with the corresponding values of R2.Finall
y     the X-values are sorted by size and a key to the R2 values is retained.
          Next we form A1 =A**(M3*M2),and form successively the powers

      A**(P+1) * A1**R1, where R1 runs through  N1 possible residues modulo M1,
      and at each stage we match the penultimate word of X against our sorted
      list. When a match is found,the list word is wholly recomputed using the
      key and a further check made for a true match on X,which implies a match
      on +/-Y. (As in section 6.3,a number of bogus matches is to be expected,
      perhaps two or three hundred in a run of several hundred thousand
      comparisons,but not enough to cause any serious inefficiency.) Once the
      match is found,we have essentially "won",and revert to section 1,or
      possibly carry out the search to the end as previously described.

 7.1^ More Technical Details .

 7.1.2. ^  The Sorted List.
          The preference here(for "large" P of 40 or more decimal digits;
      everything is quite fast for less)is to make M2 as large as possible
      (but of double IBM word size,i.e.,under 2**64)for a given N2,by using
      for M2 a product of those q which have very few possible values of c.
      In that case the gaps between successive values of R2 are large,and
      we speed up the computation of A2**R2 by the following device.
           Choose a parameter SIZ such that 2*SIZ**2 exceeds almost all the
      successive differences of the R2(SIZ =200 is often suitable). Form a
      table of A2,A2**2,...,A2**SIZ,and A2**(2*SIZ*L),L=1,SIZ .
        Then almost every power A2**R2 can be computed from the previous one
      by  two multiplications(of points on the elliptic curve); recall that
      division is multiplication by (X,-Y). The very few large gaps are
      computed by applying the usual power algorithm.
           Since we always have +/-c mod q together in the method,a value R2
      occurs if and only if M2-R2 occurs. Thus we apply the above only to
      the R2 under M2/2; after that we complete the table at the cost of a
      single division(the point A2**M2 having been worked out once for all.)

   7.1.2.  The Matching List.
         We find it expedient here to write M1 as a product M1K1*M1K2,and
      to rewrite the typical residue R1 as LISTK1(K1) +M1K1*LISTK2(K2),
      where the residues LISTK1 modulo M1K1 form an outer loop,and the
      residues LISTK2 modulo M1K2 the inner loop. Both our M1K1 and M1K2
      are currently single length,and  certainly we require M1K2 to be so.
      We prefer there to be a large number of residues modulo M1K2,so that
      the possible differences betweeen successive ones are limited in number
      to at most 2000.
              For different LISTK1 in the outer loop,the precise residues in
      the inner loop are different. However,their differences modulo M1K2
      form a small set. Thus before commencing the double loop,we prepare a
      table of our relevant point raised to the powers of all these differences
..     Then in the inner loop,each successive power is obtained by a single
      multiplication of points on the elliptic curve.


    7.2 ^  The Global Strategy .

      It is now clear that we require a set of values of q ,with their
      corresponding numbers of possible c,which satisfy two criteria. First,
      the product of all the q must be at least 4*sqrt(P). Second,the product
      of the numbers of c for all these q must not exceed the time and storage
      constraints of our sort and match. For the sorted list,if we have N2
      distinct residues,we need storage of 3*N2 words(list plus double word key
      ). One might send off the key to secondary storage,but it is a mild
      nuisance. At present,N2 of around 300000 is the most that I use.
          With the two-tier matching list,the critical constraint is the
      time; so long as the inner loop has about 10000 residues it is very
      efficient and needs only moderate storage for the powers of differences
      table. But the outer loop is there,and the number of residues it
      contains is a multiplier of the total time.
             The entire program is now automatic except for the
      assignment  of q between M2,M1(and M1K1 and M1K2). As we proceed
      with the computation of q,any q with c=0 is automatically assigned
      to M3,and is pure good. Otherwise,we rank the qs by the ratio
      (q/ number of c),and as soon as the product of the leading q
      exceeds 4*sqrt(P) and the  product of the numbers of c is less
      than some bound depending on P(with 50 digits,we might carry on
      a bit to make a shorter second stage; with 70,we are only too glad
      to have found something),and on the storage and time constraints
      of our program,we decide to go into the sort and match stage.
           At this point,a hand assignment of q is made.

 7.3 ^  Prospects.
      As things stand,it is quite feasible to do all q up to 200 for a 100-
      digit prime in about 4 hours. The apparent restriction of information
      (measured as the product of (possible c / all c ) modulo (product q))
      is enormous. For every q it is less than 1/2,and usually much better.
      But there seems to be no good way of using this information effectively.
              In primality proving one can in effect show very quickly for
      numerous q that any prime factor of a putative prime is congruent to
      one of (q-1)/2 values modulo q,but the information becomes unwieldy
      and the Delay Line Sieve to find the least number satisfying the
      conditions is needed,and is not effective beyond a certain size. The
      situation here is analogous.
              Of course Schoof's method does not have this drawback,and
      actually knowing c uniquely modulo each q gives an unique answer
      as soon as we have enough q. But the price he pays for this is
      (in effect)that he must work on the principal congruence subgroup
      Gamma(q) versus our GZ(q),and deal with equations of degree q**2
      versus our (q+1). The equivalent computer effort that we need to
      get up to q=251 would only give at most q=17 for Schoof's
      algorithm.
                These remarks are in no way intended to detract from a
      very remarkable piece of mathematics,merely to point out that with
      our programming skills his method does not work(and,by all accounts,
      the programming skills of our complement in the world do no better.)
            However,a combination of the two methods might yield another
      10 decimal digits;we could start off with an unique c modulo
      8*9*5*7*11*13*17,and then use the method above with a much bigger M3.
       It may thus be hoped that someone will now be enterprising enough
    to program Schoof's method up to,say,17; the fact that by themselves
      the results would have no practical application need no longer
      act as a deterrent.

     8. ^ Various appendices and numerical data.

     8.1 ^   Examples(q=173 and q=257) of SSE-derived functions,showing the
      laundry process. Notice that the two Fourier series have the same
      coefficients(though a different leading power of x)for a large number
      of terms. Why is this? (Ramanujan worshippers please note what the
      master did himself,not the fumblings of his apostolic succession.)

Q= 173 FUNCTION X**-  4(1,   1   1   2   3   3   5   6   8  10  13
  16  21  25  31  38  47  56  69  82  99 118 141 166  26  59
 102 148  32  92 168  76   1 108  62  28  22  28  68 129  56
   8  12  47 143 110 150  71  86 166  12 120  14  16  12 140
 122  99 141  45  69 170 103   6 153 162 163 105 136  42 152
  83  21  88 154  12  61  98  53  56  76  93  97  86 111 158
 169 164 120  82 119 106  17 127 130   0  74 165  46 134 122
        (multiply by eta(tau) *eta(173*tau)  )
Q= 173 X**( 13/  4)*(1,   0 -1  0  0 -1  0  0  0  0  0  1  0  0  0
  0  0  1  0  0  0  0  0  0  0 -1  1 -1  0  0
  0  1  0  0 -1  0 -1  0  0  0  0  0  0  0 -1
  0  0  0  0  1  1  0  0  0  0  0  0  0  0  0
  0  0  1  0  1  0  0  0  0  0  0 -1  0  0 -1
  0  0  1  1  0 -1  0  0  0  0  0  0 -1  0 -1
  0 -1  0  0  0 -1  0  0  0  0  1
        (above in sparse notation: coeff*x**(power)/4)
 -1(   21) -1(   33)  1(   57)  1(   81) -1(  113)  1(  117) -1(  121)  1(  137)
-1(  149) -1(  157) -1(  189)  1(  209)  1(  213)  1(  261)  1(  269) -1(  297)
-1(  309)  1(  321)  1(  325) -1(  333) -1(  361) -1(  369) -1(  377) -1(  393)
 1(  413)
LAUNDERED            1       1       2       3       3       5       6       8
      10      13      16      21      25      31      38      47      56      69
      82      99     118     141     166     199     232     275     321     378
     438     514     595     693     800     927    1066    1233    1412    1625
    1859    2132    2430    2780    3161    3603    4089    4648    5261    5968
    6740    7624    8597    9702   10915   12295   13807   15519   17399   19517
   21843   24462   27331   30551   34087   38040   42374   47219   52524   58437
   64917   72120   80009   88770   98352  108971  120593  133444  147494  163019
  179976  198680  219111  241605  266160  293173  322630  354992  390279  428987
  471161  517389  567719  622817  682785  748355  819674  897598  982286 1074722
 1175150 1284647
      (end 173  begin 257)
Q= 257 FUNCTION X**-  6(1,   1   1   2   3   3   5   6   8  10  13
  16  21  25  31  38  47  56  69  82  99 118 141 166 199 233
  18  65 122 183   2  84 182  34 162  46 214 139  97  78  97
 143 240 115  49  31  84 196 144 162  34   1 100  57 181 189
 139  12 124 199  56 178 140 180 128 222  63 134  52  55  24
 202 243 123  36 235 153  48 175  12 105 217 145 186 197 212
 147  79 216 147 173 151 200 240 199  47  93  76 172 215 220
      (multiply by eta*eta(257))
Q= 257 X**( 19/  4)*(1,   0 -1  0  0 -1  0  0  0  0  0  1  0  0  0
  0  0  1  0  0  0  0  0  0  0  0  0 -1  0  0
  0  0  0  0  0  0 -1 -1  1  0  0  0  0  1  0
  0 -1  0  0  0  1  0  0  0  0  0 -1  0  0  0
  0  1  1  0  0  0  0  0  0  0  0  0  0  0  0
  0  1  0  0  0 -1  0  0 -1  0  0  0  0  0  0
  0  0  0  0  0 -1  0  0  0  0  0
        (sparse notation)
 -1(   27) -1(   39)  1(   63)  1(   87) -1(  127) -1(  163) -1(  167)  1(  171)
 1(  191) -1(  203)  1(  219) -1(  243)  1(  263)  1(  267)  1(  323) -1(  339)
-1(  351) -1(  399)
LAUNDERED            1       1       2       3       3       5       6       8
      10      13      16      21      25      31      38      47      56      69
      82      99     118     141     166     199     233     275     322     379
     440     516     598     696     805     933    1074    1242    1424    1639
    1877    2153    2456    2810    3199    3647    4143    4710    5336    6055
    6844    7744    8739    9866   11108   12517   14067   15816   17745   19913
   22301   24985   27934   31237   34875   38935   43398   48379   53847   59933
   66618   74040   82185   91221  101124  112088  124109  137391  151935  167996
  185566  204934  226120  249435  274919  302943  333541  367143  403826  444055
  487933  536018  588424  645784  708275  776596  850974  932232 1020623 1117094
 1221993 1336363

   8.2 Examples of application of method.

   8.2.1.  Smallest prime of 65 digits(copy of notice in NMBRTHRY net.)
       (February 21st,1991)
    In early 1988 I found a method for finding the number of points on
an elliptic curve modulo a prime P,and gave an example where P had 44
decimal digits; this was described and distributed in a 9-page memorandum.
  Recently,without any essentially new ideas,but applying greater technic
-al competence at each stage of the method,I have found that:

      The number of points on the INRIA curve  y**2=x**3 + 105*x + 78153

  modulo the smallest prime of sixtyfive decimal digits,which is 10**64+57,

  is equal to  P + 1 + C,where C is MINUS the following

                      179  72908 11790  97384 74641  25900 78222    .

     The method of finding this number leads to its own proof of correctness
(" only one possible number between P + 1 +/- 2*root(P) "),but two more direct
proofs can be obtained by factorizing the numbers of points on the curve and
on its twist modulo P. The latter number is  120 times an easily
provable prime; for the former one must work harder,and I am indebted to
Francois Morain for using his quadratic sieve to factorize it for me(since
at the moment the authorities will not give me enough storage to use SPAR);
the factors are 92,

                       27  70973 85393  30085 85690  21436 13511,
       and

                        3  92265 16706  26008 12655  88031 99403.
      ***********************FINIS  *************************************

               *******************************************
    (Q=701 extrapolation solution)
Q= 701 FUNCTION X**- 21(1,   1   2   2   4   5   8  10  14  18  25
  31  42  52  68  84 108 133 168 205 256 311 384 463 567 681
 125 287 489  14 294 607 289  19 540 426 430 524  67 435 296
 321 596 387 498 197 302  83 384 474 521 503 618 147  31 259
 413 501 153  98  35  11 500 188 335 373 269 208 284  63 481
 494 527 344 579 453 142 448 452 469 640 160 279 528 571 333
  67 212 309 698 365 358 654 389 687 337 458 382  48 323 246
      **** multiply by eta * eta(701)*******
Q= 701 X**( 33/  4)*(1,   0  0 -1  0  0  0  0 -1  0  0  0  0  0  0  0  0  1  0
   0  0
  0  0  0  0  0  0  1  0  0  0  0  0  0  0  0  0  0  0  0  0  0
 -1  0  0  0  0  0  0  0  0  0  0  0  0  0  0 -1  0  0  0  0  0
  0 -1  0  1  0  0  0  0  0  0  1  0  0  0  0  1 -1  0  0  0  0
  0  0  0  0  0  0  0  0  0 -1  0  0  0  0  1  0  0
 sparse notation
 -1(   45) -1(   65)  1(  101)  1(  141) -1(  201) -1(  261) -1(  289)  1(  297)
 1(  325)  1(  345) -1(  349) -1(  405)  1(  425)
LAUNDERED            1       2       2       4       5       8      10      14
      18      25      31      42      52      68      84     108     133     168
     205     256     311     384     463     567     681     826     988    1190
    1416    1696    2009    2392    2823    3344    3931    4636    5431    6376
    7445    8708   10135   11812   13706   15920   18423   21332   24618   28424
   32720   37674   43264   49688   56928   65224   74565   85234   97239  110911
  126278  143740  163344  185564  210488  238675  270258  305905  345801  390741
  440992  497490  560593  631427  710457  799018  897733 1008180 1131161 1268561
 1421396 1591910 1781401 1992521 2226904 2487719 2776994 3098487 3454740 3850201
            Berlekamp and Lehmer Factorization of Polynomials Modulo q

                            A.O.L.Atkin

      To simplify our exposition,we shall make some unessential assumptions.
            Consider a polynomial f(x) without repeated factors,of degree n,
      monic. Our finite field will be the prime field modulo q(prime),where
      q is large in that finding roots of an equation mod q by trial and
      error is not feasible,but not very large in that storage becomes a
      problem. Typically we are interested in "Chinese" size,q just less
than positive single word size,thus under 2^31 and often for convenience
      under 2^30 on IBM machines. Our first task is to compute

              x^q   -   x      (modulis q,f(x))

      which is done by the usual power algorithm left to right. Multiplication
      by x is trivial; squaring is best done with an auxiliary table of powers
      from x^n to x^(2n-2) as linear combinations of 1,x,x2,...,x^(n-1).
      (If multiplication mod q is slower than squaring mod q further considerat
      -ions apply; see my Ellpts Essay). Then

                  gcd(x^q -x ,f(x))

      will  yield the linear factors of f(x) as a block. If what remains has
      degree.gt.3 we proceed further with the leftover part,which we rename as
      f(x),of degree n.
          We now create a matrix of the powers x^q,x^(2q),...,x^((n-1)*q) ,
       expressed as linear combinations of 1,x,x2,...,x^(n-1). f^q requires
       only a reduction from the previous computation. We form the remaining
          powers in the order(e.g.,for n=37) ::  2,4,8,16,32,3( as (1+2)^2
      -2 -4 divided by 2),6,12,24,5( as (2+3)^2 -4 -6 divided by 2),10,20,
      7,14,28,9,18,36,11,22,13,26,15,30,17,34,19,21,23,25,27,29,31,33,35,and
      finally 37 which needs a multiplication. At this point the methods
      diverge substantially,and we describe Berlekamp's first.
            It should first be said that much of the original literature
      assumes q small(especially 2),and n relatively large; plainly one
      must modify some of our algorithms in this case. Berlekamp finds the
      nullspace of this (n-1)-by-n matrix,in the form of a basis consisting of
1 and polynomials B1(x),.,Br(x),which nullspace we call the"Berlekamp polynomial
       B(x) with the property that B^q = B (modulis q,f(x))and 0.lt.deg(g).lt.n
           If r=0,so that the matrix has rank n-1,then f(x) is irreducible.
      Otherwise f(x) is the product of (r+1) distinct factors. Observe that
      the polynomials B form not merely a vector space,but a ring(with
      multiplication defined modulis q,f(x)),and for fixed B(x) and each
      factor fi(x) of f(x) there is a constant c such that  fi(x) divides
      B(x)-c. The product of all (B(x)-c) must be zero modulis q,f(x); but
      it is also a polynomial in B of degree.le.r+1(usually=) ,which can be
      found directly from an (r+1-by-n) matrix of B,B^2,...B^(r+1),as linear
      combinations of 1,x,...x^(n-1). Thus subject to FINDING THE LINEAR
      FACTORS OF A POLYNOMIAL MODULO q(to which we return later),we may
      choose some B(usually the Bj of least degree) and gcd all its
      appropriate (B-c) with f(x). If the number of c is less than (r+1),
      we must repeat partially with another B,&c.
                For a fair comparison with the other method,one should program
      a less theoretically elegant but practically faster algorithm using
      "first out",where the matrix is not processed fully to find its rank,
      but the first element of the nullspace is taken as B(x). This is especial
ly    advantageous when the number of factors is large(in the extreme  super-
      singular case,they all have degree 2,and this is known); but ad hoc
      methods may be needed to PROVE that all the factors found are irreducible
..     In a random case,the miniscule saving in matrix processing is not worth
      the bother involved.
              In the Lehmer method,we compute successively x^(q^a)-x,for a=2,3,
      ... For each a,we form the gcd with f(x),which necessarily extracts those
      factors of f(x) of precisely degree a. The powers are computed using the
      matrix(it is extremely inefficient to use the power algorithm for large
      q),which in general we do not modify even when a factor is found.We do
      however peel each successive gcd off f(x),and when 2*a exceeds the
      degree of the remaining f(x),it must be irreducible. Notice that in our
      gcd algorithm,which is likely to proceed as in Euclid with decreasing
      degrees,we can stop when our degree falls below a.
           The problem remains as to how to separate a block of factors
      all of which have the same degree a. This is of infrequent occurrence
      in a random factorization,and usually the number of factors is small.
      We perform a subsidiary Berlekamp on the block,but with the advantage
  that we already know some members of the Berlekamp nullspace,namely the
      symmetric functions of x,x^q,x^(q^2),...x^(q^(a-1)),which have already
      been computed and (when potentially needed)saved; usually the sum is
      sufficient to split the block.
          A refinement which I have recently(1988) adopted with some success is
      as follows. The time of the gcd is significant compared to the time of
      creating the next x^(q^a); thus for some values of a we defer the gcd
      and proceed directly to the next power. At the end,when 2*a exceeds
 the degree of the remaining f(x),we backtrack and clean up("Coda"). An array
      DEGLIS(a) keeps the degree down to which we must take our gcd algorithm
      at a(a=-1 conventionally says no gcd). The choice of DEGLIS is a matter
      of judgment; an example follows for degree n=101(the default value is
      DEGLIS(a)=a,and we exhibit only the other values):
      DEGLIS(a)=-1 for a=4,5,6,7,8,9,10,11,13,14,15,17,19,21,23,24,25
      DEGLIS(a)=b for (a,b)=  (12,4)  (16,8)  (18,9)  (20,5)  (22,11)
(26,13)  (28,7)  (30,15)  (34,17)  (38,19)  (42,21) (46,23) (48,24) (50,25).
         Thus three factors of degrees 4,6,and 12,would show up as a block
      of degree 22 at a=12; but they are easily separated using the (saved)
      x^(q^4)-x and x^(q^6)-x.
           In a typical example with n=99 we found factors of degrees 12,9,
      and 10,at stages a=12,18,and 20,respectively. The stages were continued
      up to a=34,at which point a "logic" subprogram determined what splits
      were still possible. For as the program proceeds,the array DEGLIS is
      changed; once a=12 is done,then factors of degree 4,6,and 12,are no
      longer possible,so we reset DEGLIS(a)=a for a=4,6,12. Now a search
      for all a with DEGLIS(a).lt.a may require some actual work,in the shape
      of an actual gcd with the saved x^(q^a)-x,or sometimes an inference
      from the implied cofactor.
                In fact at the coda stage we have a residual polynomial
      and a number of remaining a with DEGLIS(a)=-1. Any irreducible
      factor of our residual polynomial must have degree one of these a,and
      its cofactor must have degree in the semigroup of nonnegative addends of
      these a. The second condition is sometimes enough to eliminate cases
      without further work; otherwise judicious gcds are performed,and DEGLIS
      constantly updated.
                                 In the above example work was required for
      a=19,21,23,and 24. Thus compared with the standard method of taking gcds
      for each a,we saved gcds for a=4,5,6,7,8,9,10,11,13,14,15,but lost
      by carrying degree 9 for 9 stages and degree 10 for 10 stages,in our f(x)
..     It would be foolish to break one's head trying to assess these effects
      for a random split of n into degrees of factors of f(x),especially
      since they are machine dependent. Far better to run both programs,and
      let the machine tell which is better(we give some data at the end).
                    Whether or not one adopts our modification, it is
      certainly inefficient to recompute the matrix for a smaller f(x)
      except very near the beginning(I do it only if factors are found at
      a=2 or a=3). On balance the saving on gcds seems to exceed the loss
      on carrying a larger f(x) for longer in most cases,and of course
      for irreducible f(x) the saving is considerable.
               Finally,I have found that,even doing all gcds,Lehmer is
      faster than Berlekamp. However it is fair to say that an optimal
      matrix processing routine in which one proceeds with,say,10-by-10
      blocks and does scalar products in the registers before division
      by q might speed up the Berlekamp by 20%. A postfinal observation:
      since the first common part of the program is 60% or more of the whole
      in many cases,the overall saving provided by the modification may
      be only 5%.  But 5% seems a lot in the State of Illinois.

          Obtaining Linear Factors Modulo q .

      Suppose without loss of generality that we have f(x) of degree n modulo q
      and that it is known that all the factors of f(x) are linear. As before,
      we think of q as large,possibly Chinese of 10 digits,primality proving
      of 500-1000 digits,factorizing of 100-150 digits,&c. We describe below
      the principles; the exact algorithm depends on the precise size of q and
      the machine,which influence the relative cost of finding x^(approximately
      q) (modulis q,f(x)) and that of performing a gcd with two polynomials
      of degree n mod q.
            The first task is to find all the small factors of (q-1),where
      "small" may be up to 1000 if q has 500 digits. If for example q-1=
      4*167*rough and n=9,we would form and store x^rough. Then gcd f(x)
      successively with x^(rough*167)-w,for w=i,-1,-i,1 mod q . (The cost
      of finding fourth roots mod q is small compared with the cost of
      polynomial operations.) With reasonable luck we should now have split
      f(x) into at most quadratic pieces,which can be finished using only a
      square root algorithm(and our previously found i becomes useful again.)
      I have in fact got a special purpose n=3 and n=4 routine,using Tartaglia
      and Euler,but these are probably not worth writing in most contexts.
          If however there remains an unfactorized residual polynomial,say of
      degree 5,it would be a mistake for large q to do what many textbooks
      recommend,namely to form y^((q-1)/4) for various translates y=x+small
      (some authors actually recommend using (q-1)/2 only!). For this
      involves a repeat of the longest part of the algorithm. It is better
      to gcd with x^((q-1)/167)-w,where w runs over 167-th roots of unity
      modulo q,with various options as to one's action after partial success
      (depending on relative cost of square roots & expected remaining gcds).
 Obviously whether it is specifically worth using 167 depends on n and q and the
machine; it is a nice exercise in probability. But my observation has been
      that the "intuitive" programmer,including myself,does not calculate
      this strategy precisely unless forced to do so.


        Some Timings


     Comparison of Berlekamp and Lehmer(Unmodified)

         (Modulo 2^30-35  on the IBM 3090 with previous "slow"processor)

      Degree of Polynomial  Time in " for  Berlekamp   Lehmer    Split
                            x^q & matrix      ending    ending
       125                   2.63            2.00      1.29      71,34,14,2,2,2
       124                   2.50            1.88      1.44     102,13,5,3,1

       225                  14.44            9.08      6.73     153,45,26,1
       224                  14.48            9.16      8.73     200,13,6,5

       350                  58.78           45.41     25.78  125,105,83,15,6,3,
                                                                            3,1

      Comparison of Lehmer(unmodified)   and  Lehmer(modified)
      (modulo 2^30-35 on IBM 3090 with fast processor June 1991)

      Degree  x^q & mat      Unmodified      Modified       Split
      170       2.44          1.157          0.834         13,47,110
      169       2.37          0.854          0.923         1,31,37,41,59
      168       2.30          1.378          1.125         1,1,2,3,3,15,143
      167       2.30          0.799          0.798         1,6,7,17,21,31,84
      166       2.27          1.360          1.135         29,137
      165       2.19          1.186          0.888         1,2,44,118
      164       2.17          1.057          0.816         1,2,7,8,10,22,114

       Only in one case does the modification lose,when the extremely "unlucky"
      decision to have DEGLIS(a)=-1 for 31,37,41, involved that these factors
      were all found in the Coda. On balance,it is clearly superior,and there
      remains the possibility of improving the actual  DEGLIS strategy.