here is(1)a corrected elliptic curve # points(2) remarks on factzn The Number of Points on an Elliptic Curve Modulo a Prime by A.O.L. Atkin (This is an expanded and improved version of a 9-page ms dated January 1988. ) 0.^ The elliptic curve we consider is E: y**2=x**3 + A*x +B (modulo P) (0) where P is a (large) prime,certainly not 2 or 3,and usually bigger than any auxiliary prime q,where currently 2.le.q.le.311 . We summarize briefly some routine material. ^1. Matters of Proof. In the method we finally describe,it is often convenient not to prove some of the intermediate steps; we arrive at a putative number of points, certainly correct,but legally speculative.Thus in this section we address the problem: Prove that the number of points on (0) is P + 1 + c, where c is given. (1.1) First use trial divisors on P + 1 +/- c(simultaneously) up to some moderate bound B(perhaps 10**7,depending on the size of P). The general area of interest here is with P from 30 to 100 decimal digits,so after each (found factor AND 10000 primes)a Selfridge may be performed. This will (in the worst case-often better with obvious consequences)leave us with P +1 + c = F1 * U1 , P + 1 -c = F2 * U2 , where F1 and F2 are (factorized) products of known primes,and U1 and U2 are composite numbers with no prime factor less than B. We also have (F1,U1)=(F1,U2)=(F2,U1)=(F2,U2)=(U1,U2)=1, of which the last needs a gcd; and we hope that (F1,F2) is rather small. Now find + and - twists of E and points (a1 and a2) on them. By the usual techniques,if necessary repeating with a new point/twist,we expect to show that there is a subgroup of the group of points on E mod P which has order F1*X1 on E,where X1 divides U1 and is not 1,and similarly an order F2*X2 on E' ,the twist of E by (*/ P). If the true numbers of points on E and E' are P + 1 +/- C,then we have shown that (C-c) is divisible by lcm(F1,F2) *X1 *X2 . But the absolute value of (C-c) is less than 4*sqrt(P),so that our c is certainly proved correct when this is less than our proven divisor. In particular,since X1 and X2 are both grater than B,we have that lcm(F1,F2)*B*B . gt. 4*sqrt(P) (1.2) is sufficient to prove that c is correct. We can improve on this with rather more work. Let us ostensibly factorize P by the elliptic curve method,using a first stage with a1**F1, and a second stage up to B2(.gt.B) . Notice that,since P is prime and we can identify a match without a gcd,we may use the Shanks baby-giant steps with sort and match,as in Shanks's quadratic forms factorization method. Using the modification introduced in SPAR saves a factor of 6 or 7,and it is quite fast to work up to B2=10**10,and 10**12 only takes a few minutes. This will either produce a factor of U1,or else prove that X1 is greater than B2. A similar computation on E' will prove that X2 is greater than B2,and we can now replace B by B2 in (1.2) above. In the unfortunate (and unlikely) event that these expedients fail, we resort to ECM and possibly sieve factorizations of U1 or U2. Factorization takes considerably longer than the techniques described above. Of course,the congruences modulis the small primes q described below are in many cases proved without reservation,and so give rise to further conditions on (C-c). The difficulty is that knowing (C-c) is,say,0 or 4 mod 13 and divisible by an UNKOWN large prime,is unhelpful: 0 is all right,but not 4. 2^. The Construction of Modular Equations 2.1.For our present purpose,we take q to be an odd prime,and seek an algebraic equation connecting j =x**-1 + 744 + 196884*x+... (x=exp(2*pie*i*tau)) and some suitable function f on the subgroup GZ(q) ("gamma sub zero q"). We prefer this function to have as low a valence v as possible,so that the equation has degree (q+1) in f and degree v in j. One canonically available function is f= eta**s(tau)/ eta**s(q*tau) ,where s is the smallest even integer such that 24 divides s*(q-1). Unfortunately its valence is unreasonably large when q is not congruent to 1(mod 12). Another less numerically canonical function is the function of lowest valence v on GZ*(q) =(GZ(q),W(q)),where W(q) is the involution (0,-1 ; q,0). This has valence 2v on GZ(q),of course,but v is usually very small,since GZ*(q) almost always has a Weierstrass point at infinity(I conjecture that this is so for q.gt.389). My own package has a series of special purpose routines for different q,but there is one universal method, involving hypergeometric series and Berlekamp factorization,which gives total conviction but no proof. The learned reader will recognize supersingular equations,Hasse invariants,&c,in what follows,but I shall not allow myself to start explaining anything. (for seq: see Weber III or author's 1975 Michigan/Japan talk) When q=1(mod 4),consider the Gauss hypergeometric series F(1/12,5/12; 1; 1728/j) , (which happens to be the fourth root of E4). Modulo q this series is congruen t to H(1/j)*H(1/j**q)*H(1/j**q**2)* .... (for this & more,see Carlitz) where H has degree d; then j**d *H(1/j) is a monic polynomial SSE(j) of degree d= approx. q/12 . When q=3(mod 4),we employ F(7/12,11/12; 1; 1728/j) similarly. SSE(j) has only nonrepeated linear and quadratic factors. Remove the linear factors to produce SSE*(j) of degree 2g,where g is the genus of GZ*(q),and SSE* has precisely g quadratic factors modulo q. We do NOT wish to find these,however. It is easy to prove that every function with a pole of order m at infinity and leading term k*x**-m+...on GZ*(q) is congruent modulo q to a polynomial in j (i.e.,the corresponding Fourier coefficients are termwise congruent). This polynomial has the following property: it has constant remainder(no j-term)when divided by ANY of the quadratic factors of SSE*. That is to say,the (infinite)ring of such polynomials is precisely that of the Berlekamp polynomials in Berlekamp's factorization method. Thus we ostensibly factorize SSE* by B.,but never proceed to find the constants and the factors;we stop when the Berlekamp matrix has been echelonized,and now the (columns as I do it)are the polynomials in j(and,though we do not need it here,the rows *dj are the differentials)modulo q. There is however one caveat which I have never been able to overcome (nor,I believe,have Serre and many other people). If one finds that,in terms of x,the polynomial in j of smallest degree begins(for example) x**-3(1,1,1,2,2,2,3,3,4,5,6,7,9,10,.....) , it is tempting to suppose that one has PROVED that 3 is the smallest possible valence on GZ*(q). We are in fact sure,and even that 1,1,1,2,.. &c are right over Z,since how could it go 1,1,1+179,2-2*179,...? But in theory we may have been misled. The least degree function over Z (over C directly,but deep theorems of Shimura imply over Z)may be x**-4(179,1,1,1,2,...) and of course a compensating differential may start x**3(179,1,-1,...). Thus the "mod q gap series"may not be the true gap series over C at infinity. I have of course checked numerous cases,and it always has been,but the annoying theoretical possiblity remains that they may differ. However,it is clear that the mod q situation is more Weierstrassy than the "true" one,so that when we have no WP mod q,we have no WP a fortiori over C. In this way(using the full SSE)one can show that infinity(and hence also 0)is never a WP of GZ(q). ( There are many other interesting corollaries and issues related to the supersingular equation,which we defer.) Now even if we are prepared to forgo proof,we cannot afford to be wrong. The speculative inference of an exact function x**-21(1,1,2,2,4,5,8,10,14,.... from its reduction modulo q for,say,q=701,starts off very securely,but what happens as the presumed terms become larger? Our position is analogous to that of the tax evader,led step by step to wilder and more extravagant lies and guesses until finally the IRS gives him a wrong coefficient. Our remedy should be sought,not from the ineptitudes of Government,but from the technology of organized crime,in the process LAUNDR described in the next paragraph: for a small commission we can have our coefficients laundered. The reader might wish to try her hand at the following. We can be sure that the 125 after 681 is really 826,the 287 998,&c,but at some point discrepancy is inevitable. The'inevitability'of Bach's next chord given the last chord and the next note of the chorale does not enable us to write his works ourselves. Q= 701 FUNCTION X**- 21(1, 1 2 2 4 5 8 10 14 18 25 31 42 52 68 84 108 133 168 205 256 311 384 463 567 681 125 287 489 14 294 607 289 19 540 426 430 524 67 435 296 321 596 387 498 197 302 83 384 474 521 503 618 147 31 259 413 501 153 98 35 11 500 188 335 373 269 208 284 63 481 494 527 344 579 453 142 448 452 469 640 160 279 528 571 333 67 212 309 698 365 358 654 389 687 337 458 382 48 323 246 (Solution at end of this essay) Assume now that the valence v of our function f is less than (q+1)/24,as is almost always the case(163 is the biggest exception.) Then f*eta(tau)*eta(q*tau) is a CUSPform of weight one and multiplier system (*/q)*some 12-th root of unity ,and hence by Serre-Deligne of prescribed type and with very small coefficients after we subtract if needed some constant multiple of eta(tau)*(eta(q*tau ). These can be identified with certainty(and are almost always based on definite quadratic forms). Now modulo any of our Chinese primes we can divide back by our weight one multiplier,to get a laundered or sanitized function. Now I have to confess that I do not do this myself. For once one has used the SSE and the above to identify the function as a ratio of forms of weight one,the numerators can be readily identified as binary theta series,and the algorithm expressed in that form. This is also much faster than computing j modulo q out to several thousand terms; however the POWMAT and GEGAJ described below take so much longer relatively that the cost of computing the function is immaterial. 2.2^. The Modular Equation(Star Type) (routine GEGAJ) We have an algebraic relation Sigma a(r,s) . f**r . j**s =0 (0.le.s.le.2*v and 0.le.r.le.q+1) For fixed f,both j and j(q*tau) are among the solutions;for fixed j, the solutions are f and f((tau+l)/q),l=0,q-1. If we plan most of the work using j(q*tau),we see that a(r,s) multiplies a power series with leading term x**-(r+q*s). Two such terms are only equal if r=r'(mod q). Also the coefficient of f**q as a polynomial in j is plainly equal to -f +constant +O(terms in x and higher powers),which enables us to commence our equation as f**(q+1) + f**q *(-j**v +...+const) + f**(q-1)*...... =0 : This start achieved,we have r=r' and s=s' in the above,so that every remaining coefficient can be found by solving echelon-form using f**r and j(q*tau)**s. For this we need all the series f**r (1.le.r.le.q+1) up to(at most) q*(2v+1) terms; unfortunately the succesive use does not enable us to save much on storage here. POWMAT is a general purpose program which produces all powers from 1 to k of a given power series up to N terms in each power. It uses the fast FFTLIB which is part of LMA4064V,and is too long to describ e here. Basically we find d of size sqrt(k )+a little,and get the powers from 1 to d and store their FFTs. Then successively the FFTs of f**2d, f**3d,...,are found,and each is used to find d powers of f using only one inverse FFT per power. This is often quicker than the subsequent echelon solving. The latter could be speeded up by the usual "block"technique,but in the present case that was too much of a pain to program. 2.3^. The Modular Equation (Eta Type) (routine GEGEN3) Here we have f=eta**s/eta(q*tau)**s of valence v,say,and j. Again, the involution shows that q**(s/2) / f and j(q*tau) satisfy the same relation as f and j; this would enable a"both ends to the middle" type of echelon solving to be applied. But we chose instead to use the relation between j and g=f(tau/q): for fixed j the roots are f((tau+l)/q),l=0,q-1, and q**(s/2)/ f, and we then used routines for multn/divn by eta series,for powers of sparse coefficient series,and Newton's formula. 3^. The Use of the Modular Equation Our indifference above to the precise function used to define GZ(q) with j will now be explained. Given our original curve E modulo P, we compute the usual j modulo P as j = 6912*A**3 / ( 4*A**3 + 27* B**2) Substitution of this value of j in the modular equation yields an equation f**(q+1)....=0 of degree q+1 modulo P. (In practice,we leave the modular equation as a set of equations modulis enough Chinese primes,and reduce modulo P at this point.) We are concerned ONLY with the DEGREES of the irreducible factors of this last equation modulo P,and not with the factors themselves. More precisely,recalling always that q and P are distinct,we have: (Definition: the "split" modulo P is the permutation of (q+1) letters obtained by treating each irreducible factor of degree d as a cycle of length d;"alternating" means split in A(q+1);"symmetric" means split not in A(q+1). ) (3.1) If (P/q)=1,the split is alternating. (3.2) If (P/q)=-1,the split is symmetric. If the number of points on E mod P is P + 1 + c,and the roots of t**2 -t*c + P =0 (mod q) are a and b,then: (3.3) if a=b(or c**2=4*P mod q),then the split is (1q) or (111..11) (3.3.1) The last case (all factors linear) occurs only when c**2 -4*P is divisible by q**2. Now assume a and b are distinct mod q. (3.4) If a and b are in the field of q elements(so that (c**2-4*P)/q) =1),then the split is (11rrrr..rr),where rs=q-1 . (r.gt.1) (3.5) If a and b are not in the field of q elements(so that (c**2-4*P)/q)=-1),then the split is (rrrrr.rr) with rs=q+1.(r.gt.1) In the last two cases,r is the EXACT ORDER of (a/b) in the fields of q,q**2,elements,respectively. Conversely,splits as prescribed above imply the appropriate conditions on a and b,and hence on c. Note that we cannot distinguish +/- c by these criteria,which is to be expected since we have only used j mod P,and not A and B, and j is invariant by the twist of E. The number of possible c mod q remaining is equal to (Euler)phi(r). This is always less than q/2 and in some cases we hope to find very small values of r. Computations with Polynomials Modulo P For large degree (q+1) and large P(here perhaps up to 100 decimal digits)we run into serious time. Thus efficiency is important,even constant improvements like 20%. Our prime P consists of P(1) 32-bit words starting at P(2). For degree N,we have three kinds of object,each represented by a string of N*P(1) words: the monic polynomial x**N+ a(1)*x**(N-1)+.. ...+a(N),the same as x**N=a(1)*x**(N-1)+...+a(N),and finally the expression a(1)*...without the x**N. Our first task is to form x**P-x modulis (P,f(x)); for this we do the usual left-to-right square and multiply. "Multiply" is trivial(only by x). For squaring we construct an auxiliary table of powers x**N,...x**(2N-2) modulis (P,f),in an (N-1)by N BLNMAT matrix of P(1)-word numbers. We also add an N-th row(see below), which requires a total storage of P(1)*N*N words. To square a(1)*x**(N-1)+...+a(N) we proceed in two stages. First,we have to square to get a polynomial of degree 2N-2(FFT is pretty useless here). Assuming that the basic number multiplication is about twice the squaring time for P(1).gt.4,say,it is beneficial to regard a(i)*a(j) as ((a(i)+a(j))**2-a**2(i)-a**2(j))/2 ,once N is over about 10 or so. The same remark applies to the next stage (essentially forming a vector as vector*matrix): we have to find N dot products,and at the overhead cost of working out 1 extra sum of squares(N were already done in BLNMAT),we halve the dot product time. At this point,the gcd (x**P-x,f) is trivial,and after removal of the linear factors we need to contemplate x**(P**a) - x . We know that at some point(namely a=r)this will suddenly vanish. P is so large that exponentiation with square and(now long)multiply would be absurd. We therefore form another matrix BLQMAT which contains the powers x**P,x**2P,...,x**((N-1)*P),as polynomials of degree.le.N-1 in x (and again an N-th row for sums of squares.) This is done by the usual "square and halve and square again" algorithm for forming a list of powers,when FFT is not relevant, and when squaring is cheaper than multiplying. Briefly,we compute powers as follows 1,2,4,8,16,3,6,12,24,5,10,20,7,14,9,18,11,22, 13,15,17,19,21,23,(for N-1=25),with only the 25-th power needing a multiply. For,e.g.,3= half( (1+2)**2-2-4),17=half((8+9)**2-16-18). &c. It should be said that all these devices have been programmed, and do lead in practice to substantial improvements. This is not computer science. The auxiliary storage used(which really needs to be in core for time efficiency)is in an array with a safe counter. When it is clear that BLQMAT(which needs BLNMAT)would overrun storage,a trap PNOMAT is set,and BLNMAT is not used,at the cost of greater execution time. (Note that the storage for computing the modular equation can now be thrown away. In a perfect world one would store all the modular equations on disk: in practice we check legally once-for-all that the equation is correct if we use w Chinese primes,and only store w for later purposes,recomputing the resiudes.) 5^. Some Examples of the Method before the Sieve. 5.1^. P= 1 87668 40755 53521 23529 , E is Y**2=X**3 + 105*X + 78153 q= degree of (x**P-x,f) Possible r Actual r List of c (+/-) 13 1 .... ... 3 17 0 3,9 3 1 19 2 2,6,18 6 3 23 0 2,3,4,6,12 6 5 29 2 4,28 28 1,2,5,7,8,14 31 2 3,5,15 15 9,10,13,14 37 2 2,3,6,9,18 9 5,10,15 41 0 3,7,21 7 6,9,11 For example,when q=37,the second column shows that we are in case (3.4 ). Since (P/37)=1,we need (rrr..rr)to be alternating,so that r=4,12,and 36,are impossible. At x**(P**9) we found our running x**(P**a) was reduced to x,giving r=9; the final c calculation is trivial. Note that in the cases q=29 and q=31 the value of r was the largest possible. In theory this can be deduced at the penultimate stage (when q=31 and a=5 does NOT give x**(P**5)=x,we know r must be 15: one can even stop after one non zero entry is found as we begin the a=5 computation.) However,once BLQMAT has been done,the cost of continuing to the end is relatively small,and any error in any single excluding list is so disastrous,that we often do so as a check. Finally notice that as a matter of global strategy,we are interested in retaining those q and corresponding lists of c for which phi(r)/q is small. Thus we might very well leave q=29 at a=8,since we know we do not wish to use the list of values of c even if we double check it. 5.2^. P = 10**99 + 17**2 E is Y**2 = X**3 + 105* X + 78153 (At present(May 17th,1991)there is no reasonable prospect that our methods will work for a 100-digit prime. The information modulis q can be gathered,but the sieve is not feasible.) For q=211(and in fact v=7 in (2.2))the modular equation has 2333 terms. We need 21 Chinese primes @ 29.5" each,for a total of 10.24',to compute it. It took 14.60' to find X**P-X,with a linear part of degree 2. At this point r=3,5,7,15,21,35,and 105, were still possible. BLQMAT and BLNMAT together would have needed storage beyond the 4096K available,so that we did BLQMAT without BLNMAT,at some cost in terms of time,which was 23.02'. In fact r=7 was now found after only 10.54" more,giving c= +/- 47,95,105 (mod 211) for total 48' CPU time on an IBM3090. (With this prime,the largest q for which BLQMAT was possible with BLNMAT was 197: the timings there were 357" for modular equation,12.71' for X**P - X,and 460" for BLQMAT. In this case r was 196,so that the check took substantial time,namely 302"). 6^ . Other methods. Before describing the sieve,sort,and match,we first discuss what else is available,and what are the criteria for using various method s. We continue to regard the operation of our group of points as multiplication; many parts of the program are the same for quadratic forms and finite fields,where multiplication is standard. 6.1 ^ Modular forms. If the modular form corresponding to E taken over Q can be identifi ed, then its P-th Fourier coefficient will be -c. Obviously this is an absurd way of getting one particular value of c for one P,if P is small(under 10**5),and an impossible way if P is medium or larger. But if one wants the values of c for all P up to some limit,it is probably the best method when available. 6.2^ Direct Computation of the Character Sum Although this takes time P,the implied constant is very good,and with treble differencing of x**3 +a*x +b,double differencing to create a table of the Legendre symbol,and careful use of all tricks(e.g. on the IBM 3093 one can difference y**2/2 down to zero using BCTR,&c),the time is only a few milliseconds up to P=10**5. 6.3 ^ Baby-Giant Steps as in Shanks. In what follows r and s are integers to be suitably chosen later; r*s should be minimally greater than sqrt(P). Find a twist +/- of E with an initial point A. Form and"store"the powers of A from 1 to r. If r*P is small enough,this presents no problem. If storage becomes tight,the following technique is used(we shall not repeat thse remarks in section 7,and other analogous places.) We store the penultimate word of X,where A**r=(X,Y). We thus have an array of r 32-bit words. We now perform a keysort(very fast) on this so as to make the array in ascending"logical"order(i.e. as unsigned 32-bit integers). If necessary,the key can be sent off to secondary storage. We now successively compute A**(P+1+i*r) and match the penultimate word of its "X" against our sorted list,and similarly A**(P+1-i*r) ,where i runs from 1 through odd integers up to 2*s-1.(Remember that,using the Weierstra form,we always know when any power we compute happens to become the group identity). If a match is found on the penultimate word,we find r from our key ,and directly recompute A**r for a full comparison of the "X"s. If it is a true match,then the "Y"s are equal or equal and opposite,and we have found a nontrivial power of A which is equal to the group identity.One expects in large cases to find a number of bogus matches; the situation is similar to the birthday problem(we have 2**32 days in the year,so that once we have 2**16 or more students in our class two may well have the same birthday.) However,there are not likely to be enough bogus matches to cause serious inefficiency. Notice that we prefer the penultimate word; the last word may have some 2-adic congruences,and the first word may be small. (If first=penultimate we have only two words in all,when storage is unlikely to be a problem). Now P + 1 + c,the order of our group of points,lies between P +1 + 2*Sqrt(P) and P + 1 - 2*sqrt(P). Provided that r*(2*s) is greater than or equal to the integer part of 2*sqrt(P),or more or less r*s.ge.sqrt(P), then every power of A which is equal to the identity,and which lies betwe en the given bounds,will be found by the above process. In particular,if we carry out the full search,and find exactly one match,that must be the group order. It is probably best,however,to leave the sieve once the first match is found,and to make a quick attempt to prove it correct by the methods of section 1; if we fail,come back to the sieve. Again, if the matches are too numerous,and the order of the point A is disastrously low,we should try again with a new point. Finally,excerpts of data from different places can often be combined logically to advantage. For example,we may quickly find some information on c modulo 60. Now if our first success S is NOT the true order of the group,then our point has order d where d divides S,and the possible orders of the group are S (mod d). Als any number which is S (mod d) in the range over which we have already searched would have been a success,and it was not; this plainly restricts d. (Example: suppose that our first success is P+1 +k*sqrt(P)=S (k.gt.0), following the algorithm described above. Then we know d.ge.2k*sqrt(P). If we know that c is odd,and in fact S is odd,the nearest possible multiple of d is at least 6k*sqrt(P) away. Thus if 6k.gt.k+2,or k.gt.2/5,we have shown that S is the true order. ) It remains to choose r and s. They are both P**.25 apart from constants,so that the match time is fixed.(a single match is O(log r). ) The major time is thus the computation of the powers of A. If we are going to go through the whole sieve, that time is proportional to r+2*s (For each new power to be matched,the time to find its (X,Y) is very much greater than the match itself,so we do not bother with (2+ .01*log r) *s ). This is minimal when r=2*s approximately. However,we know the distribution of c/(2*sqrt(P)): if this is cos(theta) with 0.le.theta.le.pie,then the element probability is sin**2(theta).d(theta). We find that,assuming there is only one multiple of the order of A in our range(as is usually the case),the expected value of i where we find it is 4*(2*s)/(3*pie), in time 4*s/(3*pie),recalling that i runs over odd values only. Thus our expected effort now becomes r+ 4*s/(3*pie),which is minimal whe -n r=P**.25 *sqrt(4/3pie). It is certainly worth distinguishing the case when one intends to leave the sieve after the first success; it is problematical whether it is worth bothering with the exact distribution of c; the assumption of uniform distribution leads to sqrt(1/2) in place of sqrt(4/3pie),and makes the program slightly easier to write. 6.4^ The Ostensible Factorization of P by ECM. The method of 6.3 is certain to succeed,but is not feasible once r and s reach about 10**6,suggesting that P of 25 decimal digits is the limit. For P between 25 and 35 decimal digits(with decreasing chance of success as P increases)we may have recourse to the following method. Find a point A on some twist of E,and raise A to the usual "Large Product of Small Primes" power L(however,the algorithm should perhaps keep A**(iL/10), for i =1 through 9,even though this might normally be slightly inefficien t.) A**L may itself be the identity. If not,we perform a second stage as in SPAR using sort and match,and perhaps find a large(often prime)B such tha t A**(L*B) is the identity. It is now quite easy to disentangle the exact order of A,which almost always defines the group order uniquely. We have a second chance(but no more)if we fail the first time,using a different twist of E with different quadratic character. 6.5 ^ The Method of Schoof. Schoof's beautiful paper (Math.Comp.1985) is so well known that I do not here summarize it. It is an effective polynomial time algorithm,and has many elegant corollaries. I did in fact attempt to implement it in 1987, and was led to the method of the present paper,but I do not regard my method as any kind of "implementation" of Schoof. My Chinese grandfather here is rather Lehmer,with his Delay Line Sieve,who had partial information modulis an arbitrary number of Chinese primes, but a problem in its effective application. Naturally the idea of using Frobenius with auxiliary primes q I learned from Schoof. More detailed comparisons are best left until after my method has been fully described. 7 ^ Determination of the number of points on E modulo P. The method we now describe is almost certain to work(though I cannot prove this)for P up to about 50 decimal digits. From there up to 65 digits(my current record),the chances of success decrease,or strictly speaking an excessive time is needed in "unlucky" cases. We suppose that by the methods described above in section 3,we have found a number of limitations on c of the form c congruent to (list) modulo q,for various primes q(perhaps including q=8 or 9 by special purpose programs.) For practical purposes,we collate this information in the form c = L3 (modulo M3) (L3 is usually zero) c= one of LIST1(L1) (modulo M1) (N1 values of L1) c = one of LIST2(L2) (modulo M2) (N2 values of L2). (We suppose now w.l.o.m.g. that L3=0). Write c = M3 * (M1*R2 + M2*R1), where we shall also suppose that M1*M2*M3 .gt.4*sqrt(P) . There certainly exists an R1 with abs(R1).le.M1/2,and (since we have that abs(c).le.M3*M1*M2/2 )the corresponding R2 has abs(R2).le.M2 . The congruences for c modulis M1 and M2 are equivalent to congruences for R1 modulo M1 and R2 modulo M2. We now find some point A on some twist of E and form A2=A**(M3*m1). The N2 possible powers A2**R2 are found,and the penultimate words of the relevant X are stored together with the corresponding values of R2.Finall y the X-values are sorted by size and a key to the R2 values is retained. Next we form A1 =A**(M3*M2),and form successively the powers A**(P+1) * A1**R1, where R1 runs through N1 possible residues modulo M1, and at each stage we match the penultimate word of X against our sorted list. When a match is found,the list word is wholly recomputed using the key and a further check made for a true match on X,which implies a match on +/-Y. (As in section 6.3,a number of bogus matches is to be expected, perhaps two or three hundred in a run of several hundred thousand comparisons,but not enough to cause any serious inefficiency.) Once the match is found,we have essentially "won",and revert to section 1,or possibly carry out the search to the end as previously described. 7.1^ More Technical Details . 7.1.2. ^ The Sorted List. The preference here(for "large" P of 40 or more decimal digits; everything is quite fast for less)is to make M2 as large as possible (but of double IBM word size,i.e.,under 2**64)for a given N2,by using for M2 a product of those q which have very few possible values of c. In that case the gaps between successive values of R2 are large,and we speed up the computation of A2**R2 by the following device. Choose a parameter SIZ such that 2*SIZ**2 exceeds almost all the successive differences of the R2(SIZ =200 is often suitable). Form a table of A2,A2**2,...,A2**SIZ,and A2**(2*SIZ*L),L=1,SIZ . Then almost every power A2**R2 can be computed from the previous one by two multiplications(of points on the elliptic curve); recall that division is multiplication by (X,-Y). The very few large gaps are computed by applying the usual power algorithm. Since we always have +/-c mod q together in the method,a value R2 occurs if and only if M2-R2 occurs. Thus we apply the above only to the R2 under M2/2; after that we complete the table at the cost of a single division(the point A2**M2 having been worked out once for all.) 7.1.2. The Matching List. We find it expedient here to write M1 as a product M1K1*M1K2,and to rewrite the typical residue R1 as LISTK1(K1) +M1K1*LISTK2(K2), where the residues LISTK1 modulo M1K1 form an outer loop,and the residues LISTK2 modulo M1K2 the inner loop. Both our M1K1 and M1K2 are currently single length,and certainly we require M1K2 to be so. We prefer there to be a large number of residues modulo M1K2,so that the possible differences betweeen successive ones are limited in number to at most 2000. For different LISTK1 in the outer loop,the precise residues in the inner loop are different. However,their differences modulo M1K2 form a small set. Thus before commencing the double loop,we prepare a table of our relevant point raised to the powers of all these differences .. Then in the inner loop,each successive power is obtained by a single multiplication of points on the elliptic curve. 7.2 ^ The Global Strategy . It is now clear that we require a set of values of q ,with their corresponding numbers of possible c,which satisfy two criteria. First, the product of all the q must be at least 4*sqrt(P). Second,the product of the numbers of c for all these q must not exceed the time and storage constraints of our sort and match. For the sorted list,if we have N2 distinct residues,we need storage of 3*N2 words(list plus double word key ). One might send off the key to secondary storage,but it is a mild nuisance. At present,N2 of around 300000 is the most that I use. With the two-tier matching list,the critical constraint is the time; so long as the inner loop has about 10000 residues it is very efficient and needs only moderate storage for the powers of differences table. But the outer loop is there,and the number of residues it contains is a multiplier of the total time. The entire program is now automatic except for the assignment of q between M2,M1(and M1K1 and M1K2). As we proceed with the computation of q,any q with c=0 is automatically assigned to M3,and is pure good. Otherwise,we rank the qs by the ratio (q/ number of c),and as soon as the product of the leading q exceeds 4*sqrt(P) and the product of the numbers of c is less than some bound depending on P(with 50 digits,we might carry on a bit to make a shorter second stage; with 70,we are only too glad to have found something),and on the storage and time constraints of our program,we decide to go into the sort and match stage. At this point,a hand assignment of q is made. 7.3 ^ Prospects. As things stand,it is quite feasible to do all q up to 200 for a 100- digit prime in about 4 hours. The apparent restriction of information (measured as the product of (possible c / all c ) modulo (product q)) is enormous. For every q it is less than 1/2,and usually much better. But there seems to be no good way of using this information effectively. In primality proving one can in effect show very quickly for numerous q that any prime factor of a putative prime is congruent to one of (q-1)/2 values modulo q,but the information becomes unwieldy and the Delay Line Sieve to find the least number satisfying the conditions is needed,and is not effective beyond a certain size. The situation here is analogous. Of course Schoof's method does not have this drawback,and actually knowing c uniquely modulo each q gives an unique answer as soon as we have enough q. But the price he pays for this is (in effect)that he must work on the principal congruence subgroup Gamma(q) versus our GZ(q),and deal with equations of degree q**2 versus our (q+1). The equivalent computer effort that we need to get up to q=251 would only give at most q=17 for Schoof's algorithm. These remarks are in no way intended to detract from a very remarkable piece of mathematics,merely to point out that with our programming skills his method does not work(and,by all accounts, the programming skills of our complement in the world do no better.) However,a combination of the two methods might yield another 10 decimal digits;we could start off with an unique c modulo 8*9*5*7*11*13*17,and then use the method above with a much bigger M3. It may thus be hoped that someone will now be enterprising enough to program Schoof's method up to,say,17; the fact that by themselves the results would have no practical application need no longer act as a deterrent. 8. ^ Various appendices and numerical data. 8.1 ^ Examples(q=173 and q=257) of SSE-derived functions,showing the laundry process. Notice that the two Fourier series have the same coefficients(though a different leading power of x)for a large number of terms. Why is this? (Ramanujan worshippers please note what the master did himself,not the fumblings of his apostolic succession.) Q= 173 FUNCTION X**- 4(1, 1 1 2 3 3 5 6 8 10 13 16 21 25 31 38 47 56 69 82 99 118 141 166 26 59 102 148 32 92 168 76 1 108 62 28 22 28 68 129 56 8 12 47 143 110 150 71 86 166 12 120 14 16 12 140 122 99 141 45 69 170 103 6 153 162 163 105 136 42 152 83 21 88 154 12 61 98 53 56 76 93 97 86 111 158 169 164 120 82 119 106 17 127 130 0 74 165 46 134 122 (multiply by eta(tau) *eta(173*tau) ) Q= 173 X**( 13/ 4)*(1, 0 -1 0 0 -1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 -1 1 -1 0 0 0 1 0 0 -1 0 -1 0 0 0 0 0 0 0 -1 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 -1 0 0 -1 0 0 1 1 0 -1 0 0 0 0 0 0 -1 0 -1 0 -1 0 0 0 -1 0 0 0 0 1 (above in sparse notation: coeff*x**(power)/4) -1( 21) -1( 33) 1( 57) 1( 81) -1( 113) 1( 117) -1( 121) 1( 137) -1( 149) -1( 157) -1( 189) 1( 209) 1( 213) 1( 261) 1( 269) -1( 297) -1( 309) 1( 321) 1( 325) -1( 333) -1( 361) -1( 369) -1( 377) -1( 393) 1( 413) LAUNDERED 1 1 2 3 3 5 6 8 10 13 16 21 25 31 38 47 56 69 82 99 118 141 166 199 232 275 321 378 438 514 595 693 800 927 1066 1233 1412 1625 1859 2132 2430 2780 3161 3603 4089 4648 5261 5968 6740 7624 8597 9702 10915 12295 13807 15519 17399 19517 21843 24462 27331 30551 34087 38040 42374 47219 52524 58437 64917 72120 80009 88770 98352 108971 120593 133444 147494 163019 179976 198680 219111 241605 266160 293173 322630 354992 390279 428987 471161 517389 567719 622817 682785 748355 819674 897598 982286 1074722 1175150 1284647 (end 173 begin 257) Q= 257 FUNCTION X**- 6(1, 1 1 2 3 3 5 6 8 10 13 16 21 25 31 38 47 56 69 82 99 118 141 166 199 233 18 65 122 183 2 84 182 34 162 46 214 139 97 78 97 143 240 115 49 31 84 196 144 162 34 1 100 57 181 189 139 12 124 199 56 178 140 180 128 222 63 134 52 55 24 202 243 123 36 235 153 48 175 12 105 217 145 186 197 212 147 79 216 147 173 151 200 240 199 47 93 76 172 215 220 (multiply by eta*eta(257)) Q= 257 X**( 19/ 4)*(1, 0 -1 0 0 -1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 -1 0 0 0 0 0 0 0 0 -1 -1 1 0 0 0 0 1 0 0 -1 0 0 0 1 0 0 0 0 0 -1 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 -1 0 0 -1 0 0 0 0 0 0 0 0 0 0 0 -1 0 0 0 0 0 (sparse notation) -1( 27) -1( 39) 1( 63) 1( 87) -1( 127) -1( 163) -1( 167) 1( 171) 1( 191) -1( 203) 1( 219) -1( 243) 1( 263) 1( 267) 1( 323) -1( 339) -1( 351) -1( 399) LAUNDERED 1 1 2 3 3 5 6 8 10 13 16 21 25 31 38 47 56 69 82 99 118 141 166 199 233 275 322 379 440 516 598 696 805 933 1074 1242 1424 1639 1877 2153 2456 2810 3199 3647 4143 4710 5336 6055 6844 7744 8739 9866 11108 12517 14067 15816 17745 19913 22301 24985 27934 31237 34875 38935 43398 48379 53847 59933 66618 74040 82185 91221 101124 112088 124109 137391 151935 167996 185566 204934 226120 249435 274919 302943 333541 367143 403826 444055 487933 536018 588424 645784 708275 776596 850974 932232 1020623 1117094 1221993 1336363 8.2 Examples of application of method. 8.2.1. Smallest prime of 65 digits(copy of notice in NMBRTHRY net.) (February 21st,1991) In early 1988 I found a method for finding the number of points on an elliptic curve modulo a prime P,and gave an example where P had 44 decimal digits; this was described and distributed in a 9-page memorandum. Recently,without any essentially new ideas,but applying greater technic -al competence at each stage of the method,I have found that: The number of points on the INRIA curve y**2=x**3 + 105*x + 78153 modulo the smallest prime of sixtyfive decimal digits,which is 10**64+57, is equal to P + 1 + C,where C is MINUS the following 179 72908 11790 97384 74641 25900 78222 . The method of finding this number leads to its own proof of correctness (" only one possible number between P + 1 +/- 2*root(P) "),but two more direct proofs can be obtained by factorizing the numbers of points on the curve and on its twist modulo P. The latter number is 120 times an easily provable prime; for the former one must work harder,and I am indebted to Francois Morain for using his quadratic sieve to factorize it for me(since at the moment the authorities will not give me enough storage to use SPAR); the factors are 92, 27 70973 85393 30085 85690 21436 13511, and 3 92265 16706 26008 12655 88031 99403. ***********************FINIS ************************************* ******************************************* (Q=701 extrapolation solution) Q= 701 FUNCTION X**- 21(1, 1 2 2 4 5 8 10 14 18 25 31 42 52 68 84 108 133 168 205 256 311 384 463 567 681 125 287 489 14 294 607 289 19 540 426 430 524 67 435 296 321 596 387 498 197 302 83 384 474 521 503 618 147 31 259 413 501 153 98 35 11 500 188 335 373 269 208 284 63 481 494 527 344 579 453 142 448 452 469 640 160 279 528 571 333 67 212 309 698 365 358 654 389 687 337 458 382 48 323 246 **** multiply by eta * eta(701)******* Q= 701 X**( 33/ 4)*(1, 0 0 -1 0 0 0 0 -1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 -1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 -1 0 0 0 0 0 0 -1 0 1 0 0 0 0 0 0 1 0 0 0 0 1 -1 0 0 0 0 0 0 0 0 0 0 0 0 0 -1 0 0 0 0 1 0 0 sparse notation -1( 45) -1( 65) 1( 101) 1( 141) -1( 201) -1( 261) -1( 289) 1( 297) 1( 325) 1( 345) -1( 349) -1( 405) 1( 425) LAUNDERED 1 2 2 4 5 8 10 14 18 25 31 42 52 68 84 108 133 168 205 256 311 384 463 567 681 826 988 1190 1416 1696 2009 2392 2823 3344 3931 4636 5431 6376 7445 8708 10135 11812 13706 15920 18423 21332 24618 28424 32720 37674 43264 49688 56928 65224 74565 85234 97239 110911 126278 143740 163344 185564 210488 238675 270258 305905 345801 390741 440992 497490 560593 631427 710457 799018 897733 1008180 1131161 1268561 1421396 1591910 1781401 1992521 2226904 2487719 2776994 3098487 3454740 3850201 Berlekamp and Lehmer Factorization of Polynomials Modulo q A.O.L.Atkin To simplify our exposition,we shall make some unessential assumptions. Consider a polynomial f(x) without repeated factors,of degree n, monic. Our finite field will be the prime field modulo q(prime),where q is large in that finding roots of an equation mod q by trial and error is not feasible,but not very large in that storage becomes a problem. Typically we are interested in "Chinese" size,q just less than positive single word size,thus under 2^31 and often for convenience under 2^30 on IBM machines. Our first task is to compute x^q - x (modulis q,f(x)) which is done by the usual power algorithm left to right. Multiplication by x is trivial; squaring is best done with an auxiliary table of powers from x^n to x^(2n-2) as linear combinations of 1,x,x2,...,x^(n-1). (If multiplication mod q is slower than squaring mod q further considerat -ions apply; see my Ellpts Essay). Then gcd(x^q -x ,f(x)) will yield the linear factors of f(x) as a block. If what remains has degree.gt.3 we proceed further with the leftover part,which we rename as f(x),of degree n. We now create a matrix of the powers x^q,x^(2q),...,x^((n-1)*q) , expressed as linear combinations of 1,x,x2,...,x^(n-1). f^q requires only a reduction from the previous computation. We form the remaining powers in the order(e.g.,for n=37) :: 2,4,8,16,32,3( as (1+2)^2 -2 -4 divided by 2),6,12,24,5( as (2+3)^2 -4 -6 divided by 2),10,20, 7,14,28,9,18,36,11,22,13,26,15,30,17,34,19,21,23,25,27,29,31,33,35,and finally 37 which needs a multiplication. At this point the methods diverge substantially,and we describe Berlekamp's first. It should first be said that much of the original literature assumes q small(especially 2),and n relatively large; plainly one must modify some of our algorithms in this case. Berlekamp finds the nullspace of this (n-1)-by-n matrix,in the form of a basis consisting of 1 and polynomials B1(x),.,Br(x),which nullspace we call the"Berlekamp polynomial B(x) with the property that B^q = B (modulis q,f(x))and 0.lt.deg(g).lt.n If r=0,so that the matrix has rank n-1,then f(x) is irreducible. Otherwise f(x) is the product of (r+1) distinct factors. Observe that the polynomials B form not merely a vector space,but a ring(with multiplication defined modulis q,f(x)),and for fixed B(x) and each factor fi(x) of f(x) there is a constant c such that fi(x) divides B(x)-c. The product of all (B(x)-c) must be zero modulis q,f(x); but it is also a polynomial in B of degree.le.r+1(usually=) ,which can be found directly from an (r+1-by-n) matrix of B,B^2,...B^(r+1),as linear combinations of 1,x,...x^(n-1). Thus subject to FINDING THE LINEAR FACTORS OF A POLYNOMIAL MODULO q(to which we return later),we may choose some B(usually the Bj of least degree) and gcd all its appropriate (B-c) with f(x). If the number of c is less than (r+1), we must repeat partially with another B,&c. For a fair comparison with the other method,one should program a less theoretically elegant but practically faster algorithm using "first out",where the matrix is not processed fully to find its rank, but the first element of the nullspace is taken as B(x). This is especial ly advantageous when the number of factors is large(in the extreme super- singular case,they all have degree 2,and this is known); but ad hoc methods may be needed to PROVE that all the factors found are irreducible .. In a random case,the miniscule saving in matrix processing is not worth the bother involved. In the Lehmer method,we compute successively x^(q^a)-x,for a=2,3, ... For each a,we form the gcd with f(x),which necessarily extracts those factors of f(x) of precisely degree a. The powers are computed using the matrix(it is extremely inefficient to use the power algorithm for large q),which in general we do not modify even when a factor is found.We do however peel each successive gcd off f(x),and when 2*a exceeds the degree of the remaining f(x),it must be irreducible. Notice that in our gcd algorithm,which is likely to proceed as in Euclid with decreasing degrees,we can stop when our degree falls below a. The problem remains as to how to separate a block of factors all of which have the same degree a. This is of infrequent occurrence in a random factorization,and usually the number of factors is small. We perform a subsidiary Berlekamp on the block,but with the advantage that we already know some members of the Berlekamp nullspace,namely the symmetric functions of x,x^q,x^(q^2),...x^(q^(a-1)),which have already been computed and (when potentially needed)saved; usually the sum is sufficient to split the block. A refinement which I have recently(1988) adopted with some success is as follows. The time of the gcd is significant compared to the time of creating the next x^(q^a); thus for some values of a we defer the gcd and proceed directly to the next power. At the end,when 2*a exceeds the degree of the remaining f(x),we backtrack and clean up("Coda"). An array DEGLIS(a) keeps the degree down to which we must take our gcd algorithm at a(a=-1 conventionally says no gcd). The choice of DEGLIS is a matter of judgment; an example follows for degree n=101(the default value is DEGLIS(a)=a,and we exhibit only the other values): DEGLIS(a)=-1 for a=4,5,6,7,8,9,10,11,13,14,15,17,19,21,23,24,25 DEGLIS(a)=b for (a,b)= (12,4) (16,8) (18,9) (20,5) (22,11) (26,13) (28,7) (30,15) (34,17) (38,19) (42,21) (46,23) (48,24) (50,25). Thus three factors of degrees 4,6,and 12,would show up as a block of degree 22 at a=12; but they are easily separated using the (saved) x^(q^4)-x and x^(q^6)-x. In a typical example with n=99 we found factors of degrees 12,9, and 10,at stages a=12,18,and 20,respectively. The stages were continued up to a=34,at which point a "logic" subprogram determined what splits were still possible. For as the program proceeds,the array DEGLIS is changed; once a=12 is done,then factors of degree 4,6,and 12,are no longer possible,so we reset DEGLIS(a)=a for a=4,6,12. Now a search for all a with DEGLIS(a).lt.a may require some actual work,in the shape of an actual gcd with the saved x^(q^a)-x,or sometimes an inference from the implied cofactor. In fact at the coda stage we have a residual polynomial and a number of remaining a with DEGLIS(a)=-1. Any irreducible factor of our residual polynomial must have degree one of these a,and its cofactor must have degree in the semigroup of nonnegative addends of these a. The second condition is sometimes enough to eliminate cases without further work; otherwise judicious gcds are performed,and DEGLIS constantly updated. In the above example work was required for a=19,21,23,and 24. Thus compared with the standard method of taking gcds for each a,we saved gcds for a=4,5,6,7,8,9,10,11,13,14,15,but lost by carrying degree 9 for 9 stages and degree 10 for 10 stages,in our f(x) .. It would be foolish to break one's head trying to assess these effects for a random split of n into degrees of factors of f(x),especially since they are machine dependent. Far better to run both programs,and let the machine tell which is better(we give some data at the end). Whether or not one adopts our modification, it is certainly inefficient to recompute the matrix for a smaller f(x) except very near the beginning(I do it only if factors are found at a=2 or a=3). On balance the saving on gcds seems to exceed the loss on carrying a larger f(x) for longer in most cases,and of course for irreducible f(x) the saving is considerable. Finally,I have found that,even doing all gcds,Lehmer is faster than Berlekamp. However it is fair to say that an optimal matrix processing routine in which one proceeds with,say,10-by-10 blocks and does scalar products in the registers before division by q might speed up the Berlekamp by 20%. A postfinal observation: since the first common part of the program is 60% or more of the whole in many cases,the overall saving provided by the modification may be only 5%. But 5% seems a lot in the State of Illinois. Obtaining Linear Factors Modulo q . Suppose without loss of generality that we have f(x) of degree n modulo q and that it is known that all the factors of f(x) are linear. As before, we think of q as large,possibly Chinese of 10 digits,primality proving of 500-1000 digits,factorizing of 100-150 digits,&c. We describe below the principles; the exact algorithm depends on the precise size of q and the machine,which influence the relative cost of finding x^(approximately q) (modulis q,f(x)) and that of performing a gcd with two polynomials of degree n mod q. The first task is to find all the small factors of (q-1),where "small" may be up to 1000 if q has 500 digits. If for example q-1= 4*167*rough and n=9,we would form and store x^rough. Then gcd f(x) successively with x^(rough*167)-w,for w=i,-1,-i,1 mod q . (The cost of finding fourth roots mod q is small compared with the cost of polynomial operations.) With reasonable luck we should now have split f(x) into at most quadratic pieces,which can be finished using only a square root algorithm(and our previously found i becomes useful again.) I have in fact got a special purpose n=3 and n=4 routine,using Tartaglia and Euler,but these are probably not worth writing in most contexts. If however there remains an unfactorized residual polynomial,say of degree 5,it would be a mistake for large q to do what many textbooks recommend,namely to form y^((q-1)/4) for various translates y=x+small (some authors actually recommend using (q-1)/2 only!). For this involves a repeat of the longest part of the algorithm. It is better to gcd with x^((q-1)/167)-w,where w runs over 167-th roots of unity modulo q,with various options as to one's action after partial success (depending on relative cost of square roots & expected remaining gcds). Obviously whether it is specifically worth using 167 depends on n and q and the machine; it is a nice exercise in probability. But my observation has been that the "intuitive" programmer,including myself,does not calculate this strategy precisely unless forced to do so. Some Timings Comparison of Berlekamp and Lehmer(Unmodified) (Modulo 2^30-35 on the IBM 3090 with previous "slow"processor) Degree of Polynomial Time in " for Berlekamp Lehmer Split x^q & matrix ending ending 125 2.63 2.00 1.29 71,34,14,2,2,2 124 2.50 1.88 1.44 102,13,5,3,1 225 14.44 9.08 6.73 153,45,26,1 224 14.48 9.16 8.73 200,13,6,5 350 58.78 45.41 25.78 125,105,83,15,6,3, 3,1 Comparison of Lehmer(unmodified) and Lehmer(modified) (modulo 2^30-35 on IBM 3090 with fast processor June 1991) Degree x^q & mat Unmodified Modified Split 170 2.44 1.157 0.834 13,47,110 169 2.37 0.854 0.923 1,31,37,41,59 168 2.30 1.378 1.125 1,1,2,3,3,15,143 167 2.30 0.799 0.798 1,6,7,17,21,31,84 166 2.27 1.360 1.135 29,137 165 2.19 1.186 0.888 1,2,44,118 164 2.17 1.057 0.816 1,2,7,8,10,22,114 Only in one case does the modification lose,when the extremely "unlucky" decision to have DEGLIS(a)=-1 for 31,37,41, involved that these factors were all found in the Coda. On balance,it is clearly superior,and there remains the possibility of improving the actual DEGLIS strategy.