Main/src/FFT/src.64/README
=========================================

Generic FFT
===========

generate_dft.py <-- everything is there including the templates
                     it can distinguish between SmallPrimeField
                     and GeneralizedFermatPrimeField

dft_general.cpp <-- top-level algorithm generated by generate_dft.py

dft16.cpp  <-- base case on 16 points generated by generate_dft.py
dft32.cpp  
dft8.cpp
dft64.cpp 

What is missing:
- At BPAS compile time, we want to determine what is the
  largest size of a Generalized Fermat priem that we can use.
  This size must be a power of the form k = 2^e machine words
  and a base-case FTT on a vectorr of size K=2*k must
  fit in cache.  Note that this vector will have a size 
  K * k = 2 * k^2  machine words
     that, for k=1, that would be 2 words
           for k=2,  that would be 8 words
           for k=4,  that would be 32 words
           for k=8,  that would be 128 words
           for k=16,  that would be 512 words  <-- good for stego
           for k=32, that would be 2048 words  <-- good for dinosaur
           for k=64, that would be 8192 words  <-- 

- C case ==>
   needs to the C Support for FFT  
   needs to discuss that with Lin-Xiao




Svyatoslas's FFT: 
=================

generate_fft_furer.py & generate_fft.py
---------------------------------------
Each of the python files generate_fft.py and generate_fft_furer.py
implement a 1D FFT code generator, respectively called the "old"
and the "new".

Both use the HTHRESHOLD environment variable.

The first one uses .config file while the other uses .config_furer

The first one geenrates files fft_iter1.cpp and fft_iter2.cpp
The second one geenrates files fft_furer1.cpp and fft_fure2.cpp

The file .config (used to configure
the old 1D FFT code generatoor) has the format:
<file name_1>
<prime_1>
<file name_1>
<prime_2>

The file .config_furer (used to configure
the new 1D FFT code generatoor) has the format:
<radix>
<prime_1>
<generator_1>
<prime_2>
<generator_2>

where <file name_1> and <file name_1> are the names
of the generated files and 
each of <prime_1>, <generator_1>, <prime_2> <generator_2>
 is an unsigned positive integer such that

<radix> is the FFT radix, which is a power of 2. 
        8 is often used

<prime_1> and <prime_2> are 64-bit prime numbers of max bit 
such that p-1 is divisible by HTHRESHOLD (where p is either
<prime_1> and <prime_2>)

<generator_1> and <generator_2> are generators of the unit
group of Z/pZ

Principles behind the two 1D FFT generators
-------------------------------------------
The old and new use two different strategies:
- Radix-two Coley-Tukey number theoretic transform (R2CTNTTF)
- Furer Ccomplexity Class Transform (FCCT)

Moreover, (currently) 
- the R2CTNTTF uses Montgommery trick
- the FCCT uses Montgommery trick with a sparse radix prime number
  p1 = (256*255)^4 + 1; however multiplying by low-power of the 
  primitive root has not been optimized yet.
   In particular when multiplying by w=255*256
  Using p2= (2^32 - 2^16)^2 + 1 and w = sqrt(2^32 - 2^16),
  multiplying by w^2 is cheap and is used 3 times on a 8-point
  butterfly


=================================================================
generate_fft.py 
=================================================================

It reads .config
Takes as input the HTRHSESHOLD
Then executes  generate_dft_iter_notunrolled
     -> generates the header for a specified prime        
     -> generates code following the template

                        
         generate_block(log,H,depth,code):
                        // 2^log size block
                        // depth in the row number in the butterfly graph
                         // code is where we write

The shuffling code (rec algo) is in the template code, see the code
of DFT_eff. Warning: Shuffle provides the base case

  Code that is generated 
  FFT_CASES: FFT on 32, ... up 1024  (will have loops)
  // Note that the unrolled FFTs (on 2, 4, 8, 16 points)
  // is the template


       generate_dft_iterative
                        // size is the number of points

                      // "generate_16_point_fft"
                      // creates a layer of blocks of size 16

                        // the block size mayy not divide the input size
                        // this explains why we have 
                        // the function generate_last_block

          generate_Array_Bit_Reversal
                      //  "header" is the header file
                     //   Array is computed by the Python code
                        with the "ArrayBitReversalSpe"
                      // "name" is the size associated with
                       //  "ArrayBitReversalSpe"

            generate_first_block_fft  IS  NOT   USED
                       // similar to 1 16 point FFT


                
            generate_everything   
                        // is the top level functions


=================================================================
generate_fft.py 
=================================================================


basic_routine.cpp     operations on bivariate polynomials
-----------------
sfixn * TwoConvolutionModNew(sfixn *Ap, sfixn *Bp,
			       sfixn d1, sfixn d2, sfixn K, 	     
			       MONTP_OPT2_AS_GENE *pPtr, 
			       int H, int *RevBidMap,sfixn num)

void CyclicConvolution(sfixn *res, sfixn s, 
			 sfixn es1, sfixn es2, 
			 sfixn K, sfixn dims2,  
			 sfixn *A, sfixn *B, sfixn dA, sfixn dB, 
			 sfixn *KRT, sfixn *dRT,
			 sfixn *invKRT, sfixn *invdRT,
			 MONTP_OPT2_AS_GENE *pPtr,
			 int H, int *RevBidMap, sfixn invn, sfixn invn1,sfixn num)

 void AdaptiveEvaluation(sfixn *res, sfixn es1, sfixn es2, 
			  sfixn K, sfixn dims2, 
			  sfixn *A, sfixn dA,  
			  sfixn *KRT, sfixn *dRT,
			  MONTP_OPT2_AS_GENE *pPtr,
			  int H, int *RevBidMap,sfixn num)

void AdaptiveInterpolation(sfixn *res, sfixn K, sfixn es1,
			     sfixn es2, sfixn dims2, 
			     sfixn *invKRT, sfixn *invdRT, 
			     MONTP_OPT2_AS_GENE *pPtr,
			     int H, int *RevBidMap,sfixn invn,sfixn invn1,sfixn num)

void NegacyclicConvolution(sfixn *res, sfixn s, 
			     sfixn es1, sfixn es2, 
			     sfixn K, sfixn dims2, 
			     sfixn *A, sfixn *B, 
			     sfixn dA, sfixn dB, sfixn K2,
			     sfixn *thetaPtr, 
			     sfixn *KRT, sfixn *dRT,
			     sfixn *invKRT, sfixn *invdRT,
			     MONTP_OPT2_AS_GENE *pPtr,
			     int H, int *RevBidMap,sfixn invn,sfixn invn1,sfixn num)

void WeightVectorAdaptiveEvaluation(sfixn *res, sfixn es1, 
				      sfixn es2, sfixn K, 
				      sfixn dims2,  
				      sfixn *A, sfixn dA,  
				      sfixn *thetaPtr, 
				      sfixn *KRT, sfixn *dRT,
				      MONTP_OPT2_AS_GENE *pPtr,
				      int H, int *RevBidMap,sfixn num)

 void AdaptiveInterpolationWeightVector(sfixn *res, sfixn K, 
					 sfixn es1, sfixn es2, 
					 sfixn dims2, 
					 sfixn *thetaPtr, 
					 sfixn *invKRT, 
					 sfixn *invdRT,
					 MONTP_OPT2_AS_GENE *pPtr,
					 int H, int *RevBidMap,sfixn invn,sfixn invn1,sfixn num)

general_routine.cpp
-------------------
// consecutive powers of the primitive root
void EX_Mont_GetNthRoots_OPT2_AS_GENE(sfixn e, sfixn n, 
					sfixn * rootsPtr, 
					MONTP_OPT2_AS_GENE * pPtr)
// Pairwise multiplicaiton
void EX_Mont_PairwiseMul_OPT2_AS(sfixn n, sfixn * APtr, sfixn * BPtr, sfixn p)

 //Matteo's rectangular matrix transpose---------------
  //out-of-place transpose A[i0..i1][j0..j1] into B[j0..j1][i0..i1]
  //then copy back to A
  // n: size of A
  //row major layout
void transpose(sfixn *A, sfixn lda, sfixn *B, sfixn ldb,
		 sfixn i0, sfixn i1, sfixn j0, sfixn j1) 

  /* Matteo: Traverse the trapezoidal space (i, j) where
   i0 <= i < i1
   j0 + (i - i0) * dj0 <= j < j1 
 */
  //square matrix A in place transposition
  void sqtranspose(sfixn *A, sfixn lda,
		   sfixn i0, sfixn i1,
		   sfixn j0, sfixn dj0, sfixn j1 /*, int dj1 = 0 */)


 //new DFT
  /*
   * n=2^r
   * 
   */
  void DFT_eff(int n, int r, 
	       sfixn *A, 
	       sfixn *W, 
	       MONTP_OPT2_AS_GENE *pPtr,
	       int H, int *RevBidMap,
	       sfixn *B,sfixn whichprime)

  void InvDFT_eff_keepMontgomery(int n, int r, 
		  sfixn *A, 
		  sfixn *W, 
		  MONTP_OPT2_AS_GENE *pPtr,
		  int H, int *RevBidMap,
		  sfixn *B,sfixn invn,sfixn whichprime)

  void InvDFT_eff(int n, int r, 
		  sfixn *A, 
		  sfixn *W, 
		  MONTP_OPT2_AS_GENE *pPtr,
		  int H, int *RevBidMap,
		  sfixn *B,sfixn invn,sfixn whichprime)


generate_tft_tree_template.cpp
-------------------------------

SLP code for TFT:  
void  TFT_AddSubSpeSSEModInplace(sfixn* a0,sfixn* a1, sfixn* a2, sfixn* a3)
inline TFT_2POINT(sfixn *A,sfixn *W)
...................................
inline TFT_iter32(sfixn *A,sfixn *W)

Main sub_routines of TFT_core:
TFT_twiddle
TFT_Basecase
TFT_Core