Solved AES-GCM GF(2^128)

freejlr · Dec 10, 2021

Hi guys.

I am trying to do a multiplication in GF to a 128 bit block.

But I have completely different results than I should. For example, if multiplied(Case 2: Test vectors):

H = 0x66e94bd4ef8a2c3b884cfa59ca342b2e
C = 0x0388dace60b6a392f328c2b971b2fe78

H * C = 0x5e2ec746917062882c85b0685353deb7

I'm confused, I don't know if I have a problem with my code, or I didn't understand any part of the algorithm. I don't know if I have a problem when passing the two values H and C, such as the small endian and the big endian ...

But in GCM theory for bytes it uses big endian.

Code:

.section .text
.globl _start
.align 16

_start:
                     
# xmm0 | Z
# rsp  | Y
# rsp-16  | V

pxor %xmm0, %xmm0
movq $-16, %rax                         # byte position

_l01:

xorq %r8, %r8                           # bit position
movq $-8, %r9                           # bit count

_l0:

btq %r8, 16(%rsp, %rax)                 # check the bit posotion 0 - 7
jnc _l1                                 # bit equal to 0,  jump to _l1

movdqu -16(%rsp), %xmm3
pxor %xmm3, %xmm0                       # bit equal to 1, Z = Z xor V

_l1:

movq -8(%rsp), %r10
movq %r10, %rbx
shlq $63, %rbx
shrq $1, -8(%rsp)
shrq $1, -16(%rsp)                       #Rightshift V in memory and xored by R
orq %rbx, -16(%rsp)

btq $63, %r10
jnc _l3
xorb $0xe1, -1(%rsp)

_l3:

incq %r8                                 #next bit position
incq %r9                                 #bit count addition
js _l0                                   #r9 equal to signed, next bit
                                         # else go to the next byte

incq %rax                                #byte addition, rax equal to 0, block finished
js _l01                                  # else next byte


# END - ret xmm0 entire block 128 bits



#Explanation of Rightshift V in memory

#rsp-16 = 0x80000000000000010000000000000000

#movq -8(%rsp), %rbx  //rbx =    0x8000000000000001
#shlq $63, %rbx       //rbx =    0x8000000000000000
#shrq $1, -8(%rsp)    //rsp-8 =  0x4000000000000000
#shrq $1, -16(%rsp)   //rsp-16 = 0x0000000000000000
#orq %rbx, -16(%rsp)  //rsp-16 = 0x8000000000000000
#btq $63, %r10        // check the bit V[127]
#jnc _l3              // V[127] equal to 0, no xored
#xorb $0xe1, -1(%rsp) // else, V[127] equal to 1, xored by R rsp-1 =  0xa100000000000000

#result rsp-16 = 0xa1000000000000008000000000000000

GCM document

The code is oriented on algorithm 1 on page 8.

I have checked that the bit movements in my code were fine, and the counter, so I don't understand how I get wrong results, what is wrong?

Regards.

unitrunker · Dec 10, 2021

Test case 2 applies to AES, not GCM. Scroll down one more page.

freejlr · Dec 11, 2021

unitrunker said:
Test case 2 applies to AES, not GCM. Scroll down one more page.

I get a bad result, again.

But I think my code is completely wrong, and I didn't understand the polynomial part. I have found this implementation, on stackoverflow, but in this case the user performs a multiplication to a 24 bit block, not 128.

A user told him that this was wrong, but I am going to focus on the code itself, not on the finite field.

GCM mul C

Well, I am going to describe the parts of the code "" I am not a master in C "".

Code:

 for(j=0;j<3;j++,++val_2)   
    {
        Z_val[j]=0x00;     
        V_val[j]=*val_2;   
    }
    
    // Z_val is Z 0^128, but in 24bits
    // V_val is Y, val_2 == Y, but in 24bits
    //Okay, is equal to my:
    
# rsp  | Y
# rsp-16  | V

Ok, in the next part I see something different from my code:

Code:

    for(i=0;i<24;i++)
    {
        if (*val_1 & mask_b)
        {
            for(j=0;j<3;j++)   
                Z_val[j]^=V_val[j];
        }

It is worth a for with 24 iterations, I am doing it in 128, since that are the bits of the block, well but what it sees or different is the if, since it does not compare the first bit of the byte, if not the most significant, since the mask is 0x80.

Code:

btq %r8, 16(%rsp, %rax)

There is this, I would be doing it completely the other way around, I would be starting, comparing the least significant bit.

Could this be wrong? Because I think I have confused the representation of polynomials.

Code:

 if (!(V_val[2] & 0x01))
        {//if LSB of V_val is 0
            for(j=0;j<3;j++)
            { //V_val = rightshift(V_val)
                if (j!=0)
                    if (V_val[2-j] & 0x01)
                        V_val[3-j] |= 0x80;
                V_val[2-j]>>=1;         
            }
        }

btq $0, %r10
jnc _l3
xorb $0xe1, -1(%rsp)

Here another difference to my code, is checking the least significant bit, but the last byte. There is this, I am checking the first bit of the 128-bit block. It is completely different.

Code:

 {//if LSB of V_val is 1
            for(j=0;j<3;j++)
            {//V_val = rightshift(V_val)
                if (j!=0)
                    if (V_val[2-j] & 0x01)
                        V_val[3-j] |= 0x80;
                V_val[2-j]>>=1;         
            }
            V_val[0]^=R_val; //V_val = rightshift(V_val) ^ R               
        }

The same but with xor, but he does the xor to the first byte, I to the last ...

Code:

if(mask_b & 0x01)   { val_1++; rnd=1;} 
        mask_b >>= 1;                       
        if (rnd)    { mask_b=0x80; rnd=0; }
    }

Here I would go increasing the value of x, and passing to the next bit, in the mask.

And all the bit shifts are unique for the bytes, what I mean is that my shifts go to the next bits, I don't know if I explain myself, in this case the bits are lost, and they don't modify the next bytes.

I think I completely misunderstood the algorithm, what do you think?

Thanks.

unitrunker · Dec 12, 2021

Your initial x64 code makes me think you want high speed.

The instructions in your pdf look reasonable. I'd rather stick to that than something from SO.

I may have a go at this in C over a cup of coffee.

I usually work with small-ish ECC codes so this is new.

Alain De Vos · Dec 12, 2021

What is weird in the algorithm p8 is the if V127 inside the loop. I would expect this outside the loop?

unitrunker · Dec 12, 2021

Alain De Vos said:
What is weird in the algorithm p8 is the if V127 inside the loop. I would expect this outside the loop?

That bit can potentially change on each iteration.

if V127 = 0 then V ← rightshift(V ) else V ← rightshift(V ) ⊕ R

unitrunker · Dec 13, 2021

The code becomes much simpler if you can use a 128 bit integral type like int128, __int128 or int128_t (whatever your compiler allows).

Alain De Vos · Dec 13, 2021

R=11100001 corresponds to : 1+ x^2+x^3+x^7. If i'm correct for V127=1 you then flip bits : 127, 126 ,125 ,121 of V which were before the rightshift bits 126, 125 ,124 ,120.
I will first try a GF(2^8) finite field. PS: I see ocaml has a uint128...
PS: I think you only need two operators , bitshift to left , and xor.

freejlr · Dec 14, 2021

unitrunker said:
Your initial x64 code makes me think you want high speed.

The instructions in your pdf look reasonable. I'd rather stick to that than something from SO.

I may have a go at this in C over a cup of coffee.

I usually work with small-ish ECC codes so this is new.

If my idea is that it is fast, but it can be improved much more if the M tables are computed, which appears on page 12, there you can see the performance of working with and without tables.

But it's still too early for that, since I'm having a problem with the product.

Alain De Vos said:
R=11100001 corresponds to : 1+ x^2+x^3+x^7. If i'm correct for V127=1 you then flip bits : 127, 126 ,125 ,121 of V which were before the rightshift bits 126, 125 ,124 ,120.
I will first try a GF(2^8) finite field. PS: I see ocaml has a uint128...

Yes, for 2^128 is f = 1 + α + α2 + α7 + α128. -- R = 11100001 + 0^120

And when V 127 is 1, it is divided with the xor by that polynomial.

Next week the student day ends for Christmas holidays, I will be able to delve into it, and see if with your help I can solve it.

Regards.

unitrunker · Dec 14, 2021

Here's some output I get for multiplication:

80000000000000000000000000000000*80000000000000000000000000000000=80000000000000000000000000000000 80000000000000000000000000000000*40000000000000000000000000000000=c0000000000000000000000000000000 80000000000000000000000000000000*20000000000000000000000000000000=e0000000000000000000000000000000 80000000000000000000000000000000*10000000000000000000000000000000=f0000000000000000000000000000000 80000000000000000000000000000000*08000000000000000000000000000000=f8000000000000000000000000000000 80000000000000000000000000000000*04000000000000000000000000000000=fc000000000000000000000000000000 80000000000000000000000000000000*02000000000000000000000000000000=fe000000000000000000000000000000 80000000000000000000000000000000*01000000000000000000000000000000=ff000000000000000000000000000000 80000000000000000000000000000000*00800000000000000000000000000000=ff800000000000000000000000000000 80000000000000000000000000000000*00400000000000000000000000000000=ffc00000000000000000000000000000 80000000000000000000000000000000*00200000000000000000000000000000=ffe00000000000000000000000000000 80000000000000000000000000000000*00100000000000000000000000000000=fff00000000000000000000000000000 80000000000000000000000000000000*00080000000000000000000000000000=fff80000000000000000000000000000 80000000000000000000000000000000*00040000000000000000000000000000=fffc0000000000000000000000000000 80000000000000000000000000000000*00020000000000000000000000000000=fffe0000000000000000000000000000 80000000000000000000000000000000*00010000000000000000000000000000=ffff0000000000000000000000000000 80000000000000000000000000000000*00008000000000000000000000000000=ffff8000000000000000000000000000 80000000000000000000000000000000*00004000000000000000000000000000=ffffc000000000000000000000000000 80000000000000000000000000000000*00002000000000000000000000000000=ffffe000000000000000000000000000 80000000000000000000000000000000*00001000000000000000000000000000=fffff000000000000000000000000000 80000000000000000000000000000000*00000800000000000000000000000000=fffff800000000000000000000000000 80000000000000000000000000000000*00000400000000000000000000000000=fffffc00000000000000000000000000 80000000000000000000000000000000*00000200000000000000000000000000=fffffe00000000000000000000000000 80000000000000000000000000000000*00000100000000000000000000000000=ffffff00000000000000000000000000 80000000000000000000000000000000*00000080000000000000000000000000=ffffff80000000000000000000000000 80000000000000000000000000000000*00000040000000000000000000000000=ffffffc0000000000000000000000000 80000000000000000000000000000000*00000020000000000000000000000000=ffffffe0000000000000000000000000 80000000000000000000000000000000*00000010000000000000000000000000=fffffff0000000000000000000000000 80000000000000000000000000000000*00000008000000000000000000000000=fffffff8000000000000000000000000 80000000000000000000000000000000*00000004000000000000000000000000=fffffffc000000000000000000000000 80000000000000000000000000000000*00000002000000000000000000000000=fffffffe000000000000000000000000 80000000000000000000000000000000*00000001000000000000000000000000=ffffffff000000000000000000000000 80000000000000000000000000000000*00000000800000000000000000000000=ffffffff800000000000000000000000 80000000000000000000000000000000*00000000400000000000000000000000=ffffffffc00000000000000000000000 80000000000000000000000000000000*00000000200000000000000000000000=ffffffffe00000000000000000000000 80000000000000000000000000000000*00000000100000000000000000000000=fffffffff00000000000000000000000 80000000000000000000000000000000*00000000080000000000000000000000=fffffffff80000000000000000000000 80000000000000000000000000000000*00000000040000000000000000000000=fffffffffc0000000000000000000000 80000000000000000000000000000000*00000000020000000000000000000000=fffffffffe0000000000000000000000 80000000000000000000000000000000*00000000010000000000000000000000=ffffffffff0000000000000000000000 80000000000000000000000000000000*00000000008000000000000000000000=ffffffffff8000000000000000000000 80000000000000000000000000000000*00000000004000000000000000000000=ffffffffffc000000000000000000000 80000000000000000000000000000000*00000000002000000000000000000000=ffffffffffe000000000000000000000 80000000000000000000000000000000*00000000001000000000000000000000=fffffffffff000000000000000000000 80000000000000000000000000000000*00000000000800000000000000000000=fffffffffff800000000000000000000 80000000000000000000000000000000*00000000000400000000000000000000=fffffffffffc00000000000000000000 80000000000000000000000000000000*00000000000200000000000000000000=fffffffffffe00000000000000000000 80000000000000000000000000000000*00000000000100000000000000000000=ffffffffffff00000000000000000000 80000000000000000000000000000000*00000000000080000000000000000000=ffffffffffff80000000000000000000 80000000000000000000000000000000*00000000000040000000000000000000=ffffffffffffc0000000000000000000 80000000000000000000000000000000*00000000000020000000000000000000=ffffffffffffe0000000000000000000 80000000000000000000000000000000*00000000000010000000000000000000=fffffffffffff0000000000000000000 80000000000000000000000000000000*00000000000008000000000000000000=fffffffffffff8000000000000000000 80000000000000000000000000000000*00000000000004000000000000000000=fffffffffffffc000000000000000000 80000000000000000000000000000000*00000000000002000000000000000000=fffffffffffffe000000000000000000 80000000000000000000000000000000*00000000000001000000000000000000=ffffffffffffff000000000000000000 80000000000000000000000000000000*00000000000000800000000000000000=ffffffffffffff800000000000000000 80000000000000000000000000000000*00000000000000400000000000000000=ffffffffffffffc00000000000000000 80000000000000000000000000000000*00000000000000200000000000000000=ffffffffffffffe00000000000000000 80000000000000000000000000000000*00000000000000100000000000000000=fffffffffffffff00000000000000000 80000000000000000000000000000000*00000000000000080000000000000000=fffffffffffffff80000000000000000 80000000000000000000000000000000*00000000000000040000000000000000=fffffffffffffffc0000000000000000 80000000000000000000000000000000*00000000000000020000000000000000=fffffffffffffffe0000000000000000 80000000000000000000000000000000*00000000000000010000000000000000=ffffffffffffffff0000000000000000 80000000000000000000000000000000*00000000000000008000000000000000=ffffffffffffffff8000000000000000 80000000000000000000000000000000*00000000000000004000000000000000=ffffffffffffffffc000000000000000 80000000000000000000000000000000*00000000000000002000000000000000=ffffffffffffffffe000000000000000 80000000000000000000000000000000*00000000000000001000000000000000=fffffffffffffffff000000000000000 80000000000000000000000000000000*00000000000000000800000000000000=fffffffffffffffff800000000000000 80000000000000000000000000000000*00000000000000000400000000000000=fffffffffffffffffc00000000000000 80000000000000000000000000000000*00000000000000000200000000000000=fffffffffffffffffe00000000000000 80000000000000000000000000000000*00000000000000000100000000000000=ffffffffffffffffff00000000000000 80000000000000000000000000000000*00000000000000000080000000000000=ffffffffffffffffff80000000000000 80000000000000000000000000000000*00000000000000000040000000000000=ffffffffffffffffffc0000000000000 80000000000000000000000000000000*00000000000000000020000000000000=ffffffffffffffffffe0000000000000 80000000000000000000000000000000*00000000000000000010000000000000=fffffffffffffffffff0000000000000 80000000000000000000000000000000*00000000000000000008000000000000=fffffffffffffffffff8000000000000 80000000000000000000000000000000*00000000000000000004000000000000=fffffffffffffffffffc000000000000 80000000000000000000000000000000*00000000000000000002000000000000=fffffffffffffffffffe000000000000 80000000000000000000000000000000*00000000000000000001000000000000=ffffffffffffffffffff000000000000 80000000000000000000000000000000*00000000000000000000800000000000=ffffffffffffffffffff800000000000 80000000000000000000000000000000*00000000000000000000400000000000=ffffffffffffffffffffc00000000000 80000000000000000000000000000000*00000000000000000000200000000000=ffffffffffffffffffffe00000000000 80000000000000000000000000000000*00000000000000000000100000000000=fffffffffffffffffffff00000000000 80000000000000000000000000000000*00000000000000000000080000000000=fffffffffffffffffffff80000000000 80000000000000000000000000000000*00000000000000000000040000000000=fffffffffffffffffffffc0000000000 80000000000000000000000000000000*00000000000000000000020000000000=fffffffffffffffffffffe0000000000 80000000000000000000000000000000*00000000000000000000010000000000=ffffffffffffffffffffff0000000000 80000000000000000000000000000000*00000000000000000000008000000000=ffffffffffffffffffffff8000000000 80000000000000000000000000000000*00000000000000000000004000000000=ffffffffffffffffffffffc000000000 80000000000000000000000000000000*00000000000000000000002000000000=ffffffffffffffffffffffe000000000 80000000000000000000000000000000*00000000000000000000001000000000=fffffffffffffffffffffff000000000 80000000000000000000000000000000*00000000000000000000000800000000=fffffffffffffffffffffff800000000 80000000000000000000000000000000*00000000000000000000000400000000=fffffffffffffffffffffffc00000000 80000000000000000000000000000000*00000000000000000000000200000000=fffffffffffffffffffffffe00000000 80000000000000000000000000000000*00000000000000000000000100000000=ffffffffffffffffffffffff00000000 80000000000000000000000000000000*00000000000000000000000080000000=ffffffffffffffffffffffff80000000 80000000000000000000000000000000*00000000000000000000000040000000=ffffffffffffffffffffffffc0000000 80000000000000000000000000000000*00000000000000000000000020000000=ffffffffffffffffffffffffe0000000 80000000000000000000000000000000*00000000000000000000000010000000=fffffffffffffffffffffffff0000000 80000000000000000000000000000000*00000000000000000000000008000000=fffffffffffffffffffffffff8000000 80000000000000000000000000000000*00000000000000000000000004000000=fffffffffffffffffffffffffc000000 80000000000000000000000000000000*00000000000000000000000002000000=fffffffffffffffffffffffffe000000 80000000000000000000000000000000*00000000000000000000000001000000=ffffffffffffffffffffffffff000000 80000000000000000000000000000000*00000000000000000000000000800000=ffffffffffffffffffffffffff800000 80000000000000000000000000000000*00000000000000000000000000400000=ffffffffffffffffffffffffffc00000 80000000000000000000000000000000*00000000000000000000000000200000=ffffffffffffffffffffffffffe00000 80000000000000000000000000000000*00000000000000000000000000100000=fffffffffffffffffffffffffff00000 80000000000000000000000000000000*00000000000000000000000000080000=fffffffffffffffffffffffffff80000 80000000000000000000000000000000*00000000000000000000000000040000=fffffffffffffffffffffffffffc0000 80000000000000000000000000000000*00000000000000000000000000020000=fffffffffffffffffffffffffffe0000 80000000000000000000000000000000*00000000000000000000000000010000=ffffffffffffffffffffffffffff0000 80000000000000000000000000000000*00000000000000000000000000008000=ffffffffffffffffffffffffffff8000 80000000000000000000000000000000*00000000000000000000000000004000=ffffffffffffffffffffffffffffc000 80000000000000000000000000000000*00000000000000000000000000002000=ffffffffffffffffffffffffffffe000 80000000000000000000000000000000*00000000000000000000000000001000=fffffffffffffffffffffffffffff000 80000000000000000000000000000000*00000000000000000000000000000800=fffffffffffffffffffffffffffff800 80000000000000000000000000000000*00000000000000000000000000000400=fffffffffffffffffffffffffffffc00 80000000000000000000000000000000*00000000000000000000000000000200=fffffffffffffffffffffffffffffe00 80000000000000000000000000000000*00000000000000000000000000000100=ffffffffffffffffffffffffffffff00 80000000000000000000000000000000*00000000000000000000000000000080=ffffffffffffffffffffffffffffff80 80000000000000000000000000000000*00000000000000000000000000000040=ffffffffffffffffffffffffffffffc0 80000000000000000000000000000000*00000000000000000000000000000020=ffffffffffffffffffffffffffffffe0 80000000000000000000000000000000*00000000000000000000000000000010=fffffffffffffffffffffffffffffff0 80000000000000000000000000000000*00000000000000000000000000000008=fffffffffffffffffffffffffffffff8 80000000000000000000000000000000*00000000000000000000000000000004=fffffffffffffffffffffffffffffffc 80000000000000000000000000000000*00000000000000000000000000000002=fffffffffffffffffffffffffffffffe 80000000000000000000000000000000*00000000000000000000000000000001=ffffffffffffffffffffffffffffffff

freejlr · Dec 18, 2021

unitrunker said:
Here's some output I get for multiplication:

If they are multiplications done with my code, they will be wrong.

I have programmed the rijndael algorithm, and in the modular reductions, I do not invade other bytes. I think it is a very bad way to explain it, what I mean is that when doing a multiplication if the byte exceeds 8 bits I have to do a modular reduction with the polynomial 0x1b if I remember correctly.

Right now what I don't see as correct are the divisions of my code in GCM, but next week I'll get to it.

Regards.

freejlr · Dec 21, 2021

I am following in the understanding of the algorithm, the following thread has solved some doubts for me.

Link

GCM uses big-endian for bits and little-endian for bytes, and of course the bits are inverted, there are examples in the link, for example the polynomial:

f = 1 + α + α2 + α7 = 0x87(10000111)

It really is 0xe1 (11100001) that would be the little-endian of 0x87 of the polynomial f.

Then V[127] It is not 0x80 (10000000) is 0x1(00000001), so when i compare bit 127 with bt i'm doing it completely wrong.

I think that every time I understand better, I hope to have something this week, but hey, I have a lot of Christmas work.

Regards.

unitrunker · Dec 22, 2021

This is a line-for-line translation of algorithm 1.

From 2.5 "The leftmost bit is X0, and the rightmost bit is X127".

Note that in the code below, the leftmost bit is the most significant bit and the rightmost bit is the least significant bit. This allows the ">>" operator to align with the "rightshift" operation called for in algo 1.

To test bit X127, a bit-wise '&' with '1' will isolate that bit. Likewise, to test bit X0, a bit-wise '&' with '1' left-shifted 127 bits will isolate the desired bit.

C++:

static __int128_t multiply(__int128_t X, __int128_t Y)
{
  static constexpr __int128_t R = ((__int128_t)0xE1) << 120;
  __int128_t Z = 0;
  __int128_t V = X;

  for (unsigned char i = 0; i < 128; i++)
  {
    const __int128_t mask = ((__int128_t)1) << (127 - i);
    if ( (Y & mask) != 0)
    {
      Z = Z ^ V;
    }
    if ((V & 1) == 0)
    {
      V >>= 1;
    }
    else
    {
      V = (V >> 1) ^ R;
    }
  }
  return Z;
}

I've used 'constexpr' so technically this is C++ (and I'm using clang, not GCC). I'm not aware of a format specifier for int128 so I did this:

C++:

static void print(__int128_t value)
{
  long long hi = value >> 64;
  long long lo = value & 0xFFFFFFFFFFFFFFFFLL;
  fprintf(stdout, "%016llx%016llx", hi, lo);
}

Note: I've updated the output in post 10 above. I was printing "V" instead of "Z".

freejlr · Dec 23, 2021

Ok, I am going to correct, and I'm going to check yours code with the vector tests in the document.

I found more information about multiplication test in an intel pdf on page 71-72.

Document

Regards.

freejlr · Jan 16, 2022

Hi unitrunker Apologies for not writing earlier.

I have read your code, it works perfectly, and I have taken the liberty of modifying a problem.

C:

#include <stdio.h>

void print(__uint128_t value) {

long long hi = value >> 64;
long long lo = value & 0xffffffffffffffffll;

fprintf(stdout, "%016llx%016llx", hi, lo);

}

__uint128_t multiply(__uint128_t X, __uint128_t Y) {

__uint128_t R = ((__uint128_t)0xe1) << 120;
__uint128_t Z = 0;
__uint128_t V = X;

for (unsigned char i=0; i < 128; i++) {

__uint128_t mask = ((__uint128_t)1) << (127 - i);

if( (Y & mask) != 0) {

Z = Z ^ V;

}

if( (V & 1) == 0) {

V >>= 1;

}

else {

V = (V >> 1) ^ R;

}

}

return Z;

}

int main (int argc, char *argv[]) {

__uint128_t R = ((__uint128_t)0x952b2a56a5604ac0 << 64) | 0xb32b6656a05b40b6;
__uint128_t S = ((__uint128_t)0xdfa6bf4ded81db03 << 64) | 0xffcaff95f830f061;

print(multiply(R,S));

}

Pass your code to C, I like it better than C++, I hope you don't mind, it compiles for example with a simple clang example.c.

Well as you can see I changed the __int128_t to __uint128_t, the reason is that if you treat the data as an int, I think it would be signed by default, the result is wrong.

By treating the data as unsigned, it is as if it were in assembler, it does not treat the most significant bit as negative. I'm sorry if I don't explain myself well.

You can check it, if I change the __uint128_t by __int128_t of your code, the result is wrong, that's why it treats the data as signed, but I don't know why, I don't get along very well with medium and high level languages...

I have used the test of the intel document that I shared in my previous message on page 71, there you can check how the result is correct, and its algorithm is correct.

When I write my code in ASM I will share it, but I don't know whether to do the GCM algorithm, or study the creation of M tables, to speed up the calculation of multiplications.

I'm going to take things more calmly, to try not to have errors in understanding the algorithm, apart from that I have to study XSL (I hate it), I have an exam next week, and I didn't study at all.

Thanks for the help, really.

unitrunker · Jan 16, 2022

Oh crud. It was the signed right shift inside print() (and the other right-shift on V). That carries the sign bit down. Nice work freejlr.

freejlr · Jan 28, 2022

The code:

Code:

.section .data

R: .quad 0xb32b6656a05b40b6,0x952b2a56a5604ac0
S: .quad 0xffcaff95f830f061,0xdfa6bf4ded81db03

.section .text
.globl _start
.align 16

_start:

pxor %xmm0, %xmm0                        # xmm0 | Z
movdqu R, %xmm1
movdqu S, %xmm2
movdqu %xmm1, (%rsp)                     # rsp  | Y
movdqu %xmm2, -16(%rsp)                  # rsp-16  | V

_l02:

movq $15, %rax                           # byte position

_l01:

movq $7, %r8

_l0:

btq %r8, (%rsp, %rax)                 # check the bit posotion 0 - 7
jnc _l1                                 # bit equal to 0,  jump to _l1

movdqu -16(%rsp), %xmm3
pxor %xmm3, %xmm0                        # bit equal to 1, Z xor V

_l1:

btq $0, -16(%rsp)                        # check V[127] equal to 1 V = rightshift(V) xor R
jc _l2                                   # equal to 0, V = rightshift(V)


movq -8(%rsp), %rbx
shlq $63,%rbx
shrq $1, -8(%rsp)
shrq $1, -16(%rsp)                       #Rightshift V in memory
orq %rbx, -16(%rsp)

jmp _l3

_l2:

movq -8(%rsp), %rbx
shlq $63, %rbx
shrq $1, -8(%rsp)
shrq $1, -16(%rsp)                       #Rightshift V in memory and xored by R
orq %rbx, -16(%rsp)
xorb $0xe1, -1(%rsp)

_l3:

decq %r8                                   #next bit position
jns _l0                                     #r9 equal to 0, jump to next byte

decq %rax                                  #byte addition, rax equal to 0, block finished
jns _l01


movd %xmm0, %rdi
movdqu %xmm0, (%rsp)
movq 8(%rsp), %rdi
movq $1, %rax
syscall

The parts:

Code:

pxor %xmm0, %xmm0                        # xmm0 | Z
movdqu R, %xmm1
movdqu S, %xmm2
movdqu %xmm1, (%rsp)                     # rsp  | Y
movdqu %xmm2, -16(%rsp)                  # rsp-16  | V


movd %xmm0, %rdi
movdqu %xmm0, (%rsp)
movq 8(%rsp), %rdi
movq $1, %rax
syscall

They are the parameter pass, and I use the syscall exit to check the result of the mul. Yes, it is a very bad method, but I have not written any routine in ASM to transform integers to strings. So I can call printf, and I refuse to use any transformation function.They have to be ignored they don't belong to the crowd

Likewise, the parameter taking can be done differently and the return to xmm0 of course.

I follow the construction of the algorithm whenever I can. I think that programming the complete GCM algorithm, and then studying the M tables to speed up the calculation of the mul

Regards.

freejlr · Jan 28, 2022

Hi

Note, I was in class tutoring today, And they had nothing to say, so I

And it occurred to me to look at a pdf I have from Agner Fog that talks about instruction latencies, performance, etc...

Agner tables

And as I feared, the btq %r8, (%rsp, %rax) instruction is very heavy when it works with a register and a memory address, but the btq $0, -16(%rsp) below, which is with a constant and a memory address, is not as slow as the previous one.

So to eliminate the upper bt, it occurred to me to use shlb, since when doing a bit shift, the last bit shifted will be saved in the CF flag, which is the same operation as bt, but well, I also need the counter register r8, If someone comes up with something to improve this, I guess working with registers and not so much with memory.

Well looking at Agner's tables, the change is much better, shlb is much less heavy working with constant and memory than bt with register and memory.

The modified part is this:

Code:

btq %r8, (%rsp, %rax)                 # check the bit posotion 0 - 7
jnc _l1                               # bit equal to 0,  jump to _l1

//For this

shlb $1, (%rsp, %rax)                 # check the bit posotion 0 - 7
jnc _l1                                 # bit equal to 0,  jump to _l1

Well guys I think I'm going to leave multiplication, I'll continue when I can with the rest, but next week I have a lot of exams and work.

But surely they will have news about any questions you have in the future.

Regards.