For some homework i need to vectorise the k-loop of this code:
for (int i = 0; i < N; i++)
for (int j = 0; j < N; j++)
for (int k = 0; k < N; k++) {
C[i][j] += A[i][k] * B[k][j];
}
}```
to which i've produced this:
```void q1_vec_k() {
//note: B array is transposed during the init routine
__m256 cV, bV, aV;
__m128 xmm1; //vector for extracting significant 128 bits of cV vector to use SS command
for (int i = 0; i < N; i++) {
for (int j = 0; j < N; j++) {
cV =_mm256_setzero_ps(); //sets the value of all C vector elements to 0
for (int k = 0; k < (N/8) * 8; k += 8) {
bV = _mm256_loadu_ps(&kB[k][j]); //loads 8 values of the transposed B-Array (column order of the original array)
aV = _mm256_loadu_ps(&A[i][k]); //loads 8 values of the A array (row order)
cV = _mm256_fmadd_ps(aV, bV, cV); //cv += (aV * bV)
}
cV = _mm256_hadd_ps(cV, cV);
cV = _mm256_hadd_ps(cV, cV);
cV = _mm256_hadd_ps(cV, cV); //at this point 1 element of cV is the sum of 8 elements
xmm1 = _mm256_extractf128_ps(cV, 0); //extracts the first 4 elements of cV, converting it to a 128 bit vector, allowing for use of store_ss
_mm_store_ss(&kC[i][j], xmm1);
for (int k = (N / 8) * 8; k < N; k++) {
kC[i][j] += A[i][k] * B[k][j];
}
}
}
}```
but this doesn't produce the same result: upwards of 20 - 70 out of what it should be, so it returns things as false, and its really stressing me out. Can anyone help?
Full code listed below: