#An efficient way to get the bytes of a bitstype, e.x. convert `Float32` to `NTuple{4, UInt8}`?

1 messages · Page 1 of 1 (latest)

ripe herald
#

I'm working on some C interop code that packs structs into very specific memory layouts.

#

Ideally it would be something like reinterpret(NTuple{4, UInt8}, my_float), but that doesn't actually work. I could reinterpret as UInt32 and pull the bytes out by hand, but that's a) not as cheap as a no-op reinterpret cast, b) I think it assumes the endianness of the platform, c) it wouldn't extend to more complex bits types that are larger than 4 bytes

errant ice
#

f(x) = Tuple(reinterpret(reshape, UInt8, [x])) seems to work, the assembly looks pretty bad though

ripe herald
#

Putting it into an array definitely looks bad, I wonder if putting it into a Ref works better?

errant ice
#

not supported

#

maybe you can see how reinterpret is implemented and use it?

ripe herald
#

My current implementation is this:

let f = Ref(f)
  ptr = Base.unsafe_convert(Ptr{Float32}, f)
  ptrBytes = Base.unsafe_convert(Ptr{NTuple{4, UInt8}}, ptr)
  unsafe_load(ptrBytes)
end
#

Which I think amounts to a memcpy? But I don't really know how to read assembly

errant ice
#

it just looks like a bunch of shifts to me ```asm
pushq %rbp
movq %rsp, %rbp
; │ @ REPL[27]:5 within hack
; │┌ @ pointer.jl:111 within unsafe_load @ pointer.jl:111
vmovd %xmm0, %ecx
movq %rdi, %rax
movl %ecx, %edx
movl %ecx, %esi
shrl $24, %esi
shrl $16, %edx
; │└
movb %sil, 3(%rdi)
movb %dl, 2(%rdi)
movb %ch, 1(%rdi)
movb %cl, (%rdi)
popq %rbp
retq
; └
; ┌ @ REPL[27]:5 within <invalid>
nopw %cs:(%rax,%rax)

ripe herald
#
const F = 2.5f0
f() = let f = Ref(F)
    ptr = Base.unsafe_convert(Ptr{Float32}, f)
    ptrBytes = Base.unsafe_convert(Ptr{NTuple{4, UInt8}}, ptr)
    unsafe_load(ptrBytes)
end

@code_llvm f()
;  @ REPL[54]:1 within `f`
; Function Attrs: uwtable
define [4 x i8] @julia_f_637() #0 {
top:
;  @ REPL[54]:4 within `f`
  ret [4 x i8] c"\00\00 @"
}

@code_native f()
        .text
        .file   "f"
        .globl  julia_f_659                     # -- Begin function julia_f_659
        .p2align        4, 0x90
        .type   julia_f_659,@function
julia_f_659:                            # @julia_f_659
; ┌ @ REPL[54]:1 within `f`
        .cfi_startproc
# %bb.0:                                # %top
        pushq   %rbp
        .cfi_def_cfa_offset 16
        .cfi_offset %rbp, -16
        movq    %rsp, %rbp
        .cfi_def_cfa_register %rbp
        movq    %rcx, %rax
; │ @ REPL[54]:4 within `f`
        movl    $1075838976, (%rcx)             # imm = 0x40200000
        popq    %rbp
        retq
.Lfunc_end0:
        .size   julia_f_659, .Lfunc_end0-julia_f_659
        .cfi_endproc
; └
                                        # -- End function
        .type   .L_j_const1,@object             # @_j_const1
        .section        .rodata.cst4,"aM",@progbits,4
        .p2align        2
.L_j_const1:
        .long   0x40200000                      # float 2.5
        .size   .L_j_const1, 4

        .section        ".note.GNU-stack","",@progbits
errant ice
#

I think the function has just been optimized out as you've declared F to be a constant

ripe herald
#

Oh yeah, here it is not optimized out:

;  @ REPL[61]:1 within `f`
; Function Attrs: uwtable
define [4 x i8] @julia_f_681() #0 {
top:
  %0 = load atomic i32*, i32** inttoptr (i64 2739371800856 to i32**) unordered, align 8
; ┌ @ refpointer.jl:136 within `Ref`
; │┌ @ refvalue.jl:10 within `RefValue` @ refvalue.jl:8
    %1 = load i32, i32* %0, align 4
; └└
;  @ REPL[61]:4 within `f`
; ┌ @ pointer.jl:111 within `unsafe_load` @ pointer.jl:111
   %.sroa.0.0.extract.trunc = trunc i32 %1 to i8
   %.sroa.0.1.extract.shift = lshr i32 %1, 8
   %.sroa.0.1.extract.trunc = trunc i32 %.sroa.0.1.extract.shift to i8
   %.sroa.0.2.extract.shift = lshr i32 %1, 16
   %.sroa.0.2.extract.trunc = trunc i32 %.sroa.0.2.extract.shift to i8
   %.sroa.0.3.extract.shift = lshr i32 %1, 24
   %.sroa.0.3.extract.trunc = trunc i32 %.sroa.0.3.extract.shift to i8
; └
  %.fca.0.insert = insertvalue [4 x i8] zeroinitializer, i8 %.sroa.0.0.extract.trunc, 0
  %.fca.1.insert = insertvalue [4 x i8] %.fca.0.insert, i8 %.sroa.0.1.extract.trunc, 1
  %.fca.2.insert = insertvalue [4 x i8] %.fca.1.insert, i8 %.sroa.0.2.extract.trunc, 2
  %.fca.3.insert = insertvalue [4 x i8] %.fca.2.insert, i8 %.sroa.0.3.extract.trunc, 3
  ret [4 x i8] %.fca.3.insert
#

Here's the same code but for a struct of Int32, Float32, and Float64:

;  @ REPL[68]:1 within `f`
; Function Attrs: uwtable
define void @julia_f_704([16 x i8]* noalias nocapture noundef nonnull sret([16 x i8]) align 1 dereferenceable(16) %0, { i32, float, double }* nocapture noundef nonnull readonly align 8 dereferenceable(16) %1) #0 {
top:
; ┌ @ refpointer.jl:136 within `Ref`
; │┌ @ refvalue.jl:10 within `RefValue` @ refvalue.jl:8
    %2 = bitcast { i32, float, double }* %1 to i8*
; └└
;  @ REPL[68]:4 within `f`
  %3 = getelementptr inbounds [16 x i8], [16 x i8]* %0, i64 0, i64 0
  call void @llvm.memcpy.p0i8.p0i8.i64(i8* noundef nonnull align 1 dereferenceable(16) %3, i8* noundef nonnull align 8 dereferenceable(16) %2, i64 16, i1 false)
  ret void
}
#

So yeah, looks like a memcpy. To be fair, I think that's also the best you can do in C++ without type punning which technically invokes UB

#

Although to test that in Julia I have to write code which goes on to use this data, and see if it still does the memcpy

abstract grail
#

I don't have code for because I don't know the julia c binding that well but I would allocate a 4-array of UInt8s, get the array pointer that points to the first element with jl_array_ref(, 0) and then hardcopy the float to that location, then construct the tuple from the 4-array which shouldn't allocate afaik.

That way it's one copy of 32bits and one allocation of 32bits which is optimal

#

since the values are stored in order it should overwrite the previous memory so the array now has the bits from the float

ripe herald
#

It looks like Julia is able to optimize it into a pointer load instead of actually copying everything! Or am I misreading this?

julia> g(s::S) = begin
          data = f(s)
          return data[3] # Grab 3rd byte
       end
g (generic function with 1 method)

julia> @code_llvm g(S(4, 3.5, -20.1))
;  @ REPL[75]:1 within `g`
; Function Attrs: uwtable
define i8 @julia_g_709({ i32, float, double }* nocapture noundef nonnull readonly align 8 dereferenceable(16) %0) #0 {
top:
;  @ REPL[75]:2 within `g`
; ┌ @ REPL[73]:1 within `f`
; │┌ @ refpointer.jl:136 within `Ref`
; ││┌ @ refvalue.jl:10 within `RefValue` @ refvalue.jl:8
     %.sroa.2.0..sroa_raw_cast = bitcast { i32, float, double }* %0 to i8*
     %.sroa.2.0..sroa_raw_idx = getelementptr inbounds i8, i8* %.sroa.2.0..sroa_raw_cast, i64 2
     %.sroa.2.0.copyload = load i8, i8* %.sroa.2.0..sroa_raw_idx, align 2
; └└└
;  @ REPL[75]:3 within `g`
  ret i8 %.sroa.2.0.copyload
}
#

Is Julia able to optimize out the heap allocation for that array?

errant ice
#

for float to a tuple of 4 UInt8 there was no heap allocation.

#

nvm

#

wait yes there was no allocation

ripe herald
#

I was asking about @abstract grail 's approach using an array

errant ice
#

ahh mb

abstract grail
#

idk how to read llvm but this was my idea:

julia> array = Array{UInt8}(undef, 4);

julia> float = Ref{Float32}(1234.5678)
Base.RefValue{Float32}(1234.5677f0)

julia> ptr = ccall(:jl_arrayref, Ptr{Cvoid}, (Any,), array)
Ptr{Nothing} @0x00007f5441a0f810

julia> ccall(:memcpy, Cvoid, (Ptr{Cvoid}, Ptr{Cvoid}, Csize_t), ptr, float, 4)

julia> array
4-element Vector{UInt8}:
 0x2b
 0xb2
 0xfa
 0x38

julia> Tuple(array)
(0x2b, 0xb2, 0xfa, 0x38)