#An efficient way to get the bytes of a bitstype, e.x. convert `Float32` to `NTuple{4, UInt8}`?
1 messages · Page 1 of 1 (latest)
Ideally it would be something like reinterpret(NTuple{4, UInt8}, my_float), but that doesn't actually work. I could reinterpret as UInt32 and pull the bytes out by hand, but that's a) not as cheap as a no-op reinterpret cast, b) I think it assumes the endianness of the platform, c) it wouldn't extend to more complex bits types that are larger than 4 bytes
f(x) = Tuple(reinterpret(reshape, UInt8, [x])) seems to work, the assembly looks pretty bad though
Putting it into an array definitely looks bad, I wonder if putting it into a Ref works better?
My current implementation is this:
let f = Ref(f)
ptr = Base.unsafe_convert(Ptr{Float32}, f)
ptrBytes = Base.unsafe_convert(Ptr{NTuple{4, UInt8}}, ptr)
unsafe_load(ptrBytes)
end
Which I think amounts to a memcpy? But I don't really know how to read assembly
it just looks like a bunch of shifts to me ```asm
pushq %rbp
movq %rsp, %rbp
; │ @ REPL[27]:5 within hack
; │┌ @ pointer.jl:111 within unsafe_load @ pointer.jl:111
vmovd %xmm0, %ecx
movq %rdi, %rax
movl %ecx, %edx
movl %ecx, %esi
shrl $24, %esi
shrl $16, %edx
; │└
movb %sil, 3(%rdi)
movb %dl, 2(%rdi)
movb %ch, 1(%rdi)
movb %cl, (%rdi)
popq %rbp
retq
; └
; ┌ @ REPL[27]:5 within <invalid>
nopw %cs:(%rax,%rax)
const F = 2.5f0
f() = let f = Ref(F)
ptr = Base.unsafe_convert(Ptr{Float32}, f)
ptrBytes = Base.unsafe_convert(Ptr{NTuple{4, UInt8}}, ptr)
unsafe_load(ptrBytes)
end
@code_llvm f()
; @ REPL[54]:1 within `f`
; Function Attrs: uwtable
define [4 x i8] @julia_f_637() #0 {
top:
; @ REPL[54]:4 within `f`
ret [4 x i8] c"\00\00 @"
}
@code_native f()
.text
.file "f"
.globl julia_f_659 # -- Begin function julia_f_659
.p2align 4, 0x90
.type julia_f_659,@function
julia_f_659: # @julia_f_659
; ┌ @ REPL[54]:1 within `f`
.cfi_startproc
# %bb.0: # %top
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset %rbp, -16
movq %rsp, %rbp
.cfi_def_cfa_register %rbp
movq %rcx, %rax
; │ @ REPL[54]:4 within `f`
movl $1075838976, (%rcx) # imm = 0x40200000
popq %rbp
retq
.Lfunc_end0:
.size julia_f_659, .Lfunc_end0-julia_f_659
.cfi_endproc
; └
# -- End function
.type .L_j_const1,@object # @_j_const1
.section .rodata.cst4,"aM",@progbits,4
.p2align 2
.L_j_const1:
.long 0x40200000 # float 2.5
.size .L_j_const1, 4
.section ".note.GNU-stack","",@progbits
I think the function has just been optimized out as you've declared F to be a constant
Oh yeah, here it is not optimized out:
; @ REPL[61]:1 within `f`
; Function Attrs: uwtable
define [4 x i8] @julia_f_681() #0 {
top:
%0 = load atomic i32*, i32** inttoptr (i64 2739371800856 to i32**) unordered, align 8
; ┌ @ refpointer.jl:136 within `Ref`
; │┌ @ refvalue.jl:10 within `RefValue` @ refvalue.jl:8
%1 = load i32, i32* %0, align 4
; └└
; @ REPL[61]:4 within `f`
; ┌ @ pointer.jl:111 within `unsafe_load` @ pointer.jl:111
%.sroa.0.0.extract.trunc = trunc i32 %1 to i8
%.sroa.0.1.extract.shift = lshr i32 %1, 8
%.sroa.0.1.extract.trunc = trunc i32 %.sroa.0.1.extract.shift to i8
%.sroa.0.2.extract.shift = lshr i32 %1, 16
%.sroa.0.2.extract.trunc = trunc i32 %.sroa.0.2.extract.shift to i8
%.sroa.0.3.extract.shift = lshr i32 %1, 24
%.sroa.0.3.extract.trunc = trunc i32 %.sroa.0.3.extract.shift to i8
; └
%.fca.0.insert = insertvalue [4 x i8] zeroinitializer, i8 %.sroa.0.0.extract.trunc, 0
%.fca.1.insert = insertvalue [4 x i8] %.fca.0.insert, i8 %.sroa.0.1.extract.trunc, 1
%.fca.2.insert = insertvalue [4 x i8] %.fca.1.insert, i8 %.sroa.0.2.extract.trunc, 2
%.fca.3.insert = insertvalue [4 x i8] %.fca.2.insert, i8 %.sroa.0.3.extract.trunc, 3
ret [4 x i8] %.fca.3.insert
Here's the same code but for a struct of Int32, Float32, and Float64:
; @ REPL[68]:1 within `f`
; Function Attrs: uwtable
define void @julia_f_704([16 x i8]* noalias nocapture noundef nonnull sret([16 x i8]) align 1 dereferenceable(16) %0, { i32, float, double }* nocapture noundef nonnull readonly align 8 dereferenceable(16) %1) #0 {
top:
; ┌ @ refpointer.jl:136 within `Ref`
; │┌ @ refvalue.jl:10 within `RefValue` @ refvalue.jl:8
%2 = bitcast { i32, float, double }* %1 to i8*
; └└
; @ REPL[68]:4 within `f`
%3 = getelementptr inbounds [16 x i8], [16 x i8]* %0, i64 0, i64 0
call void @llvm.memcpy.p0i8.p0i8.i64(i8* noundef nonnull align 1 dereferenceable(16) %3, i8* noundef nonnull align 8 dereferenceable(16) %2, i64 16, i1 false)
ret void
}
So yeah, looks like a memcpy. To be fair, I think that's also the best you can do in C++ without type punning which technically invokes UB
Although to test that in Julia I have to write code which goes on to use this data, and see if it still does the memcpy
I don't have code for because I don't know the julia c binding that well but I would allocate a 4-array of UInt8s, get the array pointer that points to the first element with jl_array_ref(, 0) and then hardcopy the float to that location, then construct the tuple from the 4-array which shouldn't allocate afaik.
That way it's one copy of 32bits and one allocation of 32bits which is optimal
since the values are stored in order it should overwrite the previous memory so the array now has the bits from the float
It looks like Julia is able to optimize it into a pointer load instead of actually copying everything! Or am I misreading this?
julia> g(s::S) = begin
data = f(s)
return data[3] # Grab 3rd byte
end
g (generic function with 1 method)
julia> @code_llvm g(S(4, 3.5, -20.1))
; @ REPL[75]:1 within `g`
; Function Attrs: uwtable
define i8 @julia_g_709({ i32, float, double }* nocapture noundef nonnull readonly align 8 dereferenceable(16) %0) #0 {
top:
; @ REPL[75]:2 within `g`
; ┌ @ REPL[73]:1 within `f`
; │┌ @ refpointer.jl:136 within `Ref`
; ││┌ @ refvalue.jl:10 within `RefValue` @ refvalue.jl:8
%.sroa.2.0..sroa_raw_cast = bitcast { i32, float, double }* %0 to i8*
%.sroa.2.0..sroa_raw_idx = getelementptr inbounds i8, i8* %.sroa.2.0..sroa_raw_cast, i64 2
%.sroa.2.0.copyload = load i8, i8* %.sroa.2.0..sroa_raw_idx, align 2
; └└└
; @ REPL[75]:3 within `g`
ret i8 %.sroa.2.0.copyload
}
Is Julia able to optimize out the heap allocation for that array?
for float to a tuple of 4 UInt8 there was no heap allocation.
nvm
wait yes there was no allocation
I was asking about @abstract grail 's approach using an array
ahh mb
idk how to read llvm but this was my idea:
julia> array = Array{UInt8}(undef, 4);
julia> float = Ref{Float32}(1234.5678)
Base.RefValue{Float32}(1234.5677f0)
julia> ptr = ccall(:jl_arrayref, Ptr{Cvoid}, (Any,), array)
Ptr{Nothing} @0x00007f5441a0f810
julia> ccall(:memcpy, Cvoid, (Ptr{Cvoid}, Ptr{Cvoid}, Csize_t), ptr, float, 4)
julia> array
4-element Vector{UInt8}:
0x2b
0xb2
0xfa
0x38
julia> Tuple(array)
(0x2b, 0xb2, 0xfa, 0x38)