#archived-dots
1 messages ยท Page 234 of 1
no worries, im stuck coding my own project and Im trying to map out the component architecture in my head
That only exists by itself... you mean that doesn't have any components?
Normally, you'll have an entity archetype (like a class) with a lot of entities constructed from it (like an array of identical class type but different properties)
Aaaah
A singleton entity is a single entity of a single archetype (class). Like when you create an object from a class and only that 1 object.
I think with your last message I understood better
How do you create an archetype of entities?
And can an archetype inherit from another
Or I'm thinking like object oriented haha
No. Archetypes are different even if they share some of the same components
Just completely throw inheritance out the window
There is "polymorphism" but it's completely pointer based and reinterpreting things.
There is no inheritance, there is, largely, no references. Everything is value typed and isolated.
Those components, are you simulating real life? xD
Economic simulator
Wow
That is possibly the most pure Object Oriented coding I have ever seen. And Im converting it to DOTS
The inheritance and event based programming is insane
It makes my eyes water seeing the garbage code
It's not mine, it's some guy's academic model
yea
I think it's perfect for DOTS, once I somehow wrap my head around some of the relational aspects that Im struggling to convert to DOTS
Sounds cool!
HPC# is not 'C# wanting to be like c++'. it's a dialect of c# and why and how you can read all about here: https://lucasmeijer.com/posts/cpp_unity/
Lucas Meijer personal website
Anyways, an "archetype" is like a class name and the components are like the properties of that class. You have to manually construct the archetype though which is what that image is doing.
And what's the use of an archetype
It's the "object" type used to define the array that the data will be located in
So you run a system in all the entities that belong to an archetype?
Like how you make var arrayOfStrings = new String[15]
That string is the archetype of that array
Yep. That's how it's used. Like running a foreach over an array.
Generally, the translation between Unity DOTS terminology is this: archetype = type, component = property, chunk = array.
@robust scaffold not really sure how you make a game without structural changes
bitflags, re-interpretation / polymorphism, and buffer elements.
of course, structural changes will have to happen, because how else are the entities created in the first place, but minimizing them is my goal
polymorphism ๐ค
At initialization, it sets index (since the target entity has not yet been created). At the end, it reads index, lookup the actual entity corresponding to that index, and sets the entity. No structural change, no entity shifting. Can all be done in parallel.
And Franco, ignore what I'm doing. This is advanced DOTS
bitflags is like what unity does to represent a LayerMask?
yes
yep. DOTS you want to pack as much data together as you can, minimizing unused and wasted memory space, to file in an entire chunk into the legendary "cpu cache" for maximum performance
that is beyond my comprehension at this point ๐ฅฒ
Entity is 2 ints, 8 bytes wide. Does it really "need" to be an entity at any point? No. You can reinterpret that data at any point to be completely different so long as it fits within 8 byte size.
in that example, ushort is 2 bytes wide, the other 6 bytes are ignored.
tbh with a working fps prototype, the major slowdowns are really physics and hybrid unityengine stuff, not any jobs/systems doing structural changes
Yea, true. I'm using pure Entities as a simulation foundation then developing a game off the frame-end data via UI visualization. No actual connection to engine stuff.
So with my 2.5 million entities, every little bit of micro-optimization matters.
@robust scaffold ๐ understandable, thats totally not my use case
When ya think about, DOTS inheritance is simultaneously non-existent and extremely fluid compared to C#.
All one has to do is identify which struct is sized larger than the other and where each property is located, then you can switch between any two struct types completely free.
But probably when you re-interpret the data to the other struct it won't make sense. Unless the other struct has the same data, and more
More than it won't make sense, it won't be meaningful
Like if they're totally 2 different structs
kinda, sorta. If you already know the context in which each byte correspond to, you can manually piece together what byte set correspond to what type using manual pointer reading.
Something like that. An initial enum identifying that type is the following data corresponding to. Then using a switch statement to properly access the data within each struct type
what's field offset for? any docs? haven't actually seen that anywhere
Thats how one can mimic polymorphism within a component without structural changes.
FieldOffset is mainly for explicit struct layout: https://docs.microsoft.com/en-us/dotnet/api/system.runtime.interopservices.structlayoutattribute?view=net-5.0
You've been coding in DOTS and dont know what field offset is? What have you been doing?
https://docs.microsoft.com/en-us/dotnet/api/system.runtime.interopservices.structlayoutattribute?view=net-5.0
Ha, sniped
๐
The C# way of writing unions ๐
have absolutely not needed it, and i've converted lots of things to dots
Ah, I knew there was a term for what im doing but I couldnt remember
but it might be useful, that's why i asked
True. I'm here trying to micro optimize my burst code for vectorization everything so identifying byte offsets at IDE programming time and hardcoding span jumps is key.
yeah, i'll definitely do that as well. just did not know about fieldoffset until now.
So... if you have one value stored there in that piece of memory, can you interpret that piece of memory, without changing it as more than one struct?
There's also FieldOffset sequential, I use it for manual component based ring buffers. Auto is not supported by Burst so dont use it.
Or is to re-use that piece of memory, after changing some values, and then re-interpret it as another component
yep. That is what i do to maintain identical shared and chunk components
Well that isnt the union example I put up earlier but it's along the same strategy
TShared and TDestination are identical structs, just typed differently because SharedComponentData and ComponentData can not be one component.
So I use the magic of pointer casting to just reinterpret TShared to TDestination and set the chunk component as TShared without any other changes.
If I could access chunk components by dynamic type handles, this would definitely be an event based system.
Well, it is an event based system
it requires the existence of a singleton containing ProvinceChangedEnable as a flag, then runs two systems mirroring the shared onto the chunk component. Since this is to run maybe once a minute or more, it's not a buffer element event
I think I'm not ready to completely understand this haha. Anyways, I'll leave the chat now, thank you for your super kindness, wish you the best in your interesting-looking project! Probably I'll ask more next week or something, thanks!
Yep. DOTS is very deep and basically a new coding language (it's a different coding style for sure). Dont worry about not understanding it yet. Frankly, DOTS wont be a production capable way of coding in Unity for a few more years at least
My last question is... (sorry can't avoid it xD)
im here mostly all day, my code is garbo anyways
and I need to scrap and rewrite the entire thing
Do you think that DOTS is more complicated or hard to code than regular C# and Monobehaviours, or just different, and looks complicated because I'm just not familiarized with it?
nah i doubt that. some shipped games already use dots and while the way how you write code for dots most likely will change a few more times, migration to newer versions will get easier and easier
I would in fact say it's easier once you understand how it works. Definitely makes "doing" something to an game object incredibly simple to append to a current game loop since everything is forced to be compartmentalized (largely). Parallelization across multiple threads is also a "first option" rather than an optimization tacked on near the midpoint of a game's development cycle.
it's very different. and gives you a lot more granularity. but writing correct dots code will get you cleaner code and easier to reusable code than OOP. it's harder to get into it when you come from oop but no. i wouldn't say it's harder than normal c#
@glacial hazel I think dots is way easier to code for than old monobehaviours, I dont think kornflaks makes it easy on himself but I cant go back to non dots workflows, Ive tried several times and dots just is much more simple and elegant imo. Might take time to adjust though
I mean, the "boilerplate" and overly verbose DOTS coding doesnt help since C# was designed as an OOS coding language
I mean, look at the code I need to write just to reset all of an entity's census component values to 0:
On the other hand, that operating over about 2.5 million entities is so fast, i had difficulties finding the actual operation time
0.01ms and even then it was a blip.
boilerplate code will be reduced even more as dots is developed btw
hopefully, a major feature Unity has promised is improved code gen to allow for C# macros.
what would a c# macro be? i know cpp macros but i doubt that's what you're referring to
But yea, people are hypothesizing on the forums that was the reason why 0.18/0.19 was scrapped and unity went dark.
Unity maps methods to C++ then codegens from there.
0.18 .... was not scrapped. they still had it internally.
Yea, public release of 0.18 was scrapped. They're on 0.20 internally
They're still coding, we're stuck on 0.17 with Entities.ForEach stuck codegen'ing to IJobChunk code.
they said they want to move loads of things to source generators. that's what i'm really looking forward to
Microsoft these days are focusing heavily on source generation integration. C#9.0 or something talked a lot about source generators
When I read the preliminary patch notes, I can smell Unity all over it
i've written a lot of source generators already. they are so awesome. and 2 days ago i realized they just dropped the 2.0 api for source generators.
I heard rumors that they got rid of IL weaving completely and moved to source generators, really excited for that
Yep. Will take a few years though for 9.0 improvements to trickle down to unity. Hence why DOTS wont be production ready for a few years, along with other issues.
wdym 9.0 improvements?
I had to manually rewrite all my Entities.ForEach into customs jobs because IL weaving is so abysmal slow
Nowadays I don't even touch Entities.ForEach anymore..
I think it's C#9, it's one of the newest C# versions. I dont have much experience with source gens but people on the forums were estimating that a lot of Unity's issues stem from the lack of comprehensive source gen integration into C# compared to something like C++.
And the newest C# just dropped those very same source gen improvements. Just my thoughts personally but I think unity may have greased some wheels over at the Gates headquarters to push through what they want in C#
c# doesn't do improvements to code really. that's .Net c# 9 is already available for 2022 and .net 6 will also soon be available
that are source generators
No clue what those do and they wont be available for Unity programmers for years but yea. Maybe that was what Unity was waiting for?
but it's not really linked to the C# version actually
they are already available in unity 2021
I was pretty sure Unity was still partially implementing 7.0 features like inline and switch match case syntax into IL2CPP. Then again, nothing's stopping them from picking and choosing
I am 100% sure record structs are not allowed in unity and that came in.... 8.0?
il2cpp is......good but not related to dots or anything really
no, record structs just dropped now in C# 10 which only dropped a few days ago
did you mean record classes?
Correct me if Im wrong but Mono/IL2CPP is the fundamental coding language Unity is built on. It's not Microsoft's C#. Unity needs to take microsoft's C# then patch it into their own C# version that Unity uses.
and they are on it and are actively working to integrate to CoreCLR into unity.
9.0, Record value typed components that contain pointers / references instead of pure values.
those are not structs btw. these are immutable classes
You can also create record types with mutable properties and fields:
Not a class, they can be unmanaged and burstable
Basically just C# implementation of pointers
no they literally get lowered down to a class
I swear they're value typed like structs and enums
wait a moment please
public abstract record
Fuck, you're right. They're a wrapper around a class
C#/VB/F# compiler playground.
if you want record structs those exist in C# 10
but they too have nothing to do with pointers
Turbo Makes Games has some tutorials which aren't outdated
I thought they return pointers while also containing the backing value. Allowing for direct read-write operation on arrays of record type properties?
no
Well, not pointers. References.
Damn, I must be confusing them with the Span<> implementation then
what i am looking forward to in C# 10 is with support for normal structs
that will also be insanely awesome for dots
is this on the c# 10 spec page? Nvm found it in C# 9 lol
with? Isnt that just inlined creating a temporary local copy with properties changed?
with shouldn't be in c# 9
yes. then you don't have to make a temp copy of a component and then assign it again. you can just say with
I guess that can inline changing a property within a struct access through an array into one line. Yea
i'm totally gonna use it loads and loads
I just reinterpret all my native arrays into ints or floats then jump indices to access single properties but that also works
Finally, got this thing to vectorize
why don't you inline the method anyway?
This is not vectorized, length from batch in chunk does not result in vectorization for some reason
Im reading the burst inspector for vectorization. If I inline it, it'll get lost inside the miles of assembly
ahhh. ok. so that's something you'll change for release?
Yea. Just for debugging here. I'll change it to AggressiveInlining when I'm happy with it.
I isolate the loop functions so I can see what it's compiling into using burst and it's a real simple NoInlining to check it
What the fuck, using batchInChunk.Count is vectorized now
XD
maybe it thinks it aliases with something?
yeah idk. was just thinking
Okay, now it's not vectorized. Now Im paranoid. I need to check all my other functions if they're still vectorized
Now it's not vectorized and using the native-array's length
Burst. Why you do this?
burst is fully deterministic. there must be something that burst can't reliably get enough information of
something i'm curious about is how dots will be moddable.......it seems pristine for it......yet i am not sure how i would inject new code into a compiled project
Burst docs has a section including modding
If Burst can support modding, DOTS can
i did not see any
how'd i miss that
I figured out the difference in vectorization / not vectorized. There is no vectorized set. Only vectorized math functions. *= 0 is equivalent to set 0... but that doesnt make sense. It's still being set to the ultimate value.
wait
Vectorized set
Burst. What the fuck
and now it's not vectorized anymore... identical code. Just refreshed the assets
hm, seems to have been added more recently-ish. haven't read too much into burst recently. so maybe that's why i missed it
it's been around since 1.5, released in uhhhhh late 2020? Last year ish. 1.6 definitely fleshed it out because I remember the section on modding support being fairly bare bones
i think the last i read burst in depth was with 1.5
Honestly at the level of effort for bursted mods, I might as well fork the codebase and modify the source code myself. If it's open source...
um, maybe try setting burst compilation to synchronous only. because otherwise i think the problem you face is a side effect.
It's basically merging burst outputs. Fairly dangerous on the security side. Literally no scripting limits
that's what i like though tbh
i hate modding apis that are like 'ohhhh, only visual scripting for youuuuu'
cough vrchat cough
heh. Ya gotta think how it's gonna affect the game publisher though. If someone releases basically a raw DLL asking for people to download to run in their game. That's asking for a virus.
that's not my concern as a gamedev though. cause i don't regulate this / support this and it's 3rd party. anyone who downloads mods is responsible for any damage themselves
Bonus points if you allow automatic mod syncing from the server ๐
You can say that but when someone posts a screenshot of your game locked down by legitimate anti-viruses and they dont post context that they're using a mod, that reflects very badly
most likely not for dedicated servers
uhhhh how exactly? you can literally photoshop something like that and it has the exact same effect
You could but the average person wont. The average person will spam social media about how their computer has a virus (even from a mod) and their technological illiteracy will immediately assume that the game gave them a virus
there's nothing you can do about that anyway. you'll even have enough of these people that are just ranting because
and tbh. i can very much do without such people
The average person wont assume that the mod they're loading is a raw DLL. After all, nearly every other popular game they play uses Lua or some other scripting language to enforce a "safety" on the mods they download. So they'll assume the same.
99% of people will have no fucking clue about mod architecture anyway. they don't know what a dll is and they don't know what lua is
I mean, power to ya if you can somehow communicate the risk of the mods you load in your game as unprotected and raw compiled code but I just open source my game and ask whoever bothers to play to compile it themselves
that doesn't work for titles you wanna get some profit from
I dont need to know what a DLL or Lua scripting language is, I in fact dont. I do know that the former can give me a virus and the latter can not. That is what the average person will know.
and srsly. you can literally make a whole popup in your game about the risk of using mods which the user has to agree to so there's your risk communication
Ha, thats why Im an engineer and dont want to make programming a job. Imagine earning money doing this.
lua can also give you a virus if you know what you're doing
if the programmer of the game Lua is hooking onto is incompetant. There is no way for the programmer using a bursted DLL to enforce sandboxed safety
and there really shouldn't be
Well, that question is for lawyers, the legal team, and how much liability one assumes by providing a program that blindly trusts DLL extensions provided to it upon non-admin confirmation from a user.
wth. the user AGREES to use the dll because he has to seperately download it, put it in the folder and start the game. and you have a whole section in your AGB and a huge clause at the start of the game where the user has to agree to know the risks of using mods
That seems reasonable, especially if you don't advertise or build additional tooling for it yourself. If you were doing some sort of mod syncing setup, I believe it would be a good idea to take more responsibility than that.
yes definitely. that's a whole different scenario
@robust scaffold did synchronous compilation reap more expected results?
I have no clue why but no
Hm strange
I added a blank new line, now it's vectorized
definitely a race condition somewhere
This reliably vectorizes, when an array operates on another
but not when one of them is constant
I think I figured it out. The burst inspector lies
The method stamps are not the same
Even if I reload unity a few times / refresh assets
Burst doesnt want to update
This is the proper stamp, with un-vectorized application of 0
ah, so it keeps a prev version in the inspector?
I actually dont know any more. I've long since renamed that method to see if renaming things update the inspector
it still has the nonvectorizedclear name
im gonna try and restart unity, see if that helps
hm, seems like a bug to me. best file a bug report on the forum
There ya go. Now it refreshed. It's just a text stamp. I dont think it matters
Yea. It's not suppose to be vectorized. the burst inspector just seemed to stall at old version using text logic
Is that a bug? Yes. Worth reporting? Nah. Restarting unity isnt that bad thankfully to worth writing up a bug report
Alright, several restarts later to verify the method stamp remains constant. I think I got a reliable vectorization set value code
By instead initializing the value that the array is being set to outside the vectorized code section, it seems to now know reliably that the setting of the array can be vectorized
i think it's still worth to report so it gets fixed and doesn't stall other people
adding the in parameter to the parameter list changes the code drastically. I dont know how to read it though. Give me a sec to wait for unity to restart
psh, other people dont read the burst inspector and I'm sure the devs already know
i do use the burst inspector. a lot. and i think others do also. and if the devs know already, your bug report will be closed in no time
Using the in parameter changes the vmovedqu to vmoveups
with int you copy the int to the method. with in you reference a readonly pointer to the int for the method
for int that does not really matter since pointer and int are usually same size
dqu or ups? which one is faster?
i have no idea.
removing the in parameter, it has less code but also has that additional vmovd in front
i'd imagine dqu though
yeah, vmovd probably copies the int to the method since it's a mov
where are the docs? would be appreciated if you can link that^^ i'm interested
that's for intel, which my computer is using
AMD also has a version as well but it's not as nice. It's a pure text file
what, how does that make sense?
yea, and the loop goes around the dqu, so the small upfront cost of vmovd probably is insignificant... probably
actually yeah
fuck all people who say micro optimization is irrelevant. WE'RE DOING NANO OPTIMIZATION
yea, it's insignificant
yeah
assembly level optimization, hours upon hours of changing a single line and seeing how burst reacts
and this is setting entities to 0. Before vectorization, it already took 0.01ms singlethreaded
Even if I cut the time of operation in half, it'll still show up as 0.01ms as the profiler doesnt show nano-seconds
BUT PURPLE.
with 0 entities your code performs worse than ups, with 1 entity it performs equally, and with 2+ it performs better than ups
See, micro-optimization. Reducing runtime by microseconds. Nano-optimization, reducing runtime by nano-seconds (counting them CPU cycles)
Yep, if one wants to loop vectorize set an array, declare the value as a const outside the vectorize method. Do not directly set values within the vectorized function.
That's the ultimate vectorized code. Spent....3 hours figuring that out.
original code replaced reset value with 0.
damn, I wanted to ask half an hour ago if you tried ExpectVectorized
was that all it took now?
ExpectVectorized will throw an error
it cant detect much beyond simple addition of 2 arrays
and im working on a different one now
ah ok, i think expectvectorized does nothing else. total troll method lol
Ive replaced the inline conditional with a math.select and the constant values to the method parameter
If you havent read the conversation above, i dont blame ya, the key takeaway is that the burst inspector lies
Read the method call and check the parameter list
if it isnt updated to the list in the code itself, the burst output failed to update
Restart unity
haven't really followed it. oh sheet, that sucks.
and it'll fix itself
for example, that doesnt match the code image above
so I need to restart unity
and then check the burst inspector again
if you can make a small repo, post it in the forum and be famous like tertle!
nah, i'll accept my anonymity.
See, now it updates after a restart
lets see what changed...
absolutely nothing at all
well no, theres no vectorized commands before the loop like in the old version...
Alright, my nearing 3.5 hour journey into burst nano-optimization. Overall verdict: do not initialize any data within vectorized methods. Any and all data must originate outside the function. That includes constant defined variables.
seems like an oversight
so census[i] = 0 doesn't vec but census[i] = resetValue as parameter does vec?
interesting find
Wait really?!
100% certain. I've stared at this section of code for hours
vmovss is not a good command. Let me see if using a direct pointer helps instead of ref NativeReference
Nevermind, it's literally identical. Well fuck
Whats that command again to have the burst compiler complain if something isnt vectorized?
Loop.ExpectVectorized(). It doesnt work outside the most basic examples
What the fuck, now it's unrolled
And its using packed singles instead as well
I HAVE DONE NOTHING DIFFERENT EXCEPT RESTART UNITY AND IT UNROLLS ITSELF FOR ME????????
BURST, WHY
So are you sure that inlined 0 was the issue with your example above ๐ ?
the reset census application of 0 to census[] array? 100%
inlining the 0 into the for loop will break vectorization
replacing it with a parameter who's value is set by a constant integer outside the function will result in a vectorized function
As a joke, could you XOR it with itself, and see how it likes that
as there is a vxorps
good idea, I was doing multiplied by 0 and seeing how it looks but that works. *= actually resulted in an occasional vectorized result but it was unreliable. Not compared to the so far 100% consistant vectorization that using a parameter variable is right now
I guess the issue with the hardcoded 0, is that there might not be a vmovps that takes an immediate value
hrm, i cant find the function.........
and with the const instead it becomes a memory load
Nope
vxorps exists, i know it does. Unity is doing vxorps against itself to set values to 0 in their actual code
but detecting it in my custom code doesnt work
could you split reading and writing census into 2 lines?
Like that?
no change in resulting burst
Yes that
no change in resulting burst
Which I guess is a weird way of writing 0 haha
I should really setup a test project for messing with this some time
wait
is that just straight up doing memset on your whole array?
removing the loop
Which would make sense
well there is no loop in there right?
yes
So it figure out you are setting the same value to the whole array
and instead uses memset to just write 0 into that entire memory block
vectorizing is not some magical improvement
well I just wasted 3 hours
hell, it might sometimes be SLOWER
god, I wish i knew assembly
some basic assembly knowledge is always nice to have
Well, I have none. I'm a nuclear engineer, not a comp sci
Ask me about neutrons, and I can do whatever ya need. Ask me about assembly, and I have no clue
The idea of "vector" code, is that you can do the same instruction on multiple things
using (one of) a specialized instruction set that the target CPU supports
basically any memX stuff you can do, memset, memcpy tends to be the fastest thing there is really for cpus
right, identify mem-X. Really should color it something special, like uhhhhhh red?
would be helpful yeah
My next task is this massive chunk, see what I can do to improve it. Now I doubt i can do anything mem-X related to it
doesn't look like it at a glance
now mad is not doing it in one line. It's cutting it up into vmulss vaddss vdivss
Is this vectorizing tho?
it's purple
Sure, but its not using packed singles anywhere
Well, it's doing so a bit further down I guess
at the top is moves all to xmm4 etc.
Again, it's all single scalar, purple or not
the purple just means it's an AVX instruction right?
yep
That doesn't mean it's better sadly
yeah, zero has a point
hrm, true
If you want, I can run you through the assembly you have there on a short voice chat?
Give you a bit more of an understand on what it does, what it means
wait, voice? uh sorry no
All good. It's just too much to write here haha.
just take a glance through that. Other than the lack of singles packing, anything else I can do to that?
But in short, the "vectorized" stuff, are the assembly instructions that you see end in PS, not SS
alright, aim for maximum number of PS
PD is also fine (for doubles)
p stands for packed I assume
Yes
Also, this kind of optimization, is really only worth it in your hot loop, if the profiler says it's something that needs work
Though it's a fun puzzle for learning
I want to get as much using PS as possible so I can identify what patterns result in it
this of course doesnt, and I need to figure out what does
you could see it like this, you could take paper and draw 4 vertical columns in it, write the input at the top, and do the EXACT same step on every of those 4 inputs, until you get to the result, on all 4, at the same time. Then it can be vectorized
vmovps ymm0, .... this loads the 4 floats (packed single) into ymm0
This is a loop within the loop over inflation.length right?
you mean that it works with the loop being 1, but not with it being 2
give me a sec, burst inspector broke again
Thats what it looks like with values > 1.
no more v_ps
and i know there's a maddvps version
thats replacing the maddss
Pretty sure it's unrolling that loop when it's 1
well, it's not a loop at 1
though I agree, it should be able to make it into a mad thats done twice with the loop. dunno why not
what happens if you literally just copy the line twice, instead of the loop
now it identifies the v_ps version of madd
kek
burst 1.7.0
Change the optimization pipeline to run the loop unroller exclusively after the loop vectorizer. This improves codegen in a lot of cases (mostly because the SLP vectorizer is unable to vectorize all the code that the loop unroller could have).```
I guess this is screwing this specific example over
time to downgrade
Would be interesting to see if this behaviour "fixes" itself in an older version
Because it feels like a bug
However, I've bashed my head against stupid bugs enough today. Time for sleep! Lemme know how the downgrade went ๐
np, time to keep hammering away at this
well manually typing it out does retain vps
1.6.2, produces the unrolled variant properly
lol my charactercontroller job burst output is 80k lines, i dont know how one would find the time to optimize this
part by part
yep, im going line by line
and 1.7.0 inspector is very buggy, 1.6.2 is good
found the limit
it's actually 99 loops
nooo, i cant outsmart the unroller
Looks like I need to manually type out the v_ps functions. Fun
Assuming direct control over the assembly:
seems like the compiler can identify what patterns result in MAD float operations. Huh. Maybe manually coding out vectorized code is the way forward
This program can now only run on windows based intel computers within the last 3 years. Support for other computers will require payment
my test has some shocking results honestly
spellstats1 profile marker is nearly double the time of spellstats2
yea, the lookup mem-copies every single struct you call to a "local" variable they then hand to the job
the exposed returns a pointer you directly access
unity has made some terrible design with this. wow
I have to change so much code
but also quite happy that i can improve a LOT
it's to enforce ECS style upon entities. If you have a direct reference, you can place methods on components and have them get called on their properties. You can even have inheritance in them with direct reference
๐
also god damn, it is surprisingly really hard to manually vectorize something
Not sure I can follow what you mean with ECS style. I want to read fast, CDFE can't read fast when it copies. That's terrible design and I see no safety reasons, or other excuse really
Component are never suppose to have methods that write to the component on them. Reading is fine but writing is not. That goes against "ECS design". By returning a copy of a struct, that enforces the prevention of write methods on the struct
Why is writing not fine?
unity can not monitor what happens when a reference is returned from an entity. So they cant enforce thread safety. Why they didnt give us another option I have no clue.
I get it when it would skip chunk version increments but you can get a RW pointer just fine, it's what they do anyway
They can't ensure that anyway ๐
I mean, they prevent it pretty strictly unless you put the Disable tag on it. Then it's do what you want, even race conditions
Yea, i dont know why they didnt ship an option like exposed CDFE's get reference or something.
with their design they just prevent using Interlocked easily
I saw it myself, such issues are handled really well with it. No reason to circle around with writing, reading and iterating on endless data
I give up manually vectorizing my code
I cant get that to work. Well the automatic definitely does but not the other manual additions
the compile works fine but not all entities are added, some are skipped, some are double added
I think you have the greatest usage of burst and vectorization with that project. Hardly any game code looks like this. It's financial software if I remember correctly, right?
yea, simulating it
that's really unique for unity. I think Burst has never been tested that much lol
i just wanna get my toes wet manually vectorizing something so when I actually need it, like in the very long and twisting calculations, I can just take over from burst and vectorize it myself
Interested what the forum guys have to say about my find. So many use CDFE and so many fuck their performance with it
oh boy, the drama
but I got my shitty manually vectorized thing to work at least. Packing consists of 8 floats, not 4.
ah cool!
lets see about that performance...
my 2 big systems rely quite heavily on CDFE. Gonna rewrite all this tomorrow and see how much it'll improve. But now I'm off to bed. Have a good night o/
Yep, good night
I've definitely found cdfe to be horribly slow, so that wouldn't surprise me. Are you sure you have safety checks/leak detection disabled as well, just to be sure?
What's dots lol, I just saw this image and it scared me
dyor
https://forum.unity.com/threads/dots-releases-latest-release-dots-0-17.1044523/
Should we be sticking with 2020.3.9f1 rather than any of the newer subversions of 2020.3?
And should I stick to these versions of eg burst, collections etc listed in this thread? I'm trying to track down the cause of a crash and I want to rule out having the wrong versions of any of the relevant packages!
dots devs say you should however dots is workable(minor issues exist) even on newer unity versions(2021.2 and even 2022.1) as you can read from this thread (btw i think dots is working well on the latest versions of 2020.3): https://forum.unity.com/threads/dots-and-2021-2-known-issues.1183456/
Don't use .ForEach and use the IForEntityBatch(WithIndex) instead. That way you don't have to use CDFEs
I keep updating to the latest 2020 and it hasnt been an issue so far
You have to use CDFE if you need random data lookups. Even in EntityBatch jobs.
Great answer from Joachim. Looking forward what they come up with. Really glad they are acknowledging the problem. It cropped up a few times in the forum.
Later on I'll come around in changing all CDFEs to read only pointers. Really excited where it'll end up.
I already have a tight barrier around my code to not have any structural changes in there. Personally I'm completely on the safe side
I have a value from 0 to 7. I need to generate a 32 byte (or 8 int) wide mask from it efficiently.
Chunky
Please note that the source hints you see in there, dont always exactly line up with what the assembly is for
works without burst, doesnt work when burst is on...
i dont know if that's really worth it... I couldnt get rid of the for loop anyways
What is this for, and what is it supposed to do?
overly elaborate adding 1 to every value in the array
that's the remainder factor for the 1 - 7 values
When I need to manually vectorize something, i'll need to figure this out
alright, I think I broke burst. 1.7.0 is really unstable... well no shit
I dont think you understand what mm256_add_epi32 does
no?
Adding 2 values lengthwise?
And I really recommend against using these intrinsics without a very in depth understanding
it cant be that hard to understand, i mean it works without burst
I mean, is that really that hard to understand?
It's resulting in the same exact code (well not unrolled), what's wrong with using add_epi32?
btw beware that default(v256) is old syntax. new code should be written with just v256 mask = default
I'm just following the syntax used inside Burst source code. Should I be using the latter?
yeah. it produces the same output so it doesn't change anything but lots of people who worked with c# for years are slow in catching up with new syntax
alright, good to know
imma try and eliminate the for loop with macro bitshifting
the main reason why I'm doing this very elaborate removal of a for loop is because the intended code that actually typing out the assembly for wont be as simple as addition and instead be located in a macro function taking only v256 indices.
c# is adopting more and more target typing anyways, as is recommended in the roslyn guidelines
Should I stop using var and instead strongly type my variables?
that of course depends on the developer. for the compiler it's the same either way. i personally never use var i just wanna know what type to expect without having to rely on an ide. and most often you're just shifting the type from the left side to the right side anyway. in the roslyn guidelines there is exactly specified when to use var use var when the type is immediately visible. e.g. var name = "Mind" or var person = new Person() do NOT use var in cases like var name = new Person().Name or similar. use var in for loops do NOT use var in foreach loops
i personaly see even less reason to use var now because var person = new Person() can now easily be written as Person person = new()
"do NOT use var in foreach loops" - explain pls ๐
True. I use an IDE that posts the type in a small icon next to the variable so it's never confusing but I can see why if someone has to comb through my code in github or something
that is stated in the roslyn guidelines and it makes a lot of sense. you might have elements that are Person but you want to explicitly work on IPerson both can be done with foreach but only if you explicitly specify WHAT you want.
yeah rider does something like this too. and apparently even visual studio has the option to too. i think it clutters up my ide experience and what's the need if i can just write it out anyway. code is easily readable with AND without ide
The clutter does get annoying at times, like defining archtypes.
yeah i personally dislike that
nearly doubles the length of the actual code
exactly
I adopted var with ECS, too much clutter yeah
And hours later, I've successfully developed a "generic" operation to properly handle a variable sized input array and allow for manual vectorization
All of that is identical to this:
Well, slower by about 20% from just glancing through the profiler
I now understand why everything should be packed into 32 byte wide structs. Makes life so much easier.
ya know, that is an option. Literally pad your structs to be 32 bytes wide
of course, ya need to fill it with something because thats room for 8 floats and if ya need only 1 float, thats 7 floats of wasted memory. Or 3 doubles
or use [StructLayout(LayoutKind.Sequential, Size = 64)]
64's too large, 32 is actually with width of v256
you want it to be implicitly convertible to v256 without size difference
Well yea, that works. Or explicit. Still, that's 28 bytes of wasted space if you stick with floats
isn't that kind of a given with padding?
just something to keep in mind when designing components. Either keep them single value only or expand to fill 32 bytes.
when ya think about if, padding 28 bytes onto a float is completely worthless, turning the vectorized function into literally single operation.
I was thinking about maybe merging 4 entities into 1... somehow
yeah, doesn't help I think
4 or 8 entities
all of this wouldn't be a problem if I could somehow force chunk sizes to contain multiples of 4 or 8 entities, so the arrays produced from them are aligned with no remainder when converted to v256.
hrm, thats an idea
the chunk layout could certainly be designed for that. how would the last chunk be handled by vectorization when there are only 2 entities left?
and I see you have lots of iterations, what's your general count?
the additional padding of more entities to result in evenly divisible by 4 (or 8) entities
so you would ensure even the last chunk has a multiple of 4 or 8?
those entities will be tagged with PaddingEntity component or something so the values within them dont actually result in changes to gameplay
Each chunk is 16KB, that's inherently divisible by 8. By organizing the components in a way such that all full chunks result in divisible by 8 entities, we can thus assume that the last partial chunk combined with an overall total entity count divisible by 8 to contain divisible by 8 entities.
im trying to logic out the reasoning right now but that's the gist of it. And seeing if there's any hint to dynamically assigned chunk sizes in the training code that may fuck with it
sure, if you can make sure the entity count fits. or do you mean that by padding? essentially creating useless entities just to keep the multiples up
yep
could be a pain to handle but yeah, I think the logic is sound
dummy entities of the exact same archetype except with a component containing a bool determining if it's a dummy padding entity or not
there's a way to change the chunk size. I've tried that once but it didn't do anything so /shrug
because the archetype must be identical between legit entities and not, there cant be a zero sized padding tag since that changes archetype
yeah right, that would break it
the main problem with my code right now is that I use a loooooooot of shared component data. Well, not a lot, just one. But it fractures my chunks into about 3000 possibilities.
So the maximum padding entity count is 3000 x 7 (where 1 legit entity exists per shared component possibility). Thats 21,000 entities wasted.
in the grand scheme of things, 21,000 entities out of 2,500,000 entities is barely a blip...
I will need to profile extensively the performance comparison between the tail implementation I screenshotted above and using padding entities.
that's not good. can't you bring the data into something else? like a nhm
and I see you have lots of iterations, what's your general count? - is your usual iteration count 2.5M?
if I were to instead transfer the data into a NHM, I would require twice the memory now. One storing the original entities and now one massive collection containing the copy and an additional <7 indices to make it divisible by 8. That's unworkable.
Each one of my systems are operating on all 2.5M entities yes.
that's quite bonkers. and you have less than 0.1ms for it?
the operation above? No, it's ~6 ms.
just adding 1 to a component of all 2.5M entities takes 6 ms
ok, kind of relieved. I think I would've deleted my code otherwise haha
Actually no, I'm off by an order of magnitude
it's 0.6ms
i didnt turn on burst
really fast, and you're also writing that amount back
thats with manual vectorization
so with normal code Burst couldn't figure it out?
What? Of course burst can. It's += 1 every index
what do you mean with manual vectorization then?
I just need a really simple case where I can compare the burst outputs so everything is aligned as expected
I type out the assembly in the code
ah i see
is identical to
except there's about 12 more lines to support the manual assembly version compared to burst automatic really simple code
that was the screenshot above doing
actually, manual vectorization is about 1ms faster (in total frame accumulated time) than burst's automatic implementation. I think it's the various safety checks required
While manual vectorization is accessing the data directly
Yea the one you shared. I got it
public static class ComponentDataFromEntityExtensions
{
public static unsafe ref T GetAsRef<T>(this ComponentDataFromEntity<T> componentDataFromEntity,
Entity entity) where T : struct, IComponentData
{
var entityPrivate =
(ExposedComponentDataFromEntity<T>*) UnsafeUtility.AddressOf(ref componentDataFromEntity);
#if ENABLE_UNITY_COLLECTIONS_CHECKS
AtomicSafetyHandle.CheckWriteAndThrow(entityPrivate->m_Safety);
#endif
entityPrivate->m_EntityComponentStore->AssertEntityHasComponent(entity, entityPrivate->m_TypeIndex);
entityPrivate->CheckComponentIsZeroSized();
void* ptr = entityPrivate->m_EntityComponentStore->GetComponentDataWithTypeRW(entity,
entityPrivate->m_TypeIndex, entityPrivate->m_GlobalSystemVersion,
ref entityPrivate->m_Cache);
return ref UnsafeUtility.AsRef<T>(ptr);
}
}```
ah, that's the one from the other guy (forgot his name) are you using that in your main code base?
Yep. works well without needing to import another package
Unity should really just expose it, it's literally a single line removed from the normal CDFE...
Also hrm, I disabled the safety checks in burst and restarted unity to clean the burst cache, yet it's still about maybe 0.5ms slower than manually vectorized. Even with my absolute shit tail method call.
where do you have the struct ExposedComponentDataFromEntity code? it's bugging me about protection levels
internal struct ExposedComponentDataFromEntity<T> where T : struct, IComponentData
{
#if ENABLE_UNITY_COLLECTIONS_CHECKS
public readonly AtomicSafetyHandle m_Safety;
#endif
[NativeDisableUnsafePtrRestriction] public readonly unsafe EntityComponentStore* m_EntityComponentStore;
public readonly int m_TypeIndex;
public readonly uint m_GlobalSystemVersion;
#if ENABLE_UNITY_COLLECTIONS_CHECKS
public readonly bool m_IsZeroSized; // cache of whether T is zero-sized
#endif
public LookupCache m_Cache;
[Conditional("ENABLE_UNITY_COLLECTIONS_CHECKS")]
public void CheckComponentIsZeroSized()
{
#if ENABLE_UNITY_COLLECTIONS_CHECKS
if (m_IsZeroSized)
throw new ArgumentException(
$"ComponentDataFromEntity<{typeof(T)}> indexer can not index the component because it is zero sized, you can use Exists instead.");
#endif
}
}```
basically CDFE but with everything public
You need to add these 3 lines in a separate .asmref file located in the folder those functions are in:
"reference": "Unity.Entities"
}```
that bypasses the "internal" function protection nonsense
ahh, that's it. thanks!
Np, i am starving. Im gonna go find some food then maybe actually set up Unity Performance Tests instead of just looking at the profiler and eyeballing it
enjoy, don't code while starving! I've worked with the unity perf test and it's wonky from timings
hope you have more stable results with it
I might just use the profiler markers - recorder built in methods to just output data into the debug logger.
Does performance test hook onto profiler markers?
no it doesnt, alright I'll roll my own performance testing solution. Profile markers work in burst anyways and Im pretty sure Unity;s performance test cant.
if you profile the editor they will show up
but I'd roll with your own, let it really run, much more stable
performance: 2 > 3 > 1
shame ref is a little slower than the pointer. it's so much easier to write and read
the ToSpellStats extension seems to make no difference if I use "this ref" or not. Which would mean if the struct is already a ref there's no by value paramter involved. good that it works like this, wouldn't have expected it though
those profiler markers are really unreliable
or the cpu/worker threads are ๐
i have 4 ways now and from measuring it, I honestly can't tell which one is the fastest
well, the first is always the slowest but the other ones are fluctuating like crazy
i have direct pointer access, casting the pointer with ref instead of UnsafeUtility and one with UnsafeUtility.AsRef
I should write a test for it. this seems pointless
or I should stop wasting time because it's clear anyway which is fastest. the one most annoying to write -.-
You're not suppose to get the RO pointer, that's read only
But yea. Getting and modifying the pointer directly skips a lot of safety checks
I only read from it, in the process a new struct is created
Ah. Hrm. Interesting
the SpellStats struct
Mark spellstats as volatile. Just a prefix. Dont know of itll change anything
I'd need to make an isolated test to really see what's going on with regards to safety checks, etc...
spellstats is a local to the thread. but good point, I need to read into volatile and how it could be useful to me in some contexts
Volatile will just prevent the compiler from optimizing away previous actions. For example, you're doing nothing but writing to spellstats
The compiler may be deleting actions by 1 and 2 that may skew the results
ok, I see, that may help indeed
Local variables cannot be declared volatile
well, let's see if it makes a difference as property
hm, doesn't really work in that context.
can't really set it as a private property for the job, i'll try anyway
i remember not being able to set anything private in jobs but let's see
a volatile field can't be of the type "spellstats". lol screw this ๐
ha, guess RIP
have you ever worked with ants profiler or something? where you can profile and see the timing of every single line of code. that's so fucking useful, never got it to work for unity and for burst I didn't even try anymore. thta would be so cool to have
I'm honestly a little lost where the cpu time is going
ants profiler?
really popilar for c# development, app or web
lol, don't remember how much it was back then when I used it. Not much money for a company ๐ but for an individual, yeah. I also hate the subscriptions that are so common now -.-
you don't own anything anymore, kind of sucks
tight! visual studio can do the same, let's see how much I can get out of it
just the execute isn't that useful
ok, VS can profile all my code. holy shit ๐ also with burst
not quite sure how to interpret this ๐
No clue...
protected override void OnUpdate()
{
if (_recorder.enabled)
{
_times[_index++] = _recorder.elapsedNanoseconds;
if (_index < _times.Length)
return;
_recorder.enabled = false;
var mean = _times.Average();
var standardDev = math.sqrt(1d / 99d *
_times.Select(time => time - mean).Select(innerPow => innerPow * innerPow)
.Sum());
Debug.Log($"[Results] Mean: {math.round(mean / 1e3d) / 1e3d}ms." +
$" SD: {math.round(standardDev / 1e3d) / 1e3d}ms.");
}
if (!Input.GetKey(KeyCode.Space))
return;
_index = 0;
_recorder.enabled = true;
}```
Just hammered out a performance "recorder". Press spacebar to output general statistics of the area of code recorded by the profiler marker
it records 100 frames then logs the statistics to the debug log
i dont know if the numbers mean what ya think it means
because why would setting a local variable take so much numbers / time?
Also thats what the output looks like
it's just setting values from a struct to a local variable. no idea why it's so high
VS calls the value cpu unit
maybe similar to tick?
it's reliable but this part sticks out
it shouldnt take that long though
did you run the game in order to get those values?
attached to the editor
well, all numbers seem kind of expected. the only thing that sticks out is the var source I sent before, that one I don't get
Also that's automatic burst vectorization's performance
nearly 8 ms difference, why do I not believe that>
manual yea
Thats what the profiler says, 8ms is even more than the total runtime
markers dont take any overhead, well any more than when the profiler is looking at it
not true, my frames take 5 times longer with markers
interesting, I have very different results with sampling. maybe i'm a bit excessive, 13 x 250k
yea, you really only need what, 100 frames? 500 at most. A good sample size. Not 250k....
depends on what you want to measure ๐ i'm measuring the different code paths inside the method
ah, im just measuring addition so its not much
k, seems neglect able then
I once had a bad experience with declaring struct fields inside of a struct of the same time (with the intention of making some kind of node structure).
Maybe you can already figure this did not go very well.
However, nesting NativeCollections inside of other NativeCollections should not produce any problems, right? (E.g. NativeArray<NativeArray<int>>)
nope. nesting bad
Oh?
i think it's the pointer to a pointer problem
Generally you want to linearize your data to maintain cache coherency. Burst will optimize by just adding to the pointer the type size. By nesting, you prevent that optimization.
which means non-linear memory layout
Use case:
NativeHashMap<Entity, NativeArray<VertexData>> to associate large amount of vertex data with a particular Entity
Use a NativeMultiHashMap
use nativemultihashmap
Hmm...
Also if VertexData for a particular Entity is swapped in and out frequently?
But if you require that the order of data remain constant (NMHM does not guarantee order when set in parallel), use NativeHashMap<Entity, UnsafeList<VertexData>>
You can not nest Native-X containers. They're classes due to managed safety check method calls.
VertexData elements for a particular Entity should appear in sequence, but nothing more
Does the vertex data change? Like a single value in that list change?
I vaguely recall encountering an issue when attempting to put a NativeArray inside an IComponentData. However, the IComponentData will do nothing more than hold a pointer/reference to the array, right?
Do you know the vertex data at load time and never expect to change it? Maybe swap for a different vertex set?
You can not put class / managed types in IComponentData. NativeArray is a managed type.
the vertex data being swapped in and out frequently is a problem. can you work around it?
Ahh that was probably the problem then
Vertex data is not meant to be placed in native containers. Far too large
If vertex data does not change often (maybe once every few seconds at least), bake in vertex data into Blob Assets.
BlobAssets are basically read only massive native containers that you can also nest things with. And they're intended to contain nested arrays
Unfortunately, they're also just arrays of blittable types. So no strings (unless you code it into char[]) and no pointers.
And they were designed for vertex and mesh data in mind. Something that should change very rarely because they're very expensive to set up.
The problem is not necessarily one of performance, but one of ease-of-maintainance. Right now I have IComponentData that acts as a key/index to a range of VertexData within a NativeArray shared by multiple mesh Entities.
The reason for this shared NativeArray: My mesh calculation jobs operate on a batch of meshes, as its wasteful to do 1000 jobs for meshes that consist of just a few triangles.
However I am not very pleased with how complex everything has become. So I wish to simplify/streamline it as much as possible. Which is why I'm exploring options such as NativeHashMap<Entity, NativeArray<VertexData>> and such.
Hrm, I really dont have much experience with mesh processing. But if you're gonna stick an array of vertex data into a hash map with Entity as key, just attach a component with an UnsafeList<VertexData>() to the component instead.
that will mean a unique set of VertexData for every entity though unless you start fragmenting your chunks
I'll look into the UnsafeList. So far I have not used it before
UnsafeList is basically NativeList with all its safety checks stripped out
These jobs can last multiple frames so I'm not that concerned about how the chunks will behave and such
So just faster? Or can I do something extra with them?
Ah, if they last multiple frames, you can not attach Unsafe directly to a component
Oh wait so unlike NativeList, UnsafeList can be put in an IComponentData?
Jobs involving entity data can not last longer than a single frame. So yea, the only way forward is NativeHashMap<Entity, UnsafeList<VertexData>>
yes. UnsafeList is basically a raw pointer
I see
Hmm, it would seem so yes
NativeList / NativeArray are also raw pointers to buffers but with a little bit special features
Burst is basically built around operating on NativeArrays so other containers are basically second class. They could work but dont optimize well. I dont believe they can even autovectorize.
Which is what I'm largely discovering with my tests on dynamic buffers. A bit of an issue
DynamicBuffers have autovectorization issues is what youre saying?
DynamicBuffers, as far as I can tell, are like integrated NativeLists. Raw pointers to either near a chunk or outside of it into the general stack. Getting data like just adding 1 to every value in a dynamic buffer to vectorize just does not work
We can get a pointer to the actual buffer using .AsNativeArray().GetUnsafePointer() so manual vectorization should work...
But if I could do it, Burst should be able to as well.
Hmm
Alright, the profiler marker can measure singlethreaded very well. Why is multithreaded so high?
Well that's just clear proof, auto-vectorized is garbage for singlethreaded job. Or I forgot to turn off safety checks
Yes. I did forget to turn off safety checks. Yea, that checks out with my understanding.
Automatic vectorization
Manual vectorization, what the fuck
it's faster than adding one to every index????
oh wow, by a lot
It's adding every value a random number, doesnt really matter what it is since it'll constantly overflow but still
weren't you saying you used the same operations that burst made?
yea, except Automatic isnt vectorized at all. Burst didnt seem to understand the NextInt command applied to all values
ah
im very suspicious of the manually vectorized results anyways. Somehow faster than += 1 to every index? No fucking way
well i mean manual vectorization is nice, but with that you lose platform dependent compilation, don't you?
And yes, i've turn off leak detection and safety checks for both
True, unless I wanted to write out commands for every platform
Only intel windows 10 computers from the last 3 years can run this program, sorry
so i mean i am glad that you are pushing the boundaries but is it really worth the effort in the long run? starting to doubt it
i am all about performance, but if it breaks on any future machine without rewriting pretty much everything.......
5.7 ms down to 1.6 ms (maybe) just by spending a few hours typing out the assembly yourself? That's big
and it's not even the assembly as burst handles a lot of the pain as well even using manual commands
You can use a switch statement to fall back on default burst assembly if an intel processor isnt detected
but burst is aot
The burst processor detector is compile time constant so all the extra code is stripped
Burst compiles it all then passes the relevant data based on whatever starts up the program
hmmm. i mean i see that it definitely can make a big difference, but i don't see bigger studios (or even small ones) writing burstable code with manual vectorization
And honestly all you need to do is support AVX2 which combines all the sub-instructions:
There's also Neon but psh, who has an AMD computer in 2014 anyways
who has an amd pc in 2021 XD
"Those taiwanese chip developers are dead and will never develop a viable chip anyways, hahahaha" - Intel Execs circa 201X.
i'm definitely still intrigued though.....keep going
arm's gonna get big for pc too eventually. i bet on it
someday....
microprocessors from Intel and AMD proposed by Intel in March 2008 and first supported by Intel with the Sandy Bridge[1] processor shipping in Q1 2011 and later on by AMD with the Bulldozer[2] processor shipping in Q3
Yea, intel and AMD are collaborating and running AVX2 instructions. So that is all I need to bother to support
yeah, seems decent
ARM / Neon controls over 98% of the mobile market though. So if you need to optimize for that, use ARM instructions
Each feature level above provides a compile-time check to test if the feature level is present at compile-time:
yeah ok. i literally can't wait for my colleages to freak when seeing advanced code though. XD gonna be insanely funny
Yea, it's compile time constant. So no need to worry about 150+ lines of manually vectorized code to just add 1 to every index of an array slowing ya down. Burst got ya covered.
ok nice
I believe it's set by the build target as well. Guess that does do something in the end.
I mean, the manual vectorization does result in regular C# code if you follow it all the way down in the source. Except it's structured in a way that the Burst pattern recognition can easily map.
yeah that's of course understood
So if you dont leave a fallback method and still compile for a platform that doesnt have that instruction set, it'll still work. Maybe not as well optimized though due to the massive amount of extra code.
meh. it's still bursted code. will still be pretty good
yea, I repeated the tests and confirmed that each entity is being randomly set a value. That's 2.5x faster than regular bursted code.
Manual
pristine ๐
Automatic
massive difference, manual unrolled the random value generation into a series of identical code and with the AVX2 commands as well whereas manual just has the generic mov.
And the AVX2 command includes vpaddd, so it's a packed addition. Full vectorization.
Are you using 1.7.0 burst?
think so
I dont recommend it if you plan on doing manual vectorization. Burst eventually stopped updating its cache of burst function outputs and wouldnt recompile, even if I reset the editor and turned burst off and on again
I think manual vectorization broke something and it didnt reset even after reload. I had to go back to 1.6.2 for everything to work properly.
1.7.0 introduced "cached" versions of bursted functions to speed up burst compilation inside a file and I think I broke something
hm yeah. hope that get's fixed in a release soon
Well yea, Burst 1.7.0 isnt even available for the "public" yet and only accessable through manual version request
Still, the arrows are beautiful. God I miss them in 1.6.2
oh definitely
Also a big tip I've largely discovered today. Use [AssumeRange()] everywhere you can. Burst loves it and drastically changes the resulting compiled assembly for the better with it.
oh i'm kinda an advocate to use any attribute whenever available as much as possible
That assume range starting 1 removes a wasted "if empty" return check in the assembly.
oh awesome
Just one of the attribute I've seen actually do something
[NoAlias]
And the int remainder max value of 7 occasionally allows Burst to unroll and fully vectorize things as well. 8 ints is a magic number and allows for easy packing
If I have more than 1 pointer of identical types in a parameter list, that goes before them yep.
Im pretty sure I can also vectorize the random number generation.... pretty sure
Alright, once you multithread it, the gap between manual (top) and automatic (bottom) Burst isnt that large. ~0.1 ms so a 15% improvement when you manually vectorize. Pretty much what Burst themselves say to expect.
TLDR version: If you have a singlethreaded very mathematics heavy job, manually vectorize. You can get massive speedups by doing so. Multithreaded jobs like those working on Entities, ehhhhh. Massive amount of time for a 15% improvement (that will get larger though the less cores the computer has). Performance critical items should be manually vectorized as expected.
And the fact that it has taken me ~20 hours to create a functional manually vectorized job to add += 1 to every index of an array kinda tells ya the amount of effort as a beginner in assembly (no previous knowledge) has to take in order to do something so incredibly simple.
i think we're going in a more core direction though in the next years. just saying. but yeah, good points.
none said it'd be trivial
DOTS is here for the core utilization. Burst is here for the micro-optimization. DOTS itself is fairly static, I doubt I can get any better performance without ripping apart the package any more than me and enzi has already done with the direct pointer access CDFE. Burst though is a land of opportunity for hours of wasted time to remove 1% of computation time.
well....burst still is part of dots
Yea, I wish it was though. But that's autovectorization. As I mentioned earlier today, I wish there was a way to define chunk entity sizes. Multiples of 8 would make manual vectorization literally as easy as just automatic vectorization.
maybe in the future
The vast majority of my time coding was spent trying to get the remaining 1 - 7 components vectorized without using a value-wise for loop
Well fuck, valuewise for loop is faster
Have you thought about putting float4 or int4 in a comp instead of float and int? might be easier to get at least the multiple of 4
That was the padding I was discussing earlier.
If i only need 1 float, I'm wasting 7 float worth of space if I pad it out
Entity index determines the size of the native array I'm computing over
a float4 in an entity is equivalent to 4 entities with 1 float
that's true with pointer access
And that is what the native array is, a linear pointer access point
when the cache line is 64 bytes, reading a native array at [0], the next 4, assuming ints should be free, right?
eh, more but you get the meaning ๐
It reads [0] and [1] then to fill 64
Vectorized random int generation. Pulled from the NextInt() function itself
wellll, no. This will result in 8 identical ints
I just put a marker in a job that has no code in between and it's still measuring like 1.87ms
What?
Yeah, I don't know what it's measuring exactly. Itself? lol
code sample?
let me try....
mine are called about 4,000 times
it should be literally 0
that's annoying
0.078/4*250 = 4.875ms, so it's not a linear cost at least
as long as the number of entities and method calls are the same, the overhead should be constant
comparisons still work, just not perfect comparisons
well, at least I can tell now that the variable assign from a struct that VS was so heavily profiling is garbage measurement. i mean, it was weird to have it be like 1.7ms, but when an empty marker has 1.5ms it's quite okay.
and disappointing what VS is measuring. i really dislike these kind of false positives that i'm chasing sometimes. just a waste of time -.-
lots of garbage then, huh
huuuuh, I think random is broken. They're missing pointers
NextInt() from random gives the same exact value every time
the Unity.Mathematics.Random?
use it with ref
otherwise it returns the same value
oh
i had the same thing in the beginning and was sooo weirded out because sometimes all my casters were just missing the target!
Where do I dispose of a referenced native array in a job? the calling function or inside the job iteself? [BurstCompile] private struct CreatePathJob : IJob { public NativeArray<AstarNode> nodeArray;
i think you know by now but state is a field
Generally outside. There's an overload of native container dispose that accepts job handles
oh yea, me dumb
but the math isnt compiled away so it did its job in making the compiler do things
thanks
it's a job that schedules disposal of the container so just chain it onto the back of the job using it
Absolutely beautiful
oh nice, so vectorized random is working now?
no, it'll keep returning the same value
what it's doing is preventing burst from mem-set away all the logic
Like this? ```
private void FindPath(Vector3Int start, Vector3Int end)
{
NativeArray<AstarNode> nativeArray = new NativeArray<AstarNode>(nodeArray.Length, Allocator.Temp);
for (int i = 0; i < nodeArray.Length; i++)
{
nativeArray[i] = nodeArray[i];
}
Vector3Int startCell = navigationMap.WorldToCell(start);
Vector3Int endCell = navigationMap.WorldToCell(end);
CreatePathJob job = new CreatePathJob
{
nodeArray = nativeArray,
start = new int2(startCell.x, startCell.y),
end = new int2(endCell.x, endCell.y),
bounds = bounds
};
job.Run();
nativeArray.Dispose();
}```
so this is replicating a complex and expensive task that I manually vectorized and is identical in code with an autovectorized (or not) code.
one, NativeArray has a parameter after the Temp called NativeArrayOptions.UninitializedMemory. It'll skip the very expensive memclear command that will improve speed very slightly
the rewrite to use CDFE pointers shaved of around 1ms. I had hopes it would be more but it's honestly not that bad. as it's multi threaded it's really over 8ms overall
Two, NativeArray also has a mem-copy extension function: .CopyFrom(int[]) that you can use instead of using that for loop
random access will never be good, it's just how random access is
3, use int3 instead of Vector3Int, it honestly wont change much but maybe Burst might do something with it.
job.Run() returns an object known as JobHandle that tells you if a job is complete. It also allows you to clean up the thread. You forgot to store it.
Basically you need to add var pathJobHandle = job.Schedule(); and then next line var pathJobHandle = nativeArray.Dispose(pathJobHandle);
I've thought a lot about going around the random access but it usually means, write out data and start another iteration which is always more expansive than just taking the hit. some mechanics will never be that fast but I still can't stop thinking about solving this ๐
Then do something expensive on the main thread. Then near the end of the Update(), call pathJobHandle.Complete() and you just multithreaded that job. Congrats.