#Strange Timing Question

329 messages ยท Page 1 of 1 (latest)

warm tusk
#

EDIT: summary of progress here #1403846020504358932 message , so you don't have to read the whole thread

original message:

I've got a vulkan game that I'm profiling using tracy, and am confused about the results. I'm using a pretty bog standard game loop, and the specific bits that I'm profiling seem well within the 16ms budget. what's strange is I'm getting a cadence where my fence wait is very long every other frame, and then both frames on my swapchain flush in quick sequence.

any idea what gives?

(note: the stackframes I've "zonescoped" are sparse, but I promise there's no mysterious CPU load being unaccounted for. the "earth" GPU frames are my most expensive renderpass, but the rest of my renderpasses are also there, just way small by comparison)

warm tusk
#

oof. turns out I was using 2 frames in flight when I should have been using 3. whoops!

warm tusk
#

update, nope: that didn't account for it. turns out it's a timing thing, and I happened to toggle some stuff that degraded performance which "fixed" it. still no idea what's wrong ๐Ÿ™

hoary zinc
#

Looks like CPU bound here. Try to add dedicated tracy zones for: vkAcquireNextImageKHR, vkResetFence, vkWaitFences, vkQueueSubmit, vkQueuePresent.

warm tusk
#

ty for answer, sorry for splitting this thread (I wasn't getting any traction here before ๐Ÿ˜› )
but I have a more up to date tracy screenshot here #beginners message (I have tracy zones scoped around the fence waits, and the acquire + waits + resets + submit + present, from CPU side, all happen within a tiny sliver, with the exception of the fence waits)

#

dumping post from #beginners here:

I'm running into absolutely wacky frame timings, and I just cannot figure it out for the life of me. I'm using a bog standard frame loop, doing nothing fancy, I've copy/pasted my loop into chat GPT and it can't find anything wrong with it, I've pored over it line by line making sure I understand what's happening, and something just still isn't making sense. anyone wanna take a stab at what I could possibly be doing wrong?
(attached: a tracy profile screenshot showing the ridiculousness; gray = waiting on fence)
some details:

  • VK_PRESENT_MODE_FIFO
  • 2 frames in flight
  • 3 swapchain images
  • extremely minimal CPU (~0.5ms/frame) and GPU (~3ms/frame) workload
  • 1 fence per fif (wait at top of frame, trigger w/ submit)
  • also keep track of frame/swapchain index fence associations and make sure to also wait on that fence
  • 1 semaphore per fif (submit pWaitSemaphores, triggered on vkAcquireNextImage)
  • 1 semaphore per swapchain img (present pWaitSemaphores, triggered on submit pSignalSemaphores)

with such small workloads, what I'd expect to happen is, there'd be constant pressure on the sync objects to immediately snap up rendering w/ every vblank. but I'm getting these crazy waits.

anyone see something obviously wrong in my logic?

hoary zinc
#

roger. could u show the code for fencewait and fenceimagewait?

warm tusk
#

here's my whole loop code:

{
  vkWaitForFences(gvCore.device, 1, &gvLoop.frameResourceFences[gvLoop.curfif], VK_TRUE, UINT64_MAX);

  VkResult swapchainResult = VK_SUCCESS;
  {
    u32 imageIndex;
    swapchainResult = vkAcquireNextImageKHR(gvCore.device, gvWindow.swapchain, UINT64_MAX, gvLoop.frameImageAvailableSemaphores[gvLoop.curfif], VK_NULL_HANDLE, &imageIndex);

    if(swapchainResult == VK_SUCCESS || swapchainResult == VK_SUBOPTIMAL_KHR)
    {
      if (gvLoop.imageFrameResourceFences[imageIndex] != VK_NULL_HANDLE)
      {
        vkWaitForFences(gvCore.device, 1, &gvLoop.imageFrameResourceFences[imageIndex], VK_TRUE, UINT64_MAX);
      }
      gvLoop.imageFrameResourceFences[imageIndex] = gvLoop.frameResourceFences[gvLoop.curfif];
      vkResetFences(gvCore.device, 1, &gvLoop.frameResourceFences[gvLoop.curfif]);

      uploadEnvData();
      VkCommandBuffer commandBuffer = startRenderCommandBuffer();
      appendPreRenderCommands(commandBuffer);
      appendRenderCommands(commandBuffer, imageIndex, true, true);
      endRenderCommandBuffer(commandBuffer);

      VkSubmitInfo submitInfo={};
      submitInfo.sType = VK_STRUCTURE_TYPE_SUBMIT_INFO;
      submitInfo.waitSemaphoreCount = 1;
      submitInfo.pWaitSemaphores = &gvLoop.frameImageAvailableSemaphores[gvLoop.curfif];
      VkPipelineStageFlags waitStages[] = {VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT};
      submitInfo.pWaitDstStageMask = waitStages;

      submitInfo.commandBufferCount = 1;
      submitInfo.pCommandBuffers = &gvLoop.commandBuffers[gvLoop.curfif];
      submitInfo.signalSemaphoreCount = 1;
      submitInfo.pSignalSemaphores = &gvLoop.imageRenderFinishedSemaphores[imageIndex];

      VkPresentInfoKHR presentInfo={};
      presentInfo.sType = VK_STRUCTURE_TYPE_PRESENT_INFO_KHR;
      presentInfo.waitSemaphoreCount = 1;
      presentInfo.pWaitSemaphores = &gvLoop.imageRenderFinishedSemaphores[imageIndex];

      presentInfo.swapchainCount = 1;
      presentInfo.pSwapchains = &gvWindow.swapchain;
      presentInfo.pImageIndices = &imageIndex;
      presentInfo.pResults = nul;

      CHECK_VKCMD(vkQueueSubmit(gvCore.graphicsQueue, 1, &submitInfo, gvLoop.frameResourceFences[gvLoop.curfif]),"failed to submit draw command buffer");

      swapchainResult = vkQueuePresentKHR(gvCore.presentQueue, &presentInfo);
    }
    else
    {
      vkResetFences(gvCore.device, 1, &gvLoop.frameResourceFences[gvLoop.curfif]);
    }
  }

  // swapchain result is always VK_SUCCESS in my problematic case, so don't worry about it
}
gvLoop.curfif = (gvLoop.curfif+1)%gvConfig.nfif;
#

(I took out the VkZoneScoped calls and other profiling stuff for clarity, but they're scoped around the two vkWaitForFences calls)

hoary zinc
#

issue which I see. U use same index for swapchain image and FIF. This the top mistake in Vulkan. gvLoop.curfif

Solution: For swapchain image u need to use dedicated index which will return vkAcquireNextImageKHR

For FIF index u need to brainlessly increment your own index and reset to 0.

#

@hildarthedorf told the legit strategy how to organize FIF and swapchain using OG binary semaphores.

For each FIF:

  • semaphore (acquire)
  • fence (for submit, wait and reset)
  • command pool + command buffer (for best practices)

For each swapchain image:

  • semaphore (render end)
  • framebuffer (connected to swapchain image)

The workflow

  1. vkWaitForFences
  2. vkAcquireNextImageKHR (fif->acquire)
  3. vkResetFences
  4. vkResetCommandPool
  5. Begin command buffer
  6. Issue commands
  7. End command buffer
  8. vkQueueSubmit(wait sem: fif->acquire, signal sem: swapchain->renderEnd)
  9. vkQueuePresentKHR(wait sem: swapchain->renderEnd)

Frame in flight index is incremented and reset to zero. Swapchain image index is returned by vkAcquireNextImageKHR.

FIF and swapchain image count could be anything. For example in my Android app I have two FIF and four swapchain images.
โ–ถ๏ธ This could be simplified by using timeline semaphores. Unfortunately I never touched them yet because my target devices are limited by Vulkan 1.1 for mobile tile GPUs. Pay attention that VkFramebuffer and VkRenderPass crap features are not needed on Desktop Vulkan. You should adjust approach using dynamic rendering.

warm tusk
#

reading through the hildar post now, but I think I'm already doing what you are describing? I have independent curfif and imageIndex; I cycle curfif (just after/outside of this loop, sorry I failed to include that), and use the imageIndex returned from the acquire

#

(adding this now to the loop above: gvLoop.curfif = (gvLoop.curfif+1)%gvConfig.nfif;)

hoary zinc
#

i'm looking code. line 12, line 6 and line 61. Obviously same index

warm tusk
#

there are two vkWaitForFences, though: line 2 and line 12. line 2 waits on the curfif fence, and line 12 waits on the fence that was last associated with that imageIndex (note: line 14 associates the fences in frameResourceFences with the currently used image index, they're not two separate lists of fences)

hoary zinc
#

let me compare code flow more carefully...

#

line 18 is strange. what the purpose here? It's needed to wait only once fence per frame.

warm tusk
#

sorry, I've been tinkering w/ the posted code to get rid of extranneous info. what are you thinking is line 18?

hoary zinc
#

I have not clue what's going on. I'm comparing with my implementation for Windows and Android platform. I have none of this. So maybe u could explain why additional vkWaitForFences is needed?

warm tusk
#

by "none of this" you mean just the second vkWaitForFences? or is there other stuff you're not doing?

the situation is, I have a fence for every fif- I don't have any fences other that. the second list is just an association of existing fences with their corresponding imageIndex.

so what might happen is:

- frame 1: fif:0, imageIndex 0
- frame 2: fif:1, imageIndex 0

and I'd wait on fif:1 fence for frame 2, but I also would wait on fif:0 fence for frame 2, because I'm trying to use an index that was associated with a previous frame.

I'd think this would be rare, but my tracy profiling shows that it hits, so ๐Ÿคท

hoary zinc
#

Roger. I suggest to implement the simple case first. Here is my suggestion:

warm tusk
#

I guess there's no need for me to stall CPU side while waiting for a submit involving an in-use swapchain image index, because presumably the semaphore would hold off on letting that submit so anything until the swapchain actually gives up the image anyways. but ftr, if I just comment out that additional wait, I still get the ridiculous timing

#

looking at your code now ๐Ÿ‘€

hoary zinc
#

are u rendering complex scene of single triangle? It could be wrong GPU timeline because of missing tracy zone.

warm tusk
#

a complex scene, but not that complex. I'm TracyVkZoneing all the render command appendations (there's one "big" one, called "earth", and a bunch of small ones)

hoary zinc
#

do u do any cross command buffer syncronizations? because I see at least two parallel command buffer executions on GPU timeline

warm tusk
#

nope, one big command buffer and one big submit

warm tusk
#

ok I'm actually still working through hildar's post very slowly and carefully to not miss anything, and I think the only inconsistency with what I have is that hildar has "one framebuffer for each swapchain image", where I have "one framebuffer for each fif X swapchain image"

white echo
#

it doesn't impact perf but like ... why?

warm tusk
#

I have a set of frame resources per fif, so that's uniform buffers, storage buffers, intermediate images, depth buffers, etc..., and a set of present images per nswapchainimages. so the thought is, I might get assigned to present to one swapchain image while also being assigned a different set of other frame resources?

white echo
#

yes but sets and framebuffers are separate objects

#

the only concern here is depth or any other attachment which has no concern on a single queue cuz every frame is sequential and you don't care about the previous frame's data

#

also why you only need one depth image total usually

warm tusk
#

ok. I mean that'd be a nice GPU memory optimization to take advantage of in the future, but for now I just want to figure out this stutter lol

white echo
#

soo quick qn, why are you waiting on fences after acquiring and why are you doing this

if (gvLoop.imageFrameResourceFences[imageIndex] != VK_NULL_HANDLE)
      {
        vkWaitForFences(gvCore.device, 1, &gvLoop.imageFrameResourceFences[imageIndex], VK_TRUE, UINT64_MAX);
      }
      gvLoop.imageFrameResourceFences[imageIndex] = gvLoop.frameResourceFences[gvLoop.curfif];
      vkResetFences(gvCore.device, 1, &gvLoop.frameResourceFences[gvLoop.curfif]);
#

seems to me that it is likey the source of your problem cuz you might be waiting on a fence that waits for 2 frames

warm tusk
white echo
#

move the wait before acquire and then only use gvLoop.frameResourceFences[gvLoop.curfif]
delete the whole gvLoop.imageFrameResourceFences[imageIndex]

warm tusk
#

I already have a wait before the acquire (using frameResourceFences[curfif]); the imageFrameResourceFences[imageIndex] is an additional wait, and removing it doesn't solve the issue

white echo
#

ohhhh missed the first line lmao

#

well I don't see anything else wrong with the code you have shown ๐Ÿค”

warm tusk
#

yeah, damn. wth am I doing wrong... ๐Ÿค”

warm tusk
#

ok so I've wrapped a TracyVkZone around my whole frame's command buffer (which verifies that yes, the "earth" commands is the 99% time spend), but I've also implemented the VK_EXT_calibrated_timestamps , so the timing of the GPU and CPU stuff should be ~ in sync. and when I hoaver over one of my GPU loads, it highlights (vertical thin pink line on the left side of the screenshot) where on the CPU the command was issued. and so it looks like I'm issuing two frames of commands, then chilling on a fence for a very long time, while presumably a semaphore is preventing the corresponding fif's commands from flushing(?), then the semaphore relents, the commands are flushed, the fence is triggered, and I can again issue two fifs.

so what would be holding on to a semaphore for 31ms, with no visible GPU load? is there a way I can get tracy to see some more internals there?

warm tusk
#

equal parts fascinating and frustrating: if I start my loop with a vkDeviceWaitIdle(gvCore.device); right before my vkWaitForFences, I get a rock solid 60fps ๐Ÿคฆ

#

(I don't count this as "problem solved", btw. clearly there's something wrong w/ my sync logic that this fixes it...)

hoary zinc
#

wait could u try another app? Make sure that every single implicit layer is off. Especially Riva Tuner.

#

Maybe the issue is not with the code but with your environment.

warm tusk
#

how can I try another app? I'd have to build it with tracy integration?

hoary zinc
#

vk cube

#

i think u can distinguish 30 fps vs 60 fps by eye

warm tusk
#

the issue isn't merely that it's 30fps to 60fps, but that the 30fps I experience is super stuttery and terrible. but that should work too- how can I quickly and easily run vk cube?

hoary zinc
#

it should be inside VulkanSDK package. Are u on Windows or Linux?

warm tusk
#

windows

#

(but also heads up, I really doubt I'm running anything that would universally hamstring vulkan applications. happy to run this quick test to prove it though)

#

ok yep found Vulkan Cube.exe, and it looks like it's running 60fps

hoary zinc
warm tusk
#

windows

hoary zinc
#

bad. we need a console output. vulkan-1.dll could print loading order if u run app with VK_LOADER_DEBUG=all

#

layer stack for VkInstance and VkDevice

#

@white echo do u know how to output this with vkconfig maybe?

warm tusk
#

I just ran it through the vulkan configurator and got:


Vulkan Development Status:
- Vulkan Layers Controlled by "Validation" configuration
- Environment variables:
    - VULKAN_SDK: C:\VulkanSDK\1.3.283.0
    - VK_LOCAL: C:\Users\phild\VulkanSDK
- Vulkan Loader version: 1.4.309
- User-Defined Layers locations:
    - VK_LAYER_PATH variable: None
    - Per-configuration paths: None
    - VK_ADD_LAYER_PATH variable: None
- `vk_layer_settings.txt` uses the default platform path:
    C:\Users\phild\AppData\Local\LunarG\vkconfig\override
- Available Layers:
    - VK_LAYER_NV_optimus
    - VK_LAYER_NV_present
    - VK_LAYER_RENDERDOC_Capture
    - VK_LAYER_OBS_HOOK
    - VK_LAYER_VALVE_steam_overlay
    - VK_LAYER_VALVE_steam_fossilize
    - VK_LAYER_LUNARG_api_dump
    - VK_LAYER_LUNARG_gfxreconstruct
    - VK_LAYER_KHRONOS_synchronization2
    - VK_LAYER_KHRONOS_validation
    - VK_LAYER_LUNARG_monitor
    - VK_LAYER_LUNARG_screenshot
    - VK_LAYER_KHRONOS_profiles
    - VK_LAYER_KHRONOS_shader_object
- Physical Devices:
    - Intel(R) Iris(R) Xe Graphics with Vulkan 1.3.293
        - deviceUUID: 8680A0A7040000000002000000000000
        - driverUUID: 33322E302E3130312E36303738000000
    - NVIDIA GeForce RTX 4050 Laptop GPU with Vulkan 1.4.312
        - deviceUUID: 4FCCE0F1BC31734D47A646C685209BCF
        - driverUUID: 2A9A7E7F0F015AF8B3D06EE0131CF715

Launching Vulkan Application:
- Application: nightondeck.exe
- Executable: C:\archive\projects\github\nightondeck\package\release\nightondeck.exe
- Working Directory: C:\archive\projects\github\nightondeck\package\release
- Log file: C:\Users\phild\VulkanSDK\nightondeck.txt

257 countries

  [+0.01s] ini_borders

  [+0.15s] ini_window
Vulkan Instance Extensions Requesting:
  VK_KHR_surface
  VK_KHR_win32_surface

WARNING-CreateInstance-status-message(INFO / SPEC): msgNum: 601872502 - Validation Information: [ WARNING-CreateInstance-status-message ] Object 0: handle = 0x2b239d4c050, type = VK_OBJECT_TYPE_INSTANCE; | MessageID = 0x23dfd876 | vkCreateInstance():  Khronos Validation Layer Active:
    Settings File: Found at C:\Users\phild\AppData\Local\LunarG\vkconfig\override\vk_layer_settings.txt specified by VkConfig application override.
    Current Enables: None.
    Current Disables: None.

    Objects: 1
        [0] 0x2b239d4c050, type: 1, name: NULL

  [+0.06s] ini_graphics
Vulkan Device Extensions Requesting:
  VK_EXT_calibrated_timestamps
  VK_KHR_swapchain

Vulkan API Version: 1.4.312
Vulkan Driver Version: 580.352.0

WARNING-vkGetDeviceProcAddr-device(WARN / SPEC): msgNum: 582089644 - Validation Warning: [ WARNING-vkGetDeviceProcAddr-device ] | MessageID = 0x22b1fbac | vkGetDeviceProcAddr(): pName is trying to grab vkGetPhysicalDeviceCalibrateableTimeDomainsEXT which is an instance level function
    Objects: 0

shared staging used: 184.7MB of 512.0MB (36.1%)

  [+0.92s] ini_graphicscontent
  [+0.00s] ini_audio

There are 0 game controller(s) attached (0 joystick(s))
  [+0.00s] ini_game
[+1.24s] init

writing file pipelines.cache

total time 20423380762 (7.00s)
[+1.24s] init
  [+0.01s] ini_borders
  [+0.15s] ini_window
  [+0.06s] ini_graphics
  [+0.92s] ini_graphicscontent
    [+0.12s] ini_vulkanshared
    [+0.00s] ini_vulkanoffscreens
  [+0.00s] ini_audio
  [+0.00s] ini_game

arena memory used (7063 allocs, 2 arenas) 100% 262144b (0.2MB) of 262144b (0.2MB)
arena memory used (1 allocs) 0% 0b (0.0MB) of 131072b (0.1MB)

Process terminated
hoary zinc
#

does not count. it's needed to be run via app code. App is responsible for demanding explicit layers for example

#

could u quickly add console entry point for ur app? I mean main function and switch linker settings to produce console version of your app. U could get HINSTANCE via GetModuleHandleW ( nullptr ); The rest of WinAPI code should not change.

white echo
#

you need to compile as a console app basically

#

vkconfig doesn't output VK_LOADER_DEBUG iirc

warm tusk
#

ok, I have it building and running as a console app. what do you want me to do with it?

hoary zinc
#
  1. open terminal (cmd.exe)
  2. execute set VK_LOADER_DEBUG=all
  3. run your app in that terminal
  4. show the Vulkan loader output for VkInstance layers and VkDevice layers
warm tusk
hoary zinc
#
vkCreateDevice layer callstack setup to:
   <Application>
     ||
   <Loader>
     ||
   VK_LAYER_NV_optimus
           Type: Implicit
           Enabled By: Implicit Layer
               Disable Env Var:  DISABLE_LAYER_NV_OPTIMUS_1
           Manifest: C:\WINDOWS\System32\DriverStore\FileRepository\nvdmi.inf_amd64_a2b59b092685856e\nv-vk64.json
           Library:  C:\WINDOWS\System32\DriverStore\FileRepository\nvdmi.inf_amd64_a2b59b092685856e\.\nvoglv64.dll
     ||
   VK_LAYER_NV_present
           Type: Implicit
           Enabled By: Implicit Layer
               Disable Env Var:  DISABLE_LAYER_NV_GR2608_1
           Manifest: C:\WINDOWS\System32\DriverStore\FileRepository\nvdmi.inf_amd64_a2b59b092685856e\nv-vk64.json
           Library:  C:\WINDOWS\System32\DriverStore\FileRepository\nvdmi.inf_amd64_a2b59b092685856e\.\nvoglv64.dll
     ||
   VK_LAYER_OBS_HOOK
           Type: Implicit
           Enabled By: Implicit Layer
               Disable Env Var:  DISABLE_VULKAN_OBS_CAPTURE
           Manifest: C:\ProgramData\obs-studio-hook\obs-vulkan64.json
           Library:  C:\ProgramData\obs-studio-hook\.\graphics-hook64.dll
     ||
   <Device>
       Using "NVIDIA GeForce RTX 4050 Laptop GPU" with driver: "C:\WINDOWS\System32\DriverStore\FileRepository\nvdmi.inf_amd64_a2b59b092685856e\.\nvoglv64.dll"

Ur layers.

#

lt's try to disable every single of them. Do u see Disable Env Var:?

#

so the Idea is to define it as env variable while running your app. U can use same trick as before with set VK_LOADER_DEBUG=all
For example here is my config in visual studio

#

=1 is intentional here.

#

after that validate than nothing except GPU driver is below ur app

warm tusk
#

yep, that did not change anything :/

hoary zinc
#

alr. What's VVL status? Any sync validation/core validation errors/warnings?

warm tusk
#

oh wait nvm one sec

hoary zinc
#

for RivaTuner (installed with MSI autoburner usually) it's needed to do this

warm tusk
#

agh ok sorry. I missed one of the settings, and then briefly thought it was fixed, but it quickly went back to its old stuttery pattern

#

what is rivatuner?

hoary zinc
#

if u do not use it - ignore it. It's special because it's working and even Vulkan Loader could not detect it. usually this stuff is used for this as !example for DX game.

warm tusk
#

ah. yeah, I don't use it. but anyways, I've also disabled the layers, and no luck ๐Ÿ™

warm tusk
#

no validation warnings (lol I actually fixed the one I had missed when enabling the sync stuff for the vk tracy, that was exposed when running it through the configurator)

hoary zinc
#

NVIDIA 580.97 driver?

warm tusk
#

yep

#

ah, nope

#

but close

#

I'll load that one up now

#

(I usually keep pretty up to date. I'd be surprised if mine is more than a few months old)

#

updated, problem persists

hoary zinc
#

out of ideas ๐Ÿ˜ข

warm tusk
#

me too ๐Ÿ˜ฎโ€๐Ÿ’จ
thanks for taking the time to troubleshoot with me though

flint moon
#

@warm tusk I encountered the same issue some 2-3 years ago, too, and by now I am pretty sure that this is a bug in Tracy. Tracy sometimes calls vkGetCalibratedTimestampsEXTexcessively (1000+ calls per frame) because the calibrated timestamp deviation may change at any time and Tracy does not take this into account correctly. See here: https://github.com/wolfpld/tracy/blob/master/public/tracy/TracyVulkan.hpp#L366. If m_deviation is close to 0, it may take a lot of calls to vkGetCalibratedTimestampsEXT to finish the loop, which is obviously slow.

GitHub

Frame profiler. Contribute to wolfpld/tracy development by creating an account on GitHub.

warm tusk
#

I appreciate the suggestion, but I don't think that's it in my case. I was getting problematic profiles before I enabled the calibrated timestamps ext, and can see the general pattern of every-other spiky frame times from my in-game frame timer (with tracy disabled). :/

warm tusk
#

putting together a summary message, so anyone else doesn't need to read through the whole thread ๐Ÿ˜›

issue: (see tracy screenshot) I'm seeing alternating fence waits of ~30ms and ~1ms, even with extremely minimal CPU & GPU workloads

my current loop (stripped down for simplicity):

v0 loop()
{
  vkWaitForFences(gvCore.device, 1, &gvLoop.frameResourceFences[gvLoop.curfif], VK_TRUE, UINT64_MAX);

  u32 imageIndex;
  VkResult swapchainResult = vkAcquireNextImageKHR(gvCore.device, gvWindow.swapchain, UINT64_MAX, gvLoop.frameImageAvailableSemaphores[gvLoop.curfif], VK_NULL_HANDLE, &imageIndex);
  //in practice, swapchainResult = VK_SUCCESS and problem persists, so don't worry about it

  vkResetFences(gvCore.device, 1, &gvLoop.frameResourceFences[gvLoop.curfif]);

  uploadEnvBuffs();

  VkCommandBuffer commandBuffer = startRenderCommandBuffer();
  appendRenderCommands(commandBuffer);
  endRenderCommandBuffer(commandBuffer);

  VkSubmitInfo submitInfo={};
  submitInfo.sType = VK_STRUCTURE_TYPE_SUBMIT_INFO;
  submitInfo.waitSemaphoreCount = 1;
  submitInfo.pWaitSemaphores = &gvLoop.frameImageAvailableSemaphores[gvLoop.curfif];
  VkPipelineStageFlags waitStages[] = {VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT};
  submitInfo.pWaitDstStageMask = waitStages;

  submitInfo.commandBufferCount = 1;
  submitInfo.pCommandBuffers = &gvLoop.commandBuffers[gvLoop.curfif];
  submitInfo.signalSemaphoreCount = 1;
  submitInfo.pSignalSemaphores = &gvLoop.imageRenderFinishedSemaphores[imageIndex];

  VkPresentInfoKHR presentInfo={};
  presentInfo.sType = VK_STRUCTURE_TYPE_PRESENT_INFO_KHR;
  presentInfo.waitSemaphoreCount = 1;
  presentInfo.pWaitSemaphores = &gvLoop.imageRenderFinishedSemaphores[imageIndex];

  presentInfo.swapchainCount = 1;
  presentInfo.pSwapchains = &gvWindow.swapchain;
  presentInfo.pImageIndices = &imageIndex;
  presentInfo.pResults = nul;

  CHECK_VKCMD(vkQueueSubmit(gvCore.graphicsQueue, 1, &submitInfo, gvLoop.frameResourceFences[gvLoop.curfif]),"failed to submit draw command buffer");

  swapchainResult = vkQueuePresentKHR(gvCore.presentQueue, &presentInfo);
  gvLoop.curfif = (gvLoop.curfif+1)%gvConfig.nfif;
}

loop details:

  • 2 fif (issue also occurs with 3)
  • 3 swapchain images
  • VK_PRESENT_MODE_FIFO_KHR
  • < 3ms CPU/frame, < 3ms GPU/frame
  • no validation warnings
  • disabled all layers
  • updated graphics driver
  • if I throw a vkDeviceWaitIdle right before the wait for fences, I get a rock solid 60fps (but would obviously like to not need to do this)

what might I be doing wrong? I'd be willing to do anything (well, not anything...) to get to the bottom of this- if someone wants to zoom control my screen, or if you'd like compensation, etc...

toxic pewter
#

Do you also see it with your GPU manufacturer profiler?

warm tusk
#

I haven't tried (I've never used nvidia's profiler before), but I'll give that a go.
though I'm not worried about this being a fault of the profiler: I can feel it in-game, even with any profilers disabled (and can also verify it w/ the basic in-game frametime display)

warm tusk
#

ok I've pulled it up in NVidia NSight Systems, and not going to lie, I don't really understand how to interpret the results lol. reading through the "getting started" docs on their website, but is there anything specific you think I should be looking for with that?

#

ah shoot looks like NSight Systems is different from NSight Graphics.

warm tusk
#

ok welp, I launched NSight Graphics and it looks like that's a single-frame-capture tool? (similar to renderdoc?) which isn't showing me anything especially useful.

warm tusk
#

AHA! an update! I tried just logging my fif and imageIndex per frame (omg I should have done this way earlier...), and I'm getting this, starting from frame 0:

0 0
1 0
0 1
1 2
0 0
1 0
0 1
1 1
0 0
1 0
0 1
1 1
0 0
1 0
0 1
1 1
0 0
1 0
0 1
1 1
0 0
1 0
0 1
1 1
0 0
1 0

WTF. fif is oscillating correctly, but imageIndex is... why the hell is it giving me repeats and rarely using the third frame?!?!

toxic pewter
#

You control frame in flight yourself so it's normal that it's "oscillating"

#

imageIndex comes from vkAcquireNextImageKHR which has no guarantee on the order

warm tusk
#

"no guarantee", right- in the same way that an OS "doesn't guarantee" that any time will be spent on any given thread.
so, I have to be able to handle any out of order frames (which I thankfully do- my game doesn't crash or anything). but if it's giving me 001100110011 out of 3 requested frames in flight, then something is not working as intended

warm tusk
#

in case anyone is curious, this might be the culprit: #synchronization message

fallow wind
warm tusk
#

the yellow/green/red bars at the very top are frame times (each bar is a frame). the purple highlight in that zone is the 5-or-so frames I'm zoomed into in the bottom part of the diagram.

#

the bottom part then, is a timeline broken up into two areas: the red lines are on the GPU, and the purple/gray are on the CPU

#

looking at the purple, you'll see one long frame (mostly taken up by fencewait), then one very short frame (so short it's cut off fra|)

fallow wind
#

The red bars up top, why are they in two rows?

#

Or, I mean the red bars in the middle, I guess.

warm tusk
#

my mouse is hovering over a GPU submit workload, which shows a red sliver ~ 30ms prior highlighting when on the CPU it was submitted

#

dunno, two vulkan "contexts"? (not totally sure what that means under the hood)

#

but it (in this case anyways, not sure if this is true by rule) corresponds to a fif's workload

#

(so one blob is one fif workload, and the blob next to it but on another row corresponds to the next fif's workload)

fallow wind
#

So FiF0 is on top and FiF1 is on the bottom? That sort of thing?

warm tusk
#

yep

fallow wind
#

And the brown fencewait bar is the wait on the FiF's fence up at the top of your rendering loop?

warm tusk
#

yes

fallow wind
#

Where in that loop do you increment curfif?

warm tusk
#

right at the end

#

oh whoops, I cut it off in my copy/paste

#

let me fix that

#

fixed

fallow wind
#

So weird. That makes no sense at all.

#

What platform is this on? Like this is where I start seriously wondering about broken drivers and stuff like that...

#

If the swapchain is well behaved then the only reason it should be handing back the same image index twice (in FIFO mode especially) is if you stalled long enough for the desktop compositor to have entirely finished using that image you last presented before you next call acquire.

misty glade
#

You mentioned how using vkDeviceWaitIdle() fixes the issue, what about vkQueueWaitIdle()?

warm tusk
#

but yes, this is on an up to date windows 11 laptop using an up to date nvidia driver on a laptop 4050 GPU

toxic pewter
#

Yeah OS + vendor can help

#

Yeah ok it shouldn't happen on this setup

warm tusk
fallow wind
#

Is your EXE accidentally named some_game_that_had_broken_codde_that_nv_hacked_around.exe or something? Clutching at straws here, I know...

warm tusk
toxic pewter
#

Broken implicit layer maybe?

misty glade
#

Just trying ideas here

warm tusk
warm tusk
warm tusk
warm tusk
misty glade
#

^_^

#

Hmm

#

Maybe try reducing the problem space

#

How about two swapchain images

#

instead of three

#

or did you already try that lol

#

Aren't you supposed to use the current frame index for the proper semaphore for VkPresentInfoKHR, not the imageIndex

fallow wind
#

I'm still a bit confused about the fencewait bar. Something is off there. So you're waiting to rerecord the FiF0 data, and that fence correctly blocks you until your last submitted FiF0 clears the GPU, Okay. But you don't start trying to rerecord FiF0 until a while after FiF1 has already finished rendering, which seems like it's too late. Recording the next FiF0 should happen on top of FiF1 being recorded...

misty glade
# misty glade Aren't you supposed to use the current frame index for the proper semaphore for ...

This is what my engine does:

Semaphore[] finishedSemaphores = lastPass.getFinishedSemaphores();

            VkPresentInfoKHR presentInfo = VkPresentInfoKHR.calloc(stack);
            presentInfo.sType(VK_STRUCTURE_TYPE_PRESENT_INFO_KHR);
            presentInfo.pWaitSemaphores(stack.longs(
                    ((VulkanSemaphore[]) finishedSemaphores)[frameIndex].getHandle()
            ));
            presentInfo.swapchainCount(1);
            presentInfo.pSwapchains(stack.longs(swapchain.getHandle()));
            presentInfo.pImageIndices(pImageIndex);

            vkQueuePresentKHR(presentQueue, presentInfo);

            frameIndex = (frameIndex + 1) % FRAMES_IN_FLIGHT;
warm tusk
#

so we're waiting on the fence which is waiting on the completion of a GPU workload which is waiting to even begin processing until its corresponding acquired swapchain frame is available which is waiting on vsync

#

you can see in the diagram (below the frame timing bars) are two rows of "frame" spans. the top row is the span of the CPU frame (== to the width of the parent-most purple bar), and the bar below it is vsync, locked at 16.6ms.

you'll notice both GPU workloads are "released" to begin work right on a vsync frame boundary, which makes sense

#

I could mark up that diagram with notes on exactly when each fence/semaphore is taken/released, and it will all check out and be correct.

the issue is that I'm for some godforsaken reason being dished 001100110011.

misty glade
warm tusk
#

(I wouldn't be surprised if the situation driving this madness is just that I'm working with such tiny GPU workloads, and that's... confusing the driver that's responsible for predictively picking which swapchain image to next assign, which is preferring to try to reuse frames to minimize... GPU memory cache thrashing? I dunno... and it's mistaking the slow frame for thinking something took a long time? ๐Ÿคท)

#

(^ pulling 100% out of my ass)

fallow wind
#

Do you have a way of marking which fencewait belongs to which of the red GPU bars and where each of those GPU workloads was submitted?

warm tusk
fallow wind
#

Oh, okay, I think I figured it out. The tiny pink sliver to the right of fencewait is recording that frame.

#

And the submit is at the end of that.

#

Got it.

misty glade
#

that also wouldn't fix vkAcquireNextImageKHR returning weird indices

#

that's so weird

warm tusk
misty glade
warm tusk
misty glade
#

Maybe try it on a different machine

#

Or use your iGPU as the selected device

warm tusk
#

ah iGPU is a good call ๐Ÿ‘€

misty glade
#

My nvidia gpu handled some glsl stuff weirdly compared to my igpu so it might help

warm tusk
#

iGPU works perfect. 0,1,2,0,1,2,0,1,2

misty glade
#

How long did you leave it running?

warm tusk
#

this whole time since you suggested it

#

rock solid

#

ridiculous ๐Ÿคฆ

#

man and app bootup time is like 5x faster w/ igpu too ๐Ÿ˜ญ

#

I could have just been using that this whole time

#

(don't get me wrong, I want to fix the sync issues in the general case, not just for my machine)

#

so if there is still an issue in my setup, I do want to work to understand and fix it. but is it really as simple as "4050 drivers expect massive workloads and their swapchain distribution predictive logic gets confused when that's not the case"?

misty glade
warm tusk
#

would you feel comfortable running an .exe from me? ๐Ÿ˜›

misty glade
#

no, sorry lol

warm tusk
#

hah fair

misty glade
#

I don't have an nvidia gpu with me rn anyway

#

Mine's an Intel Iris Xe

warm tusk
#

I mean the test case is something other than my graphics card. no need to be nvidia.

misty glade
#

yeah

#

does your app build for linux?

warm tusk
#

it does, need to quick run it through the build machine first

#

one sec

fallow wind
#

I think you've just got stupid drivers that're going into some weird compat mode or something like that.

#

Like your code that you've posted looks right.

misty glade
#

you could run your app under linux with the official nvidia drivers

if it works, then it's a windows 4070 driver issue, and if it doesn't, that means your vulkan api usage is still broken and maybe the igpu was just not caring

warm tusk
#

I don't have linux installed on this machine :/

misty glade
#

I think you'll run out of space pretty quickly though

warm tusk
#

still haven't got around to testing it on linux, for now I'm just assuming it's an odd behavior of my graphics driver (if anyone wants to volunteer to test it on their system, lmk!)

one thing I realized, though, is that, because it is still hitting every present with a new frame, the only real "issue" is an input/physics sampling one.

that is, when I pass actual delta times into my game loop:

  • ~2ms passes
  • get input
  • advance physics ~2ms
  • frame drawn from this pov
  • ~30ms passes
  • get input
  • advance physics ~30ms
  • frame drawn from this pov
  • repeat
    but the frames are presented precisely 16ms apart, which looks super juttery because it's alternating showing 2ms of advanced physics every 16ms, and 30ms of advanced physics every 16ms.

if instead, I just hardcode 16ms into advancing the world, regardless of the fact that actually only either 2ms has passed or 30ms has passed, then it actually looks totally smooth. the only real "issue" is the really inconsistent input sampling, up to +30ms more latency than I'd get otherwise (which isn't good, but in my particular case, is "acceptable").

#

(all that said: if someone here knows somebody at nvidia, and thinks they'd be interested in correcting a vulkan acquire image degeneracy on laptop-4050's, I'd be happy to cooperate with them to fix it!)

storm crypt
#

I have the exact same issue and did some digging.
It seems like just a rather common encounter / misconception, found like two related posts on the first page in the Vulkan subreddit. (FIFO giving random images / Vsync causes extreme stuttering)

Don't think it's a driver bug or anything, the AcquireNextImage thing is mostly a red herring. The real issue is related to Frames In Flight, explained in detail by Erfan Ahmadi here.

If I understand correctly, it can happen when the host is submitting frames way faster than the device can present. Sometimes multiple frames are executed in quick succession (the short delay) before any present can happen, so the following frames will need to wait (the long delay) until the presentation engine is ready, and thus causing serve visual and input lag. It's fine in Mailbox or other mode 'cause any extra frames are consumed immediately instead of being queued in FIFO.

Removing FiF or adding waits around submits can solve this but then it loses the advantage of using FiF in the first place.
In the above blog post and nvpro_core2 apparently fix it using timeline semaphores or similar technique.

Total beginner btw, I have no idea about most of it but hope it helps. ๐Ÿ™‚

white echo
#

any way of syncing with the cpu with timeline semaphores you can do with fences

#

the signaling order in the nvpro reddit thread you linked is also the standard FIF fence signaling order

white echo
#

also your first link only relates to latency of input, the issue we are talking about here is there is a 2 vsync gpu pause every FIF number of frames, as in,
draw frame 0 -> draw frame 1 -> gpu goes to sleep for 2 vsync amount of time -> draw frame 3 -> draw frame 4

storm crypt
#

I guess the main point is some(me) would expect the process gets blocked on the fence or acquireImage when the queue is full, but it's actually the GPU waiting(the 2 vsync) on the sema until the image is usable.

#

Yeah, the timeline sema is probably just different way to do the same thing. I just thought maybe they somehow prevent the host from going too far ahead.

warm tusk
#

if vkAquireNexeImage would just hand out images approximately cyclically (012012012 rather than 0011001100) for me, then I wouldn't have any issue. in that world, even if I'm able to push through both frames in flight essentially instantly: I'll constantly be two frames ahead, and I'll be spending a vast majority of the time just waiting at the door for each "present" to free up my next submit, which would finish instantly and leave me back at the door waiting for the next present. in other words, it'd render 1 frame every 16ms, get its work done instantly, and chill for most of the frame. perfect!

but, since it's alternating doubles, I get the absolute worst case situation. I need to wait on a fence to issue a submit, but the submit also has to wait for its corresponding swapchain image to finish any presents it's still busy with to begin work.

so here's the timing:

  • A: fif 0
  • A: aquire swapchain 0
  • A: submit (fif:0,sc:0)
  • A: GPU processes (fif:0,sc:0) [fif 0 released]
  • A: GPU holds on presenting swapchain 0, waiting for vsync
  • B: fif 1
  • B: aquire swapchain 1
  • B: submit (fif:1,sc:1)
  • B: GPU processes (fif:1,sc:1) [fif 1 released]
  • B: GPU holds on presenting swapchain 1, waiting for vsync
  • C: fif 0
  • C: aquire swapchain 1 <- why 1?!?!
  • C: submit (fif:0,sc:1)
  • C: GPU holds on processing (fif:0,sc:1), because swapchain 1 hasn't actually yet been released, because B hasn't yet presented
  • D: fif 1
  • D: aquire swapchian 0
  • D: submit (fif:1,sc:0)
  • D: GPU holds on processing (fif:1,sc:0), because swapchain 0 hasn't actually yet been released, because A hasn't yet presented
  • E: CPU holds on fif 0 because C hasn't processed

ok phew. at this point, everything is saturated, and we're waiting on our first vsync signal to trigger a present and start releasing the holds. continuing on:

  • vsync signal
  • A: presents [swapchain 0 released]
  • D would be free to start processing, but it can't, because command buffers submitted on a same queue are processed in order, and C is still held up

so we've presented a frame, but nothing was actually freed to resume! now we wait a full 16ms for the next vsync. after that wait:

  • vsync signal
  • B: presents [swapchain 1] released]
  • C: GPU processes (fif:0,sc:1) [fif 0 released]
  • D: GPU processes (fif:1,sc:0) [fif 1 released]

at this point, all swapchains are released, and all fifs are free, so we start over.

#

^ looks more complicated than it is. the dastardly line is C's aquiring the swapchain used by B (resulting in D aquiring the swapchain used by A). so we saturate the semaphores and fences, but when A fully completes, it does nothing to release C.

white echo
#

@warm tusk add a fence to acquire, you can prob reuse the FIF fence since you have to wait for it before recording/submitting
when you get 0,0,1,1, it should cancel out FIF but when you get 0,1,2 it should let FIF work but you may get less cpu time than without it

warm tusk
#

lol isn't that what we were discussing here? #1403846020504358932 message

but what do you mean "{in the degenerate case} it should cancel out FIF"? regardless, putting that back in doesn't resolve the issue

cobalt hound
#

Something to keep in mind with frames in flight is input delay, which with triple buffering could cause one frame to be presented and two be waiting to be presented, but the two waiting could have the same input information as the one that is being presented. which adds some kind of input delay with turbulence for a game running at 60fps ranging from about 16ms-48ms which is a noticeable delay which could be interpreted as lagging especially when there is some additional input delay from other sources.

cobalt hound
#

also:
when outputing the image indicies returned from vkAcquireNextImageKHR from my current project running on a rtx3080TI:

idx: 0
idx: 0
idx: 0
idx: 0
idx: 0
idx: 0
idx: 0
idx: 0
idx: 0
idx: 0
idx: 0
idx: 0
idx: 0
idx: 0
idx: 0
idx: 0
idx: 0
idx: 0
idx: 0
idx: 0

some context:
i'm dooing double buffering.
before writing to the swapchain image i asynchronously wait on the previous VkSwapchainPresentFenceInfoKHR to be signaled, because that was the only way i got rid of some validation error ... it doesn't matter though because before im writing to the swapchain image, e.g. i'm waiting on that i do physics, and rendering gbuffers->writing to the swapchain image is the post process step, which takes more than enough time for the previous image to be presented

warm tusk
#

double buffering meaning "two frames in flight"? because you are not double buffering present with that swapchain index pattern lol. and if by "async wait on previous present fence" you mean you're sitting on a fence (CPU waiting) until present of previous swapchain image to acquire the next swapchain image and submit the post process write to that image, then you're kinda forcing a not-double-buffering pattern.

cobalt hound
#

well to be fair, im not dooing double buffering not on the presentation side, however the post process step goes in less than a millisecond for me, which it is fine to do it like this, since the stuff that is done before takes signitifantly more time

#

the point i wanted to make is that the indicies returned from vkAcquireNextImageKHR on nvidia drivers seem not to be in a oscillating order, so the issue ur having might not be from that

warm tusk
#

oh, I can also get 0,0,0,0... if I use 1 fif (in other words, if I wait CPU side for the completion of the previous submit before acquiring the swapchainimg and launching the next one). and that works! the driver clearly notices that all presentation frames are free so there's no need to cycle.
the issue is that I'm running on a 4050, so of course it'll be able to perform a full frame's work in series. but I'd like to ship this to run on a 4050 and a 950, and I'm frustrated that the only reason the 4050 is stalling is because of an incoherent swapchain presentation order (the 0,1,1,0,0,1,1,0...).

I should be able to saturate the fences/semaphores, and when the presentation engine releases resources, I snap them up and immediately re-saturate, regardless of how fast I can resaturate. it should be the periodic release of resources that drives the period of the whole system. I feel like this is a pretty fundamental ask?

cobalt hound
warm tusk
#

yeah I've been trying to figure out what word is most correct to use there, and that was the best I could think of. but you're right that's not quite correct ๐Ÿ˜›

what I'm trying to get at is, if I'm able to produce @ a faster rate than needed, I can put up gates at various parts of the pipeline to limit output. I'm calling a gate "saturated" here if the thing it's gating is waiting on the other side of that gate.

#

like w/ a multithreaded queue: you can consider the producer/consumer sides of the queue. the producer fills the queue to the brim, but is then stuck at the gate, unable to fill it more until a consumer comes along and frees up some resources in the queue. in this case, my application is the producer, and vsync is the consumer. the producer has "saturated" this queue because it's in wait 99% of the time w/ a full queue, and just responds to any release of resources by immediately resaturating.

cobalt hound
#

i see

warm tusk
#

(the opposite side of that would be if my game can't produce content to fill the queue fast enough to keep up with the consumer- so it's the consumer that just sits around waiting for something to plop in the queue that it can grab, and the producer is just constantly doing work to add to the queue and immediately starting work on the next thing. unfortunately, in this situation, you'd say "the producer thread is saturated" because it's 100% busy doing work. so me overloading the word to say "the gate is saturated" to mean the opposite is... fairly considered confusing ๐Ÿ˜› )

cobalt hound
warm tusk
#

that's what I have here #1403846020504358932 message (where curfif is what you've called frameResourceIndex)

the issue is the submit wait semaphore on the swapchain's present, which is 1. necessary, because I can't be rendering to an image that is already in queue to present, and 2. causing every other frame to wait for two vsyncs to be able to get to work (frame A:swap index 0, B:1, C:1- A and B can get submitted immediately, C needs to wait for both A and B to finish presenting before it can get to work). see #1403846020504358932 message

cobalt hound
#

and when you disable vsync, e.g. use VK_PRESENT_MODE_IMMEDIATE_KHR or VK_PRESENT_MODE_FIFO_RELAXED_KHR?

warm tusk
#

lol well IMMEDIATE just means I'm rendering/updating at like 1k fps, so yeah, no stuttering (but battery drain will be insane). and I'm actually not sure about FIFO_RELAXED- what is that?

cobalt hound
cobalt hound
warm tusk
#

have I tried IMMEDIATE? yes. and it does what I've said.

#

I'm reading the spec on RELAXED now, but already I notice that it's not a guaranteed presentation mode :/

#

also, it looks like RELAXED's value is that it won't wait past a missed vsync. so a late frame has the opportunity to tear into a current one. which is not my problem

cobalt hound
#

and by that do: vkWaitForFences(gvCore.device, 1, &gvLoop.frameResourceFences[gvLoop.curfif], VK_TRUE, UINT64_MAX); with curfif=0, e.g. wait for A to be presented

warm tusk
#

Itโ€™s fif 0,1,0,1 because Iโ€™m only using 2 fif (to limit lag input latency). But the problem persists even w/ 3 fif (just with a slightly more complicated gate saturation timeline that Iโ€™d rather not write out ๐Ÿ˜›).

#

But C fence doesnโ€™t wait on A to be presented despite sharing a fif index, because Aโ€™s command buffer submission has actually already finished processing and released the gate

cobalt hound
#

oh its the cmd buffer

warm tusk
#

each frame submits a command buffer, and presents a frame. The fences unlock at the completion of the command buffer submissionโ€™s GPU processing, but do not care about the present.

cobalt hound
warm tusk
#

That is the full function. Or do you mean โ€œincluding the various profiling macros and recovery code paths that donโ€™t get hit in this caseโ€?

cobalt hound
warm tusk
#

Yep, on those grounds (removed profiling macros and unhit code paths). But I can post the actual full thing if youโ€™d like as well- just heads up itโ€™s the same info but more obfuscated ๐Ÿ˜›

cobalt hound
#

would still like to see it

warm tusk
cobalt hound
warm tusk
#

yeah, and that actually makes vkAcquireNextSwapchainImage just return 0,0,0,0...

cobalt hound
# warm tusk yeah, and that actually makes `vkAcquireNextSwapchainImage` just return `0,0,0,0...
VkSubmitInfo submitInfo = {};
submitInfo.sType = VK_STRUCTURE_TYPE_SUBMIT_INFO;
submitInfo.waitSemaphoreCount = 1;
submitInfo.pWaitSemaphores = &gvLoop.frameImageAvailableSemaphores[gvLoop.curfif];
VkPipelineStageFlags waitStages[] = { VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT };
submitInfo.pWaitDstStageMask = waitStages;

submitInfo.commandBufferCount = 1;
submitInfo.pCommandBuffers = &gvLoop.commandBuffers[gvLoop.curfif];
submitInfo.signalSemaphoreCount = 1;
submitInfo.pSignalSemaphores = &gvLoop.imageRenderFinishedSemaphores[imageIndex];

in your submit info the only thing your waiting on is the frameImageAvailableSemaphores semaphore.
which means that your gpu is free to render frame A and B at the same time.

theory:
in frame 0 and 1 your cpu submits rendering commands for fif's A and B, since there is no dependency your gpu decides to render them both at once.
in frame 2 you wait for A to finish, once that happens you submit the rendering commands for the next frame.
in frame 3 you wait for B to finish, however since it was rendered along with A its already finished and the next commands can be submited.
in frame 4 you wait for A to finish, while rendering B aswell ... round and round it goes ...

however when you do a HARD_WAIT with vkDeviceWaitIdle B is not submitted before A has completed ...

warm tusk
#

I'm confused. lol I've already outlined exactly what happens here: #1403846020504358932 message

and there also is no "both at once"- I'm using a single graphics queue. but it does process A and Bs commands in quick succession (#1403846020504358932 message)

the issue is that C's command buffer cannot start processing on the GPU (even though it can be submitted!) until B's swapchain image finishes presenting. D also can't start processing because it was enqueued after C's, even though once A presents all semaphores are unlocked.

so we have

  • enqueue/process both A and B (in quick sequence)
  • enqueue C and eventually D but can't process
  • A presents, but that unlocks nothing
  • B presents, which unlocks C
  • C and D process (in quick sequence)
#

maybe I should make another thread that consolidates all the most up to date information. but the current question I want to answer is:

How can I force acquire swapchain to not give degenerate frame ordering?

or

How can I architect my game loop to process input/physics with a periodicity driven by an easily met vsync even in the case of degenerate frame ordering?

cobalt hound
#

from vulkans Implicit Synchronization Guarantees:

Submission order is a fundamental ordering in Vulkan, giving meaning to the order in which action and synchronization commands are recorded and submitted to a single queue. Explicit and implicit ordering guarantees between commands in Vulkan all work on the premise that this ordering is meaningful. This order does not itself define any execution or memory dependencies; synchronization commands and other orderings within the API use this ordering to define their scopes.
warm tusk
cobalt hound
cobalt hound
warm tusk
cobalt hound
#

my theory is that the rendering of A and B overlaps because there is no explicit synchronization between them, or any kind of dependency

#

if it was so, the fix would be to make any submission wait on the previous submission

warm tusk
# cobalt hound also, the order in which anything is submitted to a queue, might be executed out...

let's slow down. my above descriptions describe the situation which justifies every single wait. I'm no longer asking any questions about why anything is waiting. if you want to point to any wait in the profiling screenshot, I will be able to tell you exactly what it's waiting on and why.

the issue is that the waits are "long short long short long short", which if I drive the processing between snapshots using those waits, I end up with snapshots after 2ms of physics processing and after 30ms of physics processing, displayed every 16ms, which looks like stuttering

#

and your diagram's annotations are not correct. the red chunks on the upper timelines (not the frame graph) are alternating frame submit loads

#

one sec, I'll annotate it myself

cobalt hound
#

they are just to illustrate the point im trying to make

#

that your gpus is working on two frames at once

warm tusk
#

they literally are not. what the hell.
the red blobs are each a frame's GPU submit being processed.

#

in the profiler I can hover over the GPU submit loads and it shows the corresponding frame from which they were submitted. notice how each blob is executed sequentially.

cobalt hound
#

submission order != execution order

warm tusk
#

not by constraint, but here, it literally is

cobalt hound
#

whats the name of these red blobs?

warm tusk
#

"earth", but that's because that's the name of the renderpass they're processing.

but yo. I can hover over them, and see the corresponding CPU frame they were submitted on.

cobalt hound
#

wheres the corresponding tracy scope for those?

warm tusk
#

like on the CPU? it's submitted with the submit. what?

cobalt hound
#

what is reponsible for them showing up?

#

is it some tracy option, or something you do in code?

warm tusk
#

yes, it flags renderpasses in the command buffer that get submitted during record

cobalt hound
#

nvm

warm tusk
#

also worth noting that that isn't the most up to date profiling screenshot (it's just the first one I came across when looking for a link that showed the long,short,long,short behavior). the one linked includes the fenceImageWait step, which is commented out in my latest loop. (it also might have been taken before I integrated the gpu sync extension which lets tracy ~precisely place the GPU and CPU timelines relative to each other- though the difference between before/after isn't significant)

the image here #1403846020504358932 message is more up to date

cobalt hound
#

how does the tracy capture look with HARD_WAIT?

#

as comparison that might help

warm tusk
#

what question are you trying to answer? I have already accounted for why every wait exists. do you not believe my analysis?

cobalt hound
warm tusk
#

I appreciate the help, but this post is growing to 300 posts long, and I have had back and forths on the topic with multiple people, and you are not showing a willingness to first understand existing findings

cobalt hound
#

what controls these?

cobalt hound
white echo
warm tusk
# cobalt hound the question i'm trying to answer is whether the information tracy gives you is ...

ok. thanks for the time you have put in, truly. but I don't think we're even close to on the same page, and respectfully, I can't spend any more time assisting this line of inquiry. the behavior tracy shows is extremely consistent with the synchronization data sent to vulkan (again, please just check this out, if you're interested in understanding that: #1403846020504358932 message).

a spec can explicitly not guarantee something like execution order, but a driver can absolutely implement the spec in a way that happens to often respect that order. that non-discrepancy is not enough for me to start questioning a widely used tool used in the most trivial way, especially when an out-of-order swapchain image distribution fully explains the issue.

if I'm overlooking something, and there's more to it that would justify this line of inquiry (like if you're the developer of tracy and are familiar with some shortcoming or something?), then ok. but you've gotta sell me more than just demanding various profiles.

warm tusk
white echo
#

or it's still in the signaled state

warm tusk
#

I do reset the fences before reuse?

white echo
#

ohhh I thought you pass it to acquire, missed the brackets lol

#

VkResult vkAcquireNextImageKHR(
VkDevice device,
VkSwapchainKHR swapchain,
uint64_t timeout,
VkSemaphore semaphore,
VkFence fence,
uint32_t* pImageIndex);

#

that's what I meant

#

not your image index frence stuff

warm tusk
#

OH got it. no I haven't tried that. but wait why would that help? what am I trying to hold off on doing CPU side until the image is acquired?

white echo
#

so just wait for FIF then reset then acquire with FIF then wait and reset again

white echo
#

if you get 0,1,2,0,1,2 then FIF still should work

warm tusk
#

k gimme a sec to think through this

warm tusk
#

trying it out, seeing some very strange behavior. one is that I'm spending a majority of time on this new acquire fence wait, which does trigger a 0,1,2,0,1,2... swapchain image distribution, but every once in a while I get stuck on a frame for 10s

white echo
#

I think I can declare you as cursed lmao

warm tusk
#

ah wait the big 10s thing might be the tracy collection lag issue a someone else was posting about earlier

white echo
#

lmao

#

well try it without tracy and a standard fps monitor and graph it out

#

should not go 30 fps sometimes

#

if everything looks fine then we should prob move on from this cursed thread lol

warm tusk
#

ok wow yep, looks great. total bummer that I have to do that, and also that for some reason that kills tracy performance. but yeah is this a step that should be added to the "standard frame sync logic" to encourage cyclic acquire distribution?

#

and also ftr, I don't necessarily need to reuse the fif fence, right? I don't even need multiple fences for this. I could just have one "acquire fence" that I reuse every frame?

white echo
#

I mean needing to care about when you're drawing nothing is rather niche by how no one knows how to solve this lmao

warm tusk
#

(because it's always added, waited on, reset in the full frame)

white echo
warm tusk
#

ok sure. I was just making sure I was understanding the usage pattern here ๐Ÿ‘

white echo
#

lol 2 textured sphers and some 2D sprites is fairly close to nothing for the gpu

warm tusk
#

sure, but a lot of people make games rendering even less

#

(my point is it shouldn't be that niche)

white echo
#

well you can blog about it

warm tusk
#

LOL you mean write a blog about this thread?

white echo
#

yes

warm tusk
#

ugh more work...