#Improve performance in a conversion BGRA -> RGB operation

139 messages · Page 1 of 1 (latest)

young kayak
#

I'm capturing the screen using Windows DXGI Desktop Duplication API through a library and the pic_data contains raw pixel data of a frame represented in a vector of u8 BGRA format. I realized that my screen capture was slow due to the time it takes to convert pixel data to the correct color type. I was wondering how I could improve the following code to reduce the time it takes to complete the conversion operation.

Time taken for RGB conversion: 337.2225ms

let rgb_data: Vec<u8> = pic_data
                .as_bgra()
                .to_owned()
                .iter()
                .copied()
                .flat_map(|pixel| [pixel.r, pixel.g, pixel.b])
                .collect();

Time taken for RGB conversion: 113.4304ms

for rgba in pic_data.chunks(4) {
    rgb_data.push(rgba[2]); // Red
    rgb_data.push(rgba[1]); // Green
    rgb_data.push(rgba[0]); // Blue
}
young kayak
#

I'm not sure if this post is applicable:
https://users.rust-lang.org/t/converting-a-bgra-u8-to-rgb-u8-n-for-images/67938/13

Apparently using the swizzling method was the fastest performing - bgra×4_to_rgb×4 (by ref)/swizzle (-39.6%)

manic quail
#

does it matter what the alpha value is?

#

chopping off the A could result in a decent perf decrease

#

since you either need to realloc or remove a bunch of stuff from the vec of data

#

going from bgra to rgba is just one swap

#

it seems to take about 7 times longer on my machine to allocate rgb data to a new vec, compared to just swapping each R and B value

#

oh wow don't even ask about removing individual elements

modest ivy
young kayak
young kayak
#

I would like to avoid complexity and the use of additional library as much as possible.

modest ivy
#

why do you have to_owned, then iter then copied

#
let rgb_data: Vec<u8> = pic_data
                .as_bgra()
                .iter()
                .copied()
                .flat_map(|pixel| [pixel.r, pixel.g, pixel.b])
                .collect();
manic quail
#

yeah that’s probably close to the fastest you could get(?)

#

i misread and thought pic data was already a collection of u8

#

if pic_data internally holds just a Vec of u8, you ideally interact with that since you want a u8 out of it

stable veldt
#

where is as_bgra from?

#

like, what is the type of pic_data

modest ivy
#

i'm pretty sure that's from rgb crate, because i suggested that crate to them before

#

of which i have already suggested a solution here #rust-discussions-1 message

#

hmm why did i do to_owned().iter().copied lol

#

i just copy pasted the code lol

#

hmm why did you do flat_map instead of map(RGBA::from) as i suggested?

young kayak
#

But, I'll take a look at the code again when I get back home.

stable veldt
#
use core::ptr;

fn shuffle_bgra4_to_rgb4_a4(v: [u8; 16]) -> [u8; 16] {
    [
        v[2], v[1], v[0],
        v[6], v[5], v[4],
        v[10], v[9], v[8],
        v[14], v[13], v[12],
        v[3], v[7], v[11], v[15]
    ]
}

pub fn bgra_to_rgb(bgra: &[u8]) -> Vec<u8> {
    let mut out: Vec<u8> = Vec::with_capacity(bgra.len() * 3 / 4 + 16);

    unsafe {
        let mut i = 0;
        let mut out_len = 0;
        let out_ptr = out.as_mut_ptr();
        while i + 16 <= bgra.len() {
            let bgra4 = *<&[u8; 16]>::try_from(&bgra[i..i+16]).unwrap();
            let rgb4_a4 = shuffle_bgra4_to_rgb4_a4(bgra4);
            ptr::copy_nonoverlapping(rgb4_a4.as_ptr(), out_ptr.add(out_len) as *mut u8, 16);
            out_len += 12;
            i += 16;
        }

        let mut bgra4 = [0u8; 16];
        ptr::copy_nonoverlapping(bgra.as_ptr().add(i), bgra4.as_mut_ptr(), bgra.len() - i);
        let rgb4_a4 = shuffle_bgra4_to_rgb4_a4(bgra4);
        ptr::copy_nonoverlapping(rgb4_a4.as_ptr(), out_ptr.add(out_len) as *mut u8, 16);
        out_len += (bgra.len() - i) / 4;

        out.set_len(out_len);
    }
    
    out
}
#

this should be fast on x86-64

modest ivy
#

okay damn

stable veldt
#

although it needs SSE3

young kayak
stable veldt
#

yes

young kayak
#

Nice, I'll have to have a look and benchmark it.

young kayak
stable veldt
#

it could be a lot better but I ran into issues with the Rust compiler unrolling the loop

young kayak
#

down from 113.4304ms

stable veldt
#

@young kayak could you try this version?

#

I can't get it to compile properly without the black_box which hurts performance sadly

#

oh I have an idea perhaps

#

yay it worked

#

try this @young kayak

#

nvm that's invalid let me fix bug

young kayak
#

I think copy_bgra4_to_rgb4_a4() is not called

stable veldt
#

it's not it's old

#

this should be right

#

@young kayak please ignore the earlier two versions, try the latest one

young kayak
#

Time taken for RGB conversion: 18.0167ms

#

it's improved

stable veldt
#

that's the latest one?

young kayak
#

yep

stable veldt
#

what compiler flags do you use to compile?

young kayak
#

npm run tauri dev

stable veldt
#

no idea what that is

#

...are you targetting wasm?

young kayak
#

I'm using Tauri framework

young kayak
stable veldt
#

@young kayak could you do tauri dev --verbose and see what commands it runs

young kayak
#
npm verb cli C:\Program Files\nodejs\node.exe C:\Users\user\AppData\Roaming\npm\node_modules\npm\bin\npm-cli.js
npm info using npm@9.6.7
npm info using node@v18.16.0
npm verb title npm run tauri dev
npm verb argv "run" "tauri" "dev" "--loglevel" "verbose"
npm verb logfile logs-max:10 dir:C:\Users\user\AppData\Local\npm-cache\_logs\2023-06-21T14_27_15_522Z-
npm verb logfile C:\Users\user\AppData\Local\npm-cache\_logs\2023-06-21T14_27_15_522Z-debug-0.log
stable veldt
#

it should run cargo

#

@young kayak ah you're on windows

#

could you try

#
set RUSTFLAGS=-C target-cpu=native
#

then make sure tauri re-builds

#

make you need a clean command or something

#

@young kayak wait have you not been using tauri dev --release so far?

young kayak
#

no i have not

stable veldt
#

well that's going to make a huge difference then

#

try first tauri dev --release without the RUSTFLAGS

#

and then again with

#

(to reset the rustflags just do set RUSTFLAGS=)

young kayak
#

tauri build ?

stable veldt
#

idk how you can get tauri to clean in between, perhaps you have to just delete the target folder

#

npm run tauri dev --release

#

looking at performance not in release mode is always folly

young kayak
#

ok I deleted the target folder

#

ok building is going to take a while

#

I've reseted flags, deleted target folder and now running npm run tauri dev --release which is rebuilding.

#

Time taken for RGB conversion: 18.1744ms it's about the same.

#

I'll try with npm run tauri build

#

Time taken for RGB conversion: 6.002ms
with npm run tauri build

smoky shadow
stable veldt
#

target-cpu=native sets all the flags the cpu on the host supports

smoky shadow
young kayak
#

Time taken for RGB conversion: 3.8135ms using ```set RUSTFLAGS=-C target-cpu=native

young kayak
#
fn encode_frames(frames: Vec<Vec<u8>>, width: u32, height: u32) -> Result<()> {
    let mut file = File::create("../vid/captured_video.h264").unwrap();
    let mut encoder = Encoder::builder()
        .fps(60, 1)
        .build(Colorspace::RGB, width as _, height as _)
        .unwrap();
    {
        let headers = encoder.headers().unwrap();
        file.write_all(headers.entirety()).unwrap();
    }
    for (index, mut pic_data) in frames.into_iter().enumerate() {
        let image = Image::rgb(width as _, height as _, &pic_data);
        let (data, _) = encoder.encode((60 * index) as _, image).unwrap();
        file.write_all(data.entirety()).unwrap();
    }
    {
        let mut flush = encoder.flush();
        while let Some(result) = flush.next() {
            let (data, _) = result.unwrap();
            file.write_all(data.entirety()).unwrap();
        }
    }
    Ok(())
}

async fn capture_frames(dupl: &mut DesktopDuplicationApi, width: u32, height: u32) -> Result<Vec<Vec<u8>>> {
    let (device, ctx) = dupl.get_device_and_ctx();
    let mut texture_reader = TextureReader::new(device, ctx);
    let mut frames: Vec<Vec<u8>> = vec![];
    let start_time = Instant::now();

    while start_time.elapsed() < Duration::from_secs(10) {
        let tex = dupl.acquire_next_vsync_frame().await;
        if let Ok(tex) = tex {
            let mut pic_data: Vec<u8> = vec![0; (width * height * 4) as usize];
            texture_reader.get_data(&mut pic_data, &tex).unwrap();
            // use bgra_to_rgb() to efficiently convert BGRA to RGB
            frames.push(bgra_to_rgb(&pic_data));
        }
    }
    Ok(frames)
}
#
 match capture_frames(&mut dupl, width, height).await {
    Ok(frames) => {
        match encode_frames(frames, width, height) {
            Ok(()) => println!("Video capture and encoding successful."),
            Err(e) => println!("Error during encoding: {:?}", e),
        }
    },
    Err(e) => println!("Error during frame capture: {:?}", e),
}

Is it necessary to return frames, is there a way to return the output of the conversion bgra_to_rgb(&pic_data) without having to create a new vector ?

manic quail
#

why is it a vec of vec?

#

@young kayak

#

ohh wait nvm i see

#

creating a new vec is going to be the fastest

young kayak
#

the encode part is from the x264 rust library

manic quail
#

i mean again ideally you don't have to get rid of the A value

#

since that's just a mem::swap() for every four elements

#

will easily be the fastest

#

but no, in this case, it's fastest just to create a new Vec

young kayak
#

are you referring to frames 2d vec ?

manic quail
#

hm?

young kayak
#

mem::swap() for every four elements what do you mean by this ?

manic quail
#

im just talking about the bgra to rgba conversion

#

you wouldn't need to create a new vec if you didn't have to get rid of the A value

young kayak
#

ah, right. The instruction provided by @stable veldt was the fastest.

manic quail
#

yes

#

for bgra to rgb that's basically the fastest you can get

young kayak
#

yep, I wonder if there's a way to capture the correct format rather than BGRA...

manic quail
#

yeahhh

young kayak
#

I'm using this library

#

that captures as BGRA

#

wouldn't the encoder want YUV instead of RGB ? I see there is YUV420 there.

young kayak
#
let supported_formats = [DXGI_FORMAT_B8G8R8A8_UNORM, DXGI_FORMAT_R10G10B10A2_UNORM, DXGI_FORMAT_R16G16B16A16_FLOAT];
let device: IDXGIDevice4 = dev.cast().unwrap();
let dupl: WinResult<IDXGIOutputDuplication> = unsafe { output.as_raw_ref().DuplicateOutput1(&device, 0, &supported_formats) };

I think this tells it to select BGRA.

#

would it be faster to convert from RGBA -> RGB ?

#

or capture in RGBA format and have the encoder to ignore the last "alpha" value, that way we won't need to convert at all ?

young kayak
#

I wouldn't mind waiting a bit longer in the encoding process as long as the screen capturing isn't dropping frames.

manic quail
#

but i don't think that's the main bottleneck

young kayak
#

Hmm, I'm pretty sure i measured the conversion time and in release build it's like 3ms for 1 frame.

#

I'm not sure if that's fine or bad

manic quail
#

that's less than a 1/5th of 1 frame

#

at 60fps

#

basically no noticeable stutter

young kayak
#

What about higher framerate? Like 100 or 120?

young kayak
#

I've got great news, I'm able to avoid color type conversion entirely

#

I can feed the encoder (x264) BGRA color format without converting.

young kayak
#

Now despite having while start_time.elapsed() < Duration::from_secs(10) my recording is capturing more time than 10 seconds now.

young kayak
#

Thanks everyone for your help in this matter.
I've now run into another problem:
#1121586405068525630

young kayak
#
Instantaneous frame rate: 166.76952 FPS
Instantaneous frame rate: 162.62807 FPS
Instantaneous frame rate: 170.23015 FPS
Instantaneous frame rate: 161.93546 FPS
Instantaneous frame rate: 170.75044 FPS
Instantaneous frame rate: 165.93379 FPS
Instantaneous frame rate: 348.09247 FPS
Instantaneous frame rate: 196.537 FPS
Instantaneous frame rate: 168.1916 FPS
Instantaneous frame rate: 165.23463 FPS
let start_time = Instant::now();
    while start_time.elapsed() < Duration::from_secs(10) {
        let frame_start_time = Instant::now();
        
        // DXGI AcquireNextFrame()
        let tex = dupl.acquire_next_vsync_frame().await;
        
        if let Ok(tex) = tex {
            let mut pic_data: Vec<u8> = vec![0; (width * height * 4) as usize];
            texture_reader.get_data(&mut pic_data, &tex).unwrap();
            frames.push(pic_data);
        }
        
        // Calculate frame rate for this frame and write to file
        let elapsed_secs = frame_start_time.elapsed().as_secs_f32();
        let frame_rate = 1.0 / elapsed_secs;
        writeln!(file, "Instantaneous frame rate: {} FPS", frame_rate);
    }

I don't know why it's capturing more frames than display's refresh rate which is 165Hz.