#frame by frame anaylsis of footage of someone typing, need advice

15 messages · Page 1 of 1 (latest)

dim coyote
#

might be doing frame by frame anaylsis of what someone typed into a laptop via video footage. footage is decent, can clearly see his fingers and hands and where they are placed on the keyboard, we know the model of laptop and its keyboard layout and we were able to superimpose the keys over the video

Question is now, whats the most effective way to do this sort of analysis, anyone done this before or know someone who might be able to advice?

fringe relic
#

Hi! Could you DM me so I can ask you some details about who you are and why you're trying to do this? We just want to make sure that you're not a stalker or something

Everyone else, please refrain from answering in this thread until I've posted a message saying its OK to reply

dim coyote
torpid raft
#

If @fringe relic has given the ok, I can help.

fringe relic
#

Hi, this is fine, please go ahead

dim coyote
abstract totem
#

if its an extensive amount of survalance the best thing would likely be to map the hands to a motion capture software and then animate the typing
after that you could ether set it up so it actually register the key presses or you could manually move the camera so you could just record the key presses manually (depening on how much it is)

I honestly don't have any idea how to do it in practice but their are likely software good enough to trarnsfer the positon on the hands to the movement model without too much manual work needed

dim coyote
#

I asked seyebermancer for a copy of the code where he implemented the above pipeline, which he responded to with this:

sure-

here is the documentation- you can feed that into a gpt4 and it should get you going pretty quick if you dont feel like reading through it all.
https://developers.google.com/mediapipe/solutions/vision/hand_landmarker/python
https://colab.research.google.com/github/googlesamples/mediapipe/blob/main/examples/hand_landmarker/python/hand_landmarker.ipynb

this is overkill but this is my visual command center with facial recognition, hand gesture, obj detection and ocr for live stream.. i removed the joint mapping overlay on that to not bog down the live stream but it may give you an idea on implementation. the OCR could def be improved upon with some image super enhancing models but this was more proof of concept build out.
think this has like 7 recognizable gestures but if you differentiate between L/R and account for sequences you would have more than enough possibilities. do a fist to activate your whisper transcription model feed to transcribe your speech into a shellgpt to do voice to command line code executions and then a open palm to trigger a system wipe code script and your good to go lol.
https://github.com/LJPearson176/SEAL-See-All

for some of your personal projects you might get some value out of this NLP GUI i put together with Pyside-
i havent implemented all of the best algorithms yet but the color coded POS tagging and POS chunking work pretty well for some exploratory data analysis.
https://github.com/LJPearson176/Book-Worm-NLP

wonder if there would be any value using tkinter and the specs of the laptop to overlay some keyboard grid lines on top of the video? that combined with the hand landmarks might help increase your comfort/confidence in the manual analysis.
Credit: seyebermancer

#

Lots of really interesting feedback here, wish I had the time to put this into a real practical proof of concept, but sadly I do not. For anyone who is planning on trying out any of these ideas (if any), please reach out to me so I can stay in touch with you and learn from your work!

#

This is definitely going on my "hobby project" list though. I really want to work on this (just for fun).

dim coyote
#

Okay I can't resist, this can't be that hard to script up in python.