This is great but does anyone know of essentially the same thing BUT it can actually get content visually? ie., it is great to have a transcript but unless the models can 'see' the video they will miss valuable info like if the in-video captions show a github repo link or something
There are many models that extracts context from real time video. But they often just rely on identifying simpel object and tasks. Not longer context spanning a video.