Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

- Llava is not the SOTA open VLM, InternVL-1.5 is https://huggingface.co/spaces/opencompass/open_vlm_leaderboa...

You need to compare the evals to strong open VLMs including this and CogVLM

- This is not "first-ever multimodal model built on top of Llama3", there's already a Llava on Llama3-8b https://huggingface.co/lmms-lab



Like InternVL, no llama.cpp support severely limits its applications. Close to GPT4v performance level and runnable locally on any machine (no need for a GPU) would be huge for the accessibility community.


Very curious how it performs on OCR tasks compared to InternVL. To be competitive at reading text you need tiling support, and InternVL does tiles exceptionally well.


I think CogVLM2 is even better than Intern at OCR (my usecase is extracting information from an invoice)


After some superficial testing I with bad quality scans you can find on kaggle I can not confirm that. CogVLM2 refuses to handle scans that InternVL-V1.5 still can comprehend.


Thank you for the link! Our initial testing suggests MiniCPM outperforms InternVL for GUI understanding: https://github.com/OpenAdaptAI/OpenAdapt/issues/637#issuecom...

(InternVL appears to hallucinate more.)


I’m going to be saying First Ever AI something for the next 15 years for clout and capital, not going to be listening to anybody’s complicated ten step funnel if they’re not doing the obvious




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: