{"id":230357,"date":"2024-06-10T09:30:48","date_gmt":"2024-06-10T09:30:48","guid":{"rendered":"https:\/\/michigandigitalnews.com\/index.php\/2024\/06\/10\/dragonfly-enhanced-vision-language-model-with-multi-resolution-zoom-launched-by-together-ai\/"},"modified":"2025-06-25T17:17:28","modified_gmt":"2025-06-25T17:17:28","slug":"dragonfly-enhanced-vision-language-model-with-multi-resolution-zoom-launched-by-together-ai","status":"publish","type":"post","link":"https:\/\/michigandigitalnews.com\/index.php\/2024\/06\/10\/dragonfly-enhanced-vision-language-model-with-multi-resolution-zoom-launched-by-together-ai\/","title":{"rendered":"Dragonfly: Enhanced Vision-Language Model with Multi-Resolution Zoom Launched by Together.ai"},"content":{"rendered":"<p> [ad_1]<br \/>\n<\/p>\n<div>\n<figure class=\"figure mt-2\">&#13;<br \/>\n                        <a href=\"https:\/\/image.blockchain.news:443\/features\/D8E08E86F8EDBDDCD68414CF49BDD8B1401B11A69515DFF98E6B2B03EE9CF9D7.jpg\" data-glightbox=\"\" data-gallery=\"image-popup\">&#13;<br \/>\n                            <img decoding=\"async\" class=\"rounded\" src=\"https:\/\/image.blockchain.news:443\/features\/D8E08E86F8EDBDDCD68414CF49BDD8B1401B11A69515DFF98E6B2B03EE9CF9D7.jpg\" alt=\"Dragonfly: Enhanced Vision-Language Model with Multi-Resolution Zoom Launched by Together.ai\"\/>&#13;<br \/>\n&#13;<br \/>\n                        <\/a>&#13;<br \/>\n                    <\/figure>\n<p>Together.ai has announced the launch of Dragonfly, an innovative vision-language model designed to enhance fine-grained visual understanding and reasoning about image regions. The architecture leverages multi-resolution zoom-and-select capabilities to optimize multi-modal reasoning while maintaining context efficiency, according to <a rel=\"nofollow\" href=\"https:\/\/www.together.ai\/blog\/dragonfly-v1\">Together AI<\/a>.<\/p>\n<h2>Dragonfly Model Architecture<\/h2>\n<p>Dragonfly employs two primary strategies: multi-resolution visual encoding and zoom-in patch selection. These techniques enable the model to focus on fine-grained details of image regions, enhancing its commonsense reasoning capabilities. The architecture processes images at multiple resolutions\u2014low, medium, and high\u2014dividing each image into sub-images that are encoded into visual tokens. These tokens are then projected into a language space, forming a concatenated sequence that feeds into the language model.<\/p>\n<p><strong>Zoom-in Patch Selection:<\/strong> Dragonfly employs a selective approach for high-resolution images, identifying and retaining only the sub-images that provide the most significant visual information. This targeted selection reduces redundancy and improves the overall model efficiency.<\/p>\n<h2>Performance and Evaluation<\/h2>\n<p>Dragonfly demonstrates promising performance on several vision-language benchmarks, including commonsense visual question answering and image captioning. The model achieved competitive results on benchmarks such as AI2D, ScienceQA, MMMU, MMVet, and POPE, showcasing its effectiveness in fine-grained understanding of image regions.<\/p>\n<p><strong>Benchmark Performance:<\/strong><\/p>\n<div style=\"overflow-x: auto;\">\n<table>&#13;<\/p>\n<tbody>&#13;<\/p>\n<tr>&#13;<\/p>\n<th>Model<\/th>\n<p>&#13;<\/p>\n<th>AI2D<\/th>\n<p>&#13;<\/p>\n<th>ScienceQA<\/th>\n<p>&#13;<\/p>\n<th>MMMU<\/th>\n<p>&#13;<\/p>\n<th>MMVet<\/th>\n<p>&#13;<\/p>\n<th>POPE<\/th>\n<p>&#13;<br \/>\n<\/tr>\n<p>&#13;<\/p>\n<tr>&#13;<\/p>\n<td>VILA<\/td>\n<p>&#13;<\/p>\n<td>&#8211;<\/td>\n<p>&#13;<\/p>\n<td>68.2<\/td>\n<p>&#13;<\/p>\n<td>\u00a0<\/td>\n<p>&#13;<\/p>\n<td>34.9<\/td>\n<p>&#13;<\/p>\n<td>85.5<\/td>\n<p>&#13;<br \/>\n<\/tr>\n<p>&#13;<\/p>\n<tr>&#13;<\/p>\n<td>LLaVA-v1.5 (Vicuna-7B)<\/td>\n<p>&#13;<\/p>\n<td>54.8<\/td>\n<p>&#13;<\/p>\n<td>70.4<\/td>\n<p>&#13;<\/p>\n<td>35.3<\/td>\n<p>&#13;<\/p>\n<td>30.5<\/td>\n<p>&#13;<\/p>\n<td>85.9<\/td>\n<p>&#13;<br \/>\n<\/tr>\n<p>&#13;<\/p>\n<tr>&#13;<\/p>\n<td>LLaVA-v1.6 (Mistral-7B)<\/td>\n<p>&#13;<\/p>\n<td>60.8<\/td>\n<p>&#13;<\/p>\n<td>72.8<\/td>\n<p>&#13;<\/p>\n<td>33.4<\/td>\n<p>&#13;<\/p>\n<td>44.8<\/td>\n<p>&#13;<\/p>\n<td>86.7<\/td>\n<p>&#13;<br \/>\n<\/tr>\n<p>&#13;<\/p>\n<tr>&#13;<\/p>\n<td>QWEN-VL-chat<\/td>\n<p>&#13;<\/p>\n<td>52.3<\/td>\n<p>&#13;<\/p>\n<td>68.2<\/td>\n<p>&#13;<\/p>\n<td>35.9<\/td>\n<p>&#13;<\/p>\n<td>&#8211;<\/td>\n<p>&#13;<\/p>\n<td>&#8211;<\/td>\n<p>&#13;<br \/>\n<\/tr>\n<p>&#13;<\/p>\n<tr>&#13;<\/p>\n<td>Dragonfly (LLaMA-8B)<\/td>\n<p>&#13;<\/p>\n<td>63.6<\/td>\n<p>&#13;<\/p>\n<td>80.5<\/td>\n<p>&#13;<\/p>\n<td>37.8<\/td>\n<p>&#13;<\/p>\n<td>35.9<\/td>\n<p>&#13;<\/p>\n<td>91.2<\/td>\n<p>&#13;<br \/>\n<\/tr>\n<p>&#13;<br \/>\n<\/tbody>\n<p>&#13;<br \/>\n<\/table>\n<\/div>\n<h2>Dragonfly-Med<\/h2>\n<p>In collaboration with Stanford Medicine, Together.ai has also introduced Dragonfly-Med, a version fine-tuned on 1.4 million biomedical image-instruction data. This model excels in high-resolution medical data tasks, outperforming previous models like Med-Gemini on multiple medical imaging benchmarks.<\/p>\n<h2>Evaluation on Medical Benchmarks<\/h2>\n<p>Dragonfly-Med was evaluated on visual question-answering and clinical report generation tasks, achieving state-of-the-art results on several benchmarks:<\/p>\n<div style=\"overflow-x: auto;\">\n<table>&#13;<\/p>\n<tbody>&#13;<\/p>\n<tr>&#13;<\/p>\n<th>Dataset<\/th>\n<p>&#13;<\/p>\n<th>Metric<\/th>\n<p>&#13;<\/p>\n<th>Med-Gemini<\/th>\n<p>&#13;<\/p>\n<th>Dragonfly-Med (LLaMA-8B)<\/th>\n<p>&#13;<br \/>\n<\/tr>\n<p>&#13;<\/p>\n<tr>&#13;<\/p>\n<td>VQA-RAD<\/td>\n<p>&#13;<\/p>\n<td>Acc (closed)<\/td>\n<p>&#13;<\/p>\n<td>69.7<\/td>\n<p>&#13;<\/p>\n<td>77.4<\/td>\n<p>&#13;<br \/>\n<\/tr>\n<p>&#13;<\/p>\n<tr>&#13;<\/p>\n<td>SLAKE<\/td>\n<p>&#13;<\/p>\n<td>Acc (closed)<\/td>\n<p>&#13;<\/p>\n<td>84.8<\/td>\n<p>&#13;<\/p>\n<td>90.4<\/td>\n<p>&#13;<br \/>\n<\/tr>\n<p>&#13;<\/p>\n<tr>&#13;<\/p>\n<td>Path-VQA<\/td>\n<p>&#13;<\/p>\n<td>Acc (closed)<\/td>\n<p>&#13;<\/p>\n<td>83.3<\/td>\n<p>&#13;<\/p>\n<td>92.3<\/td>\n<p>&#13;<br \/>\n<\/tr>\n<p>&#13;<br \/>\n<\/tbody>\n<p>&#13;<br \/>\n<\/table>\n<\/div>\n<h2>Conclusion and Future Work<\/h2>\n<p>Dragonfly&#8217;s architecture offers a new research direction by focusing on zooming in on image regions to capture more fine-grained visual information. Together.ai plans to continue improving the model&#8217;s capabilities and exploring new architectures and visual encoding strategies to benefit broader scientific fields.<\/p>\n<p>The collaboration with Stanford Medicine and the utilization of resources like Meta LLaMA3 and CLIP from OpenAI have been crucial in developing Dragonfly. The model&#8217;s codebase also builds upon the foundations of Otter and LLaVA-UHD.<\/p>\n<p><span><i>Image source: Shutterstock<\/i><\/span>                    <!-- Divider --><\/p>\n<p>. . .<\/p>\n<h4>Tags<\/h4>\n<p>                    <!-- Divider --><\/p>\n<p>                    <!-- Author info START --><\/p>\n<p>                    <!-- Author info END --><br \/>\n                    <!-- Divider -->\n                <\/div>\n<p>[ad_2]<br \/>\n<br \/><a href=\"https:\/\/blockchain.news\/news\/dragonfly-vision-language-model-launch\">Source link <\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>[ad_1] &#13; &#13; &#13; &#13; &#13; Together.ai has announced the launch of Dragonfly, an innovative vision-language model designed to enhance fine-grained visual understanding and reasoning<\/p>\n","protected":false},"author":1,"featured_media":230358,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"_uf_show_specific_survey":0,"_uf_disable_surveys":false,"footnotes":""},"categories":[171],"tags":[],"_links":{"self":[{"href":"https:\/\/michigandigitalnews.com\/index.php\/wp-json\/wp\/v2\/posts\/230357"}],"collection":[{"href":"https:\/\/michigandigitalnews.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/michigandigitalnews.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/michigandigitalnews.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/michigandigitalnews.com\/index.php\/wp-json\/wp\/v2\/comments?post=230357"}],"version-history":[{"count":0,"href":"https:\/\/michigandigitalnews.com\/index.php\/wp-json\/wp\/v2\/posts\/230357\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/michigandigitalnews.com\/index.php\/wp-json\/wp\/v2\/media\/230358"}],"wp:attachment":[{"href":"https:\/\/michigandigitalnews.com\/index.php\/wp-json\/wp\/v2\/media?parent=230357"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/michigandigitalnews.com\/index.php\/wp-json\/wp\/v2\/categories?post=230357"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/michigandigitalnews.com\/index.php\/wp-json\/wp\/v2\/tags?post=230357"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}