InternSVG: Towards Unified SVG Tasks with Multimodal Large Language Models
Overview of our InternSVG family
Abstract
Vector graphics, represented in Scalable Vector Graphics (SVG) format, serve as a core medium for digital design and web rendering. Existing works on SVG tasks often focus on isolated subtasks such as generation, editing, or understanding. In this paper, we propose InternSVG, a unified framework based on multimodal large language models that jointly addresses SVG-related tasks across perception and creation. By representing SVGs as structured sequences and aligning them with textual descriptions and raster renderings, InternSVG enables a generalizable interface for vector reasoning, generation, and manipulation. Extensive experiments demonstrate its versatility and performance across diverse SVG benchmarks.
SAgoge: A Comprehensive Multimodal SVG Dataset
We introduce SAgoge, a large-scale and comprehensive dataset for SVG tasks with more than 16 million training samples spanning icons, illustrations, chemical structures, and animations.
Raw SVGs are gathered from the web and a custom synthesis pipeline, then normalized to a 128 × 128 canvas and simplified to shorten code. The rendered images or videos, processed SVG code, and handcrafted prompts are fed to an MLLM to synthesize high-quality training samples for understanding, editing, and generation.
InternSVG: A Unified MLLM for SVG Understanding, Editing, and Generation
InternSVG follows the “ViT–MLP–LLM” paradigm , using InternViT-300M as the vision encoder and Qwen2.5-7B as the language model. We further design SVG-specific special tokens and introduce a tailored embedding initialization strategy to incorporate SVG content effectively.
See InternSVG in Action!
SArena: A Companion Benchmark
To enable systematic evaluation across SVG understanding, editing, and generation, we introduce SArena, a benchmark that aligns with the domains and difficulty spectrum covered by SAgoge and provides standardized tasks and metrics.SArena includes 4 sub-benchmarks, i.e., icons, illustrations, chemical structures, and animation.
Generation Performance
| Model | Text-to-SVG | Image-to-SVG | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| FID ↓ | FID-C ↓ | CLIP-T2I ↑ | CLIP-I2I ↑ | Tokens | DINO ↑ | SSIM ↑ | LPIPS ↓ | PSNR ↑ | Tokens | |
| Llama-3.1-8B | 19.428 | 11.247 | 21.863 | 71.859 | 280 | -- | -- | -- | -- | -- |
| Qwen2.5-VL-7B | 24.781 | 15.454 | 21.538 | 71.384 | 249 | 0.781 | 0.506 | 0.378 | 6.534 | 281 |
| Keye-VL-8B | 21.961 | 14.393 | 21.557 | 71.167 | 227 | 0.801 | 0.531 | 0.368 | 6.939 | 286 |
| GLM-4.1V-9B | 22.684 | 10.447 | 22.562 | 73.197 | 269 | 0.820 | 0.539 | 0.345 | 7.329 | 289 |
| InternVL3-8B | 23.061 | 14.303 | 21.897 | 71.450 | 269 | 0.812 | 0.557 | 0.361 | 7.220 | 256 |
| Llama-3.2-11B | 28.156 | 14.345 | 21.711 | 71.485 | 261 | 0.759 | 0.467 | 0.389 | 5.908 | 216 |
| Gemma-3-12B | 17.137 | 10.409 | 22.023 | 71.622 | 290 | 0.821 | 0.576 | 0.352 | 7.632 | 360 |
| InternVL3-14B | 18.996 | 13.224 | 22.066 | 71.493 | 227 | 0.825 | 0.562 | 0.359 | 7.343 | 216 |
| Kimi-VL-A3B | 30.807 | 16.996 | 21.439 | 70.536 | 228 | 0.798 | 0.562 | 0.362 | 7.179 | 245 |
| Gemma-3-27B | 15.145 | 9.303 | 22.526 | 73.277 | 249 | 0.826 | 0.595 | 0.354 | 7.833 | 267 |
| Qwen2.5-VL-32B | 20.043 | 10.393 | 22.783 | 73.228 | 317 | 0.836 | 0.562 | 0.357 | 7.503 | 309 |
| InternVL3-38B | 18.014 | 11.042 | 22.795 | 73.077 | 251 | 0.829 | 0.549 | 0.351 | 7.305 | 230 |
| Grok-3 | 21.967 | 8.694 | 24.122 | 76.797 | 346 | -- | -- | -- | -- | -- |
| Llama-3.1-70B | 18.032 | 8.300 | 22.747 | 73.876 | 255 | -- | -- | -- | -- | -- |
| Llama-3.1-405B | 16.794 | 8.390 | 22.822 | 73.920 | 236 | -- | -- | -- | -- | -- |
| DeepSeek-V3 | 24.990 | 8.803 | 23.790 | 76.470 | 251 | -- | -- | -- | -- | -- |
| GPT-4o | 15.178 | 6.763 | 24.617 | 77.742 | 246 | 0.874 | 0.616 | 0.316 | 8.435 | 231 |
| Gemini-2.5-Flash | 16.720 | 5.208 | 24.658 | 78.218 | 451 | 0.876 | 0.587 | 0.316 | 8.324 | 533 |
| Claude-Sonnet-3.7 | 14.383 | 3.499 | 25.294 | 80.786 | 417 | 0.909 | 0.647 | 0.290 | 9.259 | 389 |
| Claude-Sonnet-4 | 15.840 | 4.291 | 25.421 | 80.579 | 444 | 0.915 | 0.665 | 0.276 | 9.855 | 541 |
| Llama-3.2-90B | 19.309 | 8.550 | 22.841 | 74.006 | 249 | 0.757 | 0.437 | 0.377 | 5.777 | 192 |
| Llama-4-Scout | 17.908 | 9.382 | 22.849 | 73.563 | 256 | 0.844 | 0.582 | 0.346 | 7.736 | 246 |
| Llama-4-Maverick | 14.931 | 6.526 | 23.570 | 75.816 | 265 | 0.863 | 0.596 | 0.329 | 8.027 | 255 |
| GLM-4.5V | 16.641 | 5.093 | 24.450 | 78.349 | 372 | 0.872 | 0.627 | 0.315 | 8.666 | 322 |
| Step3-321B | 20.061 | 9.706 | 23.053 | 74.184 | 308 | 0.834 | 0.555 | 0.340 | 7.516 | 301 |
| Qwen2.5-VL-72B | 15.948 | 9.875 | 22.946 | 73.681 | 275 | 0.837 | 0.584 | 0.346 | 7.834 | 372 |
| InternVL3-78B | 17.580 | 10.596 | 22.805 | 73.123 | 252 | 0.850 | 0.584 | 0.339 | 7.802 | 234 |
| Starvector 8B | -- | -- | -- | -- | -- | 0.871 | 0.623 | 0.206 | 13.595 | 951 |
| LLM4SVG 7B | 21.939 | 8.611 | 19.458 | 70.726 | 705 | 0.748 | 0.472 | 0.409 | 5.375 | 485 |
| OmniSVG 3B | 28.292 | 11.318 | 21.679 | 74.831 | 1.7k | 0.894 | 0.756 | 0.186 | 12.669 | 2.4k |
| InternSVG 8B | 8.715 | 1.876 | 23.916 | 80.911 | 1.0k | 0.949 | 0.811 | 0.127 | 18.226 | 1.3k |
Simple Editing Performance
| Model | Simple Editing Performance | |||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Low-level Color Editing | Cropping | Flipping | Rotation | Scaling | Adding Stroke | Translation | Transparency | |||||||||||||||||||||||||
| DINO↑ | SSIM↑ | LPIPS↓ | PSNR↑ | DINO↑ | SSIM↑ | LPIPS↓ | PSNR↑ | DINO↑ | SSIM↑ | LPIPS↓ | PSNR↑ | DINO↑ | SSIM↑ | LPIPS↓ | PSNR↑ | DINO↑ | SSIM↑ | LPIPS↓ | PSNR↑ | DINO↑ | SSIM↑ | LPIPS↓ | PSNR↑ | DINO↑ | SSIM↑ | LPIPS↓ | PSNR↑ | DINO↑ | SSIM↑ | LPIPS↓ | PSNR↑ | |
| Qwen2.5-VL-7B | 0.958 | 0.892 | 0.061 | 73.123 | 0.870 | 0.673 | 0.270 | 10.087 | 0.852 | 0.636 | 0.313 | 9.683 | 0.919 | 0.803 | 0.152 | 47.833 | 0.902 | 0.653 | 0.262 | 12.466 | 0.917 | 0.728 | 0.180 | 25.767 | 0.908 | 0.634 | 0.295 | 13.257 | 0.966 | 0.889 | 0.073 | 50.893 |
| InternVL3-8B | 0.963 | 0.903 | 0.055 | 75.568 | 0.884 | 0.705 | 0.257 | 10.271 | 0.842 | 0.704 | 0.259 | 23.198 | 0.979 | 0.818 | 0.157 | 48.211 | 0.923 | 0.684 | 0.231 | 12.403 | 0.933 | 0.791 | 0.150 | 35.333 | 0.916 | 0.708 | 0.222 | 27.231 | 0.982 | 0.954 | 0.026 | 67.912 |
| InternVL3.5-8B | 0.999 | 0.992 | 0.007 | 88.473 | 0.881 | 0.761 | 0.195 | 11.376 | 0.905 | 0.704 | 0.241 | 13.358 | 0.886 | 0.697 | 0.246 | 21.118 | 0.932 | 0.710 | 0.234 | 16.638 | 0.936 | 0.721 | 0.162 | 20.350 | 0.917 | 0.660 | 0.276 | 12.508 | 0.989 | 0.967 | 0.024 | 59.713 |
| Gemma-3-27B | 1.000 | 1.000 | 0.000 | 99.057 | 0.885 | 0.619 | 0.297 | 14.116 | 0.995 | 0.982 | 0.008 | 96.554 | 0.991 | 0.945 | 0.041 | 85.314 | 0.943 | 0.846 | 0.100 | 67.280 | 0.968 | 0.857 | 0.116 | 40.216 | 0.962 | 0.896 | 0.045 | 82.705 | 0.883 | 0.687 | 0.141 | 63.444 |
| InternVL3.5-30B | 0.999 | 0.995 | 0.005 | 91.706 | 0.889 | 0.732 | 0.235 | 10.902 | 0.916 | 0.769 | 0.195 | 23.892 | 0.869 | 0.708 | 0.262 | 18.751 | 0.930 | 0.693 | 0.236 | 14.118 | 0.949 | 0.769 | 0.135 | 27.933 | 0.947 | 0.746 | 0.222 | 32.944 | 0.992 | 0.968 | 0.024 | 63.038 |
| Qwen2.5-VL-32B | 0.967 | 0.914 | 0.044 | 88.400 | 0.903 | 0.657 | 0.306 | 9.062 | 0.919 | 0.807 | 0.154 | 35.634 | 0.986 | 0.959 | 0.024 | 90.586 | 0.917 | 0.673 | 0.236 | 19.639 | 0.932 | 0.739 | 0.139 | 33.796 | 0.934 | 0.748 | 0.191 | 31.632 | 0.980 | 0.949 | 0.029 | 80.879 |
| Llama-4-Scout | 0.969 | 0.925 | 0.049 | 87.067 | 0.879 | 0.652 | 0.283 | 9.134 | 0.901 | 0.755 | 0.206 | 21.027 | 0.974 | 0.926 | 0.051 | 80.043 | 0.925 | 0.705 | 0.226 | 18.068 | 0.960 | 0.840 | 0.104 | 38.360 | 0.926 | 0.686 | 0.251 | 18.387 | 0.983 | 0.957 | 0.028 | 66.797 |
| Llama-4-Maverick | 0.998 | 0.996 | 0.006 | 94.874 | 0.903 | 0.677 | 0.301 | 9.404 | 0.955 | 0.914 | 0.074 | 76.565 | 0.989 | 0.967 | 0.024 | 88.142 | 0.927 | 0.776 | 0.194 | 23.361 | 0.970 | 0.886 | 0.073 | 52.249 | 0.956 | 0.741 | 0.226 | 31.710 | 0.996 | 0.991 | 0.006 | 94.987 |
| Qwen2.5-VL-72B | 0.995 | 0.986 | 0.008 | 97.542 | 0.909 | 0.668 | 0.307 | 9.174 | 0.948 | 0.874 | 0.090 | 52.671 | 0.992 | 0.949 | 0.045 | 82.266 | 0.901 | 0.678 | 0.267 | 11.492 | 0.965 | 0.875 | 0.105 | 44.055 | 0.951 | 0.704 | 0.256 | 18.695 | 0.995 | 0.992 | 0.010 | 72.101 |
| InternVL3-78B | 0.995 | 0.987 | 0.008 | 96.985 | 0.909 | 0.682 | 0.299 | 9.599 | 0.936 | 0.833 | 0.129 | 32.765 | 0.994 | 0.974 | 0.017 | 92.534 | 0.931 | 0.695 | 0.238 | 12.792 | 0.947 | 0.790 | 0.145 | 37.317 | 0.957 | 0.831 | 0.134 | 46.221 | 0.992 | 0.984 | 0.015 | 68.573 |
| InternVL3.5-241B | 0.983 | 0.956 | 0.021 | 91.262 | 0.904 | 0.763 | 0.225 | 11.763 | 0.896 | 0.754 | 0.165 | 30.961 | 0.901 | 0.783 | 0.188 | 39.965 | 0.919 | 0.661 | 0.245 | 11.857 | 0.948 | 0.762 | 0.136 | 27.335 | 0.928 | 0.750 | 0.160 | 25.850 | 0.956 | 0.882 | 0.059 | 64.399 |
| GPT-4o | 0.995 | 0.987 | 0.007 | 98.406 | 0.913 | 0.688 | 0.300 | 9.556 | 0.994 | 0.976 | 0.017 | 87.340 | 0.995 | 0.986 | 0.010 | 94.845 | 0.947 | 0.811 | 0.163 | 45.845 | 0.966 | 0.864 | 0.093 | 48.913 | 0.982 | 0.928 | 0.060 | 72.016 | 0.990 | 0.977 | 0.014 | 85.619 |
| Gemini-2.5-Flash | 1.000 | 1.000 | 9.761 | 99.057 | 0.885 | 0.619 | 0.297 | 14.116 | 0.995 | 0.982 | 0.008 | 96.554 | 0.991 | 0.945 | 0.041 | 85.314 | 0.943 | 0.846 | 0.100 | 67.280 | 0.968 | 0.857 | 0.116 | 40.216 | 0.962 | 0.896 | 0.045 | 82.705 | 0.883 | 0.687 | 0.141 | 63.444 |
| Claude-Sonnet-4 | 1.000 | 1.000 | 0.000 | 100.000 | 0.928 | 0.696 | 0.291 | 9.626 | 0.944 | 0.943 | 0.055 | 73.786 | 0.999 | 0.994 | 0.006 | 96.676 | 0.953 | 0.833 | 0.138 | 50.330 | 0.982 | 0.907 | 0.055 | 51.913 | 0.999 | 0.997 | 0.002 | 87.758 | 0.999 | 1.000 | 0.000 | 97.535 |
| InternSVG 8B | 1.000 | 1.000 | 0.000 | 100.000 | 1.000 | 1.000 | 0.000 | 100.000 | 0.996 | 0.987 | 0.005 | 98.672 | 1.000 | 1.000 | 0.000 | 99.692 | 0.999 | 1.000 | 0.000 | 98.655 | 1.000 | 1.000 | 0.000 | 99.488 | 1.000 | 1.000 | 0.000 | 100.000 | 1.000 | 1.000 | 0.000 | 99.968 |
Hard Editing Performance
| Model | Semantic-level Color Editing | Style Transfer | ||||||
|---|---|---|---|---|---|---|---|---|
| DINO↑ | SSIM↑ | LPIPS↓ | PSNR↑ | DINO↑ | SSIM↑ | LPIPS↓ | PSNR↑ | |
| Qwen2.5-VL-7B | 0.919 | 0.768 | 0.166 | 23.902 | 0.889 | 0.658 | 0.193 | 11.940 |
| InternVL3-8B | 0.903 | 0.728 | 0.184 | 22.071 | 0.917 | 0.728 | 0.158 | 13.457 |
| Gemma-3-27B | 0.981 | 0.920 | 0.072 | 53.068 | 0.869 | 0.591 | 0.210 | 12.174 |
| Qwen2.5-VL-32B | 0.926 | 0.769 | 0.158 | 28.290 | 0.910 | 0.723 | 0.162 | 14.283 |
| Llama-4-Scout | 0.964 | 0.860 | 0.120 | 27.852 | 0.963 | 0.848 | 0.119 | 15.417 |
| Llama-4-Maverick | 0.975 | 0.891 | 0.099 | 41.222 | 0.969 | 0.855 | 0.105 | 16.765 |
| Qwen2.5-VL-72B | 0.975 | 0.888 | 0.100 | 42.759 | 0.957 | 0.836 | 0.113 | 16.771 |
| InternVL3-78B | 0.955 | 0.857 | 0.105 | 27.033 | 0.912 | 0.705 | 0.175 | 13.429 |
| GPT-4o | 0.972 | 0.912 | 0.073 | 54.651 | 0.952 | 0.819 | 0.117 | 18.173 |
| Gemini-2.5-Flash | 0.981 | 0.920 | 0.072 | 53.068 | 0.869 | 0.591 | 0.210 | 12.174 |
| Claude-Sonnet-4 | 0.991 | 0.944 | 0.050 | 56.741 | 0.976 | 0.867 | 0.097 | 18.374 |
| InternSVG 8B | 0.996 | 0.959 | 0.041 | 69.875 | 0.952 | 0.808 | 0.139 | 18.100 |
Overall Editing Performance
| Model | DINO↑ | SSIM↑ | LPIPS↓ | PSNR↑ | Tokens |
|---|---|---|---|---|---|
| Qwen2.5-VL-7B | 0.909 | 0.728 | 0.192 | 25.402 | 1.0k |
| InternVL3-8B | 0.921 | 0.761 | 0.170 | 29.615 | 1.2k |
| Gemma-3-27B | 0.942 | 0.815 | 0.113 | 54.200 | 1.3k |
| Qwen2.5-VL-32B | 0.933 | 0.782 | 0.148 | 37.737 | 1.0k |
| Llama-4-Scout | 0.949 | 0.825 | 0.138 | 34.070 | 1.3k |
| Llama-4-Maverick | 0.966 | 0.870 | 0.109 | 46.944 | 1.3k |
| Qwen2.5-VL-72B | 0.961 | 0.849 | 0.124 | 41.006 | 1.2k |
| InternVL3-78B | 0.958 | 0.848 | 0.116 | 40.533 | 1.2k |
| GPT-4o | 0.968 | 0.887 | 0.088 | 55.255 | 1.2k |
| Gemini-2.5-Flash | 0.942 | 0.815 | 0.113 | 54.200 | 1.3k |
| Claude-Sonnet-4 | 0.979 | 0.915 | 0.071 | 57.595 | 1.3k |
| InternSVG 8B | 0.989 | 0.952 | 0.036 | 77.331 | 1.4k |
Understanding Performance
| Model | Overall | Color | Geometry | Quantity | Semantic |
|---|---|---|---|---|---|
| Qwen2.5-VL-7B | 52.8 | 69.3 | 50.4 | 34.9 | 56.4 |
| InternVL3-8B | 59.5 | 79.1 | 59.3 | 38.2 | 61.3 |
| Gemma-3-27B | 59.5 | 82.2 | 67.6 | 43.6 | 44.7 |
| Qwen2.5-VL-32B | 65.5 | 82.8 | 65.5 | 47.7 | 66.1 |
| Llama-4-Scout | 57.5 | 82.4 | 57.0 | 41.6 | 49.0 |
| Llama-4-Maverick | 64.7 | 87.5 | 62.0 | 47.2 | 62.3 |
| Qwen2.5-VL-72B | 63.4 | 82.4 | 65.1 | 44.6 | 61.6 |
| InternVL3-78B | 65.3 | 86.4 | 71.0 | 48.8 | 54.9 |
| GPT-4o | 71.0 | 88.2 | 78.5 | 47.5 | 69.6 |
| Gemini-2.5-Flash | 73.0 | 90.1 | 81.9 | 53.0 | 67.2 |
| Claude-Sonnet-4 | 77.1 | 91.5 | 82.4 | 53.8 | 80.6 |
| InternSVG 8B | 85.1 | 93.0 | 85.8 | 61.9 | 99.7 |
Generation Performance
| Model | Text-to-SVG | Image-to-SVG | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| FID ↓ | FID-C ↓ | CLIP-T2I ↑ | CLIP-I2I ↑ | Tokens | DINO ↑ | SSIM ↑ | LPIPS ↓ | PSNR ↑ | Tokens | |
| Qwen2.5-VL-7B | 37.903 | 28.455 | 18.069 | 61.928 | 756 | 0.739 | 0.513 | 0.413 | 7.732 | 1.2k |
| InternVL3-8B | 36.736 | 25.682 | 18.493 | 61.964 | 493 | 0.772 | 0.569 | 0.397 | 8.542 | 716 |
| InternVL3.5-8B | 70.837 | 35.776 | 18.095 | 63.357 | 3.6k | 0.721 | 0.306 | 0.410 | 5.283 | 2.5k |
| InternVL3.5-14B | 65.967 | 34.912 | 18.131 | 63.496 | 3.5k | 0.722 | 0.296 | 0.414 | 5.130 | 2.8k |
| Gemma-3-27B | 27.838 | 13.766 | 21.486 | 67.255 | 613 | 0.824 | 0.617 | 0.379 | 9.920 | 764 |
| InternVL3.5-30B | 68.438 | 33.285 | 18.354 | 63.910 | 3.8k | 0.739 | 0.331 | 0.404 | 5.778 | 3.0k |
| Qwen2.5-VL-32B | 32.115 | 17.804 | 19.773 | 64.555 | 779 | 0.816 | 0.591 | 0.382 | 9.297 | 828 |
| InternVL3.5-38B | 42.172 | 21.556 | 18.221 | 65.511 | 4.3k | 0.755 | 0.393 | 0.400 | 6.540 | 3.8k |
| Llama-4-Scout | 35.489 | 18.647 | 20.299 | 64.182 | 524 | 0.807 | 0.599 | 0.360 | 9.549 | 574 |
| Llama-4-Maverick | 30.835 | 14.831 | 21.872 | 67.366 | 551 | 0.839 | 0.644 | 0.340 | 10.469 | 608 |
| Qwen2.5-VL-72B | 29.521 | 18.407 | 20.923 | 65.349 | 527 | 0.808 | 0.628 | 0.363 | 9.900 | 886 |
| InternVL3-78B | 30.457 | 19.195 | 20.577 | 64.826 | 454 | 0.830 | 0.638 | 0.348 | 9.985 | 514 |
| InternVL3.5-241B | 43.339 | 23.061 | 18.191 | 65.689 | 2.9k | 0.792 | 0.480 | 0.378 | 8.093 | 3.1k |
| GPT-4o | 28.124 | 14.150 | 23.637 | 70.696 | 473 | 0.850 | 0.663 | 0.327 | 10.723 | 484 |
| Gemini-2.5-Flash | 28.865 | 8.894 | 24.800 | 74.796 | 1.2k | 0.829 | 0.516 | 0.359 | 9.091 | 1.8k |
| Claude-Sonnet-4 | 27.294 | 7.640 | 23.094 | 74.525 | 1.0k | 0.901 | 0.670 | 0.305 | 11.731 | 1.3k |
| Starvector 8B | -- | -- | -- | -- | -- | 0.650 | 0.070 | 0.447 | 1.990 | 2.6k |
| LLM4SVG 7B | 48.704 | 29.568 | 15.468 | 62.933 | 1.2k | 0.713 | 0.494 | 0.413 | 6.221 | 476 |
| OmniSVG 3B | 42.756 | 22.885 | 16.861 | 64.815 | 4.5k | 0.797 | 0.656 | 0.330 | 10.433 | 6.7k |
| InternSVG 8B | 22.397 | 5.141 | 21.116 | 74.662 | 8.1k | 0.924 | 0.716 | 0.188 | 14.644 | 7.7k |
Generation Performance
| Model | Text-to-SVG | Image-to-SVG | |||||||
|---|---|---|---|---|---|---|---|---|---|
| FID ↓ | FID-C ↓ | CLIP-I2I ↑ | Tokens | DINO ↑ | SSIM ↑ | LPIPS ↓ | PSNR ↑ | Tokens | |
| Qwen2.5-VL-7B | 56.248 | 73.698 | 51.814 | 907 | 0.769 | 0.468 | 0.274 | 7.501 | 996 |
| InternVL3-8B | 33.613 | 61.675 | 56.856 | 910 | 0.865 | 0.783 | 0.203 | 13.840 | 805 |
| Gemma-3-27B | 29.937 | 49.967 | 60.776 | 776 | 0.887 | 0.823 | 0.190 | 14.959 | 683 |
| Qwen2.5-VL-32B | 53.047 | 56.431 | 58.428 | 1.2k | 0.821 | 0.570 | 0.225 | 10.005 | 900 |
| Llama-4-Scout | 33.781 | 46.584 | 62.522 | 849 | 0.866 | 0.734 | 0.205 | 12.984 | 624 |
| Llama-4-Maverick | 26.844 | 31.924 | 69.643 | 747 | 0.908 | 0.798 | 0.173 | 14.977 | 687 |
| Qwen2.5-VL-72B | 32.307 | 44.540 | 63.931 | 620 | 0.846 | 0.647 | 0.215 | 12.106 | 716 |
| InternVL3-78B | 29.216 | 40.080 | 65.969 | 698 | 0.911 | 0.813 | 0.177 | 15.375 | 545 |
| GPT-4o | 24.505 | 19.297 | 76.599 | 640 | 0.920 | 0.791 | 0.174 | 14.673 | 533 |
| Gemini-2.5-Flash | 27.708 | 21.777 | 75.897 | 1.4k | 0.934 | 0.817 | 0.155 | 15.539 | 1.1k |
| Claude-Sonnet-4 | 21.252 | 15.240 | 78.308 | 1.2k | 0.957 | 0.871 | 0.132 | 17.554 | 956 |
| Starvector 8B | -- | -- | -- | -- | 0.977 | 0.841 | 0.147 | 17.419 | 1.2k |
| InternSVG 8B | 9.974 | 0.877 | 93.931 | 981 | 0.994 | 0.873 | 0.138 | 17.722 | 931 |
Generation Performance
| Model | Text-to-SANI | Video-to-SANI | |||||||
|---|---|---|---|---|---|---|---|---|---|
| FVD ↓ | CLIP-T2V ↑ | CLIP-V2V ↑ | Tokens | DINO ↑ | SSIM ↑ | LPIPS ↓ | PSNR ↑ | Tokens | |
| Qwen2.5-VL-7B | 214.379 | 19.118 | 50.649 | 296 | 0.787 | 0.716 | 0.273 | 11.758 | 423 |
| InternVL3-8B | 310.066 | 17.017 | 43.856 | 433 | 0.780 | 0.612 | 0.286 | 9.883 | 415 |
| Gemma-3-27B | 159.119 | 21.105 | 59.309 | 533 | 0.824 | 0.733 | 0.265 | 12.290 | 516 |
| Qwen2.5-VL-32B | 128.299 | 20.535 | 59.188 | 537 | 0.823 | 0.696 | 0.273 | 11.417 | 505 |
| Llama-4-Scout | 167.932 | 21.014 | 62.929 | 505 | 0.831 | 0.742 | 0.259 | 12.427 | 426 |
| Llama-4-Maverick | 141.470 | 22.304 | 67.615 | 563 | 0.841 | 0.754 | 0.246 | 12.858 | 447 |
| Qwen2.5-VL-72B | 151.682 | 20.376 | 59.454 | 433 | 0.834 | 0.721 | 0.261 | 11.931 | 402 |
| InternVL3-78B | 169.159 | 20.263 | 60.896 | 409 | 0.828 | 0.704 | 0.264 | 11.336 | 385 |
| GPT-4o | 286.352 | 22.808 | 70.608 | 404 | 0.860 | 0.743 | 0.250 | 12.260 | 400 |
| Gemini-2.5-Flash | 151.983 | 22.239 | 66.554 | 986 | 0.847 | 0.701 | 0.257 | 12.015 | 917 |
| Claude-Sonnet-4 | 169.484 | 24.070 | 74.179 | 907 | 0.867 | 0.760 | 0.240 | 13.189 | 866 |
| InternSVG 8B | 99.474 | 22.572 | 73.162 | 812 | 0.876 | 0.754 | 0.237 | 14.168 | 888 |
SGP-Bench
To further validate the effectiveness of SAgoge in enhancing model capabilities for SVG modeling, we conduct comparative experiments on SGP-Bench, a benchmark specifically designed to evaluate semantic and structural understanding of symbolic graphic programs.
| Model | Semantics ↑ | Count ↑ | Color ↑ | Shape ↑ | Reasoning ↑ | Overall ↑ |
|---|---|---|---|---|---|---|
| Gemma-1.1-2B | 32.1 | 33.3 | 25.0 | 35.6 | 28.7 | 31.7 |
| InternLM2.5-7B | 27.3 | 31.7 | 59.8 | 51.5 | 28.2 | 42.1 |
| Keye-VL-8B | 41.4 | 47.5 | 71.4 | 54.9 | 40.6 | 52.2 |
| GLM-4.1V-9B | 41.6 | 55.6 | 79.1 | 61.5 | 40.0 | 57.1 |
| InternVL3-8B | 33.7 | 46.5 | 69.8 | 59.1 | 36.1 | 50.6 |
| Gemma-3-12B | 24.8 | 30.8 | 47.2 | 25.7 | 22.8 | 30.5 |
| DeepSeek-Coder-V2-16B | 30.9 | 37.9 | 63.7 | 54.8 | 26.8 | 45.1 |
| InternVL3-14B | 38.2 | 52.9 | 74.4 | 54.1 | 41.7 | 52.9 |
| Kimi-VL-A3B-2506 | 31.1 | 41.5 | 67.0 | 47.4 | 32.4 | 44.9 |
| Gemma-3-27B | 36.7 | 51.4 | 76.3 | 62.1 | 39.4 | 54.7 |
| Qwen2.5-VL-32B | 40.0 | 55.7 | 76.3 | 61.2 | 43.9 | 56.5 |
| InternVL3-38B | 40.8 | 58.7 | 82.2 | 63.6 | 43.9 | 59.1 |
| GPT-4o | 45.9 | 56.8 | 87.3 | 75.2 | 50.4 | 64.8 |
| Gemini-2.5-Flash | 53.8 | 57.8 | 88.1 | 75.6 | 55.5 | 67.6 |
| Claude-Sonnet-4 | 55.9 | 67.6 | 89.5 | 79.0 | 58.9 | 71.5 |
| GLM-4.5V | 47.3 | 63.7 | 87.3 | 72.3 | 55.8 | 66.1 |
| Qwen2.5-VL-72B | 40.2 | 55.1 | 80.1 | 62.0 | 41.1 | 57.1 |
| InternVL3-78B | 41.0 | 59.1 | 84.0 | 65.2 | 47.0 | 60.3 |
| Step3-321B-A38B | 35.9 | 54.0 | 82.8 | 63.2 | 38.6 | 56.5 |
| InternSVG 8B | 54.6 | 70.7 | 85.5 | 82.4 | 57.5 | 72.3 |
Comparison with Baselines
We compare the generated SVGs with those produced by baseline methods to assess visual quality.
SArena-Icon
Text-to-SVG
Image-to-SVG
SArena-Illustration
Text-to-SVG
Image-to-SVG
SArena-Chemistry
Text-to-SVG
Image-to-SVG
SArena-Animation
Text-to-SVG
Image-to-SVG
BibTeX
@misc{wang2025internsvgunifiedsvgtasks,
title={InternSVG: Towards Unified SVG Tasks with Multimodal Large Language Models},
author={Haomin Wang and Jinhui Yin and Qi Wei and Wenguang Zeng and Lixin Gu and Shenglong Ye and Zhangwei Gao and
Yaohui Wang and Yanting Zhang and Yuanqi Li and Yanwen Guo and Wenhai Wang and Kai Chen and Yu Qiao and Hongjie Zhang},
year={2025},
eprint={2510.11341},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2510.11341},
}