DreamStruct: Understanding Slides and User Interfaces via Synthetic Data Generation
summarize TL;DR
We improve visual design representation learning by generating declarative languages (e.g., HTML) that describe structures like slides and UIs, with dynamic labels embedded for more flexible data creation and improved model performance across tasks.
palette Summary
Enabling machines to understand structured visuals like slides and user interfaces is essential for making them accessible to people with disabilities. However, achieving such understanding computationally has required manual data collection and annotation, which is time-consuming and labor-intensive. To overcome this challenge, we present a method to generate synthetic, structured visuals with target labels using code generation. Our method allows people to create datasets with built-in labels and train models with a small number of human-annotated examples. We demonstrate performance improvements in three tasks for understanding slides and UIs: recognizing visual elements, describing visual content, and classifying visual content types.
smart_toy Learning Design Representation with Generative Code Semantics
DreamStruct constructs synthetically labeled datasets for visual design like slides and UIs by generating declariative and semantic code abstraction, where annotations are specified a priori, and visual elements are synthesized accordingly. The methodology is structured around three core design principles: adherence to common design guidelines, representation of prevalent visual design patterns, and alignment with targeted semantic labels. Our multi-stage pipeline operates as follows:
- Ideation: Design concepts are synthesized by prompting a large language model (LLM) with design guidelines and common patterns as well as the curated example sets, yielding diverse design representations.
- Generation: The visual is programmatically generated as code for each design concept, with structured annotations embedded in the HTML for semantic labeling of visual elements, ensuring compliance with specified guidelines, patterns and targeted labels.
- Production: Placeholder visual elements are replaced with real-world data or AI-generated assets, followed by post-processing and filtering to ensure high-quality outputs. The resulting DreamStruct dataset encompasses around 20K annotated code-visual pairs with real-world data distributions and a diverse range of design patterns.
dataset Overview of Example Synthetic Datasets: DreamSlides and DreamUI
DreamStruct generated a total of 10,053 slide-code pairs (DreamSlides) and 9,774 UI-code pairs (DreamUI). The dataset includes highly detailed descriptions for slides and UIs, with each description averaging 57.23 words and containing 19.66 named entities, a substantial increase compared to human-annotated datasets, which average 5.74 words and 3.37 named entities. Element compositions, such as images, charts, and diagrams, remain consistent between DreamStruct and human-annotated datasets. For instance, DreamStruct samples contain an average of 2.32 images, 0.21 charts, and 0.24 diagrams per sample, while human-annotated samples feature 2.48 images, 0.07 charts, and 0.12 diagrams. The total number of elements per sample is nearly identical, with DreamStruct at 8.74 and human-annotated datasets at 8.83. Despite not explicitly controlling the distribution of element types during generation, the distribution in DreamStruct aligns closely with real-world datasets. One notable difference is the rare occurrence of the 'upper task bar' element in DreamStruct's UI samples, which typically displays time and internet signals but is less relevant to the core content of the UIs.
shelf_position Example Use Case of Design Modeling: Element Recognition
Element recognition is a crucial task for understanding visual design, especially in slides and UIs. DreamStruct’s synthetic dataset provides a boost in model performance over human-annotated data by using the synthetic data as training or prtraining sources. For slide element recognition, models trained on our synthetic data achieved 55.3% mean Average Precision (mAP), compared to 44.3% on human-annotated datasets in the same data scale. In UI element recognition, we used a two-stage training approach, pretraining models on synthetic data and finetuning them with just 50 human-annotated samples. This method improved performance by 5.2%, showing that synthetic data can fill in gaps where real-world data is sparse or imbalanced.
record_voice_overExample Use Case of Design Modeling: Image Captioning
Image captioning helps generate descriptions for screenshots of slides and UIs, enabling better accessibility for blind and low-vision users. We compared models trained on DreamStruct’s synthetic datasets with others fine-tuned on human-written captions. Our model outperformed baselines, achieving a 77.9% win rate over competing models for slide captioning and a 64.8% win rate for UI captioning. When against GPT-4V, a state-of-the-art model, our smaller model achieved 31.6% and 37.3% win rates for slides and UIs, respectively. Our results show the impact of our synthetic data in training models capable of generating more relevant and detailed captions and improve open model capabilities.
hdr_strong Example Use Case of Design Modeling: Image Classification
Image classification is key to identifying themes and visual patterns in slides and UIs. DreamStruct allows models to classify slides with an accuracy of 67.9%, significantly higher than the 33.5% accuracy of models trained on human-annotated datasets. Similarly, UI classification accuracy rose to 55.2% when trained with synthetic data, compared to 31% with human-annotated data. By using a larger, more diverse dataset, our method helps models recognize design elements in ways that generalize well across both slides and UIs. Despite synthetic data's advantages, proprietary models like GPT-4V still lead in performance, but DreamStruct’s data closes the gap.
quick_reference_allReference
@inproceedings{peng2024dreamstruct, title={DreamStruct: Understanding Slides and User Interfaces via Synthetic Data Generation}, author={Peng, Yi-Hao and Huq, Faria and Jiang, Yue and Wu, Jason and Li, Amanda Xin Yue and Bigham, Jeffrey and Pavel, Amy}, booktitle={Proceedings of the European Conference on Computer Vision (ECCV)}, year={2024} }
partner_exchange
Acknowledgements
This work was funded in part by the National Science Foundation.
This webpage template was inspired by the BlobGAN project.