DreamStruct: Understanding Slides and User Interfaces via Synthetic Data Generation

1Carnegie Mellon University
2University of Texas, Austin
3Aalto University
The 18th European Conference on Computer Vision (ECCV 2024)
A diagram illustrates a workflow for synthetically generating metadata to decode visual design semantics for visual representation learning. The top section features a 'Recipe Finder' app, where a generated description explains the UI elements, including search filters and recipe details such as 'Creamy Pasta' and 'Garden Salad.' Corresponding HTML code is generated along with images retrieved via an image search, which are then displayed in a generated UI, demonstrating the system's ability to decode and represent visual design semantics. The bottom section showcases a slide titled 'Common Verb Conjugations,' where the generated description details a bar graph and a table explaining Japanese verb transformations. The system generates HTML, CSS, and JavaScript code to visually render the slide. This process highlights how synthetically generated metadata can be used to decode visual design semantics, such as UI layouts and educational slides, supporting structured visual representation learning.

summarize TL;DR

We improve visual design representation learning by generating declarative languages (e.g., HTML) that describe structures like slides and UIs, with dynamic labels embedded for more flexible data creation and improved model performance across tasks.

palette Summary

Enabling machines to understand structured visuals like slides and user interfaces is essential for making them accessible to people with disabilities. However, achieving such understanding computationally has required manual data collection and annotation, which is time-consuming and labor-intensive. To overcome this challenge, we present a method to generate synthetic, structured visuals with target labels using code generation. Our method allows people to create datasets with built-in labels and train models with a small number of human-annotated examples. We demonstrate performance improvements in three tasks for understanding slides and UIs: recognizing visual elements, describing visual content, and classifying visual content types.

smart_toy Learning Design Representation with Generative Code Semantics

DreamStruct constructs synthetically labeled datasets for visual design like slides and UIs by generating declariative and semantic code abstraction, where annotations are specified a priori, and visual elements are synthesized accordingly. The methodology is structured around three core design principles: adherence to common design guidelines, representation of prevalent visual design patterns, and alignment with targeted semantic labels. Our multi-stage pipeline operates as follows:

dataset Overview of Example Synthetic Datasets: DreamSlides and DreamUI

DreamStruct generated a total of 10,053 slide-code pairs (DreamSlides) and 9,774 UI-code pairs (DreamUI). The dataset includes highly detailed descriptions for slides and UIs, with each description averaging 57.23 words and containing 19.66 named entities, a substantial increase compared to human-annotated datasets, which average 5.74 words and 3.37 named entities. Element compositions, such as images, charts, and diagrams, remain consistent between DreamStruct and human-annotated datasets. For instance, DreamStruct samples contain an average of 2.32 images, 0.21 charts, and 0.24 diagrams per sample, while human-annotated samples feature 2.48 images, 0.07 charts, and 0.12 diagrams. The total number of elements per sample is nearly identical, with DreamStruct at 8.74 and human-annotated datasets at 8.83. Despite not explicitly controlling the distribution of element types during generation, the distribution in DreamStruct aligns closely with real-world datasets. One notable difference is the rare occurrence of the 'upper task bar' element in DreamStruct's UI samples, which typically displays time and internet signals but is less relevant to the core content of the UIs.

shelf_position Example Use Case of Design Modeling: Element Recognition

Element recognition is a crucial task for understanding visual design, especially in slides and UIs. DreamStruct’s synthetic dataset provides a boost in model performance over human-annotated data by using the synthetic data as training or prtraining sources. For slide element recognition, models trained on our synthetic data achieved 55.3% mean Average Precision (mAP), compared to 44.3% on human-annotated datasets in the same data scale. In UI element recognition, we used a two-stage training approach, pretraining models on synthetic data and finetuning them with just 50 human-annotated samples. This method improved performance by 5.2%, showing that synthetic data can fill in gaps where real-world data is sparse or imbalanced.

record_voice_overExample Use Case of Design Modeling: Image Captioning

Image captioning helps generate descriptions for screenshots of slides and UIs, enabling better accessibility for blind and low-vision users. We compared models trained on DreamStruct’s synthetic datasets with others fine-tuned on human-written captions. Our model outperformed baselines, achieving a 77.9% win rate over competing models for slide captioning and a 64.8% win rate for UI captioning. When against GPT-4V, a state-of-the-art model, our smaller model achieved 31.6% and 37.3% win rates for slides and UIs, respectively. Our results show the impact of our synthetic data in training models capable of generating more relevant and detailed captions and improve open model capabilities.

hdr_strong Example Use Case of Design Modeling: Image Classification

Image classification is key to identifying themes and visual patterns in slides and UIs. DreamStruct allows models to classify slides with an accuracy of 67.9%, significantly higher than the 33.5% accuracy of models trained on human-annotated datasets. Similarly, UI classification accuracy rose to 55.2% when trained with synthetic data, compared to 31% with human-annotated data. By using a larger, more diverse dataset, our method helps models recognize design elements in ways that generalize well across both slides and UIs. Despite synthetic data's advantages, proprietary models like GPT-4V still lead in performance, but DreamStruct’s data closes the gap.

quick_reference_allReference

@inproceedings{peng2024dreamstruct,
        title={DreamStruct: Understanding Slides and User Interfaces via Synthetic Data Generation},
        author={Peng, Yi-Hao and Huq, Faria and Jiang, Yue and Wu, Jason and Li, Amanda Xin Yue and Bigham, Jeffrey and Pavel, Amy},
        booktitle={Proceedings of the European Conference on Computer Vision (ECCV)},
        year={2024}
}

partner_exchange Acknowledgements

This work was funded in part by the National Science Foundation.

This webpage template was inspired by the BlobGAN project.