Large Language Model-Brained GUI Agents:
A Survey
- 논문을 참고해 데이터를 어떻게 수집해야 되는지 확인해보자.


- 205 tasks
- Benchmark라 학습한 논문은 아님

- 3 collections of data
- web UI data crawled from the internet
- HTML code
- elements that display visible text content
- elements with a special “title” attribute that display descriptive text when hovering
- approximately 300k web pages from the lastest Comon Crawl repository
- mobile UI data reorganized from public datasets
- 3 types of data
- widget captioning
- mobile UI grounding
- mobile UI summarization
- 20k screenshots, 40k widgets, and 100k descriptions
- general vision-language instruction-following data
- 30 task-specific prompts for each added GUI task, resulting in a 1M dataset to train



- 162,859 captions created by human workers for
- 61,285 UI elements
- across 21,750 unique screens
- from 6,470 mobile apps.
