AgentStudio: A Toolkit for Building General Virtual Agents

SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents

https://github.com/njucckevin/SeeClick/blob/main/readme_data.md
GUI grounding benchmark ScreenSpot, encompassing more than 1200 instructoins from various GUI platforms

3 collections of data
- web UI data crawled from the internet
  - HTML code
    - elements that display visible text content
    - elements with a special “title” attribute that display descriptive text when hovering
  - approximately 300k web pages from the lastest Comon Crawl repository
- mobile UI data reorganized from public datasets
  - 3 types of data
    - widget captioning
    - mobile UI grounding
    - mobile UI summarization
  - 20k screenshots, 40k widgets, and 100k descriptions
- general vision-language instruction-following data
30 task-specific prompts for each added GUI task, resulting in a 1M dataset to train

162,859 captions created by human workers for
- 61,285 UI elements
- across 21,750 unique screens
- from 6,470 mobile apps.