|
| 1 | +# SceenSpot |
| 2 | + |
| 3 | +## GUI Grounding Benchmark: ScreenSpot |
| 4 | + |
| 5 | +ScreenSpot is an evaluation benchmark for GUI grounding, comprising over 1200 instructions from iOS, Android, macOS, Windows and Web environments, along with annotated element types (Text or Icon/Widget). |
| 6 | + |
| 7 | +This evaluation allows for both: |
| 8 | +- `screenspot_rec_test`: the original evaluation of `{img} {instruction} --> {bounding box}` called grounding or Referring Expression Completion (REC); |
| 9 | +- `screenspot_reg_test`: the new evaluation of `{img} {bounding box} --> {instruction}` called instruction generation or Referring Expression Generation (REG). |
| 10 | + |
| 11 | +### REC Metrics |
| 12 | + |
| 13 | +REC/Grounding requires that a model outputs a bounding box for the target element in the image. The evaluation metrics are: |
| 14 | +- `IoU`: Intersection over Union (IoU) between the predicted bounding box and the ground truth bounding box. |
| 15 | +- `ACC@IoIU`: We use `IoU` to create `ACC@IoU` metrics at different IoU thresholds where an output with an IoU above the threshold is considered correct. |
| 16 | +- `CENTER ACC`: The predicted bounding box is considered correct if the center of the predicted bounding box is within the ground truth bounding box. This is what's reported in the paper. |
| 17 | + |
| 18 | +### REG Metrics |
| 19 | + |
| 20 | +REG/Generation requires that a model outputs the instruction that describes the target element in the image. Currently, this element will be highlighted in red in the image. The evaluation metrics are: |
| 21 | +- `CIDEr`: The CIDEr metric is used to evaluate the quality of the generated instruction. As the paper doesn't consider this task, we have selected this metric as a standard for evaluating the quality of the generated instruction. This matches with what other works like ScreenAI have done for instruction generation for RICO datasets. |
| 22 | + |
| 23 | +## Baseline Scores |
| 24 | + |
| 25 | +As a Baseline, here is how LLaVA-v1.5-7b performs on the ScreenSpot dataset: |
| 26 | +- `IoU`: 0.051 |
| 27 | + |
| 28 | + |
| 29 | + |
| 30 | + |
| 31 | + |
| 32 | +- `CENTER ACC`: 0.097 |
| 33 | +- `CIDEr`: 0.097 |
| 34 | + |
| 35 | +## References |
| 36 | + |
| 37 | +- ArXiv: [SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents](https://arxiv.org/abs/2401.10935) |
| 38 | +- GitHub: [njucckevin/SeeClick](https://github.com/njucckevin/SeeClick) |
| 39 | + |
| 40 | +```bibtex |
| 41 | +@misc{cheng2024seeclick, |
| 42 | + title={SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents}, |
| 43 | + author={Kanzhi Cheng and Qiushi Sun and Yougang Chu and Fangzhi Xu and Yantao Li and Jianbing Zhang and Zhiyong Wu}, |
| 44 | + year={2024}, |
| 45 | + eprint={2401.10935}, |
| 46 | + archivePrefix={arXiv}, |
| 47 | + primaryClass={cs.HC} |
| 48 | +} |
| 49 | +``` |
0 commit comments