Skip to content

Commit 2f3811c

Browse files
committed
Add README file specific to ScreenSpot
1 parent 28962cb commit 2f3811c

File tree

1 file changed

+49
-0
lines changed

1 file changed

+49
-0
lines changed
Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
# SceenSpot
2+
3+
## GUI Grounding Benchmark: ScreenSpot
4+
5+
ScreenSpot is an evaluation benchmark for GUI grounding, comprising over 1200 instructions from iOS, Android, macOS, Windows and Web environments, along with annotated element types (Text or Icon/Widget).
6+
7+
This evaluation allows for both:
8+
- `screenspot_rec_test`: the original evaluation of `{img} {instruction} --> {bounding box}` called grounding or Referring Expression Completion (REC);
9+
- `screenspot_reg_test`: the new evaluation of `{img} {bounding box} --> {instruction}` called instruction generation or Referring Expression Generation (REG).
10+
11+
### REC Metrics
12+
13+
REC/Grounding requires that a model outputs a bounding box for the target element in the image. The evaluation metrics are:
14+
- `IoU`: Intersection over Union (IoU) between the predicted bounding box and the ground truth bounding box.
15+
- `ACC@IoIU`: We use `IoU` to create `ACC@IoU` metrics at different IoU thresholds where an output with an IoU above the threshold is considered correct.
16+
- `CENTER ACC`: The predicted bounding box is considered correct if the center of the predicted bounding box is within the ground truth bounding box. This is what's reported in the paper.
17+
18+
### REG Metrics
19+
20+
REG/Generation requires that a model outputs the instruction that describes the target element in the image. Currently, this element will be highlighted in red in the image. The evaluation metrics are:
21+
- `CIDEr`: The CIDEr metric is used to evaluate the quality of the generated instruction. As the paper doesn't consider this task, we have selected this metric as a standard for evaluating the quality of the generated instruction. This matches with what other works like ScreenAI have done for instruction generation for RICO datasets.
22+
23+
## Baseline Scores
24+
25+
As a Baseline, here is how LLaVA-v1.5-7b performs on the ScreenSpot dataset:
26+
- `IoU`: 0.051
27+
28+
29+
30+
31+
32+
- `CENTER ACC`: 0.097
33+
- `CIDEr`: 0.097
34+
35+
## References
36+
37+
- ArXiv: [SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents](https://arxiv.org/abs/2401.10935)
38+
- GitHub: [njucckevin/SeeClick](https://github.com/njucckevin/SeeClick)
39+
40+
```bibtex
41+
@misc{cheng2024seeclick,
42+
title={SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents},
43+
author={Kanzhi Cheng and Qiushi Sun and Yougang Chu and Fangzhi Xu and Yantao Li and Jianbing Zhang and Zhiyong Wu},
44+
year={2024},
45+
eprint={2401.10935},
46+
archivePrefix={arXiv},
47+
primaryClass={cs.HC}
48+
}
49+
```

0 commit comments

Comments
 (0)