Evaluation of the submissions will be performed on a private test data set containing individual files of surgical videos @ 1FPS through grand-challenge automated docker submission and evaluation system. The test data and labels are hidden from the participants. Following are the metrics that will be used to evaluate algorithm performances and displayed on the leaderboards. 

Category 1: Surgical tool detection

The standard COCO dataset bounding box detection evaluation metric (mAP@[0.5:0.05:0.95] - https://cocodataset.org/#detection-eval) will be used for evaluations and determining the winners

Category 2: Surgical step recognition
The mean weighted  f1-score across all classes will be used for evaluations and determining the winners.