All of our developemnt kits can be found here. For more details, check out the task-specific information below.
python convert_to_kitti_tracking.py --input json_folder --output output_dir # 2d tracking --depth 0
python convert_to_kitti_tracking.py --input json_folder --output output_dir # 2d tracking --depth 1
python scripts/run_jrdb_2d.py --gt gt_folder --tracker tracker_folder # 2d tracking
python scripts/run_jrdb_3d.py --gt gt_folder --tracker tracker_folder # 3d tracking
python scripts/run_jrdb_2d.py --gt data/gt/jrdb/jrdb_2d_box_train --tracker data/tracker/jrdb/jrdb_2d_box_train # 2d tracking
python scripts/run_jrdb_2d.py --gt data/gt/jrdb/jrdb_3d_box_train --tracker data/tracker/jrdb/jrdb_3d_box_train # 3d tracking
python convert_dataset_to_KITTI.py -i JRDB -o KITTI_converted_JRDB
g++ -O3 -o evaluate_object evaluate_object.cpp # compile cpp first.
./evaluate_object path/to/groundtruth path/to/results 1
output_file.txt # then run cpp file with gt, predictions and output file.
python ospa_2d_det.py --gt gt_folder --pred pred_folder # 2d detections
python ospa_3d_det.py --gt gt_folder --pred pred_folder # 3d detections
Open Human Pose Development Kit
The pose development kit contains code for evaluation predictions on JRDB-Pose.
Human Trajectory Forecasting Development Kit
The human trajectory forecasting development kit contains code for evaluation of forecast trajectories.
Panoptic Segmentation and Tracking
The panoptic segmentation and tracking development kit contains code for evaluation of panoptic segmentation and tracking.
We have also created a visualization toolkit to make it easy to visualize your predictions on JRDB. Check out the Visualisation Toolkit , which has been adapted from Kitti Object Visualisation
We adopted the wide-established metrics and criteria from KITTI and AVA. Details about the criteria can be found in the following document:
Evaluation of Tracking Same as most datasets in TrackEval, We will use several metric families to evaluate results: OSPA, Clear-MOT, HOTA and Identity. Each of them contains a set of metrics. Additional metrics may be included later in the challenge.
Evaluation of Detection: We will use OSPA and precision to evaluate the performance of each detection submission. However, we will also report recall and AOS for 2D detection. Additional metrics may be included later in the challenge.
Evaluation of Action/Group/Activity Detection: We use mean Average Precision (mAP) to evaluate the performance of each task. We also provide detailed AP results per-sequence and per-category.
Evaluation of Pose Detection: We use both Average Precision (based on thresholded OKS) as well as OSPA-Pose to evaluate the performance of each task. We further provide detailed AP results per-sequence and per-category. Since we only label some people in a scene (tiny people will not be labeled), we forgive predicted poses for unlabeled people by matching poses with all ground-truth boxes.
Evaluation of Trajectory Forecasting: We use both EFE (End-to-end Forecasting Error) as well as OSPA-Trajectory to evaluate the performance of each submission. Since some people disapear in the hidden future, we forgive forecast trajectories for disapeared people by matching trajectories with all ground-truth trajectories.
2D/3D Tracking Benchmark:
The primary metric we use to evaluate tracking is MOTA, which combines false positives, false
negatives, and id switches. We also report MOTP, which is a measure of the localisation
accuracy of the tracking algorithm. Rank is determined by MOTA. We require
intersection-over-union to be greater than 50% for 2D tracking and 30% for 3D tracking.
MOTA is given by:
where t indicates the frame number, T is the total number of frames, FP is the number of
false positives in frame t , FN the number of false negatives in frame t, IDs the number of id
switches in frame t, and GT is the
number of ground truth objects of frame t.
MOTP is given by:
where i indicates frame number, t is the total number of frames, M is the total number of
objects in frame i , ci
is the number of matches between predictions and ground truth in frame i
, and d is the intersection over union distance (1-IOU) of a particular match.
i,j
To evaluate 2D tracking, we run Clear-MOT metrics using an IoU threshold of 0.5. To evaluate
3D tracking, the 3D IoU is calculated using a combination of the Sutherland-Hodgman algorithm
and the shoelace formula (surveyor's formula) to determine the area of intersection. The
Sutherland Hodgman algorithm is an algorithm used to clip polygons. A 3D-IoU threshold of 0.5
is used to determine matches.
Our benchmark contains OSPA and HOTA too.
Preparing Tracking Submissions:
Your submission will consist of a single zip file. Please ensure that the sequence folders are
directly
zipped and that you do not zip their parent folder. The folder structure and content of this
file (e.g.
result files) have to comply with the KITTI tracking format
described here.
Expected
Directory Structure of 2D/3D Tracking Submissions:
CIWT/data/0000.txt
                  /0001.txt
                  /0002.txt
...
                  /0026.txt
Each of the txt file corresponding to a test sequence, ordering alphabetically. eg. 0000.txt corresponds to sequence cubberly-auditorium-2019-04-22_1 and 0026.txt corresponds to sequence tressider-2019-04-26_3.
2D/3D tracking File and Label Format:
All values (numerical or strings) are separated via
spaces, each row
corresponds to one object. The 18 columns (17 values + 1 score value) represent:
frame, track id, type, truncated,occluded, alpha, bb_left, bb_top, bb_width, bb_height, x, y, z,
height, width, length, rotation_y, score
The details are given below:
#Values
Name
Description
1
frame
Frame within the sequence where the object appearers
1
track id
Unique tracking id of this object within this sequence
1
type
Describes the type of object: 'Pedestrian' only
1
truncated
Integer (0,1,2) indicating the level of truncation.
Note that this is in contrast to the object detection
benchmark where truncation is a float in [0,1].
1
occluded
Integer (0,1,2,3) indicating occlusion state:
0 = fully visible, 1 = partly occluded
2 = largely occluded, 3 = unknown
1
alpha
Observation angle of object, ranging [-pi..pi]
4
bbox
2D bounding box of object in the image (0-based index):
contains left, top, right, bottom pixel coordinates
3
location
3D object location x,y,z in camera coordinates (in meters)
3
dimensions
3D object dimensions: height, width, length (in camera coordinate - y_size, x_size, z_size; in meters).
1
rotation_y
Rotation ry around Y-axis in camera coordinates [-pi..pi]
1
score
Only for results: Float, indicating confidence in
detection, needed for p/r curves, higher is better.
The conf value contains the detection confidence in the det.txt files. For a submission, it acts as
a flag whether the entry is to be considered. A value of 0 means that this particular instance is
ignored in the evaluation, while any other value can be used to mark it as active. For submitted
results, all lines in the .txt file with a confidence of 1 are considered. Fields which are not
used,
such as 2D bounding box for 3D tracking or location, dimension, and rotation_y for 2D tracking,
must be set to -1.
Note incorrect submission format may result error in evaluation or abnormal
results.
2D Object Detection Benchmark:
The goal in the 2D object detection task is to train object detectors for pedestrian in a 360
panorama image. The object detectors must provide as output the 2D 0-based bounding box in
the image using the format specified above, as well as a detection score, indicating the
confidence in the detection. All other values must be set to their default values.
In our evaluation, we only evaluate detections on 2D bounding box larger than 500^2 pixel^2 in
the image and that are not fully occluded. For evaluation criterion, inspired by PASCAL, we use
41-point interpolated AP and require the intersection-over-union of bounding boxes to be larger
than 30%, 50%, 70% for an object to be detected correctly.
3D Object Detection Benchmark:
The goal in the 3D object detection task is to train object detectors for pedestrian in a lidar
point
clouds. The object detectors must provide the 3D bounding box (in the format specified above,
i.e. 3D dimensions and 3D locations) and the detection score/confidence. All other values must
be set to their default values.
In our evaluation, we only evaluate detections on 3D bounding box which encloses more than
10 3D lidar points and lies within 25 meters in bird's eye view. For evaluation criterion,
inspired
by PASCAL, we use 41-point interpolated AP and require the intersection-over-union of
bounding boxes to be larger than 30%, 50%, 70% for an object to be detected correctly.
Preparing Detection Submissions:
Your submission will consist of a single zip file. Please ensure that the sequence folders are
directly
zipped and that you do not zip their parent folder. The folder structure and content of this
file (e.g.
result files) have to comply with the KITTI
format described in:
Geiger, Andreas, Lenz, Philip, and Urtasun, Raquel.
"Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite."
2012
IEEE Conference on Computer
Vision and Pattern Recognition. IEEE, 2012.
http://www.cvlibs.net/datasets/kitti/index.php
The evaluation script expects a folder in the following structure:
cubberly-auditorium-2019-04-22_1/image_stitched/000000.txt
                                  /000001.txt
                                  /000002.txt
                                  ...
                                  /001078.txt
...
tressider-2019-04-26_3/image_stitched/000000.txt
                        /000001.txt
                        /000002.txt
                        ...
                        /001658.txt
Each subfolder represents a sequence and each text file within the subfolder is a label file of a given frame. The label files contain the following information. All values (numerical or strings) are separated by spaces, each row corresponds to one object.
The details are given below:
#Values
Name
Description
1
type
Describes the type of object: 'Pedestrian' only
1
truncated
Integer 0 (non-truncated) and 1 (truncated), where
truncated refers to the object leaving image boundaries
* May be an arbitrary value for evaluation.
1
occluded
Integer (0, 1, 2, 3) indicating occlusion state:
0 = fully visible, 1 = mostly visible
2 = severely occluded, 3 = fully occluded
* May be an arbitrary value for evaluation.
1
num_points
Integer, number of points within a 3D bounding box.
* May be an arbitrary value for evaluation.
* May be a negative value to indicate a 2D bounding box
without corresponding 3D bounding box.
1
alpha
Observation angle of object, ranging [-pi..pi]
4
bbox
2D bounding box of object in the image (0-based index):
contains left, top, right, bottom pixel coordinates
3
dimensions
3D object dimensions: height, width, length (in camera coordinate - y_size, x_size, z_size; in meters).
3
location
3D object location x,y,z in camera coordinates (in meters)
1
rotation_y
Rotation ry around Y-axis in camera coordinates [-pi..pi]
1
score
Only for results: Float, indicating confidence in
detection, needed for p/r curves, higher is better.
Individual Action Detection
The goal in this challenge is to train a classifier to predict the set of individual action labels
for each detected bounding box in the keyframes of each video sequence. We utilize task-1 to
evaluate the performance of the trained model.
The expected text files must be named as det_acion.txt and gt_action.txt.
Social Group Detection
The goal in this challenge is to train a model to divide exisitng bounding boxes into different
social groups each indicated by a unique id.
We utilize task-2 and task-3 to evaluate the performance of the trained model.
The expected text files must be named as det_group.txt and gt_group.txt.
Social Activity Detection
The goal in this challenge is to train a classifier to predict the set of social activity labels for
each detected social group in the
keyframes of each video sequence. We utilize task-4 and task-5 to evaluate the performance of the
trained model.
The set of social activity labels for each group is the individual actions which are being performed
by more than two people in that group.
The expected text files must be named as det_activity.txt and gt_activity.txt.
Preparing Action/Group/Activity Submissions:
Your submission will consist of a single zip file with the name.
eg."det_action.zip","det_group.zip"
or "det_activity.zip" depending on which challenge you are attending.
The evaluation script expects a folder in the following structure:
det_action/det_action.txt #action submission
det_group/det_group.txt #group submission
det_activity/det_activity.txt #activity submission
When preparing and evaluating your results on the training split on your own computer, the
ground truth data should be structured in the same manner, except you don't need to zip the folder.
For each challenge, the evaluation script expects a det.txt and a gt.txt file in the following structure:
#Values
Name
Description
1
sequence_id
Integer between 0 to 26, indicating the sequence id.
1
keyframe-id
Integer, indicating the key-frame id in the specified sequence.
*Evaluation is performed on key-frames which are sampled every one second. [15, 30, 45, ...].
4
bounding-box coordinates
float values of [x1,y1,x2,y2] in the image size.
1
social group id
Integer, indicating the social group id of the box.
* Must be >0 and boxes within the same social group should have a similar group id.
* An arbitrary value in task_1 and task_4..
1
individual action id/
social activity id
Integer, indicating the individual action or social activity id of the box. * Must be >0.
* An arbitrary value in task_2, task_3.
1
score(Pred)/Diff(GT)
Float, in gt.txt it indicates the difficulty level of the label which is being evaluated.
In det.txt, it indicates the confidence score of the predicted label which is being evaluated.
In social grouping challenge, it must be the confidence score of detected bounding boxes.
All values are separated via spaces and each row corresponds to one idividual action or social group id or social activity label for a box. For individual action and social activity challenges, each box can have multiple rows in the gt.txt and det.txt files since each box can have multiple action/activity lables. However, in the social grouping challenge, each box must have exactly one row in the txt files indicating its social group id.
For more information regarding the structure of text files, please refer to the README.txt file inside the toolkit. A guide on the utilized metrics and the evaluation strategy can be found here
.Human Pose Detection
The goal of this challenge is to train a model to predict the poses for all people in a scene.
The predicted keypoints should be the same ones as in the training data, as specified in the JRDB-Pose dataset details.
We evaluate predictions based on AP and OSPA-Pose. Since small-size people are not annotated with poses,
we will ignore predicted poses if they are sufficiently similar to an unlabeled ground-truth box. Thus,
we do not penalize predictions on all detections in the scene.
Preparing Pose Detection Submissions:
We evaluate models on stitched images. Your submission will consist of a single zip file.
The evaluation script expects a folder in the following structure:
submission_folder
-- /cubberly-auditorium-2019-04-22_1.json
-- /discovery-walk-2019-02-28_0.json
-- /...
Each json file represents COCO-style annotations for each scene. For reference, this is the exact same
format in which we provide the annotations. Please make sure that you use the correct image ids.
When preparing and evaluating your results on the training split on your own computer, the
ground truth data should be structured in the same manner, except you don't need to zip the folder.
End-to-end Human Trajectory Forecasting in the Wild
The goal of this challenge is to train a model to forecast the trajectories for all people in a scene given raw image and point clouds.
We evaluate predictions based on EFE and OSPA-2, with detailed definitions available in our JRDB-Traj paper.
To evaluate the performance of trajectory forecasting, we need to establish associations between predicted trajectories and ground-truth trajectories based on their corresponding IDs.
This enables us to measure the distance between these trajectories accurately.
In order to accomplish this, we adapt the OSPA metric specifically for the trajectory forecasting task, introducing the End-to-end Forecasting Error (EFE).
One crucial aspect of EFE is that it takes into account the possibility of individuals disappearing in the hidden part of the video clips in the ground truth, ensuring that the network is not penalized for such occurrences. In short, EFE determines the associations between predicted and ground-truth trajectories, measures their distances, and penalizes any mismatches in the number of trajectories.
For implementation details, please refer to the development kit documentation.
Preparing Human Trajectory Forecasting Submissions:
Your submission will consist of a single zip file.
Please ensure that the sequence folders are directly zipped and that you do not zip their parent folder.
The folder structure and content of this file (e.g. result files) have to comply with the KITTI tracking format and x,y locations are in Lidar coordinate systems as described here.
The evaluation script expects a folder in the following structure:
submission_folder
-- data
---- 0000.txt
---- 0001.txt
---- ...
---- 0026.txt
Each of the txt file corresponding to a test sequence, ordering alphabetically. eg. 0000.txt corresponds to sequence cubberly-auditorium-2019-04-22_1 and 0026.txt corresponds to sequence tressider-2019-04-26_3.
And each submission should have the following 12 columns represent:
frame, track id, type, 0, 0, -1, -1, -1, -1, -1, x, y
The details are given below:
#Values
Name
Description
1
frame
Frame within the sequence where the object appearers
1
track id
Unique tracking id of this object within this sequence
1
type
Describes the type of object: 'Pedestrian' only
2
location
Object location (x,y) in lidar coordinates (in meters) i.e, the center of the bounding box of the person on the ground
* Note incorrect submission format may result error in evaluation or abnormal
results.
You can download and observe the outputs of one trained Social-LSTM from the Trajectory Forecasting leaderboard as a reference example for the submission format.
* For reading from the dataset, please refer to the 2D/3D Tracking Benchmark.
Important notes of the challenge: