JRDB

Submission Policy

We strongly encourage all participants to use only the provided training data split to develop their algorithms (e.g. for the learning process and/or parameter tuning). The test data split should be used only to generate final results for a new submission to the challenge. Please, do not use the challenge submission system as way to tune your algorithm!
Important: We limit the number of submissions per account to THREE per month for each task. We only consider the best submission per account for the leaderboard. It is STRICTLY PROHIBITED to create multiple accounts using different email addresses! We will actively monitor submissions and delete accounts violating these rules based on invalid or repeating supervisor and institution.
For both 2D and 3D tracking challenge, the participants may opt between using their own detector or using the detections we provide in the website.
Submissions to the challenge should, at least, be accompanied by a short abstract (up to 5000 characters) explaining the technical details of the method used.
Metadata can be edited after submission by clicking edit where previous submissions are displayed. Note that you can update metadata for up to 6 months, after which submissions become finalized. If a submission is still anonymous after 6 months, it will be deleted.
Currently, all tracking and detection submissions are evaluated on stitched images and not the individual images but participants are free to use all available data.
Note incorrect submission format may result error in evaluation or abnormal results.

Development Kits

All of our developemnt kits can be found here. For more details, check out the task-specific information below.

Open Tracking Development Kit
The tracking development kit has been adapted from TrackEval, extended with our implementation for 3D tracking and OSPA. The original Git Repo provides details on the file structure under preparing submissions.
We provide python script to convert json to txt format:

python convert_to_kitti_tracking.py --input json_folder --output output_dir # 2d tracking --depth 0

                                python convert_to_kitti_tracking.py --input json_folder --output output_dir # 2d tracking --depth 1

To evaluate tracking results, cd to TrackEval and run the script:

python scripts/run_jrdb_2d.py --gt gt_folder --tracker tracker_folder # 2d tracking

                                python scripts/run_jrdb_3d.py --gt gt_folder --tracker tracker_folder # 3d tracking

The result will be printed in the terminal and also saved as a txt file in current dir.
As an example:

python scripts/run_jrdb_2d.py --gt data/gt/jrdb/jrdb_2d_box_train --tracker data/tracker/jrdb/jrdb_2d_box_train # 2d tracking

                                python scripts/run_jrdb_2d.py --gt data/gt/jrdb/jrdb_3d_box_train --tracker data/tracker/jrdb/jrdb_3d_box_train # 3d tracking

For our modification to original repo, we added a new metric class called OSPA, we added two new script in /scripts called run_jrdb_2d.py and run_jrdb_3d.py. We add two new dataset class kitti_2d_box.py and kitti_3d_box.py. These file generally follow Kitti Tracking implementation.

Open Detection Development Kit
The detection development kit has been adapted from KITTI Detection to for the format of our dataset. A sperated python file is provided for OSPA metric.
We also provide python script to convert json to txt format:

python convert_dataset_to_KITTI.py -i JRDB -o KITTI_converted_JRDB

For evaluate AP at 0.3, 0.5 and 0.8 threshold with cpp script:


                                g++ -O3 -o evaluate_object evaluate_object.cpp  # compile cpp first.

                            ./evaluate_object path/to/groundtruth path/to/results 1
                            output_file.txt # then run cpp file with gt, predictions and output file.

The evaluation script will print out 41-points precision/recall values. The result file will report the corresponding 41-points interpolated average precision, just as in KITTI.
For evaluate OSPA with python script:


                            python ospa_2d_det.py --gt gt_folder --pred pred_folder # 2d detections

                            python ospa_3d_det.py --gt gt_folder --pred pred_folder # 3d detections

The overall and per sequence results will be saved in a txt file.

Open Individual Action and Social Grouping/Activity Development Kit
The action/group/activity development kit has been adapted from AVA and extended to our dataset and different tasks.

Open Human Pose Development Kit

The pose development kit contains code for evaluation predictions on JRDB-Pose.

Human Trajectory Forecasting Development Kit

The human trajectory forecasting development kit contains code for evaluation of forecast trajectories.

Panoptic Segmentation and Tracking

The panoptic segmentation and tracking development kit contains code for evaluation of panoptic segmentation and tracking.

Visualization Toolkit

We have also created a visualization toolkit to make it easy to visualize your predictions on JRDB. Check out the Visualisation Toolkit , which has been adapted from Kitti Object Visualisation

Criteria for Evaluation

We adopted the wide-established metrics and criteria from KITTI and AVA. Details about the criteria can be found in the following document:

Criteria for the Evaluation and Information about Development Kit.

Evaluation of Tracking Same as most datasets in TrackEval, We will use several metric families to evaluate results: OSPA, Clear-MOT, HOTA and Identity. Each of them contains a set of metrics. Additional metrics may be included later in the challenge.

Evaluation of Detection: We will use OSPA and precision to evaluate the performance of each detection submission. However, we will also report recall and AOS for 2D detection. Additional metrics may be included later in the challenge.

Evaluation of Action/Group/Activity Detection: We use mean Average Precision (mAP) to evaluate the performance of each task. We also provide detailed AP results per-sequence and per-category.

Evaluation of Pose Detection: We use both Average Precision (based on thresholded OKS) as well as OSPA-Pose to evaluate the performance of each task. We further provide detailed AP results per-sequence and per-category. Since we only label some people in a scene (tiny people will not be labeled), we forgive predicted poses for unlabeled people by matching poses with all ground-truth boxes.

Evaluation of Trajectory Forecasting: We use both EFE (End-to-end Forecasting Error) as well as OSPA-Trajectory to evaluate the performance of each submission. Since some people disapear in the hidden future, we forgive forecast trajectories for disapeared people by matching trajectories with all ground-truth trajectories.

Tracking

2D/3D Tracking Benchmark:
The primary metric we use to evaluate tracking is MOTA, which combines false positives, false negatives, and id switches. We also report MOTP, which is a measure of the localisation accuracy of the tracking algorithm. Rank is determined by MOTA. We require intersection-over-union to be greater than 50% for 2D tracking and 30% for 3D tracking.

MOTA is given by: $...$
where t indicates the frame number, T is the total number of frames, FP is the number of false positives in frame t , FN the number of false negatives in frame t, IDs the number of id switches in frame t, and GT is the number of ground truth objects of frame t.

MOTP is given by: $...$
where i indicates frame number, t is the total number of frames, M is the total number of objects in frame i , ci is the number of matches between predictions and ground truth in frame i , and d is the intersection over union distance (1-IOU) of a particular match. i,j To evaluate 2D tracking, we run Clear-MOT metrics using an IoU threshold of 0.5. To evaluate 3D tracking, the 3D IoU is calculated using a combination of the Sutherland-Hodgman algorithm and the shoelace formula (surveyor's formula) to determine the area of intersection. The Sutherland Hodgman algorithm is an algorithm used to clip polygons. A 3D-IoU threshold of 0.5 is used to determine matches.
Our benchmark contains OSPA and HOTA too.

Preparing Tracking Submissions:
Your submission will consist of a single zip file. Please ensure that the sequence folders are directly zipped and that you do not zip their parent folder. The folder structure and content of this file (e.g. result files) have to comply with the KITTI tracking format described here.
Expected Directory Structure of 2D/3D Tracking Submissions:

CIWT/data/0000.txt

                                                   /0001.txt

                                                   /0002.txt

                                ...

                                                   /0026.txt

Each of the txt file corresponding to a test sequence, ordering alphabetically. eg. 0000.txt corresponds to sequence cubberly-auditorium-2019-04-22_1 and 0026.txt corresponds to sequence tressider-2019-04-26_3.

2D/3D tracking File and Label Format:
All values (numerical or strings) are separated via spaces, each row corresponds to one object. The 18 columns (17 values + 1 score value) represent:

frame, track id, type, truncated,occluded, alpha, bb_left, bb_top, bb_width, bb_height, x, y, z,
                            height, width, length, rotation_y, score

The details are given below:


                                
                                    
                                      
                                        #Values 
                                        Name
                                        Description
                                      
                                    
                                    
                                      
                                        1
                                        frame 
                                        Frame within the sequence where the object appearers
                                        
                                      
                                      
                                        1
                                        track id
                                        Unique tracking id of this object within this sequence
                                      
                                      
                                        1
                                        type
                                        Describes the type of object: 'Pedestrian' only
                                      
                                      
                                        1
                                        truncated
                                        Integer (0,1,2) indicating the level of truncation.
                                            Note that this is in contrast to the object detection
                                            benchmark where truncation is a float in [0,1].
                                      
                                      
                                        1
                                        occluded
                                        Integer (0,1,2,3) indicating occlusion state:
                                            0 = fully visible, 1 = partly occluded
                                            2 = largely occluded, 3 = unknown
                                      
                                      
                                        1
                                        alpha
					Observation angle of object, ranging [-pi..pi]
                                      
                                      
                                        4
                                        bbox  
					2D bounding box of object in the image (0-based index):
					    contains left, top, right, bottom pixel coordinates
                                      
                                      
                                        3
                                        location
                                        3D object location x,y,z in camera coordinates (in meters)
                                      
                                      
                                        3
                                        dimensions 
                                        3D object dimensions: height, width, length (in camera coordinate - y_size, x_size, z_size; in meters).
                                      
                                      
                                        1
                                        rotation_y
                                        Rotation ry around Y-axis in camera coordinates [-pi..pi]
                                      
                                      
                                        1
                                        score
                                        Only for results: Float, indicating confidence in
                                            detection, needed for p/r curves, higher is better.

#Values	Name	Description
1	frame	Frame within the sequence where the object appearers
1	track id	Unique tracking id of this object within this sequence
1	type	Describes the type of object: 'Pedestrian' only
1	truncated	Integer (0,1,2) indicating the level of truncation. Note that this is in contrast to the object detection benchmark where truncation is a float in [0,1].
1	occluded	Integer (0,1,2,3) indicating occlusion state: 0 = fully visible, 1 = partly occluded 2 = largely occluded, 3 = unknown
1	alpha	Observation angle of object, ranging [-pi..pi]
4	bbox	2D bounding box of object in the image (0-based index): contains left, top, right, bottom pixel coordinates
3	location	3D object location x,y,z in camera coordinates (in meters)
3	dimensions	3D object dimensions: height, width, length (in camera coordinate - y_size, x_size, z_size; in meters).
1	rotation_y	Rotation ry around Y-axis in camera coordinates [-pi..pi]
1	score	Only for results: Float, indicating confidence in detection, needed for p/r curves, higher is better.

The conf value contains the detection confidence in the det.txt files. For a submission, it acts as a flag whether the entry is to be considered. A value of 0 means that this particular instance is ignored in the evaluation, while any other value can be used to mark it as active. For submitted results, all lines in the .txt file with a confidence of 1 are considered. Fields which are not used, such as 2D bounding box for 3D tracking or location, dimension, and rotation_y for 2D tracking, must be set to -1.
Note incorrect submission format may result error in evaluation or abnormal results.

Detection

2D Object Detection Benchmark:
The goal in the 2D object detection task is to train object detectors for pedestrian in a 360 panorama image. The object detectors must provide as output the 2D 0-based bounding box in the image using the format specified above, as well as a detection score, indicating the confidence in the detection. All other values must be set to their default values. In our evaluation, we only evaluate detections on 2D bounding box larger than 500^2 pixel^2 in the image and that are not fully occluded. For evaluation criterion, inspired by PASCAL, we use 41-point interpolated AP and require the intersection-over-union of bounding boxes to be larger than 30%, 50%, 70% for an object to be detected correctly.

3D Object Detection Benchmark:
The goal in the 3D object detection task is to train object detectors for pedestrian in a lidar point clouds. The object detectors must provide the 3D bounding box (in the format specified above, i.e. 3D dimensions and 3D locations) and the detection score/confidence. All other values must be set to their default values. In our evaluation, we only evaluate detections on 3D bounding box which encloses more than 10 3D lidar points and lies within 25 meters in bird's eye view. For evaluation criterion, inspired by PASCAL, we use 41-point interpolated AP and require the intersection-over-union of bounding boxes to be larger than 30%, 50%, 70% for an object to be detected correctly.

Preparing Detection Submissions:
Your submission will consist of a single zip file. Please ensure that the sequence folders are directly zipped and that you do not zip their parent folder. The folder structure and content of this file (e.g. result files) have to comply with the KITTI format described in:
Geiger, Andreas, Lenz, Philip, and Urtasun, Raquel.
"Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite."
2012 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2012.
http://www.cvlibs.net/datasets/kitti/index.php
The evaluation script expects a folder in the following structure:

cubberly-auditorium-2019-04-22_1/image_stitched/000000.txt

                                                                           /000001.txt

                                                                           /000002.txt

                                                                           ...

                                                                           /001078.txt

                                        ...

                                        tressider-2019-04-26_3/image_stitched/000000.txt

                                                                 /000001.txt

                                                                 /000002.txt

                                                                 ...

                                                                 /001658.txt

Each subfolder represents a sequence and each text file within the subfolder is a label file of a given frame. The label files contain the following information. All values (numerical or strings) are separated by spaces, each row corresponds to one object.

The details are given below:


                                
                                    
                                      
                                        #Values 
                                        Name
                                        Description
                                      
                                    
                                    
                                      
                                        1
                                        type
                                        Describes the type of object: 'Pedestrian' only
                                      
                                      
                                        1
                                        truncated
                                        Integer 0 (non-truncated) and 1 (truncated), where
                                            truncated refers to the object leaving image boundaries
                                            * May be an arbitrary value for evaluation.
                                      
                                      
                                        1
                                        occluded
                                        Integer (0, 1, 2, 3) indicating occlusion state:
                                            0 = fully visible, 1 = mostly visible
                                            2 = severely occluded, 3 = fully occluded
                                            * May be an arbitrary value for evaluation.
                                      
                                      
                                        1
                                        num_points
                                        Integer, number of points within a 3D bounding box.
                                            * May be an arbitrary value for evaluation.
                                            * May be a negative value to indicate a 2D bounding box
                                            without corresponding 3D bounding box.
                                      
                                      
                                        1
                                        alpha
                                        Observation angle of object, ranging [-pi..pi]
                                      
                                      
                                        4
                                        bbox
					2D bounding box of object in the image (0-based index):
					    contains left, top, right, bottom pixel coordinates
                                      
                                      
                                        3
                                        dimensions 
                                        3D object dimensions: height, width, length (in camera coordinate - y_size, x_size, z_size; in meters).
                                      
                                      
                                        3
                                        location
                                        3D object location x,y,z in camera coordinates (in meters)
                                      
                                      
                                        1
                                        rotation_y
                                        Rotation ry around Y-axis in camera coordinates [-pi..pi]
                                      
                                      
                                        1
                                        score
                                        Only for results: Float, indicating confidence in
                                            detection, needed for p/r curves, higher is better.

#Values	Name	Description
1	type	Describes the type of object: 'Pedestrian' only
1	truncated	Integer 0 (non-truncated) and 1 (truncated), where truncated refers to the object leaving image boundaries * May be an arbitrary value for evaluation.
1	occluded	Integer (0, 1, 2, 3) indicating occlusion state: 0 = fully visible, 1 = mostly visible 2 = severely occluded, 3 = fully occluded * May be an arbitrary value for evaluation.
1	num_points	Integer, number of points within a 3D bounding box. * May be an arbitrary value for evaluation. * May be a negative value to indicate a 2D bounding box without corresponding 3D bounding box.
1	alpha	Observation angle of object, ranging [-pi..pi]
4	bbox	2D bounding box of object in the image (0-based index): contains left, top, right, bottom pixel coordinates
3	dimensions	3D object dimensions: height, width, length (in camera coordinate - y_size, x_size, z_size; in meters).
3	location	3D object location x,y,z in camera coordinates (in meters)
1	rotation_y	Rotation ry around Y-axis in camera coordinates [-pi..pi]
1	score	Only for results: Float, indicating confidence in detection, needed for p/r curves, higher is better.

Action/Group/Activity

Individual Action Detection
The goal in this challenge is to train a classifier to predict the set of individual action labels for each detected bounding box in the keyframes of each video sequence. We utilize task-1 to evaluate the performance of the trained model. The expected text files must be named as det_acion.txt and gt_action.txt.

Social Group Detection
The goal in this challenge is to train a model to divide exisitng bounding boxes into different social groups each indicated by a unique id. We utilize task-2 and task-3 to evaluate the performance of the trained model. The expected text files must be named as det_group.txt and gt_group.txt.

Social Activity Detection
The goal in this challenge is to train a classifier to predict the set of social activity labels for each detected social group in the keyframes of each video sequence. We utilize task-4 and task-5 to evaluate the performance of the trained model. The set of social activity labels for each group is the individual actions which are being performed by more than two people in that group. The expected text files must be named as det_activity.txt and gt_activity.txt.

Preparing Action/Group/Activity Submissions:
Your submission will consist of a single zip file with the name. eg."det_action.zip","det_group.zip" or "det_activity.zip" depending on which challenge you are attending.
The evaluation script expects a folder in the following structure:

det_action/det_action.txt #action submission

                          det_group/det_group.txt #group submission

                          det_activity/det_activity.txt #activity submission

When preparing and evaluating your results on the training split on your own computer, the ground truth data should be structured in the same manner, except you don't need to zip the folder.

For each challenge, the evaluation script expects a det.txt and a gt.txt file in the following structure:


                                
                                    
                                      
                                        #Values 
                                        Name
                                        Description
                                      
                                    
                                    
                                      
                                        1
                                        sequence_id
                                        Integer between 0 to 26, indicating the sequence id.
                                      
                                      
                                        1
                                        keyframe-id
                                        Integer, indicating the key-frame id in the specified sequence.
                                            *Evaluation is performed on key-frames which are sampled every one second. [15, 30, 45, ...].
                                      
                                      
                                        4
                                        bounding-box coordinates
                                        float values of [x1,y1,x2,y2] in the image size.
                                      
                                      
                                        1
                                        social group id
                                        Integer, indicating the social group id of the box.
                                            * Must be >0 and boxes within the same social group should have a similar group id.
                                            * An arbitrary value in task_1 and task_4..
                                      
                                      
                                        1
                                        individual action id/
                                            social activity id
                                        Integer, indicating the individual action or social activity id of the box. * Must be >0.
                                            * An arbitrary value in task_2, task_3.
                                      
                                      
                                        1
                                        score(Pred)/Diff(GT)  
                                        Float, in gt.txt it indicates the difficulty level of the label which is being evaluated.
                                            In det.txt, it indicates the confidence score of the predicted label which is being evaluated.
                                            In social grouping challenge, it must be the confidence score of detected bounding boxes.

#Values	Name	Description
1	sequence_id	Integer between 0 to 26, indicating the sequence id.
1	keyframe-id	Integer, indicating the key-frame id in the specified sequence. *Evaluation is performed on key-frames which are sampled every one second. [15, 30, 45, ...].
4	bounding-box coordinates	float values of [x1,y1,x2,y2] in the image size.
1	social group id	Integer, indicating the social group id of the box. * Must be >0 and boxes within the same social group should have a similar group id. * An arbitrary value in task_1 and task_4..
1	individual action id/ social activity id	Integer, indicating the individual action or social activity id of the box. * Must be >0. * An arbitrary value in task_2, task_3.
1	score(Pred)/Diff(GT)	Float, in gt.txt it indicates the difficulty level of the label which is being evaluated. In det.txt, it indicates the confidence score of the predicted label which is being evaluated. In social grouping challenge, it must be the confidence score of detected bounding boxes.

All values are separated via spaces and each row corresponds to one idividual action or social group id or social activity label for a box. For individual action and social activity challenges, each box can have multiple rows in the gt.txt and det.txt files since each box can have multiple action/activity lables. However, in the social grouping challenge, each box must have exactly one row in the txt files indicating its social group id.

For more information regarding the structure of text files, please refer to the README.txt file inside the toolkit. A guide on the utilized metrics and the evaluation strategy can be found here

Human Pose Detection & Tracking

Human Pose Detection
The goal of this challenge is to train a model to predict the poses for all people in a scene. The predicted keypoints should be the same ones as in the training data, as specified in the JRDB-Pose dataset details. We evaluate predictions based on AP and OSPA-Pose. Since small-size people are not annotated with poses, we will ignore predicted poses if they are sufficiently similar to an unlabeled ground-truth box. Thus, we do not penalize predictions on all detections in the scene.

Preparing Pose Detection Submissions:
We evaluate models on stitched images. Your submission will consist of a single zip file.
The evaluation script expects a folder in the following structure:


                        submission_folder

                        -- /cubberly-auditorium-2019-04-22_1.json

                        -- /discovery-walk-2019-02-28_0.json

                        -- /...

Each json file represents COCO-style annotations for each scene. For reference, this is the exact same format in which we provide the annotations. Please make sure that you use the correct image ids. When preparing and evaluating your results on the training split on your own computer, the ground truth data should be structured in the same manner, except you don't need to zip the folder.

Trajectory Forecasting

End-to-end Human Trajectory Forecasting in the Wild
The goal of this challenge is to train a model to forecast the trajectories for all people in a scene given raw image and point clouds. We evaluate predictions based on EFE and OSPA-2, with detailed definitions available in our JRDB-Traj paper. To evaluate the performance of trajectory forecasting, we need to establish associations between predicted trajectories and ground-truth trajectories based on their corresponding IDs. This enables us to measure the distance between these trajectories accurately. In order to accomplish this, we adapt the OSPA metric specifically for the trajectory forecasting task, introducing the End-to-end Forecasting Error (EFE). One crucial aspect of EFE is that it takes into account the possibility of individuals disappearing in the hidden part of the video clips in the ground truth, ensuring that the network is not penalized for such occurrences. In short, EFE determines the associations between predicted and ground-truth trajectories, measures their distances, and penalizes any mismatches in the number of trajectories. For implementation details, please refer to the development kit documentation.

Preparing Human Trajectory Forecasting Submissions:
Your submission will consist of a single zip file. Please ensure that the sequence folders are directly zipped and that you do not zip their parent folder. The folder structure and content of this file (e.g. result files) have to comply with the KITTI tracking format and x,y locations are in Lidar coordinate systems as described here.
The evaluation script expects a folder in the following structure:


                        submission_folder

                        -- data 

                        ---- 0000.txt

                        ---- 0001.txt

                        ---- ... 

                        ---- 0026.txt

And each submission should have the following 12 columns represent:

frame, track id, type, 0, 0, -1, -1, -1, -1, -1, x, y

The details are given below:


                                
                                    
                                      
                                        #Values 
                                        Name
                                        Description
                                      
                                    
                                    
                                      
                                        1
                                        frame 
                                        Frame within the sequence where the object appearers
                                        
                                      
                                      
                                        1
                                        track id
                                        Unique tracking id of this object within this sequence
                                      
                                      
                                        1
                                        type
                                        Describes the type of object: 'Pedestrian' only
                                      
                                      
                                        2
                                        location
                                        Object location (x,y) in lidar coordinates (in meters) i.e, the center of the bounding box of the person on the ground

#Values	Name	Description
1	frame	Frame within the sequence where the object appearers
1	track id	Unique tracking id of this object within this sequence
1	type	Describes the type of object: 'Pedestrian' only
2	location	Object location (x,y) in lidar coordinates (in meters) i.e, the center of the bounding box of the person on the ground

* Note incorrect submission format may result error in evaluation or abnormal results. You can download and observe the outputs of one trained Social-LSTM from the Trajectory Forecasting leaderboard as a reference example for the submission format.
* For reading from the dataset, please refer to the 2D/3D Tracking Benchmark.

Important notes of the challenge:

Limit the prediction to a maximum of 12 timesteps, but you can allow the model to skip predictions for certain IDs if it anticipates their disappearance.
There are no newcomers in the hidden part.
The x,y coordinates are with respect to the robot and in lidar coordinates (in meters). If needed, you can convert the coordinates using the provided script in Coordinate Conversion.
Feel free to utilize the tracking algorithm outputs from the JRDB 3D tracking leaderboard to give as input to your trajectory forecasting model. For easy access, we have provided the PiFeNet_SimpleTrack outputs converted into radar coordinates in Test Trackings, which you can download and use. However, for improved performance, you also have the option to employ your own tracking algorithms to achieve better results. You can utilize any available sensory data if required. Extracting GPS, IMU, and other sensor information from ROSbag files is an option, but the responsibility of ensuring their synchronization rests with you. Happy forecasting!

Preparing Submissions

Submission Policy

Development Kits

Tracking Development Kit

Detection Development Kit

Individual Action and Social Grouping/Activity Development Kit

Human Pose Estimation

Human Trajectory Forecasting

Panoptic Segmentation and Tracking

Visualization Toolkit

Criteria for Evaluation

Tracking

Detection

Action/Group/Activity

Human Pose Detection & Tracking

Trajectory Forecasting