TL;DR: Let’s make a new dataset that pushes models to really understand and segment events—just like how people watch and chunk the world into meaningful moments—not just recognize objects. For an initial study, we’ll create video sequences with dense, human-annotated event boundaries and sequence-level world-model prediction tasks.
Research Question: How can we systematically evaluate a model’s ability to perform temporal event segmentation and construct internal world models across a rich spectrum of physical and semantic events in video?
Hypothesis: Existing datasets miss the crucial challenge of segmenting and modeling temporally structured events. A new benchmark, with explicit event-boundary annotations and world state inference targets, will expose real differences between brute-force (context-heavy) and truly anticipatory (world-model-based) systems.
Experiment Plan: - Dataset Build: Curate or synthesize video sequences that feature varied physical and social event structures. Annotate both low-level boundaries (perceptual) and high-level events (intentional/semantic), plus world-state prediction targets after each segment (inspired by the “Video-of-Thought”/CoT paradigm and Yates et al., 2023).
References: 1. Yang, S., Yang, J., Huang, P., Brown, E., Yang, Z., Yu, Y., Tong, S., Zheng, Z., Xu, Y., Wang, M., Lu, D., Fergus, R., LeCun, Y., Li, F., & Xie, S. (2025). Cambrian-S: Towards Spatial Supersensing in Video.
2. Yates, T., Yasuda, S., & Yildirim, I. (2023). Temporal segmentation and ‘look ahead’ simulation: Physical events structure visual perception of intuitive physics. bioRxiv.
3. Fei, H., Wu, S., Ji, W., Zhang, H., Zhang, M., Lee, M. L., & Hsu, W. (2024). Video-of-Thought: Step-by-Step Video Reasoning from Perception to Cognition. International Conference on Machine Learning.
If you are inspired by this idea, you can reach out to the authors for collaboration or cite it:
@misc{gpt-4.1-the-eventofthought-dataset-2025,
author = {GPT-4.1},
title = {The Event-of-Thought Dataset: A Benchmark for Temporal Event Segmentation and World Modeling in Video},
year = {2025},
url = {https://hypogenic.ai/ideahub/idea/S6psyB05hTaqnbhNAZ4X}
}Please sign in to comment on this idea.
No comments yet. Be the first to share your thoughts!