We introduce an integrated framework for generating narratives across multiple modalities simultaneously. Rather than using sequential pipelines, the system concurrently produces textual narratives, scene graphs, visual content, and emotional soundscapes. The framework comprises a language model Narrator, a Director managing dynamic scene graphs for consistency, a Narrative Arc Controller for overall story structure, and an Affective Tone Mapper for emotional coherence. Qualitative evaluation on diverse narrative prompts demonstrates significantly enhanced narrative depth, visual fidelity, and emotional resonance compared to cascaded baseline methods.
@misc{ghorbani2025aether,title={Aether Weaver: Multimodal Affective Narrative Co-Generation with Dynamic Scene Graphs},author={Ghorbani, Saeed},year={2025},eprint={2507.21893},archiveprefix={arXiv},}
2024
ECCV
UPose3D: Uncertainty-Aware 3D Human Pose Estimation with Cross-View and Temporal Cues
We introduce UPose3D, a novel approach for multi-view 3D human pose estimation, addressing challenges in accuracy and scalability. Our method advances existing pose estimation frameworks by improving robustness and flexibility without requiring direct 3D annotations. At the core of our method, a pose compiler module refines predictions from a 2D keypoints estimator by leveraging temporal and cross-view information. Our novel cross-view fusion strategy is scalable to any number of cameras, while our synthetic data generation strategy ensures generalization across diverse actors, scenes, and viewpoints. UPose3D leverages prediction uncertainty of both the 2D keypoint estimator and the pose compiler module, providing robustness to outliers and noisy data, resulting in state-of-the-art performance in out-of-distribution settings.
@inproceedings{davoodnia2024upose3d,title={UPose3D: Uncertainty-Aware 3D Human Pose Estimation with Cross-View and Temporal Cues},author={Davoodnia, Vandad and Ghorbani, Saeed and Carbonneau, Marc-Andr{\'e} and Messier, Alexandre and Etemad, Ali},booktitle={European Conference on Computer Vision (ECCV)},year={2024},}
ECCVW
SkelFormer: Markerless 3D Pose and Shape Estimation using Skeletal Transformers
Vandad Davoodnia, Saeed Ghorbani, Alexandre Messier, and Ali Etemad
We present a markerless motion capture pipeline for 3D human pose and shape estimation. Our two-stage approach first leverages existing 2D keypoint detectors to obtain 3D joint positions, then applies a regression-based inverse-kinematic skeletal transformer that maps the joint positions to pose and shape representations from heavily noisy observations. By integrating pose space priors and separating 3D keypoint detection from inverse kinematics, our model achieves strong generalization to unseen noisy data. We evaluate our approach across three public datasets in both in-distribution and out-of-distribution settings, demonstrating robust performance against noise and occlusions.
@inproceedings{davoodnia2024skelformer,title={SkelFormer: Markerless 3D Pose and Shape Estimation using Skeletal Transformers},author={Davoodnia, Vandad and Ghorbani, Saeed and Messier, Alexandre and Etemad, Ali},booktitle={ECCV Workshop on CV for Metaverse},year={2024},}
ECCV
Real-Time Neural Cloth Deformation using a Compact Latent Space and a Latent Vector Predictor
Chanhaeng Lee, Mykhailo Perepichka, Saeed Ghorbani, Sudhir Mudur, Eric Paquette, and Tiberiu Popa
In European Conference on Computer Vision (ECCV) 2024
We present a neural network approach for real-time garment simulation on human figures. A two-stage training framework first compresses cloth vertex data into a compact latent representation, then trains a latent vector predictor to efficiently decode it into blend shape weights for realistic deformation. The method prioritizes computational efficiency suitable for interactive gaming applications while maintaining visual fidelity in cloth behaviour.
@inproceedings{lee2024cloth,title={Real-Time Neural Cloth Deformation using a Compact Latent Space and a Latent Vector Predictor},author={Lee, Chanhaeng and Perepichka, Mykhailo and Ghorbani, Saeed and Mudur, Sudhir and Paquette, Eric and Popa, Tiberiu},booktitle={European Conference on Computer Vision (ECCV)},year={2024},doi={10.1007/978-3-031-92387-6_25}}
2023
CGF
ZeroEGGS: Zero-shot Example-based Gesture Generation from Speech
We present ZeroEGGS, a neural network framework for speech-driven gesture generation with zero-shot style control by example. Style can be controlled via only a short example motion clip, even for motion styles unseen during training. Our model uses a Variational framework to learn a style embedding, making it easy to modify style through latent space manipulation or blending and scaling of style embeddings. The probabilistic nature of our framework further enables the generation of a variety of outputs given the input, addressing the stochastic nature of gesture motion. In a user study, we show that our model outperforms previous state-of-the-art techniques in naturalness of motion, appropriateness for speech, and style portrayal. We also release a high-quality dataset of full-body gesture motion including fingers, with speech, spanning across 19 different styles.
@article{ghorbani2023zeroeggs,author={Ghorbani, Saeed and Ferstl, Ylva and Holden, Daniel and Troje, Nikolaus F. and Carbonneau, Marc-Andr{\'e}},title={ZeroEGGS: Zero-shot Example-based Gesture Generation from Speech},journal={Computer Graphics Forum},year={2023},keywords={animation, gestures, character control, motion capture},doi={10.1111/cgf.14734},}
2022
ICMI
Exemplar-based Stylized Gesture Generation from Speech: An Entry to the GENEA Challenge 2022
We present our entry to the GENEA Challenge 2022 on data-driven co-speech gesture generation. Our system is a neural network that generates gesture animation from an input audio file. The motion style generated by the model is extracted from an exemplar motion clip and embedded in a latent space using a variational framework, allowing generalization to styles unseen during training. The GENEA challenge evaluation showed that our model produces full-body motion with highly competitive levels of human-likeness.
@inproceedings{ghorbani2022exemplar,title={Exemplar-based Stylized Gesture Generation from Speech: An Entry to the GENEA Challenge 2022},author={Ghorbani, Saeed and Ferstl, Ylva and Carbonneau, Marc-Andr{\'e}},booktitle={International Conference on Multimodal Interaction},pages={778--783},year={2022},}
APIN
Estimating Pose from Pressure Data for Smart Beds with Deep Image-based Pose Estimators
In-bed pose estimation has shown value in fields such as hospital patient monitoring, sleep studies, and smart homes. We explore different strategies for detecting body pose from highly ambiguous pressure data, with the aid of pre-existing pose estimators. We examine pre-trained pose estimators used either directly or retrained on two pressure datasets, and explore a learnable pre-processing domain adaptation step that transforms vague pressure maps to a representation closer to the expected input space of common pose estimation modules. Our analysis shows that combining a learnable pre-processing module with retraining pre-existing image-based pose estimators on pressure data achieves very high pose estimation accuracy.
@article{davoodnia2022beds,journal={Applied Intelligence},title={Estimating Pose from Pressure Data for Smart Beds with Deep Image-based Pose Estimators},author={Davoodnia, Vandad and Ghorbani, Saeed and Etemad, Ali},publisher={Springer},year={2022}}
2021
ICASSP
In-bed Pressure-based Pose Estimation using Image Space Representation Learning
We address the challenge of pose estimation from in-bed pressure sensing systems by presenting a novel end-to-end framework that accurately locates body parts from vague pressure data. Our method equips an off-the-shelf pose estimator with a deep trainable neural network that pre-processes and prepares the pressure data for subsequent pose estimation, transforming ambiguous pressure maps into images with shapes and structures similar to the expected input domain of the pose estimator. Results confirm high visual quality in generated images and high pose estimation rates.
@inproceedings{davoodnia2021,title={In-bed Pressure-based Pose Estimation using Image Space Representation Learning},author={Davoodnia, Vandad and Ghorbani, Saeed and Etemad, Ali},booktitle={IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},year={2021},}
PLOS ONE
MoVi: A Large Multi-Purpose Human Motion and Video Dataset
We introduce MoVi, a large human Motion and Video dataset made publicly available. It contains 60 female and 30 male actors performing a collection of 20 predefined everyday actions and sports movements, and one self-chosen movement. In five capture rounds, the same actors and movements were recorded using different hardware systems, including an optical motion capture system, video cameras, and inertial measurement units (IMU). In total, the dataset contains 9 hours of motion capture data, 17 hours of video data from 4 viewpoints (including one hand-held camera), and 6.6 hours of IMU data. We present state-of-the-art estimates of skeletal motions and full-body shape deformations associated with skeletal motion.
@article{ghorbani2021movi,title={MoVi: A Large Multi-Purpose Human Motion and Video Dataset},author={Ghorbani, Saeed and Mahdaviani, Kimia and Thaler, Anne and Kording, Konrad and Cook, Douglas James and Blohm, Gunnar and Troje, Nikolaus F.},year={2021},journal={PLOS ONE},}
ICPR
Gait Recognition using Multi-Scale Partial Representation Transformation with Capsules
We propose a novel deep network for gait recognition that learns to transfer multi-scale partial gait representations using capsules to obtain more discriminative gait features. Our network obtains multi-scale partial representations using a deep partial feature extractor, recurrently learns correlations among partial features using Bi-directional Gated Recurrent Units, and uses a capsule network to learn deeper part-whole relationships. The method achieves superior performance on CASIA-B and OU-MVLP datasets across four challenging test protocols, notably under challenging viewing and carrying conditions.
@article{sepas2021gait,title={Gait Recognition using Multi-Scale Partial Representation Transformation with Capsules},author={Sepas-Moghaddam, Alireza and Ghorbani, Saeed and Troje, Nikolaus F. and Etemad, Ali},journal={International Conference on Pattern Recognition (ICPR)},year={2021},}
2020
CGF
Probabilistic Character Motion Synthesis using a Hierarchical Deep Latent Variable Model
We present a probabilistic framework to generate character animations based on weak control signals, such that the synthesized motions are realistic while retaining the stochastic nature of human movement. The proposed architecture, a hierarchical recurrent model, maps each sub-sequence of motions into a stochastic latent code using a variational autoencoder extended over the temporal domain. We also propose an objective function which respects the impact of each joint on the pose and compares joint angles based on angular distance. We demonstrate the ability of our model to generate convincing and diverse periodic and non-periodic motion sequences without the need for strong control signals.
@article{ghorbani2020motion,journal={Computer Graphics Forum (Symposium on Computer Animation)},title={Probabilistic Character Motion Synthesis using a Hierarchical Deep Latent Variable Model},author={Ghorbani, Saeed and Wloka, Calden and Etemad, Ali and Brubaker, Marcus A. and Troje, Nikolaus F.},year={2020},publisher={The Eurographics Association and John Wiley \& Sons Ltd.},issn={1467-8659},doi={10.1111/cgf.14116},talk={https://www.youtube.com/watch?v=r9F74LcGC0A}}
2019
CGI
Auto-labelling of Markers in Optical Motion Capture by Permutation Learning
Optical marker-based motion capture is a vital tool in applications such as motion and behavioural analysis, animation, and biomechanics. Labelling, i.e., assigning optical markers to pre-defined positions on the body, is a time-consuming and labour-intensive postprocessing step. We present a framework for automatic marker labelling which estimates a permutation matrix for each individual frame using a differentiable permutation learning model, then utilizes temporal consistency to identify and correct remaining labelling errors.
@inproceedings{ghorbani2019auto,title={Auto-labelling of Markers in Optical Motion Capture by Permutation Learning},author={Ghorbani, Saeed and Etemad, Ali and Troje, Nikolaus F.},booktitle={Computer Graphics International (CGI)},pages={167--178},year={2019},organization={Springer},award={Best Paper Award}}
CVR
Automatic Initialization and Tracking of Markers in Optical Motion Capture by Learning to Rank
@inproceedings{ghorbani2019cvr,title={Automatic Initialization and Tracking of Markers in Optical Motion Capture by Learning to Rank},author={Ghorbani, Saeed and Etemad, Ali and Troje, Nikolaus F.},booktitle={CVR Vision Conference},year={2019},organization={CVR},award={Best Poster Award}}
2010
WCSP
Sub-pixel Image Registration based on Physical Forces
Ali Ghayoor, Saeed Ghorbani, and Ali Asghar Beheshti Shirazi
In International Conference on Wireless Communications & Signal Processing (WCSP) 2010
We present a method for sub-pixel image registration based on physical forces, treating images like charged materials that attract each other. Registration parameters (translation and rotation) are estimated simultaneously as one image moves in the direction of the applied force until the resultant force reaches zero. Sub-pixel accuracy is achieved by applying the Canny edge detector and using interpolation techniques, reducing registration error to below 1 pixel.
@inproceedings{ghayoor2010sub,title={Sub-pixel Image Registration based on Physical Forces},author={Ghayoor, Ali and Ghorbani, Saeed and Shirazi, Ali Asghar Beheshti},booktitle={International Conference on Wireless Communications \& Signal Processing (WCSP)},pages={1--5},year={2010},organization={IEEE},}