Tencent’s Voyager Turns Images Into 3D Video Scenes

tencent voyager images to 3d

Tencent’s new artificial intelligence (AI) model scored higher than OpenAI’s Sora on Stanford’s benchmark test, creating 3D-style videos from single photos that users can explore like virtual worlds.

The Chinese tech giant released HunyuanWorld-Voyager on September 2, achieving a score of 77.62 on Stanford’s WorldScore benchmark. OpenAI’s Sora managed only 62.15, while another competitor, WonderWorld, scored 72.69.

Voyager transforms any photo into short video clips that simulate camera movement through 3D space. Users type simple commands like “move forward” or “pan left” to navigate the scene.

The system generates 49-frame videos lasting about two seconds. Multiple clips can be linked together for several minutes of footage. Each frame includes both color video and depth information, allowing the footage to be converted into 3D point clouds.

Tencent’s model excelled in style consistency, scoring 84.89 compared to competitors. It also achieved high marks in subjective visual quality at 71.09. However, WonderWorld still leads in camera control with a score of 92.98 versus Voyager’s 85.95.

The technology relies on a “world cache” system that remembers previously generated parts of a scene. When the virtual camera moves, the system projects stored 3D points back into 2D to maintain consistency. This prevents the visual drift that plagues many AI video generators.

Tencent trained Voyager using more than 100,000 video clips. The training data included real-world footage and computer-generated scenes from Unreal Engine. An automated pipeline calculated depth information for each frame, thereby eliminating the need for manual labeling work.

Despite its technical success, Voyager faces practical hurdles. The system demands at least 60GB of GPU memory for basic 540p resolution. Tencent recommends 80GB for optimal results, putting it beyond the reach of most consumers.

Legal restrictions also limit access. The model cannot be used commercially in the European Union, the United Kingdom, or South Korea. Commercial deployments serving more than 100 million users need separate licensing from Tencent.

“Chinese companies are generally prioritizing efficiency and utilization—efficient utilization of GPU servers. And that doesn’t necessarily impair the ultimate effectiveness of the technology,” Tencent Chief Strategy Officer James Mitchell previously said.

The code is available on GitHub and Hugging Face for researchers who meet the hardware requirements. Multi-GPU setups can speed processing by 6.69 times using eight cards.

Google is also developing a competing interactive world generator using text prompts with Genie 3. “We think world models are key on the path to AGI, specifically for embodied agents, where simulating real-world scenarios is particularly challenging,” said a DeepMind scientist.

Voyager represents another step in the global AI competition, where Chinese companies increasingly challenge Western firms in cutting-edge technology development.

Leave a Comment