![]() |
市場調查報告書
商品編碼
1872684
中國自動駕駛數據閉環(2025)China Autonomous Driving Data Closed Loop Research Report, 2025 |
||||||
要點
2023年至2025年,合成資料佔比將從20-30%提升至50-60%,成為補充長尾情境的核心資源。
從資料收集到部署的全流程自動化工具鏈正在逐步部署,有助於降低成本、提高效率。
高效整合車雲一體化資料閉環是實現快速迭代的關鍵。
自動駕駛資料閉環的本質是 "採集、傳輸、處理、學習、部署" 的循環最佳化系統。 2025年,業界將加速從 "0到1" 階段邁向 "高品質、高效率" 時代,面臨的主要挑戰集中在長尾場景覆蓋和成本控制。汽車製造商和一級供應商正在積極建立自身的資料閉環解決方案。高效率的數據收集、處理和分析流程能夠持續改進自動駕駛演算法,顯著提升智慧駕駛系統的準確性和穩定性。
高品質資料收集的效率將決定智慧駕駛的發展速度。目前,汽車產業的數據來源包括量產車的觸發數據傳輸、採集車輛採集的高價值場景特定數據、利用真實路邊數據重建真實世界的工程方法以及基於世界模型的數據合成技術。自動駕駛技術大規模應用的核心路徑是利用真實數據建立基礎能力,然後利用合成數據突破其限制。從 2023 年到 2025 年,自動駕駛訓練資料中真實資料與合成資料的比例將發生顯著變化,逐漸從最初以真實資料為中心的模式轉向合成資料比例不斷增加的混合模式。
自動駕駛資料的閉環管理已從最初關注單一環節(例如,提高標註效率)轉向涵蓋 "採集、標註、訓練、模擬和部署" 的端到端自動化架構。關鍵突破在於利用大規模 AI 模式和雲邊協同技術打破資料流壁壘,使閉環管理能夠自我演進。
汽車雲端資料閉環的本質是建構 "輕量化車輛+智慧雲端" 的協同系統,打破資料流壁壘,實現智慧車輛的持續演進。車輛即時擷取環境感知資料(路況、車輛運行資料等),進行去識別化、加密及壓縮後上傳至雲端。雲端處理大量資料(PB/EB級),並進行標註、模型訓練和演算法最佳化。產生新的特徵並透過OTA方式分發給車輛進行升級。
本報告調查分析了中國汽車產業,並總結了與閉環自動駕駛數據相關的趨勢。
詞彙表
Data Closed-Loop Research: Synthetic Data Accounts for Over 50%, Full-process Automated Toolchain Gradually Implemented
Key Points:
From 2023 to 2025, the proportion of synthetic data increased from 20%-30% to 50%-60%, becoming a core resource to fill long-tail scenarios.
Full-process automated toolchain from collection to deployment is gradually implemented, helping reduce costs and improve efficiency.
Efficient collaboration of the vehicle-cloud integrated data closed-loop is a key factor in achieving faster iterations.
The essence of autonomous driving data closed-loop is a cyclic optimization system of "collection-transmission-processing-training-deployment". In 2025, the industry is accelerating from the "0->1" stage to the "high-quality and high-efficiency" era, with core contradictions focusing on long-tail scenario coverage and cost control. OEMs and Tier 1 suppliers are actively establishing their own data closed-loop solutions. Through efficient data collection, processing and analysis processes, they continuously improve autonomous driving algorithms, thereby significantly enhancing the accuracy and stability of intelligent driving systems.
The efficiency of acquiring high-quality data determines the evolution speed of intelligent driving. Currently, data sources in the automotive field include mass-produced vehicle-triggered data transmission, high-value specific scenario data collection by collection vehicles, engineering practices for physical world restoration through roadside real data, and data synthesis technology based on world models. The core path for the large-scale application of autonomous driving technology -> real data anchors basic capabilities, and synthetic data breaks through capability boundaries. From 2023 to 2025, the proportion of real data and synthetic data in autonomous driving training data has undergone significant changes, gradually shifting from a real data-dominated model in the early stage to a hybrid model with an increasing proportion of synthetic data.
2023: Real data dominates, synthetic data starts (synthetic data accounts for 20%-30%): Real data is still the main body, mainly used for basic scenario training, but faces the problem of insufficient coverage of long-tail scenarios. For example, Tesla relied on real road test data from over one million vehicles in the early stage, but the collection efficiency of extreme scenarios (such as pedestrians breaking in during heavy rain) is low. Synthetic data accounts for about 20%-30%, mainly used to supplement long-tail scenarios. Experiments by Applied Intuition show that after adding 30% of synthetic data with frequent appearance of cyclists to real data, the recognition accuracy (mAP score) of the perception model for cyclists is significantly improved.
2024: Accelerated penetration of synthetic data (proportion rises to 40%-50%): Synthetic data has upgraded from an "auxiliary tool" to a "core production material". Its penetration rate rising to 40%-50% marks that intelligent driving has entered a new data-driven paradigm. At the end of 2024, the Shanghai High-level Autonomous Driving Demonstration Zone launched a plan of 100 data collection vehicles. Through a hybrid model of "real data collection + world model-generated virtual data", the proportion of synthetic data is close to 50%; for example, Nvidia DRIVE Sim generates synthetic data of distant objects (100-350 meters) to solve the problem of sparse real annotations. After adding 92,000 synthetic images, the detection accuracy (F1 score) of vehicles 200 meters away is improved by 33%.
2025: Synthetic data surpasses (accounts for over 50%): The ratio of synthetic data to real data moves towards "5:5" or even higher. Academician Wu Hequan pointed out that 90% of the training for L4/L5 is simulation data, and only 10%-20% of real data is retained as a "gene pool" to avoid model deviation. In terms of innovative applications of synthetic data, take Li Auto as an example. It uses world models to reconstruct historical scenarios and expand variants (such as virtualizing ordinary intersections into rainy night and foggy conditions), and automatically generates extreme cases for cyclic training. The proportion of synthetic data in Li Auto exceeds 90%, replacing real-vehicle testing and verifying reliability.
According to Lang Xianpeng from Li Auto, in 2023, the effective real-vehicle test mileage of Li Auto was about 1.57 million kilometers, with a cost of 18 yuan per kilometer. By the first half of 2025, a total of 40 million kilometers had been tested, including only 20,000 kilometers of real-vehicle testing and 38 million kilometers of synthetic data. The test cost dropped to an average of 0.5 yuan per kilometer. Moreover, the test quality is high, all scenarios can be inferred from one instance, and complete retesting is possible.
The advantages of synthetic data are not only reflected in cost and efficiency but also in its value density beyond human experience. Synthetic data is generated in batches through technical means at extremely low cost, perfectly matching the high-frequency training needs of AI; it can also independently generate extreme corner case scenarios that "humans have not experienced but comply with physical laws".
The autonomous driving data closed-loop has shifted from focusing on a single link (such as improving annotation efficiency) in the early stage to an end-to-end automated architecture covering "collection-annotation-training-simulation-deployment". The core breakthrough is to break through data flow barriers through AI large models and cloud-edge collaboration technology, realizing closed-loop self-evolution.
LiangDao Intelligence LD Data Factory is a full-link 4D ground truth solution from collection to delivery. The LD Data Factory toolchain product has been delivered to more than a dozen automotive OEMs and Tier 1s in China, Germany, and Japan. This automated 4D annotation tool software has automatically annotated more than 3,300 hours of road-collected data for customers, obtaining high-quality 4D continuous frame ground truth; by the middle of 2025, LiangDao Intelligence had delivered more than 55 million frames of data to a well-known German luxury car brand.
LD Data Factory integrates "data collection, automated annotation, manual annotation, quality control, and performance evaluation". The toolchain includes AI preprocessing and VLM-assisted collection, an automated annotation module for target detection, full-process closed loop of automatic quality inspection, and hybrid cloud and private deployment. LD Data Factory covers several core modules and realizes data management and task collaboration through a unified data management platform: including time synchronization and spatial calibration, distributed storage and indexing services, a visual annotation platform LDEditor (full-stack annotation), an automated quality control module LD Validator, and a perception performance evaluation module LD KPI.
Main products under MindFlow currently include an integrated data annotation platform, a data management platform (including a vector database), and a model training platform, covering the entire value chain from raw data to model implementation. Users can complete the entire algorithm development process in one stop without switching multiple tools or platforms, redefining a new paradigm of AI data services. The technical highlights of its MindFlow SEED platform (third generation) include support for 4D point cloud annotation (lane lines, segmentation), RPA automated processes, and AI pre-annotation covering more than 4,000 functional modules.
Currently, MindFlow empowers customers including SAIC Group, Changan Automobile, Great Wall Motors, Geely Automobile, FAW Group, Li Auto, Huawei, Bosch, ECARX, MAXIEYE, NavInfo and RoboSense.
The essence of the vehicle-cloud integrated data closed-loop is to build a collaborative system of "vehicle-side lightweight + cloud-side intelligence", break through data flow barriers, and realize the continuous evolution of intelligent vehicles. The vehicle side is responsible for real-time collection of environmental perception data (such as road conditions, vehicle operation data), which is uploaded to the cloud after desensitization, encryption, and compression. The cloud processes massive amounts of data (PB/EB level), performs annotation, model training, and algorithm optimization, generates new capabilities, and issues them to the vehicle side to realize OTA upgrades.
The ExceedData data closed-loop solution is a vehicle-cloud integrated solution, which has gained the trust and mass production application of more than 15 automotive OEMs and is deployed in more than 30 mainstream models.
The composition of the ExceedData data closed-loop solution includes the vehicle-side edge computing engine (vCompute), edge data engine (vADS), edge database (vData), as well as the cloud-side algorithm development tool (vStudio), cloud computing engine (vAnalyze), and cloud management platform (vCloud). This solution can reduce data transmission costs by 75%, cloud storage costs by 90%, and cloud computing costs by 33%. According to the calculation of an OEM case cooperating with ExceedData: the total cost optimization can be reduced by 85%.
In terms of OEMs, take Xpeng Motors as an example. Its self-built "cloud-side model factory" has a computing power reserve of 10 EFLOPS in 2025, and the end-to-end iteration cycle is shortened to an average of 5 days, supporting rapid closed-loop from cloud-side pre-training to vehicle-side model deployment.
Xpeng launched China's first 72 billion parameter multimodal world base model for L4 high autonomous driving, which has chain-of-thought (CoT) reasoning capabilities and can simulate human common-sense reasoning and generate control signals. Through model distillation technology, the capabilities of the base model are migrated to the vehicle-side small model, realizing personalized deployment of "small size and high intelligence".
High-value data (such as corner cases) is initially screened through the vehicle-side rule engine. The cloud combines synthetic data generation technologies (such as GAN, diffusion models) to fill data gaps and improve model generalization capabilities. At the same time, end-to-end (E2E) and VLA models integrate multimodal inputs to directly output control commands, relying on cloud-side large model training (such as Xpeng's 72 billion parameter base model) to achieve lightweight deployment on the vehicle side.
With the comprehensive modeling of the entire intelligent driving system, car companies are pursuing "better cost, higher efficiency, and more stable services" in the data closed-loop. The delivery method of intelligent driving is accelerating from delivering code for single-vehicle deployment to a subscription-based cloud service as the core. The efficiently collaborative data closed-loop of vehicle-cloud integration is the key for intelligent vehicles to achieve faster iterations driven by AI.
Glossary