Moonshot AI Kimi ‘Context Caching’ Feature Starts Public Beta
On July 1st, it was announced that Moonshot AI officially launched the public beta of Kimi open platform’s Context Caching feature. The official statement mentioned that this technology can reduce up to 90% of the cost for developers using large text flagship models while maintaining API prices unchanged, and improve model response speed.
According to the introduction, Context Caching is a data management technique that allows the system to pre-store a large amount of data or information that will be frequently requested. When users request the same information, the system can directly provide it from the cache without needing to recalculate or retrieve it from the original data source.
Context Caching is suitable for frequent requests, repeated references to a large number of initial context scenes, which can reduce the cost of long text models and improve efficiency. The official statement indicates that costs can be reduced by up to 90%, and the delay of the first token can be reduced by 83%. The applicable business scenarios are as follows:
Provide a QA Bot with a large amount of preset content, such as the Kimi API assistant.
Frequent queries for a fixed set of documents, such as an information disclosure Q&A tool for listed companies.
Periodic analysis of static code repositories or knowledge bases, such as various Copilot Agents.
Instantly popular AI applications with huge traffic flow, such as the LLM Riddles.
Agent-type applications with complex interaction rules.
The Context Caching billing model is mainly divided into three parts: Cache creation cost, Cache storage cost, and Cache invocation cost.
Call the cache creation interface, after successfully creating the cache, charge based on the actual number of tokens in the cache. Each token costs 24 yuan.
During the cache survival time, cache storage fees are charged per minute. 10 yuan/token/minute.
Cache call charges: Within the Cache lifetime, if a user successfully requests a created Cache through the chat interface and matches the content of the chat message with the active Cache, Cache call charges will be incurred based on the number of calls. 0.02 yuan per call.
After the feature is launched, there will be a public beta test for 3 months, and the price during the public beta period may be adjusted at any time. During the public beta period, the Context Caching feature will be prioritized for Tier5 level users, while the release time for other user groups is to be determined.