# Lmdeploy 源码解析01

基于自己的理解和实际线上部署,分析lmdeploy模型推理框架的流程,方便问题排查和优化。

  1. 启动命令和入口 lmdeploy.serve.openai.api_server

api server是基于fastapi框架提供的类openai api接口,支持http调用。主要接口是:

  • /v1/models
  • /health
  • /v1/chat/completions
  • /v1/chat/completions_qos
  • /v1/completions
  • /v1/completions_qos
  1. serve启动参数
def serve(model_path: str,
          model_name: Optional[str] = None,
          backend: Literal['turbomind', 'pytorch'] = 'turbomind',
          backend_config: Optional[Union[PytorchEngineConfig,
                                         TurbomindEngineConfig]] = None,
          chat_template_config: Optional[ChatTemplateConfig] = None,
          server_name: str = '0.0.0.0',
          server_port: int = 23333,
          tp: int = 1,
          allow_origins: List[str] = ['*'],
          allow_credentials: bool = True,
          allow_methods: List[str] = ['*'],
          allow_headers: List[str] = ['*'],
          log_level: str = 'ERROR',
          api_keys: Optional[Union[List[str], str]] = None,
          ssl: bool = False,
          qos_config_path: str = '',
          **kwargs):
    pass
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

基于启动参数,初始化AsyncEngine,这里边根据参数来初始化turbomind engine or pytorch engine。

  1. v1/chat/completions接口

OpenAi Complete接口 (opens new window)


#生成generator
result_generator = VariableInterface.async_engine.generate(
        request.messages,
        request.session_id,
        gen_config=gen_config,
        stream_response=True,  # always use stream to enable batching
        sequence_start=True,
        sequence_end=True,
        do_preprocess=not isinstance(request.messages,
                                     str),  # text completion for string input
        adapter_name=adapter_name,
    )

#generator逻辑。先计算message prompt input
prompt_input = await self._get_prompt_input(prompt, do_preprocess,
                                                    sequence_start,
                                                    adapter_name)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18

3.1 Pytorch Engine

async def async_stream_infer(
            self,
            session_id: int,
            input_ids: List[int],
            gen_config: GenerationConfig = None,
            adapter_name: str = None,
            input_embeddings: InputEmbeddingType = None,
            input_embedding_ranges: InputEmbeddingRangeType = None,
            **kwargs):
          pass
1
2
3
4
5
6
7
8
9
10
Last Updated: 5/18/2025, 10:04:57 AM
Apache License 2.0 | Copyright © 2022 by xueliang.wu 苏ICP备15016087号