Tyrosine kinase inhibitor (TKI) combined with immunotherapy regimens are now widely used for treating advanced hepatocellular carcinoma (HCC), but their clinical efficacy is limited to a subset of patients. Considering that the vast majority of advanced HCC patients lose the opportunity for liver resection and thus cannot provide tumor tissue samples, we leveraged the clinical and image data to construct a multimodal convolutional neural network (CNN)–Transformer model for predicting and analyzing tumor response to TKI–immunotherapy. We employed an automatic liver tumor segmentation system, based on a two-stage 3D U-Net framework, to delineate lesions by first segmenting the liver parenchyma and then precisely localizing the tumor. This approach effectively addresses the variability in clinical data and significantly reduces bias introduced by manual intervention. Next, we developed a clinical model using only pretreatment clinical information, a CNN model using only pretreatment magnetic resonance imaging data, and an advanced multimodal CNN–Transformer model fusing both imaging and clinical parameters from a training cohort (n = 181) and then compared their predictive performances in an independent cohort (n = 30). In the validation cohort, the area under the curve (95% confidence interval) values were 0.720 (0.710–0.731), 0.695 (0.683–0.707), and 0.785 (0.760–0.810), respectively, indicating that the multimodal model significantly outperformed the single-modality baseline models across validations. Finally, single-nucleus sequencing with the surgical tumor specimens reveals tumor ecosystem diversity associated with treatment response, providing a preliminary biological validation for the prediction model. In summary, this multimodal model effectively integrates imaging and clinical features of HCC patients, has a superior performance in predicting tumor response to TKI–immunotherapy, and provides a reliable tool for optimizing personalized treatment strategies.