Interpretable discovery of patterns in tabular data via spatially semantic topographic maps

Yan, Rui; Islam, Md Tauhidual; Xing, Lei

doi:10.1038/s41551-024-01268-6

Article
Published: 15 October 2024

Interpretable discovery of patterns in tabular data via spatially semantic topographic maps

Nature Biomedical Engineering volume 9, pages 471–482 (2025)Cite this article

2355 Accesses
4 Altmetric
Metrics details

Subjects

Abstract

Tabular data—rows of samples and columns of sample features—are ubiquitously used across disciplines. Yet the tabular representation makes it difficult to discover underlying associations in the data and thus hinders their analysis and the discovery of useful patterns. Here we report a broadly applicable strategy for unravelling intertwined relationships in tabular data by reconfiguring each data sample into a spatially semantic 2D topographic map, which we refer to as TabMap. A TabMap preserves the original feature values as pixel intensities, with the relationships among the features spatially encoded in the map (the strength of two inter-related features correlates with their distance on the map). TabMap makes it possible to apply 2D convolutional neural networks to extract association patterns in the data to aid data analysis, and offers interpretability by ranking features according to importance. We show the superior predictive performance of TabMap by applying it to 12 datasets across a wide range of biomedical applications, including disease diagnosis, human activity recognition, microbial identification and the analysis of quantitative structure–activity relationships.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on SpringerLink
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Workflow of TabMap-based tabular data analysis.**

**Fig. 2: TabMap visualization of tabular data and spatial correlation properties of TabMap.**

**Fig. 3: Performance of different methods on biomedical tabular datasets.**

**Fig. 4: Feature attributions of the proposed method provide useful insights that help identify important features for model predictions.**

Converting tabular data into images for deep learning with convolutional neural networks

Article Open access 31 May 2021

Enhanced analysis of tabular data through Multi-representation DeepInsight

Article Open access 04 June 2024

Accurate predictions on small data with a tabular foundation model

Article Open access 08 January 2025

Data availability

The BCTIL dataset is available from the Single Cell Portal (https://singlecell.broadinstitute.org/single_cell). The TOX-171 and LUNG datasets are available from the scikit-feature repository⁶⁴. The OncoNPC dataset can be requested from its authors. Additional datasets used in this study are available from the UCI Machine Learning Repository⁶⁵. The main data supporting the results in this study are available within the paper and its Supplementary Information. Source data are provided with this paper.

Code availability

The source code for TabMap is available via GitHub at https://github.com/rui-yan/TabMap. All methods are implemented in Python, using PyTorch as the primary package for model training. The code base is made available for non-commercial and academic purposes.

References

Shilo, S., Rossman, H. & Segal, E. Axes of a revolution: challenges and promises of big data in healthcare. Nat. Med. 26, 29–38 (2020).
Article CAS PubMed Google Scholar
Obermeyer, Z. & Emanuel, E. J. Predicting the future—big data, machine learning, and clinical medicine. N. Engl. J. Med. 375, 1216–1219 (2016).
Article PubMed PubMed Central Google Scholar
Marx, V. The big challenges of big data. Nature 498, 255–260 (2013).
Article CAS PubMed Google Scholar
Wu, X., Zhu, X., Wu, G.-Q. & Ding, W. Data mining with big data. IEEE Trans. Knowl. Data Eng. 26, 97–107 (2013).
Google Scholar
LaValle, S., Lesser, E., Shockley, R., Hopkins, M. S. & Kruschwitz, N. Big data, analytics and the path from insights to value. MIT Sloan Manage. Rev. 52, 21–32 (2011).
Google Scholar
Xing, L., Giger, M. L. & Min, J. K. Artificial Intelligence in Medicine: Technical Basis and Clinical Applications (Academic Press, 2020).
Wee-Chung Liew, A., Yan, H. & Yang, M. Pattern recognition techniques for the emerging field of bioinformatics: a review. Pattern Recognit. 38, 2055–2073 (2005).
Article Google Scholar
Tang, B., Pan, Z., Yin, K. & Khateeb, A. Recent advances of deep learning in bioinformatics and computational biology. Front. Genet. 10, 214 (2019).
Article PubMed PubMed Central Google Scholar
Karim, M. R. et al. Deep learning-based clustering approaches for bioinformatics. Brief. Bioinform. 22, 393–415 (2021).
Article PubMed Google Scholar
Kiselev, V. Y., Andrews, T. S. & Hemberg, M. Challenges in unsupervised clustering of single-cell RNA-seq data. Nat. Rev. Genet. 20, 273–282 (2019).
Article CAS PubMed Google Scholar
Nelder, J. A. & Wedderburn, R. W. M. Generalized linear models. J. R. Stat. Soc. A 135, 370–384 (1972).
Article Google Scholar
Tolles, J. & Meurer, W. J. Logistic regression: relating patient characteristics to outcomes. JAMA 316, 533–534 (2016).
Article PubMed Google Scholar
Breiman, L. Random forests. Mach. Learn. 45, 5–32 (2001).
Article Google Scholar
Chen, T. &` Guestrin, C. Xgboost: a scalable tree boosting system. In Proc. 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 785–794 (Association for Computing Machinery, 2016).
Alipanahi, B., Delong, A., Weirauch, M. T. & Frey, B. J. Predicting the sequence specificities of DNA-and RNA-binding proteins by deep learning. Nat. Biotechnol. 33, 831–838 (2015).
Article CAS PubMed Google Scholar
Ronao, C. A. & Cho, S.-B. Human activity recognition with smartphone sensors using deep learning neural networks. Expert Syst. Appl. 59, 235–244 (2016).
Article Google Scholar
Arik, S. Ö. & Pfister, T. Tabnet: attentive interpretable tabular learning. Proc. AAAI Conf. Artif. Intell. 35, 6679–6687 (2021).
Huang, X., Khetan, A., Cvitkovic, M. & Karnin, Z. Tabtransformer: tabular data modeling using contextual embeddings. Preprint at https://arxiv.org/abs/2012.06678 (2020).
Kadra, A., Lindauer, M., Hutter, F. & Grabocka, J. Well-tuned simple nets excel on tabular datasets. Adv. Neural Inf. Process. Syst. 34, 23928–23941 (2021).
Google Scholar
Borisov, V. et al. Deep neural networks and tabular data: a survey. IEEE Trans. Neural Netw. Learn. Syst. 35, 7499–7519 (2022).
Gorishniy, Y., Rubachev, I., Khrulkov, V. & Babenko, A. Revisiting deep learning models for tabular data. Adv. Neural Inf. Process. Syst. 34, 18932–18943 (2021).
Google Scholar
Shwartz-Ziv, R. & Armon, A. Tabular data: deep learning is not all you need. Inf. Fusion 81, 84–90 (2022).
Article Google Scholar
Zhu, Y. et al. Converting tabular data into images for deep learning with convolutional neural networks. Sci. Rep. 11, 11325 (2021).
Article CAS PubMed PubMed Central Google Scholar
Anguita, D., Ghio, A., Oneto, L., Parra, X. & Reyes-Ortiz, J. L. Human activity recognition on smartphones using a multiclass hardware-friendly support vector machine. In Ambient Assisted Living and Home Care. 4th International Workshop IWAAL 2012 (eds Bravo, J. et al.) 216–223 (Springer, 2012).
Jayaram, N. & Baker, J. W. Correlation model for spatially distributed ground-motion intensities. Earthq. Eng. Struct. Dyn. 38, 1687–1708 (2009).
Article Google Scholar
ElShawi, R., Sherif, Y., Al-Mallah, M. & Sakr, S. Interpretability in healthcare: a comparative study of local machine learning interpretability techniques. Comput. Intell. 37, 1633–1650 (2021).
Article Google Scholar
Tjoa, E. & Guan, C. A survey on explainable artificial intelligence (xai): toward medical xai. IEEE Trans. Neural Netw. Learn. Syst. 32, 4793–4813 (2020).
Article Google Scholar
Yu, K.-H., Beam, A. L. & Kohane, I. S. Artificial intelligence in healthcare. Nat. Biomed. Eng. 2, 719–731 (2018).
Article PubMed Google Scholar
Shortliffe, E. H. & Sepúlveda, M. J. Clinical decision support in the era of artificial intelligence. JAMA 320, 2199–2200 (2018).
Article PubMed Google Scholar
Sharma, A., Vans, E., Shigemizu, D., Boroevich, K. A. & Tsunoda, T. Deepinsight: a methodology to transform a non-image data to an image for convolution neural network architecture. Sci. Rep. 9, 11399 (2019).
Article PubMed PubMed Central Google Scholar
Lundberg, S. M. & Lee, S.-I. A unified approach to interpreting model predictions. Adv. Neural Inf. Process. Syst. 31, 4768–4777 (2017).
Savas, P. et al. Single-cell profiling of breast cancer T cells reveals a tissue-resident memory subset associated with improved prognosis. Nat. Med. 24, 986–993 (2018).
Article CAS PubMed Google Scholar
Jia, J., Li, H., Huang, Z., Yu, J. & Cao, B. Comprehensive immune landscape of lung-resident memory CD8⁺ T cells after influenza infection and reinfection in a mouse model. Front. Microbiol. 14, 1184884 (2023).
Article PubMed PubMed Central Google Scholar
Lelliott, E. J. et al. NKG7 enhances cd8⁺ T cell synapse efficiency to limit inflammation. Front. Immunol. 13, 931630 (2022).
Article CAS PubMed PubMed Central Google Scholar
Wen, T. et al. NKG7 is a T-cell–intrinsic therapeutic target for improving antitumor cytotoxicity and cancer immunotherapy. Cancer Immunol. Res. 10, 162–181 (2022).
Article CAS PubMed Google Scholar
Ting, D. S. W., Carin, L., Dzau, V. & Wong, T. Y. Digital technology and COVID-19. Nat. Med. 26, 459–461 (2020).
Article CAS PubMed PubMed Central Google Scholar
LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436–444 (2015).
Article CAS PubMed Google Scholar
Bazgir, O. et al. Representation of features as images with neighborhood dependencies for compatibility with convolutional neural networks. Nat. Commun. 11, 4391 (2020).
Article CAS PubMed PubMed Central Google Scholar
Shavitt, I. & Segal, E. Regularization learning networks: deep learning for tabular datasets. Adv. Neural Inf. Process. Syst. 31, 1386–1396 (2018).
Kossen, J. et al. Self-attention between datapoints: going beyond individual input–output pairs in deep learning. Adv. Neural Inf. Process. Syst. 34, 28742–28756 (2021).
Google Scholar
Zhou, B., Khosla, A., Lapedriza, A., Oliva, A. & Torralba, A. Learning deep features for discriminative localization. In Proc. IEEE Conference on Computer Vision and Pattern Recognition 2921–2929 (IEEE, 2016).
Selvaraju, R. R. et al. Grad-cam: visual explanations from deep networks via gradient-based localization. In Proc. IEEE International Conference on Computer Vision 618–626 (IEEE, 2017).
Ribeiro, M. T., Singh, S. & Guestrin, C. “Why should i trust you?”: explaining the predictions of any classifier. In Proc. 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 1135–1144 (Association for Computing Machinery, 2016).
Peyré, G. et al. Computational optimal transport: with applications to data science. Found. Trends Mach. Learn. 11, 355–607 (2019).
Article Google Scholar
Moon, I. et al. Machine learning for genetics-based classification and treatment response prediction in cancer of unknown primary. Nat. Med. 29, 2057–2067 (2023).
Peyré, G., Cuturi, M. & Solomon, J. Gromov–Wasserstein averaging of kernel and distance matrices. In International Conference on Machine Learning 2664–2672 (PMLR, 2016).
Cuturi, M. Sinkhorn distances: lightspeed computation of optimal transport. Adv. Neural Inf. Process. Syst. 26, 2292–2300 (2013).
Crouse, D. F. On implementing 2D rectangular assignment algorithms. IEEE Trans. Aerosp. Electron. Syst. 52, 1679–1696 (2016).
Article Google Scholar
Shapley, L. S. in Contributions to the Theory of Games II (eds Kuhn, H. W. & Tucker, A. W.) 307–317 (Princeton Univ. Press, 1953).
Deng, X. & Papadimitriou, C. H. On the complexity of cooperative solution concepts. Math. Oper. Res. 19, 257–266 (1994).
Article Google Scholar
Datta, A., Sen, S. & Zick, Y. Algorithmic transparency via quantitative input influence: theory and experiments with learning systems. In 2016 IEEE Symposium on Security and Privacy (SP) 598–617 (IEEE, 2016).
Štrumbelj, E. & Kononenko, I. Explaining prediction models and individual predictions with feature contributions. Knowl. Inf. Syst. 41, 647–665 (2014).
Article Google Scholar
Shrikumar, A., Greenside, P. & Kundaje, A. Learning important features through propagating activation differences. In International Conference on Machine Learning 3145–3153 (PMLR, 2017).
Sakar, C., Serbes, G., Gunduz, A., Nizam, H. & Sakar, B. Parkinson’s disease classification. UCI Machine Learning Repository https://doi.org/10.24432/C5MS4X (2018).
Mansouri, K., Ringsted, T., Ballabio, D., Todeschini, R. & Consonni, V. QSAR biodegradation. UCI Machine Learning Repository https://doi.org/10.24432/C5H60M (2013).
Reyes-Ortiz, J., Anguita, D., Ghio, A., Oneto, L. & Parra, X. Human activity recognition using smartphones. UCI Machine Learning Repository https://doi.org/10.24432/C54S4K (2012).
Mah, P. & Veyrieras, J.-B. MicroMass. UCI Machine Learning Repository https://doi.org/10.24432/C5T61S (2013).
Guyon, I., Gunn, S., Ben-Hur, A. & Dror, G. Arcene. UCI Machine Learning Repository https://doi.org/10.24432/C58P55 (2008).
Cole, R. & Fanty, M. ISOLET. UCI Machine Learning Repository https://doi.org/10.24432/C51G69 (1994).
Lathrop, R. p53 Mutants. UCI Machine Learning Repository https://doi.org/10.24432/C5T89H (2010).
Wolberg, W., Mangasarian, O., Street, N. & Street, W. Breast cancer Wisconsin (diagnostic). UCI Machine Learning Repository https://doi.org/10.24432/C5DW2B (1995).
Bhattacharjee, A. et al. Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses. Proc. Natl Acad. Sci. USA 98, 13790–13795 (2001).
Article CAS PubMed PubMed Central Google Scholar
Li, J. et al. Feature selection: a data perspective. ACM Comput. Surv. 50, 1–45 (2017).
Google Scholar
Li, J. et al. scikit-feature feature selection repository. GitHub https://jundongl.github.io/scikit-feature (2018).
UCI Machine Learning Repository; https://archive.ics.uci.edu

Download references

Acknowledgements

We acknowledge the support from the Stanford Cancer Institute and the National Institutes of Health (1K99LM014309, 1R01CA223667 and 1R01CA275772).

Author information

These authors contributed equally: Rui Yan, Md Tauhidual Islam.

Authors and Affiliations

Institute for Computational and Mathematical Engineering, Stanford University, Stanford, CA, USA
Rui Yan & Lei Xing
Department of Radiation Oncology, Stanford University, Stanford, CA, USA
Md Tauhidual Islam & Lei Xing
Department of Electrical Engineering, Stanford University, Stanford, CA, USA
Lei Xing

Authors

Rui Yan
View author publications
You can also search for this author inPubMed Google Scholar
Md Tauhidual Islam
View author publications
You can also search for this author inPubMed Google Scholar
Lei Xing
View author publications
You can also search for this author inPubMed Google Scholar

Contributions

L.X. and M.T.I. conceived the experiments. R.Y. conducted the experiments and analysed the results. All authors contributed to writing the paper.

Corresponding author

Correspondence to Lei Xing.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Biomedical Engineering thanks Yitan Zhu and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Probability distributions of model predictions and ROC curves for TabMap and five other prediction models.

Probability distributions of model predictions (left) and ROC curves (right) from 5-fold cross-validation on the PD dataset for (a) TabMap, (b) 1DCNN, (c) LR, (d) RF, (e) GB, and (f) XGB. In the ROC curves, the blue curve illustrates the mean performance across the hold-out test set, where each fold represents 20% of the tested data. The gray shaded area shows the standard deviation of the performance.

Source data

Extended Data Fig. 2 Confusion matrices for TabMap and five other prediction models.

Average confusion matrices from 5-fold cross-validation on the HAR dataset for (a) TabMap, (b) 1DCNN, (c) LR, (d) RF, (e) GB, and (f) XGB.

Source data

Extended Data Fig. 3 Cell-type annotation and canonical biomarker identification using TabMap.

(a) 2D t-SNE visualization of T cells using embeddings extracted from the fully connected layer of the trained TabMap model, with ten distinct clusters represented by different colors. (b) Top 20 genes with the highest SHAP values crucial for identifying T cell subtypes CD8⁺T_RM, CD4⁺FOXP3⁺, and CD4⁺RGCC⁺. Key genes previously identified in literature are marked in red on the y-axis. (c) Heat map illustrating local attributions of key genes based on SHAP values, with cells grouped into clusters as indicated by color bars at the bottom. Key genes for each cluster are annotated on the y-axis. Attribution values are color-coded, with positive attributions shown in red and negative attributions in blue.

Source data

Supplementary information

Supplementary Information

Supplementary figures and tables.

Reporting Summary

Source data

Source Data Figs. 2–4, Extended Data Figs. 1–3 and Supplementary Figs. 1, 2, 4–8

Statistical source data.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Yan, R., Islam, M.T. & Xing, L. Interpretable discovery of patterns in tabular data via spatially semantic topographic maps. Nat. Biomed. Eng 9, 471–482 (2025). https://doi.org/10.1038/s41551-024-01268-6

Download citation

Received: 15 March 2023
Accepted: 23 September 2024
Published: 15 October 2024
Issue Date: April 2025
DOI: https://doi.org/10.1038/s41551-024-01268-6