Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Interpretable discovery of patterns in tabular data via spatially semantic topographic maps

Abstract

Tabular data—rows of samples and columns of sample features—are ubiquitously used across disciplines. Yet the tabular representation makes it difficult to discover underlying associations in the data and thus hinders their analysis and the discovery of useful patterns. Here we report a broadly applicable strategy for unravelling intertwined relationships in tabular data by reconfiguring each data sample into a spatially semantic 2D topographic map, which we refer to as TabMap. A TabMap preserves the original feature values as pixel intensities, with the relationships among the features spatially encoded in the map (the strength of two inter-related features correlates with their distance on the map). TabMap makes it possible to apply 2D convolutional neural networks to extract association patterns in the data to aid data analysis, and offers interpretability by ranking features according to importance. We show the superior predictive performance of TabMap by applying it to 12 datasets across a wide range of biomedical applications, including disease diagnosis, human activity recognition, microbial identification and the analysis of quantitative structure–activity relationships.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Workflow of TabMap-based tabular data analysis.
Fig. 2: TabMap visualization of tabular data and spatial correlation properties of TabMap.
Fig. 3: Performance of different methods on biomedical tabular datasets.
Fig. 4: Feature attributions of the proposed method provide useful insights that help identify important features for model predictions.

Similar content being viewed by others

Data availability

The BCTIL dataset is available from the Single Cell Portal (https://singlecell.broadinstitute.org/single_cell). The TOX-171 and LUNG datasets are available from the scikit-feature repository64. The OncoNPC dataset can be requested from its authors. Additional datasets used in this study are available from the UCI Machine Learning Repository65. The main data supporting the results in this study are available within the paper and its Supplementary Information. Source data are provided with this paper.

Code availability

The source code for TabMap is available via GitHub at https://github.com/rui-yan/TabMap. All methods are implemented in Python, using PyTorch as the primary package for model training. The code base is made available for non-commercial and academic purposes.

References

  1. Shilo, S., Rossman, H. & Segal, E. Axes of a revolution: challenges and promises of big data in healthcare. Nat. Med. 26, 29–38 (2020).

    Article  CAS  PubMed  Google Scholar 

  2. Obermeyer, Z. & Emanuel, E. J. Predicting the future—big data, machine learning, and clinical medicine. N. Engl. J. Med. 375, 1216–1219 (2016).

    Article  PubMed  PubMed Central  Google Scholar 

  3. Marx, V. The big challenges of big data. Nature 498, 255–260 (2013).

    Article  CAS  PubMed  Google Scholar 

  4. Wu, X., Zhu, X., Wu, G.-Q. & Ding, W. Data mining with big data. IEEE Trans. Knowl. Data Eng. 26, 97–107 (2013).

    Google Scholar 

  5. LaValle, S., Lesser, E., Shockley, R., Hopkins, M. S. & Kruschwitz, N. Big data, analytics and the path from insights to value. MIT Sloan Manage. Rev. 52, 21–32 (2011).

    Google Scholar 

  6. Xing, L., Giger, M. L. & Min, J. K. Artificial Intelligence in Medicine: Technical Basis and Clinical Applications (Academic Press, 2020).

  7. Wee-Chung Liew, A., Yan, H. & Yang, M. Pattern recognition techniques for the emerging field of bioinformatics: a review. Pattern Recognit. 38, 2055–2073 (2005).

    Article  Google Scholar 

  8. Tang, B., Pan, Z., Yin, K. & Khateeb, A. Recent advances of deep learning in bioinformatics and computational biology. Front. Genet. 10, 214 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  9. Karim, M. R. et al. Deep learning-based clustering approaches for bioinformatics. Brief. Bioinform. 22, 393–415 (2021).

    Article  PubMed  Google Scholar 

  10. Kiselev, V. Y., Andrews, T. S. & Hemberg, M. Challenges in unsupervised clustering of single-cell RNA-seq data. Nat. Rev. Genet. 20, 273–282 (2019).

    Article  CAS  PubMed  Google Scholar 

  11. Nelder, J. A. & Wedderburn, R. W. M. Generalized linear models. J. R. Stat. Soc. A 135, 370–384 (1972).

    Article  Google Scholar 

  12. Tolles, J. & Meurer, W. J. Logistic regression: relating patient characteristics to outcomes. JAMA 316, 533–534 (2016).

    Article  PubMed  Google Scholar 

  13. Breiman, L. Random forests. Mach. Learn. 45, 5–32 (2001).

    Article  Google Scholar 

  14. Chen, T. &` Guestrin, C. Xgboost: a scalable tree boosting system. In Proc. 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 785–794 (Association for Computing Machinery, 2016).

  15. Alipanahi, B., Delong, A., Weirauch, M. T. & Frey, B. J. Predicting the sequence specificities of DNA-and RNA-binding proteins by deep learning. Nat. Biotechnol. 33, 831–838 (2015).

    Article  CAS  PubMed  Google Scholar 

  16. Ronao, C. A. & Cho, S.-B. Human activity recognition with smartphone sensors using deep learning neural networks. Expert Syst. Appl. 59, 235–244 (2016).

    Article  Google Scholar 

  17. Arik, S. Ö. & Pfister, T. Tabnet: attentive interpretable tabular learning. Proc. AAAI Conf. Artif. Intell. 35, 6679–6687 (2021).

  18. Huang, X., Khetan, A., Cvitkovic, M. & Karnin, Z. Tabtransformer: tabular data modeling using contextual embeddings. Preprint at https://arxiv.org/abs/2012.06678 (2020).

  19. Kadra, A., Lindauer, M., Hutter, F. & Grabocka, J. Well-tuned simple nets excel on tabular datasets. Adv. Neural Inf. Process. Syst. 34, 23928–23941 (2021).

    Google Scholar 

  20. Borisov, V. et al. Deep neural networks and tabular data: a survey. IEEE Trans. Neural Netw. Learn. Syst. 35, 7499–7519 (2022).

  21. Gorishniy, Y., Rubachev, I., Khrulkov, V. & Babenko, A. Revisiting deep learning models for tabular data. Adv. Neural Inf. Process. Syst. 34, 18932–18943 (2021).

    Google Scholar 

  22. Shwartz-Ziv, R. & Armon, A. Tabular data: deep learning is not all you need. Inf. Fusion 81, 84–90 (2022).

    Article  Google Scholar 

  23. Zhu, Y. et al. Converting tabular data into images for deep learning with convolutional neural networks. Sci. Rep. 11, 11325 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  24. Anguita, D., Ghio, A., Oneto, L., Parra, X. & Reyes-Ortiz, J. L. Human activity recognition on smartphones using a multiclass hardware-friendly support vector machine. In Ambient Assisted Living and Home Care. 4th International Workshop IWAAL 2012 (eds Bravo, J. et al.) 216–223 (Springer, 2012).

  25. Jayaram, N. & Baker, J. W. Correlation model for spatially distributed ground-motion intensities. Earthq. Eng. Struct. Dyn. 38, 1687–1708 (2009).

    Article  Google Scholar 

  26. ElShawi, R., Sherif, Y., Al-Mallah, M. & Sakr, S. Interpretability in healthcare: a comparative study of local machine learning interpretability techniques. Comput. Intell. 37, 1633–1650 (2021).

    Article  Google Scholar 

  27. Tjoa, E. & Guan, C. A survey on explainable artificial intelligence (xai): toward medical xai. IEEE Trans. Neural Netw. Learn. Syst. 32, 4793–4813 (2020).

    Article  Google Scholar 

  28. Yu, K.-H., Beam, A. L. & Kohane, I. S. Artificial intelligence in healthcare. Nat. Biomed. Eng. 2, 719–731 (2018).

    Article  PubMed  Google Scholar 

  29. Shortliffe, E. H. & Sepúlveda, M. J. Clinical decision support in the era of artificial intelligence. JAMA 320, 2199–2200 (2018).

    Article  PubMed  Google Scholar 

  30. Sharma, A., Vans, E., Shigemizu, D., Boroevich, K. A. & Tsunoda, T. Deepinsight: a methodology to transform a non-image data to an image for convolution neural network architecture. Sci. Rep. 9, 11399 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  31. Lundberg, S. M. & Lee, S.-I. A unified approach to interpreting model predictions. Adv. Neural Inf. Process. Syst. 31, 4768–4777 (2017).

  32. Savas, P. et al. Single-cell profiling of breast cancer T cells reveals a tissue-resident memory subset associated with improved prognosis. Nat. Med. 24, 986–993 (2018).

    Article  CAS  PubMed  Google Scholar 

  33. Jia, J., Li, H., Huang, Z., Yu, J. & Cao, B. Comprehensive immune landscape of lung-resident memory CD8+ T cells after influenza infection and reinfection in a mouse model. Front. Microbiol. 14, 1184884 (2023).

    Article  PubMed  PubMed Central  Google Scholar 

  34. Lelliott, E. J. et al. NKG7 enhances cd8+ T cell synapse efficiency to limit inflammation. Front. Immunol. 13, 931630 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  35. Wen, T. et al. NKG7 is a T-cell–intrinsic therapeutic target for improving antitumor cytotoxicity and cancer immunotherapy. Cancer Immunol. Res. 10, 162–181 (2022).

    Article  CAS  PubMed  Google Scholar 

  36. Ting, D. S. W., Carin, L., Dzau, V. & Wong, T. Y. Digital technology and COVID-19. Nat. Med. 26, 459–461 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  37. LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436–444 (2015).

    Article  CAS  PubMed  Google Scholar 

  38. Bazgir, O. et al. Representation of features as images with neighborhood dependencies for compatibility with convolutional neural networks. Nat. Commun. 11, 4391 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  39. Shavitt, I. & Segal, E. Regularization learning networks: deep learning for tabular datasets. Adv. Neural Inf. Process. Syst. 31, 1386–1396 (2018).

  40. Kossen, J. et al. Self-attention between datapoints: going beyond individual input–output pairs in deep learning. Adv. Neural Inf. Process. Syst. 34, 28742–28756 (2021).

    Google Scholar 

  41. Zhou, B., Khosla, A., Lapedriza, A., Oliva, A. & Torralba, A. Learning deep features for discriminative localization. In Proc. IEEE Conference on Computer Vision and Pattern Recognition 2921–2929 (IEEE, 2016).

  42. Selvaraju, R. R. et al. Grad-cam: visual explanations from deep networks via gradient-based localization. In Proc. IEEE International Conference on Computer Vision 618–626 (IEEE, 2017).

  43. Ribeiro, M. T., Singh, S. & Guestrin, C. “Why should i trust you?”: explaining the predictions of any classifier. In Proc. 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 1135–1144 (Association for Computing Machinery, 2016).

  44. Peyré, G. et al. Computational optimal transport: with applications to data science. Found. Trends Mach. Learn. 11, 355–607 (2019).

    Article  Google Scholar 

  45. Moon, I. et al. Machine learning for genetics-based classification and treatment response prediction in cancer of unknown primary. Nat. Med. 29, 2057–2067 (2023).

  46. Peyré, G., Cuturi, M. & Solomon, J. Gromov–Wasserstein averaging of kernel and distance matrices. In International Conference on Machine Learning 2664–2672 (PMLR, 2016).

  47. Cuturi, M. Sinkhorn distances: lightspeed computation of optimal transport. Adv. Neural Inf. Process. Syst. 26, 2292–2300 (2013).

  48. Crouse, D. F. On implementing 2D rectangular assignment algorithms. IEEE Trans. Aerosp. Electron. Syst. 52, 1679–1696 (2016).

    Article  Google Scholar 

  49. Shapley, L. S. in Contributions to the Theory of Games II (eds Kuhn, H. W. & Tucker, A. W.) 307–317 (Princeton Univ. Press, 1953).

  50. Deng, X. & Papadimitriou, C. H. On the complexity of cooperative solution concepts. Math. Oper. Res. 19, 257–266 (1994).

    Article  Google Scholar 

  51. Datta, A., Sen, S. & Zick, Y. Algorithmic transparency via quantitative input influence: theory and experiments with learning systems. In 2016 IEEE Symposium on Security and Privacy (SP) 598–617 (IEEE, 2016).

  52. Štrumbelj, E. & Kononenko, I. Explaining prediction models and individual predictions with feature contributions. Knowl. Inf. Syst. 41, 647–665 (2014).

    Article  Google Scholar 

  53. Shrikumar, A., Greenside, P. & Kundaje, A. Learning important features through propagating activation differences. In International Conference on Machine Learning 3145–3153 (PMLR, 2017).

  54. Sakar, C., Serbes, G., Gunduz, A., Nizam, H. & Sakar, B. Parkinson’s disease classification. UCI Machine Learning Repository https://doi.org/10.24432/C5MS4X (2018).

  55. Mansouri, K., Ringsted, T., Ballabio, D., Todeschini, R. & Consonni, V. QSAR biodegradation. UCI Machine Learning Repository https://doi.org/10.24432/C5H60M (2013).

  56. Reyes-Ortiz, J., Anguita, D., Ghio, A., Oneto, L. & Parra, X. Human activity recognition using smartphones. UCI Machine Learning Repository https://doi.org/10.24432/C54S4K (2012).

  57. Mah, P. & Veyrieras, J.-B. MicroMass. UCI Machine Learning Repository https://doi.org/10.24432/C5T61S (2013).

  58. Guyon, I., Gunn, S., Ben-Hur, A. & Dror, G. Arcene. UCI Machine Learning Repository https://doi.org/10.24432/C58P55 (2008).

  59. Cole, R. & Fanty, M. ISOLET. UCI Machine Learning Repository https://doi.org/10.24432/C51G69 (1994).

  60. Lathrop, R. p53 Mutants. UCI Machine Learning Repository https://doi.org/10.24432/C5T89H (2010).

  61. Wolberg, W., Mangasarian, O., Street, N. & Street, W. Breast cancer Wisconsin (diagnostic). UCI Machine Learning Repository https://doi.org/10.24432/C5DW2B (1995).

  62. Bhattacharjee, A. et al. Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses. Proc. Natl Acad. Sci. USA 98, 13790–13795 (2001).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  63. Li, J. et al. Feature selection: a data perspective. ACM Comput. Surv. 50, 1–45 (2017).

    Google Scholar 

  64. Li, J. et al. scikit-feature feature selection repository. GitHub https://jundongl.github.io/scikit-feature (2018).

  65. UCI Machine Learning Repository; https://archive.ics.uci.edu

Download references

Acknowledgements

We acknowledge the support from the Stanford Cancer Institute and the National Institutes of Health (1K99LM014309, 1R01CA223667 and 1R01CA275772).

Author information

Authors and Affiliations

Authors

Contributions

L.X. and M.T.I. conceived the experiments. R.Y. conducted the experiments and analysed the results. All authors contributed to writing the paper.

Corresponding author

Correspondence to Lei Xing.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Biomedical Engineering thanks Yitan Zhu and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Probability distributions of model predictions and ROC curves for TabMap and five other prediction models.

Probability distributions of model predictions (left) and ROC curves (right) from 5-fold cross-validation on the PD dataset for (a) TabMap, (b) 1DCNN, (c) LR, (d) RF, (e) GB, and (f) XGB. In the ROC curves, the blue curve illustrates the mean performance across the hold-out test set, where each fold represents 20% of the tested data. The gray shaded area shows the standard deviation of the performance.

Source data

Extended Data Fig. 2 Confusion matrices for TabMap and five other prediction models.

Average confusion matrices from 5-fold cross-validation on the HAR dataset for (a) TabMap, (b) 1DCNN, (c) LR, (d) RF, (e) GB, and (f) XGB.

Source data

Extended Data Fig. 3 Cell-type annotation and canonical biomarker identification using TabMap.

(a) 2D t-SNE visualization of T cells using embeddings extracted from the fully connected layer of the trained TabMap model, with ten distinct clusters represented by different colors. (b) Top 20 genes with the highest SHAP values crucial for identifying T cell subtypes CD8+TRM, CD4+FOXP3+, and CD4+RGCC+. Key genes previously identified in literature are marked in red on the y-axis. (c) Heat map illustrating local attributions of key genes based on SHAP values, with cells grouped into clusters as indicated by color bars at the bottom. Key genes for each cluster are annotated on the y-axis. Attribution values are color-coded, with positive attributions shown in red and negative attributions in blue.

Source data

Supplementary information

Supplementary Information

Supplementary figures and tables.

Reporting Summary

Source data

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yan, R., Islam, M.T. & Xing, L. Interpretable discovery of patterns in tabular data via spatially semantic topographic maps. Nat. Biomed. Eng 9, 471–482 (2025). https://doi.org/10.1038/s41551-024-01268-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41551-024-01268-6

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing
Morty Proxy This is a proxified and sanitized view of the page, visit original site.