Advertisement

Overview of Sankey flow diagrams: Focusing on symptom trajectories in older adults with advanced cancer

Open AccessPublished:January 07, 2022DOI:https://doi.org/10.1016/j.jgo.2021.12.017

      Abstract

      This perspectives paper provides an overview of how to read and interpret a Sankey, examples using symptom data from older adults with advanced cancer, a synopsis of medical literature, and comments on creating and using the diagram for presentation of data. From prior reports and our own, we conclude Sankeys are an excellent tool for visualizing the changing status of older patients with cancer. Older adult symptom data is used as an example with data dispalyed in a range of Sankey flow diagrams. Because there is large heterogeneity in aging, different subgroups can be examined. In a single diagram, Sankey can show both the likelihood and variability of patients' future status from their current status.

      Keywords

      1. Introduction

      Data visualization provides summative snapshots of research insights and highlights the importance and generalizability of the work [
      • Brundage M.
      • Blackford A.
      • Tolbert E.
      • et al.
      Presenting comparative study PRO results to clinicians and researchers: beyond the eye of the beholder.
      ,
      • Wilke C.O.
      Fundamentals of data visualization: A primer on making informative and compelling figures.
      ]. The Sankey flow diagram (Sankey) is a data visualization technique that emphasizes flow/movement/change from one state to another or one time to another. Dating from 1898 and named for Captain Matthew Sankey, the technique, commonly used in physics and engineering to display energy flow, is also popular in economics and business to examine complex multi-step processes [
      • Yu B.
      • Silva C.T.
      VisFlow - web-based visualization framework for tabular data with a subset flow model.
      ]. Nevertheless, Sankeys have a limited presence in healthcare literature. Our interest in this method evolved from a need to improve data visualization of patient tolerability from clinical trials [
      • Flannery M.A.
      • Culakova E.
      • Canin B.E.
      • Peppone L.
      • Ramsdale E.
      • Mohile S.G.
      Understanding treatment tolerability in older adults with cancer.
      ]. Herein, we provide an overview of how to read and interpret a Sankey, examples using symptom data from older adults with advanced cancer, a synopsis of medical literature, and comments on creating and using the diagram for presentation of data.

      2. Brief overview of Sankey with sample diagram

      Flow diagrams are graphs that visualize paths between related events (i.e. states, positions, or steps) [
      • Harris R.L.
      Information graphics : A comprehensive illustrated reference.
      ]. A Sankey diagram includes nodes and arcs. As transitions occur, each arc flows from its source node to target node(s). Fig. 1 is an example with terms labelled and defined. In a Sankey diagram the size of each node and width of each arc represent the number of objects/members, thus indicating the magnitude of flow [
      • Lamer A.
      • Laurent G.
      • Pelayo S.
      • El Amrani M.
      • Chazard E.
      • Marcilly R.
      Exploring patient path through Sankey diagram: a proof of concept.
      ]. For example, a node with five members would be half as tall as a ten member node. An arc transitioning 20 objects is twice as wide/tall as a ten objects transition arc. In Fig. 1 the nodes are stacked vertically and organized by steps, qualifying the figure as a subcategory of Sankey called an alluvial diagram [
      • Holtz Y.
      Sankey Diagram. data-to-viz.
      ]. Alluvial diagrams allow the reader to easily compare the sample at different times.
      Fig. 1
      Fig. 1Example of a Sankey flow diagram with components labelled, defined, and explained.
      Legend: The x-axis represents 3 consecutive steps or time points. The total height of the y-axis represents the full sample (100%). This Sankey depicts flow among three different states represented by colors. Eight nodes are depicted representing combinations of states and steps occurring in this example. The connections between nodes are referred to as links, flows, or arcs. Arcs represent the proportion of the sample transitioning from one node to the next.

      3. Sankey example: Patient-reported symptomatic toxicities

      Geriatric Assessment for Patients 70+ (GAP 70+, NCT:02054741) was a nationwide cluster randomized trial evaluating the effect of a geriatric assessment intervention on treatment- related toxicity reported by clinicians via the Common Terminology Criteria for Adverse Events (CTCAEs) [
      • Mohamed M.R.
      • Kyi K.
      • Mohile S.G.
      • et al.
      Prevalence of and factors associated with treatment modification at first cycle in older adults with advanced cancer receiving palliative treatment.
      ]. Patients' perspectives about symptoms were assessed with Patient- Reported Outcomes (PRO)-CTCAEs. The National Cancer Institute developed PRO-CTCAEs to evaluate symptomatic toxicity in patients participating in cancer clinical trials [
      • Basch E.
      • Reeve B.B.
      • Mitchell S.A.
      • et al.
      Development of the National Cancer Institute’s patient-reported outcomes version of the common terminology criteria for adverse events (PRO-CTCAE).
      ]. A library of 78 symptom terms is available; investigators select items that are relevant to the population and treatment under investigation. Questions include the symptom attributes of presence/absence, frequency, severity, and interference from the symptom with usual/daily activities, scored 0–4.
      In GAP70+, Sankey diagrams were used to visualize the trajectories of symptoms during six months of treatment (Fig. 2) in 692 patients at four time points. In these diagrams, responses of three and four represent the most severe, most frequent, or highest interference and are collapsed into one category to simplify interpretation by decreasing the number of nodes. Three important symptomatic toxicities (diarrhea frequency, fatigue interference with usual/daily activities, and pain severity) were selected, and two graphs are provided for each symptom to illustrate specific aspects.
      Fig. 2
      Fig. 2PRO-CTCAE Symptom Data Display with Sankey Flow Diagrams.
      Fig. Ai: Both prevalence and frequency appear to be quite stable over time. Figure Aii: Focus on the arcs as they flow from left to right. For example, following the yellow color, about a fifth (22%) of patients reported that they had diarrhea rarely at the baseline (yellow node); following the arcs to 4–6 weeks, 33% of these individuals improved and reported no diarrhea, 28% stayed the same at rarely, and 27% reported the frequency had increased, while 12% discontinued. Observing panel Bi, the arcs demonstrate frequent movement for patients at each time point to both lower and higher levels of interference. The focused Sankey (Bii) highlights the proportions of patients for whom interference got worse. For example, of the patients with no interference in usual/daily activities at the initiation of treatment, 56% reported some level of interference with daily activities by 4–6 weeks (6% had discontinued). Observing panel Ci, the arcs demonstrate frequent movement for patients at each time point to both lower and higher levels of interference. The focused Sankey (Cii) highlights the proportions of patients for whom interference got worse. For example, of the patients with no interference in usual/daily activities at the initiation of treatment, 56% reported some level of interference with daily activities by 4–6 weeks (6% had discontinued).
      The use of Sankey to display symptom trajectories in older adults with cancer during toxic treatment enhanced visualization of variability and the association of symptom trajectories with discontinuation. These three symptoms were selected because of their clinical relevance in caring for older adults with advanced cancer receiving treatment. Clinicians caring for these individuals may be aware of the percentage who report these side effects in clinical trials but do not know how a particular individual will fare. In contrast, the Sankey contributes additional granular information based on the interpretation of nodes and arcs. Knowledge of diarrhea frequency is critically important in older adults, as it can lead to dehydration. The Sankey (Fig. 2Aii) provides evidence of both decreased and increased frequency, unlike the bar graph (Fig. 2Ai). Interference in usual /daily activities from fatigue indicates a decrease in functional status in older adults, an important tolerability metric. The increase in interference is highlighted in Fig. 2B, which also shows the proportion of patients reporting high interference who subsequently withdrew. For older adults with advanced cancer receiving treatment, an increase in pain severity warrants reevaluation. Fig. 2C indicates that patients reporting mild pain have wide variability in their pain levels at the next assessment, reinforcing the need for routine symptom monitoring.

      4. Medical literature: Sankey flow diagrams

      A focused review of the medical literature was conducted to identify references including Sankey diagrams using PubMed and Google Scholar with the search term “Sankey diagram” for title and abstract. The search was restricted to English language research articles. From the identified abstracts, full text articles were scanned, and only articles that included actual Sankey diagrams were reviewed. Articles that included cancer and/or older adults were specifically searched. A total of 13 articles was selected, reviewed, and categorized as displayed in Supplemental Table 1. Three broad categories of use for Sankey were found in the medical literature: 1) to visualize flow/transitions over time, 2) to visualize flow/transitions to specific events, and 3) to demonstrate associations. An example that illustrates changes over time shows transitions in frailty status over one year in a cohort of older adults hospitalized with a critical illness [
      • Brummel N.E.
      • Girard T.D.
      • Pandharipande P.P.
      • et al.
      Prevalence and course of frailty in survivors of critical illness.
      ]. Changes in frailty status are visualized with an array of Sankeys to reflect transitions to improved or deteriorated status [
      • Brummel N.E.
      • Girard T.D.
      • Pandharipande P.P.
      • et al.
      Prevalence and course of frailty in survivors of critical illness.
      ]. A common example of using a Sankey to visualize the flow of events is to show the sequence of treatments, as for a cohort of older women with early-stage node-negative breast cancer [
      • Carleton N.
      • Zou J.
      • Fang Y.
      • et al.
      Outcomes after sentinel lymph node biopsy and radiotherapy in older women with early-stage, estrogen receptor-positive breast cancer.
      ]. Sankey was also used to visualize associations between two different assay methods and hormone receptor status of breast tumors [
      • Kenn M.
      • Cacsire Castillo-Tong D.
      • Singer C.F.
      • Cibena M.
      • Kolbl H.
      • Schreiner W.
      Co-expressed genes enhance precision of receptor status identification in breast cancer patients.
      ].

      5. Creating Sankey flow diagrams

      A Sankey diagram can be drawn for discrete variables with interrelated values in any dataset. Two main steps are required: first, converting the data into the correct format for Sankey nodes and arcs, and second, programming the actual diagram display. These steps can be performed with a variety of software programs (i.e. R, Excel, Python, Plotly, Google Charts). For additional details refer to Supplemental Table 2. In Fig. 2, PRO-CTCAE data were organized into the required data matrix format to create nodes and arcs with Python software; the visualization was generated with the Python Plotly application programming interface. Plotly was selected because it provides precise control over the design of the graphic, allows multiple steps, and supports creating interactive graphics.
      A well-made Sankey requires a few careful design decisions that depend on the nature of the data and what phenomenon the researcher wants to feature. The main decisions center on clear identification of the x- and y-axes, the number of nodes per step, handling missing data, and the use of color and transparency. The x- and y-axes should be selected so the placement of nodes allows information to intuitively flow though the diagram. In the symptom examples, the y-axis orders the nodes by symptom severity. Simultaneously, node heights indicate size (i.e. number or proportion of patients). The x-axis places nodes on a time scale. The number of nodes per step is critical to ease of interpretation. Collapsing nodes together in a meaningful way will improve clarity of the visualization. If the data are numeric, values may need to be grouped or collapsed into a limited number. Similarly, if the data are categorical, a decision needs to be made on the appropriate number of categories to display. The decision to collapse can be based upon features of the data (which categories had a large number of respondents) or on a clinical utility/ meaningfulness dimension (e.g. collapsing severe and very severe into one group, or collapsing treatment by category such as hormonal or first line).
      Missing data affect the overall Sankey display, because missing data severely disturbs the arc presentation. For withdrawal, we recommend adding dropout nodes, in the vertical direction for all time points after baseline. This allows visualizing the status of the entire cohort at each display point and adds clinically meaningful information (Supplemental Fig. 1). A second issue is intermittent missing values (an assessment time point that was missed while patient remained on study). The researcher needs to review their data set for the scope of this issue. Be aware that skipping intermediate nodes may occur, making interpretability more challenging. Any participants with missing data at intermediate times could be removed, however this is recommended only if it affects a very small (i.e. < 5%) proportion of data, as deleting participants could be misleading.
      The transparency and color of any Sankey can be set manually, which creates a range of options for highlighting specific patterns as shown in Fig. 2. In our experience, using different shades of the same color palette creates challenges and confusion in interpreting data, as the arcs overlap. A diverging color palette for the nodes on the y-axis and distinct non-gradient colors may ease visualization [
      • Vosough Z.
      • Kammer D.
      • Keck M.
      • Groh R.
      Visualization approaches for understanding uncertainty in flow diagrams.
      ]. For specific transitions, adjusting transparency allows arcs to fade into the background while others are highlighted. The amount of color saturation in the arcs can be set; if the saturation is very weak the arc will be more transparent and therefore more difficult to see. Therefore, the saturation can be adjusted for optimal visualization. Among possible sources, the ColorBrewer [
      • Brewer C.
      • Harrower M.
      The Pennsylvania State University. Color Brewer 2.0.
      ] website offers a variety of diverging palettes with instant demonstrations. This option requires labeling the nodes rather than identifying them by color. If color is not an option, grayscale can be used, however, optimal contrast may be compromised.

      6. Pros and Cons

      Sankey diagrams represent an insightful but complex and specialized graphic method. The primary value of a Sankey is visualization of the evolution of a population's status over time through transitions. In other words, the Sankey diagram allows clinicians and patients to view the relative frequency of all patient status paths, illustrating the likelihood and variability of an individual's future status from their current one. The option of highlighting specific transitions (e.g., cohort showing improvement) provides an additional advantage for illustrating patterns. Another benefit of Sankey is that multiple variables can be viewed with one graph. The Sankeys displaying symptom status (Fig. 2) include three variables (sample proportion, time, and symptom attribute), but more complex Sankeys include many more variables [
      • Kunzig R.
      • Locatelli L.
      Here’s how a ‘circular economy’ could save the world.
      ]. Transitions can also be colored by a subgroup property, such as sex, which allows a view of differences in flow between subgroups. Sankey embodies two essential benefits compared to bar graphs. First, Sankey provides the ability to visualize transitions from one state to another. The second is that multiple changes (steps/times) can be shown with relatively little loss of interpretability. Bar graphs can show the proportion of the population in each state, but do not visualize transitions [
      • Lamer A.
      • Laurent G.
      • Pelayo S.
      • El Amrani M.
      • Chazard E.
      • Marcilly R.
      Exploring patient path through Sankey diagram: a proof of concept.
      ].
      One downside of Sankey is complexity. An explanation is often required for users to fully grasp the benefits. Even with this complexity, many Sankey diagrams track only immediate transitions and do not illustrate the entire trajectory. For example, they will not fully track the portion of the cohort with no symptoms, then some symptoms, and then transitioning to no symptoms again. A special Sankey called a parallel set diagram has this functionality [
      • Wilke C.O.
      Fundamentals of data visualization: A primer on making informative and compelling figures.
      ], but it further increases the complexity of displaying transitions. If transitions are not the focus of the research, a stacked bar graph would likely be sufficient. A potential limitation is if a low number of members experience the change, then the transition arc will be very narrow and more difficult to visualize. As with all graphs, it is important to display not only necessary but also sufficient information related to the studied question. An additional con is the creation of a Sankey may require specific software or advanced programing skills. However, current advances in analytical sciences continue to increase the availability of many visualization techniques [
      • Wilke C.O.
      Fundamentals of data visualization: A primer on making informative and compelling figures.
      ].

      7. Conclusion

      In conclusion, despite their development over 100 years ago, Sankey flow diagrams have only recently garnered attention in the medical literature. From prior reports and our own, we conclude Sankeys are an excellent tool for visualizing the changing status of older patients with cancer. Because there is large heterogeneity in aging, different subgroups can be examined. In a single diagram, Sankey can show both the likelihood and variability of patients' future status from their current status. They can be designed for any longitudinal patient status all while being highly customizable so that transitions and subgroups of interest can be highlighted. We recommend that clinicians and researchers consider integrating Sankey diagrams as appropriate for optimal display of longitudinal changes.

      Author contributions

      Conceptualization: Otto, Flannery, Culakova, Mohile
      Data curation: Mohile, Flannery, Culakova
      Funding acquisition: Mohile
      Methodology: Mohile, Flannery
      Formal Analysis: Flannery, Culakova, Otto
      Visualization: Otto, Zhang
      Writing -original draft: Otto, Flannery, Culakova
      Writing- review and editing: all authors

      Declaration of Competing Interest

      None.

      Acknowledgements

      We would like to thank and acknowledge S. Rosenthal MD for editorial review.
      The funding sources had no involvement in study design, analysis, and interpretation of data, writing, or decision to submit for publication. This work was supported by the National Institutes of Health ( NIH NCI UG CA 189961 , NIH U01 CA 233167-01 , NIH NIA K24 AG056589 , NIH NCI R01CA177592 and The Claude D. Pepper Older Americans Independence Center Award # P30-AG024832 ).

      Appendix A. Supplementary data

      The following are the supplementary data related to this article.

      References

        • Brundage M.
        • Blackford A.
        • Tolbert E.
        • et al.
        Presenting comparative study PRO results to clinicians and researchers: beyond the eye of the beholder.
        Qual Life Res. 2018; 27: 75-90
        • Wilke C.O.
        Fundamentals of data visualization: A primer on making informative and compelling figures.
        O’Reilly Media, 2019
        • Yu B.
        • Silva C.T.
        VisFlow - web-based visualization framework for tabular data with a subset flow model.
        IEEE Trans Vis Comput Graph. 2017; 23: 251-260
        • Flannery M.A.
        • Culakova E.
        • Canin B.E.
        • Peppone L.
        • Ramsdale E.
        • Mohile S.G.
        Understanding treatment tolerability in older adults with cancer.
        J Clin Oncol. 2021; 39: 2150-2163
        • Harris R.L.
        Information graphics : A comprehensive illustrated reference.
        Oxford University Press, New York, New York1999
        • Lamer A.
        • Laurent G.
        • Pelayo S.
        • El Amrani M.
        • Chazard E.
        • Marcilly R.
        Exploring patient path through Sankey diagram: a proof of concept.
        Stud Health Technol Inform. 2020; 270: 218-222
        • Holtz Y.
        Sankey Diagram. data-to-viz.
        (Accessed June 25, 2021)
        • Mohamed M.R.
        • Kyi K.
        • Mohile S.G.
        • et al.
        Prevalence of and factors associated with treatment modification at first cycle in older adults with advanced cancer receiving palliative treatment.
        J Geriatr Oncol. 2021; 12: 1208-1213https://doi.org/10.1016/j.jgo.2021.06.007
        • Basch E.
        • Reeve B.B.
        • Mitchell S.A.
        • et al.
        Development of the National Cancer Institute’s patient-reported outcomes version of the common terminology criteria for adverse events (PRO-CTCAE).
        J Natl Cancer Inst. 2014; 106
        • Brummel N.E.
        • Girard T.D.
        • Pandharipande P.P.
        • et al.
        Prevalence and course of frailty in survivors of critical illness.
        Crit Care Med. 2020; 48: 1419-1426
        • Carleton N.
        • Zou J.
        • Fang Y.
        • et al.
        Outcomes after sentinel lymph node biopsy and radiotherapy in older women with early-stage, estrogen receptor-positive breast cancer.
        JAMA Netw Open. 2021; 4e216322
        • Kenn M.
        • Cacsire Castillo-Tong D.
        • Singer C.F.
        • Cibena M.
        • Kolbl H.
        • Schreiner W.
        Co-expressed genes enhance precision of receptor status identification in breast cancer patients.
        Breast Cancer Res Treat. 2018; 172: 313-326
        • Vosough Z.
        • Kammer D.
        • Keck M.
        • Groh R.
        Visualization approaches for understanding uncertainty in flow diagrams.
        J Comput Lang. 2019; 52: 44-54
        • Brewer C.
        • Harrower M.
        The Pennsylvania State University. Color Brewer 2.0.
        • Kunzig R.
        • Locatelli L.
        Here’s how a ‘circular economy’ could save the world.
        in: National Geographic 2020. March 2020