If you don't remember your password, you can reset it by entering your email address and clicking the Reset Password button. You will then receive an email that contains a secure link for resetting your password
If the address matches a valid account an email will be sent to __email__ with instructions for resetting your password
Department of Preventive Medicine and Population Health, Sealy Center on Aging, University of Texas Medical Branch Galveston, TX, United States of America
This perspectives paper provides an overview of how to read and interpret a Sankey, examples using symptom data from older adults with advanced cancer, a synopsis of medical literature, and comments on creating and using the diagram for presentation of data. From prior reports and our own, we conclude Sankeys are an excellent tool for visualizing the changing status of older patients with cancer. Older adult symptom data is used as an example with data dispalyed in a range of Sankey flow diagrams. Because there is large heterogeneity in aging, different subgroups can be examined. In a single diagram, Sankey can show both the likelihood and variability of patients' future status from their current status.
]. The Sankey flow diagram (Sankey) is a data visualization technique that emphasizes flow/movement/change from one state to another or one time to another. Dating from 1898 and named for Captain Matthew Sankey, the technique, commonly used in physics and engineering to display energy flow, is also popular in economics and business to examine complex multi-step processes [
]. Nevertheless, Sankeys have a limited presence in healthcare literature. Our interest in this method evolved from a need to improve data visualization of patient tolerability from clinical trials [
]. Herein, we provide an overview of how to read and interpret a Sankey, examples using symptom data from older adults with advanced cancer, a synopsis of medical literature, and comments on creating and using the diagram for presentation of data.
2. Brief overview of Sankey with sample diagram
Flow diagrams are graphs that visualize paths between related events (i.e. states, positions, or steps) [
]. A Sankey diagram includes nodes and arcs. As transitions occur, each arc flows from its source node to target node(s). Fig. 1 is an example with terms labelled and defined. In a Sankey diagram the size of each node and width of each arc represent the number of objects/members, thus indicating the magnitude of flow [
]. For example, a node with five members would be half as tall as a ten member node. An arc transitioning 20 objects is twice as wide/tall as a ten objects transition arc. In Fig. 1 the nodes are stacked vertically and organized by steps, qualifying the figure as a subcategory of Sankey called an alluvial diagram [
Legend: The x-axis represents 3 consecutive steps or time points. The total height of the y-axis represents the full sample (100%). This Sankey depicts flow among three different states represented by colors. Eight nodes are depicted representing combinations of states and steps occurring in this example. The connections between nodes are referred to as links, flows, or arcs. Arcs represent the proportion of the sample transitioning from one node to the next.
Geriatric Assessment for Patients 70+ (GAP 70+, NCT:02054741) was a nationwide cluster randomized trial evaluating the effect of a geriatric assessment intervention on treatment- related toxicity reported by clinicians via the Common Terminology Criteria for Adverse Events (CTCAEs) [
]. Patients' perspectives about symptoms were assessed with Patient- Reported Outcomes (PRO)-CTCAEs. The National Cancer Institute developed PRO-CTCAEs to evaluate symptomatic toxicity in patients participating in cancer clinical trials [
]. A library of 78 symptom terms is available; investigators select items that are relevant to the population and treatment under investigation. Questions include the symptom attributes of presence/absence, frequency, severity, and interference from the symptom with usual/daily activities, scored 0–4.
In GAP70+, Sankey diagrams were used to visualize the trajectories of symptoms during six months of treatment (Fig. 2) in 692 patients at four time points. In these diagrams, responses of three and four represent the most severe, most frequent, or highest interference and are collapsed into one category to simplify interpretation by decreasing the number of nodes. Three important symptomatic toxicities (diarrhea frequency, fatigue interference with usual/daily activities, and pain severity) were selected, and two graphs are provided for each symptom to illustrate specific aspects.
Fig. 2PRO-CTCAE Symptom Data Display with Sankey Flow Diagrams.
Fig. Ai: Both prevalence and frequency appear to be quite stable over time. Figure Aii: Focus on the arcs as they flow from left to right. For example, following the yellow color, about a fifth (22%) of patients reported that they had diarrhea rarely at the baseline (yellow node); following the arcs to 4–6 weeks, 33% of these individuals improved and reported no diarrhea, 28% stayed the same at rarely, and 27% reported the frequency had increased, while 12% discontinued. Observing panel Bi, the arcs demonstrate frequent movement for patients at each time point to both lower and higher levels of interference. The focused Sankey (Bii) highlights the proportions of patients for whom interference got worse. For example, of the patients with no interference in usual/daily activities at the initiation of treatment, 56% reported some level of interference with daily activities by 4–6 weeks (6% had discontinued). Observing panel Ci, the arcs demonstrate frequent movement for patients at each time point to both lower and higher levels of interference. The focused Sankey (Cii) highlights the proportions of patients for whom interference got worse. For example, of the patients with no interference in usual/daily activities at the initiation of treatment, 56% reported some level of interference with daily activities by 4–6 weeks (6% had discontinued).
The use of Sankey to display symptom trajectories in older adults with cancer during toxic treatment enhanced visualization of variability and the association of symptom trajectories with discontinuation. These three symptoms were selected because of their clinical relevance in caring for older adults with advanced cancer receiving treatment. Clinicians caring for these individuals may be aware of the percentage who report these side effects in clinical trials but do not know how a particular individual will fare. In contrast, the Sankey contributes additional granular information based on the interpretation of nodes and arcs. Knowledge of diarrhea frequency is critically important in older adults, as it can lead to dehydration. The Sankey (Fig. 2Aii) provides evidence of both decreased and increased frequency, unlike the bar graph (Fig. 2Ai). Interference in usual /daily activities from fatigue indicates a decrease in functional status in older adults, an important tolerability metric. The increase in interference is highlighted in Fig. 2B, which also shows the proportion of patients reporting high interference who subsequently withdrew. For older adults with advanced cancer receiving treatment, an increase in pain severity warrants reevaluation. Fig. 2C indicates that patients reporting mild pain have wide variability in their pain levels at the next assessment, reinforcing the need for routine symptom monitoring.
4. Medical literature: Sankey flow diagrams
A focused review of the medical literature was conducted to identify references including Sankey diagrams using PubMed and Google Scholar with the search term “Sankey diagram” for title and abstract. The search was restricted to English language research articles. From the identified abstracts, full text articles were scanned, and only articles that included actual Sankey diagrams were reviewed. Articles that included cancer and/or older adults were specifically searched. A total of 13 articles was selected, reviewed, and categorized as displayed in Supplemental Table 1. Three broad categories of use for Sankey were found in the medical literature: 1) to visualize flow/transitions over time, 2) to visualize flow/transitions to specific events, and 3) to demonstrate associations. An example that illustrates changes over time shows transitions in frailty status over one year in a cohort of older adults hospitalized with a critical illness [
]. A common example of using a Sankey to visualize the flow of events is to show the sequence of treatments, as for a cohort of older women with early-stage node-negative breast cancer [
A Sankey diagram can be drawn for discrete variables with interrelated values in any dataset. Two main steps are required: first, converting the data into the correct format for Sankey nodes and arcs, and second, programming the actual diagram display. These steps can be performed with a variety of software programs (i.e. R, Excel, Python, Plotly, Google Charts). For additional details refer to Supplemental Table 2. In Fig. 2, PRO-CTCAE data were organized into the required data matrix format to create nodes and arcs with Python software; the visualization was generated with the Python Plotly application programming interface. Plotly was selected because it provides precise control over the design of the graphic, allows multiple steps, and supports creating interactive graphics.
A well-made Sankey requires a few careful design decisions that depend on the nature of the data and what phenomenon the researcher wants to feature. The main decisions center on clear identification of the x- and y-axes, the number of nodes per step, handling missing data, and the use of color and transparency. The x- and y-axes should be selected so the placement of nodes allows information to intuitively flow though the diagram. In the symptom examples, the y-axis orders the nodes by symptom severity. Simultaneously, node heights indicate size (i.e. number or proportion of patients). The x-axis places nodes on a time scale. The number of nodes per step is critical to ease of interpretation. Collapsing nodes together in a meaningful way will improve clarity of the visualization. If the data are numeric, values may need to be grouped or collapsed into a limited number. Similarly, if the data are categorical, a decision needs to be made on the appropriate number of categories to display. The decision to collapse can be based upon features of the data (which categories had a large number of respondents) or on a clinical utility/ meaningfulness dimension (e.g. collapsing severe and very severe into one group, or collapsing treatment by category such as hormonal or first line).
Missing data affect the overall Sankey display, because missing data severely disturbs the arc presentation. For withdrawal, we recommend adding dropout nodes, in the vertical direction for all time points after baseline. This allows visualizing the status of the entire cohort at each display point and adds clinically meaningful information (Supplemental Fig. 1). A second issue is intermittent missing values (an assessment time point that was missed while patient remained on study). The researcher needs to review their data set for the scope of this issue. Be aware that skipping intermediate nodes may occur, making interpretability more challenging. Any participants with missing data at intermediate times could be removed, however this is recommended only if it affects a very small (i.e. < 5%) proportion of data, as deleting participants could be misleading.
The transparency and color of any Sankey can be set manually, which creates a range of options for highlighting specific patterns as shown in Fig. 2. In our experience, using different shades of the same color palette creates challenges and confusion in interpreting data, as the arcs overlap. A diverging color palette for the nodes on the y-axis and distinct non-gradient colors may ease visualization [
]. For specific transitions, adjusting transparency allows arcs to fade into the background while others are highlighted. The amount of color saturation in the arcs can be set; if the saturation is very weak the arc will be more transparent and therefore more difficult to see. Therefore, the saturation can be adjusted for optimal visualization. Among possible sources, the ColorBrewer [
] website offers a variety of diverging palettes with instant demonstrations. This option requires labeling the nodes rather than identifying them by color. If color is not an option, grayscale can be used, however, optimal contrast may be compromised.
6. Pros and Cons
Sankey diagrams represent an insightful but complex and specialized graphic method. The primary value of a Sankey is visualization of the evolution of a population's status over time through transitions. In other words, the Sankey diagram allows clinicians and patients to view the relative frequency of all patient status paths, illustrating the likelihood and variability of an individual's future status from their current one. The option of highlighting specific transitions (e.g., cohort showing improvement) provides an additional advantage for illustrating patterns. Another benefit of Sankey is that multiple variables can be viewed with one graph. The Sankeys displaying symptom status (Fig. 2) include three variables (sample proportion, time, and symptom attribute), but more complex Sankeys include many more variables [
]. Transitions can also be colored by a subgroup property, such as sex, which allows a view of differences in flow between subgroups. Sankey embodies two essential benefits compared to bar graphs. First, Sankey provides the ability to visualize transitions from one state to another. The second is that multiple changes (steps/times) can be shown with relatively little loss of interpretability. Bar graphs can show the proportion of the population in each state, but do not visualize transitions [
One downside of Sankey is complexity. An explanation is often required for users to fully grasp the benefits. Even with this complexity, many Sankey diagrams track only immediate transitions and do not illustrate the entire trajectory. For example, they will not fully track the portion of the cohort with no symptoms, then some symptoms, and then transitioning to no symptoms again. A special Sankey called a parallel set diagram has this functionality [
], but it further increases the complexity of displaying transitions. If transitions are not the focus of the research, a stacked bar graph would likely be sufficient. A potential limitation is if a low number of members experience the change, then the transition arc will be very narrow and more difficult to visualize. As with all graphs, it is important to display not only necessary but also sufficient information related to the studied question. An additional con is the creation of a Sankey may require specific software or advanced programing skills. However, current advances in analytical sciences continue to increase the availability of many visualization techniques [
In conclusion, despite their development over 100 years ago, Sankey flow diagrams have only recently garnered attention in the medical literature. From prior reports and our own, we conclude Sankeys are an excellent tool for visualizing the changing status of older patients with cancer. Because there is large heterogeneity in aging, different subgroups can be examined. In a single diagram, Sankey can show both the likelihood and variability of patients' future status from their current status. They can be designed for any longitudinal patient status all while being highly customizable so that transitions and subgroups of interest can be highlighted. We recommend that clinicians and researchers consider integrating Sankey diagrams as appropriate for optimal display of longitudinal changes.
We would like to thank and acknowledge S. Rosenthal MD for editorial review.
The funding sources had no involvement in study design, analysis, and interpretation of data, writing, or decision to submit for publication. This work was supported by the National Institutes of Health (NIH NCI UG CA 189961, NIH U01 CA 233167-01, NIH NIA K24 AG056589, NIH NCI R01CA177592 and The Claude D. Pepper Older Americans Independence Center Award #P30-AG024832).
Appendix A. Supplementary data
The following are the supplementary data related to this article.