This makes the site run faster and easier to use. Unfortunately, your browser is out of date and will not support some of these technologies.
We recommend that you use a modern browser such as Google Chrome or Microsoft Edge to view this website.
Using longitudinal real world patient data combined with AI methodologies can reveal novel disease phenotypes and uncover novel insights about patient progression. However, working with real-world patient data is very challenging as there are various sources of unwanted data biases that might dominate the clustering outcome and result in non-relevant findings. This problem is amplified when we work with longitudinal real-world data that spans multiple years. In our recent paper, which is currently under review (preprint available here), we illustrate how we can apply adversarial training techniques to overcome these challenges to find clinically relevant patient phenotypes.
In this work, we showcase how previously proposed approaches suffer from a common data bias, which we refer to as “trajectory bias”. Trajectory bias occurs when algorithms learn to rely heavily on the amount of data present for each given patient, which essentially represents the number of medical visits that patient has had in the timespan of the dataset. While this can be correlated with the patients’ health statuses, as patients with greater health complications are likely to pay more visits to the hospital and therefore have more data, it is not always the case. For example, sometimes a patient is transferred between different NHS trusts and so the amount of data that one trust has on that patient may not be reflective of the true number of hospital visits that patient has made. There are various further reasons why a patient could have missing data, such as patient relocation or problems with the IT infrastructure.
Ignoring the “trajectory bias” problem results in patient clusters which are dominated by the amount of available data rather than the specific clinical observations, as illustrated in the Figure 1 (left side). Our novel algorithm overcomes this problem and can identify patient groups based on similar clinical observations (right side).
Our algorithm, developed in collaboration between one of our research teams at Sensyne Health and Bayer, combines two core methods to reduce the extent to which clustering analyses are influenced directly by the amount of data present for a given patient. Firstly, the algorithm uses a recurrent neural network autoencoder which learns to take the complex, temporal data for each patient’s medical history and then compress this to a fixed-size representation that further clustering methods can be applied to. Secondly, we used an adversarial training scheme during the training of the recurrent neural network autoencoder. This allows for the learnt embeddings to be invariant to certain types of biases in the data. In the case of our algorithm, the adversarial training scheme encourages the model to be insensitive to changes in the trajectory length a patient has. A diagram of the algorithm’s architecture can be seen in Figure 2.
We validated this algorithm using two cohorts: a wider cardiovascular cohort of approximately 500,000 patients and a smaller cohort of 1,430 patients with heart failure. Using several evaluation metrics, we found that our algorithm resulted in a reduced bias from variations in the amount of data a patient has and therefore reduced the effect of trajectory bias. This resulted in our model producing patient representation spaces where the trajectory length of patients was more evenly distributed. This can be seen in Figure 3, which shows the relative positions of all patients in the patient represent spaces (compressed down to two dimensions using principal component analysis (PCA)). The colour of each point on the plots indicates the trajectory length (with darker points representing longer trajectories). The plots show a clear difference how patients with different lengths are distributed between a state-of-the-art autoencoder model, considered here as baseline, and our adversarial autoencoder model. The adversarial effect can be adjusted depending on the specific application.
With this novel approach, identified patient clusters found using an unsupervised patient stratification approach are more likely to be determined by clinical observations rather than predominantly relying on the amount of data present for a given patient. This approach can allow for much more meaningful clusters to be derived from patient cohorts, which can lead to deeper insights into conditions as well as helping to guide recruitment for clinical trials.
Using longitudinal real world patient data combined with AI methodologies can reveal novel disease phenotypes and uncover novel insights about patient progression. However, working with real-world patient data is very challenging as there are various sources of unwanted data biases that might dominate the clustering outcome and result in non-relevant findings. This problem is amplified when we work with longitudinal real-world data that spans multiple years. In our recent paper, which is currently under review (preprint available here), we illustrate how we can apply adversarial training techniques to overcome these challenges to find clinically relevant patient phenotypes.
In this work, we showcase how previously proposed approaches suffer from a common data bias, which we refer to as “trajectory bias”. Trajectory bias occurs when algorithms learn to rely heavily on the amount of data present for each given patient, which essentially represents the number of medical visits that patient has had in the timespan of the dataset. While this can be correlated with the patients’ health statuses, as patients with greater health complications are likely to pay more visits to the hospital and therefore have more data, it is not always the case. For example, sometimes a patient is transferred between different NHS trusts and so the amount of data that one trust has on that patient may not be reflective of the true number of hospital visits that patient has made. There are various further reasons why a patient could have missing data, such as patient relocation or problems with the IT infrastructure.
Ignoring the “trajectory bias” problem results in patient clusters which are dominated by the amount of available data rather than the specific clinical observations, as illustrated in the Figure 1 (left side). Our novel algorithm overcomes this problem and can identify patient groups based on similar clinical observations (right side).
Our algorithm, developed in collaboration between one of our research teams at Sensyne Health and Bayer, combines two core methods to reduce the extent to which clustering analyses are influenced directly by the amount of data present for a given patient. Firstly, the algorithm uses a recurrent neural network autoencoder which learns to take the complex, temporal data for each patient’s medical history and then compress this to a fixed-size representation that further clustering methods can be applied to. Secondly, we used an adversarial training scheme during the training of the recurrent neural network autoencoder. This allows for the learnt embeddings to be invariant to certain types of biases in the data. In the case of our algorithm, the adversarial training scheme encourages the model to be insensitive to changes in the trajectory length a patient has. A diagram of the algorithm’s architecture can be seen in Figure 2.
We validated this algorithm using two cohorts: a wider cardiovascular cohort of approximately 500,000 patients and a smaller cohort of 1,430 patients with heart failure. Using several evaluation metrics, we found that our algorithm resulted in a reduced bias from variations in the amount of data a patient has and therefore reduced the effect of trajectory bias. This resulted in our model producing patient representation spaces where the trajectory length of patients was more evenly distributed. This can be seen in Figure 3, which shows the relative positions of all patients in the patient represent spaces (compressed down to two dimensions using principal component analysis (PCA)). The colour of each point on the plots indicates the trajectory length (with darker points representing longer trajectories). The plots show a clear difference how patients with different lengths are distributed between a state-of-the-art autoencoder model, considered here as baseline, and our adversarial autoencoder model. The adversarial effect can be adjusted depending on the specific application.
With this novel approach, identified patient clusters found using an unsupervised patient stratification approach are more likely to be determined by clinical observations rather than predominantly relying on the amount of data present for a given patient. This approach can allow for much more meaningful clusters to be derived from patient cohorts, which can lead to deeper insights into conditions as well as helping to guide recruitment for clinical trials.
Using longitudinal real world patient data combined with AI methodologies can reveal novel disease phenotypes and uncover novel insights about patient progression. However, working with real-world patient data is very challenging as there are various sources of unwanted data biases that might dominate the clustering outcome and result in non-relevant findings. This problem is amplified when we work with longitudinal real-world data that spans multiple years. In our recent paper, which is currently under review (preprint available here), we illustrate how we can apply adversarial training techniques to overcome these challenges to find clinically relevant patient phenotypes.
In this work, we showcase how previously proposed approaches suffer from a common data bias, which we refer to as “trajectory bias”. Trajectory bias occurs when algorithms learn to rely heavily on the amount of data present for each given patient, which essentially represents the number of medical visits that patient has had in the timespan of the dataset. While this can be correlated with the patients’ health statuses, as patients with greater health complications are likely to pay more visits to the hospital and therefore have more data, it is not always the case. For example, sometimes a patient is transferred between different NHS trusts and so the amount of data that one trust has on that patient may not be reflective of the true number of hospital visits that patient has made. There are various further reasons why a patient could have missing data, such as patient relocation or problems with the IT infrastructure.
Ignoring the “trajectory bias” problem results in patient clusters which are dominated by the amount of available data rather than the specific clinical observations, as illustrated in the Figure 1 (left side). Our novel algorithm overcomes this problem and can identify patient groups based on similar clinical observations (right side).
Our algorithm, developed in collaboration between one of our research teams at Sensyne Health and Bayer, combines two core methods to reduce the extent to which clustering analyses are influenced directly by the amount of data present for a given patient. Firstly, the algorithm uses a recurrent neural network autoencoder which learns to take the complex, temporal data for each patient’s medical history and then compress this to a fixed-size representation that further clustering methods can be applied to. Secondly, we used an adversarial training scheme during the training of the recurrent neural network autoencoder. This allows for the learnt embeddings to be invariant to certain types of biases in the data. In the case of our algorithm, the adversarial training scheme encourages the model to be insensitive to changes in the trajectory length a patient has. A diagram of the algorithm’s architecture can be seen in Figure 2.
We validated this algorithm using two cohorts: a wider cardiovascular cohort of approximately 500,000 patients and a smaller cohort of 1,430 patients with heart failure. Using several evaluation metrics, we found that our algorithm resulted in a reduced bias from variations in the amount of data a patient has and therefore reduced the effect of trajectory bias. This resulted in our model producing patient representation spaces where the trajectory length of patients was more evenly distributed. This can be seen in Figure 3, which shows the relative positions of all patients in the patient represent spaces (compressed down to two dimensions using principal component analysis (PCA)). The colour of each point on the plots indicates the trajectory length (with darker points representing longer trajectories). The plots show a clear difference how patients with different lengths are distributed between a state-of-the-art autoencoder model, considered here as baseline, and our adversarial autoencoder model. The adversarial effect can be adjusted depending on the specific application.
With this novel approach, identified patient clusters found using an unsupervised patient stratification approach are more likely to be determined by clinical observations rather than predominantly relying on the amount of data present for a given patient. This approach can allow for much more meaningful clusters to be derived from patient cohorts, which can lead to deeper insights into conditions as well as helping to guide recruitment for clinical trials.
Using longitudinal real world patient data combined with AI methodologies can reveal novel disease phenotypes and uncover novel insights about patient progression. However, working with real-world patient data is very challenging as there are various sources of unwanted data biases that might dominate the clustering outcome and result in non-relevant findings. This problem is amplified when we work with longitudinal real-world data that spans multiple years. In our recent paper, which is currently under review (preprint available here), we illustrate how we can apply adversarial training techniques to overcome these challenges to find clinically relevant patient phenotypes.
In this work, we showcase how previously proposed approaches suffer from a common data bias, which we refer to as “trajectory bias”. Trajectory bias occurs when algorithms learn to rely heavily on the amount of data present for each given patient, which essentially represents the number of medical visits that patient has had in the timespan of the dataset. While this can be correlated with the patients’ health statuses, as patients with greater health complications are likely to pay more visits to the hospital and therefore have more data, it is not always the case. For example, sometimes a patient is transferred between different NHS trusts and so the amount of data that one trust has on that patient may not be reflective of the true number of hospital visits that patient has made. There are various further reasons why a patient could have missing data, such as patient relocation or problems with the IT infrastructure.
Ignoring the “trajectory bias” problem results in patient clusters which are dominated by the amount of available data rather than the specific clinical observations, as illustrated in the Figure 1 (left side). Our novel algorithm overcomes this problem and can identify patient groups based on similar clinical observations (right side).
Our algorithm, developed in collaboration between one of our research teams at Sensyne Health and Bayer, combines two core methods to reduce the extent to which clustering analyses are influenced directly by the amount of data present for a given patient. Firstly, the algorithm uses a recurrent neural network autoencoder which learns to take the complex, temporal data for each patient’s medical history and then compress this to a fixed-size representation that further clustering methods can be applied to. Secondly, we used an adversarial training scheme during the training of the recurrent neural network autoencoder. This allows for the learnt embeddings to be invariant to certain types of biases in the data. In the case of our algorithm, the adversarial training scheme encourages the model to be insensitive to changes in the trajectory length a patient has. A diagram of the algorithm’s architecture can be seen in Figure 2.
We validated this algorithm using two cohorts: a wider cardiovascular cohort of approximately 500,000 patients and a smaller cohort of 1,430 patients with heart failure. Using several evaluation metrics, we found that our algorithm resulted in a reduced bias from variations in the amount of data a patient has and therefore reduced the effect of trajectory bias. This resulted in our model producing patient representation spaces where the trajectory length of patients was more evenly distributed. This can be seen in Figure 3, which shows the relative positions of all patients in the patient represent spaces (compressed down to two dimensions using principal component analysis (PCA)). The colour of each point on the plots indicates the trajectory length (with darker points representing longer trajectories). The plots show a clear difference how patients with different lengths are distributed between a state-of-the-art autoencoder model, considered here as baseline, and our adversarial autoencoder model. The adversarial effect can be adjusted depending on the specific application.
With this novel approach, identified patient clusters found using an unsupervised patient stratification approach are more likely to be determined by clinical observations rather than predominantly relying on the amount of data present for a given patient. This approach can allow for much more meaningful clusters to be derived from patient cohorts, which can lead to deeper insights into conditions as well as helping to guide recruitment for clinical trials.
Using longitudinal real world patient data combined with AI methodologies can reveal novel disease phenotypes and uncover novel insights about patient progression. However, working with real-world patient data is very challenging as there are various sources of unwanted data biases that might dominate the clustering outcome and result in non-relevant findings. This problem is amplified when we work with longitudinal real-world data that spans multiple years. In our recent paper, which is currently under review (preprint available here), we illustrate how we can apply adversarial training techniques to overcome these challenges to find clinically relevant patient phenotypes.
In this work, we showcase how previously proposed approaches suffer from a common data bias, which we refer to as “trajectory bias”. Trajectory bias occurs when algorithms learn to rely heavily on the amount of data present for each given patient, which essentially represents the number of medical visits that patient has had in the timespan of the dataset. While this can be correlated with the patients’ health statuses, as patients with greater health complications are likely to pay more visits to the hospital and therefore have more data, it is not always the case. For example, sometimes a patient is transferred between different NHS trusts and so the amount of data that one trust has on that patient may not be reflective of the true number of hospital visits that patient has made. There are various further reasons why a patient could have missing data, such as patient relocation or problems with the IT infrastructure.
Ignoring the “trajectory bias” problem results in patient clusters which are dominated by the amount of available data rather than the specific clinical observations, as illustrated in the Figure 1 (left side). Our novel algorithm overcomes this problem and can identify patient groups based on similar clinical observations (right side).
Our algorithm, developed in collaboration between one of our research teams at Sensyne Health and Bayer, combines two core methods to reduce the extent to which clustering analyses are influenced directly by the amount of data present for a given patient. Firstly, the algorithm uses a recurrent neural network autoencoder which learns to take the complex, temporal data for each patient’s medical history and then compress this to a fixed-size representation that further clustering methods can be applied to. Secondly, we used an adversarial training scheme during the training of the recurrent neural network autoencoder. This allows for the learnt embeddings to be invariant to certain types of biases in the data. In the case of our algorithm, the adversarial training scheme encourages the model to be insensitive to changes in the trajectory length a patient has. A diagram of the algorithm’s architecture can be seen in Figure 2.
We validated this algorithm using two cohorts: a wider cardiovascular cohort of approximately 500,000 patients and a smaller cohort of 1,430 patients with heart failure. Using several evaluation metrics, we found that our algorithm resulted in a reduced bias from variations in the amount of data a patient has and therefore reduced the effect of trajectory bias. This resulted in our model producing patient representation spaces where the trajectory length of patients was more evenly distributed. This can be seen in Figure 3, which shows the relative positions of all patients in the patient represent spaces (compressed down to two dimensions using principal component analysis (PCA)). The colour of each point on the plots indicates the trajectory length (with darker points representing longer trajectories). The plots show a clear difference how patients with different lengths are distributed between a state-of-the-art autoencoder model, considered here as baseline, and our adversarial autoencoder model. The adversarial effect can be adjusted depending on the specific application.
With this novel approach, identified patient clusters found using an unsupervised patient stratification approach are more likely to be determined by clinical observations rather than predominantly relying on the amount of data present for a given patient. This approach can allow for much more meaningful clusters to be derived from patient cohorts, which can lead to deeper insights into conditions as well as helping to guide recruitment for clinical trials.