Please upgrade your browser

We built this website using the latest browser technologies to deliver the very best experience.

This makes the site run faster and easier to use. Unfortunately, your browser is out of date and will not support some of these technologies.

We recommend that you use a modern browser such as Google Chrome or Microsoft Edge to view this website.

Download ChromeDownload Microsoft Edge
Blog

Reducing Data Bias in Patient Stratification Analysis using longitudinal Real-World Data

Sensyne Health and Bayer have collaborated on a project, which used anonymised patient data from Oxford University Hospitals NHS Foundation Trust, to develop an algorithm that identifies patient subgroups based on clinical factors rather than unwanted data biases.

February 4, 2022

Using longitudinal real world patient data combined with AI methodologies can reveal novel disease phenotypes and uncover novel insights about patient progression. However, working with real-world patient data is very challenging as there are various sources of unwanted data biases that might dominate the clustering outcome and result in non-relevant findings. This problem is amplified when we work with longitudinal real-world data that spans multiple years. In our recent paper, which is currently under review (preprint available here), we illustrate how we can apply adversarial training techniques to overcome these challenges to find clinically relevant patient phenotypes.  

In this work, we showcase how previously proposed approaches suffer from a common data bias, which we refer to as “trajectory bias”. Trajectory bias occurs when algorithms learn to rely heavily on the amount of data present for each given patient, which essentially represents the number of medical visits that patient has had in the timespan of the dataset. While this can be correlated with the patients’ health statuses, as patients with greater health complications are likely to pay more visits to the hospital and therefore have more data, it is not always the case. For example, sometimes a patient is transferred between different NHS trusts and so the amount of data that one trust has on that patient may not be reflective of the true number of hospital visits that patient has made. There are various further reasons why a patient could have missing data, such as patient relocation or problems with the IT infrastructure. 

Ignoring the “trajectory bias” problem results in patient clusters which are dominated by the amount of available data rather than the specific clinical observations, as illustrated in the Figure 1 (left side). Our novel algorithm overcomes this problem and can identify patient groups based on similar clinical observations (right side). 

Figure 1: An illustration of how trajectory bias can occur and lead to patient clustering that is predominantly driven by the amount of data a patient has (trajectory length). In the figure, patient A and patient B are examples of patients with similar clinical recordings but different trajectory lengths. Patient C is an example of a patient where their trajectory length is not reflective of the patient’s true number of hospital visits. Current state of the art methods, which do not compensate the trajectory bias would incorrectly cluster patient A and patient C together rather than patient A and patient B. Our algorithm compensates for the trajectory bias and instead clusters patient A and patient B together. The two grey regions of the learned representation space from our method (bottom right) indicate possible clinically relevant subgroups detected by the algorithm which comprise of patients with heart related conditions and patients with blood related conditions respectively (conditions are examples). 


Our algorithm, developed in collaboration between one of our research teams at Sensyne Health and Bayer, combines two core methods to reduce the extent to which clustering analyses are influenced directly by the amount of data present for a given patient. Firstly, the algorithm uses a recurrent neural network autoencoder which learns to take the complex, temporal data for each patient’s medical history and then compress this to a fixed-size representation that further clustering methods can be applied to. Secondly, we used an adversarial training scheme during the training of the recurrent neural network autoencoder. This allows for the learnt embeddings to be invariant to certain types of biases in the data. In the case of our algorithm, the adversarial training scheme encourages the model to be insensitive to changes in the trajectory length a patient has. A diagram of the algorithm’s architecture can be seen in Figure 2.

Figure 2: A diagram demonstrating the architecture of our adversarial autoencoder. This extends a standard autoencoder model by including an additional discriminator to compensate for potential trajectory bias. The algorithm aims to minimise the reconstruction as well as the trajectory bias loss. 

We validated this algorithm using two cohorts: a wider cardiovascular cohort of approximately 500,000 patients and a smaller cohort of 1,430 patients with heart failure. Using several evaluation metrics, we found that our algorithm resulted in a reduced bias from variations in the amount of data a patient has and therefore reduced the effect of trajectory bias. This resulted in our model producing patient representation spaces where the trajectory length of patients was more evenly distributed. This can be seen in Figure 3, which shows the relative positions of all patients in the patient represent spaces (compressed down to two dimensions using principal component analysis (PCA)). The colour of each point on the plots indicates the trajectory length (with darker points representing longer trajectories). The plots show a clear difference how patients with different lengths are distributed between a state-of-the-art autoencoder model, considered here as baseline, and our adversarial autoencoder model. The adversarial effect can be adjusted depending on the specific application.   

Figure 3: A comparison of the learned patient representation space between a baseline and our proposed adversarial autoencoder model. Each dot represents a single patient with the colour coding indicates the available trajectory with data of that patient. Here, the trajectory was divided into equally spaced windows of 3 months and patients could have a trajectory up to 52 months (18 “time windows” x 3 months). The adversarial effect can be weighted by parameter a depending on the specific application. 


With this novel approach, identified patient clusters found using an unsupervised patient stratification approach are more likely to be determined by clinical observations rather than predominantly relying on the amount of data present for a given patient. This approach can allow for much more meaningful clusters to be derived from patient cohorts, which can lead to deeper insights into conditions as well as helping to guide recruitment for clinical trials. 


Robert Dürichen, Head of Machine Learning Research, Sensyne Health

Blog

Reducing Data Bias in Patient Stratification Analysis using longitudinal Real-World Data

February 4, 2022
Sensyne Health and Bayer have collaborated on a project, which used anonymised patient data from Oxford University Hospitals NHS Foundation Trust, to develop an algorithm that identifies patient subgroups based on clinical factors rather than unwanted data biases.

Using longitudinal real world patient data combined with AI methodologies can reveal novel disease phenotypes and uncover novel insights about patient progression. However, working with real-world patient data is very challenging as there are various sources of unwanted data biases that might dominate the clustering outcome and result in non-relevant findings. This problem is amplified when we work with longitudinal real-world data that spans multiple years. In our recent paper, which is currently under review (preprint available here), we illustrate how we can apply adversarial training techniques to overcome these challenges to find clinically relevant patient phenotypes.  

In this work, we showcase how previously proposed approaches suffer from a common data bias, which we refer to as “trajectory bias”. Trajectory bias occurs when algorithms learn to rely heavily on the amount of data present for each given patient, which essentially represents the number of medical visits that patient has had in the timespan of the dataset. While this can be correlated with the patients’ health statuses, as patients with greater health complications are likely to pay more visits to the hospital and therefore have more data, it is not always the case. For example, sometimes a patient is transferred between different NHS trusts and so the amount of data that one trust has on that patient may not be reflective of the true number of hospital visits that patient has made. There are various further reasons why a patient could have missing data, such as patient relocation or problems with the IT infrastructure. 

Ignoring the “trajectory bias” problem results in patient clusters which are dominated by the amount of available data rather than the specific clinical observations, as illustrated in the Figure 1 (left side). Our novel algorithm overcomes this problem and can identify patient groups based on similar clinical observations (right side). 

Figure 1: An illustration of how trajectory bias can occur and lead to patient clustering that is predominantly driven by the amount of data a patient has (trajectory length). In the figure, patient A and patient B are examples of patients with similar clinical recordings but different trajectory lengths. Patient C is an example of a patient where their trajectory length is not reflective of the patient’s true number of hospital visits. Current state of the art methods, which do not compensate the trajectory bias would incorrectly cluster patient A and patient C together rather than patient A and patient B. Our algorithm compensates for the trajectory bias and instead clusters patient A and patient B together. The two grey regions of the learned representation space from our method (bottom right) indicate possible clinically relevant subgroups detected by the algorithm which comprise of patients with heart related conditions and patients with blood related conditions respectively (conditions are examples). 


Our algorithm, developed in collaboration between one of our research teams at Sensyne Health and Bayer, combines two core methods to reduce the extent to which clustering analyses are influenced directly by the amount of data present for a given patient. Firstly, the algorithm uses a recurrent neural network autoencoder which learns to take the complex, temporal data for each patient’s medical history and then compress this to a fixed-size representation that further clustering methods can be applied to. Secondly, we used an adversarial training scheme during the training of the recurrent neural network autoencoder. This allows for the learnt embeddings to be invariant to certain types of biases in the data. In the case of our algorithm, the adversarial training scheme encourages the model to be insensitive to changes in the trajectory length a patient has. A diagram of the algorithm’s architecture can be seen in Figure 2.

Figure 2: A diagram demonstrating the architecture of our adversarial autoencoder. This extends a standard autoencoder model by including an additional discriminator to compensate for potential trajectory bias. The algorithm aims to minimise the reconstruction as well as the trajectory bias loss. 

We validated this algorithm using two cohorts: a wider cardiovascular cohort of approximately 500,000 patients and a smaller cohort of 1,430 patients with heart failure. Using several evaluation metrics, we found that our algorithm resulted in a reduced bias from variations in the amount of data a patient has and therefore reduced the effect of trajectory bias. This resulted in our model producing patient representation spaces where the trajectory length of patients was more evenly distributed. This can be seen in Figure 3, which shows the relative positions of all patients in the patient represent spaces (compressed down to two dimensions using principal component analysis (PCA)). The colour of each point on the plots indicates the trajectory length (with darker points representing longer trajectories). The plots show a clear difference how patients with different lengths are distributed between a state-of-the-art autoencoder model, considered here as baseline, and our adversarial autoencoder model. The adversarial effect can be adjusted depending on the specific application.   

Figure 3: A comparison of the learned patient representation space between a baseline and our proposed adversarial autoencoder model. Each dot represents a single patient with the colour coding indicates the available trajectory with data of that patient. Here, the trajectory was divided into equally spaced windows of 3 months and patients could have a trajectory up to 52 months (18 “time windows” x 3 months). The adversarial effect can be weighted by parameter a depending on the specific application. 


With this novel approach, identified patient clusters found using an unsupervised patient stratification approach are more likely to be determined by clinical observations rather than predominantly relying on the amount of data present for a given patient. This approach can allow for much more meaningful clusters to be derived from patient cohorts, which can lead to deeper insights into conditions as well as helping to guide recruitment for clinical trials. 


Robert Dürichen, Head of Machine Learning Research, Sensyne Health

Blog

Reducing Data Bias in Patient Stratification Analysis using longitudinal Real-World Data

Reducing Data Bias in Patient Stratification Analysis using longitudinal Real-World Data

February 4, 2022
Sensyne Health and Bayer have collaborated on a project, which used anonymised patient data from Oxford University Hospitals NHS Foundation Trust, to develop an algorithm that identifies patient subgroups based on clinical factors rather than unwanted data biases.

Using longitudinal real world patient data combined with AI methodologies can reveal novel disease phenotypes and uncover novel insights about patient progression. However, working with real-world patient data is very challenging as there are various sources of unwanted data biases that might dominate the clustering outcome and result in non-relevant findings. This problem is amplified when we work with longitudinal real-world data that spans multiple years. In our recent paper, which is currently under review (preprint available here), we illustrate how we can apply adversarial training techniques to overcome these challenges to find clinically relevant patient phenotypes.  

In this work, we showcase how previously proposed approaches suffer from a common data bias, which we refer to as “trajectory bias”. Trajectory bias occurs when algorithms learn to rely heavily on the amount of data present for each given patient, which essentially represents the number of medical visits that patient has had in the timespan of the dataset. While this can be correlated with the patients’ health statuses, as patients with greater health complications are likely to pay more visits to the hospital and therefore have more data, it is not always the case. For example, sometimes a patient is transferred between different NHS trusts and so the amount of data that one trust has on that patient may not be reflective of the true number of hospital visits that patient has made. There are various further reasons why a patient could have missing data, such as patient relocation or problems with the IT infrastructure. 

Ignoring the “trajectory bias” problem results in patient clusters which are dominated by the amount of available data rather than the specific clinical observations, as illustrated in the Figure 1 (left side). Our novel algorithm overcomes this problem and can identify patient groups based on similar clinical observations (right side). 

Figure 1: An illustration of how trajectory bias can occur and lead to patient clustering that is predominantly driven by the amount of data a patient has (trajectory length). In the figure, patient A and patient B are examples of patients with similar clinical recordings but different trajectory lengths. Patient C is an example of a patient where their trajectory length is not reflective of the patient’s true number of hospital visits. Current state of the art methods, which do not compensate the trajectory bias would incorrectly cluster patient A and patient C together rather than patient A and patient B. Our algorithm compensates for the trajectory bias and instead clusters patient A and patient B together. The two grey regions of the learned representation space from our method (bottom right) indicate possible clinically relevant subgroups detected by the algorithm which comprise of patients with heart related conditions and patients with blood related conditions respectively (conditions are examples). 


Our algorithm, developed in collaboration between one of our research teams at Sensyne Health and Bayer, combines two core methods to reduce the extent to which clustering analyses are influenced directly by the amount of data present for a given patient. Firstly, the algorithm uses a recurrent neural network autoencoder which learns to take the complex, temporal data for each patient’s medical history and then compress this to a fixed-size representation that further clustering methods can be applied to. Secondly, we used an adversarial training scheme during the training of the recurrent neural network autoencoder. This allows for the learnt embeddings to be invariant to certain types of biases in the data. In the case of our algorithm, the adversarial training scheme encourages the model to be insensitive to changes in the trajectory length a patient has. A diagram of the algorithm’s architecture can be seen in Figure 2.

Figure 2: A diagram demonstrating the architecture of our adversarial autoencoder. This extends a standard autoencoder model by including an additional discriminator to compensate for potential trajectory bias. The algorithm aims to minimise the reconstruction as well as the trajectory bias loss. 

We validated this algorithm using two cohorts: a wider cardiovascular cohort of approximately 500,000 patients and a smaller cohort of 1,430 patients with heart failure. Using several evaluation metrics, we found that our algorithm resulted in a reduced bias from variations in the amount of data a patient has and therefore reduced the effect of trajectory bias. This resulted in our model producing patient representation spaces where the trajectory length of patients was more evenly distributed. This can be seen in Figure 3, which shows the relative positions of all patients in the patient represent spaces (compressed down to two dimensions using principal component analysis (PCA)). The colour of each point on the plots indicates the trajectory length (with darker points representing longer trajectories). The plots show a clear difference how patients with different lengths are distributed between a state-of-the-art autoencoder model, considered here as baseline, and our adversarial autoencoder model. The adversarial effect can be adjusted depending on the specific application.   

Figure 3: A comparison of the learned patient representation space between a baseline and our proposed adversarial autoencoder model. Each dot represents a single patient with the colour coding indicates the available trajectory with data of that patient. Here, the trajectory was divided into equally spaced windows of 3 months and patients could have a trajectory up to 52 months (18 “time windows” x 3 months). The adversarial effect can be weighted by parameter a depending on the specific application. 


With this novel approach, identified patient clusters found using an unsupervised patient stratification approach are more likely to be determined by clinical observations rather than predominantly relying on the amount of data present for a given patient. This approach can allow for much more meaningful clusters to be derived from patient cohorts, which can lead to deeper insights into conditions as well as helping to guide recruitment for clinical trials. 


Robert Dürichen, Head of Machine Learning Research, Sensyne Health

Blog

Reducing Data Bias in Patient Stratification Analysis using longitudinal Real-World Data

Sensyne Health and Bayer have collaborated on a project, which used anonymised patient data from Oxford University Hospitals NHS Foundation Trust, to develop an algorithm that identifies patient subgroups based on clinical factors rather than unwanted data biases.

Using longitudinal real world patient data combined with AI methodologies can reveal novel disease phenotypes and uncover novel insights about patient progression. However, working with real-world patient data is very challenging as there are various sources of unwanted data biases that might dominate the clustering outcome and result in non-relevant findings. This problem is amplified when we work with longitudinal real-world data that spans multiple years. In our recent paper, which is currently under review (preprint available here), we illustrate how we can apply adversarial training techniques to overcome these challenges to find clinically relevant patient phenotypes.  

In this work, we showcase how previously proposed approaches suffer from a common data bias, which we refer to as “trajectory bias”. Trajectory bias occurs when algorithms learn to rely heavily on the amount of data present for each given patient, which essentially represents the number of medical visits that patient has had in the timespan of the dataset. While this can be correlated with the patients’ health statuses, as patients with greater health complications are likely to pay more visits to the hospital and therefore have more data, it is not always the case. For example, sometimes a patient is transferred between different NHS trusts and so the amount of data that one trust has on that patient may not be reflective of the true number of hospital visits that patient has made. There are various further reasons why a patient could have missing data, such as patient relocation or problems with the IT infrastructure. 

Ignoring the “trajectory bias” problem results in patient clusters which are dominated by the amount of available data rather than the specific clinical observations, as illustrated in the Figure 1 (left side). Our novel algorithm overcomes this problem and can identify patient groups based on similar clinical observations (right side). 

Figure 1: An illustration of how trajectory bias can occur and lead to patient clustering that is predominantly driven by the amount of data a patient has (trajectory length). In the figure, patient A and patient B are examples of patients with similar clinical recordings but different trajectory lengths. Patient C is an example of a patient where their trajectory length is not reflective of the patient’s true number of hospital visits. Current state of the art methods, which do not compensate the trajectory bias would incorrectly cluster patient A and patient C together rather than patient A and patient B. Our algorithm compensates for the trajectory bias and instead clusters patient A and patient B together. The two grey regions of the learned representation space from our method (bottom right) indicate possible clinically relevant subgroups detected by the algorithm which comprise of patients with heart related conditions and patients with blood related conditions respectively (conditions are examples). 


Our algorithm, developed in collaboration between one of our research teams at Sensyne Health and Bayer, combines two core methods to reduce the extent to which clustering analyses are influenced directly by the amount of data present for a given patient. Firstly, the algorithm uses a recurrent neural network autoencoder which learns to take the complex, temporal data for each patient’s medical history and then compress this to a fixed-size representation that further clustering methods can be applied to. Secondly, we used an adversarial training scheme during the training of the recurrent neural network autoencoder. This allows for the learnt embeddings to be invariant to certain types of biases in the data. In the case of our algorithm, the adversarial training scheme encourages the model to be insensitive to changes in the trajectory length a patient has. A diagram of the algorithm’s architecture can be seen in Figure 2.

Figure 2: A diagram demonstrating the architecture of our adversarial autoencoder. This extends a standard autoencoder model by including an additional discriminator to compensate for potential trajectory bias. The algorithm aims to minimise the reconstruction as well as the trajectory bias loss. 

We validated this algorithm using two cohorts: a wider cardiovascular cohort of approximately 500,000 patients and a smaller cohort of 1,430 patients with heart failure. Using several evaluation metrics, we found that our algorithm resulted in a reduced bias from variations in the amount of data a patient has and therefore reduced the effect of trajectory bias. This resulted in our model producing patient representation spaces where the trajectory length of patients was more evenly distributed. This can be seen in Figure 3, which shows the relative positions of all patients in the patient represent spaces (compressed down to two dimensions using principal component analysis (PCA)). The colour of each point on the plots indicates the trajectory length (with darker points representing longer trajectories). The plots show a clear difference how patients with different lengths are distributed between a state-of-the-art autoencoder model, considered here as baseline, and our adversarial autoencoder model. The adversarial effect can be adjusted depending on the specific application.   

Figure 3: A comparison of the learned patient representation space between a baseline and our proposed adversarial autoencoder model. Each dot represents a single patient with the colour coding indicates the available trajectory with data of that patient. Here, the trajectory was divided into equally spaced windows of 3 months and patients could have a trajectory up to 52 months (18 “time windows” x 3 months). The adversarial effect can be weighted by parameter a depending on the specific application. 


With this novel approach, identified patient clusters found using an unsupervised patient stratification approach are more likely to be determined by clinical observations rather than predominantly relying on the amount of data present for a given patient. This approach can allow for much more meaningful clusters to be derived from patient cohorts, which can lead to deeper insights into conditions as well as helping to guide recruitment for clinical trials. 


Robert Dürichen, Head of Machine Learning Research, Sensyne Health

Arrange to meet us
Blog

Reducing Data Bias in Patient Stratification Analysis using longitudinal Real-World Data

February 4, 2022
Sensyne Health and Bayer have collaborated on a project, which used anonymised patient data from Oxford University Hospitals NHS Foundation Trust, to develop an algorithm that identifies patient subgroups based on clinical factors rather than unwanted data biases.

Using longitudinal real world patient data combined with AI methodologies can reveal novel disease phenotypes and uncover novel insights about patient progression. However, working with real-world patient data is very challenging as there are various sources of unwanted data biases that might dominate the clustering outcome and result in non-relevant findings. This problem is amplified when we work with longitudinal real-world data that spans multiple years. In our recent paper, which is currently under review (preprint available here), we illustrate how we can apply adversarial training techniques to overcome these challenges to find clinically relevant patient phenotypes.  

In this work, we showcase how previously proposed approaches suffer from a common data bias, which we refer to as “trajectory bias”. Trajectory bias occurs when algorithms learn to rely heavily on the amount of data present for each given patient, which essentially represents the number of medical visits that patient has had in the timespan of the dataset. While this can be correlated with the patients’ health statuses, as patients with greater health complications are likely to pay more visits to the hospital and therefore have more data, it is not always the case. For example, sometimes a patient is transferred between different NHS trusts and so the amount of data that one trust has on that patient may not be reflective of the true number of hospital visits that patient has made. There are various further reasons why a patient could have missing data, such as patient relocation or problems with the IT infrastructure. 

Ignoring the “trajectory bias” problem results in patient clusters which are dominated by the amount of available data rather than the specific clinical observations, as illustrated in the Figure 1 (left side). Our novel algorithm overcomes this problem and can identify patient groups based on similar clinical observations (right side). 

Figure 1: An illustration of how trajectory bias can occur and lead to patient clustering that is predominantly driven by the amount of data a patient has (trajectory length). In the figure, patient A and patient B are examples of patients with similar clinical recordings but different trajectory lengths. Patient C is an example of a patient where their trajectory length is not reflective of the patient’s true number of hospital visits. Current state of the art methods, which do not compensate the trajectory bias would incorrectly cluster patient A and patient C together rather than patient A and patient B. Our algorithm compensates for the trajectory bias and instead clusters patient A and patient B together. The two grey regions of the learned representation space from our method (bottom right) indicate possible clinically relevant subgroups detected by the algorithm which comprise of patients with heart related conditions and patients with blood related conditions respectively (conditions are examples). 


Our algorithm, developed in collaboration between one of our research teams at Sensyne Health and Bayer, combines two core methods to reduce the extent to which clustering analyses are influenced directly by the amount of data present for a given patient. Firstly, the algorithm uses a recurrent neural network autoencoder which learns to take the complex, temporal data for each patient’s medical history and then compress this to a fixed-size representation that further clustering methods can be applied to. Secondly, we used an adversarial training scheme during the training of the recurrent neural network autoencoder. This allows for the learnt embeddings to be invariant to certain types of biases in the data. In the case of our algorithm, the adversarial training scheme encourages the model to be insensitive to changes in the trajectory length a patient has. A diagram of the algorithm’s architecture can be seen in Figure 2.

Figure 2: A diagram demonstrating the architecture of our adversarial autoencoder. This extends a standard autoencoder model by including an additional discriminator to compensate for potential trajectory bias. The algorithm aims to minimise the reconstruction as well as the trajectory bias loss. 

We validated this algorithm using two cohorts: a wider cardiovascular cohort of approximately 500,000 patients and a smaller cohort of 1,430 patients with heart failure. Using several evaluation metrics, we found that our algorithm resulted in a reduced bias from variations in the amount of data a patient has and therefore reduced the effect of trajectory bias. This resulted in our model producing patient representation spaces where the trajectory length of patients was more evenly distributed. This can be seen in Figure 3, which shows the relative positions of all patients in the patient represent spaces (compressed down to two dimensions using principal component analysis (PCA)). The colour of each point on the plots indicates the trajectory length (with darker points representing longer trajectories). The plots show a clear difference how patients with different lengths are distributed between a state-of-the-art autoencoder model, considered here as baseline, and our adversarial autoencoder model. The adversarial effect can be adjusted depending on the specific application.   

Figure 3: A comparison of the learned patient representation space between a baseline and our proposed adversarial autoencoder model. Each dot represents a single patient with the colour coding indicates the available trajectory with data of that patient. Here, the trajectory was divided into equally spaced windows of 3 months and patients could have a trajectory up to 52 months (18 “time windows” x 3 months). The adversarial effect can be weighted by parameter a depending on the specific application. 


With this novel approach, identified patient clusters found using an unsupervised patient stratification approach are more likely to be determined by clinical observations rather than predominantly relying on the amount of data present for a given patient. This approach can allow for much more meaningful clusters to be derived from patient cohorts, which can lead to deeper insights into conditions as well as helping to guide recruitment for clinical trials. 


Robert Dürichen, Head of Machine Learning Research, Sensyne Health