An introduction to mathematical modeling in systems biology

Gunnar Cedersund
Rikard Johansson
William Lövfors
Elin Nyman

January 2019
v19.01

Introduction

First thing to do, is to read the introduction "hidden" below.

Click to expand introduction  

This is a new version of the lab introduced in 2019, more oriented towards writing your own scripts to further your understanding of the different parts, and also to give you valuable programming skills for your future career. The current version of the lab is implemented in MATLAB, but you are free to pick other languages if you can find the proper tools. Please note that the supervisors will only help with code issues in MATLAB.

If you find something that looks like a spelling error, it likely is. This lab is currently under development, if you find anything that is unclear, or you have any other feedback, please send an email to william.lovfors@liu.se

Purpose of the lab

This lab will on a conceptual level guide you through all the main sub-tasks in a systems biology project: model formulation, optimization of the model parameters with respect to experimental data, model evaluation, statistical testing, (uniqueness) analysis, identification of interesting predictions, and testing of those predictions.

How to pass

Along the way, there are check-points that you need to pass. These are marked with the checkpoint x. For each checkpoint (unless noted otherwise), discuss your findings with the lab supervisors and make sure to get a signature that the checkpoint is passed. To pass the entire lab you need to have passed all the check-points and made a final discussion of the entire lab with one of the supervisors.

The deadline to be approved is at the final computer lab session

Preparations

Get familiar with MATLAB by doing Mathworks intro tutorial. You can find the tutorial at this link.

Most of the tasks below start with read about .... If one wants to prepare, these can be read in advance of the lab. They are written more as a background about the different topics, and can be read without solving the previous tasks.

Getting MATLAB running

To do the lab, you need access to MATLAB. Depending on if you are using a private computer or one in the computer hall, you need to do either of the alteratives below.

At your own computer

If you have done the introduction, you should already have a Mathworks account, otherwise start with making one.
Go to the Mathworks homepage, download and install MATLAB.

In the computer halls

If you are working on a Linux based computer in the computer hall, use the following steps to get MATLAB going.
First, open a terminal. Then run module add prog/matlab/9.5.
Note that the version sometimes changes, if 9.5 can't be found, check available versions with module avail.

Questions?

If you have questions when going through the lab, you can either bring them to the computer lab sessions. You can also raise them on the Q&A page on LISAM. In many situations, your fellow students probably know the answer to your question. Please help each other if possible!

 

Note: There is currently a bug in how graphs are rendered. If you are missing arrowheads, leave the block "Create a function that takes data as input and plot all the points, with error bars" open

Contents

  1. Getting familiar with the data
  2. Implementing and simulating a model
  3. Implementing the hypotheses
  4. Quantitatively testing the hypotheses
  5. Testing the model with statistical tests
  6. Improving the agreement to data
  7. Estimating the model uncertainty
  8. Making predictions
  9. Making a profile-likelihood plot (optional)

Main computer lab tasks

Have you read the introduction above?

Getting familiar with the data

In this task, you will start to get familiar with the problem at hand, and will create functions that can plot the experimental data.

Read the background on the problem at hand

The Biological system and experimental data

During the following exercises you will take on the role of a system biologist working in collaboration with an experimental group.

The experimental group is focusing on the cellular response to growth factors, like the epidermal growth factor (EGF), and their role in cancer. The group have a particular interest in the phosphorylation of the receptor tyrosine kinase EGFR in response to EGF and therefore perform an experiment to elucidate the role of EGFR. In the experiment, they stimulate a cancer cell line with EGFR and use an antibody to measure the tyrosine phosphorylation at site 992, which is known to be involved in downstream growth signaling. The data from these experiments show an interesting dynamic behavior with a fast increase in phosphorylation followed by a decrease down to a steady state level which is higher than before stimulation (Figure 1) .

 

Figure1

Figure 1: The experimental data. A cancer cell line is stimulated with EGF at time point 0, and the response at EGFR-pY992 is measured

The experimental group does not know the underlying mechanisms that give rise to this dynamic behavior. However, they have two different ideas of possible mechanisms, which we will formulate as hypotheses below. The first idea is that there is a negative feedback transmitted from EGFR to a downstream signaling molecule that is involved in dephosphorylation of EGFR. The second idea is that a downstream signaling molecule is secreted to the surrounding medium and there catalyzes breakdown of EGF, which also would reduce the phosphorylation of EGFR. The experimental group would like to know if any of these ideas agree with their data, and if so, how to design experiments to find out which is the more likely idea.

To generalize this problem, we use the following notation:
EGF – Substrate (S)
EGFR – Receptor (R)

Create a “main script”.

This is the script we will use to call the other functions we create. In the rest of the instructions, the name of this script is assumed to be Main.
To create a function/script, press “new” in the toolbar and choose function/script.

Download, and load the experimental data from the expData file into MATLAB, then have a look at the structure.

Download the experimental data from here or on LISAM.

Load the experimental data into MATLAB by using the load function. Do this in your main script. Remember, you can read more about a function with the command help or doc. Run your main script and note that the variable is now in the workspace. Click on it and have a look at the subfields. Remember, the subfields can be accessed as e.g. expData.time.

Create a function that takes data as input and plot all the points, with error bars.

Use the following structure:

sequenceDiagram Main ->> PlotStuff: expData

In the above figure, the blue boxes represents a function/script with the name written in it. The black arrows represents the direction a set of variable is sent between two functions. In the above example, the script Main passes the variable expData to the function PlotStuff

Make sure your plots look nice. Give the axes labels.

Useful MATLAB commands:
errorbar, xlabel, ylabel, title, plot

 
Checkpoint 1: Show a supervisor your plot of the data.

Implementing and simulating a model

Now, let us implement a simple model! Here we will learn the basics of how to create models and simulate them. Implement a model with a single state A, with a constant increase (k1=2k_1 = 2) and an initial amount of 4 (A(0)=4A(0)=4). Let the amount of A be the measurement equations (y^\hat{y})

Make sure you have all the software/toolboxes/compilers you need installed and working.

First make sure that you have a mex-compiler install. run mex-setup and see if MATLAB finds any compilers. If not, install one, e.g. mingw.
Then, download IQM toolbox from LISAM or here, extract the files, move to the folder in MATLAB and then run the file called installIQMtoolsInitial.m. Move back to the folder where you have your Main script

Implement the simple model.

Create a text file and use the following model formulation.
Note that the text file must contain all headers (beginning with “********** “)

********** MODEL NAME
SimpleModel
********** MODEL NOTES
********** MODEL STATES
d/dt(A)=v1
A(0)=4
********** MODEL PARAMETERS
k1=2
********** MODEL VARIABLES
ySim=A
********** MODEL REACTIONS
v1=k1
********** MODEL FUNCTIONS
********** MODEL EVENTS
********** MODEL MATLAB FUNCTIONS
Compile the model in your Main script

Now, we will first convert your equations into a model object, and then compile the model into a so called MEX-file. The MEX-file is essentially a way to run C-code inside MATLAB, which is much faster then ordinary MATLAB code.

modelName='SimpleModel'
model=IQMmodel([modelName '.txt']);
IQMmakeMEXmodel(model) %this creates a MEX file which is ~100 times faster

Note: The compiled file will have a filename based on what is written under
****** MODEL NAME, and an ending based on what OS it is compiled on. In this case, SimpleModel.mex[w64/maci64/a64]

Create a function that takes (modelName, params, and time) as input and returns a vector ySim as output

Create your new function and give it the name SimulateModel. To do this, click on New in MATLABs command window and select Function. Edit the input and outputs to use the following structure:

sequenceDiagram Main ->> SimulateModel: modelName, params, time SimulateModel ->> Main: ySim

The next step will be to implement the actual simulation, but for now, just set ySim to some dummy values and leave the rest of the script empty.

Call your new SimulateModel function from your Main script with relevant inputs. For now, set params = 2 (same values as in the model file).

Implement the simulation in SimulateModel

We will now implement the real simulation.

Alternatives for simulating the model

There exists multiple ways of simulating the model. Some examples are given below:
(Note, IC is the initial conditions)

  1. sim = IQMPsimulate(model, time, IC, paramNames, params)
  2. sim = IQMPsimulate(modelName, time, IC, paramNames, params)
  3. sim = SimpleModel(time, IC, params)

Note that alternative 1) takes a IQM model as first input, and will automatically compile the model. This might sound good, but will result in the model being compiled every time a simulation is done. Running a compiled model is much faster, but the compilation takes some time Therefore this is a bad idea. Alternative 2) and 3) both uses the MEX-compiled model, however 3) uses the file directly and is almost always the best alternative (5x faster and more convenient).

Using option 3), it is possible to skip IC and/or params as inputs. The model will then instead use the values written in the model file (when the MEX-file was compiled). In practice, this is done by passing along [] instead of e.g. IC.

Hint: Use alternative 3.

Using the model name to simulate

Instead of using the actual name (e.g. SimpleModel above), you can use any valid string. See below for an example.

model = str2func(modelName)
sim = model(time, IC, params)
Using the output of the simulation:

The simulation returns a struct, with many relevant fields. In our case, we are interested in the variable values. You can access these values in the following way:
ySim = sim.variablevalues

Dealing with problematic simulations

Sometimes the simulation crashes for certain sets of parameters. To bypass that, add a try-catch block in your simulation script and return a vector of the same size as ySim should be (the size of time’), but with a very large value for each element.

try
   sim = model(time, [], params); % Replace [] with IC to use custom values
   ySim = sim.variablevalues;
catch error
   %disp(getReport(error)) %Print error message
   ySim=1e50*ones(size(time));
end
Run your Main script to simulate the model for the time points given in the data

Make sure you are sending in the right values and that the right values are in the model file. As a time-vector, use expData.time and for params use the default values. Does your SimulateModel return ySim=[4, 4.4, 4.8, 5.2, 5.6, 6, 6.4, 6.8, 7.2, 7.6, 8]?

 
Checkpoint 2: There is no checkpoint here. You are free to discuss your solution with the supervisors if you like. Does the model behave the way you expect it to? Does it change if you vary the initial conditions (IC) and/or the parameter values (params)?

Implementing the hypotheses

Now that we know how to simulate a simple model, let us implement the hypotheses your supervisors have as models, and simulate them.

Read about the first hypothesis that your supervisor has

To go from the first idea to formulate a hypothesis, we include what we know from the biological system:

  • A receptor (R) is phosphorylated (Rp) in response to a substrate (S)
  • A downstream signaling molecule (RS) is activated as an effect of the phosphorylation of the receptor
  • The active downstream signaling molecule (RSp) is involved in the >dephosphorylation of the receptor
  • Phosphorylated/active molecules can return to inactive states without stimuli

From this knowledge, we can draw an interaction graph as depicted in Box 1, along with a description of the reactions.
box1

Implement the first hypothesis
  1. Formulate your equations based on the interaction graph in Box1
  2. Create a new model text file and enter your equations for model1.
    Use the following values:
Initial values Parameter values
R(0) = 1 k1 = 1
Rp(0) = 0 k2 = 0.0001
RS(0) = 1 kfeed = 1000000
RSp(0) = 0 k4 = 1
S(0) = 1 k5 = 0.01
  1. Use your simulation function to simulate your first model. Remember to update the parameter values!
  2. Update your plot function to also plot the model simulation. Call your simulation function inside the plot function. (See tips below)
sequenceDiagram Main->>PlotStuff: expData, params, modelName PlotStuff ->> SimulateModel: modelName, params, time SimulateModel ->> PlotStuff: ySim
  1. Make sure that you now have ONE “main script” that:
    a. Loads the experimental data
    b. Compiles a model
    c. Plot data with simulation in the same graph.
Some tips for plotting

MATLAB automatically overwrites your plot if you do not explicitly tell it to keep the information. To do this, use “hold on” to keep everything plotted in the figure:

errorbar(…)  
hold on  
plot(…)
...
hold off % This is not mandatory.

Tip!
When simulating for a plot, it is generally a good idea to simulate with a high time resolution. Instead of passing time (experimental time-points) to model(time,…), create timeHighres and pass that along.

One way of doing this is:

timeHighres = time(1):step:time(end)
sim = model(timeHighres,…)

Where step is the desired step-length (in this case, step=0.001 is probably good)

Read about the second hypothesis that your supervisor has

To go from the first idea to formulate a hypothesis, we include what we know from the biological system:

  • A receptor (R) is phosphorylated (Rp) in response to a substrate (S)
  • A downstream signaling molecule (RS) is activated as an effect of the phosphorylation of the receptor
  • The active downstream signaling molecule (RSp) is involved in the >dephosphorylation of the receptor
  • Phosphorylated/active molecules can return to inactive states without stimuli

From this knowledge, we can draw an interaction graph as depicted in Box 1, along with a description of the reactions.
box2

Implement the second hypothesis

Now, let us do the same thing again but for the second hypothesis. Re-use your main script! You should be able to run it again for the second hypothesis by only changing the model name.

  1. Create a copy of your previous model file.
  2. Implemented the changes in the hypothesis from Box2. Use the following values:
Initial values Parameter values
R(0) = 1 k1 = 5
Rp(0) = 0 k2 = 20
RS(0) = 1 kfeed = 10
RSp(0) = 0 k4 = 15
S(0) = 1 k5 = 30
  1. Rerun your main script but with the second model. Make sure that the two models behave differently.

 
Checkpoint 3: Show your graphs with both data and simulation in the same figure to your supervisor. Is the agreement convincing? If not, what are the problems? Can we test quantitatively if it is good enough?

Quantitatively testing the hypotheses

We can (and should) test the hypotheses quantitatively. To do this, first we need a way to assess how good a model is. We will do this with what is sometimes referred to as the cost function.

Read about the cost function

Let us first establish a measure of how well the simulation can explain data, referred to from now on as the cost of the model. We use the following formulation of the cost:

v(θ)= t(yt y^t(θ))2SEMt2v\left(\theta\right)=\ \sum_{\forall t}\frac{{(y_t-\ {\hat{y}}_t\left(\theta\right))}^2}{{SEM}_t^2}

Where v(θ)v(\theta) is the cost, given a set of parameters θ\theta. yty_t and SEMtSEM_t is experimental data (mean and SEM) at time tt, and y^t{\hat{y}}_t is model simulation at time tt.

yt y^t(θ)y_t-\ {\hat{y}}_t\left(\theta\right), is commonly referred to as residuals, and represent the difference between the simulation and the data. These are summed in the equation above, or rather, the squared residuals are summed. Note that the residuals in the equation above is also weighted with the SEM.

Create a function that simulates your model for a set of parameters and returns the cost.

Call your simulation function from inside the CostFunction to get the simulated values!

sequenceDiagram Main->>CostFunction: expData, params, modelName CostFunction ->> SimulateModel: modelName, params, time SimulateModel ->> CostFunction: ySim CostFunction ->> Main: cost

Tip!
MATLAB interprets *, /, ^ between vectors as vector operations. You likely want elementwise operations. This can be solved by changing / to ./ (note the dot) for any of these operations. Note that this is not an issue for +, -, if the vectors are of the same size.

Note: You probably have to transpose ySim. This can be done by adding a ' after the variable (e.g. ySim')

Update your single main script to also calculate the cost

Get the cost for both models! Do this by only changing modelName and params in your Main script.

 
Checkpoint 4: No checkpoint here. Make sure that the cost when simulating with the default parameters is 701 for model 1 and 905 for model 2.

Testing the model with statistical tests

Now that we have a way of quantitatively estimating the goodness of the model, it is time to determine if it is actually good enough.

Read about statistical tests

There exists many statistical tests that can be used to reject hypotheses. One of the most common, and the one we will focus on in this lab is the χ2\chi^2-test. A χ2\chi^2-test is used to test the size of the residuals, i.e. the distance between model simulations and the mean of the measured data (Figure 2).
Figure2
Figure 2: Residuals are the difference between model simulation and the mean of the measured data

For normally distributed noise, the cost function, v(θ)v(\theta), can be used as test variable in the χ2\chi^2-test. Therefore, we test whether v(θ)v(\theta) is larger than a certain threshold. This threshold is determined by the inverse of the cumulative χ2\chi^2 distribution function. In order to calculate the threshold, we need to decide significance level and degrees of freedom. A significance level of 5% (p=0.05) means a 5% risk of rejecting the hypothesis when it should not be rejected (i.e. a 5 % risk to reject the true model). The degrees of freedom is most often the number of terms being summed in the cost (i.e. the number of data points).

If the calculated cost is larger than the threshold, we must reject the hypothesis given the set of parameters used to simulate the model. However, there might be another parameter set that would give a lower cost for the same model structure. Therefore, to reject the hypothesis, we must search for the best possible parameter set before we use the χ2\chi^2-test. To find the parameter set with the lowest cost, we use optimization.

Use a statistical test to see if you must reject the model

Here, we will try to reject the hypothesis with a χ2\chi^2-test. Calculate the limit to when the model must be rejected. You can get a limit by using the function chi2inv(0.95,dgf). Where 0.95 represents a p-value of 0.05 (1 - 0.05), and dgf is the degrees of freedom.

When selecting the degrees of freedom, there exists two major alternatives. The first is to set dgf = number of data points, and the second is dgf = number of data points - number of parameters
In this case, using the number of data points is probably the best alternative.

A third alternative could be to set limit = optimal cost + chi2inv(0.95,1) (note, dgf = 1)

If cost>limit, reject the model!

 

Checkpoint 5: Why do you think the residuals are squared? Furthermore, why do you think we weigh the residuals with the SEM? Do you reject any of the hypotheses? What is the null-hypothesis? Are there other statistical tests you can use?

Improving the agreement to data

Now that we clearly know that the models were not good enough with the initial parameter set, can we evaluate the hypotheses?

Read about optimization

To do be able to test the hypothesis, we must find the best agreement between data and simulation. If the model is not good enough assuming we have the best possible parameters, the hypothesis must be rejected. It is possible to find this best combination of parameters by manually tuning the parameters. However, that would almost certainly take way too much time. Instead, we can find this best agreement using optimization methods.

An optimization problem generally consists of two parts, an objective function with which a current solution is evaluated and some constraints that define what is considered an acceptable solution. In our case, we can use our cost function as an objective function and set lower and upper bounds (lb/ub) on the parameter values as constraints. The problem formulation typically looks something like the equations below.

min v= Cost function (θ)Subject to:lb<θ<ubmin\ v=\ Cost\ function\ (\theta) \\ Subject\ to: lb<\theta<ub

An optimization problem can be hard to solve, and the solution we find is therefore not guaranteed to be the best possible solution. We can for example find a solution which is minimal in a limited surrounding, a so-called local minimum, instead of the best possible solution – the global minimum (Figure 3). There are methods to try to reach the global minimum, for example you can run the optimization several times from different starting points and look for the lowest solution among all runs. Please note that there might be multiple solutions with equal values.

Figure3
Figure 3: The difference between a local and a global minimum.

Furthermore, there exists many different methods of solving an optimization problem. If a method is implemented as an algorithm, it is typically referred to as a solver

Pick a solver

Some alternatives (read about them with doc solver)
MATLABs versions: particleswarm, simulannealbnd, fmincon,
IQM toolbox versions: simannealingIQM, simplexIQM

It is possible that not all solvers are equally good on all problems. Furthermore, it is possible to combine them, by running them sequentially.

If you have a reasonably good start guess, it is probably a good idea to use a simulated annealing based solver. If you don't, particle swarm (which don't need a start guess) is probably a good solution.

Note that most solvers only allow for the objective function to get one input variable (the parameter values). In our case, we want to pass along parameter values, model name, and experimental data.

Solve the issue with solvers only passing one input variable to the cost function

Most solvers can only call the cost function with one input variable (the parameter values). We must therefore update our script so that we only pass one input from the Main script to the cost function. Update your scripts to have only one variable (the parameter values) passed from Main script to the cost function

One way to bypass this limitation is to update the CostFunction to only accept one input and instead get the data and select model by either loading the data and model name in the cost function, or by getting them in as global variables. However, both of these methods have issues. A third (and better) alternative is to use anonymous functions. Essentially, we create a new function which takes one input, and then calls CostFunction with that input as well as extra inputs.

sequenceDiagram Main->> Intermediary: params Intermediary ->> CostFunction: expData, params, modelName CostFunction ->> SimulateModel: modelName, params, time SimulateModel ->> CostFunction: ySim CostFunction ->> Intermediary: cost Intermediary ->> Main: cost

How to implement anonymous functions:
Replace (in main_function):
cost = CostFunction(expdata, modelName, params)
With:

func = @(p) CostFunction(expdata, modelName, p)
cost = func(params)

Here, func acts as an intermediary function, passing along more arguments to the real cost function. Note that if expdata is changed, the output from func will not change. To do that, func must be redefined (as above).

Add optimization to your main script

Implement the solver you picked. Remember, you can read more about it and see examples with doc solver or help solver (where solver is the name of the solver you want to know more about).
In the end, you scripts should have a structure similar to the figure below.

sequenceDiagram Main->> OptimizationSolver: func, x0, lb, ub OptimizationSolver ->> Intermediary: params Intermediary ->> CostFunction: expData, params, modelName CostFunction ->> SimulateModel: modelName, params, time SimulateModel ->> CostFunction: ySim CostFunction ->> Intermediary: cost Intermediary ->> OptimizationSolver: cost OptimizationSolver ->> Main: cost

Not all optimization solvers require the same input, but in general they all require a function that can be used to evaluate a parameter set, a start guess of the parameter values (x0), as well as lower and upper bounds on which values the parameter can have.

An example with simulannealbnd

lb = [1e-5, 1e-5, 1e-5, 1e-5, 1e-5]
ub = [1e5, 1e5, 1e5, 1e5, 1e5]
[bestParam, minCost]= simulannealbnd(func, x0, lb, ub)
Tip 1: using multiple solvers

You can string multiple solvers together. E.g since particle swarm does not require a start guess, it can be used to find one and then pass it on to simannealing.

[bestParamPS, minCostPS] = particleswarm(func, length(lb), lb, ub)
[bestParam, minCost] = simulannealbnd(func, bestParamPS, lb, ub)
Tip 2: showing the progress while solving

If you want to see how the optimization progresses, you need to send in an options variable to the optimization algorithm. Below is an example for particleswarm. All MATLAB solvers (the IQM toolbox ones are a bit different) handles options in the same way, but note that the solver name has to be passed when setting the options.

options=optimoptions(@particleswarm, 'display','iter')
particleswarm(func, length(lb), lb, ub, options);
Tip 3: searching in log-space (you want to do this)

It is often better to search for parameter values in log-space. To do this, simply take the logarithm of the start guess and lower/upper bounds. Then, go back to linear space before the simulation is done in SimulateModel.

sequenceDiagram Note over Main: Convert to log Main->> OptimizationSolver: func, x0, lb, ub OptimizationSolver ->> Intermediary: params Intermediary ->> CostFunction: expData, params, modelName CostFunction ->> SimulateModel: modelName, params, time Note over SimulateModel: Convert to linear SimulateModel ->> CostFunction: ySim CostFunction ->> Intermediary: cost Intermediary ->> OptimizationSolver: cost OptimizationSolver ->> Main: cost

In Main:

x0=log(x0)
lb=log(lb)
ub=log(ub)

in SimulateModel:

param=exp(param)

Note that the simulation always have to get the "normal" parameters, if you pass along logarithmized parameters, some will be negative and the ODEs will be hard to solve (resulting in a CV_ODE_ERROR error)

SAVE YOUR RESULTS IN A GOOD WAY!!!

Nothing is worse than having found a good parameter set, only to overwrite it with a bad one, and then being unable to find the good one again. To avoid this, you should always save your results in a way that does not overwrite any previous results. Use MATLABS save function.

save('filename', 'variable')

Note: Change filename to the file name you want, and variable to the name of the variable you want to save (e.g. bestParam)

Tip: Using the current time as file name

A good file name to use, that would not overwrite previous results would be to use the current time and date in the name. For this, you can use the MATLAB function datestr. The default layout of the time and date contains characters you can't use when saving a file. To avoid this, we pass along a second argument, with the desired format of the date string.

datestr(now, 'yymmdd-HHMMSS')

Or even better, with also the model name:

[modelName '_' datestr(now, 'yymmdd-HHMMSS')]
Experiment with the solvers. What is the best fit you can get? How does it look?

You should get costs for both models that are at least below 20 (likely even better, but let's not spoil that). Make sure that you plot your best results, to make sure that nothing strange is happening with the simulations.

 
Checkpoint 6: What was the best cost for the two models? Show figures with your two best agreements. You should have values at least below 20. Do you have to reject any of them?

Estimating the model uncertainty

Now that we have checked if the models are good enough by testing the best solution, it is time to estimate the uncertainty of the model.

Read about uncertainty in models

After the optimization procedure, we decide if the hypothesis should be rejected or not. If the cost function has a lower value than the threshold, we will move on and further study the behavior of the model. Unless the best solution is strictly at the boundary, there exists worse solutions than the best one, which would still be able to pass the statistical test. We will now try to find acceptable solutions, which are as different as possible. In this analysis, we study all found parameter sets that give an agreement between model simulation and experimental data – not only best parameter set. The parameter sets are used to simulate the uncertainty of the hypothesis given available data. This uncertainty can tell us things like: Do we have enough experimental data in relation to the complexity of the hypothesis? Are there well determined predictions (i.e. core predictions) to be found when we use the model to simulate new potential experiments?

Update your scripts to also save ALL sets of parameters below the limit.

The easiest way of doing this is to create a text file (filename.csv in this case), and write to that file every time the cost of a tested parameter set is below the limit (e.g. the chi2-limit)

In Main:

fid = fopen('filename.csv','a') % preferably pick a better name. a = append, change to w to replace file
----do the optimization ----
fclose(fid) % We need to close the file when we are done. If you forget this, use fclose('all')

In CostFunction:

----simulate and calculate cost ----
if cost< limit
 fprintf(fid, [sprintf('%.52f, ',params) sprintf('%.52f\n',cost)]); % Crashes if no fid is supplied. See tip below. Note that the cost is saved as the last column
end

Suggestion on function/variable names and structure:

[cost] = CostFunction(expdata, modelName, params, fid)
Tip: Avoiding crashes if fid is not supplied

To not have the function crash if fid is not supplied, we can use “nargin”, which is the number of variables sent to the function. If we have said that fid will be the 4th input argument (see above), we can add another statement to the if-statement in the CostFunction.

if cost< limit && nargin==4
 fprintf(fid, [sprintf('%.52f, ',params) sprintf('%.52f\n',cost)]);
end
Run multiple optimizations to gather many parameter sets.

In order to gather many parameter sets to give us an estimation of the model uncertainty, re-run your optimization a few times to find multiple different parameter sets. Ideally we want to find as different solutions as possible. To do this the easy way, just restart your optimization a few times and let it run to finish. Hopefully, it will find a reasonable spread of the parameter values. More optimizations will give an uncertainty closer to the truth. In the end, open your text file and check that there are a good amount of parameter sets in there.

Create a new “main” script for sampling your saved parameters and plotting

You can essentially copy your old Main script and use that as a base. We will not be optimizing anymore, so those lines can be removed.

Load your saved parameters

Make sure that you have saved results for both of your models.

If you have multiple files with parameter sets you want to combine (you probably don't):
Simply open them as text file and combine them, or use the terminal (assuming you are on a Linux computer) with the command cat.
cat file1.csv file2.csv > merged.csv
or
cat \*.csv > merged.csv

Importing a .csv file in MATLAB
Can be done using e.g load, readtable or dlmread
To convert from a table to array, use the function table2array.

Create a function to sample your parameter sets

For computational reasons, it is infeasible to plot all parameter sets. Therefore we need to sample the parameter sets. There are multiple ways of sampling the parameter sets, where some are better than other. One way is to pick at random

It is likely reasonable to plot around 1,000 parameter sets in total.

Suggestion on function/variable names and structure:
[sampledParameters] = SampleParameters(allparameters)

Tip: A good way to select parameters

Use sortrows to sort based on the values in a column, then select some parameter sets with a fixed distance. Try picking ~100 sets. Then sort based on second column and repeat.

Tip: Some easy ways to select parameters

Pick indices at random
Use randi to generate ~1000 positions from 1 to the total number of parameter sets you have. E.g. ind=randi(size(allParameters,1),1,1000)

Pick indices with equal spacing
Use linspace to generate indices with equal spacing. Note, some values might not be integers.Therefore, round the values using round. E.g ind=round(linspace(1,size(allParameters,1),1000)).

Then, extract those parameter sets from allParameters.
sampledParameters=allParameters(ind,1:end-1) % last column is the cost

Simulate and plot your selected parameter sets, with data in the same figure.

Use your simulation script to simulate your selected parameters. Plot model 1 with uncertainty in one color.
Then, do the same thing for model 2 and plot in the same windows with another color. .
Lastly, plot the data.
In the end, you should have both models and the data in one plot.

Suggestions on how to implement this:
Make a copy of PlotStuff, and pass along sampled parameters and model names for both models. Then do any of the following:

1) plot multiple lines, with different colors for the different models.
Update the plotstuff function to take an extra input argument for color, and loop over your sampledParameters for one model, then loop over the next.
Or, create a new plot function and use your simulation script to simulate your selected parameters. Plot all the simulations as mulitple lines in the same figure with the same color (one model = one color).
Then, do the same thing for model 2 and plot in the same windows.
Lastly, plot the data.

2) Create a new function and plot an area using fill
First, simulate all parameter sets and save the maximum and minimum value (ymin and ymax in the example below), then plot using fill. Note that fill creates a shape and always connects the last two points. Therefore, we must first plot the maximum or lower value, and then the other one but in reverse (fliplr reverses a vector). Fill also requires a color matrix ([1 0 0] below). See example below

fill([x fliplr(x)], [ymin fliplr(ymax)], [ 1 0 0])

 
Checkpoint 7: Show your plot with the uncertainty of the models to the supervisor. Are the uncertainties reliable? If not, are they too big or too small?

Making predictions

Now that we have a reasonable estimation of the model uncertainty, it is time to make predictions of new experiments. A good model cannot only explain estimation data, it can be used to generate good new predictions.

Read about predictions

In order to see if the model is behaving reasonably, we want to see if we can predict something that the model has not been trained on, and then test the prediction with a new experiment. Note that in order for the prediction to be testable, it must be limited in at least one point, often referred to as a core prediction. A core prediction could for example be found at one of the model states where we do not have experimental data, or at the model state we can measure, under new experimental conditions (extended time, a change in input strength etc.). The parameter sets are also studied with profile-likelihood to see which parameters that are well determined.

Core predictions

Core predictions are well determined predictions that can be used to differentiate between two models. Such a differentiation is possible when model simulations behave differently and have low uncertainty for both models. Core predictions are used to find experiments that would differentiate between the suggested hypotheses and therefore, if measured, would reject at least one of the suggested hypotheses. The outcome could also be that all hypotheses are rejected, if none of the simulations agree with the new experiment.

Experimental validation

The last step of the hypothesis testing is to hand over your analysis to the experimental group. The core predictions that discriminate between Model 1 and Model 2 should be discussed to see if the corresponding experiment is possible to perform. Now it is important to go back to think about the underlying biology again. What is the biological meaning of your core predictions? Do you think the experiment you suggest can be performed?

The outcome of the new experiment can be that you reject either Model 1 or Model 2, or that you reject both Model 1 and Model 2. What biological conclusions can we draw in each of these scenarios? What is the next step for you, the systems biologist, in each of these scenarios?

Try to find a new experiment where the models behave differently

Note, you must do the same thing in both models. Otherwise, the comparison does not mean anything.
There are many things you can test in order to find a prediction where the models are behaving differently. Some alternatives are listed below.

Changing the time
It could be possible that the models are behaving differently if we change the time points being observed. Since we can't find a good prediction right now, a shorter end time would probably not be relevant. Instead, try to increase the time being simulated.

Observing a different state
Even though the state we have been observing so far have been behaving in the same way in the two models, that might not be true for other states. Try to plot some different states and see if you can find one that behaves differently.

Manipulating the input signal
If we perturb the system a bit, it might be possible that the models will behave differently. One way could be to change the strength of the input signal (essentially changing the concentration of the substrate), or to manipulate it more complexly by for example changing the input at certain time points. In the model, this can be done by adding a so called event, see below for an example.

To create an event called event1 (a name is necessary, but not important as long as there is no duplicates), where you set S = 1 when time > 4, add the following to the model file:

...
********** MODEL EVENTS
event1 = gt(time,4), S, 1

You can change the inequality by changing gt (greater than) to ge (greater or equal), lt (less than), or le (less of equal).

Note: the simulation time must be longer then when the event is ment to occur for the event to happen.

 
Checkpoint 8: Show your plot of your prediction. Could you find a core prediction? What is a core prediction and why is it important? Do you think your prediction can be tested experimentally?

Summary and final discussion

You have now gone through the computer lab, and it is time to look back upon what you have actually done, as well as to discuss some points that might be easy to misunderstand. Below are some questions to discuss with the supervisor.

Some summarizing questions to discuss

You should know the answers to these questions, but if you do not remember the answer, see this part as a good repetition for the coming dugga.

  1. Look back at the lab and, can you give a quick recap of what you did?
  2. Why did we square the residuals? And why did we normalize with the SEM?
  3. Why is it important that we find the "best" parameter sets when testing the hypothesis? Do we always have to find the best?
  4. Can we expect to always find the best solution for our type of problems?
  5. What are the problems with underestimating the model uncertainty?
  6. If the uncertainty is very big, how could you go about to reduce it?
  7. Why do we want to find core predictions when comparing one or more models? What would be the outcome if one did the experiment?

Making a profile-likelihood plot (optional)

This is optional to do, but might be relevant in your future projects. Even if you do not do it now, remember that you can come back and try this out later.

Read about profile-likelihood

In a profile-likelihood analysis, we plot one parameters value on the x-axis and the cost of the whole parameter set on the y-axis. Such a curve is typically referred to as a profile-likelihood curve. To get the full range of values for the parameter, it is common practice to fixate the parameter being investigated, by removing it from the list of parameters being optimized, and setting it explicitly in the cost function, and re-optimize the rest of the parameters. This is done several times, and each time the parameter is fixed to a new value, spanning a range to see how the cost is affected from changes in the value of this specific parameter (Figure 4). If the resulting curve is defined and passes a threshold for rejection in both the positive and negative direction (Figure 4A), we say that the parameter is identifiable (that is, we can define a limited interval of possible values), and if instead the cost stays low in the positive and/or the negative direction, we say that the parameter is unidentifiable (Figure 4B-D). This whole procedure is repeated for each model parameter.
PL
Figure 4: Examples of identifiable (A) and unidentifiable parameters (B-D). The blue line is the threshold for rejection, the red line is the value of the cost function for different values of the parameter.

In a prediction profile-likelihood analysis, predictions instead of parameter values are investigated over the whole range of possible values. Predictions cannot be fixated, like parameters can. Therefore, the prediction profile-likelihood analysis is often more computationally heavy to perform than the profile-likelihood analysis. Instead of fixation, a term is added to the cost to force the optimization to find a certain value of the prediction, while at the same time minimizing the residuals. Resulting in an objective function something like the following:
v(θ)= t(yt y^t(θ))2SEMt2+λ(pp^)v\left(\theta\right)=\ \sum_{\forall t}\frac{{(y_t-\ {\hat{y}}_t\left(\theta\right))}^2}{{SEM}_t^2}+\lambda\left(p-\hat{p}\right)
Again, the whole range of values of the predictions are iterated through. A prediction profile-likelihood analysis can be used in the search for core predictions.

Do a profile-likelihood analysis of the parameters

Here, we will not fixate one parameter at time and reoptimize, instead we will use the already collected data. This is worse than fixing the parameters, but we will do it for the sake of time. Furthermore, as opposed to the uncertainty plots before, we do not have to select some parameter sets, we can instead use all of our collected sets. The easiest way to do a profile-likelihood plot, is to simply plot the cost against the values of one parameter. Then repeat for all parameters.

To make it slightly more pretty, one can filter the parameters to only give the lowest cost for each parameter value. This can be done using the function unique. A good idea is likely to round your parameter values to something with less decimals. This can be done with round. Note that round only rounds to integers. To get a specific decimal precision, multiply and then divide the parameter values with a reasonable value. E.g.round(p*100)/100.

 
Checkpoint O-1: Show your profile-likelihood plot. Do the parameters appear to be identifiable?