Fitting a Straight Line¶

For a first tutorial we're going to tackle one of the best-known problems: fitting a straight line to data. Fitting a straight line in this way is not particularly efficient, but it will help us get acquainted with chromatic_fitting in a way that we're probably already familiar with.

We will first import chromatic_fitting, chromatic (which our package is based on), a distribution from pymc3, and a few other handy packages.

In [1]:

            
                Copied!
                
from chromatic_fitting import *
from pymc3 import Uniform
import numpy as np
import matplotlib.pyplot as plt

plt.matplotlib.style.use('default')
from chromatic_fitting import *
from pymc3 import Uniform
import numpy as np
import matplotlib.pyplot as plt

plt.matplotlib.style.use('default')

Running chromatic_fitting v0.0.2!

This program is running on:
Python v3.9.12 (main, Jun  1 2022, 06:34:44) 
[Clang 12.0.0 ]
numpy v1.22.1
chromatic v0.3.14
pymc3 v3.11.5
pymc3_ext v0.1.1
exoplanet v0.5.2

Make some data¶

First we need to make some data. We can generate a straight line and add some random noise:

In [56]:

            
                Copied!
                
                    
                    
                
                

        
size = 100
true_intercept = 0.9
true_slope = 0.4
true_noise = 0.07

x = np.linspace(0, 1, size)
# y = a + b*x
true_regression_line = true_intercept + (true_slope * x)
# add noise
y = true_regression_line + np.random.normal(scale=true_noise, size=size)
yerr = [true_noise] * len(x)
size = 100
true_intercept = 0.9
true_slope = 0.4
true_noise = 0.07

x = np.linspace(0, 1, size)
# y = a + b*x
true_regression_line = true_intercept + (true_slope * x)
# add noise
y = true_regression_line + np.random.normal(scale=true_noise, size=size)
yerr = [true_noise] * len(x)

Let's have a look at our noisy line:

In [57]:

            
                Copied!
                
plt.plot(x,y,'k.', label='Noisy Data')
plt.errorbar(x,y,yerr,linestyle='None',color='k',alpha=0.3)
plt.plot(x,true_regression_line,color='g',label='True Regression Line')
plt.legend();
plt.plot(x,y,'k.', label='Noisy Data')
plt.errorbar(x,y,yerr,linestyle='None',color='k',alpha=0.3)
plt.plot(x,true_regression_line,color='g',label='True Regression Line')
plt.legend();

Looks good! Now we need to turn this noisy line data into something that chromatic_fitting will understand. chromatic_fitting is built on top of chromatic and uses objects called Rainbows 🌈. For more information about Rainbows please see the great chromatic docs here. In this case we will create a Rainbow object with x as our time axis and one (arbitrary) wavelength, then plug in our flux (y) and uncertainty (yerr) values:

In [58]:

            
                Copied!
                
# create simulated rainbow
r = SimulatedRainbow(time=x * u.day, wavelength=[1] * u.micron)
r.fluxlike['flux'] = np.array([y] * r.nwave)
r.fluxlike['uncertainty'] = np.array([yerr] * r.nwave)

r
# create simulated rainbow
r = SimulatedRainbow(time=x * u.day, wavelength=[1] * u.micron)
r.fluxlike['flux'] = np.array([y] * r.nwave)
r.fluxlike['uncertainty'] = np.array([yerr] * r.nwave)

r

Out[58]:

<Simulated🌈(1w, 100t)>

Let's see our Rainbow and check it looks like our noisy line above! By default Rainbow objects are 3-D representing flux as a function of wavelength and time, so the visualizations are mostly in 3-D (as below):

In [59]:

            
                Copied!
                
# plot our Rainbow:
r.imshow_quantities();
# plot our Rainbow:
r.imshow_quantities();

🌈🤖 The wavelength scale for this rainbow is 'linear',
and there are 1 wavelength centers and
0 wavelength edges defined.

It's hard to imshow something with a wavelength axis
that isn't linearly or logarithmically uniform, or doesn't
at least have its wavelength edges defined. We're giving up
and just using the wavelength index as the wavelength axis.

If you want a real wavelength axis, one solution would
be to bin your wavelengths to a more uniform grid with
`rainbow.bin(R=...)` (for logarithmic wavelengths) or
`rainbow.bin(dw=...)` (for linear wavelengths)

But if we want to see our data in 2-D again then we just have to say .plot_lightcurves() and we're back where we started...

In [61]:

            
                Copied!
                
fig, ax = plt.subplots(figsize=(8,6))
r.plot_lightcurves(ax=ax)
plt.ylim(0.6,1.6);
fig, ax = plt.subplots(figsize=(8,6))
r.plot_lightcurves(ax=ax)
plt.ylim(0.6,1.6);

Create a PolynomialModel¶

Now onto chromatic_fitting! chromatic_fitting works by using a class of model that defines the parameters we need and a function we would like to fit. For a straight line we want to use PolynomialModel. This model can be used to model any type of polynomial function (e.g. $y=x^2$ or $y=1+2x+5x^7$), but at its simplest it can model a constant value or a straight line. To set up a PolynomialModel we just need to tell it the maximum degree we would like to model. For a straight line that would be 1.

In [62]:

            
                Copied!
                
# create polynomial model:
l = PolynomialModel(degree=1, name='linear_tutorial')
l
# create polynomial model:
l = PolynomialModel(degree=1, name='linear_tutorial')
l

Out[62]:

<chromatic polynomial model 'linear_tutorial' 🌈>

Each model will have some necessary parameters to run. For a PolynomialModel the only parameter you need is a constant term. Otherwise we'd have nothing to fit!

In [63]:

            
                Copied!
                
l.required_parameters
l.required_parameters

Out[63]:

['p_0']

Every coefficient term will default to 0.0 unless we assign it a value. For example if we defined a PolynomialModel of degree = 2 we could model the equation $y=4+3x^2$ by defining $p_0=4$ and $p_2=3$.

In [64]:

            
                Copied!
                
l.defaults
l.defaults

Out[64]:

{'p_0': 0.0, 'p_1': 0.0}

Now we want to tell the model what function we want to model and what our prior distributions are for each parameter. Since we only have one wavelength in this example we don't need to worry about WavelikeFitted parameters here, we can just use Fitted. See this tutorial for an intro to Parameters:

In [65]:

            
                Copied!
                
                    
                    
                
                

        
# set up polynomial model parameters:
l.setup_parameters(
    p_0 = Fitted(Uniform, testval=0.8, upper=2, lower=0),
    p_1 = Fitted(Uniform,testval=0.1,upper=1,lower=0)
)

# print a summary of all params:
l.summarize_parameters()
# set up polynomial model parameters:
l.setup_parameters(
    p_0 = Fitted(Uniform, testval=0.8, upper=2, lower=0),
    p_1 = Fitted(Uniform,testval=0.1,upper=1,lower=0)
)

# print a summary of all params:
l.summarize_parameters()

linear_tutorial_p_0 =
  <🧮 Fitted Uniform(testval=0.8, upper=2, lower=0, name='linear_tutorial_p_0') 🧮>

linear_tutorial_p_1 =
  <🧮 Fitted Uniform(testval=0.1, upper=1, lower=0, name='linear_tutorial_p_1') 🧮>

All we've done here is to tell the model we want have Uniform distributions for $p_0$ and $p_1$ (or y-intercept and gradient) of $\mathcal{U}(0,2)$ and $\mathcal{U}(0,1)$ with initial test values of 0.8 and 0.1 respectively.

Now we can attach the Rainbow to our model and set it up before we perform our fit!

In [66]:

            
                Copied!
                
# setup model the same way as for the transit model!:
l.attach_data(r)
l.setup_lightcurves()
l.setup_likelihood()
# setup model the same way as for the transit model!:
l.attach_data(r)
l.setup_lightcurves()
l.setup_likelihood()

NOTE: The polynomial model can fit to a normalized array (using normalize=True). This normalization step is $x_n=\frac{x-mean(x)}{std(x)}$. This won't affect our final model but this means the parameters $p_1, p_2, p_3 ...$ will have to be converted back into the correct form to compare to the original array! E.g. in our case ($y=mx+c$) if we normalize the time axis then we will need to do the transforms $m$ -> $m/std(x)$ and $c$ -> $c-(m * mean(x)/std(x))$ at the end. This will get significanlty more complicated the higher in order we go so it's best to not normalize by default.

Let's see what our PyMC3 model looks like at this stage:

In [67]:

            
                Copied!
                
l._pymc3_model
l._pymc3_model

Out[67]:

$$ \begin{array}{rcl} \text{linear_tutorial_p_0_interval__} &\sim & \text{TransformedDistribution}\\\text{linear_tutorial_p_1_interval__} &\sim & \text{TransformedDistribution}\\\text{linear_tutorial_p_0} &\sim & \text{Uniform}\\\text{linear_tutorial_p_1} &\sim & \text{Uniform}\\\text{wavelength_0_data} &\sim & \text{Normal} \end{array} $$

We should have 2 parameter distributions and one wavelength_0_data distribution here. This distribution tells us that we are fitting our data (r) to the model assuming Gaussian errors (r.uncertainty). We can ignore the _interval__ parameters - these are internal to PyMC3 and we should never have to worry about these.

Sampling our Model¶

Now we can try to fit our model! Here we will first perform an optimization step (to give our sampling a good first guess) and then the actual NUTS sampling with a number of tuning and draw steps and chains that we define. We can also choose how many cores to assign to this sampling!

In [68]:

            
                Copied!
                
# optimize for initial values!
opt = l.optimize(plot=False)

# put those initial values into the sampling and define the number of tuning and draw steps, 
# as well as the number of chains.
l.sample(start=opt, tune=300, draws=300, chains=4, cores=4)
# optimize for initial values!
opt = l.optimize(plot=False)

# put those initial values into the sampling and define the number of tuning and draw steps, 
# as well as the number of chains.
l.sample(start=opt, tune=300, draws=300, chains=4, cores=4)

optimizing logp for variables: [linear_tutorial_p_1, linear_tutorial_p_0]

100.00% [18/18 00:00<00:00 logp = 1.063e+02]

message: Optimization terminated successfully.
logp: -692.9207246848301 -> 106.29809990731582

/Users/camu5866/opt/anaconda3/envs/chromaticfitting/lib/python3.9/site-packages/deprecat/classic.py:215: FutureWarning: In v4.0, pm.sample will return an `arviz.InferenceData` object instead of a `MultiTrace` by default. You can pass return_inferencedata=True or return_inferencedata=False to be safe and silence this warning.
  return wrapped_(*args_, **kwargs_)
Only 300 samples in chain.
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [linear_tutorial_p_1, linear_tutorial_p_0]

100.00% [2400/2400 00:00<00:00 Sampling 4 chains, 0 divergences]

/Users/camu5866/opt/anaconda3/envs/chromaticfitting/lib/python3.9/site-packages/scipy/stats/_continuous_distns.py:624: RuntimeWarning: overflow encountered in _beta_ppf
  return _boost._beta_ppf(q, a, b)
/Users/camu5866/opt/anaconda3/envs/chromaticfitting/lib/python3.9/site-packages/scipy/stats/_continuous_distns.py:624: RuntimeWarning: overflow encountered in _beta_ppf
  return _boost._beta_ppf(q, a, b)
/Users/camu5866/opt/anaconda3/envs/chromaticfitting/lib/python3.9/site-packages/scipy/stats/_continuous_distns.py:624: RuntimeWarning: overflow encountered in _beta_ppf
  return _boost._beta_ppf(q, a, b)
/Users/camu5866/opt/anaconda3/envs/chromaticfitting/lib/python3.9/site-packages/scipy/stats/_continuous_distns.py:624: RuntimeWarning: overflow encountered in _beta_ppf
  return _boost._beta_ppf(q, a, b)
Sampling 4 chains for 300 tune and 300 draw iterations (1_200 + 1_200 draws total) took 0 seconds.

In this example it should be very quick as we are only fitting 2 parameters! Now we can look at our results:

In [69]:

            
                Copied!
                
l.summarize(round_to=7, hdi_prob=0.68, fmt='wide')
l.summarize(round_to=7, hdi_prob=0.68, fmt='wide')

                         mean        sd   hdi_16%   hdi_84%  mcse_mean  \
linear_tutorial_p_0  0.897020  0.014703  0.884698  0.913009   0.000544   
linear_tutorial_p_1  0.430649  0.025065  0.409340  0.455740   0.000984   

                      mcse_sd    ess_bulk    ess_tail     r_hat  
linear_tutorial_p_0  0.000385  738.938173  615.699470  1.002784  
linear_tutorial_p_1  0.000696  656.889711  584.842567  1.002266

An important parameter to look out for here to check whether your samplings have converged is r_hat. This rank normalized R-hat diagnostic tests for lack of convergence by comparing the variance between multiple chains to the variance within each chain. If convergence has been achieved, the between-chain and within-chain variances should be identical, therefore, the closer it is to 1 the better the chance that your sampling successfully converged. If you're interested in the sampling see the PyMC3 docs for much more detail!

We can turn this summary table into something more useful by calling .get_results(). For one wavelength it won't look much better but the more wavelengths we fit the more confusing this summary table will get.

In [70]:

            
                Copied!
                
l.get_results(uncertainty=['hdi_16%','hdi_84%'])
l.get_results(uncertainty=['hdi_16%','hdi_84%'])

Out[70]:

	linear_tutorial_p_0	linear_tutorial_p_0_hdi_16%	linear_tutorial_p_0_hdi_84%	linear_tutorial_p_1	linear_tutorial_p_1_hdi_16%	linear_tutorial_p_1_hdi_84%	wavelength
w0	0.89702	0.884698	0.913009	0.430649	0.40934	0.45574	1.0 micron

Say we also want to see the "best-fit" model directly, we can use the .get_model() function:

In [71]:

            
                Copied!
                
model = l.get_model()
model
model = l.get_model()
model

Out[71]:

{'w0': array([0.8970199 , 0.90136989, 0.90571988, 0.91006987, 0.91441986,
        0.91876985, 0.92311984, 0.92746983, 0.93181982, 0.93616981,
        0.9405198 , 0.94486979, 0.94921978, 0.95356977, 0.95791976,
        0.96226975, 0.96661974, 0.97096973, 0.97531972, 0.97966971,
        0.9840197 , 0.98836969, 0.99271968, 0.99706967, 1.00141966,
        1.00576965, 1.01011964, 1.01446963, 1.01881962, 1.02316961,
        1.0275196 , 1.03186959, 1.03621958, 1.04056957, 1.04491956,
        1.04926955, 1.05361954, 1.05796953, 1.06231952, 1.06666951,
        1.0710195 , 1.07536949, 1.07971948, 1.08406947, 1.08841946,
        1.09276945, 1.09711944, 1.10146943, 1.10581942, 1.11016941,
        1.11451939, 1.11886938, 1.12321937, 1.12756936, 1.13191935,
        1.13626934, 1.14061933, 1.14496932, 1.14931931, 1.1536693 ,
        1.15801929, 1.16236928, 1.16671927, 1.17106926, 1.17541925,
        1.17976924, 1.18411923, 1.18846922, 1.19281921, 1.1971692 ,
        1.20151919, 1.20586918, 1.21021917, 1.21456916, 1.21891915,
        1.22326914, 1.22761913, 1.23196912, 1.23631911, 1.2406691 ,
        1.24501909, 1.24936908, 1.25371907, 1.25806906, 1.26241905,
        1.26676904, 1.27111903, 1.27546902, 1.27981901, 1.284169  ,
        1.28851899, 1.29286898, 1.29721897, 1.30156896, 1.30591895,
        1.31026894, 1.31461893, 1.31896892, 1.32331891, 1.3276689 ])}

Remember that lightcurve we plotted earlier? Once we have fit a model the "best-fit" should show up on there automatically:

In [72]:

            
                Copied!
                
fig, ax = plt.subplots(figsize=(8,6))
l.plot_model(ax=[ax])
fig, ax = plt.subplots(figsize=(8,6))
l.plot_model(ax=[ax])

In [74]:

            
                Copied!
                
l.plot_with_model_and_residuals(spacing=-.6)
plt.ylim(0.6,1.6);
l.plot_with_model_and_residuals(spacing=-.6)
plt.ylim(0.6,1.6);

In [75]:

            
                Copied!
                
l.imshow_with_models(vlimits_data=[0.96, 1.35], vspan_residuals=0.1)
l.imshow_with_models(vlimits_data=[0.96, 1.35], vspan_residuals=0.1)

🌈🤖 'planet_model' doesn't exist and will be skipped.

🌈🤖 The wavelength scale for this rainbow is 'linear',
and there are 1 wavelength centers and
0 wavelength edges defined.

It's hard to imshow something with a wavelength axis
that isn't linearly or logarithmically uniform, or doesn't
at least have its wavelength edges defined. We're giving up
and just using the wavelength index as the wavelength axis.

If you want a real wavelength axis, one solution would
be to bin your wavelengths to a more uniform grid with
`rainbow.bin(R=...)` (for logarithmic wavelengths) or
`rainbow.bin(dw=...)` (for linear wavelengths)

Compare Results to True Line¶

Since we know exactly what we injected into the data set we can compare our results to the "true" value:

In [76]:

            
                Copied!
                
fitted_intercept = l.summary['mean'][f'{l.name}_p_0']
fitted_intercept_err = l.summary['sd'][f'{l.name}_p_0']
fitted_gradient = l.summary['mean'][f'{l.name}_p_1']
fitted_gradient_err = l.summary['sd'][f'{l.name}_p_0']

print(f"Fitted y-intercept = {fitted_intercept} +/- {fitted_intercept_err}, true value = {true_intercept}")
print(f"Fitted gradient = {fitted_gradient} +/- {fitted_gradient_err}, true value = {true_slope}")
fitted_intercept = l.summary['mean'][f'{l.name}_p_0']
fitted_intercept_err = l.summary['sd'][f'{l.name}_p_0']
fitted_gradient = l.summary['mean'][f'{l.name}_p_1']
fitted_gradient_err = l.summary['sd'][f'{l.name}_p_0']

print(f"Fitted y-intercept = {fitted_intercept} +/- {fitted_intercept_err}, true value = {true_intercept}")
print(f"Fitted gradient = {fitted_gradient} +/- {fitted_gradient_err}, true value = {true_slope}")

Fitted y-intercept = 0.8970199 +/- 0.0147035, true value = 0.9
Fitted gradient = 0.430649 +/- 0.0147035, true value = 0.4

Hopefully, if everything's worked then these values should agree within error!

Let's plot the data, the true regression line and our fit and see how they compare (If we used store_model=True at the .setup_lightcurves() stage then we could easily generate a 1-sigma region for the model using the errors stored in the summary table!):

In [77]:

            
                Copied!
                
                    
                    
                
                

        
# plot the fitted model
plt.plot(l.data.time, l.data_with_model.systematics_model[0], label='Fitted Systematics Model')

# ONLY UNCOMMENT IF YOU STORED MODELS EARLIER:
# lower_hdi=[l.summary['hdi_16%'][f'linear_tutorial_model_w0[{i}]'] for i in range(l.data.ntime)]
# upper_hdi=[l.summary['hdi_84%'][f'linear_tutorial_model_w0[{i}]'] for i in range(l.data.ntime)]
# plt.fill_between(l.data.time.to_value('d'), lower_hdi, upper_hdi, alpha=0.3, label='1-sigma Region')

# plot the injected line
plt.plot(x, true_regression_line, label='True Line')

# plot the data we fit
plt.plot(l.data.time, l.data.flux[0,:], "k.", alpha=0.2, label='Data (with noise)')
plt.errorbar(l.data.time, l.data.flux[0,:], l.data.uncertainty[0,:], color='k', linestyle="None", alpha=0.2)

plt.legend();
# plot the fitted model
plt.plot(l.data.time, l.data_with_model.systematics_model[0], label='Fitted Systematics Model')

# ONLY UNCOMMENT IF YOU STORED MODELS EARLIER:
# lower_hdi=[l.summary['hdi_16%'][f'linear_tutorial_model_w0[{i}]'] for i in range(l.data.ntime)]
# upper_hdi=[l.summary['hdi_84%'][f'linear_tutorial_model_w0[{i}]'] for i in range(l.data.ntime)]
# plt.fill_between(l.data.time.to_value('d'), lower_hdi, upper_hdi, alpha=0.3, label='1-sigma Region')

# plot the injected line
plt.plot(x, true_regression_line, label='True Line')

# plot the data we fit
plt.plot(l.data.time, l.data.flux[0,:], "k.", alpha=0.2, label='Data (with noise)')
plt.errorbar(l.data.time, l.data.flux[0,:], l.data.uncertainty[0,:], color='k', linestyle="None", alpha=0.2)

plt.legend();

Hopefully our results should look pretty close to the real values! Now we can move onto something a little more challenging...