Programming Questions
In general, we are looking for your ability to neatly code up efficient algorithmic ideas. There's no unnecessary burden to make the code look very sophisticated. At the end, neat, efficient and functional code is what we expect.
PART 1
(Expected time to complete: 25 minutes)
Survival analysis is a branch of statistics used for analyzing the expected time duration until an event of interest occurs. In this notebook, we will be working through some concepts in survival analysis in the context of modeling failure times for a hypothetical machine. We denote the time at which this machine fails by , where . (You can think of time 0 as the time at which the machine was installed). The Weibull distribution is particularly popular in survival analysis, as it can accurately model the time-to-failure of real-world events and is flexible despite having only two parameters. The distribution of failure times is given by the following formula:
where is known as the shape parameter and is known as the scale parameter. Several examples of Weibull distributions are plotted in the cell below.
A derived statistic of the Weibull distribution that turns out to be very useful for the purposes of survival analysis is the hazard function . The hazard function represents the probability of failure in the next time period given that the asset has survived up until time . For a Weibull model, the hazard function can be written as:
Many statistics of interest can be dervied from the hazard function, including:
Probability of survival until time T
Probability of failure in a given time horizon
We will be writing functions to implement both of these statistics and exploring how they might be used to predict machine failure.
Perhaps the most intuitive derived statistic of the hazard function is the survival function , which tells us the probability that the machine is still in service (e.g. it has not yet failed) at time . Naturally, . The function can be written as
and can be dervied from the hazard using the formula
where the cumulative hazard function is defined as
If you noticed how the failure density and the hazard function were created in the cells above, we are not actually generating a continuous function but instead a discrete-time approximation represented as a numpy array. You can think of each time as a day in the transformer's lifetime. (So the 20th entry of the hazard array is the probability that the machine will fail on day 20 given that it was alive on day 19). Therefore, computing the cumulative hazard is as simple as computing the sum of the hazard up until time (there's a nice function in the numpy library for doing this!)
As you might expect, the probability of the machine surviving beyond time decreases as time goes on.
Another statistic of interest is the horizon failure probability; given that the machine is in service at time , what is the probability that it fails in the next timesteps? This can be expressed as
At this point, you might be wondering what exactly the point of all this is. It turns out that using machine learning, we can use data collected during the lifetimes of many machines to train a model that takes in information about a new machine of the same type and returns an estimate of the hazard function for that machine specifically. We can then use this estimated hazard function to compute almost any statistic we'd like, including the horizon failure probability. One thing that might be particularly interesting is to see whether we can predict failure at a certain horizon. That is: given data up until a certain time , how accurately can we predict whether the machine will fail in the next timesteps? For the purposes of this example, we'll choose .
One approach to implementing this is the following. We start at the beginning of the machine's lifetime, and compute the probability of failure between times . Then we compute the probability of failure between . And so on... until we reach the end of the transformer's lifetime. Each time, we record whether the actual failure time fell within the interval (this will only be true the last time we do the computation).
Let's make things a bit more specific. Assume the hazard function we've been working with represents the hazard for each day of a machine's lifetime, and again that the horizon we're interested in is 1 year = 365 days). Also assume that based on the manufacturer's records, this machine failed on day 4000. The plot below will hopefully make it clear what we'd like to implement:
The last interval highlighted in red (3650, 4015) contains the actual failure (which occurred on day 4000 of it's life, represented by the thick red vertical line), while the spaces between the vertical lines represent each of the intervals over which we're computing the failure probability. Imagine we have
a trained model, which given data about a machine outputs a hazard function that accurately models the probability of failure across the machine's lifetime
a large set of training examples
a set of test (validation) examples
If we were to compute the failure probability for each year-long interval for each machine in the training set, we might be able to find a threshold that would allow us to build a binary classifier for failure in the next year! We'll come back to this in a second. Right now, we would like to construct a pandas DataFrame that contains failure probabilities and labels of whether a failure actually occurred in each interval (for just one machine).
failure_prob | label | |
---|---|---|
0 | 0.0 | 0 |
1 | 0.1 | 0 |
2 | 0.2 | 0 |
3 | 0.5 | 1 |
failure_prob | label | |
---|---|---|
0 | 0.021148 | 0 |
1 | 0.062001 | 0 |
2 | 0.101149 | 0 |
3 | 0.138664 | 0 |
4 | 0.174613 | 0 |
5 | 0.209061 | 0 |
6 | 0.242072 | 0 |
7 | 0.273705 | 0 |
8 | 0.304017 | 0 |
9 | 0.333065 | 0 |
10 | 0.360900 | 0 |
11 | 0.387574 | 1 |
Nice work so far!
How might you build a failure prediction classifier at the 1-year horizon, given the setup we've outlined? Remember, you have at your disposal:
a trained model, which given data about a machine outputs hazard functions that accurately model the probability of failure across the machine's lifetime
a large set of training examples
a set of test (validation) examples
and now,
a function that takes a survival function for a machine (remember, survival can be easily derived from hazard) and returns a list of failure probabilities at the 1-year horizon, along with the "ground truth" of whether the machine actually failed in that period.
Hint 1: For this problem, don't worry too much about the interpretation of the failure probability values; all you should be assuming is that higher probabilities indicate it's more likely that a machine will fail in that period.
Hint 2: Remember that threshold we referred to a few cells ago? How could we learn a good value for from all of the machine records in the training data, and use it to see how well we can predict failure for all of the machines in the test data?
PART 2
(Expected time to complete: 20 minutes)
Write a python function that takes an integer n and returns the integer closest to the square root of n, without importing any modules.
PART 3
(Expected time to complete: 15 minutes)
Problem statement: You have a steady stream of data points coming in, and your objective is to compute the mean of all data samples observed so far, whenever a new data point comes in. Write an "efficient" algorithm to do this.
Note that your function will be called whenever a new data point comes in, and you may choose to record historical information however you may see fit. But, please keep in mind, that the algorithm must be efficient. You can assume this stream to be a consistent, never-ending inflow of points.