How would I create a dataframe to calculate conditional probability in Python?

How would I create a dataframe to calculate conditional probability in Python? - python

I have a data frame like this Source Data Frame and using that, I will need to get a data frame with conditional probability over time given images in specific years like the target dataframe.
I have used melt and then merge to get the probability, however, I am not getting the correct probability using my code, for example, if I run the below code from my inspiration for my source data frame above, given 'At least one mountain', the probability of 'At least one mountain' gives me 0.32301, however, I should be getting a probability of 1 for all years. Am I calculating the conditional probability differently:
I tried taking some inspiration from the Conditional Probability in Python solution, however, it is still giving me the wrong answer.
Here is how I am populating my final dataframe, which is wrong as per the above target given data frame.

Related

How to correlate partial signal with dataset

I have a some large datasets of sensor values consisting of a single sensor value sampled at a one-minute interval, like a waveform. The total dataset spans a few years.
I wish to (using python) enter/select a arbitrary set of sensor data (for instance consisting of 600 values, so for 10hrs worth of data) and find all similar time stamps where roughly the same shape occurred in these datasets.
The matches should be made by shape (relative differences), not by actual values, as there are different sensors used with different biases and environments. Also, I wish to retrieve multiple matches within a single dataset, to further analyse.
I’ve been looking into pandas, but I’m stuck at the moment... any guru here?

I don't know much about the functionalities available in Pandas.
I think you need to first decide the typical time span T over which
the correlation is supposed to occurred. What I would do is to
split all your times series into (possibly overlapping) segments
of duration T using Numpy (see here for instance).
This will lead to a long list of segments. I would then compute
the correlation between all pairs of segments using e.g. corrcoef.
You get a large correlation matrix where you can spot the
pairs of similar segments by applying a threshold on the absolute
value of the correlation. You can estimate the correct threshold
by applying this algorithm to a data set where you don't expect
any correlation, or by randomizing your data.

Pandas.. does quantile function need sorted data to calculate percentiles?

I'm using Pandas to clean up some data and do basic statistics. I am wondering if quantile() does sort the values before the calculation or i must do the sorting beforehand?
For example, here I'm trying to get the 50th percentile of the number of workers in each company
Percentile50th = Y2015_df.groupby (["company"])["worker"].quantile(0.50)
I'm asking because when I was verifying the values I got with the results in MS Excel, I discovered that Median function requires the data to be sorted in order to get the right median. But I'm not sure if its the case in Pandas.

You do not need to sort. See the link in my previous comment. Example

Find probaility of failure using logistic regression in python

I am trying to get some insights from a dataset using Logistic Regression. The starting dataframe has the information of if something was removed unscheduled (1 = Yes and 0 = No) and some data which was provided with this removal.
This looks like this:
This data is then 'dummyfied' using pandas.get_dummies, where the result is as expected.
Then I normalized the found coefficients (using coef_) to scale everything the same. I put this in a dataframe with the column 'Parameter' (which is the column name of the dummy dataframe) and the column 'Value' (which is the value obtained using the coefficients).
Now I will get the following result.
Now this result shows that the On-wing Time is the biggest contributor in the unscheduled removal.
Now the question: how can I predict what the chance is that there will be an unscheduled removal for this reason (this column)? So, what is the chance that I will get another unscheduled removal which is caused by the On-wing Time?
Note that these parameters can change since this data is fake data and data may be added later on. So when the biggest contributor changes, the prediction should also focus on the new biggest contributor.
I hope you understand the question.
EDIT
The complete code and dataset (fake one) can be found here: 1drv.ms/u/s!AjkQWQ6EO_fMiSEfu3vYgSTBR0PZ
Ganesh

Pandas and the best method for representing variable-length time-series

Here's the scenario. Let's say I have data from a visual psychophysics experiment, in which a subject indicates whether the net direction of motion in a noisy visual stimulus is to the left or to the right. The atomic unit here is a single trial and a typical daily session might have between 1000 and 2000 trials. With each trial are associated various parameters: the difficulty of that trial, where stimuli were positioned on the computer monitor, the speed of motion, the distance of the subject from the display, whether the subject answered correctly, etc. For now, let's assume that each trial has only one value for each parameter (e.g., each trial has only one speed of motion, etc.). So far, so easy: trial ids are the Index and the different parameters correspond to columns.
Here's the wrinkle. With each trial are also associated variable length time series. For instance, each trial will have eye movement data that's sampled at 1 kHz (so we get time of acquisition, the x data at that time point, and y data at that time point). Because each trial has a different total duration, the length of these time series will differ across trials.
So... what's the best means for representing this type of data in a pandas DataFrame? Is this something that pandas can even be expected to deal with? Should I go to multiple DataFrames, one for the single valued parameters and one for the time series like parameters?
I've considered adopting a MultiIndex approach where level 0 corresponds to trial number and level 1 corresponds to time of continuous data acquisition. Then all I'd need to do is repeat the single valued columns to match the length of the time series on that trial. But I immediately foresee 2 problems. First, the number of single valued columns is large enough that extending each one of them to match the length of the time series seems very wasteful if not impractical. Second, and more importantly, if I wanna do basic groupby type of analyses (e.g. getting the proportion of correct responses at a given difficulty level), this will give biased (incorrect) results because whether each trial was correct or wrong will be repeated as many times as necessary for its length to match the length of time series on that trial (which is irrelevant to the computation of the mean across trials).
I hope my question makes sense and thanks for suggestions.

I've also just been dealing with this type of issue. I have a bunch of motion-capture data that I've recorded, containing x- y- and z-locations of several motion-capture markers at time intervals of 10ms, but there are also a couple of single-valued fields per trial (e.g., which task the subject is doing).
I've been using this project as a motivation for learning about pandas so I'm certainly not "fluent" yet with it. But I have found it incredibly convenient to be able to concatenate data frames for each trial into a single larger frame for, e.g., one subject:
subject_df = pd.concat(
[pd.read_csv(t) for t in subject_trials],
keys=[i for i, _ in enumerate(subject_trials)])
Anyway, my suggestion for how to combine single-valued trial data with continuous time recordings is to duplicate the single-valued columns down the entire index of your time recordings, like you mention toward the end of your question.
The only thing you lose by denormalizing your data in this way is that your data will consume more memory; however, provided you have sufficient memory, I think the benefits are worth it, because then you can do things like group individual time frames of data by the per-trial values. This can be especially useful with a stacked data frame!
As for removing the duplicates for doing, e.g., trial outcome analysis, it's really straightforward to do this:
df.outcome.unique()
assuming your data frame has an "outcome" column.

Splitting time data into "runs" in order to plot and examine differences

I am trying to investigate differences between runs/experiments in a continuously logged data set. I am taking a fixed subset of a few months for this data set and then analysising it to come up with an estimate on when a run was started. I have this sorted in a series of times.
With this I then chop the data up into 30 hour chunks (approximate time between runs) and then put it into a dictionary:
data = {}
for time in times:
timeNow = np.datetime64(time.to_datetime())
time30hr = np.datetime64(time.to_datetime())+np.timedelta64(30*60*60,'s')
data[time] = df[timeNow:time30hr]
So now I have a dictionary of dataframes, indexed by by StartTime and each one contains all of my data for a run, plus some extra to ensure I have it all for every run. But to compare two runs together I need to have a common X value to stack them on top of each other. Now every run is different and the point I want to consider "the same" varies depending on what i'm looking at. For the example below I have used the largest value in that dataset to "pivot" on.
for time in data:
A = data[time]
#Find max point for value. And take the first if there is more than 1
maxTtime = A[A['Value'] == A['Value'].max()]['DateTime'][0]
# Now we can say we want 12 hours before and end 12 after.
new = A[maxTtime-datetime.timedelta(0.5):maxTtime+datetime.timedelta(0.5)]
#Stick on a new column with time from 0 point:
new['RTime'] = new['DateTime'] - maxTtime
#Plot values against this new time
plot(new['RTime'],new['Value'])
This yields a graph like:
Which is great except I can't get a decent legend in order to tell what run was what and work out how much variation there is. I believe half my problem is because Im iterating over a dictionary of dataframes which is causing issues.
Could someone recommend how to better organise this (a dictionary of dataframes is all I could do to get it to work). I've thought of doing a hierarchical dataframe and instead of indexing it by run time, assigning a set of identifiers to the runs (The actual time is contained within the dataframes themself so I have no problem loosing the assumed starttime) and plotting it then with a legend.
My final aim is to have a dataset and methodology that means I can investigate the similarity and differences between different runs using different "pivot points" amd produce a graph of each one which I can then interrogate (or at least tell which data set is which to interrogate the data directly) but couldn;t get past various errors with creating it.
I can upload a set of the data to a csv if required but am not sure on the best place to upload it to. Thanks

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.