Python Pandas Regression - python

[enter image description here][1]I am struggling to figure out if regression is the route I need to go in order to solve my current challenge with Python. Here is my scenario:
I have a Pandas Dataframe that is 195 rows x 25 columns
All data (except for index and headers) are integers
I have one specific column (Column B) that I would like compared to all other columns
Attempting to determine if there is a range of numbers in any of the columns that influences or impacts column B
An example of the results I would like to calculate in Python is something similar to: Column B is above 3.5 when data in Column D is between 10.20 - 16.4
The examples I've been reading online with Regression in Python appear to produce charts and statistics that I don't need (or maybe I am interpreting incorrectly). I believe the proper wording to describe what I am asking, is to identify specific values or a range of values that are linear between two columns in a Pandas dataframe.
Can anyone help point me in the right direction?
Thank you all in advance!

Your goals sound very much like exploratory data analysis at this point. You should probably first calculate the correlation between your target column B and any other column using pandas.Series.corr (which really is the same as bivariate regression), which you could list:
other_cols = [col for col in df1.columns if col !='B']
corr_B = [{other: df.loc[:, 'B'].corr(df.loc[:, other])} for other in other_col]
To get a handle on specific ranges, I would recommend looking at:
the cut and qcut functionality to bin your data as you like and either plot or correlate subsets accordingly: see docs here and here.
To visualize bivariate and simple multivariate relationships, I would recommend
the seaborn package because it includes various types of plots designed to help you get a quick grasp of covariation among variables. See for instance the examples for univariate and bivariate distributions here, linear relationship plots here, and categorical data plots here.
The above should help you understand bivariate relationships. Once you want to progress to multivariate relationships, you could return to the scikit-learn or statsmodels packages best suited for this in python IMHO. Hope this helps to get you started.

Related

Aggregate raster data based on geopandas DataFrame geometries

Situation
I have two datasets:
Raster data loaded using rioxarray, an xarray.DataArray
An geopandas.DataFrame with geometries indicating areas in the 1. dataset
The geo data in both datasets are in the same CRS (EPSG:4326).
Problem
For each entry in 2. I want to aggregate all values from 1. which overlap with the specific geometry. Kind of like an .group-by() using the geometries + .sum().
Current WIP approach
The package xagg does that already, is unfortunately slow on a subset of my dataset and scales worse when I try to use it on my full dataset.
Question
Is there an simple and efficient way to do this in Python?
(The solution wouldn't need to replicate the results from xagg accurately.)
WRT your comment here is some pseudo code I've used to do what you're after. The function being executed outputs files in this case. If it's not obvious this, strategy isn't going to help if you just have 1 big raster and 1 big poly file. This method assumes tiled data and uses an indexing system to match the right rasters with the overlaying polys. The full example is kind of beyond the scope of a single answer. But if you ask specifics if you have issues I can try to assist. In addition to dask's good documentation, there are lots of other posts on here with dask delayed examples.
results_list = []
for f in raster_file_list:
temp_list = dask.delayed(your_custom_function)(f, your_custom_function_arg_1, your_custom_function_arg_2)
results.append(temp_list)
results = dask.compute(*results_list, scheduler='processes')

Identification of redundant columns/variables in a classification case study

I have a Database with 13 columns(both categorical and numerical). The 13th column is a categorical variable SalStat which classifies weather the person is below 50k or above 50k. I am using Logical Regression for this case and want to know which columns (numerical and categorical) are redundant that is, dont affect SalStat, so that I can remove them. What function should I use for this purpose?
In my opinion you can study the correlation between your variables and remove the ones that have high correlation since they in a way give the same amount of information to your model
you can start with something like DataFrame.corr() then draw a heatmap using seaborn for better visualization seaborn.heatmap() or a more simple one with plt.imshow(data.corr()) plt.colorbar();

Data analysis : compare two datasets for devising useful features for population segmentation

Say I have two pandas dataframes, one containing data for general population and one containing the same data for a target group.
I assume this is a very common use case of population segmentation. My first idea to explore the data would be to perform some vizualization using e.g. seaborn Facetgrid or barplot & scatterplot or something like that to get a general idea of the trends and differences.
However, I found out that this operation is not as straightforward as I thought as seaborn is made to analyze one dataset and not compare two datasets.
I found this SO answer which provides a solution. But I am wondering how would people go about if if the dataframe was huge and a concat operation would not be possible ?
Datashader does not seem to provide such features as far as I have seen ?
Thanks for any ideas on how to go about such task
I would use the library Dask when data is too big for pandas. Dask is made by the same people who created pandas and it is a little bit more advanced, because it is a big data tool, but it has some of the same features including concat. I found dask easy enough to use and am using it for a couple of projects where I have dozens of columns and tens of millions of rows.

find anomalies in records of categorical data

I have a dataset with m observations and p categorical variables (nominal), each variable X1,X2...Xp has several different classes (possible values). Ultimately I am looking for a way to find anomalies i.e to identify rows for which the combination of values seems incorrect with respect to the data I saw so far. So far I was thinking about building a model to predict the value for each column and then build some metric to evaluate how different the actual row is from the predicted row. I would greatly appreciate any help!
Take a look on nearest neighborhoods method and cluster analysis. Metric can be simple (like squared error) or even custom (with predefined weights for each category).
Nearest neighborhoods will answer the question 'how different is the current row from the other row' and cluster analysis will answer the question 'is it outlier or not'. Also some visualization may help (T-SNE).

Python: How to properly deal with NaN's in a pandas DataFrame for feature selection in scikit-learn

This is related to a question I posted here but this one is more specific and simpler.
I have a pandas DataFrame whose index is unique user identifiers, columns correspond to unique events, and values 1 (attended), 0 (did not attend), or NaN (wasn't invited/not relevant). The matrix is pretty sparse with respect to NaNs: there are several hundred events and most users were only invited to several tens at most.
I created some extra columns to measure the "success" which I define as just % attended relative to invites:
my_data['invited'] = my_data.count(axis=1)
my_data['attended'] = my_data.sum(axis=1)-my_data['invited']
my_data['success'] = my_data['attended']/my_data['invited']
My goal right now is to do feature selection on the events/columns, starting with the most basic variance-based method: remove those with low variance. Then I would look at a linear regression on the events and keep only those with large coefficients and small p-values.
But my problem is I have so many NaN's and I'm not sure what the correct way to deal with them is as most scikit-learn methods give me errors because of them. One idea is to replace 'didn't attend' with -1 and 'not invited' with 0 but I'm worried this will alter the significance of events.
Can anyone suggest the proper way to deal with all these NaN's without altering the statistical significance of each feature/event?
Edit: I'd like to add that I'm happy to change my metric for "success" from the above if there is a reasonable one which will allow me to move forward with feature selection. I am just trying to determine which events are effective in capturing user interest. It's pretty open-ended and this is mostly an exercise to practice feature selection.
Thank you!
if I understand correctly you would like to clean your data from NaN's without significantly altering the statistical properties within it - so that you can run some analysis afterwords.
I actually came across something similar recently, one simple approach you might be interested is using sklearn's 'Imputer'. As EdChum mentioned earlier, one idea is to replace with the mean on the axis. Other options include replacing with the median for example.
Something like:
from sklearn.preprocessing import Imputer
imp = Imputer(missing_values='NaN', strategy='mean', axis=1)
cleaned_data = imp.fit_transform(original_data)
In this case, this will replace the NaN's with the mean across each axis (for example let's impute by event so axis=1). You then could round the cleaned data to make sure you get 0's and 1's.
I would plot a few histograms for the data, by event, to sanity check whether this preprocessing significantly changes your distribution - as we may be introducing too much of a bias by swapping so many values to the mean / mode / median along each axis.
Link for reference: http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Imputer.html
Taking things one step further (assuming above is not sufficient), you could alternately do the following:
Take each event column in your data and calculate the probability of attending ('p') vs not attending ('1 - p'), after dropping all nan numbers. [i.e. p = Attending / (Attending + Not Attending) ]
Then replace your NaN numbers across each event column using random numbers generated from Bernoulli distribution which we fit with the 'p' you estimated, roughly something like:
import numpy as np
n = 1 # number of trials
p = 0.25 # estimated probability of each trial (i.e. replace with what you get for attended / total)
s = np.random.binomial(n, p, 1000)
# s now contains random a bunch of 1's and 0's you can replace your NaN values on each column with
Again, this in itself is not perfect are you are still going to end up slightly biasing your data (for e.g. a more accurate approach would be to account for dependencies in your data across events for each user) - but by sampling from a roughly matching distribution this should at least be more robust than replacing arbitrarily with mean values etc.
Hope this helps!
the above example is deprecated please use:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')

Categories

Resources