Add data to Dataframe with pandas - python

I want to add a result from a simulation model to an existing dataframe at a specific position with the dataframe.
Based on a dataframe and model for linear regression I am calculating a value. This value must be added to the input dataframe used for the linear regression. I am using pandas insert functions which brings the following error:
enter image description here

Thank you very much for you help. It worked with:
X_heute_10 = df2.insert(4, 'Stunde_10', y_heute_10[0,0])

Related

Vaex .apply() method on Vaex data frame is giving incorrect output

I am trying to perform a .apply() method on a vaex data frame but it gives some error.
Given below is the code executed -
and the error message is as follows -
This dataset has 10+ Million rows.
I am trying to create one-hot encoded features wherein the values in each feature represent the count of its repetition by the same user before the current transaction date.
Kindly help ;(

Interpolation in data frame without making a new data frame

I have a pandas dataframe with an index and jut one column. Index has Dates. Column has Values. I would like to find the NewValue of a NewDate that is not in the index. To do that i suppose i may use interpolation function as: NewValue=InterpolationFunction(NewDate,Index,Column,method ext..). So what is the InterpolationFunction? It seems that most of interpolation functions are used for padding, finding the missing values ext. This is not what i want. I just want the NewValue. Not built a new Dataframe ext..
Thank you very very much for taking the time to read this.
I am afraid that you cannot find the missing values without constructing a base for your data. here is the answer to your question if you make a base dataframe:
You need to construct a panel in order to set up your data for proper interpolation. For instance, in the case of the date, let's say that you have yearly data and you want to add information for a missing year in between or generate data for quarterly intervals.
You need to construct a base for the time series i.e.:
dates = pd.date_range(start="1987-01-01",end="2021-01-01", freq="Q").values
panel = pd.DataFrame({'date_Q' :dates})
Now you can join your data to this structure and it will have dates without information as missing. Now you need to use a proper interpolation algorithm to fill the missing values. Pandas .interpolate() method has some basic interpolation methods such as polynomial and linear which you can find here.
However, much more ways of interpolation are offered by Scipy which you can find in the tutorials here.

pyspark oversample classes by every target variable

I wanted to know if there is any way to oversample the data using pyspark.
I have dataset with target variable of 10 classes. As of Now I am taking each class and oversampling like below to match
transformed_04=transformed.where(F.col('nps_score')==4)
transformed_03=transformed.where(F.col('nps_score')==3)
transformed_02=transformed.where(F.col('nps_score')==2)
transformed_01=transformed.where(F.col('nps_score')==1)
transformed_00=transformed.where(F.col('nps_score')==0)
transformed_04_more_rows=transformed_04.sample(True,11.3,9)
transformed_03_more_rows=transformed_03.sample(True,16.3,9)
transformed_02_more_rows=transformed_03.sample(True,12,9)
And finally joining all dataframes with union all
transformed_04_more_rows.unionAll(transformed_03_more_rows).unionAll(transformed_02_more_rows)
Sampling values I am checking manually . For ex if 4th class has 2000 rows and second class has 10 rows checking manually and providing values 16,12 accordingly as provided in code above
Forgive me about mentioned code is not complete one . Just to give an view I had put. I wanted to know if there is any automated way like SMOTE in pyspark .
I have seen below link ,
Oversampling or SMOTE in Pyspark
It says my target class has to be only two . If I remove the condition it throws me some datatype issues
Can anyone help me with this implementation in pyspark checking every class and providing sampling values is very painful please help
check out the sampleBy function of spark, this enables us stratified samplint. https://spark.apache.org/docs/2.4.0/api/python/pyspark.sql.html?highlight=sampleby#pyspark.sql.DataFrame.sampleBy
in your case for each of the class you can provide the fraction of sample that you want in a dictionary and use it in sampleBy, try it out.
To decide the fraction, you can do an aggregation count based on your target column , normalize to (0,1) and tune it.

Find probaility of failure using logistic regression in python

I am trying to get some insights from a dataset using Logistic Regression. The starting dataframe has the information of if something was removed unscheduled (1 = Yes and 0 = No) and some data which was provided with this removal.
This looks like this:
This data is then 'dummyfied' using pandas.get_dummies, where the result is as expected.
Then I normalized the found coefficients (using coef_) to scale everything the same. I put this in a dataframe with the column 'Parameter' (which is the column name of the dummy dataframe) and the column 'Value' (which is the value obtained using the coefficients).
Now I will get the following result.
Now this result shows that the On-wing Time is the biggest contributor in the unscheduled removal.
Now the question: how can I predict what the chance is that there will be an unscheduled removal for this reason (this column)? So, what is the chance that I will get another unscheduled removal which is caused by the On-wing Time?
Note that these parameters can change since this data is fake data and data may be added later on. So when the biggest contributor changes, the prediction should also focus on the new biggest contributor.
I hope you understand the question.
EDIT
The complete code and dataset (fake one) can be found here: 1drv.ms/u/s!AjkQWQ6EO_fMiSEfu3vYgSTBR0PZ
Ganesh

Python Pandas Regression

[enter image description here][1]I am struggling to figure out if regression is the route I need to go in order to solve my current challenge with Python. Here is my scenario:
I have a Pandas Dataframe that is 195 rows x 25 columns
All data (except for index and headers) are integers
I have one specific column (Column B) that I would like compared to all other columns
Attempting to determine if there is a range of numbers in any of the columns that influences or impacts column B
An example of the results I would like to calculate in Python is something similar to: Column B is above 3.5 when data in Column D is between 10.20 - 16.4
The examples I've been reading online with Regression in Python appear to produce charts and statistics that I don't need (or maybe I am interpreting incorrectly). I believe the proper wording to describe what I am asking, is to identify specific values or a range of values that are linear between two columns in a Pandas dataframe.
Can anyone help point me in the right direction?
Thank you all in advance!
Your goals sound very much like exploratory data analysis at this point. You should probably first calculate the correlation between your target column B and any other column using pandas.Series.corr (which really is the same as bivariate regression), which you could list:
other_cols = [col for col in df1.columns if col !='B']
corr_B = [{other: df.loc[:, 'B'].corr(df.loc[:, other])} for other in other_col]
To get a handle on specific ranges, I would recommend looking at:
the cut and qcut functionality to bin your data as you like and either plot or correlate subsets accordingly: see docs here and here.
To visualize bivariate and simple multivariate relationships, I would recommend
the seaborn package because it includes various types of plots designed to help you get a quick grasp of covariation among variables. See for instance the examples for univariate and bivariate distributions here, linear relationship plots here, and categorical data plots here.
The above should help you understand bivariate relationships. Once you want to progress to multivariate relationships, you could return to the scikit-learn or statsmodels packages best suited for this in python IMHO. Hope this helps to get you started.

Categories

Resources