Calculate proportions of rows in dataframe - python

I have a problem that you hopefully can help with.
I have a dataframe with multiple columns that looks something like this:
education experience ExpenseA ExpenseB ExpenseC
uni yes 3 2 5
uni no 7 6 8
middle yes 2 0 8
high no 12 5 8
uni yes 3 7 5
The Expenses A, B and C should add up to 10 per row, but often they don't because the data was not gathered correctly. For the rows where this is not the case, I want to take proportions.
The formula for this should be (cell value) / ((sum [ExpenseA] til [ExpenseC])/10)
example row two: total = 21 --> cells should be (value / 2.1)
How can I itterate this over all the rows for these specific columns?

I think you need divide sum of columns with exclude first 2 columns selected by DataFrame.iloc:
df.iloc[:, 2:] = df.iloc[:, 2:].div(df.iloc[:, 2:].sum(axis=1).div(10), axis=0)
print (df)
education experience ExpenseA ExpenseB ExpenseC
0 uni yes 3.000000 2.000000 5.000000
1 uni no 3.333333 2.857143 3.809524
2 middle yes 2.000000 0.000000 8.000000
3 high no 4.800000 2.000000 3.200000
4 uni yes 2.000000 4.666667 3.333333
Or sum columns with Expense substrings by DataFrame.filter:
df1 = df.filter(like='Expense')
df[df1.columns] = df1.div(df1.sum(axis=1).div(10), axis=0)

Related

DataFrame Pandas: create a new column with the average of every possible group of 3 in a series

I'm trying to find the most elegant solution to calculate the average/mean of every consecutive possible group of three elements in a series. I have a Dataframe like this
Index Values
0 25
1 12
2 21
3 2
4 6
5 1
6 2
Starting from the last element..i would like to create a new column where i have the average of [2,1,6] [1, 6, 2], [6,2,21] ... until the index 0.
So the new column should look like this
AVG
NaN
NaN
19.333
11.666
9.666
3.0
3.0
My idea was to make a loop and tail(3) the dataframe and remove the last element of the dataframe for each iteration but probably there is a more elegant way ?
Try with rolling
df['new'] = df.Values.rolling(3).mean()
0 NaN
1 NaN
2 19.333333
3 11.666667
4 9.666667
5 3.000000
6 3.000000
Name: Values, dtype: float64

Calculate Mean on Multiple Groups

I have a Table
Sex Value1 Value2 City
M 2 1 Berlin
W 3 5 Paris
W 1 3 Paris
M 2 5 Berlin
M 4 2 Paris
I want to calculate the average of Value1 and Value2 for different groups. In my origial Dataset I have 10 Group variables (with a max of 5 characteristics like 5 Cities) that I have shortened to Sex and City (2 Characteristics) in this example. The result should look like this
AvgOverall AvgM AvgW AvgBerlin AvgParis
Value1 2,4 2,6 2 2 2,66
Value2 3,2 2,6 4 3 3,3
I am familiar with the group by and tried
df.groupby('City').mean()
But here we have the problem that Sex is getting also into the calculation. Does anyone has an idea how to solve this? Thanks in advance!
You can grouping by 2 columns to 2 dataframes and then use concat also with means of numeric columns (non numeric are excluded):
df1 = df.groupby('City').mean().T
df2 = df.groupby('Sex').mean().T
df3 = pd.concat([df.mean().rename('Overall'), df2, df1], axis=1).add_prefix('Avg')
print (df3)
AvgOverall AvgM AvgW AvgBerlin AvgParis
Value1 2.4 2.666667 2.0 2.0 2.666667
Value2 3.2 2.666667 4.0 3.0 3.333333

Pearson correlation between adjacent columns in a DataFrame

let say I have a dataframe of 10 columns.
now I want to quickly calculate the relation between each column and its following column.
so pearson r of column 1 and 2, of column 2 and 3, of column 3 and 4 and so on.
is there a quick way for me to do that?
thank you!
You can use pandas.DataFrame.corr for Pearson correlation and numpy.diag to extract the values of interest. Let me show you a toy example with 5 columns (for simplicity):
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0,10,(3,5)))
pcorr = df.corr()
np.diag(pcorr, 1)
and you get:
df:
0 1 2 3 4
0 7 9 0 0 9
1 9 2 9 9 0
2 2 8 5 9 2
pcorr:
0 1 2 3 4
0 1.000000 -0.622693 0.215274 -0.240192 0.029344
1 -0.622693 1.000000 -0.898170 -0.609994 0.763857
2 0.215274 -0.898170 1.000000 0.896258 -0.969816
3 -0.240192 -0.609994 0.896258 1.000000 -0.977356
4 0.029344 0.763857 -0.969816 -0.977356 1.000000
your values of interest:
array([-0.62269252, -0.89817029, 0.89625816, -0.97735555])

Create new column that calculate percentage repeat N times

I wanted to know how to calculate the percentage of these columns and save it in a new column next to it for N times.
Example
d1 = [['0.00', '10','11','15'], ['2.99', '30','40','0'], ['4.99', '5','0','2']]
df1 = pd.DataFrame(d1, columns = ['Price', '1','2','3'])
I want the following operation iterates through all the columns (besides Price of course)
df1['1%'] = df1['1'] / df1['1'].sum() (I got an error when I tried this)
Result:
d2 = [['0.00', '10','0.22','11','0.2156','15','0.8823'], ['2.99', '30','0.66','40','0.7843','0','0'], ['4.99', '5','0.11','0','0','2','0.1176']]
df2 = pd.DataFrame(d2, columns = ['Price', '1','1%','2','2%','3','3%'])
(The columns can be N times so I need to iterate through all the columns)
IIUC, you need:
m=df1.set_index('Price').div(df1.set_index('Price').sum()).add_suffix('%')
df2=pd.concat([df1.set_index('Price'),m],axis=1).sort_index(axis=1).reset_index()
Price 1 1% 2 2% 3 3%
0 0.00 10 0.222222 11 0.215686 15 0.882353
1 2.99 30 0.666667 40 0.784314 0 0.000000
2 4.99 5 0.111111 0 0.000000 2 0.117647
Note: this is assuming the dtypes are:
df1.dtypes
Price float64
1 int32
2 int32
3 int32
In order to get the Output, you need to convert string to numeric using pd.to_numeric
pd.concat([df1, df1.drop('Price',1).apply(lambda x: pd.to_numeric(x).div(pd.to_numeric(x).sum()))
.rename(columns=lambda x: x+'%')], 1)
Output:
Price 1 2 3 1% 2% 3%
0 0.00 10 11 15 0.222222 0.215686 0.882353
1 2.99 30 40 0 0.666667 0.784314 0.000000
2 4.99 5 0 2 0.111111 0.000000 0.117647
a=df1.columns[1:]
df1[a+'%'] = df1[a].astype(float) / df1[a].astype(float).sum()
output
Price 1 2 3 1% 2% 3%
0.00 10 11 15 0.222222 0.215686 0.882353
2.99 30 40 0 0.666667 0.784314 0.000000
4.99 5 0 2 0.111111 0.000000 0.117647
Let's break down your question into 2 parts:
1) Why you get an error when you try to calculate the percentage for each column:
Basically, your columns are string types. You can either transform your column into a float type or change the type when defining your dataframe:
Changing the type of a column: df1['1%'] = df1['1%].astype(float)
Changing the type when defining your dataframe:
d1 = [[0.00, 10, 11, 15], [ 2.99, 30, 40, 0], [ 4.99, 5, 0, 2]]
2) Iterate the formula through all the columns:
The following code iterate your formula and create another column in the original dataframe:
for column in df1.drop(['Price'], axis=1).columns:
df1[column + '%'] = df1[column] / df1[column].sum()

How to add a random value to many rows in a Pandas Dataframe iteratively?

Suppose I have a Pandas Dataframe named df, which has the following structure:-
Column 1 Column 2 ......... Column 104
Row 1 0.01 0.55 3
Row 2 0.03 0.14 1
...
Row 100 0.75 0.56 0
What I am trying to accomplish is that for all rows which match the condition given below, I need to generate 100 more rows with a random value between 0 and 0.05 added to each row:-
is_less = df.iloc[:,-1] > 1
df_try = df[is_less]
df = df.append([df_try]*100,ignore_index=True)
The problem is that I can simply duplicate the rows in df_try to generate 100 more rows for each case, but I want to add a random value to each row as well, such that each row is different from the others but very similar.
import random
df = df.append([df_try + random.uniform(0,0.05)]*100, ignore_index=True)
What this does is to simply add the fixed random value to df_try's 100 new rows, but not a unique random value to each row. I know that this is because the above syntax does not iterate over df_try, resulting in the fixed random value being added, but is there a suitable way to add the random values iteratively over the data frame in this case?
One idea is create 2d array with same size like new appended DataFrame and add to joined lists with concat:
N = 10
arr = np.random.uniform(0,0.05, size=(N, len(df.columns)))
is_less = df.iloc[:,-1] > 1
df_try = df[is_less]
df = df.append(pd.concat([df_try]*N) + arr,ignore_index=True)
print (df)
Column 1 Column 2 Column 104
0 0.010000 0.550000 3.000000
1 0.030000 0.140000 1.000000
2 0.750000 0.560000 0.000000
3 0.024738 0.561647 3.045146
4 0.035315 0.584161 3.008656
5 0.022386 0.563025 3.033091
6 0.039175 0.588785 3.004649
7 0.049465 0.594903 3.003303
8 0.027366 0.580478 3.041745
9 0.044721 0.599853 3.001736
10 0.052849 0.589775 3.042434
11 0.033957 0.582610 3.045215
12 0.044349 0.582218 3.027665
Your solution should be changed by list comprehension if need add scalar to each df_try:
N = 10
is_less = df.iloc[:,-1] > 1
df_try = df[is_less]
df = df.append( [df_try + random.uniform(0, 0.05) for _ in range(N)], ignore_index=True)
print (df)
Column 1 Column 2 Column 104
0 0.010000 0.550000 3.000000
1 0.030000 0.140000 1.000000
2 0.750000 0.560000 0.000000
3 0.036756 0.576756 3.026756
4 0.039357 0.579357 3.029357
5 0.048746 0.588746 3.038746
6 0.040197 0.580197 3.030197
7 0.011045 0.551045 3.001045
8 0.013942 0.553942 3.003942
9 0.054658 0.594658 3.044658
10 0.025909 0.565909 3.015909
11 0.012093 0.552093 3.002093
12 0.058463 0.598463 3.048463
You can combine the copies first and create a single array containing all the random values, add them together, and then append the result to the original:
import numpy as np
n_copies = 2
df = pd.DataFrame(np.c_[np.arange(6), np.random.randint(1, 3, size=6)])
subset = df[df.iloc[:, -1] > 1]
extra = pd.concat([subset] * n_copies).add(np.random.uniform(0, 0.05, len(subset) * n_copies), axis='rows')
result = df.append(extra, ignore_index=True)
print(result)
Output:
0 1
0 0.000000 2.000000
1 1.000000 2.000000
2 2.000000 1.000000
3 3.000000 2.000000
4 4.000000 1.000000
5 5.000000 2.000000
6 0.007723 2.007723
7 1.005718 2.005718
8 3.003063 2.003063
9 5.005238 2.005238
10 0.006509 2.006509
11 1.034742 2.034742
12 3.022345 2.022345
13 5.040911 2.040911

Categories

Resources