Pandas create multiple dataframes by applying different weightings - python

I've got a dataframe with 3 columns and I want to add them together and test different weights.
I've written this code so far but I feel this might not be the best way:
weights = [0.5,0.6,0.7,0.8,0.9,1.0]
for i in weights:
for j in weights:
for k in weights:
outname='outname'+str(i)+'TV'+str(j)+'BB'+str(k)+'TP'
df_media[['outname']]=df_media[['TP']].multiply(i)
+df_media[['TV']].multiply(j)
+df_media[['BB']].multiply(k)
Below is the input dataframe and the first output iteration of the loops. So all of the columns have been multiplied by 0.5.
df_media:
TV BB TP
1 2 6
11 4 5
4 4 3
Output DataFrame:
'Outname0.5TV0.5BB0.5TP'
4.5
10
5.5

Dictionary
If you need a dataframe for each loop, you can use a dictionary. With this solution you also don't need to store your factor in your column name, since the weight can be your key. Here's one way via a dictionary comprehension:
weights = [0.5,0.6,0.7,0.8,0.9,1.0]
col_name = '-'.join(df_media.columns)
dfs = {w: (df_media.sum(1) * w).to_frame(col_name) for w in weights}
print(dfs[0.5])
TV-BB-TP
0 4.5
1 10.0
2 5.5
Single dataframe
Much more efficient is to store your result in a single dataframe. This removes the need for a Python-level loop.
res = pd.DataFrame(df.sum(1).values[:, None] * np.array(weights),
columns=weights)
print(res)
0.5 0.6 0.7 0.8 0.9 1.0
0 4.5 5.4 6.3 7.2 8.1 9.0
1 10.0 12.0 14.0 16.0 18.0 20.0
2 5.5 6.6 7.7 8.8 9.9 11.0
Then, for example, access the first weight as a series via res[0.5].

Related

Error: 3 columns passed, passed data had 1 columns (Creating DataFrame from lists)

I wrote a function that outputs 3 lists and want to make those lists each a column in a dataframe.
The function returns a tuple of 3 lists, containing text or lists of text.
Here is the function:
def function(pages = 0):
a = [title for title in range(pages)]
b = [[summary] for summary in title.summary]
c = [[summary2] for summary2 in title.summary2]
return a, b, c
data = function(pages = 2)
pd.DataFrame(data, columns = ['A', 'B', 'C'])
and the error says that I passed data with 2 columns while the columns have 3 columns. Can someone explain what is going on and how to fix it? Thank you!
One of the way you can address this transpose the output and then create the dataframe.
A sample example:
import pandas as pd
import numpy as np
def function(pages=0):
# Replace this with your logic
a=list(range(10))
b=[i*0.9 for i in a]
c=[i*0.5 for i in a]
return [a,b,c]
data=np.array(function()).T.tolist()
df=pd.DataFrame(data=data,columns=['A','B','C'])
Output:
In []: df
Out[25]:
A B C
0 0.0 0.0 0.0
1 1.0 0.9 0.5
2 2.0 1.8 1.0
3 3.0 2.7 1.5
4 4.0 3.6 2.0
5 5.0 4.5 2.5
6 6.0 5.4 3.0
7 7.0 6.3 3.5
8 8.0 7.2 4.0
9 9.0 8.1 4.5

Pandas - conditional row average

I have a dataframe:
x = pd.DataFrame({'1':[1,2,3,2,5,6,7,8,9], '2':[2,5,6,8,10,np.nan,6,np.nan,np.nan],
'3':[10,10,10,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan]})
I am trying to generate an average of a row but only on values greater than 5. For instance - if a row had values of 3, 6, 10. The average would be 8 ((6+10)/2). The 3 would be ignored as it is below 5.
The equivalent in excel would be =AVERAGEIF(B2:DX2,">=5")
You can mask the values greater than 5 then take mean:
x.where(x>5).mean(1)
Or:
x.mask(x<=5).mean(1)
You can create a small custom function which, inside each row, filters out values smaller or equal than a certain value and apply it to each row of your dataframe
def average_if(s, value=5):
s = s.loc[s > value]
return s.mean()
x.apply(average_if, axis=1)
0 10.0
1 10.0
2 8.0
3 8.0
4 10.0
5 6.0
6 6.5
7 8.0
8 9.0
dtype: float64

Get weighted average summary data column in new pandas dataframe from existing dataframe based on other column-ID

Somewhat similar question to an earlier question I had here: Get summary data columns in new pandas dataframe from existing dataframe based on other column-ID
However, instead of just taking the sum of datapoints, I wanted to have the weighted average in an extra column. I'll repeat and rephrase the question:
I want to summarize the data in a dataframe and add the new columns to another dataframe. My data contains appartments with an ID-number and it has surfaces and U-values for each room in the appartment. What I want is having a dataframe that summarizes this and gives me the total surface and surface-weighted average U-value per appartment. There are three conditions for the original dataframe:
Three conditions:
the dataframe can contain empty cells
when the values of surface or U-value are equal for all of the rows within that ID
(so all the same values for the same ID), then the data (surface, volumes) is not
summed but one value/row is passed to the new summary column (example: 'ID 4')(as
this could be a mistake in the original dataframe and the total surface/volume was
inserted for all the rooms by the government-employee)
the average U-value should be the Surface-weighted average U-value
Initial dataframe 'data':
print(data)
ID Surface U-value
0 2 10.0 1.0
1 2 12.0 1.0
2 2 24.0 0.5
3 2 8.0 1.0
4 4 84.0 0.8
5 4 84.0 0.8
6 4 84.0 0.8
7 52 NaN 0.2
8 52 96.0 1.0
9 95 8.0 2.0
10 95 6.0 2.0
11 95 12.0 2.0
12 95 30.0 1.0
13 95 12.0 1.5
Desired output from 'df':
print(df)
ID Surface U-value #-> U-value = surface weighted U-value!; Surface = sum of all surfaces except when all surfaces per ID are the same (example 'ID 4')
0 2 54.0 0.777
1 4 84.0 0.8 #-> as the values are the same for each row of this ID in the original data, the sum is not taken, but only one of the rows is passed (see the second condition)
2 52 96.0 1.0 # -> as one of 2 surface is empty, the corresponding U-value for the empty cell is ignored, so the output here should be the weighted average of the values that have both 'Surface'&'U-value'-values (in this case 1,0)
3 95 68.0 1.47
The code of jezrael in the reference already works brilliant for the sum() but how
to add a weighted average 'U-value'-column to it? I really have no idea. An
average could be set with a mean()-function instead of the sum() but
the weighted-average..?
import pandas as pd
import numpy as np
df = pd.DataFrame({"ID": [2,4,52,95]})
data = pd.DataFrame({"ID": [2,2,2,2,4,4,4,52,52,95,95,95,95,95],
"Surface": [10,12,24,8,84,84,84,np.nan,96,8,6,12,30,12],
"U-value":
[1.0,1.0,0.5,1.0,0.8,0.8,0.8,0.2,1.0,2.0,2.0,2.0,1.0,1.5]})
print(data)
cols = ['Surface']
m1 = data.groupby("ID")[cols].transform('nunique').eq(1)
m2 = data[cols].apply(lambda x: x.to_frame().join(data['ID']).duplicated())
df = data[cols].mask(m1 & m2).groupby(data["ID"]).sum().reset_index()
print(df)
This should do the trick:
data.groupby('ID').apply(lambda g: (g['U-value']*g['Surface']).sum() / g['Surface'].sum())
To add to original dataframe, don't reset the index first:
df = data[cols].mask(m1 & m2).groupby(data["ID"]).sum()
df['U-value'] = data.groupby('ID').apply(
lambda g: (g['U-value'] * g['Surface']).sum() / g['Surface'].sum())
df.reset_index(inplace=True)
The result:
ID Surface U-value
0 2 54.0 0.777778
1 4 84.0 0.800000
2 52 96.0 1.000000
3 95 68.0 1.470588

pandas - partially updating DataFrame with derived calculations of a subset groupby

I have a DataFrame with some NaN records that I want to fill based on a combination of data of the NaN record (index in this example) and of the non-NaN records. The original DataFrame should be modified.
Details of input/output/code below:
I have an initial DataFrame that contains some pre-calculated data:
Initial Input
raw_data = {'raw':[x for x in range(5)]+[np.nan for x in range(2)]}
source = pd.DataFrame(raw_data)
raw
0 0.0
1 1.0
2 2.0
3 3.0
4 4.0
5 NaN
6 NaN
I want to identify and perform calculations to "update" the NaN data, where the calculations are based on data of the non-NaN data and some data from the NaN records.
In this contrived example I am calculating this as:
Calculate average/mean of 'valid' records.
Add this to the index number of 'invalid' records.
Finally this needs to be updated on the initial DataFrame.
Desired Output
raw valid
0 0.0 1
1 1.0 1
2 2.0 1
3 3.0 1
4 4.0 1
5 7.0 0
6 8.0 0
The current solution I have (below) makes a calculation on a copy then updates the original DataFrame.
# Setup grouping by NaN in 'raw'
source['valid'] = ~np.isnan(source['raw'])*1
subsets = source.groupby('valid')
# Mean of 'valid' is used later to fill 'invalid' records
valid_mean = subsets.get_group(1)['raw'].mean()
# Operate on a copy of group(0), then update the original DataFrame
invalid = subsets.get_group(0).copy()
invalid['raw'] = subsets.get_group(0).index + valid_mean
source.update(invalid)
Is there a less clunky or more efficient way to do this? The real application is on significantly larger DataFrames (and with a significantly longer process of processing NaN rows).
Thanks in advance.
You can use combine_first:
#mean by default omit `NaN`s
m = source['raw'].mean()
#same as
#m = source['raw'].dropna().mean()
print (m)
2.0
#create valid column if necessary
source['valid'] = source['raw'].notnull().astype(int)
#update NaNs
source['raw'] = source['raw'].combine_first(source.index.to_series() + m)
print (source)
raw valid
0 0.0 1
1 1.0 1
2 2.0 1
3 3.0 1
4 4.0 1
5 7.0 0
6 8.0 0

Conditional column arithmetic in pandas dataframe

I have a pandas dataframe with the following structure:
import numpy as np
import pandas as pd
myData = pd.DataFrame({'x': [1.2,2.4,5.3,2.3,4.1], 'y': [6.7,7.5,8.1,5.3,8.3], 'condition':[1,1,np.nan,np.nan,1],'calculation': [np.nan]*5})
print myData
calculation condition x y
0 NaN 1 1.2 6.7
1 NaN 1 2.4 7.5
2 NaN NaN 5.3 8.1
3 NaN NaN 2.3 5.3
4 NaN 1 4.1 8.3
I want to enter a value in the 'calculation' column based on the values in 'x' and 'y' (e.g. x/y) but only in those cells where the 'condition' column contains NaN (np.isnan(myData['condition']). The final dataframe should look like this:
calculation condition x y
0 NaN 1 1.2 6.7
1 NaN 1 2.4 7.5
2 0.654 NaN 5.3 8.1
3 0.434 NaN 2.3 5.3
4 NaN 1 4.1 8.3
I'm happy with the idea of stepping through each row in turn using a 'for' loop and then using 'if' statements to make the calculations but the actual dataframe I have is very large and I wanted do the calculations in an array-based way. Is this possible? I guess I could calculate the value for all rows and then delete the ones I don't want but this seems like a lot of wasted effort (the NaNs are quite rare in the dataframe) and, in some cases where 'condition' equals 1, the calculation cannot be made due to division by zero.
Thanks in advance.
Use where and pass your condition to it, this will then only perform your calculation where the rows meet the condition:
In [117]:
myData['calculation'] = (myData['x']/myData['y']).where(myData['condition'].isnull())
myData
Out[117]:
calculation condition x y
0 NaN 1 1.2 6.7
1 NaN 1 2.4 7.5
2 0.654321 NaN 5.3 8.1
3 0.433962 NaN 2.3 5.3
4 NaN 1 4.1 8.3
EdChum's answer worked for me well! Still, I wanted to extend this thread as I think it will be useful for other people.
Let's assume your dataframe is
c x y
0 1 1.2 6.7
1 1 2.4 7.5
2 0 5.3 8.1
3 0 2.3 5.3
4 1 4.1 8.3
and you would like to update 0s in column c with associated x/y.
c x y
0 1 1.2 6.7
1 1 2.4 7.5
2 0.65 5.3 8.1
3 0.43 2.3 5.3
4 1 4.1 8.3
You can do
myData['c'] = (myData['x']/myData['y']).where(cond=myData['c']==0, other=myData['c'])
or
myData['c'].where(cond=myData['c'] != 0, other=myData['x']/myData['y'], inplace=True)
In both cases where 'cond' is not satisfied, 'other' is performed. In the second code snippet, inplace flag also works nicely (as it would also in the first code snippet.)
I found these solutions from pandas official site "where" and pandas official site "indexing"
This kind of operations are exactly what I need most of the time. I am new to Pandas and it took me a while to find this useful thread. Could anyone recommend some comprehensive tutorials to practice these types of arithmetic operations? I need to "filter/ groupby/ slice a dataframe then apply different functions/operations to each group/slice separately or all at once and keep it all inplace." Cheers!

Categories

Resources