Pandas groupby expanding optimization of syntax - python

I am using the data from the example shown here:
http://pandas.pydata.org/pandas-docs/stable/groupby.html. Go to the subheading: New syntax to window and resample operations
At the command prompt, the new syntax works as shown in the pandas documentation. But I want to add a new column with the expanded data to the existing dataframe, as would be done in a saved program.
Before a syntax upgrade to the groupby expanding code, I was able to use the following single line code:
df = pd.DataFrame({'A': [1] * 10 + [5] * 10, 'B': np.arange(20)})
df['Sum of B'] = df.groupby('A')['B'].transform(lambda x: pd.expanding_sum(x))
This gives the expected results, but also gives an 'expanding_sum is deprecated' message. Expected results are:
A B Sum of B
0 1 0 0
1 1 1 1
2 1 2 3
3 1 3 6
4 1 4 10
5 1 5 15
6 1 6 21
7 1 7 28
8 1 8 36
9 1 9 45
10 5 10 10
11 5 11 21
12 5 12 33
13 5 13 46
14 5 14 60
15 5 15 75
16 5 16 91
17 5 17 108
18 5 18 126
19 5 19 145
I want to use the new syntax to replace the deprecated syntax. If I try the new syntax, I get the error message:
df['Sum of B'] = df.groupby('A').expanding().B.sum()
TypeError: incompatible index of inserted column with frame index
I did some searching on here, and saw something that might have helped, but it gave me a different message:
df['Sum of B'] = df.groupby('A').expanding().B.sum().reset_index(level = 0)
ValueError: Wrong number of items passed 2, placement implies 1
The only way I can get it to work is to assign the result to a temporary df, then merge the temporary df into the original df:
temp_df = df.groupby('A').expanding().B.sum().reset_index(level = 0).rename(columns = {'B' : 'Sum of B'})
new_df = pd.merge(df, temp_df, on = 'A', left_index = True, right_index = True)
print (new_df)
This code gives the expected results as shown above.
I've tried different variations using transform as well, but have not been able to come up with coding this in one line as I did before the deprecation. Is there a single line syntax that will work? Thanks.

It seems you need a cumsum:
df.groupby('A')['B'].cumsum()

TL;DR
df['Sum of B'] = df.groupby('A')['B'].transform(lambda x: x.expanding().sum())
Explanation
We start from the offending line:
df.groupby('A')['B'].transform(lambda x: pd.expanding_sum(x))
Let's read carefully the warning you mentioned:
FutureWarning: pd.expanding_sum is deprecated for Series and will be
removed in a future version, replace with
Series.expanding(min_periods=1).sum()
After reading Pandas 0.17.0: pandas.expanding_sum it becomes clear that the Series the warning is talking about is the first parameter of the pd.expanding_sum. I.e. in our case it is x.
Now we apply the code transformation suggested in the warning. So pd.expanding_sum(x) becomes x.expanding(min_periods=1).sum().
According to Pandas 0.22.0: pandas.Series.expanding min_periods has a default value of 1 so in your case it can be omitted altogether, hence the final result.

Related

Comparing values in two dataframes and generate report if difference is greater set point

I have 2 data frames ( master and slave) looks like below.
# Master
C D E F G
0 5 44 4.0 33 22
1 1 0 4.5 565 11
# Slave
C D E F G
0 5 44 4.0 33.0 22
1 1 4 6.5 562.5 10
Expected results( highlight those cells where difference is > 1)
C D E F G
0 5 44 4.0 33.0 22
1 1 4 6.5 562.5 10
Where 4, 6.5, 562.5 are highlighted
Picture attached for better understanding.
I would like to compare two data frames and would like to highlight the cells where the difference exceed the SET VALUE( >1) in a newly created data frame. SET value=1 is constant for entire data frame.
Please note difference should be based on Absolute value. i.e ABS( master- slave)
I would like to use the numpy np.isclose function to achieve my goal.
This should happen for bigger data frame with 200 rows and 300 columns.
Data frame displayed here is small for better understanding.
Cell : D2 : highlight is required since (D2_MASTER) -(D2_Slave)= 0- 4 = -4
Cell : E2 : highlight is required since (E2_MASTER) -(E2_Slave)= 4.5- 6.5 = -2
Cell : F2 : highlight is required since (F2_MASTER) -(F2_Slave)= 565- 562.5.5 = 2.5
Cell : G2 : NO highlight since (G2_MASTER) -(G2_Slave)=11- 10 = 1 (should not be highlighted since difference is within the limit)
I just started coding in python and using pandas on my own and I admit I am a bit lost.
Thanks for reading all this and thanks in advance for any suggestions and feedback. !
Code
for ind,row in dfmaster.iterrows():
print(row)
(dfmaster.iloc())=np.isclose ( (dfmaster.iloc()) , (dfmaster.iloc()) , atol=1)#.any()
Let's try style.format:
def highlight_error(df):
return pd.DataFrame(np.where(df.sub(slave).abs() > 1, 'background-color:red', ''),
df.index, df.columns)
master.style.apply(highlight_error, axis=None)
On Jupyter notebook you would get:

How to merge an itertools generated dataframe and a normal dataframe in pandas?

I have generated a dataframe containing all the possible two combinations of electrocardiogram (ECG) leads using itertools using the code below
source = [ 'I-s', 'II-s', 'III-s', 'aVR-s', 'aVL-s', 'aVF-s', 'V1-s', 'V2-s', 'V3-s', 'V4-s', 'V5-s', 'V6-s', 'V1Long-s', 'IILong-s', 'V5Long-s', 'Information-s' ]
target = [ 'I-t', 'II-t', 'III-t', 'aVR-t', 'aVL-t', 'aVF-t', 'V1-t', 'V2-t', 'V3-t', 'V4-t', 'V5-t', 'V6-t', 'V1Long-t', 'IILong-t', 'V5Long-t', 'Information-t' ]
from itertools import product
test = pd.DataFrame(list(product(source, target)), columns=['source', 'target'])
The test dataframe contains 256 rows/lines containing all the two possible combinations.
The value for each combination is zero as follows
test['value'] = 0
The test df looks like this:
I have another dataframe called diagramDF that contains the combinations where the value column is non-zero. The diagramDF is significanntly smaller than the test dataframe.
source target value
0 I-s II-t 137
1 II-s I-t 3
2 II-s III-t 81
3 II-s IILong-t 13
4 II-s V1-t 21
5 III-s II-t 3
6 III-s aVF-t 19
7 IILong-s II-t 13
8 IILong-s V1Long-t 353
9 V1-s aVL-t 11
10 V1Long-s IILong-t 175
11 V1Long-s V3-t 4
12 V1Long-s aVF-t 4
13 V2-s V3-t 8
14 V3-s V2-t 6
15 V3-s V6-t 2
16 V5-s aVR-t 5
17 V6-s III-t 4
18 aVF-s III-t 79
19 aVF-s V1Long-t 235
20 aVL-s I-t 1
21 aVL-s aVF-t 16
22 aVR-s aVL-t 1
Note that the first two columns source and target have the same notations
I have tried to replace the zero values of the test dataframe with the nonzero values of the diagramDF using merge like below:
df = pd.merge(test, diagramDF, how='left', on=['source', 'target'])
However, I get an error informing me that:
ValueError: The column label 'source' is not unique. For a
multi-index, the label must be a tuple with elements corresponding to
each level
Is there something that I am getting wrong? Is there a more efficient and fast way to do this?
Might help,
pd.merge(test, diagramDF, how='left', on=['source', 'target'],right_index=True,left_index=True)
Check this:
test = test.reset_index()
diagramDF = diagramDF.reset_index()
new = pd.merge(test, diagramDF, how='left', on=['source', 'target'])

Pandas: row operations on a column, given one reference value on a different column

I am working with a database that looks like the below. For each fruit (just apple and pears below, for conciseness), we have:
1. yearly sales,
2. current sales,
3. monthly sales and
4.the standard deviation of sales.
Their ordering may vary, but it's always 4 values per fruit.
dataset = {'apple_yearly_avg': [57],
'apple_sales': [100],
'apple_monthly_avg':[80],
'apple_st_dev': [12],
'pears_monthly_avg': [33],
'pears_yearly_avg': [35],
'pears_sales': [40],
'pears_st_dev':[8]}
df = pd.DataFrame(dataset).T#tranpose
df = df.reset_index()#clear index
df.columns = (['Description', 'Value'])#name 2 columns
I would like to perform two sets of operations.
For the first set of operations, we isolate a fruit price, say 'pears', and subtract each average sales from current sales.
df_pear = df[df.loc[:, 'Description'].str.contains('pear')]
df_pear['temp'] = df_pear['Value'].where(df_pear.Description.str.contains('sales')).bfill()
df_pear ['some_op'] = df_pear['Value'] - df_pear['temp']
The above works, by creating a temporary column holding pear_sales of 40, backfill it and then use it to subtract values.
Question 1: is there a cleaner way to perform this operation without a temporary array? Also I do get the common warning saying I should use '.loc[row_indexer, col_indexer], even though the output still works.
For the second sets of operations, I need to add '5' rows equal to 'new_purchases' to the bottom of the dataframe, and then fill df_pear['some_op'] with sales * (1 + std_dev *some_multiplier).
df_pear['temp2'] = df_pear['Value'].where(df_pear['Description'].str.contains('st_dev')).bfill()
new_purchases = 5
for i in range(new_purchases):
df_pear = df_pear.append(df_pear.iloc[-1])#appends 5 copies of the last row
counter = 1
for i in range(len(df_pear)-1, len(df_pear)-new_purchases, -1):#backward loop from the bottom
df_pear.some_op.iloc[i] = df_pear['temp'].iloc[0] * (1 + df_pear['temp2'].iloc[i] * counter)
counter += 1
This 'backwards' loop achieves it, but again, I'm worried about readability since there's another temporary column created, and then the indexing is rather ugly?
Thank you.
I think, there is a cleaner way to perform your both tasks, for each
fruit in one go:
Add 2 columns, Fruit and Descr, the result of splitting of Description at the first "_":
df[['Fruit', 'Descr']] = df['Description'].str.split('_', n=1, expand=True)
To see the result you may print df now.
Define the following function to "reformat" the current group:
def reformat(grp):
wrk = grp.set_index('Descr')
sal = wrk.at['sales', 'Value']
dev = wrk.at['st_dev', 'Value']
avg = wrk.at['yearly_avg', 'Value']
# Subtract (yearly) average
wrk['some_op'] = wrk.Value - avg
# New rows
wrk2 = pd.DataFrame([wrk.loc['st_dev']] * 5).assign(
some_op=[ sal * (1 + dev * i) for i in range(5, 0, -1) ])
return pd.concat([wrk, wrk2]) # Old and new rows
Apply this function to each group, grouped by Fruit, drop Fruit
column and save the result back in df:
df = df.groupby('Fruit').apply(reformat)\
.reset_index(drop=True).drop(columns='Fruit')
Now, when you print(df), the result is:
Description Value some_op
0 apple_yearly_avg 57 0
1 apple_sales 100 43
2 apple_monthly_avg 80 23
3 apple_st_dev 12 -45
4 apple_st_dev 12 6100
5 apple_st_dev 12 4900
6 apple_st_dev 12 3700
7 apple_st_dev 12 2500
8 apple_st_dev 12 1300
9 pears_monthly_avg 33 -2
10 pears_sales 40 5
11 pears_yearly_avg 35 0
12 pears_st_dev 8 -27
13 pears_st_dev 8 1640
14 pears_st_dev 8 1320
15 pears_st_dev 8 1000
16 pears_st_dev 8 680
17 pears_st_dev 8 360
Edit
I'm in doubt whether Description should also be replicated to new
rows from "st_dev" row. If you want some other content there, set it
in reformat function, after wrk2 is created.

How can i use sum() and count() (both) for a groupby in pandas

df=pandas.DataFrame(processed_data_format, columns=["file_name", "innings", "over","ball", "individual ball", "runs","batsman", "wicket_status","bowler_name","fielder_name"])
df.groupby(['batsman'])['runs','ball'].sum()
by using this i will get the result like
a 30 29
b 4 1
c 10 15
I would like to get the count of column called filename with the result of the code as mentioned above.The Final result should be like
a 30 29 2
b 4 1 1
c 10 15 2
df=pandas.DataFrame(processed_data_format, columns=["file_name", "innings", "over","ball", "individual ball", "runs","batsman", "wicket_status","bowler_name","fielder_name"])
a = {'runs':['sum'],'ball':['sum'],'file_name':['nunique']}
t = df.groupby('batsman').agg(a)
No need to use count() for this format instead of count use nunique to get the number of unique value

Filling in missing data in Python

I was hoping you would be able to help me solve a small problem.
I am using a small device that prints out two properties that I save to a file. The device rasters in X and Y direction to form a grid. I am interested in plotting the relative intensity of these two properties as a function of the X and Y dimensions. I record the data in 4 columns that are comma separated (X, Y, property 1, property 2).
The grid is examined in lines, so for each Y value, it will move from X1 to X2 which are separated several millimeters apart. Then it will move to the next line and over again.
I am able to process the data in python with pandas/numpy but it doesn't work too well when there are any missing rows (which unfortunately does happen).
I have attached a sample of the output (and annotated the problems):
44,11,500,1
45,11,120,2
46,11,320,3
47,11,700,4
New << used as my Y axis separator
44,12,50,5
45,12,100,6
46,12,1500,7
47,12,2500,8
Sometimes, however a line or a few will be missing making it not possible to process and plot. Currently I have not been able to automatically fix it and have to do it manually. The bad output looks like this:
44,11,500,1
45,11,120,2
46,11,320,3
47,11,700,4
New << used as my Y axis separator
45,12,100,5 << missing 44,12...
46,12,1500,6
47,12,2500,7
I know the number of lines I expect since I know my range of X and Y.
What would be the best way to deal with this? Currently I manually enter the missing X and Y values and populate property 1 and 2 with values of 0. This can be time consuming and I would like to automate it. I have two questions.
Question 1: How can I automatically fill in my missing data with the corresponding values of X and Y and two zeros? This could be obtained from a pre-generated array of X and Y values that correspond to the experimental range.
Question 2: Is there a better way to split the file into separate arrays for plotting (rather than using the 'New' line?) For instance, by having a 'if' function that will output each line between X(start) and X(end) to a separate array? I've tried doing that but with no success.
I've attached my current (crude) code:
df = pd.read_csv('FileName.csv', delimiter = ',', skiprows=0)
rows = [-1] + np.where(df['X']=='New')[0].tolist() + [len(df.index)]
dff = {}
for i, r in enumerate(rows[:-1]):
dff[i] = df[r+1: rows[i+1]]
maxY = len(dff)
data = []
data2 = []
for yaxes in range(0, maxY):
data2.append(dff[yaxes].ix[:,2])
<data2 is then used for plotting using matplotlib>
To answer my Question 1, I was thinking about using the 'reindex' and 'reset_index' functions, however haven't managed to make them work.
I would appreciate any suggestions.
Is this meet what you want?
Q1: fill X using reindex, and others using fillna
Q2: Passing separated StringIO to read_csv is easier (change if you use Python 3)
# read file and split the input
f = open('temp.csv', 'r')
chunks = f.read().split('New')
# read csv as separated dataframes, using first column as index
dfs = [pd.read_csv(StringIO(unicode(chunk)), header=None, index_col=0) for chunk in chunks]
def pad(df):
# reindex, you should know the range of x
df = df.reindex(np.arange(44, 48))
# pad y from forward / backward, assuming y should have the single value
df[1] = df[1].fillna(method='bfill')
df[1] = df[1].fillna(method='ffill')
# padding others
df = df.fillna(0)
# revert index to values
return df.reset_index(drop=False)
dfs = [pad(df) for df in dfs]
dfs[0]
# 0 1 2 3
# 0 44 11 500 1
# 1 45 11 120 2
# 2 46 11 320 3
# 3 47 11 700 4
# dfs[1]
# 0 1 2 3
# 0 44 12 0 0
# 1 45 12 100 5
# 2 46 12 1500 6
# 3 47 12 2500 7
First Question
I've included print statements inside function to explain how this function works
In [89]:
def replace_missing(df , Ids ):
# check what are the mssing values
missing = np.setdiff1d(Ids , df[0])
if len(missing) > 0 :
missing_df = pd.DataFrame(data = np.zeros( (len(missing) , 4 )))
#print('---missing df---')
#print(missing_df)
missing_df[0] = missing
#print('---missing df---')
#print(missing_df)
missing_df[1].replace(0 , df[1].iloc[0] , inplace = True)
#print('---missing df---')
#print(missing_df)
df = pd.concat([df , missing_df])
#print('---final df---')
#print(df)
return df
​
In [91]:
Ids = np.arange(44,48)
final_df = df1.groupby(df1[1] , as_index = False).apply(replace_missing , Ids).reset_index(drop = True)
final_df
Out[91]:
0 1 2 3
44 11 500 1
45 11 120 2
46 11 320 3
47 11 700 4
45 12 100 5
46 12 1500 6
47 12 2500 7
44 12 0 0
Second question
In [92]:
group = final_df.groupby(final_df[1])
In [99]:
separate = [group.get_group(key) for key in group.groups.keys()]
separate[0]
Out[104]:
0 1 2 3
44 11 500 1
45 11 120 2
46 11 320 3
47 11 700 4

Categories

Resources