How do you add grouped dataframes where only some groups match? - python

I have two dataframes. I've already grouped by three columns. I then summed by group. Two simplified versions look as followed:
Dataframe 1:
pitch_type
zone
description
0
CH
1.0
hit_by_pitch
4
1.0
ball
3
2.0
swinging_strike
1
CU
2.0
hit_by_pitch
2
Dataframe 1:
pitch type
zone
description
0
CH
1.0
ball
3
3.0
ball
1
CU
2.0
hit_by_pitch
4
I want to add them together so the following is true:
matching group values are added together
CH, 1.0, ball from the first dataframe is added with the value for CH, 1.0, ball in the second dataframe
groups without a match are still included in the resulting dataframe
CH, 2.0, swinging_strike has a value of 2 in the resulting dataframe
the resulting dataframe would look like:
pitch type
zone
description
0
CH
1.0
hit_by_pitch
4
1.0
ball
6
2.0
swinging_strike
1
3.0
ball
1
CU
2.0
hit_by_pitch
6
I've had some trouble attempting this. My inclination is to use
df1.add(db2, fill_value = 0)
but I get a weird resulting dataframe where only some values from each dataframe are copied over. Could it be that I am using an incorrect axis or level, since my grouped dataframes have multiindices?
I appreciate any and all help. This has been driving me a tad banana sandwiches.

You can use concat and then groupby:
df = pd.concat([df1, df2])
df = df.groupby(['pitch_type', 'zone', 'description']).sum()

From my understanding you want to join the 2 dataframe, combine it, group by it and also sum it.
In here I assume your first DataFrame are
pitch_type
zone
description
0
CH
1.0
hit_by_pitch
4
CH
1.0
ball
3
CH
2.0
swinging_strike
1
CU
2.0
hit_by_pitch
2
second DataFrame are
pitch_type
zone
description
0
CH
1.0
ball
3
CH
3.0
ball
1
CU
2.0
hit_by_pitch
4
by using answer from the Mattravel
give we can just use concat for joining 2 dataframe by following this code
>>>import pandas as pd
>>>df1 = pd.DataFrame([["CH", 1.0, "hit_by_pitch", 4],
["CH", 1.0, "ball", 3],
["CH", 2.0, "swinging_strike", 1],
["CU", 2.0 , "hit_by_pitch", 2]], columns=["pitch_type", "zone", "description", "0"])
>>>df2 = pd.DataFrame([["CH", 1.0, "ball", 3],
["CH", 3.0, "ball", 1],
["CU", 2.0 , "hit_by_pitch", 4]], columns=["pitch_type", "zone", "description", "0"])
>>>df3 = pd.concat([df1, df2], ignore_index=True).sort_values(by=["pitch_type", "description"])
>>>print(df3.head(10))
with the result as
pitch_type
zone
description
0
CH
1.0
ball
3
CH
1.0
ball
3
CH
3.0
ball
1
CH
1.0
hit_by_pitch
4
CH
2.0
swinging_strike
1
CU
2.0
hit_by_pitch
2
CU
2.0
hit_by_pitch
4
After we concat 2 DataFrame, we need to group by it and sum it
>>>df3 = df3.groupby(["pitch_type","zone", "description"]).sum().reset_index()
>>>print(df3)
as the result you get the
pitch_type
zone
description
0
CH
1.0
ball
6
CH
1.0
hit_by_pitch
4
CH
2.0
swinging_strike
1
CH
3.0
ball
1
CU
2.0
hit_by_pitch
6
Which is same with the result what you want.
I hope this answer can help you.

Related

Create new column showing the occurrences of a column value in a range of others

I have a simple pandas DataFrame where I need to add a new column that shows the 'count' of occurrences for the 'current_price' in a range of other columns 'pricemonths', that match the current_price column:
import pandas as pd
import numpy as np
# my data
data = {'Item':['Bananas', 'Apples', 'Pears', 'Avocados','Grapes','Melons'],
'Jan':[1,0.5,1.1,0.6,2,4],
'Feb':[0.9,0.5,1,0.6,2,5],
'Mar':[1,0.6,1,0.6,2.1,6],
'Apr':[1,0.6,1,0.6,2,5],
'May':[1,0.5,1.1,0.6,2,5],
'Current_Price':[1,0.6,1,0.6,2,4]
}
# import my data
df = pd.DataFrame(data)
pricemonths=['Jan','Feb','Mar','Apr','May']
Thus, my final dataframe would contain another column ('times_found') with the values:
'times_found'
4
2
3
5
4
1
One way of doing it is to transpose the prices part of df, then use eq to compare with "Current_Price" across indices (which creates a boolean DataFrame with True for matching prices and False otherwise) and find sum across rows:
df['times_found'] = df['Current_Price'].eq(df.loc[:,'Jan':'May'].T).sum(axis=0)
or use numpy broadcasting:
df['times_found'] = (df.loc[:,'Jan':'May'].to_numpy() == df[['Current_Price']].to_numpy()).sum(axis=1)
Excellent suggestion from #HenryEcker: DataFrame equals on an axis may be faster than transposing for larger DataFrames:
df['times_found'] = df.loc[:, 'Jan':'May'].eq(df['Current_Price'], axis=0).sum(axis=1)
Output:
Item Jan Feb Mar Apr May Current_Price times_found
0 Bananas 1.0 0.9 1.0 1.0 1.0 1.0 4
1 Apples 0.5 0.5 0.6 0.6 0.5 0.6 2
2 Pears 1.1 1.0 1.0 1.0 1.1 1.0 3
3 Avocados 0.6 0.6 0.6 0.6 0.6 0.6 5
4 Grapes 2.0 2.0 2.1 2.0 2.0 2.0 4
5 Melons 4.0 5.0 6.0 5.0 5.0 4.0 1

python delete row where most columns are nans

I'm importing data where from excel where some rows may have notes in a column and are not truly part of the dataframe. dummy Eg. below:
H1 H2 H3
*highlighted cols are PII
sam red 5
pam blue 3
rod green 11
* this is the end of the data
When the above file is imported into dfPA it looks like:
dfPA:
Index H1 H2 H3
1 *highlighted cols are PII
2 sam red 5
3 pam blue 3
4 rod green 11
5 * this is the end of the data
I want to delete the first and last row. This is what I've done.
#get count of cols in df
input: cntcols = dfPA.shape[1]
output: 3
#get count of cols with nan in df
input: a = dfPA.shape[1] - dfPA.count(axis=1)
output:
0 2
1 3
2 3
4 3
5 2
(where a is a series)
#convert a from series to df
dfa = a.to_frame()
#delete rows where no. of nan's are greater than 'n'
n = 1
for r, row in dfa.iterrows():
if (cntcols - dfa.iloc[r][0]) > n:
i = row.name
dfPA = dfPA.drop(index=i)
This doesn't work. Is there way to do this?
You should use the pandas.DataFrame.dropna method. It has a thresh parameter that you can use to define a minimum number of NaN to drop the row/column.
Imagine the following dataframe:
>>> import numpy as np
>>> df = pd.DataFrame([[1,np.nan,1,np.nan], [1,1,1,1], [1,np.nan,1,1], [np.nan,1,1,1]], columns=list('ABCD'))
A B C D
0 1.0 NaN 1 NaN
1 1.0 1.0 1 1.0
2 1.0 NaN 1 1.0
3 NaN 1.0 1 1.0
You can drop columns with NaN using:
>>> df.dropna(axis=1)
C
0 1
1 1
2 1
3 1
The thresh parameter defines the minimum number of non-NaN values to keep the column:
>>> df.dropna(thresh=3, axis=1)
A C D
0 1.0 1 NaN
1 1.0 1 1.0
2 1.0 1 1.0
3 NaN 1 1.0
If you want to reason in terms of the number of NaN:
# example for a minimum of 2 NaN to drop the column
>>> df.dropna(thresh=len(df.columns)-(2-1), axis=1)
If the rows rather than the columns need to be filtered, remove the axis parameter or use axis=0:
>>> df.dropna(thresh=3)

What is the reverse operation of `.value_counts()` in pandas dataframe?

Starting from a non unique pandas series, one can count the number of each unique value by .value_counts().
>> col = pd.Series([1.0, 1.0, 2.0, 3.0, 3.0, 3.0])
0 1.0
1 1.0
2 2.0
3 3.0
4 3.0
5 3.0
dtype: object
>> stat = col.value_counts()
>> stat
3.0 3
1.0 2
2.0 1
dtype: int64
But, if starting from a data frame of two column, one for the unique values, while another for the number of occurrence. (stat in previous example). How to expand those into a single column.
Because I would like to calculate the median, mean, etc of the data in such a dataframe, I think describing a single column is much easier that two. Or is there any method to describe a 'value_count' dataframe derectly without expanding the data?
# turn `stat` into col ???
>> col.describe()
count 6.000000
mean 2.166667
std 0.983192
min 1.000000
25% 1.250000
50% 2.500000
75% 3.000000
max 3.000000
add testing data
>> df = pd.DataFrame({"Name": ["A", "B", "C"], "Value": [1,2,3], "Count": [2, 10, 2]})
>> df
Name Value Count
0 A 1 2
1 B 2 5
2 C 3 2
df2 = _reverse_count(df)
>> df2
Name Value
0 A 1
1 A 1
2 B 2
3 B 2
4 B 2
5 B 2
6 B 2
7 B 2
8 C 3
9 C 3
You can use the repeat function from numpy
import pandas as pd
import numpy as np
col = pd.Series([1.0, 1.0, 2.0, 3.0, 3.0, 3.0])
stats=col.value_counts()
pd.Series(np.repeat(stats.index,stats))
# 0 3.0
# 1 3.0
# 2 3.0
# 3 1.0
# 4 1.0
# 5 2.0
# dtype: float64
Update :
for multiple columns you can use
df.loc[df.index.repeat(df['Count'])]

Loop that counts unique values in a pandas df

I am trying to create a loop or a more efficient process that can count the amount of current values in a pandas df. At the moment I'm selecting the value I want to perform the function on.
So for the df below, I'm trying to determine two counts.
1) ['u'] returns the count of the same remaining values left in ['Code', 'Area']. So how many remaining times the same values occur.
2) ['On'] returns the amount of values that are currently occurring in ['Area']. It achieves this by parsing through the df to see if those values occur again. So it essentially looks into the future to see if those values occur again.
import pandas as pd
d = ({
'Code' : ['A','A','A','A','B','A','B','A','A','A'],
'Area' : ['Home','Work','Shops','Park','Cafe','Home','Cafe','Work','Home','Park'],
})
df = pd.DataFrame(data=d)
#Select value
df1 = df[df.Code == 'A'].copy()
df1['u'] = df1[::-1].groupby('Area').Area.cumcount()
ids = [1]
seen = set([df1.iloc[0].Area])
dec = False
for val, u in zip(df1.Area[1:], df1.u[1:]):
ids.append(ids[-1] + (val not in seen) - dec)
seen.add(val)
dec = u == 0
df1['On'] = ids
df1 = df1.reindex(df.index).fillna(df1)
The problem is I want to run this script on all values in Code. Instead of selecting one at a time. For instance, if I want to do the same thing on Code['B'], I would have to change: df2 = df1[df1.Code == 'B'].copy() and the run the script again.
If I have numerous values in Code it becomes very inefficient. I need a loop where it finds all unique values in 'Code'Ideally, the script would look like:
df1 = df[df.Code == 'All unique values'].copy()
Intended Output:
Code Area u On
0 A Home 2.0 1.0
1 A Work 1.0 2.0
2 A Shops 0.0 3.0
3 A Park 1.0 3.0
4 B Cafe 1.0 1.0
5 A Home 1.0 3.0
6 B Cafe 0.0 1.0
7 A Work 0.0 3.0
8 A Home 0.0 2.0
9 A Park 0.0 1.0
I find your "On" logic very confusing. That said, I think I can reproduce it:
df["u"] = df.groupby(["Code", "Area"]).cumcount(ascending=False)
df["nunique"] = pd.get_dummies(df.Area).groupby(df.Code).cummax().sum(axis=1)
df["On"] = (df["nunique"] -
(df["u"] == 0).groupby(df.Code).cumsum().groupby(df.Code).shift().fillna(0)
which gives me
In [212]: df
Out[212]:
Code Area u nunique On
0 A Home 2 1 1.0
1 A Work 1 2 2.0
2 A Shops 0 3 3.0
3 A Park 1 4 3.0
4 B Cafe 1 1 1.0
5 A Home 1 4 3.0
6 B Cafe 0 1 1.0
7 A Work 0 4 3.0
8 A Home 0 4 2.0
9 A Park 0 4 1.0
In this, u is the number of matching (Code, Area) pairs after that row. nunique is the number of unique Area values seen so far in that Code.
On is the number of unique Areas seen so far, except that once we "run out" of an Area -- once it's not used any more -- we start subtracting it from nuniq.
Using GroupBy with size and cumcount, you can construct your u series.
Your logic for On isn't clear: this requires clarification.
g = df.groupby(['Code', 'Area'])
df['u'] = g['Code'].transform('size') - (g.cumcount() + 1)
print(df)
Code Area u
0 A Home 2
1 A Home 1
2 B Shops 1
3 A Park 1
4 B Cafe 1
5 B Shops 0
6 A Home 0
7 B Cafe 0
8 A Work 0
9 A Park 0

Pandas manipulate dataframe

I am querying a database and populating a pandas dataframe. I am struggling to aggregate the data (via groupby) and then manipulate the dataframe index such that the dates in the table become the index.
Here is an example of how the data looks like before and after the groupby and what I ultimately am looking for.
dataframe - populated data
firm | dates | received | Sent
-----------------------------------------
A 10/08/2016 2 8
A 12/08/2016 4 2
B 10/08/2016 1 0
B 11/08/2016 3 5
A 13/08/2016 5 1
C 14/08/2016 7 3
B 14/08/2016 2 5
First I want to Group By "firm" and "dates" and "received/sent".
Then manipulate the DataFrame such that the dates becomes the index - rather than the row-index.
Finally to add a total column for each day
Some of the firms do not have 'activity' during some days or at least no activity in either received or sent. However as I want a view of the past X days back, empty values aren't possible rather I need to fill in a zero as a value instead.
dates | 10/08/2016 | 11/08/2016| 12/08/2016| 13/08/2016| 14/08/2016
firm |
----------------------------------------------------------------------
A received 2 0 4 5 0
sent 8 0 2 1 0
B received 1 3 1 0 2
sent 0 5 0 0 5
C received 0 0 2 0 1
sent 0 0 1 2 0
Totals r. 3 3 7 5 3
Totals s. 8 0 3 3 5
I've tried the following code:
df = > mysql query result
n_received = df.groupby(["firm", "dates"
]).received.size()
n_sent = df.groupby(["firm", "dates"
]).sent.size()
tables = pd.DataFrame({ 'received': n_received, 'sent': n_sent,
},
columns=['received','sent'])
this = pd.melt(tables,
id_vars=['dates',
'firm',
'received', 'sent']
this = this.set_index(['dates',
'firm',
'received', 'sent'
'var'
])
this = this.unstack('dates').fillna(0)
this.columns = this.columns.droplevel()
this.columns.name = ''
this = this.transpose()
Basically, I am not getting to the result I want based on this code.
- How can I achieve this?
- Conceptually is there a better way of achieving this result ? Say aggregating in the SQL statement or does the aggregation in Pandas make more sense from an optimisation point of view and logically.
You can use stack(unstack) to transform data from long to wide(wide to long) format:
import pandas as pd
# calculate the total received and sent grouped by dates
df1 = df.drop('firm', axis = 1).groupby('dates').sum().reset_index()
# add total category as the firm column
df1['firm'] = 'total'
# concatenate the summary data frame and original data frame use stack and unstack to
# transform the data frame so that dates appear as columns while received and sent stack as column.
pd.concat([df, df1]).set_index(['firm', 'dates']).stack().unstack(level = 1).fillna(0)
# dates 10/08/2016 11/08/2016 12/08/2016 13/08/2016 14/08/2016
# firm
# A Sent 8.0 0.0 2.0 1.0 0.0
# received 2.0 0.0 4.0 5.0 0.0
# B Sent 0.0 5.0 0.0 0.0 5.0
# received 1.0 3.0 0.0 0.0 2.0
# C Sent 0.0 0.0 0.0 0.0 3.0
# received 0.0 0.0 0.0 0.0 7.0
# total Sent 8.0 5.0 2.0 1.0 8.0
# received 3.0 3.0 4.0 5.0 9.0

Categories

Resources