python pandas merge_asof groupby - python

I have a merged dataframe as follows:
>>> merged_df.dtypes
Jurisdiction object
AdjustedVolume float64
EffectiveStartDate datetime64[ns]
VintageYear int64
ProductType object
Rate float32
Obligation float32
Demand float64
Cost float64
dtype: object
The below groupby statement returns the correct AdjustedVolume values by Jurisdiction/Year:
>>> merged_df.groupby(['Jurisdiction', 'VintageYear'])['AdjustedVolume'].sum()
When including ProductType:
>>> merged_df.groupby(['Jurisdiction', 'VintageYear','ProductType'])['AdjustedVolume'].sum()
AdjustedVolume by Year is correct if the Jurisdiction contains only one ProductType, but for any Jurisdiction with two or more ProductTypes, the AdjustedVolumes are getting split up such that they sum to the correct value. I was expecting each row to have the total AdjustedVolume, and am unclear on why it's being split up.
example:
>>> merged_df.groupby(['Jurisdiction', 'VintageYear'])['AdjustedVolume'].sum()
Jurisdiction VintageYear AdjustedVolume
CA 2017 3.529964e+05
>>> merged_df.groupby(['Jurisdiction', 'VintageYear','ProductType'])['AdjustedVolume'].sum()
Jurisdiction VintageYear ProductType AdjustedVolume
CA 2017 Bucket1 7.584832e+04
CA 2017 Bucket2 1.308454e+05
CA 2017 Bucket3 1.463026e+05
I suspect the merge_asof is being done incorrectly:
>>> df1.dtypes
Jurisdiction object
ProductType object
VintageYear int64
EffectiveStartDate datetime64[ns]
Rate float32
Obligation float32
dtype: object
>>> df2.dtypes
Jurisdiction object
AdjustedVolume float64
EffectiveStartDate datetime64[ns]
VintageYear int64
dtype: object
Because df2 has no ProductType field, the below merge is breaking up the total volume into whatever ProductTypes are under each Jurisdiction. Can I modify the below merge so each ProductType has the total AdjustedVolume?
merged_df = pd.merge_asof(df2, df1, on='EffectiveStartDate', by=['Jurisdiction','VintageYear'])

You could use both versions of the group by and merge the two tables.
The first table is a group by with the ProductType, which would break out your AdjustedVolume by ProductType.
df = df.groupby(['Jurisdiction','VintageYear','ProductType']).agg({'AdjustedVolume':'sum'}).reset_index(drop = False)
Then create another table without including the ProductType (This is where the total amount will come from).
df1 = df.groupby(['Jurisdiction','VintageYear']).agg({'AdjustedVolume':'sum'}).reset_index(drop = False)
Now create an ID column, in both tables, in order for the merge to work correctly.
df['ID'] = df['Jurisdiction'].astype(str)+'_' +df['VintageYear'].astype(str)
df1['ID'] = df1['Jurisdiction'].astype(str)+'_'+ df1['VintageYear'].astype(str)
Now merge on IDs to get the total adjusted volumne.
df = pd.merge(df, df1, left_on = ['ID'], right_on = ['ID'], how = 'inner')
Last step is to clean up your columns.
df = df.rename(columns = {'AdjustedVolume_x':'AdjustedVolume',
'AdjustedVolume_y':'TotalAdjustedVolume',
'Jurisdiction_x':'Jurisdiction',
'VintageYear_x':'VintageYear'})
del df['Jurisdiction_y']
del df['VintageYear_y']
Your output will look like:

Consider also transform to retrieve grouping aggregate inline with other records, akin to the subquery aggregate in SQL.
grpdf = merged_df.groupby(['Jurisdiction', 'VintageYear','ProductType'])['AdjustedVolume']\
.sum().reset_index()
grpdf['TotalAdjVolume'] = merged_df.groupby(['Jurisdiction', 'ProductType'])['AdjustedVolume']\
.transform('sum')

Related

Pandas convert partial column Index to Datetime

DataFrame below contains housing price dataset from 1996 to 2016.
Other than the first 6 columns, other columns need to be converted to Datetime type.
I tried to run the following code:
HousingPrice.columns[6:] = pd.to_datetime(HousingPrice.columns[6:])
but I got the error:
TypeError: Index does not support mutable operations
I wish to convert some columns in the columns Index to Datetime type, but not all columns.
The pandas index is immutable, so you can't do that.
However, you can access and modify the column index with array, see doc here.
HousingPrice.columns.array[6:] = pd.to_datetime(HousingPrice.columns[6:])
should work.
Note that this would change the column index only. In order to convert the columns values, you can do this :
date_cols = HousingPrice.columns[6:]
HousingPrice[date_cols] = HousingPrice[date_cols].apply(pd.to_datetime, errors='coerce', axis=1)
EDIT
Illustrated example:
data = {'0ther_col': [1,2,3], '1996-04': ['1996-04','1996-05','1996-06'], '1995-05':['1996-02','1996-08','1996-10']}
print('ORIGINAL DATAFRAME')
df = pd.DataFrame.from_records(data)
print(df)
print("\nDATE COLUMNS")
date_cols = df.columns[-2:]
print(df.dtypes)
print('\nCASTING DATE COLUMNS TO DATETIME')
df[date_cols] = df[date_cols].apply(pd.to_datetime, errors='coerce', axis=1)
print(df.dtypes)
print('\nCASTING DATE COLUMN INDEXES TO DATETIME')
print("OLD INDEX -", df.columns)
df.columns.array[-2:] = pd.to_datetime(df[date_cols].columns)
print("NEW INDEX -",df.columns)
print('\nFINAL DATAFRAME')
print(df)
yields:
ORIGINAL DATAFRAME
0ther_col 1995-05 1996-04
0 1 1996-02 1996-04
1 2 1996-08 1996-05
2 3 1996-10 1996-06
DATE COLUMNS
0ther_col int64
1995-05 object
1996-04 object
dtype: object
CASTING DATE COLUMNS TO DATETIME
0ther_col int64
1995-05 datetime64[ns]
1996-04 datetime64[ns]
dtype: object
CASTING DATE COLUMN INDEXES TO DATETIME
OLD INDEX - Index(['0ther_col', '1995-05', '1996-04'], dtype='object')
NEW INDEX - Index(['0ther_col', 1995-05-01 00:00:00, 1996-04-01 00:00:00], dtype='object')
FINAL DATAFRAME
0ther_col 1995-05-01 00:00:00 1996-04-01 00:00:00
0 1 1996-02-01 1996-04-01
1 2 1996-08-01 1996-05-01
2 3 1996-10-01 1996-06-01

In pandas, how to create dataframe indexed by id and with columns with separate content for each appearance?

In Python 3 and pandas I have the dataframe:
comps.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 62679 entries, 0 to 62678
Data columns (total 39 columns):
cnpj 62679 non-null object
razao_social 62679 non-null object
nome_fantasia 36573 non-null object
nome_socio 62679 non-null object
cnpj_cpf_do_socio 62679 non-null object
Column (cnpj) has unique company identifier codes. And the columns (nome_socio) have the names of people related to the companies and the column (cnpj_cpf_do_socio) the identification codes of these people
So the code in (cnpj) can be repeated many lines, according to the number of people related. For example:
cnpj nome_socio cnpj_cpf_do_socio
12345678901234 Paul JR. 987654321
12345678901234 Paul SR. 987665656
12345678901234 Mary Tree 987651213
12345678901234 Paula Sims 987652328
78889098898085 Vitor Moon 558900690
78889098898085 Sheila Kerr 546656588
The other columns (razao_social) and (nome_fantasia) are also repeated, are the names of the companies
So I would like to create a new dataframe that only has each code (cnpj) on each line, and the respective names (razao_social) and (nome_fantasia). And all (nome_socio) and (cnpj_cpf_do_socio) corresponding all on the same line, but separated by ";"
Something like:
cnpj razao_social nome_fantasia all_names all_ids_names
12345678901234 Company 1 Zebra Paul JR.;Paul SR.;Mary Tree;Paula Sims 987654321;987665656;987651213;987652328
78889098898085 Company 2 All Shops Vitor Moon;Sheila Kerr 558900690;546656588
Please, does anyone know how I can create this new dataframe?
You can use groupby, agg and do something like:
df1 = (df
.groupby(['cnpj','razao_social', 'nome_fantasia'])
.agg({'nome_socio': lambda x: ';'.join(list(x)),
'cnpj_cpf_do_socio': lambda x: ';'.join(list(map(str, x)))})
.reset_index()
You can do this with a pivot_table, something like this:
funcs = {"razao_social": lambda x: x, "nome_fantasia": lambda x: x,
"nome_socio": lambda x: ";".join(x), "cnpj_cpf_do_socio": lambda x: ";".join(x)}
pivot = pd.pivot_table(df, index="cnpj", aggfunc=funcs)
Then create all_names:
pivot["all_names"] = pivot["nome_socio"].str.cat(pivot["cnpj_cpf_do_socio"], sep=";")

pandas.groupby reacting differently on same data using lambda aggfunc with categorical type vs. object

I have encountered some strange behaviour of pandas.groupby. Depending on the dtype of my data columns, I get two completely different outcomes. One of them is as expected, the second seems strange.
Data set:
country id plan consolidation_key
AT01 1000 100 A
AT01 1000 200 B
AT01 2000 300 J
AT01 2000 200 K
in an Excel file.
import numpy as np
def consolidate(d):
columns=['country', 'id', 'consolidation_key']
# columns=['id', 'consolidation_key']
return d.groupby(by=columns).agg(
plans=pd.NamedAgg(
column="plan", aggfunc=lambda s: "-".join(sorted(set(s.astype(str))))
)
)
d = pd.read_excel(r"path\to\file\test_data.xlsx", sheet_name='data')
data = d
df = consolidate(data)
print(df)
print("-----------")
print("dtypes:")
print(data.dtypes)
print("--------------------")
data2 = d.assign(country=lambda x: pd.Categorical(x["country"]))
df2 = consolidate(data2)
print(df2)
print("-----------")
print("dtypes:")
print(data2.dtypes)
The lambda function in the consolidation does not come fully into play with the example data. It creates a list of unique items (100-200).
The result this gives is
plans
country id consolidation_key
AT01 1000 A 100
B 200
2000 J 300
K 200
-----------
dtypes:
country object
id int64
plan int64
consolidation_key object
dtype: object
--------------------
plans
country id consolidation_key
AT01 1000 A 100
B 200
J NaN
K NaN
2000 A NaN
B NaN
J 300
K 200
-----------
dtypes:
country category
id int64
plan int64
consolidation_key object
dtype: object
The first consolidation into df looks good. The second into df2 has extra items with NaN values. It looks like a cross join for both ids.
Interestingly, this only happens when columns=['country', 'id', 'consolidation_key']. With columns=['id', 'consolidation_key'], the consolidation works correctly in both cases.
Here is the big question - is this a bug in pandas or do I miss something else?
Versions:
Python 3.7.3
IPython 7.8.0
Pandas 0.25.1 (and 0.25.2)
Reading through the posts in #jezrael's answer, I came to an important comment at https://github.com/pandas-dev/pandas/issues/17594#issuecomment-545238294.
Adding observed=True to groupby solves my problem.
def consolidate(d):
columns=['country', 'id', 'consolidation_key']
return d.groupby(by=columns, observed=True).agg(
plans=pd.NamedAgg(
column="plan", aggfunc=lambda s: "-".join(sorted(set(s.astype(str))))
)
)

How do I combine dataframe columns

I've a dataframe df that looks like:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 810 entries, 0 to 809
Data columns (total 21 columns):
event_type 810 non-null object
datetime 810 non-null datetime64[ns]
person 810 non-null object
...
from_file 0 non-null float64
dtypes: datetime64[ns](1), float64(2), int64(2), object(16)
memory usage: 133.0+ KB
(There are 21 columns but only the above four I'm interested in so I've omitted them)
I want to create a second dataframe df_b that has two columns where one of them is a combination of df's event_type,person,from_file columns and the other is df's datetime. Did I explain that well?... (so two columns in df_b from df's four but where three of the above are combined into one of df_b's)
I thought of creating a new dataframe df_b as:
df_b = pandas.DataFrame({'event_type+person+from_file': [], 'datetime': []})
Then selecting all rows with:
df.loc[:, ['event_type','person','from_file','datetime']]
But beyond that I don't know how to achieve the remainder and I keep thinking I'm going to end up with datetime values that didn't correspond to the original row's datetime that was pulled from df.
So can you show me how to:
select: event_type, person, from_file, datetime from all rows in df
combine: event_type, person, from_file with '+' between the values
and then put (event_type+person+from_file), datetime into df_b
?
To drop NaN values use:
df_clean = df.dropna(subset=['event_type', 'person', 'from_file'])
Concatenating string columns in Pandas is as easy as
df_clean['event_type+person+from_file'] = df_clean['event_type'] + '+' +
df_clean['person'] + '+' + df_clean['from_file']
And then:
df_b = df_clean[['event_type+person+from_file', 'datetime']].copy()

Group by fields in pandas dataframe

I have a dataframe with the following fields. For each Id, I have two records, that represent different latitude and longitudes. I'm trying to achieve a resultant dataframe that groups by current dataframe based on id and put its latitude and longitude into different fields.
I tried with the group by function but I do not get the intended results. Any help would be greatly appreciated.
Id StartTime StopTime Latitude Longitude
101 14:42:28 14:47:56 53.51 118.12
101 22:10:01 22:12:49 33.32 333.11
Result:
Id StartLat StartLong DestLat DestLong
101 53.51 118.12 33.32 333.11
You can use groupby with apply function for return flatten DataFrame to Series:
df = df.groupby('Id')['Latitude','Longitude'].apply(lambda x: pd.Series(x.values.ravel()))
df.columns = ['StartLat', 'StartLong', 'DestLat', 'DestLong']
df = df.reset_index()
print (df)
Id StartLat StartLong DestLat DestLong
0 101 53.51 118.12 33.32 333.11
If problem:
TypeError: Series.name must be a hashable type
try change Series to DataFrame, but then need unstack with droplevel:
df = df.groupby('Id')['Latitude','Longitude']
.apply(lambda x: pd.DataFrame(x.values.ravel()))
.unstack()
df.columns = df.columns.droplevel(0)
df.columns = ['StartLat', 'StartLong', 'DestLat', 'DestLong']
df = df.reset_index()
print (df)
Id StartLat StartLong DestLat DestLong
0 101 53.51 118.12 33.32 333.11

Categories

Resources