Operations within DataFrameGroupBy - python

I am trying to understand how to apply function within the 'groupby' or each groups of the groups in a dataframe.
import pandas as pd
import numpy as np
df = pd.DataFrame({'Stock' : ['apple', 'ford', 'google', 'samsung','walmart', 'kroger'],
'Sector' : ['tech', 'auto', 'tech', 'tech','retail', 'retail'],
'Price': np.random.randn(6),
'Signal' : np.random.randn(6)}, columns= ['Stock','Sector','Price','Signal'])
dfg = df.groupby(['Sector'],as_index=False)
type(dfg)
pandas.core.groupby.DataFrameGroupBy
I want to get the sum ( Price * (1/Signal) ) group by 'Sector'.
i.e. The resulting output should look like
Sector | Value
auto | 0.744944
retail |-0.572164053
tech | -1.454632
I can get the results by creating separate data frames, but was looking for a way to
figure out how to operate withing each of the grouped ( sector) frames.
I can find mean or sum of Price
dfg.agg({'Price' : [np.mean, np.sum] }).head(2)
but not get sum ( Price * (1/Signal) ), which is what I need.
Thanks,

You provided random data, so there is no way we can get the exact number that you got. But based on what you just described, I think the following will do:
In [121]:
(df.Price/df.Signal).groupby(df.Sector).sum()
Out[121]:
Sector
auto -1.693373
retail -5.137694
tech -0.984826
dtype: float64

Related

Insert rows in Python dataframe with conditions

I have a large data file as shown below.
Edited to include an updated example:
I wanted to add two new columns (E and F) next to column D and move the suite # when applicable and City/State data in cell D3 and D4 to E2 and F2, respectively. The challenge is not every entry has the suite number. I would need to insert a row first for those entries that don't have the suite number, only for them, not for those that already have the suite information.
I know how to do loops, but am having trouble to define the conditions. One way is to count the length of the string. How should I get started? Much appreciate your help!
This is how I would do it. I don't recommend looping when using pandas. There are a lot of tools that it is often not needed. Some caution on this. Your spreadsheet has NaN and I think that is actually numpy np.nan equivalent. You also have blanks I am thinking that it is a "" equivalent.
import pandas as pd
import numpy as np
# dictionary of your data
companies = {
'Comp ID': ['C1', '', np.nan, 'C2', '', np.nan, 'C3',np.nan],
'Address': ['10 foo', 'Suite A','foo city', '11 spam','STE 100','spam town', '12 ham', 'Myhammy'],
'phone': ['888-321-4567', '', np.nan, '888-321-4567', '', np.nan, '888-321-4567',np.nan],
'Type': ['W_sale', '', np.nan, 'W_sale', '', np.nan, 'W_sale',np.nan],
}
# make the frames needed.
df = pd.DataFrame( companies)
df1 = pd.DataFrame() # blank frame for suite and town columns
# Edit here to TEST the data types
for r in range(0, 5):
v = df['Comp ID'].values[r]
print(f'this "{v}" is a ', type(v))
# So this will tell us the data types so we can construct our where(). Back to prior answer....
# Need a where clause it is similar to a if() statement in excel
df1['Suite'] = np.where( df['Comp ID']=='', df['Address'], np.nan)
df1['City/State'] = np.where( df['Comp ID'].isna(), df['Address'], np.nan)
# copy values to rows above
df1 = df1[['Suite','City/State']].backfill()
# joint the frames together on index
df = df.join(df1)
df.drop_duplicates(subset=['City/State'], keep='first', inplace=True)
# set the column order to what you want
df = df[['Comp ID', 'Type', 'Address', 'Suite', 'City/State', 'phone' ]]
output
Comp ID
Type
Address
Suite
City/State
phone
C1
W_sale
10 foo
Suite A
foo city
888-321-4567
C2
W_sale
11 spam
STE 100
spam town
888-321-4567
C3
W_sale
12 ham
Myhammy
888-321-4567
Edit: the numpy where statement:
numpy is brought in by the line import numpy as np at the top. We are creating calculated column that is based on the 'Comp ID' column. The numpy does this without loops. Think of the where like an excel IF() function.
df1(return value) = np.where(df[test] > condition, true, false)
The pandas backfill
Some times you have a value that is in a cell below and you want to duplicate it for the blank cell above it. So you backfill. df1 = df1[['Suite','City/State']].backfill().

Different questions about pandas pivot tables

Here's my df:
df=pd.DataFrame(
{
'Color': ['red','blue','red','red','green','red','yellow'],
'Type': ['Oil', 'Aluminium', 'Oil', 'Oil', 'Cement Paint', 'Synthetic Rubber', 'Emulsion'],
'Finish' : ['Satin', 'Matte', 'Matte', 'Satin', 'Semi-gloss', 'Satin', 'Satin'],
'Use' : ['Interior', 'Exterior', 'Interior', 'Interior', 'Exterior', 'Exterior', 'Exterior'],
'Price' : [55, 75, 60, 60, 55, 75, 50]
}
)
I want to create a pivot table that will output 'Color', 'color count', the percentage or weight or each count of color, and finally a total row, outputting the total color count next to 100%. Additionally, I'd like to add a header with today's date in the following format (02 - Nov).
Here is my current pivot with the aproximating inputs
today=datetime.date.today()
today_format=today.strftime("%d-m%")
pivot_table=pd.pivot_table(
data=df,
index='Color',
aggfunc={'Color':'count'}
)
df['Color'].value_counts(
normalize=True
).mul(100).round(1).astype(str) + '%'
Is there a way to add more information to the pivot as a header, total and extra column? Or just I just try to convert the pivot back to a DF and edit it from there?
The main difficulty I'm finding is that since I'm handling string data, when I 'aggfunc='sum' it actually adds the strings. And If I try to add 'margins=True, margins_name='Total count' I get the following error:
if isinstance(aggfunc[k], str):
KeyError: 'Type'
The desired table output would look something like this:
Updated Answer
Thanks to a great suggestion by Rabinzel, we can also have today's date as a column header as well:
df = (df['Color'].value_counts().reset_index().pivot_table(index = ['index'], aggfunc = np.sum, margins=True, margins_name='Total')
.assign(perc = lambda x: x['Color']/x.iloc[:-1]['Color'].sum() * 100)
.rename(columns = {'Color' : 'Color Count',
'perc' : '%'}))
new_cols = pd.MultiIndex.from_product([[datetime.today().strftime('%#d-%b')], df.columns])
df.columns = new_cols
df
2-Nov
Color Count %
index
blue 1 14.285714
green 1 14.285714
red 4 57.142857
yellow 1 14.285714
Total 7 100.000000

Loop through grouped dataframe columns to check for string values

I have two dataframes, one that contains a large amount of textual data scraped from PDF documents, and another that contains categories and subcategories.
For each subcategory, I need to calculate the percentage of documents that contains at least one mention of the subcategory (e.g. for the subcategory "apple", calculate the percentage of documents that contains "apple"). I'm able to correctly calculate the subcategory percentage. However, when I attempt to populate the dataframe with the value, an incorrect value is displayed.
For each category, I need to calculate the percentage of documents that contains at least one mention of each subcategory (e.g. for the category "fruit", calculate the percentage of documents that contains "apple" or "banana"). The calculation of this value is harder, as it's not a subtotal. I'm trying to calcuate this value through a combination of GROUPBY and APPLY, but I've gotten stuck.
The document dataframe looks like this:
The categories dataframe looks like this:
This is what I'm aiming for:
This is what I have so far:
import pandas as pd
documents = {'Text': ['apple apple', 'banana apple', 'carrot carrot carrot', 'spinach','hammer']}
doc_df = pd.DataFrame(data=documents)
print(doc_df,'\n')
categories = {'Category': ['fruit', 'fruit', 'vegetable', 'vegetable'],
'Subcategory': ['apple', 'banana', 'carrot', 'spinach']}
cat_df = pd.DataFrame(data=categories)
print(cat_df,'\n')
total_docs = doc_df.shape[0]
cat_df['Subcat_Percentage'] = 0
cat_df['Cat_Percentage'] = 0
cat_df = cat_df[['Category', 'Cat_Percentage', 'Subcategory', 'Subcat_Percentage']]
for idx, subcategory in enumerate(cat_df['Subcategory']):
total_docs_with_subcat = doc_df[doc_df['Text'].str.contains(subcategory)].shape[0]
subcat_percentage = total_docs_with_subcat / total_docs #calculation is correct
cat_df.at[idx, 'Subcat_Percentage'] = subcat_percentage #wrong value is output
cat_percentage = cat_df.groupby('Category').apply(lambda x: (doc_df[doc_df['Text'].str.contains(subcategory)].shape[0]) #this doesn't work
cat_df.at[idx, 'Cat_Percentage'] = cat_percentage
print('\n', cat_df,'\n')
It Could be better optimized, but try this :
agg_category = cat_df.groupby('Category')['Subcategory'].agg('|'.join)
def percentage_cat(category):
return doc_df[doc_df['Text'].str.contains(agg_category[category])].size / doc_df.size
def percentage_subcat(subcategory):
return doc_df[doc_df['Text'].str.contains(subcategory)].size / doc_df.size
cat_df['percentage_category'] = cat_df['Category'].apply(percentage_cat)
cat_df['sub_percentage'] = cat_df['Subcategory'].apply(percentage_subcat)
cat_df

Group By Dinstinct in Pandas

I Have Script Like This in Pandas :
dfmi['Time'] = pd.to_datetime(dfmi['Time'], format='%H:%M:%S')
dfmi['hours'] = dfmi['Time'].dt.hour
sum_dh = dfmi.groupby(['Date','hours']).agg({'Amount': 'sum', 'Price':'sum'})
dfdhsum = pd.DataFrame(sum_dh)
dfdhsum.columns = ['Amount', 'Gas Sales']
dfdhsum
And the output :
I want Sum Distinct Group BY and the final result is like This :
How its pandas code solution ??
I don't understand what you want to exactly but this instruction will sum hours , amount ans gas sales for each date
dfmi.groupby("Date").agg({'hours': 'sum', 'Amount': 'sum','Gas Sales':'sum})

Replace multiple columns with years for one column

I'm working with worldbank data and I'm trying to create some graphs representing time, but the data I have now looks like this:
As I don't think there's a way to change it to a datetime I think the only way is to replace all these years columns with 1 column called 'Year' with column names I have right now as values and current values in a separate column.
Is there any nice function in Python that allows that or would I have to iterate through the entire dataframe?
Edit to include some code:
df2 = pd.DataFrame({'Country Name': ['Aruba', 'Afghanistan', 'Angola'],
'Country Code': ['ABW', 'AFG', 'AGO'],
'1960':[65.66, 32.29, 33.25],
'1961': [66.07, 32.74, 33.57],
'1962': [66.44, 33.18, 33.91],
'1963': [66.79, 33.62, 34.27],
'1964': [66.11, 34.06, 34.65],
'1965': [67.44, 34.49, 35.03]}).set_index('Country Name')
You can try taking transpose of the dataframe thus the year values will become rows and then you can rename this as year and use it in the plots.
You can try something like this :
import pandas as pd
from matplotlib import pyplot as plt
df1 = pd.DataFrame({'Country Name' : ['Aruba', 'Afghanistan', 'Angola'],
'Country Code' : ['ABW', 'AFG', 'AGO'],
'1960' : [65.66, 32.29, 33.25],
'1961' : [66.07, 32.74, 33.57],
'1962' : [66.44, 33.18, 33.91],
'1963' : [66.79, 33.62, 34.27],
'1964' : [66.11, 34.06, 34.65],
'1965' : [67.44, 34.49, 35.03]})
df2 = df1.transpose()
df2.columns = df1['Country Name']
df2 = df2[2:]
df2['Year'] = df2.index.values
plt.plot(df2['Year'], df2['Aruba'])
plt.plot(df2['Year'], df2['Afghanistan'])
plt.plot(df2['Year'], df2['Angola'])
plt.legend()
plt.show()
Output : Plot Output

Categories

Resources