Python Group by and Sum with a Blank space - python

Revenue by Segment and Country
I have a dataframe with Revenue by Segment and by Country. I want to get the Total Revenue by Country code. So I want the output to be:
Country Revenue
FR 26.38
AE 12.02
This is what the data frame looks like now:
Country Segment Revenue
FR
Digital Games $2.40
Music $20.79
Health and Fitness $0.46
Tech Enthusiasts $2.73
AE
Digital Games $9.99
Games and Toys $2.03
AT
Entertainment-Music $0.09
AU
Shopping $52.45
Auto Enthusiasts $7.86
Auto Owners $25.92
Culture and Arts $8.04
Higher Education $25.81
Digital Games $2.60
Games and Toys $6.12

I'm assuming that your empty entries are NaN, if they are not, I advise you to make them NaN. The general idea is to fill foward in your country column, then dropping null values, which places the country code next to each row that contains data, while removing the header row. The groupby + sum is a simple operation from that point.
ffill + dropna + groupby
d = dict(
Country=df.Country.ffill(),
Revenue=df.Revenue.str.strip('$').astype(float)
)
df.assign(**d).dropna().groupby('Country')['Revenue'].sum()
Country
AE 12.02
AT 0.09
AU 128.80
FR 26.38
Name: Revenue, dtype: float64

Related

How to separate Numbers from string and move them to next column in Python?

I am working on a share market data and in some columns market cap has shifted to previous column. I am trying to fetch them in next column but the value it's returning is completely different.
This is the code I am using -
data['Market Cap (Crores)']=data['Sub-Sector'].astype('str').str.extractall('(\d+)').unstack().fillna('').sum(axis=1).astype(int)
data['Market Cap (Crores)']
But the output I am getting is
968 NaN
969 NaN
970 -2.147484e+09
971 -2.147484e+09
972 -2.147484e+09
How do I get the correct values?
You just do it, step by step. First, pick out the rows that need fixing (where the market cap is Nan). Then, I create two functions, one to pull the market cap from the string, one to remove the market cap. I use apply to fix up the rows, and substitute the values into the original dataframe.
import pandas as pd
import numpy as np
data = [
['GNA Axles Ltd', 'Auto Parts', 1138.846797],
['Andhra Paper Ltd', 'Paper Products', 1135.434614],
['Tarc', 'Real Estate 1134.645409', np.NaN],
['Udaipur Cement Works', 'Cement 1133.531734', np.NaN],
['Pnb Gifts', 'Investment Banking 1130.463641', np.NaN],
]
def getprice(row):
return float(row['Sub-Sector'].split()[-1])
def removeprice(row):
return ' '.join(row['Sub-Sector'].split()[:-1])
df = pd.DataFrame( data, columns= ['Company','Sub-Sector','Market Cap (Crores)'] )
print(df)
picks = df['Market Cap (Crores)'].isna()
rows = df[picks]
print(rows)
df.loc[picks,'Sub-Sector'] = rows.apply(removeprice, axis=1)
df.loc[picks,'Market Cap (Crores)'] = rows.apply(getprice, axis=1)
print(df)
Output:
Company Sub-Sector Market Cap (Crores)
0 GNA Axles Ltd Auto Parts 1138.846797
1 Andhra Paper Ltd Paper Products 1135.434614
2 Tarc Real Estate 1134.645409 NaN
3 Udaipur Cement Works Cement 1133.531734 NaN
4 Pnb Gifts Investment Banking 1130.463641 NaN
Company Sub-Sector Market Cap (Crores)
2 Tarc Real Estate 1134.645409 NaN
3 Udaipur Cement Works Cement 1133.531734 NaN
4 Pnb Gifts Investment Banking 1130.463641 NaN
Company Sub-Sector Market Cap (Crores)
0 GNA Axles Ltd Auto Parts 1138.846797
1 Andhra Paper Ltd Paper Products 1135.434614
2 Tarc Real Estate 1134.645409
3 Udaipur Cement Works Cement 1133.531734
4 Pnb Gifts Investment Banking 1130.463641
df['Sub-Sector Number'] = df['Sub-Sector'].astype('str').str.extractall('(\d+)').unstack().fillna('').sum(axis=1).astype(int)
df['Sub-Sector final'] = df[['Sub-Sector Number','Sub-Sector']].ffill(axis=1).iloc[:,-1]
df
Hi there,
Here is the method which you can try, use your code to create a numeric field and select non-missing value from Sub-Sector Number and Sub-Sector creating your final field - Sub-Sector final
Please try it and if not working please let me know
Thanks Leon

Pandas: Conditional Row Split

I have a Pandas DataFrame which looks like so:
user_id item_timestamp item_cashtags item_sectors item_industries
0 406225 1483229353 SPY Financial Exchange Traded Fund
1 406225 1483229353 ERO Financial Exchange Traded Fund
2 406225 1483229350 CAKE|IWM|SDS|SPY|X|SPLK|QQQ Services|Financial|Financial|Financial|Basic M... Restaurants|Exchange Traded Fund|Exchange Trad...
3 619769 1483229422 AAPL Technology Personal Computers
4 692735 1483229891 IVOG Financial Exchange Traded Fund
I'd like to split the cashtags, sectors and industries columns by |. Each cashtag corresponds to a sector which corresponds to an industry, so they are of equal amounts.
I'd like the output to be such that each cashtag, sector and industry have their own row, with the item_timestamp and user_id copying over, ie:
user_id item_timestamp item_cashtags item_sectors item_industries
2 406225 1483229350 CAKE|IWM|SDS Services|Financial|Financial Restaurants|Exchange Traded Fund|Exchange Traded Fund
would become:
user_id item_timestam item_cashtags item_sectors item_industries
406225 1483229350 CAKE Services Restaurants
406225 1483229350 IWM Financial Exchange Traded Fund
406225 1483229350 SDS Financial Exchange Traded Fund
My problem is that this is a conditional split which I'm not sure how to do in Pandas
If the frame is not to large, one easy option is to just loop through the rows. But I agree it is not the most pandamic way to do it, and definitely not the most performing one.
from copy import copy
result = []
for idx, row in df.iterrows():
d = dict(row)
for cat1, cat2 in zip(d['cat1'].split('|'), d['cat2'].split('|')):
# here you can add an if to filter on certain categories
dd = copy(d)
dd['cat1'] = cat1
dd['cat2'] = cat2
result.append(dd)
pd.DataFrame(result) # convert back
Okay, I don't know how performant this will be, but here's another approach
# test_data
df_dict = {
"user_id": [406225, 406225],
"item_timestamp": [1483229350, 1483229353],
"item_cashtags": ["CAKE|IWM|SDS", "SPY"],
"item_sectors": ["Services|Financial|Financial", "Financial"],
"item_industries": [
"Restaurants|Exchange Traded Fund|Exchange Traded Fund",
"Exchange Traded Fund"
]
}
df = pd.DataFrame(df_dict)
# which columns to split; all others should be "copied" over
split_cols = ["item_cashtags", "item_sectors", "item_industries"]
copy_cols = [col for col in df.columns if col not in split_cols]
# for each column, split on |. This gives a list, so values is an array of lists
# summing values concatenates these into one long list
new_df_dict = {col: df[col].str.split("|").values.sum() for col in split_cols}
# n_splits tells us how many times to replicate the values from the copied columns
# so that they'll match with the new number of rows from splitting the other columns
n_splits = df.item_cashtags.str.count("\|") + 1
# we turn each value into a list so that we can easily replicate them the proper
# number of times, then concatenate these lists like with the split columns
for col in copy_cols:
new_df_dict[col] = (df[col].map(lambda x: [x]) * n_splits).values.sum()
# now make a df back from the dict of columns
new_df = pd.DataFrame(new_df_dict)
# new_df
# item_cashtags item_sectors item_industries user_id item_timestamp
# 0 CAKE Services Restaurants 406225 1483229350
# 1 IWM Financial Exchange Traded Fund 406225 1483229350
# 2 SDS Financial Exchange Traded Fund 406225 1483229350
# 3 SPY Financial Exchange Traded Fund 406225 1483229353

Add new column to dataframe based on an average

I have a dataframe that includes the category of a project, currency, number of investors, goal, etc., and I want to create a new column which will be "average success rate of their category":
state category main_category currency backers country \
0 0 Poetry Publishing GBP 0 GB
1 0 Narrative Film Film & Video USD 15 US
2 0 Narrative Film Film & Video USD 3 US
3 0 Music Music USD 1 US
4 1 Restaurants Food USD 224 US
usd_goal_real duration year hour
0 1533.95 59 2015 morning
1 30000.00 60 2017 morning
2 45000.00 45 2013 morning
3 5000.00 30 2012 morning
4 50000.00 35 2016 afternoon
I have the average success rates in series format:
Dance 65.435209
Theater 63.796134
Comics 59.141527
Music 52.660558
Art 44.889045
Games 43.890467
Film & Video 41.790649
Design 41.594386
Publishing 34.701650
Photography 34.110847
Fashion 28.283186
Technology 23.785582
And now I want to add in a new column, where each column will have a success rate matching their category, i.e. wherever the row is technology, the new column will include 23.78 for that row.
df[category_success_rate] = i want the output column to be the % success which matches with the category in "main category" column.
I think you need GroupBy.transform with a Boolean mask, df['state'].eq(1) or (df['state'] == 1):
df['category_success_rate'] = (df['state'].eq(1)
.groupby(df['main_category']).transform('mean') * 100)
Alternative:
df['category_success_rate'] = ((df['state'] == 1)
.groupby(df['main_category']).transform('mean') * 100)

How to filter out entries in a data frame with specific and different values?

I have this real estate data:
neighborhood type_property type_negotiation price
Smallville house rent 2000
Oakville apartment for sale 100000
King Bay house for sale 250000
...
I have this groupby that identifies which values in the data set are a house for sale, and then returns the 10th and 90th percentile and quantity of these houses for each neighborhood in a new data frame called df_breakdown. The result looks like this:
neighborhood tenthpercentile ninetiethpercentile Quantity
King Bay 250000.0 250000.0 1
Smallville 99000.0 120000.0 8
Oakville 45000.0 160000.0 6
...
I now want to take this information back to my original real estate data set, and filter out all listings if it's a house for sale over the 90th percentile or below the 10th percentile in respect to the percentiles calculated for each neighborhood. For example, I would want a house in the Oakville neighborhood that has a price of 350000 filtered out.
I have used this argument before:
df1 = df[df.price < df.price.quantile(.90)]
But I don't know how to utilize it for differing values for each neighborhood, or even if it is useful to use. Thank you in advance for the help.
Probably not the most elegant but you could join the percentile aggregations to each of the real estate data.
df.join(df.groupby(‘neighborhood’).quantile([0.1,0.9]), on=‘neighborhood’)
On mobile, so forgive me if the syntax isn’t perfect.
You can set them to have same indexes, broadcast the percentiles, and just use .between
So first,
df2 = df2.set_index('neighborhood')
df = df.set_index('neighborhood')
Then, broadcast using loc
df.loc[:, 't'], df.loc[:, 'n'] = df2.tenthpercentile, df2.ninetiethpercentile
Finally,
df.price.between(df.t, df.n)
which yields
neighborhood
Smallville False
Oakville True
King Bay True
King Bay False
dtype: bool
So to filter, just slice
df[df.price.between(df.t, df.n)]

(Python) How to group unique values in column with total of another column

This is a sample what my dataframe looks like:
company_name country_code state_code software finance commerce etc......
google USA CA 1 0 0
jimmy GBR unknown 0 0 1
I would like to be able to group the industry of a company with its state code. For example I would like to have the total number of software companies in a state etc. (e.g. 200 software companies in CA, 100 finance companies in NY).
I am currently just counting the number of total companies in each state using:
usa_df['state_code'].value_counts()
But I can't figure out how to group the number of each type of industry in each individual state.
If the 1s and 0s are boolean flags for each category then you should just need sum.
df[df.country_code == 'USA'].groupby('state_code').sum().reset_index()
# state_code commerce finance software
#0 CA 0 0 1
df.groupby(['state_code']).agg({'software' : 'sum', 'finance' : 'sum', ...})
This will group by the state_code, and sum up the number of 'software', 'finance', etc in each grouping.
Could also do a pivot_table:
df.pivot_table(index = 'state_code', columns = ['software', 'finance', ...], aggfunc = 'sum')
This may help you:
result_dataframe = dataframe_name.groupby('state_code ').sum()

Categories

Resources