Pandas: Conditional Row Split - python

I have a Pandas DataFrame which looks like so:
user_id item_timestamp item_cashtags item_sectors item_industries
0 406225 1483229353 SPY Financial Exchange Traded Fund
1 406225 1483229353 ERO Financial Exchange Traded Fund
2 406225 1483229350 CAKE|IWM|SDS|SPY|X|SPLK|QQQ Services|Financial|Financial|Financial|Basic M... Restaurants|Exchange Traded Fund|Exchange Trad...
3 619769 1483229422 AAPL Technology Personal Computers
4 692735 1483229891 IVOG Financial Exchange Traded Fund
I'd like to split the cashtags, sectors and industries columns by |. Each cashtag corresponds to a sector which corresponds to an industry, so they are of equal amounts.
I'd like the output to be such that each cashtag, sector and industry have their own row, with the item_timestamp and user_id copying over, ie:
user_id item_timestamp item_cashtags item_sectors item_industries
2 406225 1483229350 CAKE|IWM|SDS Services|Financial|Financial Restaurants|Exchange Traded Fund|Exchange Traded Fund
would become:
user_id item_timestam item_cashtags item_sectors item_industries
406225 1483229350 CAKE Services Restaurants
406225 1483229350 IWM Financial Exchange Traded Fund
406225 1483229350 SDS Financial Exchange Traded Fund
My problem is that this is a conditional split which I'm not sure how to do in Pandas

If the frame is not to large, one easy option is to just loop through the rows. But I agree it is not the most pandamic way to do it, and definitely not the most performing one.
from copy import copy
result = []
for idx, row in df.iterrows():
d = dict(row)
for cat1, cat2 in zip(d['cat1'].split('|'), d['cat2'].split('|')):
# here you can add an if to filter on certain categories
dd = copy(d)
dd['cat1'] = cat1
dd['cat2'] = cat2
result.append(dd)
pd.DataFrame(result) # convert back

Okay, I don't know how performant this will be, but here's another approach
# test_data
df_dict = {
"user_id": [406225, 406225],
"item_timestamp": [1483229350, 1483229353],
"item_cashtags": ["CAKE|IWM|SDS", "SPY"],
"item_sectors": ["Services|Financial|Financial", "Financial"],
"item_industries": [
"Restaurants|Exchange Traded Fund|Exchange Traded Fund",
"Exchange Traded Fund"
]
}
df = pd.DataFrame(df_dict)
# which columns to split; all others should be "copied" over
split_cols = ["item_cashtags", "item_sectors", "item_industries"]
copy_cols = [col for col in df.columns if col not in split_cols]
# for each column, split on |. This gives a list, so values is an array of lists
# summing values concatenates these into one long list
new_df_dict = {col: df[col].str.split("|").values.sum() for col in split_cols}
# n_splits tells us how many times to replicate the values from the copied columns
# so that they'll match with the new number of rows from splitting the other columns
n_splits = df.item_cashtags.str.count("\|") + 1
# we turn each value into a list so that we can easily replicate them the proper
# number of times, then concatenate these lists like with the split columns
for col in copy_cols:
new_df_dict[col] = (df[col].map(lambda x: [x]) * n_splits).values.sum()
# now make a df back from the dict of columns
new_df = pd.DataFrame(new_df_dict)
# new_df
# item_cashtags item_sectors item_industries user_id item_timestamp
# 0 CAKE Services Restaurants 406225 1483229350
# 1 IWM Financial Exchange Traded Fund 406225 1483229350
# 2 SDS Financial Exchange Traded Fund 406225 1483229350
# 3 SPY Financial Exchange Traded Fund 406225 1483229353

Related

How to separate Numbers from string and move them to next column in Python?

I am working on a share market data and in some columns market cap has shifted to previous column. I am trying to fetch them in next column but the value it's returning is completely different.
This is the code I am using -
data['Market Cap (Crores)']=data['Sub-Sector'].astype('str').str.extractall('(\d+)').unstack().fillna('').sum(axis=1).astype(int)
data['Market Cap (Crores)']
But the output I am getting is
968 NaN
969 NaN
970 -2.147484e+09
971 -2.147484e+09
972 -2.147484e+09
How do I get the correct values?
You just do it, step by step. First, pick out the rows that need fixing (where the market cap is Nan). Then, I create two functions, one to pull the market cap from the string, one to remove the market cap. I use apply to fix up the rows, and substitute the values into the original dataframe.
import pandas as pd
import numpy as np
data = [
['GNA Axles Ltd', 'Auto Parts', 1138.846797],
['Andhra Paper Ltd', 'Paper Products', 1135.434614],
['Tarc', 'Real Estate 1134.645409', np.NaN],
['Udaipur Cement Works', 'Cement 1133.531734', np.NaN],
['Pnb Gifts', 'Investment Banking 1130.463641', np.NaN],
]
def getprice(row):
return float(row['Sub-Sector'].split()[-1])
def removeprice(row):
return ' '.join(row['Sub-Sector'].split()[:-1])
df = pd.DataFrame( data, columns= ['Company','Sub-Sector','Market Cap (Crores)'] )
print(df)
picks = df['Market Cap (Crores)'].isna()
rows = df[picks]
print(rows)
df.loc[picks,'Sub-Sector'] = rows.apply(removeprice, axis=1)
df.loc[picks,'Market Cap (Crores)'] = rows.apply(getprice, axis=1)
print(df)
Output:
Company Sub-Sector Market Cap (Crores)
0 GNA Axles Ltd Auto Parts 1138.846797
1 Andhra Paper Ltd Paper Products 1135.434614
2 Tarc Real Estate 1134.645409 NaN
3 Udaipur Cement Works Cement 1133.531734 NaN
4 Pnb Gifts Investment Banking 1130.463641 NaN
Company Sub-Sector Market Cap (Crores)
2 Tarc Real Estate 1134.645409 NaN
3 Udaipur Cement Works Cement 1133.531734 NaN
4 Pnb Gifts Investment Banking 1130.463641 NaN
Company Sub-Sector Market Cap (Crores)
0 GNA Axles Ltd Auto Parts 1138.846797
1 Andhra Paper Ltd Paper Products 1135.434614
2 Tarc Real Estate 1134.645409
3 Udaipur Cement Works Cement 1133.531734
4 Pnb Gifts Investment Banking 1130.463641
df['Sub-Sector Number'] = df['Sub-Sector'].astype('str').str.extractall('(\d+)').unstack().fillna('').sum(axis=1).astype(int)
df['Sub-Sector final'] = df[['Sub-Sector Number','Sub-Sector']].ffill(axis=1).iloc[:,-1]
df
Hi there,
Here is the method which you can try, use your code to create a numeric field and select non-missing value from Sub-Sector Number and Sub-Sector creating your final field - Sub-Sector final
Please try it and if not working please let me know
Thanks Leon

group two dataframes with different sizes in python pandas

I've got two data frames, one has historical prices of stocks in this format:
year
Company1
Company2
1980
4.66
12.32
1981
5.68
15.53
etc with hundreds of columns, then I have a dataframe specifing a company, its sector and its country.
company 1
industrials
Germany
company 2
consumer goods
US
company 3
industrials
France
I used the first dataframe to plot the prices of various companies over time, however, I'd like to now somehow group the data from the first table with the second one and create a separate dataframe which will have form of sectors total value of time, ie.
year
industrials
consumer goods
healthcare
1980
50.65
42.23
25.65
1981
55.65
43.23
26.15
Thank you
You can do the following, assuming df_1 is your DataFrame with price of stock per year and company, and df_2 your DataFrame with information on the companies:
# turn company columns into rows
df_1 = df_1.melt(id_vars='year', var_name='company')
df_1 = df_1.merge(df_2)
# groupby and move industry to columns
output = df_1.groupby(['year', 'industry'])['value'].sum().unstack('industry')
Output:
industry consumer goods industrials
year
1980 12.32 4.66
1981 15.53 5.68

Python Group by and Sum with a Blank space

Revenue by Segment and Country
I have a dataframe with Revenue by Segment and by Country. I want to get the Total Revenue by Country code. So I want the output to be:
Country Revenue
FR 26.38
AE 12.02
This is what the data frame looks like now:
Country Segment Revenue
FR
Digital Games $2.40
Music $20.79
Health and Fitness $0.46
Tech Enthusiasts $2.73
AE
Digital Games $9.99
Games and Toys $2.03
AT
Entertainment-Music $0.09
AU
Shopping $52.45
Auto Enthusiasts $7.86
Auto Owners $25.92
Culture and Arts $8.04
Higher Education $25.81
Digital Games $2.60
Games and Toys $6.12
I'm assuming that your empty entries are NaN, if they are not, I advise you to make them NaN. The general idea is to fill foward in your country column, then dropping null values, which places the country code next to each row that contains data, while removing the header row. The groupby + sum is a simple operation from that point.
ffill + dropna + groupby
d = dict(
Country=df.Country.ffill(),
Revenue=df.Revenue.str.strip('$').astype(float)
)
df.assign(**d).dropna().groupby('Country')['Revenue'].sum()
Country
AE 12.02
AT 0.09
AU 128.80
FR 26.38
Name: Revenue, dtype: float64

(Python) How to group unique values in column with total of another column

This is a sample what my dataframe looks like:
company_name country_code state_code software finance commerce etc......
google USA CA 1 0 0
jimmy GBR unknown 0 0 1
I would like to be able to group the industry of a company with its state code. For example I would like to have the total number of software companies in a state etc. (e.g. 200 software companies in CA, 100 finance companies in NY).
I am currently just counting the number of total companies in each state using:
usa_df['state_code'].value_counts()
But I can't figure out how to group the number of each type of industry in each individual state.
If the 1s and 0s are boolean flags for each category then you should just need sum.
df[df.country_code == 'USA'].groupby('state_code').sum().reset_index()
# state_code commerce finance software
#0 CA 0 0 1
df.groupby(['state_code']).agg({'software' : 'sum', 'finance' : 'sum', ...})
This will group by the state_code, and sum up the number of 'software', 'finance', etc in each grouping.
Could also do a pivot_table:
df.pivot_table(index = 'state_code', columns = ['software', 'finance', ...], aggfunc = 'sum')
This may help you:
result_dataframe = dataframe_name.groupby('state_code ').sum()

How to fill dataframe content NA value in empty cell?

I have a dataframe df:
Open Volume Adj Close Ticker
Date
2006-11-22 140.750000 45505300 114.480649 SPY
I want to change df to another dataframe Open price like below:
SPY AGG
Date
2006-11-22 140.750000 NA
It only use open's data and two tickers, so how to change one dataframe to another?
I think you can use DataFrame constructor with reindex by list of ticker L:
L = ['SPY','AGG']
df1 = pd.DataFrame({'SPY': [df.Open.iloc[0]]},
index=[df.index[0]])
df1 = df1.reindex(columns=L)
print (df1)
SPY AGG
2006-11-22 140.75 NaN
You can use read_html for find list of Tickers:
df2 = pd.read_html('https://en.wikipedia.org/wiki/List_of_S%26P_500_companies', header=0)[0]
#print (df2)
#filter only Ticker symbols starts with SP
df2 = df2[df2['Ticker symbol'].str.startswith('SP')]
print (df2)
Ticker symbol Security SEC filings \
407 SPG Simon Property Group Inc reports
415 SPGI S&P Global, Inc. reports
418 SPLS Staples Inc. reports
GICS Sector GICS Sub Industry \
407 Real Estate REITs
415 Financials Diversified Financial Services
418 Consumer Discretionary Specialty Stores
Address of Headquarters Date first added CIK
407 Indianapolis, Indiana NaN 1063761
415 New York, New York NaN 64040
418 Framingham, Massachusetts NaN 791519
#convert column to list, add SPY because missing
L = ['SPY'] + df2['Ticker symbol'].tolist()
print (L)
['SPY', 'SPG', 'SPGI', 'SPLS']
df1 = pd.DataFrame({'SPY': [df.Open.iloc[0]]},
index=[df.index[0]])
df1 = df1.reindex(columns=L)
print (df1)
SPY SPG SPGI SPLS
2006-11-22 140.75 NaN NaN NaN
Suppose you have a list of data frame df_list for different tickers and every item of of list have the same look of the df in your example
You can first concatenate them into one frame with
df1 = pd.concat(df_list)
Then with
df1[["Open", "Ticker"]].reset_index().set_index(["Date", "Ticker"]).unstack()
It should give you an output like
Open
Ticker AGG SPY
Date
2006-11-22 NAN 140.75

Categories

Resources