group two dataframes with different sizes in python pandas - python

I've got two data frames, one has historical prices of stocks in this format:
year
Company1
Company2
1980
4.66
12.32
1981
5.68
15.53
etc with hundreds of columns, then I have a dataframe specifing a company, its sector and its country.
company 1
industrials
Germany
company 2
consumer goods
US
company 3
industrials
France
I used the first dataframe to plot the prices of various companies over time, however, I'd like to now somehow group the data from the first table with the second one and create a separate dataframe which will have form of sectors total value of time, ie.
year
industrials
consumer goods
healthcare
1980
50.65
42.23
25.65
1981
55.65
43.23
26.15
Thank you

You can do the following, assuming df_1 is your DataFrame with price of stock per year and company, and df_2 your DataFrame with information on the companies:
# turn company columns into rows
df_1 = df_1.melt(id_vars='year', var_name='company')
df_1 = df_1.merge(df_2)
# groupby and move industry to columns
output = df_1.groupby(['year', 'industry'])['value'].sum().unstack('industry')
Output:
industry consumer goods industrials
year
1980 12.32 4.66
1981 15.53 5.68

Related

For each unique index value in MultiIndex level 0, print index if values (strings) in another column are not unique

I'm working with panel data looking like this (only relevant columns included):
Ticker Year Account_number Industry
AAA 2018 xxxx Fossil
2019 xxxx Fossil
2020 xxxx Fossil
BBB 2018 yyyy Materials
2019 yyyy Services
2020 yyyy Materials
CCC 2018 zzzz Services
2019 zzzz Services
2020 zzzz Services
Tickers (level 0 of MultiIndex) are used to identify individual and unique units in the panel. Each unit is observed over 3 years (level 1 of MultiIndex).
When I groupby('Industry') I end up double-counting the units since the same ticker is associated with more than one industry (as with ticker 'BBB').
The goal is to identify and print the tickers having this issue, and to assign them to a single industry.
I'm thinking of some code that returns the ticker if the string in the industry column is not unique, so that I can manually change it later.
Thanks for your help!
PS This is my first question here so pls let me know if you want me to be more specific or show more details about the df
If all of the values for Industry should be the same for each Ticker then you should do this the other way round.
Instead of using groupby() on Industry, use groupby() on Ticker and then loop through the data frames and return only those for which grouped_df.Ticker.nunique() > 1

When merging Dataframes on a common column like ID (primary key),how do you handle data that appears more than once for a single ID, in the second df?

So I have two dfs.
DF1
Superhero ID Superhero City
212121 Spiderman New york
364331 Ironman New york
678523 Batman Gotham
432432 Dr Strange New york
665544 Thor Asgard
123456 Superman Metropolis
555555 Nightwing Gotham
666666 Loki Asgard
Df2
SID Mission End date
665544 10/10/2020
665544 03/03/2021
212121 02/02/2021
665544 05/12/2020
212121 15/07/2021
123456 03/06/2021
666666 12/10/2021
I need to create a new df that summarizes how many heroes are in each city and in which quarter will their missions be complete. I'll be able to match the superhero (and their city) in df1 to the mission end date via their Superhero ID or SID in Df2 ('Superhero Id'=='SID'). Superhero IDs appear only once in Df1 but can appear multiple times in DF2.
Ultimately I need a count for the total no. of heroes in the different cities (which I can do - see below) as well as how many heroes will be free per quarter.
These are the thresholds for the quarters
Quarter 1 – Apr, May, Jun
Quarter 2 – Jul, Aug, Sept
Quarter 3 – Oct, Nov, Dec
Quarter 4 – Jan, Feb, Mar
The following code tells me how many heroes are in each city:
df_Count = pd.DataFrame(df1.City.value_counts().reset_index())
Which produces:
City Count
New york 3
Gotham 2
Asgard 2
Metropolis 1
I can also convert the dates into datetime format via the following operation:
#Convert to datetime series
Df2['Mission End date'] = pd.to_datetime('Df2['Mission End date']')
Ultimately I need a new df that looks like this
City Total Count No. of heroes free in Q3 No. of heroes free in Q4 Free in Q1 2021+
New york 3 2 0 1
Gotham 2 2 2 0
Asgard 2 1 2 0
Metropolis 1 0 0 1
If anyone can help me create the appropriate quarters and be able to sort them into the appropriate columns I'd be extremely grateful. I'd also like a way to handle heroes having multiple mission end dates. I can't ignore them I need to still count them. I suspect I'll need to create a custom function which I can than apply to each row via the apply() method and a lambda expression. This issue has been a pain for a while now so I'd appreciate all the help I can get. Thank you very much :)
After merging your dataframe with
df = df1.merge(df2, left_on='Superhero ID', right_on='SID')
And converting your date column to pd.datetime format
df.assign(missing_end_date=lambda x: pd.to_datetime(x['Missing End Date']))
You can create two columns; one to extract the quarter and one to extract the year of the newly created datetime column
df.assign(quarter_end_date=lambda x: x.missing_end_date.dt.quarter)
.assign(year_end_date=lambda x: x.missing_end_date.dt.year)
And combine them into a column that shows the quarter in a format Qx, yyyy
df.assign(quarter_year_end=lambda x: f"Q{int(x.quarter_end_date)}, {int(x.year_end_date)}")
Finally groupby the city and quarter, count the number of superheros and pivot the dataframe to get your desired result
df.groupby(['City', 'quarter_year_end'])
.count()
.reset_index()
.pivot(index='City', columns='quarter_year_end', values='Superhero')

Selecting all values greater than a number in a panda data frame

I have a dataframe like this with more than 50 columns(for years from 1963 to 2016). I was looking to select all countries with a population over a certain number(say 60 million). Now, when I looked, all the questions were about picking values from a single column. Which is not the case here. I also tried
df[df.T[(df.T > 0.33)].any()] as was suggested in an answer. Doesn't work. Any ideas?
The data frame looks like this:
Country Country_Code Year_1979 Year_1999 Year_2013
Aruba ABW 59980.0 89005 103187.0
Angola AGO 8641521.0 15949766 25998340.0
Albania ALB 2617832.0 3108778 2895092.0
Andorra AND 34818.0 64370 80788.0
First filter only columns with Year in columns names by DataFrame.filter, compare all rows and then test by DataFrame.any at least one matched value per row:
df1 = df[(df.filter(like='Year') > 2000000).any(axis=1)]
print (df1)
Country Country_Code Year_1979 Year_1999 Year_2013
1 Angola AGO 8641521.0 15949766 25998340.0
2 Albania ALB 2617832.0 3108778 2895092.0
Or compare all columns without first 2 selected by positons with DataFrame.iloc:
df1 = df[(df.iloc[:, 2:] > 2000000).any(axis=1)]
print (df1)
Country Country_Code Year_1979 Year_1999 Year_2013
1 Angola AGO 8641521.0 15949766 25998340.0
2 Albania ALB 2617832.0 3108778 2895092.0

Pandas: Conditional Row Split

I have a Pandas DataFrame which looks like so:
user_id item_timestamp item_cashtags item_sectors item_industries
0 406225 1483229353 SPY Financial Exchange Traded Fund
1 406225 1483229353 ERO Financial Exchange Traded Fund
2 406225 1483229350 CAKE|IWM|SDS|SPY|X|SPLK|QQQ Services|Financial|Financial|Financial|Basic M... Restaurants|Exchange Traded Fund|Exchange Trad...
3 619769 1483229422 AAPL Technology Personal Computers
4 692735 1483229891 IVOG Financial Exchange Traded Fund
I'd like to split the cashtags, sectors and industries columns by |. Each cashtag corresponds to a sector which corresponds to an industry, so they are of equal amounts.
I'd like the output to be such that each cashtag, sector and industry have their own row, with the item_timestamp and user_id copying over, ie:
user_id item_timestamp item_cashtags item_sectors item_industries
2 406225 1483229350 CAKE|IWM|SDS Services|Financial|Financial Restaurants|Exchange Traded Fund|Exchange Traded Fund
would become:
user_id item_timestam item_cashtags item_sectors item_industries
406225 1483229350 CAKE Services Restaurants
406225 1483229350 IWM Financial Exchange Traded Fund
406225 1483229350 SDS Financial Exchange Traded Fund
My problem is that this is a conditional split which I'm not sure how to do in Pandas
If the frame is not to large, one easy option is to just loop through the rows. But I agree it is not the most pandamic way to do it, and definitely not the most performing one.
from copy import copy
result = []
for idx, row in df.iterrows():
d = dict(row)
for cat1, cat2 in zip(d['cat1'].split('|'), d['cat2'].split('|')):
# here you can add an if to filter on certain categories
dd = copy(d)
dd['cat1'] = cat1
dd['cat2'] = cat2
result.append(dd)
pd.DataFrame(result) # convert back
Okay, I don't know how performant this will be, but here's another approach
# test_data
df_dict = {
"user_id": [406225, 406225],
"item_timestamp": [1483229350, 1483229353],
"item_cashtags": ["CAKE|IWM|SDS", "SPY"],
"item_sectors": ["Services|Financial|Financial", "Financial"],
"item_industries": [
"Restaurants|Exchange Traded Fund|Exchange Traded Fund",
"Exchange Traded Fund"
]
}
df = pd.DataFrame(df_dict)
# which columns to split; all others should be "copied" over
split_cols = ["item_cashtags", "item_sectors", "item_industries"]
copy_cols = [col for col in df.columns if col not in split_cols]
# for each column, split on |. This gives a list, so values is an array of lists
# summing values concatenates these into one long list
new_df_dict = {col: df[col].str.split("|").values.sum() for col in split_cols}
# n_splits tells us how many times to replicate the values from the copied columns
# so that they'll match with the new number of rows from splitting the other columns
n_splits = df.item_cashtags.str.count("\|") + 1
# we turn each value into a list so that we can easily replicate them the proper
# number of times, then concatenate these lists like with the split columns
for col in copy_cols:
new_df_dict[col] = (df[col].map(lambda x: [x]) * n_splits).values.sum()
# now make a df back from the dict of columns
new_df = pd.DataFrame(new_df_dict)
# new_df
# item_cashtags item_sectors item_industries user_id item_timestamp
# 0 CAKE Services Restaurants 406225 1483229350
# 1 IWM Financial Exchange Traded Fund 406225 1483229350
# 2 SDS Financial Exchange Traded Fund 406225 1483229350
# 3 SPY Financial Exchange Traded Fund 406225 1483229353

Iterate over rows and save as csv

I am working with this DataFrame index and it looks like this:
year personal economic human rank
country
Albania 2008 7.78 7.22 7.50 49
Albania 2009 7.86 7.31 7.59 46
Albania 2010 7.76 7.35 7.55 49
Germany 2011 7.76 7.24 7.50 53
Germany 2012 7.67 7.20 7.44 54
It has 162 countries for 9 years. What I would like to do is:
Create a for loop that returns a new dataframe with the data for each country that only shows the values for personal, economic, human, and rank only.
Save each dataframe as a .csvwith the name of the country the data belong to.
Iterate through unique values of country and year. Get data related to that country and year in another dataframe. Save it.
df.reset_index(inplace=True) # To covert multi-index as in example to columns
unique_val = df[['country', 'year']].drop_duplicates()
for _, country, year in unique_val.itertuples():
file_name = country + '_' + str(year) + '.csv'
out_df = df[(df.country == country) & (df.year == year)]
out_df = out_df.loc[:, ~out_df.columns.isin(['country', 'year'])]
print(out_df)
out_df.to_csv(file_name)

Categories

Resources