Mean of values Monthly in Dataframe with date as column - python

Thanks for investing time to help me out :)
I have DataFrame (df_NSE_Price_) like below:
Company Name ID 2000-01-03 00:00:00 2000-01-04 00:00:00 ....
Reliance Industries Ltd. 100325 50.810 54.
Tata Consultancy Service 123455 123 125
..
I would want output like below :
Company Name ID March 00 April 00 .....
Reliance Industries Ltd 100325 52 55
Tata Consultancy Services 123455 124.3 124
..
The output data has to have the average of data month wise.
So far i have tried
df_NSE_Price_.resample('M',axis=1).mean()
But this gave me error
Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'Index'

Something like this should work:
df.transpose().resample('M',axis=1).mean().transpose()

First, I converted the data to a Data Frame (I added a column with February info, too).
import pandas as pd
columns = ('Company Name',
'ID',
'2000-01-03 00:00:00',
'2000-01-04 00:00:00',
'2000-02-04 00:00:00')
data = [('Reliance Industries Ltd.', 100325, 50.810, 54., 66.0),
('Tata Consultancy Service', 123455, 123, 125, 130.0),]
df = pd.DataFrame(data=data, columns=columns)
Second, I created a two-level index (MultiIndex), using Company and ID. Now, all column labels are dates. Then, I converted the column labels to date format (using .to_datetime()
df = df.set_index(['Company Name', 'ID'])
df.columns = pd.to_datetime(df.columns)
Third, I re-sampled in monthly intervals, using the 'axis=1' to aggregate by column. This creates one month per column. Convert from month-end dates to periods with 'to_period()':
df = df.resample('M', axis=1).sum()
df.columns = df.columns.to_period('M')
2000-01 2000-02
Company Name ID
Reliance Industries Ltd. 100325 104.81 66.0
Tata Consultancy Service 123455 248.00 130.0

Related

dataframe pivot based on 3 columns

I have a data frame like shown below
customer
organization
currency
volume
revenue
Duration
Peter
XYZ Ltd
CNY, INR
20
3,000
01-Oct-2022
John
abc Ltd
INR
7
184
01-Oct-2022
Mary
aaa Ltd
USD
3
43
03-Oct-2022
John
bbb Ltd
THB
17
2,300
04-Oct-2022
Dany
ccc Ltd
CNY, INR , KRW
45
15,100
04-Oct-2022
If I pivot as shown below
df = pd.pivot_table(df, values=['runs', 'volume','revenue'],
index=['customer', 'organization', 'currency'],
columns=['Duration'],
aggfunc=sum,
fill_value=0
)
level = 0 becomes volume for all Duration (level 1) revenue for all Duration duration for all Duration.
I would like to pivot by Duration as level 0 and volume, revenue as level 2.
How to achieve it?
Current output:
I would like to have date as level 0 and volume, revenue and runs under it.
You can use swaplevel like below in your current pivot code; try this;
df1 = df.pivot_table(index=['customer', 'organization', 'currency'],
columns=['Duration'],
aggfunc=sum,
fill_value=0).swaplevel(0,1, axis=1).sort_index(axis=1)
Hope this Helps...

group two dataframes with different sizes in python pandas

I've got two data frames, one has historical prices of stocks in this format:
year
Company1
Company2
1980
4.66
12.32
1981
5.68
15.53
etc with hundreds of columns, then I have a dataframe specifing a company, its sector and its country.
company 1
industrials
Germany
company 2
consumer goods
US
company 3
industrials
France
I used the first dataframe to plot the prices of various companies over time, however, I'd like to now somehow group the data from the first table with the second one and create a separate dataframe which will have form of sectors total value of time, ie.
year
industrials
consumer goods
healthcare
1980
50.65
42.23
25.65
1981
55.65
43.23
26.15
Thank you
You can do the following, assuming df_1 is your DataFrame with price of stock per year and company, and df_2 your DataFrame with information on the companies:
# turn company columns into rows
df_1 = df_1.melt(id_vars='year', var_name='company')
df_1 = df_1.merge(df_2)
# groupby and move industry to columns
output = df_1.groupby(['year', 'industry'])['value'].sum().unstack('industry')
Output:
industry consumer goods industrials
year
1980 12.32 4.66
1981 15.53 5.68

When merging Dataframes on a common column like ID (primary key),how do you handle data that appears more than once for a single ID, in the second df?

So I have two dfs.
DF1
Superhero ID Superhero City
212121 Spiderman New york
364331 Ironman New york
678523 Batman Gotham
432432 Dr Strange New york
665544 Thor Asgard
123456 Superman Metropolis
555555 Nightwing Gotham
666666 Loki Asgard
Df2
SID Mission End date
665544 10/10/2020
665544 03/03/2021
212121 02/02/2021
665544 05/12/2020
212121 15/07/2021
123456 03/06/2021
666666 12/10/2021
I need to create a new df that summarizes how many heroes are in each city and in which quarter will their missions be complete. I'll be able to match the superhero (and their city) in df1 to the mission end date via their Superhero ID or SID in Df2 ('Superhero Id'=='SID'). Superhero IDs appear only once in Df1 but can appear multiple times in DF2.
Ultimately I need a count for the total no. of heroes in the different cities (which I can do - see below) as well as how many heroes will be free per quarter.
These are the thresholds for the quarters
Quarter 1 – Apr, May, Jun
Quarter 2 – Jul, Aug, Sept
Quarter 3 – Oct, Nov, Dec
Quarter 4 – Jan, Feb, Mar
The following code tells me how many heroes are in each city:
df_Count = pd.DataFrame(df1.City.value_counts().reset_index())
Which produces:
City Count
New york 3
Gotham 2
Asgard 2
Metropolis 1
I can also convert the dates into datetime format via the following operation:
#Convert to datetime series
Df2['Mission End date'] = pd.to_datetime('Df2['Mission End date']')
Ultimately I need a new df that looks like this
City Total Count No. of heroes free in Q3 No. of heroes free in Q4 Free in Q1 2021+
New york 3 2 0 1
Gotham 2 2 2 0
Asgard 2 1 2 0
Metropolis 1 0 0 1
If anyone can help me create the appropriate quarters and be able to sort them into the appropriate columns I'd be extremely grateful. I'd also like a way to handle heroes having multiple mission end dates. I can't ignore them I need to still count them. I suspect I'll need to create a custom function which I can than apply to each row via the apply() method and a lambda expression. This issue has been a pain for a while now so I'd appreciate all the help I can get. Thank you very much :)
After merging your dataframe with
df = df1.merge(df2, left_on='Superhero ID', right_on='SID')
And converting your date column to pd.datetime format
df.assign(missing_end_date=lambda x: pd.to_datetime(x['Missing End Date']))
You can create two columns; one to extract the quarter and one to extract the year of the newly created datetime column
df.assign(quarter_end_date=lambda x: x.missing_end_date.dt.quarter)
.assign(year_end_date=lambda x: x.missing_end_date.dt.year)
And combine them into a column that shows the quarter in a format Qx, yyyy
df.assign(quarter_year_end=lambda x: f"Q{int(x.quarter_end_date)}, {int(x.year_end_date)}")
Finally groupby the city and quarter, count the number of superheros and pivot the dataframe to get your desired result
df.groupby(['City', 'quarter_year_end'])
.count()
.reset_index()
.pivot(index='City', columns='quarter_year_end', values='Superhero')

Concat 2 Dataframes with 54 entries yields 1 row

I have created 2 dataframes with a common index based on Year and District. There are 58 rows in each dataframe and the Year and Districts are exact matches. Yet when I try to join them, I get a new dataframe with all of the columns combined (which is what I want) but only one single row - New York City. That row exists in both dataframes, as do all the rest, but only this one makes it to the merged DF. I have tried a few different methods of joining the dataframes but they all do the same thing. This example uses:
pd.concat([ groupeddf,Popdf], axis=1)
This is the Popdf with (Year, District) as Index:
Population
Year District
2017 Albany 309612
Allegany 46894
Broome 193639
Cattaraugus 77348
Cayuga 77603
This is the groupeddf indexed on Year and District (some columns eliminated for clarity):
Total SNAP Households Total SNAP Persons \
Year District
2017 Albany 223057 416302
Allegany 36935 69802
Broome 201586 363504
Cattaraugus 75567 144572
Cayuga 64168 121988
This is the merged DF after executing pd.concat([ groupeddf,Popdf], axis=1):
Population Total SNAP Households Total SNAP Persons
Year District
2017 New York City 8622698 11314598 19987958
This shows the merged dataframe has only 1 entry:
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 1 entries, (2017, New York City) to (2017, New York City)
Data columns (total 4 columns):
Population 1 non-null int64
Total SNAP Households 1 non-null int64
Total SNAP Persons 1 non-null int64
Total SNAP Benefits 1 non-null float64
dtypes: float64(1), int64(3)
memory usage: 170.0+ bytes
UPDATE: I tried another approach and it demonstrates that the indices which appear identical to me, are not being seen as identical.
When I execute this code, I get duplicates instead of a merge:
combined_df = groupeddf.merge(Popdf, how='outer', left_index=True, right_index=True)
The results look like this:
Year District
2017 Albany 223057.0 416302.0
Albany NaN NaN
Allegany 36935.0 69802.0
Allegany NaN NaN
Broome 201586.0 363504.0
Broome NaN NaN
Cattaraugus 75567.0 144572.0
Cattaraugus NaN NaN
The only exception is when you get down to New York City. That one does not duplicate, so is actually seen as the same index. So there is something wrong with the data, but I an not sure what.
Did you try using merge, like this:
combined_df = merge(groupeddf, Popdf, how = 'inner', on = ['Year','District'])
I did inner if you want to combine only where the district and year exist in both dataframes. If you want to keep all on the left dataframe, but only matching from the right, then do a left join, etc.
It took a while but I finally sorted it out. The District name in the population dataframe had a space at the end of the name, where there was not space in the SNAP df.
"Albany " vs "Albany"

How to fill dataframe content NA value in empty cell?

I have a dataframe df:
Open Volume Adj Close Ticker
Date
2006-11-22 140.750000 45505300 114.480649 SPY
I want to change df to another dataframe Open price like below:
SPY AGG
Date
2006-11-22 140.750000 NA
It only use open's data and two tickers, so how to change one dataframe to another?
I think you can use DataFrame constructor with reindex by list of ticker L:
L = ['SPY','AGG']
df1 = pd.DataFrame({'SPY': [df.Open.iloc[0]]},
index=[df.index[0]])
df1 = df1.reindex(columns=L)
print (df1)
SPY AGG
2006-11-22 140.75 NaN
You can use read_html for find list of Tickers:
df2 = pd.read_html('https://en.wikipedia.org/wiki/List_of_S%26P_500_companies', header=0)[0]
#print (df2)
#filter only Ticker symbols starts with SP
df2 = df2[df2['Ticker symbol'].str.startswith('SP')]
print (df2)
Ticker symbol Security SEC filings \
407 SPG Simon Property Group Inc reports
415 SPGI S&P Global, Inc. reports
418 SPLS Staples Inc. reports
GICS Sector GICS Sub Industry \
407 Real Estate REITs
415 Financials Diversified Financial Services
418 Consumer Discretionary Specialty Stores
Address of Headquarters Date first added CIK
407 Indianapolis, Indiana NaN 1063761
415 New York, New York NaN 64040
418 Framingham, Massachusetts NaN 791519
#convert column to list, add SPY because missing
L = ['SPY'] + df2['Ticker symbol'].tolist()
print (L)
['SPY', 'SPG', 'SPGI', 'SPLS']
df1 = pd.DataFrame({'SPY': [df.Open.iloc[0]]},
index=[df.index[0]])
df1 = df1.reindex(columns=L)
print (df1)
SPY SPG SPGI SPLS
2006-11-22 140.75 NaN NaN NaN
Suppose you have a list of data frame df_list for different tickers and every item of of list have the same look of the df in your example
You can first concatenate them into one frame with
df1 = pd.concat(df_list)
Then with
df1[["Open", "Ticker"]].reset_index().set_index(["Date", "Ticker"]).unstack()
It should give you an output like
Open
Ticker AGG SPY
Date
2006-11-22 NAN 140.75

Categories

Resources