pandas mean column 1 of each different instance in column 2 - python

I have a dataframe with list of houses and column 'GROSSAREA' for each house and column 'YEARBUILT' on when it was constructed.
I need to find the average house size for each year.
df[df['YEARBUILT'] == 1991].mean()
Would you just look it from lowest to the highest year?

It's a little hard to parse your question, but I think what you are asking for is the mean GROSSAREA per each YEARBUILD. If that's not the correct understanding then please edit your question and add an example set of data with the desired output.
If I'm correct then you want to use groupby.
import pandas as pd
df = pd.DataFrame({'YEARBUILT': [1999, 1999, 2000, 2000], 'GROSSAREA': [10, 20, 50, 60]})
df.groupby(by='YEARBUILT').mean()
GROSSAREA
YEARBUILT
1999 15
2000 55
That will give you the mean per each group of YEARBUILT.
I think of groupby like merging cells in a spreadsheet.
# Your original dataframe:
YEARBUILT GROSSAREA
1999 10
1999 20
2000 50
2000 60
# Your dataframe after df.groupby(by='YEARBUILT')
YEARBUILT GROSSAREA
1999 10
20
2000 50
60

Related

How do I count values in one dataframe based on the conditions in another dataframe

I have two dataframes. df1 shows annual rainfall over a certain area:
df1:
longitude latitude year
-13.0 8.0 1979 15.449341
1980 21.970507
1981 18.114307
1982 16.881737
1983 24.122467
1984 27.108953
1985 27.401234
1986 18.238272
1987 25.421076
1988 11.796293
1989 17.778618
1990 18.095036
1991 20.414757
and df2 shows the upper limits of each bin:
bin limits
0 16.655970
1 18.204842
2 19.526524
3 20.852657
4 22.336731
5 24.211905
6 27.143820
I'm trying to add a new column to df2 that shows the frequency of rainfall events from df1 in their corresponding bin. For example, in bin 1 I'd be looking for the values in df1 that fall between 16.65 and 18.2.
I've tried the following:
rain = df1['tp1']
for i in range 7:
limit = df2.iloc[i]
out4['count']=rain[rain>limit].count()
However, I get the following message:
ValueError: Can only compare identically-labeled Series objects
Which I think is referring to the fact that I'm comparing two df's that are different sizes? I'm also unsure if that loop is correct or not.
Any help is much appreciated, thanks!
Use pd.cut to assign your rainfall into bins:
# Define the limits for your bins
# Bin 0: (-np.inf , 16.655970]
# Bin 1: (16.655970, 18.204842]
# Bin 2: (18.204842, 19.526524]
# ...
# note that your bins only go up to 27.14 while max rainfall is 27.4 (row 6).
# You may need to add / adjust your limits.
limits = [-np.inf] + df2["limits"].to_list()
# Assign the rainfall to each bin
bins = pd.cut(df1["rainfall"], limits, labels=df2["bin"])
# Count how many values fall into each bin
bins.value_counts(sort=False).rename_axis("bin")

Use list comprehension to calculate Net Present Value and Internal Rate of Return from DataFrame

I downloaded a CSV file whose dataframe looks like this:
Year 1
Year 2
Year 3
-500 (Initial Investment)
-500 (Initial Investment)
-500 (Initial Investment)
1000
1000
1000
1000
1000
1000
1000
1000
1000
I want to use list comprehension to create a new dataframe that would return the Net Present Value of the investment and the Internal Rate of Return. Both of these functions are available from Numpys and would simply be a result of "x" risk free rate and selecting the values of each column. np.npv(rf, values_year1). Ideally, I could insert any number of years and it would give me a dataframe with the corresponding values of each year, regardless of how many years I plug into the csv file. My new dataframe would look like this:
Indicator
Year 1
Year 2
Year 3
NPV
5000 USD
25000 USD
25000 USD
IRR
12%
25%
25%
I know how to do this manually by just selecting each column and doing the calculations, but I really want to learn how to use list comprehension. Is it even possible to do this with list comprehension?
Maybe something like this gives you some help:
df = pd.DataFrame({"Year 1": [-500, 1000, 1000, 1000],"Year 2": [-500, 1000, 1000, 1000],"Year 3": [-500, 1000, 1000, 1000]})
def sumarize_ser(col):
s = col.agg(sum)
return s
df2 = df.apply(sumarize_ser).to_frame()
s_values = df2[0].sum()
def calc_irr(line):
return line/s_values*100
df2["IRR"] = df2[0].apply(calc_irr)
df2.T
Out[]:
Year 1 Year 2 Year 3
0 2500.00 2500.00 2500.00
IRR 33.33 33.33 33.33

How to access columns after creating multiIndex

I am making my DataFrame like this:
influenza_data = pd.DataFrame(data, columns = ['year', 'week', 'weekly_infections'])
and then I create MultiIndex from year and week columns:
influenza_data = influenza_data.set_index(['year', 'week'])
If I have MultiIndex my DataFrame looks like this:
weekly_infections
year week
2009 40 6600
41 7100
42 7700
43 8300
44 8600
... ...
2019 10 8900
11 6200
12 5500
13 3900
14 3300
and data_influenza.columns:
Index(['weekly_infections'], dtype='object')
The problem I have is that I can't access year and week columns now.
If I try data_influenza['week'] or year I get KeyError: 'week'. I can only do data_influenza.weekly_infections and that returns a whole DataFrame
I know if I remove multiIndex I can easily access them but why can't I data_influenza.year or week with MultiIndex? I specified columns when I was creating Dataframe
As Pandas documentation says here, you can access MultiIndex object levels by get_level_values(index) method:
influenza_data.index.get_level_values(0) # year
influenza_data.index.get_level_values(1) # week
Obviously, the index parameter represents the order of indices.

How to apply a custom rolling function to pandas groupby?

I would like to calculate the daily sales from average sales using the following function:
def derive_daily_sales(avg_sales_series, period, first_day_sales):
"""
derive the daily sales from previous_avg_sales start date to current_avg_sales end date
for detail formula, please refer to README.md
#avg_sales_series: an array of avg sales(e.g. 2020-08-04 to 2020-08-06)
#period: the averaging period in days (e.g. 30 days, 90 days)
#first_day_sales: the sales at the first day of previous_avg_sales
"""
x_n1 = avg_sales_series[-1]*period - avg_sales_series[0]*period + first_day_sales
return x_n1
The avg_sales_series is supposed to be a pandas series.
The dataframe looks like the following:
date, customer_id, avg_30_day_sales
12/08/2020, 1, 30
13/08/2020, 1, 40
14/08/2020, 1, 40
12/08/2020, 2, 20
13/08/2020, 2, 40
14/08/2020, 2, 30
I would like to first groupby customer_id and sort by date. Then, get the rolling window of size 2. And apply the custom function derive_daily_sales assuming that period=30 and first_day_sales equal to the first avg_30_day_sales.
I tried:
df_sales_grouped = df_sales.sort_values('date').groupby(['customer_id','date'])]
df_daily_sales['daily_sales'] = df_sales_grouped['avg_30_day_sales'].rolling(2).apply(derive_daily_sales, axis=1, period=30, first_day_sales= df_sales['avg_30_day_sales'][0])
You should not group by the date since you want to roll over that column, so the grouping should be:
df_sales_grouped = df_sales.sort_values('date').groupby('customer_id')
Next, what you actually want to do is apply a rolling window on each group in the dataframe. So you need to use apply twice, once on the grouped dataframe and once on each rolling window. This can be done as follows:
rolling_arguments = {'period': 30, 'first_day_sales': df_sales['avg_30_day_sales'][0]}
df_sales['daily_sales'] = df_sales_grouped['avg_30_day_sales'].apply(
lambda g: g.rolling(2).apply(derive_daily_sales, kwargs=rolling_arguments))
For the given input data, the result is:
date customer_id avg_30_day_sales daily_sales
12/08/2020 1 30 NaN
13/08/2020 1 40 330.0
14/08/2020 1 40 30.0
12/08/2020 2 20 NaN
13/08/2020 2 40 630.0
14/08/2020 2 30 -270.0

Plotting histogram for column by grouping two column in pandas

I am new to pandas and matplotlib. I have a csv file which consist of year from 2012 to 2018. For each month of the year, I have Rain data. I want to analyze by the histogram, which month of the year having maximum rainfall. Here is my dataset.
year month Temp Rain
2012 1 10 100
2012 2 20 200
2012 3 30 300
.. .. .. ..
2012 12 40 400
2013 1 50 300
2013 2 60 200
.. .. .. ..
2018 12 70 400
I could not able to plot with histogram, I tried plotting with the bar but not getting desired result. Here what I have tried:
import pandas as pd
import numpy as npy
import matplotlib.pyplot as plt
df2=pd.read_csv('Monthly.csv')
df2.groupby(['year','month'])['Rain'].count().plot(kind="bar",figsize=(20,10))
Here what I got output:
Please suggest me an approach to plot an histogram to analyze maxmimum rainfall happening in which month grouped by year.
Probably you don't want to see the count per group but
df2.groupby(['year','month'])['Rain'].first().plot(kind="bar",figsize=(20,10))
or maybe
df2.groupby(['month'])['Rain'].sum().plot(kind="bar",figsize=(20,10))
you are closed to solution, i'll write: use max() and not count()
df2.groupby(['year','month'])['Rain'].max().plot(kind="bar",figsize=(20,10))
First groubby year and month as you already did, but only keep the maximum rainfall.
series_df2 = df2.groupby(['year','month'], sort=False)['Rain'].max()
Then unstack the series, transpose it and plot it.
series_df2.unstack().T.plot(kind='bar', subplots=False, layout=(2,2))
This will give you an output that looks like this for your sample data:

Categories

Resources