Finding the Difference between Rows and Formatting Into New DF - python

This is my first time working with dataframes in Python. Here is the question:
"Calculate the difference in the number of Occupied Housing Units from year to year and print it. The difference must be calculated for the consecutive years such as 2008-2009, 2009-2010 etc. Finally, print the values in the ascending order."
Below is my code. It does work but i know there has to be a more efficient way to accomplish this aside from the brute force, manual data entry method that I used:
import pandas as pd
import numpy as np
url = 'https://raw.githubusercontent.com/unt-iialab/INFO5731_Spring2020/master/Assignments/Assignment1_denton_housing.csv'
data = pd.read_csv(url, index_col=0)
A = data.loc[(data["title_field"]=="Occupied Housing Units") , ["title_field", 'value']]
data_2 = [['2008-2009', 35916-36711],['2009-2010', 41007-35916],['2010-2011', 40704-41007],['2011-2012', 42108-40704],['2012-2013', 43673-42108],['2013-2014', 46295-43673]]
B = pd.DataFrame(data_2, columns = ['Years', 'Difference'])
sort_by_diff = B.sort_values('Difference')
print(sort_by_diff)

Related

Fastest way to calculate and append rolling mean as columns for grouped dataframe

Have the following dataset. This is a small sample while the actual dataset is much larger.
What is the fastest way to:
iterate through days = (1,2,3,4,5,6)
calculate [...rolling(day, min_periods=day).mean()]
add it as column name df[f'sma_{day}']
Method I have is casting it to dict of {ticker:price_df} and looping through shown below..
Have thought of methods like groupby, stack/unstack got stuck and need help with appending the columns because they are multi-index.
Am favouring the method with the fastest %%timeit.
import yfinance as yf
df = yf.download(['MSFT','AAPL','AMZN'], start="2022-09-13").loc[:,['Close']].stack().swaplevel().sort_index()
df.index.set_names(['Ticker','Date'], inplace=True)
df
Here is a sample dictionary method I have..
df = df.reset_index()
df = dict(tuple(df.groupby(['Ticker'])))
## Iterate through days and keys
days = (1, 2, 3, 4, 5, 6)
for key in df.keys():
for day in days:
df[key][f'sma_{day}'] = df[key].Close.sort_index(ascending=True).rolling(day, min_periods=day).mean()
## Flatten dictionary
pd.concat(df.values()).set_index(['Ticker','Date']).sort_index()

Calculate standard deviation between data in dataframes generated in a for cycle

I have a cycle that generates dataframes of the same size (96x96 or even , adds those dataframes on top of each other and then divides the result by the amount of iterations (count). This provides an average.
Now i need to calculate stdev between those generated dataframes. A cycle might generate as much as 365 DFs. I understand, that i need to remake the logic a bit, since to calculate stdev, one needs all the numbers of all dataframes.
What would be the best way to dot his? I was thinking about using MultiIndex, but since i am new to Python, i can't get my head around this.
So, here is a simple example code:
import pandas as pd
import numpy as np
zero_data = np.zeros(shape=(5,5))
df = pd.DataFrame(zero_data, columns=[0,1,2,3,4])
for i in range(1,5):
df1 = pd.DataFrame(np.random.randint(0,100,size=(5, 5)), columns=[0,1,2,3,4])
zero_data = zero_data + df1
print(zero_data)
In this code, 5 dataframes are created and are added on top of each other. How can i calculate an std of those 5 dataframes?
Ok, instead of doing 'zero_data = zero_data + df1', i used pd.concat([df1+df2]), then i use reset_index() and pass into group_by(by='index') after which i apply mean, std or whatever:
for i in range(0,5):
df1 = pd.DataFrame(np.random.randint(0,100,size=(5, 5)), columns=[0,1,2,3,4])
df = pd.concat([df, df1])
df.reset_index().groupby(by='index').sum()
Turned out to be easier than i thought it would be :)

Vectorising pandas dataframe apply function for user defined function in python

I want to compute week of the month for a specified date. For computing week of the month, I currently use the user-defined function.
Input data frame:
Output data frame:
Here is what I have tried:
from math import ceil
def week_of_month(dt):
"""
Returns the week of the month for the specified date.
"""
first_day = dt.replace(day=1)
dom = dt.day
adjusted_dom = dom + first_day.weekday()
return int(ceil(adjusted_dom/7.0))
After this,
import pandas as pd
df = pd.read_csv("input_dataframe.csv")
df.date = pd.to_datetime(df.date)
df['year_of_date'] = df.date.dt.year
df['month_of_date'] = df.date.dt.month
df['day_of_date'] = df.date.dt.day
wom = pd.Series()
# worker function for creating week of month series
def convert_date(t):
global wom
wom = wom.append(pd.Series(week_of_month(datetime.datetime(t[0],t[1],t[2]))), ignore_index = True)
# calling worker function for each row of dataframe
_ = df[['year_of_date','month_of_date','day_of_date']].apply(convert_date, axis = 1)
# adding new computed column to dataframe
df['week_of_month'] = wom
# here this updated dataframe should look like Output data frame.
What this does is for each row of data frame it computes week of the month using given function. It makes computations slower as the data frame grows to more rows. Because currently I have more than 10M+ rows.
I am looking for a faster way of doing this. What changes can I make to this code to vectorize this operation across all rows?
Thanks in advance.
Edit: What worked for me after reading answers is below code,
first_day_of_month = pd.to_datetime(df.date.values.astype('datetime64[M]'))
df['week_of_month'] = np.ceil((df.date.dt.day + first_day_of_month.weekday) / 7.0).astype(int)
The week_of_month method can be vectorized. It could be beneficial to not do the conversion to datetime objects, and instead use pandas only methods.
first_day_of_month = df.date.to_period("M").to_timestamp()
df["week_of_month"] = np.ceil((data.day + first_day_of_month.weekday) / 7.0).astype(int)
just right off the bat without even going into your code and mentioning X/Y problems, etc.:
try to get a list of unique dates, I'm sure in the 10M rows you have more than one is a duplicate.
Steps:
create a 2nd df that contains only the columns you need and no
duplicates (drop_duplicates)
run your function on the small dataframe
merge the large and small dfs
(optional) drop the small one

Need to calculating percentage increase in grouped data between 2 columns

I have a dataframe that includes years of medical data by country - including types of illness [column='Indicator Name'] and the remainder of columns are years from 1960 - 2015. I am trying to group and total the types of 'illness' rows together and determine the percentage increase from 1960 to 2016, and sort by highest.
I was trying to create a new DF (df1) with the sorted and totaled data and then determine the difference (see below). This approach is not working.
import pandas as pd
import numpy as np
df = pd.read_csv('data.csv')
df1 = df.groupby['Indicator Name']
df1['ans4']=(df1['2015']-df1['1960'])/df1['1960']
df1.sort_values(by='ans4')
How can I do this?
EDIT
I'm adding a minimal dataframe:
import pandas as pd
import numpy as np
diseases = ["disease_{:02}".format(i+1) for i in range(4)]
years = list(range(1960, 1970))
df = pd.DataFrame({"disease":diseases})
df[years] = pd.DataFrame(np.random.randint(1000, 10000,(len(diseases),len(years))))

pandas resample dealing with missing data

I am using pandas to deal with monthly data that have some missing value. I would like to be able to use the resample method to compute annual statistics but for years with no missing data.
Here is some code and output to demonstrate :
import pandas as pd
import numpy as np
dates = pd.date_range(start = '1980-01', periods = 24,freq='M')
df = pd.DataFrame( [np.nan] * 10 + range(14), index = dates)
Here is what I obtain if I resample :
In [18]: df.resample('A')
Out[18]:
0
1980-12-31 0.5
1981-12-31 7.5
I would like to have a np.nan for the 1980-12-31 index since that year does not have monthly values for every month. I tried to play with the 'how' argument but to no luck.
How can I accomplish this?
i'm sure there's a better way, but in this case you can use:
df.resample('A', how=[np.mean, pd.Series.count, len])
and then drop all rows where count != len

Categories

Resources