Working with some COVID-19 data, how should I be calculating a 14 day rolling sum of case counts?
Here's my existing code:
import pandas as pd
import matplotlib.pyplot as plt
url = 'https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv'
all_counties = pd.read_csv(url, dtype={"fips": str})
all_counties.date = pd.to_datetime(all_counties.date)
oregon = all_counties.loc[all_counties['state'] == 'Oregon']
oregon.set_index('date', inplace=True)
oregon['delta']=oregon.groupby(['state','county'])['cases'].diff().fillna(0)
oregon.head()
This code calculates the daily incremental case count (thanks to an earlier question's answers).
The next step is calculating the rolling 14 day sum, for which I have attempted this:
oregon['rolling_14']=oregon.groupby(['state','county'])['delta'].rolling(min_periods=1, window=14).sum()
It is unfortunately failing. If I have a single county's data, this works:
county['rolling_14']=county.rolling(min_periods=1, window=14).sum()
But unfortunately, this isn't viable when the data frame contains multiple counties' datasets.
The groupby().rolling() has two extra index level, namely state, county. Remove them and assignment would work
oregon['rolling_14'] = (oregon.groupby(['state','county'])['delta']
.rolling(min_periods=1, window=14).sum()
.reset_index(level=['state','county'])
)
Also, since you are working with several groupby functions, lazy groupby would help improve run time/code base a bit:
groups = oregon.groupby(['state','county'])
oregon['delta'] = groups['cases'].diff().fillna(0)
oregon['rolling_14'] = (groups['delta']
.rolling(min_periods=1, window=14).sum()
.reset_index(level=['state','county'])
)
Related
I'm trying to create a graph with Seaborn that shows all of the Windows events in the Domain Controller that have taken place in a given time range, which means you have, say, five events now, but when you run the program again in 10 minutes, you might get 25 events.
With that said, I've been able to parse these events (labeled Actions) from a mumbo-jumbo of other data in the log file and then create a DataFrame in Pandas. The script outputs the data as a dictionary. After creating the DataFrame, this is what the output looks like:
logged-out kerberos-authentication-ticket-requested logged-in created-process exited-process
1 1 5 2 1 1
Note: The values you see above are the number of times the process took place within that time frame.
That would be good enough for me, but only if a table was all I needed. When I try to put this DataFrame into Seaborn, I get an error because I don't know what to name the x and y axes because, well, they are always changing. So, my solution was to use the df.melt() function in order to convert those columns into rows, and then label the only two columns needed ('Actions','Count'). But that's where I fumbled multiple times. I can't figure out how to use the df.melt() functions correctly.
Here is my code:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
#Ever-changing data
actions = {'logged-out': 2, 'kerberos-authentication-ticket-requested': 5, 'logged-in': 2,
'created-process': 1, 'exited-process': 1, 'logged-out': 1}
#Create DataFrame
data = actions
index = 1 * pd.RangeIndex(start=1, stop=2) #add automatic index
df = pd.DataFrame(data,index=index,columns=['Action','Count'])
print(df)
#Convert Columns to Rows and Add
df.melt(id_vars=["Action", "Count"],
var_name="Action",
value_name="Count")
#Create graph
sns.barplot(data=df,x='Action',y='Count',
palette=['#476a6f','#e63946'],
dodge=False,saturation=0.65)
plt.savefig('fig.png')
plt.show()
Any help is appreciated.
You can use:
df.melt(var_name="Action", value_name="Count")
without using any id_vars!
have a dataframe with 1 minute timestamp of open, high, low, close, volume for a token.
using expanding or resample function, one can get a new dataframe based on the timeinterval. in my case its 1 day time interval.
i am looking to get the above output in the original dataframe. please assist in the same.
original dataframe:
desired dataframe:
Here "date_1d" is the time interval for my use case. i used expanding function but as the value changes in "date_1d" column, expanding function works on the whole dataframe
df["high_1d"] = df["high"].expanding().max()
df["low_1d"] = df["low"].expanding().min()
df["volume_1d"] = df["volume"].expanding().min()
then the next challenge was how to find Open and Close based on "date_1d" column
Please assist or ask more questions, if not clear on my desired output.
Fyi - data is huge for 5 years 1 minute data for 100 tokens
thanks in advance
Sukhwant
I'm not sure if I understand it right but for me it looks like you want to groupbyeach day and calculate first last min max for them.
Is the column date_1d already there ?
If not:
df["date_1d"] = df["date"].dt.strftime('%Y%m%d')
For the calculations:
df["open_1d"] = df.groupby("date_1d")['open'].transform('first')
df["high_1d"] = df.groupby("date_1d")['high'].transform('max')
df["low_1d"] = df.groupby("date_1d")['low'].transform('min')
df["close_1d"] = df.groupby("date_1d")['close'].transform('last')
EDIT:
Have a look in your code if this works like you expect it (till we have some of your data I can only guess, sorry :D )
df['high_1d'] = df.groupby('date_1d')['high'].expanding().max().values
It groups the data per "date_1d" but in the group only consider row by row (and the above rows)
EDIT: Found a neat solution using transform method. Erased the need for a "Day" Column as df.groupby is made using index.date attribute.
import pandas as pd
import yfinance as yf
df = yf.download("AAPL", interval="1m",
start=datetime.date.today()-datetime.timedelta(days=6))
df['Open_1d'] = df["Open"].groupby(
df.index.day).transform('first')
df['Close_1d'] = df["Close"].groupby(
df.index.day).transform('last')
df['High_1d'] = df['High'].groupby(
df.index.day).expanding().max().droplevel(level=0)
df['Low_1d'] = df['Low'].groupby(
df.index.day).expanding().min().droplevel(level=0)
df['Volume_1d'] = df['Volume'].groupby(
df['Day']).expanding().sum().droplevel(level=0)
df
Happy Coding!
I have a dataset (df) like that :
Card Number Amount
0 102******38 22.0
1 102******56 76.0
2 102******38 25.6
and it's load using
import panda as pd
df = pd.read_csv("file.csv")
And I would like to calculate something like :
df["Zscore"] = df["Amount"] - AVERAGE(all X in df["Amount"] who have the same "Number" in df["Card Number"] )
My intuition is something like :
import numpy as np
import statistics as st
df["Zscore"] = df["Amount"] - st.mean(np.where(condition, df["Amount"], 0))
But I can't figure out how to express my condition
After some research, I found a solution using Verticapy
import verticapy.stats as st
df["Zscore"] = (df["Amount"] - st.mean(df["Amount"])._over(["CardNumber"]))
But I need to convert my code using Verticapy, and I would like another way to do that because I have never used Verticapy and don't really want to at the moment.
So do I need to use "np.where()" and in this case is it possible to formulate my condition ?
Or do I need to alter my way to attack the problem ?
First, you need to calculate the mean value per card number. Let's calculate that by grouping same card numbers, getting the average amount, and call that 'card_mean':
mean_values = df.groupby('Card Number')\
.mean()['Amount']\
.reset_index()\
.rename(columns={'Amount':'card_mean'})
Then, you want to merge that mean value back into the original dataframe, as a new column, for each 'Card Number' that you have in your original df
df = pd.merge(df, mean_values, how='left', on='Card Number')
This gives you a combined df with 2 columns: the 'Amount' (which you loaded), and the 'card_mean' per card number (which we just calculated by averaging in step 1)
Now you can go and do you magic with both, i.e., subtract each, average over that difference, etc.. For example:
df['z_score'] = df['Amount'] - df['card_mean']
Problem statement
I had the following problem:
I have samples that ran independent tests. In my dataframe, tests of sample with the same "test name" are also independent. So the couple (test,sample) is independent and unique.
data are collected at non regular sampling rates, so we're speaking about unequaly spaced indices. This "time series" index is called unreg_idx in the example. For the sake of simplicity, it is a float between 0 & 1.
I want to figure out what the value at a specific index, e.g. for unreg_idx=0.5. If the value is missing, I just want a linear interpolation that depends on the index. If extrapolating because the value is at an extremum in the sorted unreg_idx of the group (test,sample), it can leave NaN.
Note the following from pandas documentation:
Please note that only method='linear' is supported for
DataFrame/Series with a MultiIndex.
’linear’: Ignore the index and treat the values as equally spaced.
This is the only method supported on MultiIndexes.
The only solution I found is long, complex and slow. I am wondering if I am missing out on something, or on the contrary something is missing from the pandas library. I believe this is a typical issue in scientific and engineering fields to have independent tests on various samples with non regular indices.
What I tried
sample data set preparation
This part is just for making an example
import pandas as pd
import numpy as np
tests = (f'T{i}' for i in range(20))
samples = (chr(i) for i in range(97,120))
idx = pd.MultiIndex.from_product((tests,samples),names=('tests','samples'))
idx
dfs=list()
for ids in idx:
group_idx = pd.MultiIndex.from_product(((ids[0],),(ids[1],),tuple(np.random.random_sample(size=(90,))))).sort_values()
dfs.append(pd.DataFrame(1000*np.random.random_sample(size=(90,)),index=group_idx))
df = pd.concat(dfs)
df = df.rename_axis(index=('test','sample','nonreg_idx')).rename({0:'value'},axis=1)
The (bad) solution
add_missing = df.index.droplevel('nonreg_idx').unique().to_frame().reset_index(drop=True)
add_missing['nonreg_idx'] = .5
add_missing = pd.MultiIndex.from_frame(add_missing)
added_missing = df.reindex(add_missing)
df_full = pd.concat([added_missing.loc[~added_missing.index.isin(df.index)], df])
df_full.sort_index(inplace=True)
def interp_fnc(group):
try:
return group.reset_index(['test','sample']).interpolate(method='slinear').set_index(['test','sample'], append=True).reorder_levels(['test','sample','value']).sort_index()
except:
return group
grouped = df_full.groupby(level=['test','sample'])
df_filled = grouped.apply(
interp_fnc
)
Here, the wanted values are in df_filled. So I can do df_filled.loc[(slice(None), slice(None), .5),'value'] to get what I need for each sample/test.
I would have expected to be able to do the same within 1 or maximum 2 lines of code. I have 14 here. apply is quite a slow method. I can't even use numba.
Question
Can someone propose a better solution?
If you think there is no better alternative, please comment and I'll open an issue...
I am trying to create new dataframe based on condition per groupby.
Suppose, I have dataframe with Name, Flag and Month.
import pandas as pd
import numpy as np
data = {'Name':['A', 'A', 'B', 'B'], 'Flag':[0, 1, 0, 1], 'Month':[1,2,1,2]}
df = pd.DataFrame(data)
need = df.loc[df['Flag'] == 0].groupby(['Name'], as_index = False)['Month'].min()
My condition is to find minimum month where flag equal to 0 per name.
I have used .loc to define my condition, it works fine but I found that it quite poor performance when applying with 10 million of rows.
Any more efficient way to do so?
Thank you!
Just had this same scenario yesterday, where I took a 90 second process down to about 3 seconds. Because speed is your concern (like mine was), and not using solely Pandas itself, I would recommend using Numba and NumPy. The catch is you're going to have to brush up on your data structures and types to get a good grasp on what Numba is really doing with JIT. Once you do though, it rocks.
I would recommend finding a way to get every value in your DataFrame to an integer. For your name column, try unique ID's. Flag and month already look good.
name_ids = []
for i, name in enumerate(np.unique(df["Name"])):
name_ids.append({i: name})
Then, create a function and loop the old-fashioned way:
#njit
def really_fast_numba_loop(data):
for row in data:
# do stuff
return data
new_df = really_fast_numba_loop(data)
The first time your function is called in your file, it will be about the same speed as it would elsewhere, but all the other times it will be lightning fast. So, the trick is finding the balance between what to put in the function and what to put in its outside loop.
In either case, when you're done processing your values, convert your name_ids back to strings and wrap your data in pd.DataFrame.
Et voila. You just beat Pandas iterrows/itertuples.
Comment back if you have questions!