Moving Average for python issue - python

I am looking at corona data from the NY Times which can be found here: https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv
And open for everyone to use.
The dataset is set up like this:
df = pd.read_csv('https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv')
df2 = df.copy()
df2 = df2.set_index('date')
df2['cases_lagged'] = df2.groupby(['county', 'state'])['cases'].shift()
df2[df2['fips']== 34041.0].head(10)
I was hoping I could create a moving average column the same way using a groupby statement along with the .rolling() command from pandas to compile a 7-day and 14-day moving average for the data but it does not work.
I tried it two separate ways:
#way 1
df2['moving_avg'] = df2.groupby(['county', 'state']).iloc[:4].rolling(window = 7).mean()
#way 2
df2['moving_avg'] = df2.groupby(['county', 'state'])['cases'].rolling(window = 7).mean()
And neither seems to work here.
Any thoughts on how to compile the moving average for each county within each state without having to break out each and every county into its own df for it to work? Thanks

When I ran it with all data, I terminated it because it was running for a long time. We limited the run to specific counties.
I'm not sure I think I've achieved the intended result.
los = df2[(df2['county'] == 'Los Angeles') & (df2['state'] == 'California')]
los['moving_avg'] = los[['county', 'state', 'cases']].groupby(['county', 'state'], group_keys=False).rolling(window = 7).mean()

Related

Python, Appending Dataframe in the right order and printing the dataframe as a whole, timeseries forecasting using LSTM

so i'm currently trying to make a timeseries forecasting using LSTM and i'm still on the early stage where i wanted to make sure my data clean.
for the background:
i'm trying to make a model using LSTM for temperature, rain(?), and humidity (my english not good) for 3 Station, and so if i'm correct there will be 9 models, 3 models each for each station. as of now i'm doing an experiment using 1 year worth of data
the problem:
i named my file based on the index of the month, Jan as 1, Feb as 2, Mar as 3, and so on.
Using the os library i managed to loop through the folder for each file and clean the file, drop the column, filling the missing value, etc.
But when i'm trying to append the order of the month is not correct, it starts from month 11 then go to 8 etc. what am i doing wrong?
and how to print a full dataframe? currently i succed printing the full dataframe using this method
Here is the code:
Dir_data = '/content/DATA'
excel_clean = pd.DataFrame()
train_data=[]
for i in os.listdir(Dir_data):
excel_test = pd.read_excel(i)
#drop column
excel_test.drop(columns=['ff_avg', 'ddd_x', 'ddd_car', 'ff_avg', 'ff_x','ss','Tn','Tx'],inplace = True)
#Start Cleaning
excel_test = excel_test.replace(8888,'x').replace(9999,'x').replace('','x')
excel_test['RR'] = pd.to_numeric(excel_test['RR'], errors='coerce').astype('float64')
excel_test['RH_avg'] = pd.to_numeric(excel_test['RH_avg'], errors='coerce').astype('int64')
excel_test['Tavg'] = pd.to_numeric(excel_test['Tavg'], errors='coerce').astype('float64')
#excel_test.dtypes
#Filling Missing Values
excel_test['RR'] = excel_test['RR'].fillna(excel_test['RR'].mean())
excel_test['RH_avg'] = excel_test['RH_avg'].fillna(excel_test['RH_avg'].mean())
excel_test['Tavg'] = excel_test['Tavg'].fillna(excel_test['Tavg'].mean())
excel_test['RR'] = excel_test['RR'].round(decimals=1)
excel_test['Tavg'] = excel_test['Tavg'].round(decimals=1)
excel_clean = excel_clean.append(excel_test)
pd.set_option('max_rows', 99999)
pd.set_option('max_colwidth', 400)
pd.describe_option('max_colwidth')
excel_clean.reset_index(drop=True,inplace=True)
excel_clean
it's only for 1 station as this is an experiment

Efficient way to loop through GroupBy DataFrame

Since my last post did lack in information:
example of my df (the important col):
deviceID: unique ID for the vehicle. Vehicles send data all Xminutes.
mileage: the distance moved since the last message (in km)
positon_timestamp_measure: unixTimestamp of the time the dataset was created.
deviceID mileage positon_timestamp_measure
54672 10 1600696079
43423 20 1600696079
42342 3 1600701501
54672 3 1600702102
43423 2 1600702701
My Goal is to validate the milage by comparing it to the max speed of the vehicle (which is 80km/h) by calculating the speed of the vehicle using the timestamp and the milage. The result should then be written in the orginal dataset.
What I've done so far is the following:
df_ori['dataIndex'] = df_ori.index
df = df_ori.groupby('device_id')
#create new col and set all values to false
df_ori['valid'] = 0
for group_name, group in df:
#sort group by time
group = group.sort_values(by='position_timestamp_measure')
group = group.reset_index()
#since I can't validate the first point in the group, I set it to valid
df_ori.loc[df_ori.index == group.dataIndex.values[0], 'validPosition'] = 1
#iterate through each data in the group
for i in range(1, len(group)):
timeGoneSec = abs(group.position_timestamp_measure.values[i]-group.position_timestamp_measure.values[i-1])
timeHours = (timeGoneSec/60)/60
#calculate speed
if((group.mileage.values[i]/timeHours)<maxSpeedKMH):
df_ori.loc[dataset.index == group.dataIndex.values[i], 'validPosition'] = 1
dataset.validPosition.value_counts()
It definitely works the way I want it to, however it lacks in performance a lot. The df contains nearly 700k in data (already cleaned). I am still a beginner and can't figure out a better solution. Would really appreciate any of your help.
If I got it right, no for-loops are needed here. Here is what I've transformed your code into:
df_ori['dataIndex'] = df_ori.index
df = df_ori.groupby('device_id')
#create new col and set all values to false
df_ori['valid'] = 0
df_ori = df_ori.sort_values(['position_timestamp_measure'])
# Subtract preceding values from currnet value
df_ori['timeGoneSec'] = \
df_ori.groupby('device_id')['position_timestamp_measure'].transform('diff')
# The operation above will produce NaN values for the first values in each group
# fill the 'valid' with 1 according the original code
df_ori[df_ori['timeGoneSec'].isna(), 'valid'] = 1
df_ori['timeHours'] = df_ori['timeGoneSec']/3600 # 60*60 = 3600
df_ori['flag'] = (df_ori['mileage'] / df_ori['timeHours']) <= maxSpeedKMH
df_ori.loc[df_ori['flag'], 'valid'] = 1
# Remove helper columns
df_ori = df.drop(columns=['flag', 'timeHours', 'timeGoneSec'])
The basic idea is try to use vectorized operation as much as possible and to avoid for loops, typically iteration row by row, which can be insanly slow.
Since I can't get the context of your code, please double check the logic and make sure it works as desired.

Faster loop in Pandas looking for ID and older date

So, I have a DataFrame that represents purchases with 4 columns:
date (date of purchase in format %Y-%m-%d)
customer_ID (string column)
claim (1-0 column that means 1-the customer complained about the purchase, 0-customer didn't complain)
claim_value (for claim = 1 it means how much the claim cost to the company, for claim = 0 it's NaN)
I need to build 3 new columns:
past_purchases (how many purchases the specific customer had before this purchase)
past_claims (how many claims the specific customer had before this purchase)
past_claims_value (how much did the customer's past claims cost)
This has been my approach until now:
past_purchases = []
past_claims = []
past_claims_value = []
for i in range(0, len(df)):
date = df['date'][i]
customer_ID = df['customer_ID'][i]
df_temp = df[(df['date'] < date) & (df['customer_ID'] == customer_ID)]
past_purchases.append(len(df_temp))
past_claims.append(df_temp['claim'].sum())
past_claims_value.append(df['claim_value'].sum())
df['past_purchases'] = pd.DataFrame(past_purchases)
df['past_claims'] = pd.DataFrame(past_claims)
df['past_claims_value'] = pd.DataFrame(past_claims_value)
The code works fine, but it's too slow. Can anyone make it work faster? Thanks!
Ps: It's importante to check that the date is older, if the customer had 2 purchases in the same date they shouldn't count for each other.
Pss: I'm willing to use libraries for parallel processing like multiprocessing, concurrent.futures, joblib or dask, but never had before in a similar way.
Expected outcome:
Maybe you can try using a cumsum over customers, if the dates are sorted ascendant
df.sort_values('date', inplace=True)
new_temp_columns = ['claim_s','claim_value_s']
df[['claim_s','claim_value_s']] = df[new_temp_columns].shift()
df['past_claims'] = df.groupby('customer_ID')['claim_s'].transform(pd.Series.cumsum)
df['past_claims_value'] = df.groupby('customer_ID')['claim_value_s'].transform(pd.Series.cumsum)
# set the min value for the groups
dfc = data.groupby(['customer_ID','date'])[['past_claims','past_claims_value']]
data[['past_claims', 'past_claims_value']] = dfc.transform(min)
# Remove temp columns
data = data.loc[:, ~data.columns.isin(new_temp_columns)]
Again, this will only works if te date are srotes

How to add entries in Pandas DataFrame?

Basically I have census data of US that I have read in Pandas from a csv file.
Now I have to write a function that finds counties in a specific manner (not gonna explain that because that's not what the question is about) from the table I have gotten from csv file and return those counties.
MY TRY:
What I did is that I created lists with the names of the columns (that the function has to return), then applied the specific condition in the for loop using if-statement to read the entries of all required columns in their respective list. Now I created a new DataFrame and I want to read the entries from lists into this new DataFrame. I tried the same for loop to accomplish it, but all in vain, tried to make Series out of those lists and tried passing them as a parameter in the DataFrame, still all in vain, made DataFrames out of those lists and tried using append() function to concatenate them, but still all in vain. Any help would be appreciated.
CODE:
#idxl = list()
#st = list()
#cty = list()
idx2 = 0
cty_reg = pd.DataFrame(columns = ('STNAME', 'CTYNAME'))
for idx in range(census_df['CTYNAME'].count()):
if((census_df.iloc[idx]['REGION'] == 1 or census_df.iloc[idx]['REGION'] == 2) and (census_df.iloc[idx]['POPESTIMATE2015'] > census_df.iloc[idx]['POPESTIMATE2014']) and census_df.loc[idx]['CTYNAME'].startswith('Washington')):
#idxl.append(census_df.index[idx])
#st.append(census_df.iloc[idx]['STNAME'])
#cty.append(census_df.iloc[idx]['CTYNAME'])
cty_reg.index[idx2] = census_df.index[idx]
cty_reg.iloc[idxl2]['STNAME'] = census_df.iloc[idx]['STNAME']
cty_reg.iloc[idxl2]['CTYNAME'] = census_df.iloc[idx]['CTYNAME']
idx2 = idx2 + 1
cty_reg
CENSUS TABLE PIC:
SAMPLE TABLE:
REGION STNAME CTYNAME
0 2 "Wisconsin" "Washington County"
1 2 "Alabama" "Washington County"
2 1 "Texas" "Atauga County"
3 0 "California" "Washington County"
SAMPLE OUTPUT:
STNAME CTYNAME
0 Wisconsin Washington County
1 Alabama Washington County
I am sorry for the less-knowledge about the US-states and counties, I just randomly put the state names and counties in the sample table, just to show you what do I want to get out of that. Thanks for the help in advanced.
There are some missing columns in the source DF posted in the OP. However, reading the loop I don't think the loop is required at all. There are 3 filters required - for REGION, POPESTIMATE2015 and CTYNAME. If I have understood the logic in the OP, then this should be feasible without the loop
Option 1 - original answer
print df.loc[
(df.REGION.isin([1,2])) & \
(df.POPESTIMATE2015 > df.POPESTIMATE2014) & \
(df.CTYNAME.str.startswith('Washington')), \
['REGION', 'STNAME', 'CTYNAME']]
Option 2 - using and with pd.eval
q = pd.eval("(df.REGION.isin([1,2])) and \
(df.POPESTIMATE2015 > df.POPESTIMATE2014) and \
(df.CTYNAME.str.startswith('Washington'))", \
engine='python')
print df.loc[q, ['REGION', 'STNAME', 'CTYNAME']]
Option 3 - using and with df.query
regions_list = [1,2]
dfq = df.query("(REGION==#regions_list) and \
(POPESTIMATE2015 > POPESTIMATE2014) and \
(CTYNAME.str.startswith('Washington'))", \
engine='python')
print dfq[['REGION', 'STNAME', 'CTYNAME']]
If I'm reading the logic in your code right, you want to select rows according to the following conditions:
REGION should be 1 or 2
POPESTIMATE2015 > POPESTIMATE2014
CTYNAME needs to start with "Washington"
In general, Pandas makes it easy to select rows based on conditions without having to iterate over the dataframe:
df = census_df[
((df.REGION == 1) | (df.REGION == 2)) & \
(df.POPESTIMATE2015 > POPESTIMATE2014) & \
(df.CTYNAME.str.startswith('Washington'))
]
Assuming you're selecting some kind of rows that satisfy a criteria, let's just say that select(row) and this function returns True if selected or False if not. I'll not infer what it is because you specifically said it was not important
And then you wanted the STNAME and CTYNAME of that row.
So here's what you would do:
your_new_df = census_df[census_df.apply(select, axis=1)]\
.apply(lambda x: x[['STNAME', 'CTYNAME']], axis=1)
This is the one liner that will get you what you wanted provided you wrote the select function that will pick the rows.

Groupby in Pandas

My code is working, which is good lol, but the output needs to be different in how it is viewed.
UPDATED CODE SINCE RECIEVING ANSWER
import pandas as pd
# Import File
YMM = pd.read_excel('C:/Users/PCTR261010/Desktop/OMIX_YMM_2016.xlsx').groupby(['Make','Model']).agg({'StartYear':'min', 'EndYear':'max'})
print(YMM)
The output looks like Make | Model | StartYear | EndYear, with all the makes listed down column the Make Column next to the Model Column. But the Makes are filtered like a Pivot table.
Here is a screen shot:
I need American Motors next to every American Motors Model, every Buick next to every Buick Model and so on.
Here is the link to sample data:
http://jmp.sh/KLZKWVZ
Try this:
res = YMM.groupby(['Make','Model'], as_index=False).agg({'StartYear':'min', 'EndYear':'max'})
or
res = YMM.groupby(['Make','Model']).agg({'StartYear':'min', 'EndYear':'max'}).reset_index()
With your own code
Min = YMM.groupby(['Make','Model']).StartYear.min()
Max = YMM.groupby(['Make','Model']).EndYear.max()
Min['Endyear']=Max.EndYear

Categories

Resources