I have a Python program using Pandas, which reads two dataframes, obtained in the following links:
Casos-positivos-diarios-en-San-Nicolas-de-los-Garza-Promedio-movil-de-7-dias: https://datamexico.org/es/profile/geo/san-nicolas-de-los-garza#covid19-evolucion
Denuncias-segun-bien-afectado-en-San-Nicolas-de-los-GarzaClic-en-el-grafico-para-seleccionar: https://datamexico.org/es/profile/geo/san-nicolas-de-los-garza#seguridad-publica-denuncias
What I currently want to do is a groupby in the "covid" dataframe with the same dates, having a sum of these. Regardless, no method has worked out, which regularly prints an error indicating that I should be using a syntaxis for "PeriodIndex". Does anyone have a suggestion or solution? Thanks in advance.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib notebook
#csv for the covid cases
covid = pd.read_csv('Casos-positivos-diarios-en-San-Nicolas-de-los-Garza-Promedio-movil-de-7-dias.csv')
#csv for complaints
comp = pd.read_csv('Denuncias-segun-bien-afectado-en-San-Nicolas-de-los-GarzaClic-en-el-grafico-para-seleccionar.csv')
#cleaning data in both dataframes
#keeping only the relevant columns
covid = covid[['Month','Daily Cases']]
comp = comp[['Month','Affected Legal Good', 'Value']]
#changing the labels from spanish to english
comp['Affected Legal Good'].replace({'Patrimonio': 'Heritage', 'Familia':'Family', 'Libertad y Seguridad Sexual':'Sexual Freedom and Safety', 'Sociedad':'Society', 'Vida e Integridad Corporal':'Life and Bodily Integrity', 'Libertad Personal':'Personal Freedom', 'Otros Bienes JurĂdicos Afectados (Del Fuero ComĂșn)':'Other Affected Legal Assets (Common Jurisdiction)'}, inplace=True, regex=True)
#changing the month types to dates
covid['Month'] = pd.to_datetime(covid['Month'])
covid['Month'] = covid['Month'].dt.to_period('M')
covid
You can simply usen group by statement.Timegrouper by default converts it to datetime
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib notebook
#csv for the covid cases
covid = pd.read_csv('Casos-positivos-diarios-en-San-Nicolas-de-los-Garza-Promedio-movil-de-7-dias.csv')
covid = covid.groupby(['Month'])['Daily Cases'].sum()
covid = covid.reset_index()
# #changing the month types to dates
covid['Month'] = pd.to_datetime(covid['Month'])
covid['Month'] = covid['Month'].dt.to_period('M')
covid
Related
So I am trying to develop my Python skills and am stuck on a problem for a project which queries API data from collegefootballdata.com. I have successfully queried some datasets I need to merge into a final dataframe for analysis using the following code:
To set up the client/connection:
from __future__ import print_function
from cfbd.rest import ApiException
from pprint import pprint
from google.cloud import bigquery
import cfbd
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn as sk
import plotly.graph_objects as go
import seaborn as sns
import math
import time
configuration = cfbd.Configuration()
configuration.api_key['Authorization'] = 'YOUR KEY HERE'
configuration.api_key_prefix['Authorization'] = 'Bearer'
api_config = cfbd.ApiClient(configuration)
recruit_api = cfbd.RecruitingApi(api_config)
Then, run something like this which goes through the years I want and selects the right variables:
recruit = []
for year in range(2006, 2022):
response = recruit_api.get_recruiting_players(year=year)
recruit = [*recruit, *response]
recruit_df = pd.DataFrame().from_records([
dict(ident = r.id,
rank = r.ranking,
name = r.name,
year = r.year,
rating=r.rating,
stars=r.stars,
position=r.position,
commit = r.committed_to)
for r in recruit
if r.ranking is not None
and r.name is not None
and r.year is not None
and r.rating is not None
])
recruit_df.tail()
This works in a decent amount of time. I then was able to query and merge this with some NFL draft outcomes data (see below).
BUT, when I want to bring in individual player statistics by year, that data is admittedly huge. Running similar code to above results in outrageous run times.
So my question is how do I bring in the yearly player stat data ONLY if it matches a name or ID from my dataframe I'm already working with?
Some dummy data once I have my merged df - the player yearly stats need to be brought in to match either ID OR name:
merged_data = {'id':[8263, 8264, 8265, 8266, 8267], 'name':['Bob', 'Jim', 'Al', 'Sean', 'Steve'], 'rating':[0.999, 0.998, 0.993, 0.876, 0.765], 'draft_grade':[95.0, 89.0, NaN, 92.0, 50.0], 'drafted':[1, 1, 0, 1, 0]}
df = pd.DataFrame(merged_data)
Thank you for your help!
so I started learning how to work with data in python. I wanted to load multiple securities. But I have an error that I can not fix for some reason. Could someone tell me what is the problem?
import numpy as np
import pandas as pd
from pandas_datareader import data as wb
import matplotlib.pyplot as plt
tickers = ['PG', 'MSFT', 'F', 'GE']
mydata = pd.DataFrame()
for t in tickers:
mydata[t] = wb.DataReader(t, data_source='yahoo', start = '1955-1-1')
you need 2 fixes here:
1) 1955 is too early for this data source, try 1971 or later.
2) your data from wb.DataReader(t, data_source='yahoo', start = '1971-1-1') comes as dataframe with multiple series, so you can not save it to mydata[t] as single series. Use a dictionary as in the other answer or save only closing prices:
mydata[t] = pdr.data.DataReader(t, data_source='yahoo', start = '2010-1-1')['Close']
First of all please do not share information as images unless absolutely necessary.
See: this link
Now here is a solution to your problem. You are using year '1955' but there is a possibility that data is not available for this year or there may be some other issues. But when you select the right year it will work. Another thing it returns data as dataframe so you can not assign it like a dictionary so instead of making a DataFram you should make a dictionary and store all dataframes into it.
Here is improved code choose year carefully
import numpy as np
import pandas as pd
from pandas_datareader import data as wb
import matplotlib.pyplot as plt
from datetime import datetime as dt
tickers = ['PG', 'MSFT', 'F', 'GE']
mydata = {}
for t in tickers:
mydata[t] = wb.DataReader(t, data_source='yahoo',start=dt(2019, 1, 1), end=dt.now())
Output
mydata['PG']
High Low Open Close Volume Adj Close
Date
2018-12-31 92.180000 91.150002 91.629997 91.919998 7239500.0 88.877655
2019-01-02 91.389999 89.930000 91.029999 91.279999 9843900.0 88.258835
2019-01-03 92.500000 90.379997 90.940002 90.639999 9820200.0 87.640022
2019-01-04 92.489998 90.370003 90.839996 92.489998 10565700.0 89.428787
This is my code:
import pandas as pd
cols= ['DD','MM','YYYY','HH'] #names
DD,MM,YYYY,HH=[1,2,None,4,5,5],[1,1,1,2,2,3],[2014,2014,2014,2014,2014,2014],[20,20,20,18,18,18] #data
df = pd.DataFrame(list(zip(DD,MM,YYYY,HH)), columns =cols )
print (df)
a=pd.crosstab(df .HH, df .MM,margins=True)
print (a)
I would like to view results in a table format. Table borders or at least the same number of digits would solve the problem.
I want to see the table on console without any graphical window.
If you want a nicely looking crosstab you can use seaborn.heatmap
An example
>>> import numpy as np; np.random.seed(0)
>>> import seaborn as sns; sns.set()
>>> uniform_data = np.random.rand(10, 12)
>>> ax = sns.heatmap(uniform_data)
result would look like this:
You can find many examples that show how to apply this, e.g.:
https://www.science-emergence.com/Codes/How-to-plot-a-confusion-matrix-with-matplotlib-and-seaborn/
https://gist.github.com/shaypal5/94c53d765083101efc0240d776a23823
Update
In order to simply display the crosstab in a formatted way you can skip the print and display like this
import pandas as pd
cols= ['DD','MM','YYYY','HH'] #names
DD,MM,YYYY,HH=[1,2,None,4,5,5],[1,1,1,2,2,3],[2014,2014,2014,2014,2014,2014],[20,20,20,18,18,18] #data
df = pd.DataFrame(list(zip(DD,MM,YYYY,HH)), columns =cols )
print (df)
a = pd.crosstab(df .HH, df .MM,margins=True)
display(a)
which will yield the same result as:
pd.crosstab(df .HH, df .MM,margins=True)
I am trying to select several years from a dataframe in monthly resolution.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import netCDF4 as nc
#-- open net-cdf and read in variables
data = nc.Dataset('test.nc')
time = nc.num2date(data.variables['Time'][:],
data.variables['Time'].units)
df = pd.DataFrame(data.variables['mgpp'][:,0,0], columns=['mgpp'])
df['dates'] = time
df = df.set_index('dates')
print(df.head())
This is what the head looks like:
mgpp
dates
1901-01-01 0.040735
1901-02-01 0.041172
1901-03-01 0.053889
1901-04-01 0.066906
Now I managed to extract one year:
df_cp = df[df.index.year == 2001]
but how would I extract several years, say 1997, 2001 and 2007 and have them stored in the same dataframe? Is there a one/ two line solution? My only idea for now is to iterate and then merge the dataframes but maybe there is a better solution!
My process is this:
Import csv of data containing dates, activations, and cancellations
subset the data by activated or cancelled
pivot the data with aggfunc 'sum'
convert back to data frames
Now, I need to merge the 2 data frames together but there are dates that exist in one data frame but not the other. Both data frames start Jan 1, 2017 and end Dec 31, 2017. Preferably, the output for any observation in which the index month needs to be filled with have a corresponding value of 0.
Here's the .head() from both data frames:
For reference, here's the code up to this point:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
import datetime
%matplotlib inline
#import data
directory1 = "C:\python\Contracts"
directory_source = os.path.join(directory1, "Contract_Data.csv")
df_source = pd.read_csv(directory_source)
#format date ranges as times
#df_source["Activation_Month"] = pd.to_datetime(df_source["Activation_Month"])
#df_source["Cancellation_Month"] = pd.to_datetime(df_source["Cancellation_Month"])
df_source["Activation_Day"] = pd.to_datetime(df_source["Activation_Day"])
df_source["Cancellation_Day"] = pd.to_datetime(df_source["Cancellation_Day"])
#subset the data based on status
df_active = df_source[df_source["Order Status"]=="Active"]
df_active = pd.DataFrame(df_active[["Activation_Day", "Event_Value"]].copy())
df_cancelled = df_source[df_source["Order Status"]=="Cancelled"]
df_cancelled = pd.DataFrame(df_cancelled[["Cancellation_Day", "Event_Value"]].copy())
#remove activations outside 2017 and cancellations outside 2017
df_cancelled = df_cancelled[(df_cancelled['Cancellation_Day'] > '2016-12-31') &
(df_cancelled['Cancellation_Day'] <= '2017-12-31')]
df_active = df_active[(df_active['Activation_Day'] > '2016-12-31') &
(df_active['Activation_Day'] <= '2017-12-31')]
#pivot the data to aggregate by day
df_active_aggregated = df_active.pivot_table(index='Activation_Day',
values='Event_Value',
aggfunc='sum')
df_cancelled_aggregated = df_cancelled.pivot_table(index='Cancellation_Day',
values='Event_Value',
aggfunc='sum')
#convert pivot tables back to useable dataframes
activations_aggregated = pd.DataFrame(df_active_aggregated.to_records())
cancellations_aggregated = pd.DataFrame(df_cancelled_aggregated.to_records())
#rename the time columns so they can be referenced when merging into one DF
activations_aggregated.columns = ["index_month", "Activations"]
#activations_aggregated = activations_aggregated.set_index(pd.DatetimeIndex(activations_aggregated["index_month"]))
cancellations_aggregated.columns = ["index_month", "Cancellations"]
#cancellations_aggregated = cancellations_aggregated.set_index(pd.DatetimeIndex(cancellations_aggregated["index_month"]))
I'm aware there are many posts that address issues similar to this but I haven't been able to find anything that has helped. Thanks to anyone that can give me a hand with this!
You can try:
activations_aggregated.merge(cancellations_aggregated, how='outer', on='index_month').fillna(0)