Select several years pandas dataframe - python

I am trying to select several years from a dataframe in monthly resolution.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import netCDF4 as nc
#-- open net-cdf and read in variables
data = nc.Dataset('test.nc')
time = nc.num2date(data.variables['Time'][:],
data.variables['Time'].units)
df = pd.DataFrame(data.variables['mgpp'][:,0,0], columns=['mgpp'])
df['dates'] = time
df = df.set_index('dates')
print(df.head())
This is what the head looks like:
mgpp
dates
1901-01-01 0.040735
1901-02-01 0.041172
1901-03-01 0.053889
1901-04-01 0.066906
Now I managed to extract one year:
df_cp = df[df.index.year == 2001]
but how would I extract several years, say 1997, 2001 and 2007 and have them stored in the same dataframe? Is there a one/ two line solution? My only idea for now is to iterate and then merge the dataframes but maybe there is a better solution!

Related

Pandas groupby using only year and month

I have a Python program using Pandas, which reads two dataframes, obtained in the following links:
Casos-positivos-diarios-en-San-Nicolas-de-los-Garza-Promedio-movil-de-7-dias: https://datamexico.org/es/profile/geo/san-nicolas-de-los-garza#covid19-evolucion
Denuncias-segun-bien-afectado-en-San-Nicolas-de-los-GarzaClic-en-el-grafico-para-seleccionar: https://datamexico.org/es/profile/geo/san-nicolas-de-los-garza#seguridad-publica-denuncias
What I currently want to do is a groupby in the "covid" dataframe with the same dates, having a sum of these. Regardless, no method has worked out, which regularly prints an error indicating that I should be using a syntaxis for "PeriodIndex". Does anyone have a suggestion or solution? Thanks in advance.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib notebook
#csv for the covid cases
covid = pd.read_csv('Casos-positivos-diarios-en-San-Nicolas-de-los-Garza-Promedio-movil-de-7-dias.csv')
#csv for complaints
comp = pd.read_csv('Denuncias-segun-bien-afectado-en-San-Nicolas-de-los-GarzaClic-en-el-grafico-para-seleccionar.csv')
#cleaning data in both dataframes
#keeping only the relevant columns
covid = covid[['Month','Daily Cases']]
comp = comp[['Month','Affected Legal Good', 'Value']]
#changing the labels from spanish to english
comp['Affected Legal Good'].replace({'Patrimonio': 'Heritage', 'Familia':'Family', 'Libertad y Seguridad Sexual':'Sexual Freedom and Safety', 'Sociedad':'Society', 'Vida e Integridad Corporal':'Life and Bodily Integrity', 'Libertad Personal':'Personal Freedom', 'Otros Bienes JurĂ­dicos Afectados (Del Fuero ComĂșn)':'Other Affected Legal Assets (Common Jurisdiction)'}, inplace=True, regex=True)
#changing the month types to dates
covid['Month'] = pd.to_datetime(covid['Month'])
covid['Month'] = covid['Month'].dt.to_period('M')
covid
You can simply usen group by statement.Timegrouper by default converts it to datetime
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib notebook
#csv for the covid cases
covid = pd.read_csv('Casos-positivos-diarios-en-San-Nicolas-de-los-Garza-Promedio-movil-de-7-dias.csv')
covid = covid.groupby(['Month'])['Daily Cases'].sum()
covid = covid.reset_index()
# #changing the month types to dates
covid['Month'] = pd.to_datetime(covid['Month'])
covid['Month'] = covid['Month'].dt.to_period('M')
covid

Work with data in python and numpy/pandas

so I started learning how to work with data in python. I wanted to load multiple securities. But I have an error that I can not fix for some reason. Could someone tell me what is the problem?
import numpy as np
import pandas as pd
from pandas_datareader import data as wb
import matplotlib.pyplot as plt
tickers = ['PG', 'MSFT', 'F', 'GE']
mydata = pd.DataFrame()
for t in tickers:
mydata[t] = wb.DataReader(t, data_source='yahoo', start = '1955-1-1')
you need 2 fixes here:
1) 1955 is too early for this data source, try 1971 or later.
2) your data from wb.DataReader(t, data_source='yahoo', start = '1971-1-1') comes as dataframe with multiple series, so you can not save it to mydata[t] as single series. Use a dictionary as in the other answer or save only closing prices:
mydata[t] = pdr.data.DataReader(t, data_source='yahoo', start = '2010-1-1')['Close']
First of all please do not share information as images unless absolutely necessary.
See: this link
Now here is a solution to your problem. You are using year '1955' but there is a possibility that data is not available for this year or there may be some other issues. But when you select the right year it will work. Another thing it returns data as dataframe so you can not assign it like a dictionary so instead of making a DataFram you should make a dictionary and store all dataframes into it.
Here is improved code choose year carefully
import numpy as np
import pandas as pd
from pandas_datareader import data as wb
import matplotlib.pyplot as plt
from datetime import datetime as dt
tickers = ['PG', 'MSFT', 'F', 'GE']
mydata = {}
for t in tickers:
mydata[t] = wb.DataReader(t, data_source='yahoo',start=dt(2019, 1, 1), end=dt.now())
Output
mydata['PG']
High Low Open Close Volume Adj Close
Date
2018-12-31 92.180000 91.150002 91.629997 91.919998 7239500.0 88.877655
2019-01-02 91.389999 89.930000 91.029999 91.279999 9843900.0 88.258835
2019-01-03 92.500000 90.379997 90.940002 90.639999 9820200.0 87.640022
2019-01-04 92.489998 90.370003 90.839996 92.489998 10565700.0 89.428787

How to read columns from different files and plot?

I have data of concentrations for every day of year 2005 until 2018. I want to read three columns of three different files and combine them into one, so I can plot them.
Data:file 1
time, mean_OMNO2d_003_ColumnAmountNO2CloudScreened
2005-01-01,-1.267651e+30
2005-01-02,4.90778397e+15
...
2018-12-31,-1.267651e+30
Data:file 2
time, OMNO2d_003_ColumnAmountNO2TropCloudScreened
2005-01-01,-1.267651e+30
2005-01-02,3.07444147e+15
...
Data:file 3
time, OMSO2e_003_ColumnAmountSO2_PBL
2005-01-01,-1.267651e+30
2005-01-02,-0.0144000314
...
I want to plot time and mean_OMNO2d_003_ColumnAmountNO2CloudScreened, OMNO2d_003_ColumnAmountNO2TropCloudScreened, OMSO2e_003_ColumnAmountSO2_PBL into one graph.
import glob
import pandas as pd
file_list = glob.glob('*.csv')
no= []
no2=[]
so2=[]
for f in file_list:
df= pd.read_csv(f, skiprows=8, parse_dates =['time'], index_col ='time')
df.columns=['no','no2','so2']
no.append([df["no"]])
no2.append([df["no2"]])
so2.append([df["so2"]])
How do I solve the problem?
This is very doable. I had a similar problem with 3 files all in one plot. My understanding is that you want to compare levels of NO, NO2, and SO2, that each column is in comparable order, and that you want to compare across rows. If you are ok with importing matplotlib and numpy, something like this may work for you:
import numpy as np
import matplotlib as plt
NO = np.asarray(df["no1"])
NO2 = np.asarray(df["no2"]))
SO2 = np.asarray(df["so2"))
timestamp = np.asarray(df["your_time_stamp"])
plt.plot(timestamp, NO)
plt.plot(timestamp, NO2)
plt.plot(timestamp, SO2)
plt.savefig(name_of_plot)
This will need some adjusting for your specific data frame, but I hope you see what I am getting at!

Merge Data Frames By Date With Unequal Dates

My process is this:
Import csv of data containing dates, activations, and cancellations
subset the data by activated or cancelled
pivot the data with aggfunc 'sum'
convert back to data frames
Now, I need to merge the 2 data frames together but there are dates that exist in one data frame but not the other. Both data frames start Jan 1, 2017 and end Dec 31, 2017. Preferably, the output for any observation in which the index month needs to be filled with have a corresponding value of 0.
Here's the .head() from both data frames:
For reference, here's the code up to this point:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
import datetime
%matplotlib inline
#import data
directory1 = "C:\python\Contracts"
directory_source = os.path.join(directory1, "Contract_Data.csv")
df_source = pd.read_csv(directory_source)
#format date ranges as times
#df_source["Activation_Month"] = pd.to_datetime(df_source["Activation_Month"])
#df_source["Cancellation_Month"] = pd.to_datetime(df_source["Cancellation_Month"])
df_source["Activation_Day"] = pd.to_datetime(df_source["Activation_Day"])
df_source["Cancellation_Day"] = pd.to_datetime(df_source["Cancellation_Day"])
#subset the data based on status
df_active = df_source[df_source["Order Status"]=="Active"]
df_active = pd.DataFrame(df_active[["Activation_Day", "Event_Value"]].copy())
df_cancelled = df_source[df_source["Order Status"]=="Cancelled"]
df_cancelled = pd.DataFrame(df_cancelled[["Cancellation_Day", "Event_Value"]].copy())
#remove activations outside 2017 and cancellations outside 2017
df_cancelled = df_cancelled[(df_cancelled['Cancellation_Day'] > '2016-12-31') &
(df_cancelled['Cancellation_Day'] <= '2017-12-31')]
df_active = df_active[(df_active['Activation_Day'] > '2016-12-31') &
(df_active['Activation_Day'] <= '2017-12-31')]
#pivot the data to aggregate by day
df_active_aggregated = df_active.pivot_table(index='Activation_Day',
values='Event_Value',
aggfunc='sum')
df_cancelled_aggregated = df_cancelled.pivot_table(index='Cancellation_Day',
values='Event_Value',
aggfunc='sum')
#convert pivot tables back to useable dataframes
activations_aggregated = pd.DataFrame(df_active_aggregated.to_records())
cancellations_aggregated = pd.DataFrame(df_cancelled_aggregated.to_records())
#rename the time columns so they can be referenced when merging into one DF
activations_aggregated.columns = ["index_month", "Activations"]
#activations_aggregated = activations_aggregated.set_index(pd.DatetimeIndex(activations_aggregated["index_month"]))
cancellations_aggregated.columns = ["index_month", "Cancellations"]
#cancellations_aggregated = cancellations_aggregated.set_index(pd.DatetimeIndex(cancellations_aggregated["index_month"]))
I'm aware there are many posts that address issues similar to this but I haven't been able to find anything that has helped. Thanks to anyone that can give me a hand with this!
You can try:
activations_aggregated.merge(cancellations_aggregated, how='outer', on='index_month').fillna(0)

Python Panda TIme series re sampling

I am writing scripts in panda but i could not able to extract correct output that i want. here it is problem:
i can read this data from CSV file. Here you can find table structure
http://postimg.org/image/ie0od7ejr/
I want this output from above table data
Month Demo1 Demo 2
June 2013 3 1
July 2013 2 2
in Demo1 and Demo2 column i want to count regular entry and entry which starts with u. for June there are total 3 regular entry while 1 entry starts with u.
so far i have written this code.
import sqlite3
from pylab import *
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import datetime as dt
conn = sqlite3.connect('Demo2.sqlite')
df = pd.read_sql("SELECT * FROM Data", conn)
df['DateTime'] = df['DATE'].apply(lambda x: dt.date.fromtimestamp(x))
df1 = df.set_index('DateTime', drop=False)
Thanks advace for help. End result would be bar graph. I can draw graph from output that i mention above.
For resample, you can define two aggregation functions like this:
def countU(x):
return sum(i[0] == 'u' for i in x)
def countNotU(x):
return sum(i[0] != 'u' for i in x)
print df.resample('M', how=[countU, countNotU])
Alternatively, consider groupby.

Categories

Resources