In the following MWE, my year variable is shown on the x-axis as 0 to 6 instead of the actual year number. Why is this?
import pandas as pd
from pandas_datareader import wb
from ggplot import *
dat = wb.download(
indicator=['BX.KLT.DINV.CD.WD', 'BX.KLT.DINV.WD.GD.ZS'],
country='CN', start=2005, end=2011)
dat.reset_index(inplace=True)
print ggplot(aes(x='year', y='BX.KLT.DINV.CD.WD'),
data=dat) + \
geom_line() + theme_bw()
All you need to do is convert the year column from an object dtype to datetime64:
dat['year'] = pd.to_datetime(dat['year'])
Related
This question already has an answer here:
How do I extract the date/year/month from pandas dataframe?
(1 answer)
Closed 2 years ago.
My initial dataframe is like this:
Original Dataframe
My code:
import pandas as pd
import numpy as np
def visualize_weather():
df=pd.read_csv('weather.csv')
def break_date(date):
day=date[-2:]
month=date[-5:-3]
year=date[:4]
return day,month,year
df['Day'],df['Month'],df['Year']=df['Date'].apply(break_date)
return df[['Day','Month','Year','Date']]
visualize_weather()
I am trying to break the date in day, month and year and store them into different columns.
But I am getting the error:
ValueError: too many values to unpack (expected 3)
Is there any way to achieve this without making 3 different functions for day, month and year.
You can use the following code to modify dataframe inplace. You should change on the dataframe object inside the function directly, else your changes will be lost.
import pandas as pd
df = pd.DataFrame(data={'dates': pd.bdate_range('2020-07-01', '2020-07-31', freq='B')})
def func(row):
df.loc[row.name, 'Day'] = row['dates'].day
df.loc[row.name, 'Month'] = row['dates'].month
df.loc[row.name, 'Year'] = row['dates'].year
print('Done')
df.apply(func, axis=1)
Hmm.. Working with dates as working with rows is not a good practice. You should do instead:
Use pd.to_datetime() if you need a datetime column
Since you're working with dates, you can use datetime.year, datetime.month, datetime.day.
So:
import pandas as pd
import numpy as np
import datetime
first_date = datetime.date(2020, 1, 3)
second_date = datetime.date(2019, 2, 10)
third_date = datetime.date(2018, 2, 20)
df = pd.DataFrame({"dates":[first_date,second_date,third_date ]})
def new_col(df):
size = df.size
years = []
months = []
days = []
for row in range(0, size):
year = df.iloc[row, 0].year
years.append(year)
month = df.iloc[row, 0].month
months.append(month)
day = df.iloc[row, 0].day
days.append(day)
df['years'] = years
df['months'] = months
df['days'] = days
df.drop(['dates'],axis='columns',inplace=True)
return df
new_col(df)
PS.You can also add any variable for column name.
Say I have an xarray DataArray. One of the Dimensions is a time dimension:
import numpy as np
import xarray as xr
import pandas as pd
time = pd.date_range('1980-01-01', '2017-12-01', freq='MS')
time = xr.DataArray(time, dims=('time',), coords={'time':time})
da = xr.DataArray(np.random.rand(len(time)), dims=('time',), coords={'time':time})
Now if I only want the years from 1990 to 2000, what I can do is easy:
da.sel(time=slice('1990', '2000'))
But what if I want to drop these years? I want the data for all years except those.
da.drop_sel(time=slice('1990', '2000'))
fails with
TypeError: unhashable type: 'slice'
What's going on? What's the proper way to do that?
At the moment, I'm creating a new DataArray:
tdrop = da.time.sel(time=slice('1990', '2000'))
da.drop_sel(time=tdrop)
But that seems unnecessary convoluted.
What about using where with the drop optional parameter set to True to filter on the year? Using the example below, the data with 1990 <= year <= 2000 would be dropped.
da = da.where((da["time.year"] < 1990) | (da["time.year"] > 2000), drop=True)
I want to add a column that is the end-of-the-month date to a pandas dataframe. Based on this answer, I tried the following:
import numpy as np
import pandas as pd
dates = ['2014-06-02', '2014-06-03', '2014-06-04', '2014-06-05', '2014-06-06']
sp500_index = [1924.969971, 1924.23999, 1927.880005, 1940.459961, 1949.439941]
df_sp500 = pd.DataFrame({'Date' : dates, 'Close' : sp500_index})
sp500['Date'] = pd.to_datetime(sp500['Date'], format='%Y-%m-%d')
df_sp500['EOM'] = df_sp500['Date'].dt.ceil('M') # breaks on this line
#df_sp500 = df_sp500[df_sp500['Date'] == df_sp500['EOM']]
df_sp500
but I get this error message:
AttributeError: Can only use .dt accessor with datetimelike values
The reason I want to add this column is to use it to filter out all but the EOM dates as shown in the commented out line.
import numpy as np
import pandas as pd
from pandas.tseries.offsets import MonthEnd
dates = ['2014-06-02', '2014-06-03', '2014-06-04', '2014-06-05', '2014-06-06']
sp500_index = [1924.969971, 1924.23999, 1927.880005, 1940.459961, 1949.439941]
df_sp500 = pd.DataFrame({'Date' : dates, 'Close' : sp500_index})
df_sp500['EOM'] = pd.to_datetime(df_sp500['Date'], format='%Y-%m-%d')+ MonthEnd(0)
#df_sp500['EOM']=df_sp500['EOM'].dt.day #add this if you want only day
This is already built-in to datetime with pandas.Series.is_month_end. Instead of calculating a new column just subset with:
df_sp500[df_sp500.Date.dt.is_month_end]
Input Data
dates = ['2014-06-02', '2014-06-03', '2014-06-04', '2014-06-05', '2014-06-06']
sp500_index = [1924.969971, 1924.23999, 1927.880005, 1940.459961, 1949.439941]
df_sp500 = pd.DataFrame({'Date' : dates, 'Close' : sp500_index})
df_sp500['Date'] = pd.to_datetime(df_sp500['Date'], format='%Y-%m-%d')
Base on the document
The frequency level to ceil the index to. Must be a fixed frequency
like βSβ (second) not βMEβ (month end)
So we may using MonthBegin for your case
df_sp500['Date']- pd.offsets.MonthBegin(1) #pd.offsets.MonthEnd(1)
0 2014-06-01
1 2014-06-01
2 2014-06-01
3 2014-06-01
4 2014-06-01
Name: Date, dtype: datetime64[ns]
I have a dataset like the one shown below.
Date;Time;Global_active_power;Global_reactive_power;Voltage;Global_intensity;Sub_metering_1;Sub_metering_2;Sub_metering_3
16/12/2006;17:24:00;4.216;0.418;234.840;18.400;0.000;1.000;17.000
16/12/2006;17:25:00;5.360;0.436;233.630;23.000;0.000;1.000;16.000
16/12/2006;17:26:00;5.374;0.498;233.290;23.000;0.000;2.000;17.000
16/12/2006;17:27:00;5.388;0.502;233.740;23.000;0.000;1.000;17.000
16/12/2006;17:28:00;3.666;0.528;235.680;15.800;0.000;1.000;17.000
16/12/2006;17:29:00;3.520;0.522;235.020;15.000;0.000;2.000;17.000
16/12/2006;17:30:00;3.702;0.520;235.090;15.800;0.000;1.000;17.000
16/12/2006;17:31:00;3.700;0.520;235.220;15.800;0.000;1.000;17.000
16/12/2006;17:32:00;3.668;0.510;233.990;15.800;0.000;1.000;17.000
I've used pandas to get the data into a DataFrame. The dataset has data for multiple days with an interval of 1 min for each row in the dataset.
I want to plot separate graphs for the voltage with respect to the time(shown in column 2) for each day(shown in column 1) using python. How can I do that?
txt = '''Date;Time;Global_active_power;Global_reactive_power;Voltage;Global_intensity;Sub_metering_1;Sub_metering_2;Sub_metering_3
16/12/2006;17:24:00;4.216;0.418;234.840;18.400;0.000;1.000;17.000
16/12/2006;17:25:00;5.360;0.436;233.630;23.000;0.000;1.000;16.000
16/12/2006;17:26:00;5.374;0.498;233.290;23.000;0.000;2.000;17.000
16/12/2006;17:27:00;5.388;0.502;233.740;23.000;0.000;1.000;17.000
16/12/2006;17:28:00;3.666;0.528;235.680;15.800;0.000;1.000;17.000
16/12/2006;17:29:00;3.520;0.522;235.020;15.000;0.000;2.000;17.000
16/12/2006;17:30:00;3.702;0.520;235.090;15.800;0.000;1.000;17.000
16/12/2006;17:31:00;3.700;0.520;235.220;15.800;0.000;1.000;17.000
16/12/2006;17:32:00;3.668;0.510;233.990;15.800;0.000;1.000;17.000'''
from io import StringIO
f = StringIO(txt)
df = pd.read_table(f,sep =';' )
plt.plot(df['Time'],df['Voltage'])
plt.show()
gives output :
I believe this will do the trick (I edited the dates so we have two dates)
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline #If you use Jupyter Notebook
df = pd.read_csv('test.csv', sep=';', usecols=['Date','Time','Voltage'])
unique_dates = df.Date.unique()
for date in unique_dates:
print('Date: ' + date)
df.loc[df.Date == date].plot.line('Time', 'Voltage')
plt.show()
You will get this:
X = df.Date.unique()
for i in X: #iterate over unique days
temp_df = df[df.Date==i] #get df for specific day
temp_df.plot(x = 'Time', y = 'Voltage') #plot
If you want to change x values you can use
x = np.arange(1, len(temp_df.Time), 1)
group by hour and minute after creating a DateTime variable to handle multiple days. you can filter the grouped for a specific day.
txt =
'''Date;Time;Global_active_power;Global_reactive_power;Voltage;Global_intensity;Sub_metering_1;Sub_metering_2;Sub_metering_3
16/12/2006;17:24:00;4.216;0.418;234.840;18.400;0.000;1.000;17.000
16/12/2006;17:25:00;5.360;0.436;233.630;23.000;0.000;1.000;16.000
16/12/2006;17:26:00;5.374;0.498;233.290;23.000;0.000;2.000;17.000
16/12/2006;17:27:00;5.388;0.502;233.740;23.000;0.000;1.000;17.000
16/12/2006;17:28:00;3.666;0.528;235.680;15.800;0.000;1.000;17.000
16/12/2006;17:29:00;3.520;0.522;235.020;15.000;0.000;2.000;17.000
16/12/2006;17:30:00;3.702;0.520;235.090;15.800;0.000;1.000;17.000
16/12/2006;17:31:00;3.700;0.520;235.220;15.800;0.000;1.000;17.000
16/12/2006;17:32:00;3.668;0.510;233.990;15.800;0.000;1.000;17.000'''
from io import StringIO
f = StringIO(txt)
df = pd.read_table(f,sep =';' )
df['DateTime']=pd.to_datetime(df['Date']+"T"+df['Time']+"Z")
df.set_index('DateTime',inplace=True)
filter=df['Date']=='16/12/2006'
grouped=df[filter].groupby([df.index.hour,df.index.minute])['Voltage'].mean()
grouped.plot()
plt.show()
I am writing scripts in panda but i could not able to extract correct output that i want. here it is problem:
i can read this data from CSV file. Here you can find table structure
http://postimg.org/image/ie0od7ejr/
I want this output from above table data
Month Demo1 Demo 2
June 2013 3 1
July 2013 2 2
in Demo1 and Demo2 column i want to count regular entry and entry which starts with u. for June there are total 3 regular entry while 1 entry starts with u.
so far i have written this code.
import sqlite3
from pylab import *
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import datetime as dt
conn = sqlite3.connect('Demo2.sqlite')
df = pd.read_sql("SELECT * FROM Data", conn)
df['DateTime'] = df['DATE'].apply(lambda x: dt.date.fromtimestamp(x))
df1 = df.set_index('DateTime', drop=False)
Thanks advace for help. End result would be bar graph. I can draw graph from output that i mention above.
For resample, you can define two aggregation functions like this:
def countU(x):
return sum(i[0] == 'u' for i in x)
def countNotU(x):
return sum(i[0] != 'u' for i in x)
print df.resample('M', how=[countU, countNotU])
Alternatively, consider groupby.