Python Panda TIme series re sampling - python

I am writing scripts in panda but i could not able to extract correct output that i want. here it is problem:
i can read this data from CSV file. Here you can find table structure
http://postimg.org/image/ie0od7ejr/
I want this output from above table data
Month Demo1 Demo 2
June 2013 3 1
July 2013 2 2
in Demo1 and Demo2 column i want to count regular entry and entry which starts with u. for June there are total 3 regular entry while 1 entry starts with u.
so far i have written this code.
import sqlite3
from pylab import *
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import datetime as dt
conn = sqlite3.connect('Demo2.sqlite')
df = pd.read_sql("SELECT * FROM Data", conn)
df['DateTime'] = df['DATE'].apply(lambda x: dt.date.fromtimestamp(x))
df1 = df.set_index('DateTime', drop=False)
Thanks advace for help. End result would be bar graph. I can draw graph from output that i mention above.

For resample, you can define two aggregation functions like this:
def countU(x):
return sum(i[0] == 'u' for i in x)
def countNotU(x):
return sum(i[0] != 'u' for i in x)
print df.resample('M', how=[countU, countNotU])
Alternatively, consider groupby.

Related

Problems Sorting Data out of a text-file

I have a csv file imported into a dataframe and have trouble sorting the data.
df looks like this:
Data
0 <WindSpeed>0.69</WindSpeed>
1 <PowerOutput>0</PowerOutput>
2 <ThrustCoEfficient>0</ThrustCoEffici...
3 <RotorSpeed>8.17</RotorSpeed>
4 <ReactivePower>0</ReactivePower>
5 </DataPoint>
6 <DataPoint>
7 <WindSpeed>0.87</WindSpeed>
8 <PowerOutput>0</PowerOutput
I want it to look like this:
0 Windspeed Poweroutput
1 0.69 0.0
HereĀ“s the code that I wrote so far:
import pandas as pd
from pandas.compat import StringIO
import re
import numpy as np
df= pd.read_csv('powercurve.csv', encoding='utf-8',skiprows=42)
df.columns=['Data']
no_of_rows=df.Data.str.count("WindSpeed").sum()/2
rows=no_of_rows.astype(np.uint32)
TRBX=pd.DataFrame(index=range(0,abs(rows)),columns=['WSpd[m/s]','Power[kW]'],dtype='float')
i=0
for i in range(len(df)):
if 'WindSpeed' in df['Data']:
TRBX['WSpd[m/s]', i]= re.findall ("'(\d+)'",'Data')
elif 'Rotorspeed' in df['Data']:
TRBX['WSpd[m/s]', i]= re.findall ("'(\d+)'",'Data')
Is this a suitable approach? If yes, so far there are no values written into the TRBX dataframe. Where is my mistake?
The code below should help you if your df is indeed in the same format as you:
import re
split_func = lambda x: re.split('<|>', str(x))
split_series = df.Data.apply(split_func)
data = a.apply(lambda x: x[2]).rename('data')
features = a.apply(lambda x: x[1]).rename('features')
df = pd.DataFrame(data).set_index(features).T
You may want to drop some columns that have no data or input some N/A values afterwards. You also may want to rename the variables and series to different names that make more sense to you.

Select several years pandas dataframe

I am trying to select several years from a dataframe in monthly resolution.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import netCDF4 as nc
#-- open net-cdf and read in variables
data = nc.Dataset('test.nc')
time = nc.num2date(data.variables['Time'][:],
data.variables['Time'].units)
df = pd.DataFrame(data.variables['mgpp'][:,0,0], columns=['mgpp'])
df['dates'] = time
df = df.set_index('dates')
print(df.head())
This is what the head looks like:
mgpp
dates
1901-01-01 0.040735
1901-02-01 0.041172
1901-03-01 0.053889
1901-04-01 0.066906
Now I managed to extract one year:
df_cp = df[df.index.year == 2001]
but how would I extract several years, say 1997, 2001 and 2007 and have them stored in the same dataframe? Is there a one/ two line solution? My only idea for now is to iterate and then merge the dataframes but maybe there is a better solution!

How to plot data based on given time?

I have a dataset like the one shown below.
Date;Time;Global_active_power;Global_reactive_power;Voltage;Global_intensity;Sub_metering_1;Sub_metering_2;Sub_metering_3
16/12/2006;17:24:00;4.216;0.418;234.840;18.400;0.000;1.000;17.000
16/12/2006;17:25:00;5.360;0.436;233.630;23.000;0.000;1.000;16.000
16/12/2006;17:26:00;5.374;0.498;233.290;23.000;0.000;2.000;17.000
16/12/2006;17:27:00;5.388;0.502;233.740;23.000;0.000;1.000;17.000
16/12/2006;17:28:00;3.666;0.528;235.680;15.800;0.000;1.000;17.000
16/12/2006;17:29:00;3.520;0.522;235.020;15.000;0.000;2.000;17.000
16/12/2006;17:30:00;3.702;0.520;235.090;15.800;0.000;1.000;17.000
16/12/2006;17:31:00;3.700;0.520;235.220;15.800;0.000;1.000;17.000
16/12/2006;17:32:00;3.668;0.510;233.990;15.800;0.000;1.000;17.000
I've used pandas to get the data into a DataFrame. The dataset has data for multiple days with an interval of 1 min for each row in the dataset.
I want to plot separate graphs for the voltage with respect to the time(shown in column 2) for each day(shown in column 1) using python. How can I do that?
txt = '''Date;Time;Global_active_power;Global_reactive_power;Voltage;Global_intensity;Sub_metering_1;Sub_metering_2;Sub_metering_3
16/12/2006;17:24:00;4.216;0.418;234.840;18.400;0.000;1.000;17.000
16/12/2006;17:25:00;5.360;0.436;233.630;23.000;0.000;1.000;16.000
16/12/2006;17:26:00;5.374;0.498;233.290;23.000;0.000;2.000;17.000
16/12/2006;17:27:00;5.388;0.502;233.740;23.000;0.000;1.000;17.000
16/12/2006;17:28:00;3.666;0.528;235.680;15.800;0.000;1.000;17.000
16/12/2006;17:29:00;3.520;0.522;235.020;15.000;0.000;2.000;17.000
16/12/2006;17:30:00;3.702;0.520;235.090;15.800;0.000;1.000;17.000
16/12/2006;17:31:00;3.700;0.520;235.220;15.800;0.000;1.000;17.000
16/12/2006;17:32:00;3.668;0.510;233.990;15.800;0.000;1.000;17.000'''
from io import StringIO
f = StringIO(txt)
df = pd.read_table(f,sep =';' )
plt.plot(df['Time'],df['Voltage'])
plt.show()
gives output :
I believe this will do the trick (I edited the dates so we have two dates)
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline #If you use Jupyter Notebook
df = pd.read_csv('test.csv', sep=';', usecols=['Date','Time','Voltage'])
unique_dates = df.Date.unique()
for date in unique_dates:
print('Date: ' + date)
df.loc[df.Date == date].plot.line('Time', 'Voltage')
plt.show()
You will get this:
X = df.Date.unique()
for i in X: #iterate over unique days
temp_df = df[df.Date==i] #get df for specific day
temp_df.plot(x = 'Time', y = 'Voltage') #plot
If you want to change x values you can use
x = np.arange(1, len(temp_df.Time), 1)
group by hour and minute after creating a DateTime variable to handle multiple days. you can filter the grouped for a specific day.
txt =
'''Date;Time;Global_active_power;Global_reactive_power;Voltage;Global_intensity;Sub_metering_1;Sub_metering_2;Sub_metering_3
16/12/2006;17:24:00;4.216;0.418;234.840;18.400;0.000;1.000;17.000
16/12/2006;17:25:00;5.360;0.436;233.630;23.000;0.000;1.000;16.000
16/12/2006;17:26:00;5.374;0.498;233.290;23.000;0.000;2.000;17.000
16/12/2006;17:27:00;5.388;0.502;233.740;23.000;0.000;1.000;17.000
16/12/2006;17:28:00;3.666;0.528;235.680;15.800;0.000;1.000;17.000
16/12/2006;17:29:00;3.520;0.522;235.020;15.000;0.000;2.000;17.000
16/12/2006;17:30:00;3.702;0.520;235.090;15.800;0.000;1.000;17.000
16/12/2006;17:31:00;3.700;0.520;235.220;15.800;0.000;1.000;17.000
16/12/2006;17:32:00;3.668;0.510;233.990;15.800;0.000;1.000;17.000'''
from io import StringIO
f = StringIO(txt)
df = pd.read_table(f,sep =';' )
df['DateTime']=pd.to_datetime(df['Date']+"T"+df['Time']+"Z")
df.set_index('DateTime',inplace=True)
filter=df['Date']=='16/12/2006'
grouped=df[filter].groupby([df.index.hour,df.index.minute])['Voltage'].mean()
grouped.plot()
plt.show()

python's ggplot does not use year number as label on axis

In the following MWE, my year variable is shown on the x-axis as 0 to 6 instead of the actual year number. Why is this?
import pandas as pd
from pandas_datareader import wb
from ggplot import *
dat = wb.download(
indicator=['BX.KLT.DINV.CD.WD', 'BX.KLT.DINV.WD.GD.ZS'],
country='CN', start=2005, end=2011)
dat.reset_index(inplace=True)
print ggplot(aes(x='year', y='BX.KLT.DINV.CD.WD'),
data=dat) + \
geom_line() + theme_bw()
All you need to do is convert the year column from an object dtype to datetime64:
dat['year'] = pd.to_datetime(dat['year'])

Compute annual mean of values from a tuple using time information

I have daily precipitation values with time information in following form:
a = [(19500101,3.45),(19500102,1.2).......(19701231,1.4)]
I want to take annual mean of it using date information. It might be a simple solution. I have tried as below. Any suggestions?
prcp=numpy.array(precipitation)
time=numpy.array(time)
yearly=numpy.zeros(prcp.shape)
#-----------------Get annual means-----------------
for ii in xrange(len(time)):
tt=time[ii]
if ii==0:
year_old=tt[0:4]
index_start=ii
else:
#----------------new year----------------
year=tt[0:4]
if year != year_old:
year_mean=numpy.mean(prcp[index_start:ii])
yearly[index_start:ii]=year_mean
year_old=month
index_start=ii
#----------------Get the last year----------------
if ii==len(time)-1:
year_mean=numpy.mean(prcp[index_start:])
yearly[index_start:]=year_mean
You could try Pandas for aggregations.
import pandas as pd
a = [(19500101,3.45),(19500102,1.2), (19701231,1.4)]
df = pd.DataFrame(a) # convert to dataframe
df[0] = pd.to_datetime(df[0], format='%Y%m%d') # create a datetime series
df.groupby(df[0].map(lambda x: x.year)).mean() # groupby year and mean from g roups
1
0
1950 2.325
1970 1.400
You could use the snippet below to do this:
First, segregate the data based on the years:
>>> list_of_data = [(19500101,3.45), (19500102,1.2), (19701231,1.4)]
>>> from collections import defaultdict
>>> data = defaultdict(list)
>>> for item in list_of_data:
... data[str(item[0])[:4]].append(item[1])
And now, calculate the mean using
>>> for key, value in data.iteritems():
... print key, sum(value)/len(value)
...
1950 2.325
1970 1.4
Note that I am doing two runs on the data, and #John's answer of Pandas will be probably faster if you are ok using the pandas library.
I recommend pandas as #John-Galt suggested,
If you want a python solution without pandas:
import numpy as np
a = [(19500101,3.45),(19500102,1.2).......(19701231,1.4)]
year=lambda x:int(x[0]/10**4)
years={year(x) for x in a}
annual_avg=dict()
for y in years:
annual_avg[y]=reduce(np.mean,[x[1] for x in a if year(x)==y])

Categories

Resources