I took an excel sheet which has dates and some values and want to convert them to pandas dataframe and select only rows which are between certain dates.
For some reason I cannot select a row by date index
Raw Data in Excel file
MCU
Timestamp 50D 10P1 10P2 10P3 10P6 10P9 10P12
12-Feb-15 25.17 5.88 5.92 5.98 6.18 6.23 6.33
11-Feb-15 25.9 6.05 6.09 6.15 6.28 6.31 6.39
10-Feb-15 26.38 5.94 6.05 6.15 6.33 6.39 6.46
Code
xls = pd.ExcelFile('e:/Data.xlsx')
vols = xls.parse(asset.upper()+'VOL',header=1)
vols.set_index('Timestamp',inplace=True)
Data before set_index
Timestamp 50D 10P1 10P2 10P3 10P6 10P9 10P12 25P1 25P2 \
0 2015-02-12 25.17 5.88 5.92 5.98 6.18 6.23 6.33 2.98 3.08
1 2015-02-11 25.90 6.05 6.09 6.15 6.28 6.31 6.39 3.12 3.17
2 2015-02-10 26.38 5.94 6.05 6.15 6.33 6.39 6.46 3.01 3.16
Data after set_index
50D 10P1 10P2 10P3 10P6 10P9 10P12 25P1 25P2 25P3 \
Timestamp
2015-02-12 25.17 5.88 5.92 5.98 6.18 6.23 6.33 2.98 3.08 3.21
2015-02-11 25.90 6.05 6.09 6.15 6.28 6.31 6.39 3.12 3.17 3.32
2015-02-10 26.38 5.94 6.05 6.15 6.33 6.39 6.46 3.01 3.16 3.31
Output
>>> vols.index
<class 'pandas.tseries.index.DatetimeIndex'>
[2015-02-12, ..., NaT]
Length: 1478, Freq: None, Timezone: None
>>> vols[date(2015,2,12)]
*** KeyError: datetime.date(2015, 2, 12)
I would expect this not to fail, and also I should be able to select a range of dates. Tried so many combinations but not getting it.
Using a datetime.date instance to try to retrieve the index won't work, you just need a string representation of the date, e.g. '2015-02-12' or '2015/02/14'.
Secondly, vols[date(2015,2,12)] is actually looking in your DataFrame's column headings, not the index. You can use loc to fetch row index labels instead. For example you could write vols.loc['2015-02-12']
Related
Imagine I have a dataframe that contains minute data for different symbols:
timestamp open high low close volume trade_count vwap symbol volume_10_day
0 2022-09-26 08:20:00+00:00 1.58 1.59 1.34 1.34 972 15 1.433220 ADA 2889145.1
1 2022-09-26 08:25:00+00:00 1.45 1.66 1.41 1.66 3778 25 1.551821 ADA 2889145.1
2 2022-09-26 08:30:00+00:00 1.70 1.70 1.39 1.47 13683 59 1.499826 ADA 2889145.1
3 2022-09-26 08:35:00+00:00 1.43 1.50 1.37 1.37 3627 10 1.406485 ADA 2889145.1
4 2022-09-26 08:40:00+00:00 1.40 1.44 1.40 1.44 1352 9 1.408365 ADA 2889145.1
--
100 2022-09-26 08:20:00+00:00 1.58 1.59 1.34 1.34 972 15 1.433220 ADD 2889145.1
101 2022-09-26 08:25:00+00:00 1.45 1.66 1.41 1.66 3778 25 1.551821 ADD 2889145.1
102 2022-09-26 08:30:00+00:00 1.70 1.70 1.39 1.47 13683 59 1.499826 ADD 2889145.1
103 2022-09-26 08:35:00+00:00 1.43 1.50 1.37 1.37 3627 10 1.406485 ADD 2889145.1
104 2022-09-26 08:40:00+00:00 1.40 1.44 1.40 1.44 1352 9 1.408365 ADD 2889145.1
I want to be able to filter the list, so that it only returns a single dataframe with multiple days, but that no days are repeated (like in the example above where ADA and ADD both appear for the date 2022-09-26).
How can I filter out duplicate days like this? I don't care how it's done - it could be just keeping whatever symbol appears first for a given date, like this for example:
timestamp open high low close volume trade_count vwap symbol volume_10_day
0 2022-09-26 08:20:00+00:00 1.58 1.59 1.34 1.34 972 15 1.433220 ADA 2889145.1
1 2022-09-26 08:25:00+00:00 1.45 1.66 1.41 1.66 3778 25 1.551821 ADA 2889145.1
2 2022-09-26 08:30:00+00:00 1.70 1.70 1.39 1.47 13683 59 1.499826 ADA 2889145.1
3 2022-09-26 08:35:00+00:00 1.43 1.50 1.37 1.37 3627 10 1.406485 ADA 2889145.1
4 2022-09-26 08:40:00+00:00 1.40 1.44 1.40 1.44 1352 9 1.408365 ADA 2889145.1
--
100 2022-09-27 08:20:00+00:00 1.58 1.59 1.34 1.34 972 15 1.433220 ADB 2889145.1
101 2022-09-27 08:25:00+00:00 1.45 1.66 1.41 1.66 3778 25 1.551821 ADB 2889145.1
102 2022-09-27 08:30:00+00:00 1.70 1.70 1.39 1.47 13683 59 1.499826 ADB 2889145.1
103 2022-09-27 08:35:00+00:00 1.43 1.50 1.37 1.37 3627 10 1.406485 ADB 2889145.1
104 2022-09-27 08:40:00+00:00 1.40 1.44 1.40 1.44 1352 9 1.408365 ADB 2889145.1
How can I achieve this?
Update, tried drop_duplicates as suggested by Lukas, like so:
Read from db in a df:
df = pd.read_sql_query("SELECT * from ohlc_minutes", conn)
Get the length (4769):
print(len(df))
And then:
df['timestamp'] = pd.to_datetime(df['timestamp'])
df.drop_duplicates(subset=['symbol', 'timestamp'])
print(len(df))
But it returns the same length.
How can I get my drop_duplicates to work with minute data?
You can use pd.drop_duplicates:
df.drop_duplicates(subset=['timestamp', 'symbol'])
By default, it will take the first appearance of the combination of the values in the timestamp and symbol columns, but you can change this behavior.
I have a matrix of reference values and would like to learn how Scikit-learn can be used to generate a regression model for it. I have done several types of univariate regressions in the past but it's not clear to me how to use two variables in sklearn.
I have two features (A and B) and a table of output values for certain input A/B values. See table and 3D surface below. I'd like to see how I can translate this to a two variable equation that relates the A/B inputs to the single value output, like shown in the table. The relationship looks nonlinear and it could also be quadratic, logarithmic, etc...
How do I use sklearn to perform a nonlinear regression on this tabular data?
A/B 1000 1100 1200 1300 1400 1500 1600 1700 1800 1900 2000
0 8.78 8.21 7.64 7.07 6.50 5.92 5.35 4.78 4.21 3.63 3.06
5 8.06 7.56 7.07 6.58 6.08 5.59 5.10 4.60 4.11 3.62 3.12
10 7.33 6.91 6.50 6.09 5.67 5.26 4.84 4.43 4.01 3.60 3.19
15 6.60 6.27 5.93 5.59 5.26 4.92 4.59 4.25 3.92 3.58 3.25
20 5.87 5.62 5.36 5.10 4.85 4.59 4.33 4.08 3.82 3.57 3.31
25 5.14 4.97 4.79 4.61 4.44 4.26 4.08 3.90 3.73 3.55 3.37
30 4.42 4.32 4.22 4.12 4.02 3.93 3.83 3.73 3.63 3.53 3.43
35 3.80 3.78 3.75 3.72 3.70 3.67 3.64 3.62 3.59 3.56 3.54
40 2.86 2.93 2.99 3.05 3.12 3.18 3.24 3.31 3.37 3.43 3.50
45 2.08 2.24 2.39 2.54 2.70 2.85 3.00 3.16 3.31 3.46 3.62
50 1.64 1.84 2.05 2.26 2.46 2.67 2.88 3.08 3.29 3.50 3.70
55 1.55 1.77 1.98 2.19 2.41 2.62 2.83 3.05 3.26 3.47 3.69
60 2.09 2.22 2.35 2.48 2.61 2.74 2.87 3.00 3.13 3.26 3.39
65 3.12 3.08 3.05 3.02 2.98 2.95 2.92 2.88 2.85 2.82 2.78
70 3.50 3.39 3.28 3.17 3.06 2.95 2.84 2.73 2.62 2.51 2.40
75 3.42 3.32 3.21 3.10 3.00 2.89 2.78 2.68 2.57 2.46 2.36
80 3.68 3.55 3.43 3.31 3.18 3.06 2.94 2.81 2.69 2.57 2.44
85 3.43 3.35 3.28 3.21 3.13 3.06 2.99 2.91 2.84 2.77 2.69
90 3.43 3.35 3.28 3.21 3.13 3.06 2.99 2.91 2.84 2.77 2.69
95 3.43 3.35 3.28 3.21 3.13 3.06 2.99 2.91 2.84 2.77 2.69
100 3.43 3.35 3.28 3.21 3.13 3.06 2.99 2.91 2.84 2.77 2.69
There is probably a succinct nonlinear relationship between your A, B, and table values, but without some knowledge of this system nor with any sophisticated nonlinear modeling, here's a ridiculous model with a decent score.
the_table = """1000 1100 1200 1300 1400 1500 1600 1700 1800 1900 2000
0 8.78 8.21 7.64 7.07 6.50 5.92 5.35 4.78 4.21 3.63 3.06
5 8.06 7.56 7.07 6.58 6.08 5.59 5.10 4.60 4.11 3.62 3.12
10 7.33 6.91 6.50 6.09 5.67 5.26 4.84 4.43 4.01 3.60 3.19
15 6.60 6.27 5.93 5.59 5.26 4.92 4.59 4.25 3.92 3.58 3.25
20 5.87 5.62 5.36 5.10 4.85 4.59 4.33 4.08 3.82 3.57 3.31
25 5.14 4.97 4.79 4.61 4.44 4.26 4.08 3.90 3.73 3.55 3.37
30 4.42 4.32 4.22 4.12 4.02 3.93 3.83 3.73 3.63 3.53 3.43
35 3.80 3.78 3.75 3.72 3.70 3.67 3.64 3.62 3.59 3.56 3.54
40 2.86 2.93 2.99 3.05 3.12 3.18 3.24 3.31 3.37 3.43 3.50
45 2.08 2.24 2.39 2.54 2.70 2.85 3.00 3.16 3.31 3.46 3.62
50 1.64 1.84 2.05 2.26 2.46 2.67 2.88 3.08 3.29 3.50 3.70
55 1.55 1.77 1.98 2.19 2.41 2.62 2.83 3.05 3.26 3.47 3.69
60 2.09 2.22 2.35 2.48 2.61 2.74 2.87 3.00 3.13 3.26 3.39
65 3.12 3.08 3.05 3.02 2.98 2.95 2.92 2.88 2.85 2.82 2.78
70 3.50 3.39 3.28 3.17 3.06 2.95 2.84 2.73 2.62 2.51 2.40
75 3.42 3.32 3.21 3.10 3.00 2.89 2.78 2.68 2.57 2.46 2.36
80 3.68 3.55 3.43 3.31 3.18 3.06 2.94 2.81 2.69 2.57 2.44
85 3.43 3.35 3.28 3.21 3.13 3.06 2.99 2.91 2.84 2.77 2.69
90 3.43 3.35 3.28 3.21 3.13 3.06 2.99 2.91 2.84 2.77 2.69
95 3.43 3.35 3.28 3.21 3.13 3.06 2.99 2.91 2.84 2.77 2.69
100 3.43 3.35 3.28 3.21 3.13 3.06 2.99 2.91 2.84 2.77 2.69"""
import numpy as np
import pandas as pd
import io
df = pd.read_csv(io.StringIO(initial_value=the_table), sep="\s+")
df.columns = df.columns.astype(np.uint64)
df_unstacked = df.unstack()
X = df_unstacked.index.tolist()
y = df_unstacked.to_list()
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(6)
poly_X = poly.fit_transform(X)
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(poly_X,y)
print(model.score(poly_X,y))
# 0.9762180339233807
To predict values given an A and a B using this model, you need to transform the input like was done for creating the model. So for example,
model.predict(poly.transform([(1534,56)]))
# array([2.75275659])
Even more outrageously ridiculous ...
more_X = [(a,b,np.log1p(a),np.log1p(b),np.cos(np.pi*b/100)) for a,b in X]
poly = PolynomialFeatures(5)
poly_X = poly.fit_transform(more_X)
model.fit(poly_X,y)
print(model.score(poly_X,y))
# 0.9982994398684035
... and to predict:
more_X = [(a,b,np.log1p(a),np.log1p(b),np.cos(np.pi*b/100)) for a,b in [(1534,56)]]
model.predict(poly.transform(more_X))
# array([2.74577017])
N.B.: There's probably better ways to Pythonically program these ridiculous models.
I have a dataframe with the indexes as datetime and columns as depths. I would like to plot a contour plot which looks something like the image below. Any ideas how I should go about doing this? I tried using the plt.contour() function but I think I have to sort out the arrays for the data first. I am unsure about this part.
Example of my dataframe:
Datetime -1.62 -2.12 -2.62 -3.12 -3.62 -4.12 -4.62 -5.12
2019-05-24 15:45:00 4.61 5.67 4.86 3.91 3.35 3.07 3.03 2.84
2019-05-24 15:50:00 3.76 4.82 4.13 3.32 2.84 2.40 2.18 1.89
2019-05-24 15:55:00 3.07 3.77 3.23 2.82 2.41 2.21 1.93 1.81
2019-05-24 16:00:00 2.50 2.95 2.63 2.29 1.97 1.73 1.57 1.48
2019-05-24 16:05:00 2.94 3.62 3.23 2.82 2.62 2.31 2.01 1.81
2019-05-24 16:10:00 3.07 3.77 3.23 2.82 2.51 2.31 2.10 1.89
2019-05-24 16:15:00 2.71 3.20 2.86 2.70 2.51 2.31 2.18 1.97
2019-05-24 16:20:00 2.50 3.07 2.86 2.82 2.73 2.50 2.37 2.22
2019-05-24 16:25:00 2.40 3.20 3.10 2.93 2.73 2.50 2.57 2.84
2019-05-24 16:30:00 2.21 2.95 2.86 2.70 2.73 2.72 2.91 3.49
2019-05-24 16:35:00 2.04 2.72 2.63 2.59 2.62 2.72 3.03 3.35
2019-05-24 16:40:00 1.73 2.31 2.33 2.39 2.62 2.95 3.57
Example of the plot I want:
For the X Y Z input in plt.contour(), I would like to find out what structure of data it requires. It says it requires a 2D array structure, but I am confused. How do I get that with my current dataframe?
I have worked out a solution. Note that the X (tt2) - time input, and Y(depth) - depth input, have to match the Z(mat2) matrix dimensions for the plt.contourf to work. I realised plt.contourf produces the image i want rather than plt.contour, which only plots the contour lines.
Example of my code:
tt2 = [...]
depth = [...]
plt.title('SSC Contour Plot')
fig=plt.contourf(tt2,depth,mat2,cmap='jet',levels=
[0,2,4,6,8,10,12,14,16,18,20,22,24,26], extend= "both")
plt.gca().invert_yaxis() #to flip the depth from shallowest to deepest
plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%d/%m/%Y %H:%M:%S'))
#plt.gca().xticklabels(tt)
cbar = plt.colorbar()
cbar.set_label("mg/l")
yy = len(colls2)
plt.ylim(yy-15,0) #assumming last 10 depth readings have NaN
plt.xlabel("Datetime")
plt.xticks(rotation=45)
plt.ylabel("Depth (m)")
plt.savefig(path+'SSC contour plot.png') #save plot
plt.show()
Example of plot produced
I have a dataframe, df which looks like this
Open High Low Close Volume
Date
2007-03-22 2.65 2.95 2.64 2.86 176389
2007-03-23 2.87 2.87 2.78 2.78 63316
2007-03-26 2.83 2.83 2.51 2.52 54051
2007-03-27 2.61 3.29 2.60 3.28 589443
2007-03-28 3.65 4.10 3.60 3.80 1114659
2007-03-29 3.91 3.91 3.33 3.57 360501
2007-03-30 3.70 3.88 3.66 3.71 185787
I'm trying to create a new column, which will takes the df.Open value 5 days ahead from each df.Open value and subtract it.
So the loop I"m using is this:
for i in range(0, len(df.Open)): #goes through indexes values
df['5days'][i]=df.Open[i+5]-df.Open[i] #I use those index values to locate
However, this loop is yielding an error.
KeyError: '5days'
Not sure why. I got this to temporarily work by removing the df['5days'][i], but it seems awfully slow. Not sure if there is a more efficient way to do this.
Thank you.
Using diff
df['5Days'] = df.Open.diff(5)
print(df)
Open High Low Close Volume 5Days
Date
2007-03-22 2.65 2.95 2.64 2.86 176389 NaN
2007-03-23 2.87 2.87 2.78 2.78 63316 NaN
2007-03-26 2.83 2.83 2.51 2.52 54051 NaN
2007-03-27 2.61 3.29 2.60 3.28 589443 NaN
2007-03-28 3.65 4.10 3.60 3.80 1114659 NaN
2007-03-29 3.91 3.91 3.33 3.57 360501 1.26
2007-03-30 3.70 3.88 3.66 3.71 185787 0.83
However, per your code, you may want to look ahead and align the results back. In that case
df['5Days'] = -df.Open.diff(-5)
print(df)
Open High Low Close Volume 5days
Date
2007-03-22 2.65 2.95 2.64 2.86 176389 1.26
2007-03-23 2.87 2.87 2.78 2.78 63316 0.83
2007-03-26 2.83 2.83 2.51 2.52 54051 NaN
2007-03-27 2.61 3.29 2.60 3.28 589443 NaN
2007-03-28 3.65 4.10 3.60 3.80 1114659 NaN
2007-03-29 3.91 3.91 3.33 3.57 360501 NaN
2007-03-30 3.70 3.88 3.66 3.71 185787 NaN
I think you need shift with sub:
df['5days'] = df.Open.shift(5).sub(df.Open)
print (df)
Open High Low Close Volume 5days
Date
2007-03-22 2.65 2.95 2.64 2.86 176389 NaN
2007-03-23 2.87 2.87 2.78 2.78 63316 NaN
2007-03-26 2.83 2.83 2.51 2.52 54051 NaN
2007-03-27 2.61 3.29 2.60 3.28 589443 NaN
2007-03-28 3.65 4.10 3.60 3.80 1114659 NaN
2007-03-29 3.91 3.91 3.33 3.57 360501 -1.26
2007-03-30 3.70 3.88 3.66 3.71 185787 -0.83
Or maybe need substract Open with shifted column:
df['5days'] = df.Open.sub(df.Open.shift(5))
print (df)
Open High Low Close Volume 5days
Date
2007-03-22 2.65 2.95 2.64 2.86 176389 NaN
2007-03-23 2.87 2.87 2.78 2.78 63316 NaN
2007-03-26 2.83 2.83 2.51 2.52 54051 NaN
2007-03-27 2.61 3.29 2.60 3.28 589443 NaN
2007-03-28 3.65 4.10 3.60 3.80 1114659 NaN
2007-03-29 3.91 3.91 3.33 3.57 360501 1.26
2007-03-30 3.70 3.88 3.66 3.71 185787 0.83
df['5days'] = -df.Open.sub(df.Open.shift(-5))
print (df)
Open High Low Close Volume 5days
Date
2007-03-22 2.65 2.95 2.64 2.86 176389 1.26
2007-03-23 2.87 2.87 2.78 2.78 63316 0.83
2007-03-26 2.83 2.83 2.51 2.52 54051 NaN
2007-03-27 2.61 3.29 2.60 3.28 589443 NaN
2007-03-28 3.65 4.10 3.60 3.80 1114659 NaN
2007-03-29 3.91 3.91 3.33 3.57 360501 NaN
2007-03-30 3.70 3.88 3.66 3.71 185787 NaN
this is driving me nuts, I can't plot column 'b'
it plots only column 'A'.....
this is my code, no idea what I'm doing wrong, probably something silly...
the dataframe seems ok, weirdness also is that I can access both df['A'] and df['b'] but only df['A'].plot() works, if I issue a df['b'].plot() I get this error :
Traceback (most recent call last): File
"C:\Python27\lib\site-packages\IPython\core\interactiveshell.py", line
2883, in run_code
exec(code_obj, self.user_global_ns, self.user_ns) File "", line 1, in
df['b'].plot() File "C:\Python27\lib\site-packages\pandas\tools\plotting.py", line 2511,
in plot_series
**kwds) File "C:\Python27\lib\site-packages\pandas\tools\plotting.py", line 2317,
in _plot
plot_obj.generate() File "C:\Python27\lib\site-packages\pandas\tools\plotting.py", line 921, in
generate
self._compute_plot_data() File "C:\Python27\lib\site-packages\pandas\tools\plotting.py", line 997, in
_compute_plot_data
'plot'.format(numeric_data.class.name)) TypeError: Empty 'Series': no numeric data to plot
import sqlalchemy
import pandas as pd
import matplotlib.pyplot as plt
engine = sqlalchemy.create_engine(
'sqlite:///C:/Users/toto/PycharmProjects/my_db.sqlite')
tables = engine.table_names()
dic = {}
for t in tables:
sql = 'SELECT t."weight" FROM "' + t + '" t WHERE t."udl"="IBE SM"'
dic[t] = (pd.read_sql(sql, engine)['weight'][0], pd.read_sql(sql, engine)['weight'][1])
df = pd.DataFrame.from_dict(dic, orient='index').sort_index()
df = df.set_index(pd.DatetimeIndex(df.index))
df.columns = ['A', 'b']
print(df)
print(df.info())
df.plot()
plt.show()
this is the 2 print
A b
2014-08-05 1.81 3.39
2014-08-06 1.81 3.39
2014-08-07 1.81 3.39
2014-08-08 1.80 3.37
2014-08-11 1.79 3.35
2014-08-13 1.80 3.36
2014-08-14 1.80 3.35
2014-08-18 1.80 3.35
2014-08-19 1.79 3.34
2014-08-20 1.80 3.35
2014-08-27 1.79 3.35
2014-08-28 1.80 3.35
2014-08-29 1.79 3.35
2014-09-01 1.79 3.35
2014-09-02 1.79 3.35
2014-09-03 1.79 3.36
2014-09-04 1.79 3.37
2014-09-05 1.80 3.38
2014-09-08 1.79 3.36
2014-09-09 1.79 3.35
2014-09-10 1.78 3.35
2014-09-11 1.78 3.34
2014-09-12 1.78 3.34
2014-09-15 1.78 3.35
2014-09-16 1.78 3.35
2014-09-17 1.78 3.35
2014-09-18 1.78 3.34
2014-09-19 1.79 3.35
2014-09-22 1.79 3.36
2014-09-23 1.80 3.37
... ... ...
2014-12-10 1.73 3.29
2014-12-11 1.74 3.27
2014-12-12 1.74 3.25
2014-12-15 1.74 3.24
2014-12-16 1.74 3.27
2014-12-17 1.75 3.28
2014-12-18 1.76 3.29
2014-12-19 1.04 1.39
2014-12-22 1.04 1.39
2014-12-23 1.04 1.4
2014-12-24 1.04 1.39
2014-12-29 1.04 1.39
2014-12-30 1.04 1.4
2015-01-02 1.04 1.4
2015-01-05 1.04 1.4
2015-01-06 1.04 1.4
2015-01-07 NaN 1.39
2015-01-08 NaN 1.39
2015-01-09 NaN 1.39
2015-01-12 NaN 1.38
2015-01-13 NaN 1.38
2015-01-14 NaN 1.38
2015-01-15 NaN 1.38
2015-01-16 NaN 1.38
2015-01-19 NaN 1.39
2015-01-20 NaN 1.38
2015-01-21 NaN 1.39
2015-01-22 NaN 1.4
2015-01-23 NaN 1,4
2015-01-26 NaN 1.41
[107 rows x 2 columns]
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 107 entries, 2014-08-05 00:00:00 to 2015-01-26 00:00:00
Data columns (total 2 columns):
A 93 non-null float64
b 107 non-null object
dtypes: float64(1), object(1)
memory usage: 2.1+ KB
None
Process finished with exit code 0
just got it, 'b' is of object type and not float64 because of this line :
2015-01-23 NaN 1,4