How to remove duplicate days with multiple tickers in a single dataframe?

How to remove duplicate days with multiple tickers in a single dataframe? - python

Imagine I have a dataframe that contains minute data for different symbols:
timestamp open high low close volume trade_count vwap symbol volume_10_day
0 2022-09-26 08:20:00+00:00 1.58 1.59 1.34 1.34 972 15 1.433220 ADA 2889145.1
1 2022-09-26 08:25:00+00:00 1.45 1.66 1.41 1.66 3778 25 1.551821 ADA 2889145.1
2 2022-09-26 08:30:00+00:00 1.70 1.70 1.39 1.47 13683 59 1.499826 ADA 2889145.1
3 2022-09-26 08:35:00+00:00 1.43 1.50 1.37 1.37 3627 10 1.406485 ADA 2889145.1
4 2022-09-26 08:40:00+00:00 1.40 1.44 1.40 1.44 1352 9 1.408365 ADA 2889145.1
--
100 2022-09-26 08:20:00+00:00 1.58 1.59 1.34 1.34 972 15 1.433220 ADD 2889145.1
101 2022-09-26 08:25:00+00:00 1.45 1.66 1.41 1.66 3778 25 1.551821 ADD 2889145.1
102 2022-09-26 08:30:00+00:00 1.70 1.70 1.39 1.47 13683 59 1.499826 ADD 2889145.1
103 2022-09-26 08:35:00+00:00 1.43 1.50 1.37 1.37 3627 10 1.406485 ADD 2889145.1
104 2022-09-26 08:40:00+00:00 1.40 1.44 1.40 1.44 1352 9 1.408365 ADD 2889145.1
I want to be able to filter the list, so that it only returns a single dataframe with multiple days, but that no days are repeated (like in the example above where ADA and ADD both appear for the date 2022-09-26).
How can I filter out duplicate days like this? I don't care how it's done - it could be just keeping whatever symbol appears first for a given date, like this for example:
timestamp open high low close volume trade_count vwap symbol volume_10_day
0 2022-09-26 08:20:00+00:00 1.58 1.59 1.34 1.34 972 15 1.433220 ADA 2889145.1
1 2022-09-26 08:25:00+00:00 1.45 1.66 1.41 1.66 3778 25 1.551821 ADA 2889145.1
2 2022-09-26 08:30:00+00:00 1.70 1.70 1.39 1.47 13683 59 1.499826 ADA 2889145.1
3 2022-09-26 08:35:00+00:00 1.43 1.50 1.37 1.37 3627 10 1.406485 ADA 2889145.1
4 2022-09-26 08:40:00+00:00 1.40 1.44 1.40 1.44 1352 9 1.408365 ADA 2889145.1
--
100 2022-09-27 08:20:00+00:00 1.58 1.59 1.34 1.34 972 15 1.433220 ADB 2889145.1
101 2022-09-27 08:25:00+00:00 1.45 1.66 1.41 1.66 3778 25 1.551821 ADB 2889145.1
102 2022-09-27 08:30:00+00:00 1.70 1.70 1.39 1.47 13683 59 1.499826 ADB 2889145.1
103 2022-09-27 08:35:00+00:00 1.43 1.50 1.37 1.37 3627 10 1.406485 ADB 2889145.1
104 2022-09-27 08:40:00+00:00 1.40 1.44 1.40 1.44 1352 9 1.408365 ADB 2889145.1
How can I achieve this?
Update, tried drop_duplicates as suggested by Lukas, like so:
Read from db in a df:
df = pd.read_sql_query("SELECT * from ohlc_minutes", conn)
Get the length (4769):
print(len(df))
And then:
df['timestamp'] = pd.to_datetime(df['timestamp'])
df.drop_duplicates(subset=['symbol', 'timestamp'])
print(len(df))
But it returns the same length.
How can I get my drop_duplicates to work with minute data?

You can use pd.drop_duplicates:
df.drop_duplicates(subset=['timestamp', 'symbol'])
By default, it will take the first appearance of the combination of the values in the timestamp and symbol columns, but you can change this behavior.

Related

Trying to use the BeautifulSoup Python module to pull individual elements from table data

I am new to Python and currently using BeautifulSoup with Python to try and pull some table data. I cannot get the individual elements out of the td. What I have so far is:
from bs4 import BeautifulSoup
import requests
source = requests.get('https://gol.gg/teams/list/season-ALL/split-ALL/region-ALL/tournament-LCS%20Summer%202020/week-ALL/').text
soup = BeautifulSoup(source, 'lxml')
td = soup.find_all('td', {'class': 'text-center'})
print(td)
This does display all of the td that I want to extract but am unable to figure out how to get each individual element out of the td.
Thank you in advanced for the help, it is much appreciated.

Try this:
from bs4 import BeautifulSoup
import requests
source = requests.get('https://gol.gg/teams/list/season-ALL/split-ALL/region-ALL/tournament-LCS%20Summer%202020/week-ALL/').text
soup = BeautifulSoup(source, 'lxml')
td = soup.find_all('td', {'class': 'text-center'})
print(*[text.get_text(strip=True) + '\n' for text in td])
Prints:
S10
NA
14
35.7%
0.91
1744
-48
33:19
11.2
12.4
5.5
7.0
50.0
64.3
2.71
54.2
1.00
57.1
1.14
and so on....

The following script extracts the data and saves the data to a csv file.
import requests
from bs4 import BeautifulSoup
import pandas as pd
res = requests.get('https://gol.gg/teams/list/season-ALL/split-ALL/region-ALL/tournament-LCS%20Summer%202020/week-ALL/')
soup = BeautifulSoup(res.text, 'html.parser')
table = soup.find("table", class_="table_list playerslist tablesaw trhover")
columns = [i.get_text(strip=True) for i in table.find("thead").find_all("th")]
data = []
table.find("thead").extract()
for tr in table.find_all("tr"):
data.append([td.get_text(strip=True) for td in tr.find_all("td")])
df = pd.DataFrame(data, columns=columns)
df.to_csv("data.csv", index=False)
Output:
Name Season Region Games Win rate K:D GPM GDM Game duration Kills / game Deaths / game Towers killed Towers lost FB% FT% DRAPG DRA% HERPG HER% DRA#15 TD#15 GD#15 NASHPG NASH% CSM DPM WPM VWPM WCPM
0 100 Thieves S10 NA 14 35.7% 0.91 1744 -48 33:19 11.2 12.4 5.5 7.0 50.0 64.3 2.71 54.2 1.00 57.1 1.14 0.4 -378 0.64 42.9 33.2 1937 3.0 1.19 1.31
1 CLG S10 NA 14 35.7% 0.81 1705 -120 35:25 10.6 13.2 4.9 7.9 28.6 28.6 1.93 31.5 0.57 28.6 0.64 -0.6 -1297 0.57 30.4 32.6 1826 3.2 1.17 1.37
2 Cloud9 S10 NA 14 78.6% 1.91 1922 302 28:52 15.0 7.9 8.3 3.1 64.3 64.3 3.07 72.5 1.43 71.4 1.29 0.7 2410 1.00 78.6 33.3 1921 3.0 1.10 1.26
3 Dignitas S10 NA 14 28.6% 0.86 1663 -147 32:44 8.9 10.4 3.9 8.1 42.9 35.7 2.14 41.7 0.57 28.6 0.79 -0.7 -796 0.36 25.0 32.5 1517 3.1 1.28 1.23
4 Evil Geniuses S10 NA 14 50.0% 0.85 1738 -0 34:09 11.1 13.1 6.5 6.0 64.3 57.1 2.36 48.5 1.00 53.6 1.00 0.5 397 0.50 46.5 32.3 1895 3.2 1.36 1.34
5 FlyQuest S10 NA 14 57.1% 1.28 1770 65 34:55 13.4 10.4 6.5 5.2 71.4 35.7 2.86 53.4 1.00 50.0 0.79 -0.1 69 0.71 69.2 32.7 1801 3.2 1.16 1.72
6 Golden Guardians S10 NA 14 50.0% 0.96 1740 6 36:13 10.7 11.1 6.3 6.1 50.0 35.7 3.29 62.8 0.86 42.9 1.43 0.1 711 0.50 43.6 33.7 1944 3.2 1.27 1.53
7 Immortals S10 NA 14 21.4% 0.54 1609 -246 33:54 7.5 14.0 4.3 7.9 35.7 35.7 2.29 39.9 1.00 53.6 0.79 -0.4 -1509 0.36 25.0 31.4 1734 3.3 1.37 1.47
8 Team Liquid S10 NA 14 78.6% 1.31 1796 135 35:07 11.4 8.6 7.9 4.4 42.9 64.3 2.36 43.6 0.93 50.0 1.14 0.2 522 1.21 78.6 33.1 1755 3.5 1.27 1.42
9 TSM S10 NA 14 64.3% 1.12 1768 52 34:20 11.6 10.4 7.2 5.7 50.0 78.6 2.79 51.9 1.21 64.3 0.93 0.1 -129 0.86 57.1 32.6 1729 3.2 1.33 1.33

Creating a heatmap using python and csv file

I'm trying to create a heatmap, with the x axis being time, the y axis being detectors (it's for freeway speed detection), and the colour scheme and numbers on the graph being for occupancy or basically what values the csv has at that time and detector.
My first thought is to use matplotlib in conjunction with pandas and numpy.
I've been trying lots of different approaches and feel like i've hit a brickwall in terms of getting it working.
Does anyone have a good idea about using these tools?
Cheers!
Row Labels 14142OB_L1 14142OB_L2 14140OB_E1P0 14140OB_E1P1 14140OB_E2P0 14140OB_E2P1 14140OB_L1 14140OB_L2 14140OB_M1P0 14140OB_M1P1 14140OB_M2P0 14140OB_M2P1 14140OB_M3P0 14140OB_M3P1 14140OB_S1P0 14140OB_S1P1 14140OB_S2P0 14140OB_S2P1 14140OB_S3P0 14140OB_S3P1 14138OB_L1 14138OB_L2 14138OB_L3 14136OB_L1 14136OB_L2 14136OB_L3 14134OB_L1 14134OB_L2 14134OB_L3 14132OB_L1 14132OB_L2 14132OB_L3
00 - 01 hr 0.22 1.42 0.29 0.29 0.59 0.59 0.17 1.47 0.38 0.38 0.56 0.6 0.08 0.1 0.67 0.7 0.88 0.9 0.15 0.17 0.17 1.66 0.47 0.16 1.6 0.49 0.14 0.94 1.21 0.21 1.22 0.44
01 - 02 hr 0.08 0.77 0.08 0.07 0.24 0.24 0.1 0.73 0.08 0.09 0.21 0.23 0.05 0.06 0.21 0.23 0.29 0.29 0.1 0.1 0.08 0.83 0.17 0.1 0.77 0.18 0.08 0.4 0.57 0.07 0.64 0.18
02 - 03 hr 0.08 0.73 0.06 0.06 0.23 0.23 0.06 0.73 0.07 0.07 0.23 0.24 0.02 0.02 0.16 0.17 0.32 0.34 0.06 0.07 0.06 0.77 0.16 0.06 0.78 0.17 0.07 0.3 0.66 0.06 0.68 0.19
03 - 04 hr 0.05 0.85 0.06 0.06 0.22 0.23 0.04 0.86 0.05 0.05 0.2 0.21 0.1 0.11 0.11 0.12 0.32 0.33 0.15 0.16 0.03 0.93 0.14 0.03 0.89 0.15 0.03 0.41 0.61 0.02 0.73 0.21
04 - 05 hr 0.13 1.25 0.09 0.09 0.24 0.24 0.12 1.25 0.11 0.11 0.2 0.21 0.08 0.09 0.19 0.2 0.32 0.34 0.15 0.15 0.1 1.33 0.18 0.11 1.35 0.19 0.11 0.52 1 0.07 1.08 0.29
05 - 06 hr 0.91 2.87 0.08 0.08 0.66 0.69 0.8 2.96 0.15 0.17 0.43 0.45 0.32 0.33 0.39 0.41 0.76 0.82 0.47 0.49 0.59 3.27 0.51 0.58 3.19 0.56 0.45 1.85 2.19 0.43 2.52 0.79
06 - 07 hr 3.92 5.44 1.29 1.14 4.03 4.12 3.19 6.03 1.66 1.69 3.26 3.44 1.84 1.93 13.03 14.97 13.81 19.23 4.69 5.59 3.03 6.72 3.01 2.78 6.81 3.02 1.52 4.22 7.13 2.54 5.94 2.88
07 - 08 hr 4.68 6.35 1.67 1.8 5.69 5.95 4.01 6.81 2.69 2.78 3.84 4.03 3.27 4.05 24.25 24.39 28.07 36.5 15.39 15.38 3.79 7.91 4.28 3.58 7.91 4.33 1.67 6.16 8.3 3.17 6.59 3.74
08 - 09 hr 5.21 6.31 2.51 2.82 7.46 7.72 4.53 6.65 9.03 8.98 13.94 12.77 6.73 8.55 47 48.38 50.08 48.32 22.83 21.91 4.29 8.27 5.04 4.15 8.27 5.16 2.44 6.24 9.17 3.26 6.81 4.16
09 - 10 hr 4.05 6.17 1.01 0.99 4.47 4.55 3.45 6.53 1.68 1.74 3.12 3.24 1.82 1.98 16.49 16.22 15.58 20.36 4.31 5.2 3.36 7.24 3.55 3.03 7.36 3.73 1.89 5.64 6.75 2.24 5.94 3.26
10 - 11 hr 3.62 6.64 1.14 1.15 4.11 4.18 3.23 6.87 1.79 1.87 3.03 3.13 1.72 1.89 15.02 18.75 17.25 22.61 3.06 3.24 3.06 7.69 3.23 2.87 7.49 3.56 2.06 4.99 7.05 2.26 6.2 3.07
11 - 12 hr 4.31 6.74 1.29 1.3 4.91 4.97 3.79 6.88 2.25 2.35 3.97 4.29 1.84 1.98 19.58 22.5 24.92 23.14 3.27 3.46 3.65 7.67 3.96 3.43 7.74 4 2.39 5.4 7.67 2.57 6.42 3.22
12 - 13 hr 4.53 6.9 1.4 1.39 5.81 5.9 3.96 7.18 2.69 2.86 4.94 5.28 2.15 2.29 24.46 28.34 36.59 31.06 5.4 5.39 3.95 7.98 4.54 3.7 8.03 4.69 2.36 5.99 8.29 3.01 6.61 3.37
13 - 14 hr 6.13 7.29 1.57 1.55 6.02 6.11 5.34 7.74 2.67 2.76 5.2 5.56 2.04 2.16 23.74 28.31 31.01 36.89 4.15 4.6 5.22 8.83 4.77 4.96 8.84 4.92 2.65 6.56 9.77 3.96 7.23 3.88
14 - 15 hr 8.72 8.22 2.93 3.06 8.58 8.9 8.94 9.57 17.69 17.2 18.99 23.58 2.37 3.69 38.81 53.33 49.93 45.42 5.69 4.3 8.13 10.04 5.45 7.03 9.94 5.51 3.59 7.41 12.4 5.92 8.04 4.4
15 - 16 hr 13.26 9.75 15.68 18.3 22.21 23.25 10.8 9.06 35.31 37.1 36.27 35.89 3.14 2.91 47.93 54.86 51.96 50.74 6.27 5.77 11.82 12.78 7.62 12.03 12.5 6.55 4.71 9.21 17.87 9.06 9.33 4.5
16 - 17 hr 18.25 14.92 4.95 4.63 9.68 10.2 20.14 16.68 21.38 21.39 23.92 28.11 1.75 1.86 48.15 47.31 46.65 50.4 3.46 3.31 21.52 16.97 7.37 18.47 14.84 7.51 6.88 15.52 27.8 11.17 9.35 5.34
17 - 18 hr 13.82 9.76 31.23 31.46 34.89 36.06 13.72 11.14 41.24 44.5 42 47.07 1.6 1.62 57.4 58.92 57.23 62.92 3.41 8.01 20.26 20.35 15.25 21.49 20.5 9.31 12.27 17.3 34.46 22.89 20.56 12.04
18 - 19 hr 7.51 5.81 50.48 49.94 45.97 46.43 8.65 5.95 49.26 48.28 51.04 46.46 2 3.04 56.08 56.39 54.95 59.06 3.18 6.47 13.44 13.73 25.79 17.67 21.52 19.26 6.35 11.52 22.13 11.31 10.4 5.42
19 - 20 hr 3.96 5.01 2.77 2.71 6.62 6.87 3.65 5.19 7.72 7.86 9.5 10.44 1.17 1.44 23.6 30.16 28.82 30.87 1.73 1.76 3.6 6.52 4.04 3.38 6.51 4.03 1.88 5.05 7.15 2.99 5.44 3.1
20 - 21 hr 2.16 3.72 1.75 1.74 3.96 4.02 2.03 3.72 2.62 2.73 4.32 4.54 0.76 0.79 18.41 23.69 30.91 31.05 1.31 1.26 2.1 4.76 2.97 1.93 4.75 2.97 1.43 3.43 4.9 1.73 3.9 2.27
21 - 22 hr 2.03 3.81 1.49 1.47 2.97 2.99 2 3.79 2.11 2.15 3.07 3.27 0.37 0.4 12.96 14.05 15.49 17.93 0.64 0.67 1.86 4.87 2.35 1.75 4.88 2.29 1.14 3.4 4.44 1.57 3.89 1.92
22 - 23 hr 1.33 3.2 1.21 1.22 2.46 2.5 1.21 3.23 1.75 1.79 2.36 2.48 0.35 0.38 6.19 9.26 10.48 12.16 0.57 0.58 1.28 3.85 2 1.23 3.84 1.96 0.82 2.74 3.55 1.12 3.29 1.73
23 - 24 hr 0.65 2.43 0.49 0.49 1.41 1.44 0.69 2.35 0.69 0.7 1.3 1.38 0.19 0.21 1.51 1.66 2.46 2.45 0.41 0.42 0.71 2.63 1.06 0.59 2.73 1.04 0.4 1.8 2.25 0.58 2.28 0.94
Grand Total 4.57 5.26 5.23 5.32 7.64 7.85 4.36 5.56 8.54 8.73 9.83 10.29 1.49 1.74 20.68 23.05 23.71 25.17 3.78 4.1 4.84 6.98 4.5 4.79 7.21 3.98 2.39 5.29 8.59 3.84 5.63 2.97
Here is the current script I'm using.
read_occupancy = pd.read_csv (r'C:\Users\holborm\Desktop\Visualisation\dataaxisplotstuff.csv') #read the csv file (put 'r' before the path string to address any special characters, such as '\'). Don't forget to put the file name at the end of the path + ".csv"
df = DataFrame(read_occupancy) # assign column names
#create time and detector name axis
time_axis = df.index
detector_axis = df.columns
plt.plot(df)
Using Seaborn
read_occupancy = pd.read_csv (r'C:\Users\holborm\Desktop\Visualisation\dataaxisplotstuff.csv') #read the csv file (put 'r' before the path string to address any special characters, such as '\'). Don't forget to put the file name at the end of the path + ".csv"
df = DataFrame(read_occupancy) # assign column names
#create time and detector name axis
sns.heatmap(df)
Error
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-79-33a3388e21cc> in <module>()
6 #create time and detector name axis
7
----> 8 sns.heatmap(df)
~\AppData\Local\Continuum\anaconda3\lib\site-packages\seaborn\matrix.py in heatmap(data, vmin, vmax, cmap, center, robust, annot, fmt, annot_kws, linewidths, linecolor, cbar, cbar_kws, cbar_ax, square, xticklabels, yticklabels, mask, ax, **kwargs)
515 plotter = _HeatMapper(data, vmin, vmax, cmap, center, robust, annot, fmt,
516 annot_kws, cbar, cbar_kws, xticklabels,
--> 517 yticklabels, mask)
518
519 # Add the pcolormesh kwargs here
~\AppData\Local\Continuum\anaconda3\lib\site-packages\seaborn\matrix.py in __init__(self, data, vmin, vmax, cmap, center, robust, annot, fmt, annot_kws, cbar, cbar_kws, xticklabels, yticklabels, mask)
166 # Determine good default values for the colormapping
167 self._determine_cmap_params(plot_data, vmin, vmax,
--> 168 cmap, center, robust)
169
170 # Sort out the annotations
~\AppData\Local\Continuum\anaconda3\lib\site-packages\seaborn\matrix.py in _determine_cmap_params(self, plot_data, vmin, vmax, cmap, center, robust)
203 cmap, center, robust):
204 """Use some heuristics to set good defaults for colorbar and range."""
--> 205 calc_data = plot_data.data[~np.isnan(plot_data.data)]
206 if vmin is None:
207 vmin = np.percentile(calc_data, 2) if robust else calc_data.min()
TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

You can use .set_index('Row Labels) to ensure your Row Labels column is interpreted as an axis for the heatmap and transpose your DataFrame with .T so that you get the time along the x-axis and the detectors for the y-axis.
sns.heatmap(df.set_index('Row Labels').T)

Creating new df columns via iteration

I have a dataframe, df which looks like this
Open High Low Close Volume
Date
2007-03-22 2.65 2.95 2.64 2.86 176389
2007-03-23 2.87 2.87 2.78 2.78 63316
2007-03-26 2.83 2.83 2.51 2.52 54051
2007-03-27 2.61 3.29 2.60 3.28 589443
2007-03-28 3.65 4.10 3.60 3.80 1114659
2007-03-29 3.91 3.91 3.33 3.57 360501
2007-03-30 3.70 3.88 3.66 3.71 185787
I'm trying to create a new column, which will takes the df.Open value 5 days ahead from each df.Open value and subtract it.
So the loop I"m using is this:
for i in range(0, len(df.Open)): #goes through indexes values
df['5days'][i]=df.Open[i+5]-df.Open[i] #I use those index values to locate
However, this loop is yielding an error.
KeyError: '5days'
Not sure why. I got this to temporarily work by removing the df['5days'][i], but it seems awfully slow. Not sure if there is a more efficient way to do this.
Thank you.

Using diff
df['5Days'] = df.Open.diff(5)
print(df)
Open High Low Close Volume 5Days
Date
2007-03-22 2.65 2.95 2.64 2.86 176389 NaN
2007-03-23 2.87 2.87 2.78 2.78 63316 NaN
2007-03-26 2.83 2.83 2.51 2.52 54051 NaN
2007-03-27 2.61 3.29 2.60 3.28 589443 NaN
2007-03-28 3.65 4.10 3.60 3.80 1114659 NaN
2007-03-29 3.91 3.91 3.33 3.57 360501 1.26
2007-03-30 3.70 3.88 3.66 3.71 185787 0.83
However, per your code, you may want to look ahead and align the results back. In that case
df['5Days'] = -df.Open.diff(-5)
print(df)
Open High Low Close Volume 5days
Date
2007-03-22 2.65 2.95 2.64 2.86 176389 1.26
2007-03-23 2.87 2.87 2.78 2.78 63316 0.83
2007-03-26 2.83 2.83 2.51 2.52 54051 NaN
2007-03-27 2.61 3.29 2.60 3.28 589443 NaN
2007-03-28 3.65 4.10 3.60 3.80 1114659 NaN
2007-03-29 3.91 3.91 3.33 3.57 360501 NaN
2007-03-30 3.70 3.88 3.66 3.71 185787 NaN

I think you need shift with sub:
df['5days'] = df.Open.shift(5).sub(df.Open)
print (df)
Open High Low Close Volume 5days
Date
2007-03-22 2.65 2.95 2.64 2.86 176389 NaN
2007-03-23 2.87 2.87 2.78 2.78 63316 NaN
2007-03-26 2.83 2.83 2.51 2.52 54051 NaN
2007-03-27 2.61 3.29 2.60 3.28 589443 NaN
2007-03-28 3.65 4.10 3.60 3.80 1114659 NaN
2007-03-29 3.91 3.91 3.33 3.57 360501 -1.26
2007-03-30 3.70 3.88 3.66 3.71 185787 -0.83
Or maybe need substract Open with shifted column:
df['5days'] = df.Open.sub(df.Open.shift(5))
print (df)
Open High Low Close Volume 5days
Date
2007-03-22 2.65 2.95 2.64 2.86 176389 NaN
2007-03-23 2.87 2.87 2.78 2.78 63316 NaN
2007-03-26 2.83 2.83 2.51 2.52 54051 NaN
2007-03-27 2.61 3.29 2.60 3.28 589443 NaN
2007-03-28 3.65 4.10 3.60 3.80 1114659 NaN
2007-03-29 3.91 3.91 3.33 3.57 360501 1.26
2007-03-30 3.70 3.88 3.66 3.71 185787 0.83
df['5days'] = -df.Open.sub(df.Open.shift(-5))
print (df)
Open High Low Close Volume 5days
Date
2007-03-22 2.65 2.95 2.64 2.86 176389 1.26
2007-03-23 2.87 2.87 2.78 2.78 63316 0.83
2007-03-26 2.83 2.83 2.51 2.52 54051 NaN
2007-03-27 2.61 3.29 2.60 3.28 589443 NaN
2007-03-28 3.65 4.10 3.60 3.80 1114659 NaN
2007-03-29 3.91 3.91 3.33 3.57 360501 NaN
2007-03-30 3.70 3.88 3.66 3.71 185787 NaN

Can't index by timestamp in pandas dataframe

I took an excel sheet which has dates and some values and want to convert them to pandas dataframe and select only rows which are between certain dates.
For some reason I cannot select a row by date index
Raw Data in Excel file
MCU
Timestamp 50D 10P1 10P2 10P3 10P6 10P9 10P12
12-Feb-15 25.17 5.88 5.92 5.98 6.18 6.23 6.33
11-Feb-15 25.9 6.05 6.09 6.15 6.28 6.31 6.39
10-Feb-15 26.38 5.94 6.05 6.15 6.33 6.39 6.46
Code
xls = pd.ExcelFile('e:/Data.xlsx')
vols = xls.parse(asset.upper()+'VOL',header=1)
vols.set_index('Timestamp',inplace=True)
Data before set_index
Timestamp 50D 10P1 10P2 10P3 10P6 10P9 10P12 25P1 25P2 \
0 2015-02-12 25.17 5.88 5.92 5.98 6.18 6.23 6.33 2.98 3.08
1 2015-02-11 25.90 6.05 6.09 6.15 6.28 6.31 6.39 3.12 3.17
2 2015-02-10 26.38 5.94 6.05 6.15 6.33 6.39 6.46 3.01 3.16
Data after set_index
50D 10P1 10P2 10P3 10P6 10P9 10P12 25P1 25P2 25P3 \
Timestamp
2015-02-12 25.17 5.88 5.92 5.98 6.18 6.23 6.33 2.98 3.08 3.21
2015-02-11 25.90 6.05 6.09 6.15 6.28 6.31 6.39 3.12 3.17 3.32
2015-02-10 26.38 5.94 6.05 6.15 6.33 6.39 6.46 3.01 3.16 3.31
Output
>>> vols.index
<class 'pandas.tseries.index.DatetimeIndex'>
[2015-02-12, ..., NaT]
Length: 1478, Freq: None, Timezone: None
>>> vols[date(2015,2,12)]
*** KeyError: datetime.date(2015, 2, 12)
I would expect this not to fail, and also I should be able to select a range of dates. Tried so many combinations but not getting it.

Using a datetime.date instance to try to retrieve the index won't work, you just need a string representation of the date, e.g. '2015-02-12' or '2015/02/14'.
Secondly, vols[date(2015,2,12)] is actually looking in your DataFrame's column headings, not the index. You can use loc to fetch row index labels instead. For example you could write vols.loc['2015-02-12']

pandas dataframe plotting 1 column over 2

this is driving me nuts, I can't plot column 'b'
it plots only column 'A'.....
this is my code, no idea what I'm doing wrong, probably something silly...
the dataframe seems ok, weirdness also is that I can access both df['A'] and df['b'] but only df['A'].plot() works, if I issue a df['b'].plot() I get this error :
Traceback (most recent call last): File
"C:\Python27\lib\site-packages\IPython\core\interactiveshell.py", line
2883, in run_code
exec(code_obj, self.user_global_ns, self.user_ns) File "", line 1, in
df['b'].plot() File "C:\Python27\lib\site-packages\pandas\tools\plotting.py", line 2511,
in plot_series
**kwds) File "C:\Python27\lib\site-packages\pandas\tools\plotting.py", line 2317,
in _plot
plot_obj.generate() File "C:\Python27\lib\site-packages\pandas\tools\plotting.py", line 921, in
generate
self._compute_plot_data() File "C:\Python27\lib\site-packages\pandas\tools\plotting.py", line 997, in
_compute_plot_data
'plot'.format(numeric_data.class.name)) TypeError: Empty 'Series': no numeric data to plot
import sqlalchemy
import pandas as pd
import matplotlib.pyplot as plt
engine = sqlalchemy.create_engine(
'sqlite:///C:/Users/toto/PycharmProjects/my_db.sqlite')
tables = engine.table_names()
dic = {}
for t in tables:
sql = 'SELECT t."weight" FROM "' + t + '" t WHERE t."udl"="IBE SM"'
dic[t] = (pd.read_sql(sql, engine)['weight'][0], pd.read_sql(sql, engine)['weight'][1])
df = pd.DataFrame.from_dict(dic, orient='index').sort_index()
df = df.set_index(pd.DatetimeIndex(df.index))
df.columns = ['A', 'b']
print(df)
print(df.info())
df.plot()
plt.show()
this is the 2 print
A b
2014-08-05 1.81 3.39
2014-08-06 1.81 3.39
2014-08-07 1.81 3.39
2014-08-08 1.80 3.37
2014-08-11 1.79 3.35
2014-08-13 1.80 3.36
2014-08-14 1.80 3.35
2014-08-18 1.80 3.35
2014-08-19 1.79 3.34
2014-08-20 1.80 3.35
2014-08-27 1.79 3.35
2014-08-28 1.80 3.35
2014-08-29 1.79 3.35
2014-09-01 1.79 3.35
2014-09-02 1.79 3.35
2014-09-03 1.79 3.36
2014-09-04 1.79 3.37
2014-09-05 1.80 3.38
2014-09-08 1.79 3.36
2014-09-09 1.79 3.35
2014-09-10 1.78 3.35
2014-09-11 1.78 3.34
2014-09-12 1.78 3.34
2014-09-15 1.78 3.35
2014-09-16 1.78 3.35
2014-09-17 1.78 3.35
2014-09-18 1.78 3.34
2014-09-19 1.79 3.35
2014-09-22 1.79 3.36
2014-09-23 1.80 3.37
... ... ...
2014-12-10 1.73 3.29
2014-12-11 1.74 3.27
2014-12-12 1.74 3.25
2014-12-15 1.74 3.24
2014-12-16 1.74 3.27
2014-12-17 1.75 3.28
2014-12-18 1.76 3.29
2014-12-19 1.04 1.39
2014-12-22 1.04 1.39
2014-12-23 1.04 1.4
2014-12-24 1.04 1.39
2014-12-29 1.04 1.39
2014-12-30 1.04 1.4
2015-01-02 1.04 1.4
2015-01-05 1.04 1.4
2015-01-06 1.04 1.4
2015-01-07 NaN 1.39
2015-01-08 NaN 1.39
2015-01-09 NaN 1.39
2015-01-12 NaN 1.38
2015-01-13 NaN 1.38
2015-01-14 NaN 1.38
2015-01-15 NaN 1.38
2015-01-16 NaN 1.38
2015-01-19 NaN 1.39
2015-01-20 NaN 1.38
2015-01-21 NaN 1.39
2015-01-22 NaN 1.4
2015-01-23 NaN 1,4
2015-01-26 NaN 1.41
[107 rows x 2 columns]
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 107 entries, 2014-08-05 00:00:00 to 2015-01-26 00:00:00
Data columns (total 2 columns):
A 93 non-null float64
b 107 non-null object
dtypes: float64(1), object(1)
memory usage: 2.1+ KB
None
Process finished with exit code 0

just got it, 'b' is of object type and not float64 because of this line :
2015-01-23 NaN 1,4

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to remove duplicate days with multiple tickers in a single dataframe? - python

You can use pd.drop_duplicates: df.drop_duplicates(subset=['timestamp', 'symbol']) By default, it will take the first appearance of the combination of the values in the timestamp and symbol columns, but you can change this behavior.

Related

Trying to use the BeautifulSoup Python module to pull individual elements from table data

Creating a heatmap using python and csv file

Creating new df columns via iteration

Can't index by timestamp in pandas dataframe

pandas dataframe plotting 1 column over 2

Categories

Resources