Issue with connecting points for 3d line plot in plotly - python
I have a dataframe of the following format:
x y z aa_num frame_num cluster
1 1.86 3.11 8.62 1 1 1
2 1.77 3.32 8.31 2 1 1
3 1.59 3.17 8.00 3 1 1
4 1.67 3.49 7.81 4 1 1
5 2.04 3.59 7.81 5 1 1
6 2.20 3.34 7.57 6 1 1
7 2.09 3.19 7.25 7 1 1
8 2.13 3.30 6.89 8 1 1
9 2.17 3.63 6.70 9 1 1
10 2.22 3.63 6.33 10 1 1
11 2.06 3.83 6.04 11 1 1
12 2.31 3.75 5.76 12 1 1
13 2.15 3.45 5.59 13 1 1
14 2.21 3.28 5.26 14 1 1
15 2.00 3.13 4.98 15 1 1
16 2.13 2.86 4.74 16 1 1
17 1.97 2.78 4.41 17 1 1
18 2.20 2.76 4.10 18 1 1
19 2.43 2.46 4.14 19 1 1
20 2.34 2.23 3.85 20 1 1
21 2.61 2.16 3.59 21 1 1
22 2.42 1.92 3.36 22 1 1
23 2.44 1.95 2.98 23 1 1
24 2.26 1.62 2.94 24 1 1
25 2.19 1.35 3.20 25 1 1
26 1.92 1.11 3.08 26 1 1
27 1.93 0.83 3.33 27 1 1
28 1.83 0.72 3.68 28 1 1
29 1.95 0.47 3.95 29 1 1
30 1.84 0.36 4.29 30 1 1
31 0.56 3.93 7.07 1 2 1
32 0.66 3.84 7.42 2 2 1
33 0.87 3.54 7.49 3 2 1
34 0.84 3.19 7.33 4 2 1
35 0.76 3.32 6.98 5 2 1
36 0.88 3.23 6.63 6 2 1
37 1.10 3.46 6.43 7 2 1
38 1.35 3.49 6.15 8 2 1
39 1.72 3.50 6.23 9 2 1
40 1.88 3.67 5.93 10 2 1
41 2.25 3.72 5.97 11 2 1
42 2.43 3.48 5.74 12 2 1
43 2.23 3.35 5.44 13 2 1
44 2.23 3.38 5.06 14 2 1
45 2.01 3.38 4.76 15 2 1
46 2.02 3.44 4.38 16 2 1
47 1.98 3.10 4.20 17 2 1
48 2.05 3.13 3.83 18 2 1
49 2.28 2.85 3.72 19 2 1
50 2.09 2.56 3.58 20 2 1
51 2.21 2.37 3.27 21 2 1
52 2.06 2.04 3.15 22 2 1
53 1.93 2.01 2.80 23 2 1
54 1.86 1.64 2.83 24 2 1
55 1.95 1.38 3.10 25 2 1
56 1.78 1.04 3.04 26 2 1
57 1.90 0.84 3.34 27 2 1
58 1.83 0.74 3.70 28 2 1
59 1.95 0.48 3.95 29 2 1
60 1.84 0.36 4.29 30 2 1
etc..
I'm trying to create a 3d line plot of this data, where a line consisting of 30 <x,y,z> points will be plotted for each frame_num and the points would be connected in the order of aa_num. The code to do this is as follows:
fig = plot_ly(output_cl, x = ~x, y = ~y, z = ~z, type = 'scatter3d', mode = 'lines+markers',
opacity = 1, line = list(width = 1, color = ~frame_num, colorscale = 'Viridis'),
marker = list(size = 2, color = ~frame_num, colorscale = 'Viridis'))
When I plot a single frame, it works fine:
However, a strange issue arises when I try to plot multiple instances.
For some reason, when I try to plot frame 1 and 2, point 1 and 30 connect to each other for frame 2. However, this doesn't happen for frame 1. Any ideas why? Is there someway to specify the ordering of points in 3d in plotly?
If you want to create two different traces based on column frame_num you need to pass it as a categorial variable by using factor. As an alternative you can use name = ~ frame_num or split = ~ frame_num to create multiple traces.
library(plotly)
output_cl <- data.frame(
x = c(1.86,1.77,1.59,1.67,2.04,2.2,2.09,
2.13,2.17,2.22,2.06,2.31,2.15,2.21,2,2.13,1.97,2.2,
2.43,2.34,2.61,2.42,2.44,2.26,2.19,1.92,1.93,1.83,1.95,
1.84,0.56,0.66,0.87,0.84,0.76,0.88,1.1,1.35,1.72,1.88,
2.25,2.43,2.23,2.23,2.01,2.02,1.98,2.05,2.28,2.09,
2.21,2.06,1.93,1.86,1.95,1.78,1.9,1.83,1.95,1.84),
y = c(3.11,3.32,3.17,3.49,3.59,3.34,3.19,
3.3,3.63,3.63,3.83,3.75,3.45,3.28,3.13,2.86,2.78,2.76,
2.46,2.23,2.16,1.92,1.95,1.62,1.35,1.11,0.83,0.72,
0.47,0.36,3.93,3.84,3.54,3.19,3.32,3.23,3.46,3.49,3.5,
3.67,3.72,3.48,3.35,3.38,3.38,3.44,3.1,3.13,2.85,2.56,
2.37,2.04,2.01,1.64,1.38,1.04,0.84,0.74,0.48,0.36),
z = c(8.62,8.31,8,7.81,7.81,7.57,7.25,6.89,
6.7,6.33,6.04,5.76,5.59,5.26,4.98,4.74,4.41,4.1,
4.14,3.85,3.59,3.36,2.98,2.94,3.2,3.08,3.33,3.68,3.95,
4.29,7.07,7.42,7.49,7.33,6.98,6.63,6.43,6.15,6.23,5.93,
5.97,5.74,5.44,5.06,4.76,4.38,4.2,3.83,3.72,3.58,
3.27,3.15,2.8,2.83,3.1,3.04,3.34,3.7,3.95,4.29),
aa_num = c(1L,2L,3L,4L,5L,6L,7L,8L,9L,10L,
11L,12L,13L,14L,15L,16L,17L,18L,19L,20L,21L,22L,23L,
24L,25L,26L,27L,28L,29L,30L,1L,2L,3L,4L,5L,6L,7L,
8L,9L,10L,11L,12L,13L,14L,15L,16L,17L,18L,19L,20L,
21L,22L,23L,24L,25L,26L,27L,28L,29L,30L),
frame_num = c(1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,
1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,
1L,1L,1L,2L,2L,2L,2L,2L,2L,2L,2L,2L,2L,2L,2L,
2L,2L,2L,2L,2L,2L,2L,2L,2L,2L,2L,2L,2L,2L,2L,2L,
2L,2L),
cluster = c(1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,
1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,
1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,
1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,
1L,1L)
)
fig = plot_ly(output_cl, x = ~x, y = ~y, z = ~z, type = 'scatter3d', mode = 'lines+markers', color = ~factor(frame_num), # or name = ~ frame_num or split = ~ frame_num
colors ="Set2", opacity = 1, line = list(width = 1), marker = list(size = 2))
fig
Related
Apply rolling custom function with pandas
There are a few similar questions in this site, but I couldn't find out a solution to my particular question. I have a dataframe that I want to process with a custom function (the real function has a bit more pre-procesing, but the gist is contained in the toy example fun). import statsmodels.api as sm import numpy as np import pandas as pd mtcars=pd.DataFrame(sm.datasets.get_rdataset("mtcars", "datasets", cache=True).data) def fun(col1, col2, w1=10, w2=2): return(np.mean(w1 * col1 + w2 * col2)) # This is the behavior I would expect for the full dataset, currently working mtcars.apply(lambda x: fun(x.cyl, x.mpg), axis=1) # This was my approach to do the same with a rolling function mtcars.rolling(3).apply(lambda x: fun(x.cyl, x.mpg)) The rolling version returns this error: AttributeError: 'Series' object has no attribute 'cyl' I figured I don't fully understand how rolling works, since adding a print statement to the beginning of my function shows that fun is not getting the full dataset but an unnamed series of 3. What is the approach to apply this rolling function in pandas? Just in case, I am running >>> pd.__version__ '1.5.2' Update Looks like there is a very similar question here which might partially overlap with what I'm trying to do. For completeness, here's how I would do this in R with the expected output. library(dplyr) fun <- function(col1, col2, w1=10, w2=2){ return(mean(w1*col1 + w2*col2)) } mtcars %>% mutate(roll = slider::slide2(.x = cyl, .y = mpg, .f = fun, .before = 1, .after = 1)) mpg cyl disp hp drat wt qsec vs am gear carb roll Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 102 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 96.53333 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 96.8 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 101.9333 Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 105.4667 Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 107.4 Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 97.86667 Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 94.33333 Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 90.93333 Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 93.2 Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 102.2667 Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 107.6667 Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 112.6 Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3 108.6 Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 104 Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 103.6667 Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4 105 Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 105 Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 104.4667 Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 97.2 Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 100.6 Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2 101.4667 AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2 109.3333 Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4 111.8 Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2 106.5333 Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 101.6667 Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 95.8 Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 101.4667 Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4 103.9333 Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6 107 Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8 97.4 Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 96.4
There is no really elegant way to do this. Here is a suggestion: First install numpy_ext (use pip install numpy_ext or pip install numpy_ext --user). Second, you'll need to compute your column separatly and concat it to your ariginal dataframe: import statsmodels.api as sm import pandas as pd from numpy_ext import rolling_apply as rolling_apply_ext import numpy as np mtcars=pd.DataFrame(sm.datasets.get_rdataset("mtcars", "datasets", cache=True).data).reset_index() def fun(col1, col2, w1=10, w2=2): return(w1 * col1 + w2 * col2) Col= pd.DataFrame(rolling_apply_ext(fun, 3, mtcars.cyl.values, mtcars.mpg.values)).rename(columns={2:'rolling'}) mtcars.join(Col["rolling"]) to get: index mpg cyl disp hp drat wt qsec vs am \ 0 Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 1 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 2 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 3 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 4 Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 5 Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 6 Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 7 Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 8 Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 9 Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 10 Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 11 Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 12 Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 13 Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 14 Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 15 Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 16 Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 17 Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 18 Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 19 Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 20 Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 21 Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 22 AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 23 Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 24 Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 25 Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 26 Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 27 Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 28 Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 29 Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 30 Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 31 Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 gear carb rolling 0 4 4 NaN 1 4 4 NaN 2 4 1 85.6 3 3 1 102.8 4 3 2 117.4 5 3 1 96.2 6 3 4 108.6 7 4 2 88.8 8 4 2 85.6 9 4 4 98.4 10 4 4 95.6 11 3 3 112.8 12 3 3 114.6 13 3 3 110.4 14 3 4 100.8 15 3 4 100.8 16 3 4 109.4 17 4 1 104.8 18 4 2 100.8 19 4 1 107.8 20 3 1 83.0 21 3 2 111.0 22 3 2 110.4 23 3 4 106.6 24 3 2 118.4 25 4 1 94.6 26 5 2 92.0 27 5 2 100.8 28 5 4 111.6 29 5 6 99.4 30 5 8 110.0 31 4 2 82.8
You can use the below function for rolling apply. It might be slow compared to pandas inbuild rolling in certain situations but has additional functionality. Function argument win_size, min_periods (similar to pandas and takes only integer input). In addition, after parameter is also used to control to window, it shifts the windows to include after observation. def roll_apply(df, fn, win_size, min_periods=None, after=None): if min_periods is None: min_periods = win_size else: assert min_periods >= 1 if after is None: after = 0 before = win_size - 1 - after i = np.arange(df.shape[0]) s = np.maximum(i - before, 0) e = np.minimum(i + after, df.shape[0]) + 1 res = [fn(df.iloc[si:ei]) for si, ei in zip(s, e) if (ei-si) >= min_periods] idx = df.index[(e-s) >= min_periods] types = {type(ri) for ri in res} if len(types) != 1: return pd.Series(res, index=idx) t = list(types)[0] if t == pd.Series: return pd.DataFrame(res, index=idx) elif t == pd.DataFrame: return pd.concat(res, keys=idx) else: return pd.Series(res, index=idx) mtcars['roll'] = roll_apply(mtcars, lambda x: fun(x.cyl, x.mpg), win_size=3, min_periods=1, after=1) index mpg cyl disp hp drat wt qsec vs am gear carb roll Mazda RX4 21.0 6 160.0 110 3.9 2.62 16.46 0 1 4 4 102.0 Mazda RX4 Wag 21.0 6 160.0 110 3.9 2.875 17.02 0 1 4 4 96.53333333333335 Datsun 710 22.8 4 108.0 93 3.85 2.32 18.61 1 1 4 1 96.8 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 101.93333333333332 Hornet Sportabout 18.7 8 360.0 175 3.15 3.44 17.02 0 0 3 2 105.46666666666665 Valiant 18.1 6 225.0 105 2.76 3.46 20.22 1 0 3 1 107.40000000000002 Duster 360 14.3 8 360.0 245 3.21 3.57 15.84 0 0 3 4 97.86666666666667 Merc 240D 24.4 4 146.7 62 3.69 3.19 20.0 1 0 4 2 94.33333333333333 Merc 230 22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2 90.93333333333332 Merc 280 19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4 93.2 Merc 280C 17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4 102.26666666666667 Merc 450SE 16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3 107.66666666666667 Merc 450SL 17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3 112.59999999999998 Merc 450SLC 15.2 8 275.8 180 3.07 3.78 18.0 0 0 3 3 108.60000000000001 Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.25 17.98 0 0 3 4 104.0 Lincoln Continental 10.4 8 460.0 215 3.0 5.424 17.82 0 0 3 4 103.66666666666667 Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4 105.0 Fiat 128 32.4 4 78.7 66 4.08 2.2 19.47 1 1 4 1 105.0 Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 104.46666666666665 Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.9 1 1 4 1 97.2 Toyota Corona 21.5 4 120.1 97 3.7 2.465 20.01 1 0 3 1 100.60000000000001 Dodge Challenger 15.5 8 318.0 150 2.76 3.52 16.87 0 0 3 2 101.46666666666665 AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.3 0 0 3 2 109.33333333333333 Camaro Z28 13.3 8 350.0 245 3.73 3.84 15.41 0 0 3 4 111.8 Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2 106.53333333333335 Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.9 1 1 4 1 101.66666666666667 Porsche 914-2 26.0 4 120.3 91 4.43 2.14 16.7 0 1 5 2 95.8 Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2 101.46666666666665 Ford Pantera L 15.8 8 351.0 264 4.22 3.17 14.5 0 1 5 4 103.93333333333332 Ferrari Dino 19.7 6 145.0 175 3.62 2.77 15.5 0 1 5 6 107.0 Maserati Bora 15.0 8 301.0 335 3.54 3.57 14.6 0 1 5 8 97.39999999999999 Volvo 142E 21.4 4 121.0 109 4.11 2.78 18.6 1 1 4 2 96.4 You can pass more complex function in roll_apply function. Below are few example roll_apply(mtcars, lambda d: pd.Series({'A': d.sum().sum(), 'B': d.std().std()}), win_size=3, min_periods=1, after=1) # Simple example to illustrate use case roll_apply(mtcars, lambda d: d, win_size=3, min_periods=3, after=1) # This will return rolling dataframe
I'm not aware of a way to do this calculation easily and efficiently by apply a single function to a pandas dataframe because you're calculating values across multiple rows and columns. An efficient way is to first calculate the column you want to calculate the rolling average for, then calculate the rolling average: import statsmodels.api as sm import pandas as pd mtcars=pd.DataFrame(sm.datasets.get_rdataset("mtcars", "datasets", cache=True).data) # Create column def df_fun(df, col1, col2, w1=10, w2=2): return w1 * df[col1] + w2 * df[col2] mtcars['fun_val'] = df_fun(mtcars, 'cyl', 'mpg') # Calculate rolling average mtcars['fun_val_r3m'] = mtcars['fun_val'].rolling(3, center=True, min_periods=0).mean() This gives the correct answer, and is efficient since each step should be optimized for performance. I found that separating the row and column calculations like this is about 10 times faster than the latest approach you proposed and no need to import numpy. If you don't want to keep the intermediate calculation, fun_val, you can overwrite it with the rolling average value, fun_val_r3m. If you really need to do this in one line with apply, I'm not aware of another way other than what you've done in your latest post. numpy array based approaches may be able to perform better, though less readable.
After much searching and fighting against arguments. I found an approach inspired by this answer def fun(series, w1=10, w2=2): col1 = mtcars.loc[series.index, 'cyl'] col2 = mtcars.loc[series.index, 'mpg'] return(np.mean(w1 * col1 + w2 * col2)) mtcars['roll'] = mtcars.rolling(3, center=True, min_periods=0)['mpg'] \ .apply(fun, raw=False) mtcars mpg cyl disp hp ... am gear carb roll Mazda RX4 21.0 6 160.0 110 ... 1 4 4 102.000000 Mazda RX4 Wag 21.0 6 160.0 110 ... 1 4 4 96.533333 Datsun 710 22.8 4 108.0 93 ... 1 4 1 96.800000 Hornet 4 Drive 21.4 6 258.0 110 ... 0 3 1 101.933333 Hornet Sportabout 18.7 8 360.0 175 ... 0 3 2 105.466667 Valiant 18.1 6 225.0 105 ... 0 3 1 107.400000 Duster 360 14.3 8 360.0 245 ... 0 3 4 97.866667 Merc 240D 24.4 4 146.7 62 ... 0 4 2 94.333333 Merc 230 22.8 4 140.8 95 ... 0 4 2 90.933333 Merc 280 19.2 6 167.6 123 ... 0 4 4 93.200000 Merc 280C 17.8 6 167.6 123 ... 0 4 4 102.266667 Merc 450SE 16.4 8 275.8 180 ... 0 3 3 107.666667 Merc 450SL 17.3 8 275.8 180 ... 0 3 3 112.600000 Merc 450SLC 15.2 8 275.8 180 ... 0 3 3 108.600000 Cadillac Fleetwood 10.4 8 472.0 205 ... 0 3 4 104.000000 Lincoln Continental 10.4 8 460.0 215 ... 0 3 4 103.666667 Chrysler Imperial 14.7 8 440.0 230 ... 0 3 4 105.000000 Fiat 128 32.4 4 78.7 66 ... 1 4 1 105.000000 Honda Civic 30.4 4 75.7 52 ... 1 4 2 104.466667 Toyota Corolla 33.9 4 71.1 65 ... 1 4 1 97.200000 Toyota Corona 21.5 4 120.1 97 ... 0 3 1 100.600000 Dodge Challenger 15.5 8 318.0 150 ... 0 3 2 101.466667 AMC Javelin 15.2 8 304.0 150 ... 0 3 2 109.333333 Camaro Z28 13.3 8 350.0 245 ... 0 3 4 111.800000 Pontiac Firebird 19.2 8 400.0 175 ... 0 3 2 106.533333 Fiat X1-9 27.3 4 79.0 66 ... 1 4 1 101.666667 Porsche 914-2 26.0 4 120.3 91 ... 1 5 2 95.800000 Lotus Europa 30.4 4 95.1 113 ... 1 5 2 101.466667 Ford Pantera L 15.8 8 351.0 264 ... 1 5 4 103.933333 Ferrari Dino 19.7 6 145.0 175 ... 1 5 6 107.000000 Maserati Bora 15.0 8 301.0 335 ... 1 5 8 97.400000 Volvo 142E 21.4 4 121.0 109 ... 1 4 2 96.400000 [32 rows x 12 columns] There are several things that are needed for this to perform as I wanted. raw=False will give fun access to the series if only to call .index (False : passes each row or column as a Series to the function.). This is dumb and inefficient, but it works. I needed my window center=True. I also needed the NaN filled with available info, so I set min_periods=0. There are a few things that I don't like about this approach: It seems to me that calling mtcars from outside the fun scope is potentially dangerous and might cause bugs. Multiple indexing with .loc line by line does not scale well and probably has worse performance (doing the rolling more times than needed)
How do I get all the tables from a website using pandas
I am trying to get 3 tables from a particular website but only the first two are showing up. I have even tried get the data using BeautifulSoup but the third seems to be hidden somehow. Is there something I am missing? url = "https://fbref.com/en/comps/9/keepersadv/Premier-League-Stats" html = pd.read_html(url, header=1) print(html[0]) print(html[1]) print(html[2]) # This prompts an error that the tables does not exist The first two tables are the squad tables. The table not showing up is the individual player table. This also happens with similar pages from the same site.
You could use Selenium as suggested, but I think is a bit overkill. The table is available in the static HTML, just within the comments. So you would need to pull the comments out of BeautifulSoup to get those tables. To get all the tables: import pandas as pd import requests from bs4 import BeautifulSoup, Comment url = 'https://fbref.com/en/comps/9/keepersadv/Premier-League-Stats' response = requests.get(url) tables = pd.read_html(response.text, header=1) # Get the tables within the Comments soup = BeautifulSoup(response.text, 'html.parser') comments = soup.find_all(string=lambda text: isinstance(text, Comment)) for each in comments: if 'table' in str(each): try: table = pd.read_html(str(each), header=1)[0] table = table[table['Rk'].ne('Rk')].reset_index(drop=True) tables.append(table) except: continue Output: for table in tables: print(table) Squad # Pl 90s GA PKA ... Stp Stp% #OPA #OPA/90 AvgDist 0 Arsenal 2 12.0 17 0 ... 10 8.8 6 0.50 14.6 1 Aston Villa 2 12.0 20 0 ... 6 6.8 13 1.08 16.2 2 Brentford 2 12.0 17 1 ... 10 9.9 18 1.50 15.6 3 Brighton 2 12.0 14 2 ... 17 16.2 13 1.08 15.3 4 Burnley 1 12.0 20 0 ... 14 11.7 17 1.42 16.6 5 Chelsea 2 12.0 4 2 ... 8 8.5 5 0.42 14.0 6 Crystal Palace 1 12.0 17 0 ... 7 7.5 6 0.50 13.5 7 Everton 2 12.0 19 0 ... 8 7.4 7 0.58 13.7 8 Leeds United 1 12.0 20 1 ... 8 12.5 15 1.25 16.3 9 Leicester City 1 12.0 21 2 ... 9 8.4 7 0.58 13.0 10 Liverpool 2 12.0 11 0 ... 9 9.7 16 1.33 17.0 11 Manchester City 2 12.0 6 1 ... 5 8.1 16 1.33 17.5 12 Manchester Utd 1 12.0 21 0 ... 4 4.4 2 0.17 13.3 13 Newcastle Utd 2 12.0 27 4 ... 10 9.8 4 0.33 13.9 14 Norwich City 1 12.0 27 2 ... 6 5.1 5 0.42 12.4 15 Southampton 1 12.0 14 0 ... 16 13.9 2 0.17 12.9 16 Tottenham 1 12.0 17 1 ... 3 2.7 5 0.42 14.1 17 Watford 2 12.0 20 1 ... 6 5.5 9 0.75 15.4 18 West Ham 1 12.0 14 0 ... 6 5.3 1 0.08 11.9 19 Wolves 1 12.0 12 3 ... 9 10.0 10 0.83 15.5 [20 rows x 28 columns] Squad # Pl 90s GA PKA ... Stp Stp% #OPA #OPA/90 AvgDist 0 vs Arsenal 2 12.0 13 0 ... 4 5.9 11 0.92 15.5 1 vs Aston Villa 2 12.0 16 2 ... 11 8.0 7 0.58 14.8 2 vs Brentford 2 12.0 16 1 ... 16 14.0 9 0.75 15.7 3 vs Brighton 2 12.0 12 3 ... 11 12.5 8 0.67 15.9 4 vs Burnley 1 12.0 14 0 ... 16 10.7 12 1.00 15.1 5 vs Chelsea 2 12.0 30 2 ... 10 11.1 11 0.92 14.2 6 vs Crystal Palace 1 12.0 18 2 ... 7 7.2 9 0.75 14.4 7 vs Everton 2 12.0 16 3 ... 7 7.6 7 0.58 13.8 8 vs Leeds United 1 12.0 12 1 ... 8 7.3 5 0.42 14.2 9 vs Leicester City 1 12.0 16 0 ... 2 3.3 7 0.58 14.3 10 vs Liverpool 2 12.0 35 1 ... 12 9.9 14 1.17 13.7 11 vs Manchester City 2 12.0 25 0 ... 8 6.7 4 0.33 13.1 12 vs Manchester Utd 1 12.0 20 0 ... 7 7.8 7 0.58 14.7 13 vs Newcastle Utd 2 12.0 15 0 ... 8 8.0 8 0.67 15.3 14 vs Norwich City 1 12.0 7 2 ... 5 5.7 16 1.33 17.3 15 vs Southampton 1 12.0 11 2 ... 4 3.7 9 0.75 14.0 16 vs Tottenham 1 12.0 11 1 ... 9 12.2 9 0.75 16.0 17 vs Watford 2 12.0 16 0 ... 8 8.2 9 0.75 15.3 18 vs West Ham 1 12.0 23 0 ... 13 10.5 6 0.50 13.8 19 vs Wolves 1 12.0 12 0 ... 5 6.8 9 0.75 15.3 [20 rows x 28 columns] Rk Player Nation Pos ... #OPA #OPA/90 AvgDist Matches 0 1 Alisson br BRA GK ... 15 1.36 17.1 Matches 1 2 Kepa Arrizabalaga es ESP GK ... 1 1.00 18.8 Matches 2 3 Daniel Bachmann at AUT GK ... 1 0.25 12.2 Matches 3 4 Asmir Begović ba BIH GK ... 0 0.00 15.0 Matches 4 5 Karl Darlow eng ENG GK ... 4 0.50 14.9 Matches 5 6 Ederson br BRA GK ... 14 1.27 17.5 Matches 6 7 Łukasz Fabiański pl POL GK ... 1 0.08 11.9 Matches 7 8 Álvaro Fernández es ESP GK ... 5 1.67 15.3 Matches 8 9 Ben Foster eng ENG GK ... 8 1.00 16.8 Matches 9 10 David de Gea es ESP GK ... 2 0.17 13.3 Matches 10 11 Vicente Guaita es ESP GK ... 6 0.50 13.5 Matches 11 12 Caoimhín Kelleher ie IRL GK ... 1 1.00 14.6 Matches 12 13 Tim Krul nl NED GK ... 5 0.42 12.4 Matches 13 14 Bernd Leno de GER GK ... 1 0.33 13.1 Matches 14 15 Hugo Lloris fr FRA GK ... 5 0.42 14.1 Matches 15 16 Emiliano Martínez ar ARG GK ... 12 1.09 16.4 Matches 16 17 Alex McCarthy eng ENG GK ... 2 0.17 12.9 Matches 17 18 Edouard Mendy sn SEN GK ... 4 0.36 13.3 Matches 18 19 Illan Meslier fr FRA GK ... 15 1.25 16.3 Matches 19 20 Jordan Pickford eng ENG GK ... 7 0.64 13.6 Matches 20 21 Nick Pope eng ENG GK ... 17 1.42 16.6 Matches 21 22 Aaron Ramsdale eng ENG GK ... 5 0.56 14.9 Matches 22 23 David Raya es ESP GK ... 13 1.44 15.7 Matches 23 24 José Sá pt POR GK ... 10 0.83 15.5 Matches 24 25 Robert Sánchez es ESP GK ... 13 1.18 15.4 Matches 25 26 Kasper Schmeichel dk DEN GK ... 7 0.58 13.0 Matches 26 27 Jason Steele eng ENG GK ... 0 0.00 13.0 Matches 27 28 Jed Steer eng ENG GK ... 1 1.00 14.3 Matches 28 29 Zack Steffen us USA GK ... 2 2.00 17.8 Matches 29 30 Freddie Woodman eng ENG GK ... 0 0.00 11.6 Matches [30 rows x 34 columns]
I wish to optimize the code using pythonic ways using lambda and pandas
I have the following Dataframe: TEST_NUMBER D1 D10 D50 D90 D99 Q3_15 Q3_10 l-Q3_63 RPM PRODUCT_NO 0 11 4.77 12.7 34.9 93.7 213.90 13.74 5.98 21.44 0.0 BQ0066 1 21 4.43 10.8 25.2 39.8 49.73 20.04 8.45 0.10 953.5 BQ0066 2 22 4.52 11.3 27.7 48.0 60.51 17.58 7.50 0.58 904.2 BQ0066 3 23 4.67 11.5 27.2 44.8 56.64 17.49 7.24 0.25 945.2 BQ0066 4 24 4.41 10.9 26.8 44.5 57.84 18.95 8.31 0.54 964.2 BQ0066 5 25 28.88 47.3 71.8 140.0 249.40 0.26 0.12 63.42 964.2 BQ0066 6 31 16.92 23.1 34.3 48.4 92.41 0.51 0.13 1.78 1694.5 BQ0066 7 32 16.35 22.2 33.0 45.8 59.14 0.53 0.11 0.64 1735.4 BQ0066 8 33 16.42 21.9 32.6 45.9 56.91 0.51 0.10 0.36 1754.4 BQ0066 9 34 3.47 7.3 14.1 20.7 26.52 56.59 23.71 0.00 1754.4 BQ0066 10 11 5.16 14.2 38.6 123.4 263.80 11.03 4.82 26.90 0.0 BQ0067 11 21 4.72 11.6 27.5 44.5 54.91 17.05 7.05 0.20 964.0 BQ0067 12 22 4.48 11.2 26.4 42.4 52.22 18.38 7.68 0.12 983.5 BQ0067 13 23 NaN NaN NaN NaN NaN NaN NaN NaN 983.5 BQ0067 14 31 14.80 22.4 33.2 45.5 58.11 1.05 0.36 0.56 1753.4 BQ0067 15 32 16.30 22.1 32.1 44.7 55.12 0.57 0.13 0.23 1773.8 BQ0067 16 33 3.44 7.2 14.0 21.0 26.34 56.72 24.69 0.00 1773.8 BQ0067 17 11 NaN NaN NaN NaN NaN NaN NaN NaN 0.0 BQ0068 18 21 4.83 11.9 28.1 44.2 54.63 16.76 6.70 0.19 953.7 BQ0068 19 22 4.40 10.7 26.3 43.4 57.55 19.85 8.59 0.53 974.9 BQ0068 20 23 17.61 43.8 67.9 122.6 221.20 0.75 0.33 58.27 974.9 BQ0068 21 31 15.09 22.3 33.3 45.6 59.45 0.98 0.38 0.73 1773.7 BQ0068 I wish to do the following things: Steps: Whenever the TEST_NUMBER 11 is NaN(null values), I need to remove all rows of particular PRODUCT_NO. For example, in the given dataframe, PRODUCT_NO. BQ0068 has TEST_NUMBER 11 with NaN values, hence all rows of BQ0068 should be removed. If any TEST_NUMBER other than TEST_NUMBER 11 has NaN values, then only that particular TEST_NUMBER's row should be removed. For example, PRODUCT_NO BQ0067 has row of TEST_NUMBER 23 with NaN values. Hence only that particular row of TEST_NUMBER 23should be removed. After doing the above steps, I need to the computation, for example, for PRODUCT_NO BQ0066, I need compute the the difference between rows in following way, TEST_NUMBER 21 - TEST_NUMBER 11, TEST_NUMBER 22 - TEST_NUMBER 11, TEST_NUMBER 23 - TEST_NUMBER 11, TEST_NUMBER 24 - TEST_NUMBER 25, TEST_NUMBER 21 - TEST_NUMBER 11. And then TEST_NUMBER 31 - TEST_NUMBER 25, TEST_NUMBER 32 - TEST_NUMBER 25, TEST_NUMBER 33 - TEST_NUMBER 25, TEST_NUMBER 34 - TEST_NUMBER 25. And carry on the same procedure for successive PRODUCT_NO. As you can see TEST_NUMBERS frequency is different for each PRODUCT_NO. But in all cases, every PRODUCT_NO will have only one TEST_NUMBER 11 and the other TEST_NUMBERS will be in range of 21 to 29 i.e. 21, 22, 23, 24, 25, 26, 27, 28, 29 and 31, 32, 33 ,34, 35, 36, 37, 38, 39 PYTHON CODE def pick_closest_sample(sample_list, sample_no): sample_list = sorted(sample_list) buffer = [] for number in sample_list: if sample_no // 10 == number// 10: break else: buffer.append(number) if len(buffer) > 0: return buffer[-1] return sample_no def add_closest_sample_df(df): unique_product_nos = list(df['PRODUCT_NO'].unique()) out = [] for product_no in unique_product_nos: subset = df[df['PRODUCT_NO'] == product_no] if subset.iloc[0].isnull().sum() == 0: subset.dropna(inplace = True) sample_list = subset['TEST_NUMBER'].to_list() subset['target_sample'] = subset['TEST_NUMBER'].apply(lambda x: pick_closest_sample(sample_list,x)) out.append(subset) if len(out)>0: out = pd.concat(out) out.dropna(inplace=True) return out Output of above two functions: TEST_NUMBER D1 D10 D50 D90 D99 Q3_15 Q3_10 l-Q3_63 RPM PRODUCT_NO target_sample 0 11 4.77 12.7 34.9 93.7 213.90 13.74 5.98 21.44 0.0 BQ0066 11 1 21 4.43 10.8 25.2 39.8 49.73 20.04 8.45 0.10 953.5 BQ0066 11 2 22 4.52 11.3 27.7 48.0 60.51 17.58 7.50 0.58 904.2 BQ0066 11 3 23 4.67 11.5 27.2 44.8 56.64 17.49 7.24 0.25 945.2 BQ0066 11 4 24 4.41 10.9 26.8 44.5 57.84 18.95 8.31 0.54 964.2 BQ0066 11 5 25 28.88 47.3 71.8 140.0 249.40 0.26 0.12 63.42 964.2 BQ0066 11 6 31 16.92 23.1 34.3 48.4 92.41 0.51 0.13 1.78 1694.5 BQ0066 25 7 32 16.35 22.2 33.0 45.8 59.14 0.53 0.11 0.64 1735.4 BQ0066 25 8 33 16.42 21.9 32.6 45.9 56.91 0.51 0.10 0.36 1754.4 BQ0066 25 9 34 3.47 7.3 14.1 20.7 26.52 56.59 23.71 0.00 1754.4 BQ0066 25 10 11 5.16 14.2 38.6 123.4 263.80 11.03 4.82 26.90 0.0 BQ0067 11 11 21 4.72 11.6 27.5 44.5 54.91 17.05 7.05 0.20 964.0 BQ0067 11 12 22 4.48 11.2 26.4 42.4 52.22 18.38 7.68 0.12 983.5 BQ0067 11 14 31 14.80 22.4 33.2 45.5 58.11 1.05 0.36 0.56 1753.4 BQ0067 22 15 32 16.30 22.1 32.1 44.7 55.12 0.57 0.13 0.23 1773.8 BQ0067 22 16 33 3.44 7.2 14.0 21.0 26.34 56.72 24.69 0.00 1773.8 BQ0067 22 As you can see, all rows of PRODUCT_NO BQ0068 are removed as TEST_NUMBER 11 had NaN values. Also only row of TEST_NUMBER 23 of PRODUCT_NO BQ0067 is removed as it had NaN values. So the requirements mentioned in the first two steps are met. Now the computation for PRODUCT_NO BQ0067 will be like TEST_NUMBER 31 - TEST_NUMBER 22, TEST_NUMBER 32 - TEST_NUMBER 22, TEST_NUMBER 33 - TEST_NUMBER 22 PYTHON CODE def compute_df(df): unique_product_nos = list(df['PRODUCT_NO'].unique()) out = [] for product_no in unique_product_nos: subset = df[df['PRODUCT_NO'] == product_no] target_list = list(subset['target_sample'].unique()) for target in target_list: target_df = subset[subset['target_sample'] == target] target_subset = [subset[subset['TEST_NUMBER'] == target]]*len(target_df) target_subset = pd.concat(target_subset) if len(target_subset)> 0: target_subset.index = target_df.index diff_cols = ['D1','D10','D50','D90','D99','Q3_15','Q3_10','l-Q3_63','RPM'] for col in diff_cols: target_df[col + '_diff'] = target_df[col] - target_subset[col] out.append(target_df) if len(out)>0: out = pd.concat(out) return out Output of the above function: TEST_NUMBER D1 D10 D50 D90 D99 Q3_15 Q3_10 l-Q3_63 RPM ... target_sample D1_diff D10_diff D50_diff D90_diff D99_diff Q3_15_diff Q3_10_diff l-Q3_63_diff RPM_diff 0 11 4.77 12.7 34.9 93.7 213.90 13.74 5.98 21.44 0.0 ... 11 0.00 0.0 0.0 0.0 0.00 0.00 0.00 0.00 0.0 1 21 4.43 10.8 25.2 39.8 49.73 20.04 8.45 0.10 953.5 ... 11 -0.34 -1.9 -9.7 -53.9 -164.17 6.30 2.47 -21.34 953.5 2 22 4.52 11.3 27.7 48.0 60.51 17.58 7.50 0.58 904.2 ... 11 -0.25 -1.4 -7.2 -45.7 -153.39 3.84 1.52 -20.86 904.2 3 23 4.67 11.5 27.2 44.8 56.64 17.49 7.24 0.25 945.2 ... 11 -0.10 -1.2 -7.7 -48.9 -157.26 3.75 1.26 -21.19 945.2 4 24 4.41 10.9 26.8 44.5 57.84 18.95 8.31 0.54 964.2 ... 11 -0.36 -1.8 -8.1 -49.2 -156.06 5.21 2.33 -20.90 964.2 5 25 28.88 47.3 71.8 140.0 249.40 0.26 0.12 63.42 964.2 ... 11 24.11 34.6 36.9 46.3 35.50 -13.48 -5.86 41.98 964.2 6 31 16.92 23.1 34.3 48.4 92.41 0.51 0.13 1.78 1694.5 ... 25 -11.96 -24.2 -37.5 -91.6 -156.99 0.25 0.01 -61.64 730.3 7 32 16.35 22.2 33.0 45.8 59.14 0.53 0.11 0.64 1735.4 ... 25 -12.53 -25.1 -38.8 -94.2 -190.26 0.27 -0.01 -62.78 771.2 8 33 16.42 21.9 32.6 45.9 56.91 0.51 0.10 0.36 1754.4 ... 25 -12.46 -25.4 -39.2 -94.1 -192.49 0.25 -0.02 -63.06 790.2 9 34 3.47 7.3 14.1 20.7 26.52 56.59 23.71 0.00 1754.4 ... 25 -25.41 -40.0 -57.7 -119.3 -222.88 56.33 23.59 -63.42 790.2 10 11 5.16 14.2 38.6 123.4 263.80 11.03 4.82 26.90 0.0 ... 11 0.00 0.0 0.0 0.0 0.00 0.00 0.00 0.00 0.0 11 21 4.72 11.6 27.5 44.5 54.91 17.05 7.05 0.20 964.0 ... 11 -0.44 -2.6 -11.1 -78.9 -208.89 6.02 2.23 -26.70 964.0 12 22 4.48 11.2 26.4 42.4 52.22 18.38 7.68 0.12 983.5 ... 11 -0.68 -3.0 -12.2 -81.0 -211.58 7.35 2.86 -26.78 983.5 14 31 14.80 22.4 33.2 45.5 58.11 1.05 0.36 0.56 1753.4 ... 22 10.32 11.2 6.8 3.1 5.89 -17.33 -7.32 0.44 769.9 15 32 16.30 22.1 32.1 44.7 55.12 0.57 0.13 0.23 1773.8 ... 22 11.82 10.9 5.7 2.3 2.90 -17.81 -7.55 0.11 790.3 16 33 3.44 7.2 14.0 21.0 26.34 56.72 24.69 0.00 1773.8 ... 22 -1.04 -4.0 -12.4 -21.4 -25.88 38.34 17.01 -0.12 790.3 Kindly help me optimize the code of three functions I posted, so I could write them in more pythonic way.
Points 1. and 2. can be achieved in a few line with pandas functions. You can then calculate "target_sample" and your diff_col in the same loop using groupby: # 1. Whenever TEST_NUMBER == 11 has D1 value NaN, remove all rows with this PRODUCT_NO drop_prod_no = df[(df.TEST_NUMBER==11) & (df.D1.isna())]["PRODUCT_NO"] df.drop(df[df.PRODUCT_NO.isin(drop_prod_no)].index, axis=0, inplace=True) # 2. Drop remaining rows with NaN values df.dropna(inplace=True) # 3. set column "target_sample" and calculate diffs new_df = pd.DataFrame() diff_cols = ['D1','D10','D50','D90','D99','Q3_15','Q3_10','l-Q3_63','RPM'] for _, subset in df.groupby("PRODUCT_NO"): closest_sample = last_sample = 11 for index, row in subset.iterrows(): if row.TEST_NUMBER // 10 > closest_sample // 10 + 1: closest_sample = last_sample subset.at[index, "target_sample"] = closest_sample last_sample = row.TEST_NUMBER for col in diff_cols: subset.at[index, col + "_diff"] = subset.at[index, col] - float(subset[subset.TEST_NUMBER==closest_sample][col]) new_df = pd.concat([new_df, subset]) print(new_df) Output: TEST_NUMBER D1 D10 D50 D90 D99 Q3_15 Q3_10 l-Q3_63 ... D1_diff D10_diff D50_diff D90_diff D99_diff Q3_15_diff Q3_10_diff l-Q3_63_diff RPM_diff 0 11 4.77 12.7 34.9 93.7 213.90 13.74 5.98 21.44 ... 0.00 0.0 0.0 0.0 0.00 0.00 0.00 0.00 0.0 1 21 4.43 10.8 25.2 39.8 49.73 20.04 8.45 0.10 ... -0.34 -1.9 -9.7 -53.9 -164.17 6.30 2.47 -21.34 953.5 2 22 4.52 11.3 27.7 48.0 60.51 17.58 7.50 0.58 ... -0.25 -1.4 -7.2 -45.7 -153.39 3.84 1.52 -20.86 904.2 3 23 4.67 11.5 27.2 44.8 56.64 17.49 7.24 0.25 ... -0.10 -1.2 -7.7 -48.9 -157.26 3.75 1.26 -21.19 945.2 4 24 4.41 10.9 26.8 44.5 57.84 18.95 8.31 0.54 ... -0.36 -1.8 -8.1 -49.2 -156.06 5.21 2.33 -20.90 964.2 5 25 28.88 47.3 71.8 140.0 249.40 0.26 0.12 63.42 ... 24.11 34.6 36.9 46.3 35.50 -13.48 -5.86 41.98 964.2 6 31 16.92 23.1 34.3 48.4 92.41 0.51 0.13 1.78 ... -11.96 -24.2 -37.5 -91.6 -156.99 0.25 0.01 -61.64 730.3 7 32 16.35 22.2 33.0 45.8 59.14 0.53 0.11 0.64 ... -12.53 -25.1 -38.8 -94.2 -190.26 0.27 -0.01 -62.78 771.2 8 33 16.42 21.9 32.6 45.9 56.91 0.51 0.10 0.36 ... -12.46 -25.4 -39.2 -94.1 -192.49 0.25 -0.02 -63.06 790.2 9 34 3.47 7.3 14.1 20.7 26.52 56.59 23.71 0.00 ... -25.41 -40.0 -57.7 -119.3 -222.88 56.33 23.59 -63.42 790.2 10 11 5.16 14.2 38.6 123.4 263.80 11.03 4.82 26.90 ... 0.00 0.0 0.0 0.0 0.00 0.00 0.00 0.00 0.0 11 21 4.72 11.6 27.5 44.5 54.91 17.05 7.05 0.20 ... -0.44 -2.6 -11.1 -78.9 -208.89 6.02 2.23 -26.70 964.0 12 22 4.48 11.2 26.4 42.4 52.22 18.38 7.68 0.12 ... -0.68 -3.0 -12.2 -81.0 -211.58 7.35 2.86 -26.78 983.5 14 31 14.80 22.4 33.2 45.5 58.11 1.05 0.36 0.56 ... 10.32 11.2 6.8 3.1 5.89 -17.33 -7.32 0.44 769.9 15 32 16.30 22.1 32.1 44.7 55.12 0.57 0.13 0.23 ... 11.82 10.9 5.7 2.3 2.90 -17.81 -7.55 0.11 790.3 16 33 3.44 7.2 14.0 21.0 26.34 56.72 24.69 0.00 ... -1.04 -4.0 -12.4 -21.4 -25.88 38.34 17.01 -0.12 79 Edit: you can avoid using ìterrows by applying lambda functions like you did: # 3. set column "target_sample" and calculate diffs def get_closest_sample(samples, test_no): closest_sample = last_sample = 11 for smpl in samples: if smpl // 10 > closest_sample // 10 + 1: closest_sample = last_sample if smpl == test_no: break last_sample = smpl return closest_sample new_df = pd.DataFrame() diff_cols = ['D1','D10','D50','D90','D99','Q3_15','Q3_10','l-Q3_63','RPM'] for _, subset in df.groupby("PRODUCT_NO"): sample_list = list(subset["TEST_NUMBER"]) subset["target_sample"] = subset["TEST_NUMBER"].apply(lambda x: get_closest_sample(sample_list, x)) for col in diff_cols: subset[col + "_diff"] = subset.apply(lambda row: row[col]-float(subset[subset.TEST_NUMBER==row["target_sample"]][col]), axis=1) new_df = pd.concat([new_df, subset]) print(new_df)
invalid literal for float():
I'm new with python. So maybe there is something really basic here I'm missing, but I can't figure it out...For my work I'm trying to read a txt file and apply KNN on it. The File content is as follow and it has three columns with the third one as the class, the separator is a space. 0.85 17.45 2 0.75 15.6 2 3.3 15.45 2 5.25 14.2 2 4.9 15.65 2 5.35 15.85 2 5.1 17.9 2 4.6 18.25 2 4.05 18.75 2 3.4 19.7 2 2.9 21.15 2 3.1 21.85 2 3.9 21.85 2 4.4 20.05 2 7.2 14.5 2 7.65 16.5 2 7.1 18.65 2 7.05 19.9 2 5.85 20.55 2 5.5 21.8 2 6.55 21.8 2 6.05 22.3 2 5.2 23.4 2 4.55 23.9 2 5.1 24.4 2 8.1 26.35 2 10.15 27.7 2 9.75 25.5 2 9.2 21.1 2 11.2 22.8 2 12.6 23.1 2 13.25 23.5 2 11.65 26.85 2 12.45 27.55 2 13.3 27.85 2 13.7 27.75 2 14.15 26.9 2 14.05 26.55 2 15.15 24.2 2 15.2 24.75 2 12.2 20.9 2 12.15 21.45 2 12.75 22.05 2 13.15 21.85 2 13.75 22 2 13.95 22.7 2 14.4 22.65 2 14.2 22.15 2 14.1 21.75 2 14.05 21.4 2 17.2 24.8 2 17.7 24.85 2 17.55 25.2 2 17 26.85 2 16.55 27.1 2 19.15 25.35 2 18.8 24.7 2 21.4 25.85 2 15.8 21.35 2 16.6 21.15 2 17.45 20.75 2 18 20.95 2 18.25 20.2 2 18 22.3 2 18.6 22.25 2 19.2 21.95 2 19.45 22.1 2 20.1 21.6 2 20.1 20.9 2 19.9 20.35 2 19.45 19.05 2 19.25 18.7 2 21.3 22.3 2 22.9 23.65 2 23.15 24.1 2 24.25 22.85 2 22.05 20.25 2 20.95 18.25 2 21.65 17.25 2 21.55 16.7 2 21.6 16.3 2 21.5 15.5 2 22.4 16.5 2 22.25 18.1 2 23.15 19.05 2 23.5 19.8 2 23.75 20.2 2 25.15 19.8 2 25.5 19.45 2 23 18 2 23.95 17.75 2 25.9 17.55 2 27.65 15.65 2 23.1 14.6 2 23.5 15.2 2 24.05 14.9 2 24.5 14.7 2 14.15 17.35 1 14.3 16.8 1 14.3 15.75 1 14.75 15.1 1 15.35 15.5 1 15.95 16.45 1 16.5 17.05 1 17.35 17.05 1 17.15 16.3 1 16.65 16.1 1 16.5 15.15 1 16.25 14.95 1 16 14.25 1 15.9 13.2 1 15.15 12.05 1 15.2 11.7 1 17 15.65 1 16.9 15.35 1 17.35 15.45 1 17.15 15.1 1 17.3 14.9 1 17.7 15 1 17 14.6 1 16.85 14.3 1 16.6 14.05 1 17.1 14 1 17.45 14.15 1 17.8 14.2 1 17.6 13.85 1 17.2 13.5 1 17.25 13.15 1 17.1 12.75 1 16.95 12.35 1 16.5 12.2 1 16.25 12.5 1 16.05 11.9 1 16.65 10.9 1 16.7 11.4 1 16.95 11.25 1 17.3 11.2 1 18.05 11.9 1 18.6 12.5 1 18.9 12.05 1 18.7 11.25 1 17.95 10.9 1 18.4 10.05 1 17.45 10.4 1 17.6 10.15 1 17.7 9.85 1 17.3 9.7 1 16.95 9.7 1 16.75 9.65 1 19.8 9.95 1 19.1 9.55 1 17.5 8.3 1 17.55 8.1 1 17.85 7.55 1 18.2 8.35 1 19.3 9.1 1 19.4 8.85 1 19.05 8.85 1 18.9 8.5 1 18.6 7.85 1 18.7 7.65 1 19.35 8.2 1 19.95 8.3 1 20 8.9 1 20.3 8.9 1 20.55 8.8 1 18.35 6.95 1 18.65 6.9 1 19.3 7 1 19.1 6.85 1 19.15 6.65 1 21.2 8.8 1 21.4 8.8 1 21.1 8 1 20.4 7 1 20.5 6.35 1 20.1 6.05 1 20.45 5.15 1 20.95 5.55 1 20.95 6.2 1 20.9 6.6 1 21.05 7 1 21.85 8.5 1 21.9 8.2 1 22.3 7.7 1 21.85 6.65 1 21.3 5.05 1 22.6 6.7 1 22.5 6.15 1 23.65 7.2 1 24.1 7 1 21.95 4.8 1 22.15 5.05 1 22.45 5.3 1 22.45 4.9 1 22.7 5.5 1 23 5.6 1 23.2 5.3 1 23.45 5.95 1 23.75 5.95 1 24.45 6.15 1 24.6 6.45 1 25.2 6.55 1 26.05 6.4 1 25.3 5.75 1 24.35 5.35 1 23.3 4.9 1 22.95 4.75 1 22.4 4.55 1 22.8 4.1 1 22.9 4 1 23.25 3.85 1 23.45 3.6 1 23.55 4.2 1 23.8 3.65 1 23.8 4.75 1 24.2 4 1 24.55 4 1 24.7 3.85 1 24.7 4.3 1 24.9 4.75 1 26.4 5.7 1 27.15 5.95 1 27.3 5.45 1 27.5 5.45 1 27.55 5.1 1 26.85 4.95 1 26.6 4.9 1 26.85 4.4 1 26.2 4.4 1 26 4.25 1 25.15 4.1 1 25.6 3.9 1 25.85 3.6 1 24.95 3.35 1 25.1 3.25 1 25.45 3.15 1 26.85 2.95 1 27.15 3.15 1 27.2 3 1 27.95 3.25 1 27.95 3.5 1 28.8 4.05 1 28.8 4.7 1 28.75 5.45 1 28.6 5.75 1 29.25 6.3 1 30 6.55 1 30.6 3.4 1 30.05 3.45 1 29.75 3.45 1 29.2 4 1 29.45 4.05 1 29.05 4.55 1 29.4 4.85 1 29.5 4.7 1 29.9 4.45 1 30.75 4.45 1 30.4 4.05 1 30.8 3.95 1 31.05 3.95 1 30.9 5.2 1 30.65 5.85 1 30.7 6.15 1 31.5 6.25 1 31.65 6.55 1 32 7 1 32.5 7.95 1 33.35 7.45 1 32.6 6.95 1 32.65 6.6 1 32.55 6.35 1 32.35 6.1 1 32.55 5.8 1 32.2 5.05 1 32.35 4.25 1 32.9 4.15 1 32.7 4.6 1 32.75 4.85 1 34.1 4.6 1 34.1 5 1 33.6 5.25 1 33.35 5.65 1 33.75 5.95 1 33.4 6.2 1 34.45 5.8 1 34.65 5.65 1 34.65 6.25 1 35.25 6.25 1 34.35 6.8 1 34.1 7.15 1 34.45 7.3 1 34.7 7.2 1 34.85 7 1 34.35 7.75 1 34.55 7.85 1 35.05 8 1 35.5 8.05 1 35.8 7.1 1 36.6 6.7 1 36.75 7.25 1 36.5 7.4 1 35.95 7.9 1 36.1 8.1 1 36.15 8.4 1 37.6 7.35 1 37.9 7.65 1 29.15 4.4 1 34.9 9 1 35.3 9.4 1 35.9 9.35 1 36 9.65 1 35.75 10 1 36.7 9.15 1 36.6 9.8 1 36.9 9.75 1 37.25 10.15 1 36.4 10.15 1 36.3 10.7 1 36.75 10.85 1 38.15 9.7 1 38.4 9.45 1 38.35 10.5 1 37.7 10.8 1 37.45 11.15 1 37.35 11.4 1 37 11.75 1 36.8 12.2 1 37.15 12.55 1 37.25 12.15 1 37.65 11.95 1 37.95 11.85 1 38.6 11.75 1 38.5 12.2 1 38 12.95 1 37.3 13 1 37.5 13.4 1 37.85 14.5 1 38.3 14.6 1 38.05 14.45 1 38.35 14.35 1 38.5 14.25 1 39.3 14.2 1 39 13.2 1 38.95 12.9 1 39.2 12.35 1 39.5 11.8 1 39.55 12.3 1 39.75 12.75 1 40.2 12.8 1 40.4 12.05 1 40.45 12.5 1 40.55 13.15 1 40.45 14.5 1 40.2 14.8 1 40.65 14.9 1 40.6 15.25 1 41.3 15.3 1 40.95 15.7 1 41.25 16.8 1 40.95 17.05 1 40.7 16.45 1 40.45 16.3 1 39.9 16.2 1 39.65 16.2 1 39.25 15.5 1 38.85 15.5 1 38.3 16.5 1 38.75 16.85 1 39 16.6 1 38.25 17.35 1 39.5 16.95 1 39.9 17.05 1 My Code: import csv import random import math import operator def loadDataset(filename, split, trainingSet=[] , testSet=[]): with open(filename, 'rb') as csvfile: lines = csv.reader(csvfile) dataset = list(lines) for x in range(len(dataset)-1): for y in range(3): dataset[x][y] = float(dataset[x][y]) if random.random() < split: trainingSet.append(dataset[x]) else: testSet.append(dataset[x]) def euclideanDistance(instance1, instance2, length): distance = 0 for x in range(length): distance += pow((instance1[x] - instance2[x]), 2) return math.sqrt(distance) def getNeighbors(trainingSet, testInstance, k): distances = [] length = len(testInstance)-1 for x in range(len(trainingSet)): dist = euclideanDistance(testInstance, trainingSet[x], length) distances.append((trainingSet[x], dist)) distances.sort(key=operator.itemgetter(1)) neighbors = [] for x in range(k): neighbors.append(distances[x][0]) return neighbors def getResponse(neighbors): classVotes = {} for x in range(len(neighbors)): response = neighbors[x][-1] if response in classVotes: classVotes[response] += 1 else: classVotes[response] = 1 sortedVotes = sorted(classVotes.iteritems(), key=operator.itemgetter(1), reverse=True) return sortedVotes[0][0] def getAccuracy(testSet, predictions): correct = 0 for x in range(len(testSet)): if testSet[x][-1] == predictions[x]: correct += 1 return (correct/float(len(testSet))) * 100.0 def main(): # prepare data trainingSet=[] testSet=[] split = 0.67 loadDataset('Jain.txt', split, trainingSet, testSet) print 'Train set: ' + repr(len(trainingSet)) print 'Test set: ' + repr(len(testSet)) # generate predictions predictions=[] k = 3 for x in range(len(testSet)): neighbors = getNeighbors(trainingSet, testSet[x], k) result = getResponse(neighbors) predictions.append(result) print('> predicted=' + repr(result) + ', actual=' + repr(testSet[x][-1])) accuracy = getAccuracy(testSet, predictions) print('Accuracy: ' + repr(accuracy) + '%') main()
Here: lines = csv.reader(csvfile) You have to tell csv.reader what separator to use - else it will use the default excel ',' separator. Note that in the example you posted, the separator might actually NOT be "a space", but either a tab ("\t" in python) or just a random number of spaces - in which case it's not a csv-like format and you'll have to parse lines by yourself. Also your code is far from pythonic. First thing first: python's 'for' loop are really "for each" kind of loops, ie they directly yields values from the object you iterate on. The proper way to iterate on a list is: lst = ["a", "b", "c"] for item in lst: print(item) so no need for range() and indexed access here. Note that if you want to have the index too, you can use enumerate(sequence), which will yield (index, item) pairs, ie: lst = ["a", "b", "c"] for index, item in enumerate(lst): print("item at {} is {}".format(index, item)) So your loadDataset() function could be rewritten as: def loadDataset(filename, split, trainingSet=None , testSet=None): # fix the mutable default argument gotcha # cf https://docs.python-guide.org/writing/gotchas/#mutable-default-arguments if trainingSet is None: trainingSet = [] if testSet is None: testSet = [] with open(filename, 'rb') as csvfile: reader = csv.reader(csvfile, delimiter="\t") for row in reader: row = tuple(float(x) for x in row) if random.random() < split: trainingSet.append(row) else: testSet.append(row) # so the caller can get the values back return trainingSet, testSet Note that if any value in your file is not a proper representation of a float, you'll still get a ValueError in row = tuple(float(x) for x in row). The solution here is to catch the error and handle it one way or another - either by reraising it with additionnal debugging info (which value is wrong and which line of the file it belongs to) or by logging the error and ignoring this row or however it makes sense in the context of your app / lib: for row in reader: try: row = tuple(float(x) for x in row) except ValueError as e: # here we choose to just log the error # and ignore the row, but you may want # to do otherwise, your choice... print("wrong value in line {}: {}".format(reader.line_num, row)) continue if random.random() < split: trainingSet.append(row) else: testSet.append(row) Also, if you want to iterate over two lists in parallel (get 'list1[x], list2[x]' pairs), you can use zip(): lst1 = ["a", "b", "c"] lst2 = ["x", "y", "z"] for pair in zip(lst1, lst2): print(pair) and there are functions to sum() values from an iterable, ie: lst = [1, 2, 3] print(sum(lst)) so your euclideanDistance function can be rewritten as: def euclideanDistance(instance1, instance2, length): pairs = zip(instance1[:length], instance2[:length]) return math.sqrt(sum(pow(x - y) for x, y in pairs)) etc etc...
Trouble preparing data for contourf plot [duplicate]
This question already has an answer here: Plotting Isolines/contours in matplotlib from (x, y, z) data set (1 answer) Closed 4 years ago. I would like to plot some data for a contourf plot in matplot lib, I would like do to do something like the following: x = np.arange(0, 10, 0.5) y = np.arange(0, 10, 0.5) z = x**2 - y fig, ax = plt.subplots() cs = ax.contourf(x, y, z) But this sends the following error: TypeError: Input z must be a 2D array. Can someone recommend a way for me to massage my data to make contourf happy. Ideally if someone who also explain why the format of my data doesn't work that would be greatly helpful as well. Note: The actual data I'm using is read from a data file, it is the same as above, except replace x, y, z with c_a, a, Energy respectively. c_a 0 1.60 1 1.61 2 1.62 3 1.63 4 1.64 5 1.65 6 1.66 7 1.67 8 1.68 9 1.69 10 1.70 11 1.60 12 1.61 13 1.62 14 1.63 15 1.64 16 1.65 17 1.66 18 1.67 19 1.68 20 1.69 21 1.70 22 1.60 23 1.61 24 1.62 25 1.63 26 1.64 27 1.65 28 1.66 29 1.67 ... 91 1.63 92 1.64 93 1.65 94 1.66 95 1.67 96 1.68 97 1.69 98 1.70 99 1.60 100 1.61 101 1.62 102 1.63 103 1.64 104 1.65 105 1.66 106 1.67 107 1.68 108 1.69 109 1.70 110 1.60 111 1.61 112 1.62 113 1.63 114 1.64 115 1.65 116 1.66 117 1.67 118 1.68 119 1.69 120 1.70 Name: c_a, Length: 121, dtype: float64 a 0 6.00 1 6.00 2 6.00 3 6.00 4 6.00 5 6.00 6 6.00 7 6.00 8 6.00 9 6.00 10 6.00 11 6.01 12 6.01 13 6.01 14 6.01 15 6.01 16 6.01 17 6.01 18 6.01 19 6.01 20 6.01 21 6.01 22 6.02 23 6.02 24 6.02 25 6.02 26 6.02 27 6.02 28 6.02 29 6.02 ... 91 6.08 92 6.08 93 6.08 94 6.08 95 6.08 96 6.08 97 6.08 98 6.08 99 6.09 100 6.09 101 6.09 102 6.09 103 6.09 104 6.09 105 6.09 106 6.09 107 6.09 108 6.09 109 6.09 110 6.10 111 6.10 112 6.10 113 6.10 114 6.10 115 6.10 116 6.10 117 6.10 118 6.10 119 6.10 120 6.10 Name: a, Length: 121, dtype: float64 Energy 0 -250.647503 1 -250.647661 2 -250.647758 3 -250.647791 4 -250.647762 5 -250.647668 6 -250.647511 7 -250.647297 8 -250.647031 9 -250.646721 10 -250.646378 11 -250.647624 12 -250.647758 13 -250.647831 14 -250.647841 15 -250.647788 16 -250.647671 17 -250.647493 18 -250.647258 19 -250.646972 20 -250.646644 21 -250.646282 22 -250.647726 23 -250.647835 24 -250.647884 25 -250.647871 26 -250.647794 27 -250.647655 28 -250.647456 29 -250.647200 ... 91 -250.647657 92 -250.647449 93 -250.647182 94 -250.646860 95 -250.646488 96 -250.646071 97 -250.645620 98 -250.645140 99 -250.647896 100 -250.647841 101 -250.647729 102 -250.647559 103 -250.647330 104 -250.647043 105 -250.646702 106 -250.646310 107 -250.645876 108 -250.645407 109 -250.644912 110 -250.647847 111 -250.647769 112 -250.647635 113 -250.647444 114 -250.647193 115 -250.646887 116 -250.646526 117 -250.646116 118 -250.645665 119 -250.645180 120 -250.644669 Name: Energy, Length: 121, dtype: float64
x and y also need to be 2d (see meshgrid): import numpy as np import matplotlib.pyplot as plt x = np.arange(0, 10, 0.5) y = np.arange(0, 10, 0.5) X, Y = np.meshgrid(x, y) Z = X**2 - Y fig, ax = plt.subplots() cs = ax.contourf(X, Y, Z) plt.show() Output: Edit: If your input is in a one dimensional array you need to know how to reshape it. Maybe its sqrt(length)=11 in your case? (Assuming your domain is square) import numpy as np import matplotlib.pyplot as plt c_a = np.array([np.linspace(1.6, 1.7, num=11) for _ in range(11)]).flatten() a = np.array([[1.6+i*0.1 for j in range(11)] for i in range(11)]).flatten() energy = np.array([-250.644-np.random.random()*0.003 for i in range(121)]) x = c_a.reshape((11, 11)) y = a.reshape((11, 11)) z = energy.reshape((11, 11)) fig, ax = plt.subplots() cs = ax.contourf(x, y, z) plt.show() Output: