I have a pandas dataframe with 10000 columns and 3000 rows. I want to call the below function for each column and this is taking me 1+ min. I would like to reduce it, say 10s since the same code in Rust is just taking 1 sec to complete. I know python will not match Rust performance but a reduction is welcome. Is there something I can do which can vectorize or improve performance?
def extract_cycles_v3(series: pd.Series):
from collections import deque
import numpy as np
points = deque()
result = []
def format_output(point1, point2, count):
x1 = point1
x2 = point2
rng = np.abs(x1 - x2)
mean = 0.5 * (x1 + x2)
return rng, mean, count
for point in series:
points.append(point)
while len(points) >= 3:
# Form ranges X and Y from the three most recent points
x1, x2, x3 = points[-3], points[-2], points[-1]
X = np.abs(x3 - x2)
Y = np.abs(x2 - x1)
if X < Y:
# Read the next point
break
elif len(points) == 3:
# Y contains the starting point
# Count Y as one-half cycle and discard the first point
result.append(format_output(points[0], points[1], 0.5))
points.popleft()
else:
# Count Y as one cycle and discard the peak and the valley of Y
result.append(format_output(points[-3], points[-2], 1.0))
last = points.pop()
points.pop()
points.pop()
points.append(last)
else:
# Count the remaining ranges as one-half cycles
while len(points) > 1:
result.append(format_output(points[0], points[1], 0.5))
points.popleft()
return result
Below is the time spent by profiler
Line # Hits Time Per Hit % Time Line Contents
==============================================================
9 def extract_cycles_v3(series):
10 5625 200287.0 35.6 0.0 from collections import deque
11 5625 31852.0 5.7 0.0 import numpy as np
12 5625 49075.0 8.7 0.0 points = deque()
13 5625 17245.0 3.1 0.0 result = []
14 5625 32680.0 5.8 0.0 def format_output(point1, point2, count):
15 x1 = point1
16 x2 = point2
17 rng = np.abs(x1 - x2)
18 mean = 0.5 * (x1 + x2)
19 return rng, mean, count
20 9241245 30028829.0 3.2 3.8 for point in series:
21 9241245 30075819.0 3.3 3.8 points.append(point)
22 13715870 49794814.0 3.6 6.3 while len(points) >= 3:
23 # Form ranges X and Y from the three most recent points
24 13715870 55754712.0 4.1 7.1 x1, x2, x3 = points[-3], points[-2], points[-1]
25 13715870 158680523.0 11.6 20.2 X = np.abs(x3 - x2)
26 13715870 143759753.0 10.5 18.3 Y = np.abs(x2 - x1)
27 9096216 25824510.0 2.8 3.3 if X < Y:
28 # Read the next point
29 9096216 20523135.0 2.3 2.6 break
30 4577013 15633704.0 3.4 2.0 elif len(points) == 3:
31 # Y contains the starting point
32 # Count Y as one-half cycle and discard the first point
33 42641 1982939.0 46.5 0.3 result.append(format_output(points[0], points[1], 0.5))
34 42641 182248.0 4.3 0.0 points.popleft()
35 else:
36 # Count Y as one cycle and discard the peak and the valley of Y
37 4577013 193884083.0 42.4 24.6 result.append(format_output(points[-3], points[-2], 1.0))
38 4577013 15812058.0 3.5 2.0 last = points.pop()
39 4577013 13912050.0 3.0 1.8 points.pop()
40 4577013 13846706.0 3.0 1.8 points.pop()
41 4577013 14957967.0 3.3 1.9 points.append(last)
42 else:
43 # Count the remaining ranges as one-half cycles
44 38953 155899.0 4.0 0.0 while len(points) > 1:
45 38953 1695531.0 43.5 0.2 result.append(format_output(points[0], points[1], 0.5))
46 38953 157556.0 4.0 0.0 points.popleft()
47 5625 14289.0 2.5 0.0 return result
source
I have a simple exercise that I am not sure how to do. I have the following data sets:
male100
Year Time
0 1896 12.00
1 1900 11.00
2 1904 11.00
3 1906 11.20
4 1908 10.80
5 1912 10.80
6 1920 10.80
7 1924 10.60
8 1928 10.80
9 1932 10.30
10 1936 10.30
11 1948 10.30
12 1952 10.40
13 1956 10.50
14 1960 10.20
15 1964 10.00
16 1968 9.95
17 1972 10.14
18 1976 10.06
19 1980 10.25
20 1984 9.99
21 1988 9.92
22 1992 9.96
23 1996 9.84
24 2000 9.87
25 2004 9.85
26 2008 9.69
and the second one:
female100
Year Time
0 1928 12.20
1 1932 11.90
2 1936 11.50
3 1948 11.90
4 1952 11.50
5 1956 11.50
6 1960 11.00
7 1964 11.40
8 1968 11.00
9 1972 11.07
10 1976 11.08
11 1980 11.06
12 1984 10.97
13 1988 10.54
14 1992 10.82
15 1996 10.94
16 2000 11.12
17 2004 10.93
18 2008 10.78
I have the following code:
y = -0.014*male100['Year']+38
plt.plot(male100['Year'],y,'r-',color = 'b')
ax = plt.gca() # gca stands for 'get current axis'
ax = male100.plot(x=0,y=1, kind ='scatter', color='g', label="Mens 100m", ax = ax)
female100.plot(x=0,y=1, kind ='scatter', color='r', label="Womens 100m", ax = ax)
Which produces this result:
I need to plot a line that would go exactly between them. So the line would leave all of the green points below it, and the red point above it. How do I do so?
I've tried playing with the parameters of y, but to no avail. I also tried fitting a linear regression to male100 , female100 , and the merged version of them (across rows), but couldn't get any results.
Any help would be appreciated!
A solution is using support vector machine (SVM). You can find two margins that separate two classes of points. Then, the average line of two support vectors is your answer. Notice that it's happened just when these two set of points are linearly separable.
You can use the following code to see the result:
Data Entry
male = [
(1896 , 12.00),
(1900 , 11.00),
(1904 , 11.00),
(1906 , 11.20),
(1908 , 10.80),
(1912 , 10.80),
(1920 , 10.80),
(1924 , 10.60),
(1928 , 10.80),
(1932 , 10.30),
(1936 , 10.30),
(1948 , 10.30),
(1952 , 10.40),
(1956 , 10.50),
(1960 , 10.20),
(1964 , 10.00),
(1968 , 9.95),
(1972 , 10.14),
(1976 , 10.06),
(1980 , 10.25),
(1984 , 9.99),
(1988 , 9.92),
(1992 , 9.96),
(1996 , 9.84),
(2000 , 9.87),
(2004 , 9.85),
(2008 , 9.69)
]
female = [
(1928, 12.20),
(1932, 11.90),
(1936, 11.50),
(1948, 11.90),
(1952, 11.50),
(1956, 11.50),
(1960, 11.00),
(1964, 11.40),
(1968, 11.00),
(1972, 11.07),
(1976, 11.08),
(1980, 11.06),
(1984, 10.97),
(1988, 10.54),
(1992, 10.82),
(1996, 10.94),
(2000, 11.12),
(2004, 10.93),
(2008, 10.78)
]
Main Code
Notice that the value of C is important here. If it is selected to 1, you can't get the preferred result.
from sklearn import svm
import numpy as np
import matplotlib.pyplot as plt
X = np.array(male + female)
Y = np.array([0] * len(male) + [1] * len(female))
# fit the model
clf = svm.SVC(kernel='linear', C=1000) # C is important here
clf.fit(X, Y)
plt.figure(figsize=(8, 4))
# get the separating hyperplane
w = clf.coef_[0]
a = -w[0] / w[1]
xx = np.linspace(-1000, 10000)
yy = a * xx - (clf.intercept_[0]) / w[1]
plt.figure(1, figsize=(4, 3))
plt.clf()
plt.plot(xx, yy, "k-") #********* This is the separator line ************
plt.scatter(X[:, 0], X[:, 1], c=Y, zorder=10, cmap=plt.cm.Paired,
edgecolors="k")
plt.xlim((1890, 2010))
plt.ylim((9, 13))
plt.show()
I believe your idea of making use of regression lines is correct - if they aren't used, the line would be merely superficial (and impossible to justify if the points overlap in the event of messy data).
Therefore, using some randomly made data with a known linear relationship, we can do the following:
import random
import numpy as np
from matplotlib import pyplot as plt
from sklearn.linear_model import LinearRegression
x_values = np.arange(0, 51, 1)
y_points_1 = [i * 2 + random.randint(5, 30) for i in x_points]
y_points_2 = [i - random.randint(5, 30) for i in x_points]
x_points = x_values.reshape(-1, 1)
def regression(x, y):
model = LinearRegression().fit(x, y)
y_pred = model.predict(x)
return y_pred
barrier = [(regression(x=x_points, y=y_points_1)[i] + value) / 2 for i, value in enumerate(regression(x=x_points, y=y_points_2))]
plt.plot(x_points, regression(x=x_points, y=y_points_1))
plt.plot(x_points, regression(x=x_points, y=y_points_2))
plt.plot(x_points, barrier)
plt.scatter(x_values, y_points_1)
plt.scatter(x_values, y_points_2)
plt.grid(True)
plt.show()
Giving us the following plot:
This method also works for an overlap in the data points, so if we change the random data slightly and apply the same process:
x_values = np.arange(0, 51, 1)
y_points_1 = [i * 2 + random.randint(-10, 30) for i in x_points]
y_points_2 = [i - random.randint(-10, 30) for i in x_points]
We get something like the following:
It is important to note that the lists used here are of the same length, so you would need to add some predicted points to the female data after applying regression in order to make use of the line between them. These points would merely be along the regression line with the x-values corresponding to those present in the male data.
Because sklearn might be a bit over the top for a linear fit and to get rid of the condition that you would need the same number of data points for male and female data, here the same implementation with numpy.polyfit. This also demonstrates that their approach is not a solution to the problem.
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
#data import
male = pd.read_csv("test1.txt", delim_whitespace=True)
female = pd.read_csv("test2.txt", delim_whitespace=True)
#linear fit of both populations
pmale = np.polyfit(male.Year, male.Time, 1)
pfemale = np.polyfit(female.Year, female.Time, 1)
#more appealing presentation, let's pretend we do not just fit a line
x_fitmin=min(male.Year.min(), female.Year.min())
x_fitmax=max(male.Year.max(), female.Year.max())
x_fit=np.linspace(x_fitmin, x_fitmax, 100)
#create functions for the three fit lines
male_fit = np.poly1d(pmale)
print(male_fit)
female_fit = np.poly1d(pfemale)
print(female_fit)
sep = np.poly1d(np.mean([pmale, pfemale], axis=0))
print(sep)
#plot all markers and lines
ax = male.plot(x="Year", y="Time", c="blue", kind="scatter", label="male")
female.plot(x="Year", y="Time", c="red", kind="scatter", ax=ax, label="female")
ax.plot(x_fit, male_fit(x_fit), c="blue", ls="dotted", label="male fit")
ax.plot(x_fit, female_fit(x_fit), c="red", ls="dotted", label="female fit")
ax.plot(x_fit, sep(x_fit), c="black", ls="dashed", label="separator")
plt.legend()
plt.show()
Sample output:
-0.01333 x + 36.42
-0.01507 x + 40.92
-0.0142 x + 38.67
And one point is still in the wrong section. However - I find this question so interesting because I expected answers from the sklearn crowd for non-linear data groups. I even installed sklearn in anticipation! If in the next days nobody posts a good solution
with SVMs, I will set a bounty on this question.
One solution is the geometrical approach. You can find the convex hull of each data class, then find a line that goes through these two convex hulls. To find the line, you can find inner tangent line between two convex hulls using this code, and rotate it a little bit.
You can use the following code:
from scipy.spatial import ConvexHull, convex_hull_plot_2d
male = np.array(male)
female = np.array(female)
hull_male = ConvexHull(male)
hull_female = ConvexHull(female)
plt.plot(male[:,0], male[:,1], 'o')
for simplex in hull_male.simplices:
plt.plot(male[simplex, 0], male[simplex, 1], 'k-')
# Here, the separator line comes from SMV result.
# Just to show the a separator as an exmple
# plt.plot(xx, yy, "k-")
plt.plot(female[:,0], female[:,1], 'o')
for simplex in hull_female.simplices:
plt.plot(female[simplex, 0], female[simplex, 1], 'k-')
plt.xlim((1890, 2010))
plt.ylim((9, 13))
I am working on a project for my thesis, which has to do with the capitalization of Research & Development (R&D) expenses for a data set of companies that I have.
For those who are not familiar with financial terminology, I am trying to accumulate the values of each year's R&D expenses with the following ones by decaying its value (or "depreciating" it) every time period.
I was able to apply the following code to get the gist of the operation:
df['rd_capital'] = [(df['r&d_exp'].iloc[:i] * (1 - df['dep_rate'].iloc[:i]*np.arange(i)[::-1])).sum() for i in range(1,len(df)+1)]
However, there is a major flaw with this method, which is that it continues to take away the depreciation rate once the value has reached zero, therefore going into negative territory.
For example if we have Apple's R&D expenses for 5 years at a constant depreciation rate of 20%, the code above gives me the following result:
year r&d_exp dep_rate r&d_capital
0 1999 10 0.2 10
1 2000 8 0.2 16
2 2001 12 0.2 24.4
3 2002 7 0.2 25.4
4 2003 15 0.2 33
5 2004 8 0.2 30.6
6 2005 11 0.2 29.6
However, the value for the year 2005 is incorrect as it should be 31.6!
If it was not clear, r&d_capital is retrieved the following way:
2000 = 10*(1-0.2) + 8
2001 = 10*(1-0.4) + 8*(1-0.2) + 12
2002 = 10*(1-0.6) + 8*(1-0.4) + 12*(1-0.2) + 7
2003 = 10*(1-0.8) + 8*(1-0.6) + 12*(1-0.4) + 7*(1-0.2) + 15
the key problem comes here as the code above does the following:
2004 = 10*(1-1) + 8*(1-0.8) + 12*(1-0.6) + 7*(1-0.4) + 15*(1-0.2) + 8
2005 = 10*(1-1.2) + 8*(1-1) + 12*(1-0.8) + 7*(1-0.6) + 15*(1-0.4) + 8*(0.2) + 11
Instead it should discard the values once the value reaches zero, just like this:
2004 = 8*(1-0.8) + 12*(1-0.6) + 7*(1-0.4) + 15*(1-0.2) + 8
2005 = 12*(1-0.8) + 7*(1-0.6) + 15*(1-0.4) + 8*(0.2) + 11
Thank you in advance for any help that you will give, really appreciate it :)
A possible way would be to compute the residual part for each investment. The assumption is that there a finite and known number of years after which any investment is fully depreciated. Here I will use 6 years (5 would be enough but it demonstrates how to avoid negative depreciations):
# cumulated depreciation rates:
cum_rate = pd.DataFrame(index = df.index)
for i in range(2, 7):
cum_rate['cum_rate' + str(i)] = df['dep_rate'].rolling(i).sum().shift(1 - i)
cum_rate['cum_rate1'] = df['dep_rate']
cum_rate[cum_rate > 1] = 1 # avoid negative rates
# residual values
resid = pd.DataFrame(index = df.index)
for i in range(1, 7):
resid['r' + str(i)] = (df['r&d_exp'] * (1 - cum_rate['cum_rate' + str(i)])
).shift(i)
# compute the capital
df['r&d_capital'] = resid.apply('sum', axis=1) + df['r&d_exp']
It gives as expected:
year r&d_exp dep_rate r&d_capital
0 1999 10 0.2 10.0
1 2000 8 0.2 16.0
2 2001 12 0.2 24.4
3 2002 7 0.2 25.4
4 2003 15 0.2 33.0
5 2004 8 0.2 30.6
6 2005 11 0.2 31.6
You have to keep track of the absolute depreciation and stop depreciating when the asset reaches value zero. Look at the following code:
>>> exp = [10, 8, 12, 7, 15, 8, 11]
>>> dep = [0.2*x for x in exp]
>>> cap = [0]*7
>>> for i in range(7):
... x = exp[:i+1]
... for j in range(i):
... x[j] -=(i-j)*dep[j]
... x[j] = max(x[j], 0)
... cap[i] = sum(x)
...
>>> cap
[10, 16.0, 24.4, 25.4, 33.0, 30.599999999999998, 31.6]
>>>
In the for loops I calculate for every year the remaining value of all assets (in variable x). When this reaches zero, I stop depreciating. That is what the statement x[j] = max(x[j], 0) does. The sum of the value of all assets is then stored in cap[i].
I want to display mean and standard deviation values above each of the boxplots in the grouped boxplot (see picture).
My code is
import pandas as pd
import seaborn as sns
from os.path import expanduser as ospath
df = pd.read_excel(ospath('~/Documents/Python/Kandidatspeciale/TestData.xlsx'),'Ark1')
bp = sns.boxplot(y='throw angle', x='incident angle',
data=df,
palette="colorblind",
hue='Bat type')
bp.set_title('Rubber Comparison',fontsize=15,fontweight='bold', y=1.06)
bp.set_ylabel('Throw Angle [degrees]',fontsize=11.5)
bp.set_xlabel('Incident Angle [degrees]',fontsize=11.5)
Where my dataframe, df, is
Bat type incident angle throw angle
0 euro 15 28.2
1 euro 15 27.5
2 euro 15 26.2
3 euro 15 27.7
4 euro 15 26.4
5 euro 15 29.0
6 euro 30 12.5
7 euro 30 14.7
8 euro 30 10.2
9 china 15 29.9
10 china 15 31.1
11 china 15 24.9
12 china 15 27.5
13 china 15 31.2
14 china 15 24.4
15 china 30 9.7
16 china 30 9.1
17 china 30 9.5
I tried with the following code. It needs to be independent of number of x (incident angles), for instance it should do the job for more angles of 45, 60 etc.
m=df.mean(axis=0) #Mean values
st=df.std(axis=0) #Standard deviation values
for i, line in enumerate(bp['medians']):
x, y = line.get_xydata()[1]
text = ' μ={:.2f}\n σ={:.2f}'.format(m[i], st[i])
bp.annotate(text, xy=(x, y))
Can somebody help?
This question brought me here since I was also looking for a similar solution with seaborn.
After some trial and error, you just have to change the for loop to:
for i in range(len(m)):
bp.annotate(
' μ={:.2f}\n σ={:.2f}'.format(m[i], st[i]),
xy=(i, m[i]),
horizontalalignment='center'
)
This change worked for me (although I just wanted to print the actual median values). You can also add changes like the fontsize, color or style (i.e., weight) just by adding them as arguments in annotate.
This question is a follow-up to an earlier question and from #JoeKington here. Both of these solutions work excellently for my needs.
However I have been trying to overlay a basemap on the contours. Going by the example here http://matplotlib.org/basemap/users/examples.html, I do not seem to get it right. I think my basic problem is to convert the contour x,y values into map coordinates. I reproduce below the codes for 1) contours (as given by #usethedeathstar, which works very well) and 2) the map object and the plotting.
#!/usr/bin/python
from mpl_toolkits.basemap import Basemap
import numpy as np
from scipy.interpolate import griddata
class d():
def __init__(self):
A0 = open("meansr2.txt","rb") #
A1 = A0.readlines()
A = np.zeros((len(A1),3))
for i, l in enumerate(A1):
li = l.split()
A[i,0] = float(li[0])
A[i,1] = float(li[1])
A[i,2] = float(li[2])
self.Lon = A[:,0]
self.Lat = A[:,1]
self.Z = A[:,2]
data = d()
numcols, numrows = 300, 300
xi = np.linspace(data.Lon.min(), data.Lon.max(), numrows)
yi = np.linspace(data.Lat.min(), data.Lat.max(), numcols)
xi, yi = np.meshgrid(xi, yi)
x, y, z = data.Lon, data.Lat, data.Z
points = np.vstack((x,y)).T
values = z
wanted = (xi, yi)
zi = griddata(points, values, wanted)
Defining map object
m = Basemap(projection = 'merc',llcrnrlon = 21, llcrnrlat = -18, urcrnrlon = 34, urcrnrlat = -8, resolution='h')
m.drawcountries(linewidth=0.5, linestyle='solid', color='k', antialiased=1, ax=None, zorder=None)
m.drawmapboundary(fill_color = 'white')
m.fillcontinents(color='coral',lake_color='blue')
parallels = np.arange(-18, -8, 2.)
m.drawparallels(parallels, color = 'black', linewidth = 0.5, labels=[True,False,False,False])
meridians = np.arange(22,34, 2.)
m.drawmeridians(meridians, color = '0.25', linewidth = 0.5, labels=[False,False,False,True])
import pylab as plt
an attempt to transform form lat/lon to map coordinates
#lat = list(data.Lat)
#lon = list(data.Lon)
#x, y = m(lon,lat)
comment:
contourf is tried with (x, y, zi), then all the above definitions are rewritten with xi, # yi, including many different attempts to redefine x,y and lon, lat.J
The plot functions
fig = plt.figure(0, figsize=(8,4.5))
im = plt.contourf(xi, yi, zi)
plt.scatter(data.Lon, data.Lat, c= data.Z)
plt.colorbar()
plt.show()
The above give two plots side by side.
Here is some data in case there is need to test
Lon Lat Z Z2 pos
32.6 -13.6 41 9 CHIP
27.1 -16.9 43 12 CHOM
32.7 -10.2 46 14 ISOK
24.2 -13.6 33 13 KABO
28.5 -14.4 43 11 KABW
28.1 -12.6 33 16 KAFI
27.9 -15.8 46 13 KAFU
24.8 -14.8 44 9 KAOM
31.1 -10.2 35 14 KASA
25.9 -13.5 24 8 KASE
29.1 -9.8 10 13 KAWA
25.8 -17.8 39 11 LIVI
33.2 -12.3 44 8 LUND
28.3 -15.4 46 12 LUSA
27.6 -16.1 47 9 MAGO
28.9 -11.1 31 15 MANS
31.3 -8.9 39 9 MBAL
31.9 -13.3 45 9 MFUW
23.1 -15.3 31 9 MONG
31.4 -11.9 39 9 MPIK
27.1 -15.0 42 12 MUMB
24.4 -11.8 15 9 MWIN
28.6 -13.0 39 9 NDOL
31.3 -14.3 44 12 PETA
23.3 -16.1 39 5 SENA
30.2 -13.2 38 11 SERE
24.3 -17.5 32 10 SESH
26.4 -12.2 23 12 SOLW
23.1 -13.5 27 14 ZAMB
Any assistance will be appreciated
I would like to thank all those who have looked at my problem and may have tried to work on it. By consistent trying, it has come to my attention that the overlaying of the basemap on the contours actually works with the following lines
After the map object definition
m = Basemap(projection = 'merc',llcrnrlon = 21, llcrnrlat = -18, urcrnrlon = 34, urcrnrlat = -8, resolution='h')
I have
x, y = m(xi, yi)
fig=plt.figure(figsize=(8,4.5))
cs = m.contour(x,y,zi,colors='b',linewidths=1.)
Contour(x,y,zi) plots the contours on the map. Since I was using contourf, I still have to find out why contourf does not give me the filled contours.
Thank you very much all for the patience and tolerance.