I have a sparse dataframe and would like to get all non empty values as a single column. See the image that I made up to illustrate the problem. I somehow managed to solve it using the python code below. However, I feel there migh be some better | simpler | efficient way to solve it
import pandas as pd
list1 = ["x1","x2","?","?","?","?"]
list2 = ["?","?","y1","y2","?","?"]
list3 = ["?","?","?","?","z1","z2"]
df_sparse = pd.DataFrame({"A":list1,"B":list2,"C":list3})
values_vect = []
for col in df_sparse.columns:
values = [ i for i in list(df_sparse[col]) if i !="?"]
values_vect.extend(values)
df_sparse["D"] = pd.DataFrame(values_vect,columns=["D"])
display(df_sparse)
df_sparse["D"] = df_sparse.replace("?", np.nan).ffill(axis="columns").iloc[:, -1]
replace "?"s with NaNs
forward fill the values along columns so that non-NaN values will slide to the rightmost positions
query the rightmost column, that's where the values are
to get
>>> df_sparse
A B C D
0 x1 ? ? x1
1 x2 ? ? x2
2 ? y1 ? y1
3 ? y2 ? y2
4 ? ? z1 z1
5 ? ? z2 z2
Using masking, stack and groupby.last:
df_sparse['D'] = (df_sparse
.where(df_sparse.ne('?'))
.stack()
.groupby(level=0).last()
)
print(df_sparse)
Output:
A B C D
0 x1 ? ? x1
1 x2 ? ? x2
2 ? y1 ? y1
3 ? y2 ? y2
4 ? ? z1 z1
5 ? ? z2 z2
Related
I'm quite new to python and pandas so I hope I can get some help.
I have a train_df that looks like this:
x y1 y2 y3 y4
0 -20.0 -0.702864 10.392012 1.013891 -8794.9050
1 -19.9 -0.591605 9.450884 1.231116 -8667.2340
2 -19.8 -0.983952 10.240055 0.675153 -8541.5720
And an ideal_df that looks like this:
x y1 y2 ... y48 y49 y50
0 -20.0 -0.912945 0.408082 ... -0.186278 0.912945 0.396850
1 -19.9 -0.867644 0.497186 ... -0.215690 0.867644 0.476954
2 -19.8 -0.813674 0.581322 ... -0.236503 0.813674 0.549129
Both have 400 rows.
I want to to sum up the squared deviation (distance) between y-values of train_df and ideal_df at each given x-value, e.g.:
For the 1st value of x, y1 from train_df and y1 from ideal_df, then y1 from train_df and y2 from ideal_df, etc.
Then repeat the same for every one of the 400 rows of y1 from train_df.
After that, repeat it for y2, y3, and y4 of train_df, but that is the easy part.
I wrote this
squared_deviations_y1_train = (((train_df.y1)-(ideal_df.loc[:,"y1":"y50"])) ** 2).sum()
But I have no idea what I'm doing to be honest.
Merge/join the the two dataframes by index and then for each yx column of train_df, compute the squared deviation:
train_df = pd.DataFrame(data=[ [-20.0,-0.702864,10.392012,1.013891,-8794.9050], [-19.9,-0.591605,9.450884,1.231116,-8667.2340], [-19.8,-0.983952,10.240055,0.675153,-8541.5720] ], columns=["x","y1","y2","y3","y4"])
ideal_df = pd.DataFrame(data=[ [-20.0,-0.912945,0.408082,-0.186278,0.912945,0.396850], [-19.9,-0.867644,0.497186,-0.215690,0.867644,0.476954], [-19.8,-0.813674,0.581322,-0.236503,0.813674,0.549129] ], columns=["x","y1","y2","y48","y49","y50"])
ideal_df = ideal_df.add_suffix("_i")
result_df = train_df.merge(ideal_df, left_index=True, right_index=True, how="left")
for t_col in train_df.columns:
if t_col != "x":
result_df[f"{t_col}_sd"] = sum([(result_df[t_col] - result_df[i_col]) ** 2 for i_col in ideal_df.columns if i_col != "x_i"])
[Output]:
x y1 y2 y3 y4 x_i y1_i y2_i y48_i y49_i y50_i y1_sd y2_sd y3_sd y4_sd
0 -20.0 -0.702864 10.392012 1.013891 -8794.905 -20.0 -0.912945 0.408082 -0.186278 0.912945 0.396850 5.365406 529.137105 5.911037 3.867627e+08
1 -19.9 -0.591605 9.450884 1.231116 -8667.234 -19.9 -0.867644 0.497186 -0.215690 0.867644 0.476954 4.674201 434.286809 7.737567 3.756179e+08
2 -19.8 -0.983952 10.240055 0.675153 -8541.572 -19.8 -0.813674 0.581322 -0.236503 0.813674 0.549129 8.619554 508.005021 3.091597 3.648075e+08
I did go through multiple StackOverflow posts to get an idea of how to solve this but couldn't come up with anything.
So, I have a dataframe with three attributes: id, X1, Y1.
I need to pass each instance/entry of the dataframe to a function(e.g., func) which returns two values: X2, Y2. The operation basically looks like this:
X2, Y2 = func(X1, Y1)
I need to save the X2, Y2 for each entry as a new column so that the new dataframe looks like: id, X1, Y1, X2, Y2
I am not sure how to perform this with pandas. Could you please give me some pointers?
Thanks a lot for your effort and time!
I believe this will do what your question asks (note that func() has been given an arbitrary sample implementation in this example):
import pandas as pd
df = pd.DataFrame({
'X1' : [1,2,3,4,5],
'Y1' : [2,2,3,3,4]
})
def func(a, b):
return a - b, a + b
df[['X2', 'Y2']] = pd.DataFrame(df.apply(lambda x: func(x['X1'], x['Y1']), axis=1).tolist(), columns=['foo', 'bar'])
print(df)
Output:
X1 Y1 X2 Y2
0 1 2 -1 3
1 2 2 0 4
2 3 3 0 6
3 4 3 1 7
4 5 4 1 9
I'm pretty sure we need more details, but you can do this with
df.apply(func, axis=1, expand=True)
Better would be
df["X2"] = df["id"] + df["X1"] + df["Y1"]
I believe the latter is vectorized while the former would be run as a for loop
Hope this helps
I have a really large dataframe similar to this:
CustomerId Latitude Longitude
0. a x1 y1
1. a x2 y2
2. b x3 y3
3. c x4 y4
And I have a second dataframe that corresponds to a sample of the first one, like this:
CustomerId Latitude Longitude
0. a x1 y1
3. c x4 y4
My goal is to get a new dataframe just like the original, but with NaN's instead of the coordinates of the rows with indexes that don't exist on the second dataframe. This is the result I would need:
CustomerId Latitude Longitude
0. a x1 y1
1. a NaN NaN
2. b NaN NaN
3. c x4 y4
I am new to Python and I haven't found any question like this one. Anybody has an idea of how to solve it?
First we create a mask with pandas.DataFrame.isin
After that we use np.where and ask for the opposite with ~
mask = df.CustomerId.isin(df2.CustomerId)
df['Latitude'] = np.where(~mask, np.NaN, df['Latitude'])
df['Longitude'] = np.where(~mask, np.NaN, df['Longitude'])
print(df)
CustomerId Latitude Longitude
0.0 a x1 y1
1.0 a x2 y2
2.0 b NaN NaN
3.0 c x4 y4
Explanation:
np.where works as following: np.where(condition, value if true, value if false)
How do I read the data from the file and plot the rectangle?
The given text file has the following format up to 50 rows:
x1 y1 x2 y2 x3 y3 x4 y4
where (x1,y1), (x2,y2), (x3,y3) and (x4,y4) are the four vertices of the rectangle. The rectangles have random orientation. How do I plot the series of rectangles in gnuplot?
If someone can tell me how to read from a file while using set object polygon, that may also be helpful
what I want: set object polygon from to to to .
Or, is there any other simpler code in gnuplot? Alternatively is there a python solution?
My Gnuplot solution (not just for rectangles, but any kind of polygons):
plot 'rectal.dat' u 1:2:($3-$1):($4-$2) with vectors nohead lc 1 title 'Rectangle', \
'' u 3:4:($5-$3):($6-$4) with vectors nohead lc 1 notitle, \
'' u 5:6:($7-$5):($8-$6) with vectors nohead lc 1 notitle, \
'' u 7:8:($1-$7):($2-$8) with vectors nohead lc 1 notitle
from
0 0 0 1 1 1 1 0
0 0 0 2 2 2 2 0
-0.5 -0.5 -1 1 -0.5 2 0 1
Here's the Python code to do so as we discussed.
Your input file is of format:
__ __ x1 y1 x2 y2 x3 y3 x4 y4
The code.
import numpy as np
import matplotlib.pyplot as plt
data = np.loadtxt('car.txt')
data = [i[2:] for i in data]
for d in data:
Xs = d[::2]
Ys = d[1::2]
for i in range(4):
if i < 3:
plt.plot([Xs[i],Xs[i+1]],[Ys[i],Ys[i+1]],'k-',lw=2)
elif i == 3:
plt.plot([Xs[i],Xs[0]],[Ys[i],Ys[0]],'k-',lw=2)
plt.show()
I am trying to implement multivariate linear regression using numpy. There are several questions in this forum regarding that but seems to answer my question. I have the following independent variables (X1, X2, X3, X4, X5) and dependent variable Y. I want to predict the values of Y'.
X1 X2 X3 X4 Y Y'
1 0 1 0 1 ? // ? -> referring this value as y'1
0 0 1 1 0 ? // ? -> referring this value as y'2
0 1 0 1 0 ? // ? -> referring this value as y'3
0 0 0 1 1 ? // ? -> referring this value as y'4
1 0 1 1 0 ? // ? -> referring this value as y'5
So, I am using numpy as:
>>> X1 = np.array([1,0,0,0,1])
>>> X2 = np.array([0,0,1,0,0])
>>> X3 = np.array([1,1,0,0,1])
>>> X4 = np.array([0,1,1,1,1])
>>> Y = np.array([1,0,0,1,0])
>>> x = np.array([X1,X2,X3,X4], np.int32)
>>> n = np.max(x.shape)
>>> X = np.vstack([np.ones(n), x]).T
>>> print np.linalg.lstsq(X, Y)[0]
[ 2.00000000e+00 -2.22044605e-16 -1.00000000e+00 -1.00000000e+00 -1.00000000e+00]
So, I have the equation y = a + b1.x1 +b2.x2 + b3.x3 + b4.x4 . From above, I have got the values of a,b1,b2,b3,b4.
So,how do I calculate the values of Y' which are y'1, y'2,y'3, y'4,y'5 from the above coefficient values?
The point of OLS is to fit parameters based on data you have and use that to predict a new Y. Try ...
>>> import numpy as np
>>> X = np.array([[1,0,1,0], [0,0,1,1], [0,1,0,1], [0,0,0,1], [1,0,1,1]])
>>> Y = np.array([1,0,0,1,0]).reshape((5,1))
>>> b = np.linalg.inv((X.T).dot(X)).dot(X.T).dot(Y)
>>> b
out [1]: array([[0.666], [-0.333], [-0.333], [0.333]])
Then use this to predict a new Y given 4 new X's. Also, if your Y's are binary (all zeros and ones), you should look at using Logistic Regression.