Repeating same operation for multiple columns of another df - python

I'm quite new to python and pandas so I hope I can get some help.
I have a train_df that looks like this:
x y1 y2 y3 y4
0 -20.0 -0.702864 10.392012 1.013891 -8794.9050
1 -19.9 -0.591605 9.450884 1.231116 -8667.2340
2 -19.8 -0.983952 10.240055 0.675153 -8541.5720
And an ideal_df that looks like this:
x y1 y2 ... y48 y49 y50
0 -20.0 -0.912945 0.408082 ... -0.186278 0.912945 0.396850
1 -19.9 -0.867644 0.497186 ... -0.215690 0.867644 0.476954
2 -19.8 -0.813674 0.581322 ... -0.236503 0.813674 0.549129
Both have 400 rows.
I want to to sum up the squared deviation (distance) between y-values of train_df and ideal_df at each given x-value, e.g.:
For the 1st value of x, y1 from train_df and y1 from ideal_df, then y1 from train_df and y2 from ideal_df, etc.
Then repeat the same for every one of the 400 rows of y1 from train_df.
After that, repeat it for y2, y3, and y4 of train_df, but that is the easy part.
I wrote this
squared_deviations_y1_train = (((train_df.y1)-(ideal_df.loc[:,"y1":"y50"])) ** 2).sum()
But I have no idea what I'm doing to be honest.

Merge/join the the two dataframes by index and then for each yx column of train_df, compute the squared deviation:
train_df = pd.DataFrame(data=[ [-20.0,-0.702864,10.392012,1.013891,-8794.9050], [-19.9,-0.591605,9.450884,1.231116,-8667.2340], [-19.8,-0.983952,10.240055,0.675153,-8541.5720] ], columns=["x","y1","y2","y3","y4"])
ideal_df = pd.DataFrame(data=[ [-20.0,-0.912945,0.408082,-0.186278,0.912945,0.396850], [-19.9,-0.867644,0.497186,-0.215690,0.867644,0.476954], [-19.8,-0.813674,0.581322,-0.236503,0.813674,0.549129] ], columns=["x","y1","y2","y48","y49","y50"])
ideal_df = ideal_df.add_suffix("_i")
result_df = train_df.merge(ideal_df, left_index=True, right_index=True, how="left")
for t_col in train_df.columns:
if t_col != "x":
result_df[f"{t_col}_sd"] = sum([(result_df[t_col] - result_df[i_col]) ** 2 for i_col in ideal_df.columns if i_col != "x_i"])
[Output]:
x y1 y2 y3 y4 x_i y1_i y2_i y48_i y49_i y50_i y1_sd y2_sd y3_sd y4_sd
0 -20.0 -0.702864 10.392012 1.013891 -8794.905 -20.0 -0.912945 0.408082 -0.186278 0.912945 0.396850 5.365406 529.137105 5.911037 3.867627e+08
1 -19.9 -0.591605 9.450884 1.231116 -8667.234 -19.9 -0.867644 0.497186 -0.215690 0.867644 0.476954 4.674201 434.286809 7.737567 3.756179e+08
2 -19.8 -0.983952 10.240055 0.675153 -8541.572 -19.8 -0.813674 0.581322 -0.236503 0.813674 0.549129 8.619554 508.005021 3.091597 3.648075e+08

Related

Stack/Unstack Pandas Data Frame

I have the following dataset Excel Dummy DataSet that consists of a concatenation of several Tables in an Excel sheet. They are all stacked vertically. The columns of the different tables are same, col_x, col_y, col_t, except for the column Y that varies as the tables change (See the figure below).
I somehow manage to get the output. However, I wonder if there is a simpler|more efficient way to do this?
This is what I have tried
import pandas as pd
# Import Data
path = r"/content/test_data.xlsx"
df_original = pd.read_excel(path, skiprows=4, usecols= range(0,4), header=None)
df_original.columns=["col_x","col_y","col_z","col_t"]
# Begining of the code
mask_col_x = df_original["col_x"] == "col_x"
df_break = df_original[mask_col_x]
index_break_list = df_break.index
range_list = []
for i, val in enumerate(index_break_list):
if i < len(index_break_list)-1:
span1 = (val+1,index_break_list[i+1],df_original["col_y"][val])
range_list.append(span1)
span1 = (val+1,len(df_original),df_original["col_y"][val])
range_list.append(span1)
dataframe_list = []
for elt in range_list:
df_sub = df_original.iloc[elt[0]:elt[1]].copy()
df_sub["Value y"] = elt[2]
dataframe_list.append(df_sub)
new_df = pd.concat(dataframe_list,axis=0)
new_df.to_csv("test_data_result_combined.csv")
You can create column Value y by mask with Series.where and then forward filling missing values by ffill and last filter out rows by invert mask by ~:
path = "test_data.xlsx"
df_original = pd.read_excel(path, skiprows=4, usecols= range(0,4), header=None)
df_original.columns=["col_x","col_y","col_z","col_t"]
mask_col_x = df_original["col_x"] == "col_x"
df_original['Value y'] = df_original["col_y"].where(mask_col_x).ffill()
new_df = df_original[~mask_col_x]
print (new_df)
col_x col_y col_z col_t Value y
1 index1 val_y1_table1 val_z1_table1 val_t1_table1 y1
2 index2 val_y2_table1 val_z2_table1 val_t2_table1 y1
3 index3 val_y3_table1 val_z3_table1 val_t3_table1 y1
4 index4 val_y4_table1 val_z4_table1 val_t4_table1 y1
6 index5 val_y1_table2 val_z1_table2 val_t1_table2 y2
7 index6 val_y2_table2 val_z2_table2 val_t2_table2 y2
8 index7 val_y3_table2 val_z3_table2 val_t3_table2 y2
10 index8 val_y1_table3 val_z1_table3 val_t1_table3 y3
11 index9 val_y2_table3 val_z2_table3 val_t2_table3 y3
13 index10 val_y1_table4 val_z1_table4 val_t1_table4 y4
15 index11 val_y1_table5 val_z1_table5 val_t1_table5 y5
16 index12 val_y2_table5 val_z2_table5 val_t2_table5 y5
17 index13 val_y3_table5 val_z3_table5 val_t3_table5 y5
18 index14 val_y4_table5 val_z4_table5 val_t4_table5 y5
19 index15 val_y5_table5 val_z5_table5 val_t5_table5 y5
20 index16 val_y6_table5 val_z6_table5 val_t6_table5 y5
21 index17 val_y7_table5 val_z7_table5 val_t7_table5 y5

Get non empty values of dataframe as a single column

I have a sparse dataframe and would like to get all non empty values as a single column. See the image that I made up to illustrate the problem. I somehow managed to solve it using the python code below. However, I feel there migh be some better | simpler | efficient way to solve it
import pandas as pd
list1 = ["x1","x2","?","?","?","?"]
list2 = ["?","?","y1","y2","?","?"]
list3 = ["?","?","?","?","z1","z2"]
df_sparse = pd.DataFrame({"A":list1,"B":list2,"C":list3})
values_vect = []
for col in df_sparse.columns:
values = [ i for i in list(df_sparse[col]) if i !="?"]
values_vect.extend(values)
df_sparse["D"] = pd.DataFrame(values_vect,columns=["D"])
display(df_sparse)
df_sparse["D"] = df_sparse.replace("?", np.nan).ffill(axis="columns").iloc[:, -1]
replace "?"s with NaNs
forward fill the values along columns so that non-NaN values will slide to the rightmost positions
query the rightmost column, that's where the values are
to get
>>> df_sparse
A B C D
0 x1 ? ? x1
1 x2 ? ? x2
2 ? y1 ? y1
3 ? y2 ? y2
4 ? ? z1 z1
5 ? ? z2 z2
Using masking, stack and groupby.last:
df_sparse['D'] = (df_sparse
.where(df_sparse.ne('?'))
.stack()
.groupby(level=0).last()
)
print(df_sparse)
Output:
A B C D
0 x1 ? ? x1
1 x2 ? ? x2
2 ? y1 ? y1
3 ? y2 ? y2
4 ? ? z1 z1
5 ? ? z2 z2

Use multiple columns of a dataframe for an operation and save the result in multiple columns

I did go through multiple StackOverflow posts to get an idea of how to solve this but couldn't come up with anything.
So, I have a dataframe with three attributes: id, X1, Y1.
I need to pass each instance/entry of the dataframe to a function(e.g., func) which returns two values: X2, Y2. The operation basically looks like this:
X2, Y2 = func(X1, Y1)
I need to save the X2, Y2 for each entry as a new column so that the new dataframe looks like: id, X1, Y1, X2, Y2
I am not sure how to perform this with pandas. Could you please give me some pointers?
Thanks a lot for your effort and time!
I believe this will do what your question asks (note that func() has been given an arbitrary sample implementation in this example):
import pandas as pd
df = pd.DataFrame({
'X1' : [1,2,3,4,5],
'Y1' : [2,2,3,3,4]
})
def func(a, b):
return a - b, a + b
df[['X2', 'Y2']] = pd.DataFrame(df.apply(lambda x: func(x['X1'], x['Y1']), axis=1).tolist(), columns=['foo', 'bar'])
print(df)
Output:
X1 Y1 X2 Y2
0 1 2 -1 3
1 2 2 0 4
2 3 3 0 6
3 4 3 1 7
4 5 4 1 9
I'm pretty sure we need more details, but you can do this with
df.apply(func, axis=1, expand=True)
Better would be
df["X2"] = df["id"] + df["X1"] + df["Y1"]
I believe the latter is vectorized while the former would be run as a for loop
Hope this helps

Replace some values in a dataframe with NaN's if the index of the row does not exist in another dataframe

I have a really large dataframe similar to this:
CustomerId Latitude Longitude
0. a x1 y1
1. a x2 y2
2. b x3 y3
3. c x4 y4
And I have a second dataframe that corresponds to a sample of the first one, like this:
CustomerId Latitude Longitude
0. a x1 y1
3. c x4 y4
My goal is to get a new dataframe just like the original, but with NaN's instead of the coordinates of the rows with indexes that don't exist on the second dataframe. This is the result I would need:
CustomerId Latitude Longitude
0. a x1 y1
1. a NaN NaN
2. b NaN NaN
3. c x4 y4
I am new to Python and I haven't found any question like this one. Anybody has an idea of how to solve it?
First we create a mask with pandas.DataFrame.isin
After that we use np.where and ask for the opposite with ~
mask = df.CustomerId.isin(df2.CustomerId)
df['Latitude'] = np.where(~mask, np.NaN, df['Latitude'])
df['Longitude'] = np.where(~mask, np.NaN, df['Longitude'])
print(df)
CustomerId Latitude Longitude
0.0 a x1 y1
1.0 a x2 y2
2.0 b NaN NaN
3.0 c x4 y4
Explanation:
np.where works as following: np.where(condition, value if true, value if false)

I want to plot a rectangle with given 4 coordinates in a text file in gnuplot. The rectangle may be at an angle to x axis

How do I read the data from the file and plot the rectangle?
The given text file has the following format up to 50 rows:
x1 y1 x2 y2 x3 y3 x4 y4
where (x1,y1), (x2,y2), (x3,y3) and (x4,y4) are the four vertices of the rectangle. The rectangles have random orientation. How do I plot the series of rectangles in gnuplot?
If someone can tell me how to read from a file while using set object polygon, that may also be helpful
what I want: set object polygon from to to to .
Or, is there any other simpler code in gnuplot? Alternatively is there a python solution?
My Gnuplot solution (not just for rectangles, but any kind of polygons):
plot 'rectal.dat' u 1:2:($3-$1):($4-$2) with vectors nohead lc 1 title 'Rectangle', \
'' u 3:4:($5-$3):($6-$4) with vectors nohead lc 1 notitle, \
'' u 5:6:($7-$5):($8-$6) with vectors nohead lc 1 notitle, \
'' u 7:8:($1-$7):($2-$8) with vectors nohead lc 1 notitle
from
0 0 0 1 1 1 1 0
0 0 0 2 2 2 2 0
-0.5 -0.5 -1 1 -0.5 2 0 1
Here's the Python code to do so as we discussed.
Your input file is of format:
__ __ x1 y1 x2 y2 x3 y3 x4 y4
The code.
import numpy as np
import matplotlib.pyplot as plt
data = np.loadtxt('car.txt')
data = [i[2:] for i in data]
for d in data:
Xs = d[::2]
Ys = d[1::2]
for i in range(4):
if i < 3:
plt.plot([Xs[i],Xs[i+1]],[Ys[i],Ys[i+1]],'k-',lw=2)
elif i == 3:
plt.plot([Xs[i],Xs[0]],[Ys[i],Ys[0]],'k-',lw=2)
plt.show()

Categories

Resources