Stack/Unstack Pandas Data Frame - python

I have the following dataset Excel Dummy DataSet that consists of a concatenation of several Tables in an Excel sheet. They are all stacked vertically. The columns of the different tables are same, col_x, col_y, col_t, except for the column Y that varies as the tables change (See the figure below).
I somehow manage to get the output. However, I wonder if there is a simpler|more efficient way to do this?
This is what I have tried
import pandas as pd
# Import Data
path = r"/content/test_data.xlsx"
df_original = pd.read_excel(path, skiprows=4, usecols= range(0,4), header=None)
df_original.columns=["col_x","col_y","col_z","col_t"]
# Begining of the code
mask_col_x = df_original["col_x"] == "col_x"
df_break = df_original[mask_col_x]
index_break_list = df_break.index
range_list = []
for i, val in enumerate(index_break_list):
if i < len(index_break_list)-1:
span1 = (val+1,index_break_list[i+1],df_original["col_y"][val])
range_list.append(span1)
span1 = (val+1,len(df_original),df_original["col_y"][val])
range_list.append(span1)
dataframe_list = []
for elt in range_list:
df_sub = df_original.iloc[elt[0]:elt[1]].copy()
df_sub["Value y"] = elt[2]
dataframe_list.append(df_sub)
new_df = pd.concat(dataframe_list,axis=0)
new_df.to_csv("test_data_result_combined.csv")

You can create column Value y by mask with Series.where and then forward filling missing values by ffill and last filter out rows by invert mask by ~:
path = "test_data.xlsx"
df_original = pd.read_excel(path, skiprows=4, usecols= range(0,4), header=None)
df_original.columns=["col_x","col_y","col_z","col_t"]
mask_col_x = df_original["col_x"] == "col_x"
df_original['Value y'] = df_original["col_y"].where(mask_col_x).ffill()
new_df = df_original[~mask_col_x]
print (new_df)
col_x col_y col_z col_t Value y
1 index1 val_y1_table1 val_z1_table1 val_t1_table1 y1
2 index2 val_y2_table1 val_z2_table1 val_t2_table1 y1
3 index3 val_y3_table1 val_z3_table1 val_t3_table1 y1
4 index4 val_y4_table1 val_z4_table1 val_t4_table1 y1
6 index5 val_y1_table2 val_z1_table2 val_t1_table2 y2
7 index6 val_y2_table2 val_z2_table2 val_t2_table2 y2
8 index7 val_y3_table2 val_z3_table2 val_t3_table2 y2
10 index8 val_y1_table3 val_z1_table3 val_t1_table3 y3
11 index9 val_y2_table3 val_z2_table3 val_t2_table3 y3
13 index10 val_y1_table4 val_z1_table4 val_t1_table4 y4
15 index11 val_y1_table5 val_z1_table5 val_t1_table5 y5
16 index12 val_y2_table5 val_z2_table5 val_t2_table5 y5
17 index13 val_y3_table5 val_z3_table5 val_t3_table5 y5
18 index14 val_y4_table5 val_z4_table5 val_t4_table5 y5
19 index15 val_y5_table5 val_z5_table5 val_t5_table5 y5
20 index16 val_y6_table5 val_z6_table5 val_t6_table5 y5
21 index17 val_y7_table5 val_z7_table5 val_t7_table5 y5

Related

Repeating same operation for multiple columns of another df

I'm quite new to python and pandas so I hope I can get some help.
I have a train_df that looks like this:
x y1 y2 y3 y4
0 -20.0 -0.702864 10.392012 1.013891 -8794.9050
1 -19.9 -0.591605 9.450884 1.231116 -8667.2340
2 -19.8 -0.983952 10.240055 0.675153 -8541.5720
And an ideal_df that looks like this:
x y1 y2 ... y48 y49 y50
0 -20.0 -0.912945 0.408082 ... -0.186278 0.912945 0.396850
1 -19.9 -0.867644 0.497186 ... -0.215690 0.867644 0.476954
2 -19.8 -0.813674 0.581322 ... -0.236503 0.813674 0.549129
Both have 400 rows.
I want to to sum up the squared deviation (distance) between y-values of train_df and ideal_df at each given x-value, e.g.:
For the 1st value of x, y1 from train_df and y1 from ideal_df, then y1 from train_df and y2 from ideal_df, etc.
Then repeat the same for every one of the 400 rows of y1 from train_df.
After that, repeat it for y2, y3, and y4 of train_df, but that is the easy part.
I wrote this
squared_deviations_y1_train = (((train_df.y1)-(ideal_df.loc[:,"y1":"y50"])) ** 2).sum()
But I have no idea what I'm doing to be honest.
Merge/join the the two dataframes by index and then for each yx column of train_df, compute the squared deviation:
train_df = pd.DataFrame(data=[ [-20.0,-0.702864,10.392012,1.013891,-8794.9050], [-19.9,-0.591605,9.450884,1.231116,-8667.2340], [-19.8,-0.983952,10.240055,0.675153,-8541.5720] ], columns=["x","y1","y2","y3","y4"])
ideal_df = pd.DataFrame(data=[ [-20.0,-0.912945,0.408082,-0.186278,0.912945,0.396850], [-19.9,-0.867644,0.497186,-0.215690,0.867644,0.476954], [-19.8,-0.813674,0.581322,-0.236503,0.813674,0.549129] ], columns=["x","y1","y2","y48","y49","y50"])
ideal_df = ideal_df.add_suffix("_i")
result_df = train_df.merge(ideal_df, left_index=True, right_index=True, how="left")
for t_col in train_df.columns:
if t_col != "x":
result_df[f"{t_col}_sd"] = sum([(result_df[t_col] - result_df[i_col]) ** 2 for i_col in ideal_df.columns if i_col != "x_i"])
[Output]:
x y1 y2 y3 y4 x_i y1_i y2_i y48_i y49_i y50_i y1_sd y2_sd y3_sd y4_sd
0 -20.0 -0.702864 10.392012 1.013891 -8794.905 -20.0 -0.912945 0.408082 -0.186278 0.912945 0.396850 5.365406 529.137105 5.911037 3.867627e+08
1 -19.9 -0.591605 9.450884 1.231116 -8667.234 -19.9 -0.867644 0.497186 -0.215690 0.867644 0.476954 4.674201 434.286809 7.737567 3.756179e+08
2 -19.8 -0.983952 10.240055 0.675153 -8541.572 -19.8 -0.813674 0.581322 -0.236503 0.813674 0.549129 8.619554 508.005021 3.091597 3.648075e+08

Split - apply - save csv for pandas using pure pandas / apply

I have a boolean dataframe indexed by timestamps
df
>>>
timestamp x0 x1 x2
2020-01-01 True False True
2020-01-02 True False True
2020-01-03 False True True
I want to save a csv of the column names for each row where the column is True, with the current timestamp as the csv filename. So in the example above, the desired output would be 3 csv's:
20200101.csv:
x0,
x2,
20200102.csv:
x0,
x2,
20200103.csv:
x1,
x2,
I have managed to do this using a for loop and some pandas methods, but it seems clunky. (This would be almost a one-liner in R, like using split and lapply.)
import numpy as np
for idx, row in df.iterrows():
tmp = row.replace({False: np.nan}).dropna()
tmp = pd.DataFrame({"my_col": tmp.index.tolist()})
file_name = ''.join(str(idx.date()).split('-'))
tmp.to_csv(f"{file_name}.csv", index=False)
Is there a clean way to do this using pure pandas / map reduce / pandas apply and avoiding for loops?
Had to stick with a loop to write out the CSVs.
df_out = df.melt(id_vars='timestamp').loc[lambda x: x['value']].sort_values('timestamp')
print(df_out)
timestamp variable value
0 2020-01-01 x0 True
6 2020-01-01 x2 True
1 2020-01-02 x0 True
7 2020-01-02 x2 True
5 2020-01-03 x1 True
8 2020-01-03 x2 True
Resorted to the much-maligned loop for output to CSV:
for t, frame in df_out.groupby('timestamp').variable:
frame.to_csv(re.sub('-','',fr'd:\jchtempnew\SO\{t}.csv'),
index=None, header=None, line_terminator=',\r\n')
20200101.csv:
x0,
x2,
20200102.csv:
x0,
x2,
20200103.csv:
x1,
x2,
Note that line_terminator=',\r\n' is included in to_csv to put a comma at the end of each line.

Use multiple columns of a dataframe for an operation and save the result in multiple columns

I did go through multiple StackOverflow posts to get an idea of how to solve this but couldn't come up with anything.
So, I have a dataframe with three attributes: id, X1, Y1.
I need to pass each instance/entry of the dataframe to a function(e.g., func) which returns two values: X2, Y2. The operation basically looks like this:
X2, Y2 = func(X1, Y1)
I need to save the X2, Y2 for each entry as a new column so that the new dataframe looks like: id, X1, Y1, X2, Y2
I am not sure how to perform this with pandas. Could you please give me some pointers?
Thanks a lot for your effort and time!
I believe this will do what your question asks (note that func() has been given an arbitrary sample implementation in this example):
import pandas as pd
df = pd.DataFrame({
'X1' : [1,2,3,4,5],
'Y1' : [2,2,3,3,4]
})
def func(a, b):
return a - b, a + b
df[['X2', 'Y2']] = pd.DataFrame(df.apply(lambda x: func(x['X1'], x['Y1']), axis=1).tolist(), columns=['foo', 'bar'])
print(df)
Output:
X1 Y1 X2 Y2
0 1 2 -1 3
1 2 2 0 4
2 3 3 0 6
3 4 3 1 7
4 5 4 1 9
I'm pretty sure we need more details, but you can do this with
df.apply(func, axis=1, expand=True)
Better would be
df["X2"] = df["id"] + df["X1"] + df["Y1"]
I believe the latter is vectorized while the former would be run as a for loop
Hope this helps

How can I link the mean values to the middle values of the other column repetitively in this data frame?

I have a file where data is saved in a way that x represents the data value and t represents the time of the data point as below:
x1 t1
x2 t2
x3 t3
x4 t4
x5 t5
-----
x6 t6
x7 t7
x8 t8
x9 t9
x10 t10
-----
.
.
.
So as you see above, one column holds the data samples and the other holds the time points.
Now what I want to do is I want to take mean value of each five points and associate it with the middle value(not the mean) of the five time points. So that I would have a mean values plot and time will be the middle value.
To make it more clear the new desired array will be like:
mean(x1, x2, x3, x4, x5) ----> t3
mean(x6, x7, x8, x9, x10) ----> t8
.
.
.
I can also use pandas module for this for instance, but couldn't figure out the algorithm.
I created my own data, to show an example
import pandas as pd
import numpy as np
x = pd.date_range(start='01-01-2020', end='31-10-2020')
df = pd.DataFrame({
'x': x,
'y': np.random.rand(len(x))
})
df
Output
x y
0 2020-01-01 0.939691
1 2020-01-02 0.835836
2 2020-01-03 0.893328
3 2020-01-04 0.887928
4 2020-01-05 0.393777
.. ... ...
300 2020-10-27 0.072485
301 2020-10-28 0.797486
302 2020-10-29 0.236217
303 2020-10-30 0.619942
304 2020-10-31 0.471080
[305 rows x 2 columns]
To compute the middle timestep and the mean value I group by the index devided by 5 with integer devision
df.groupby(df.index // 5).apply(
lambda x: pd.Series([x.x[2], np.mean(x.y)])
)
Output
0 1
0 2020-01-03 0.790112
1 2020-01-08 0.700751
2 2020-01-13 0.437752
3 2020-01-18 0.531026
4 2020-01-23 0.597368
.. ... ...
56 2020-10-09 0.549869
57 2020-10-14 0.589078
58 2020-10-19 0.388551
59 2020-10-24 0.679042
60 2020-10-29 0.439442
[61 rows x 2 columns]
You could also use the agg() method to aggregate the dataframe. You additionally need to pass in a dictionary to specify the function to be used in aggregating each column
N = 5
agg_dictionary = {'data_column': 'mean', 'time_column': lambda col: col.tolist()[N//2]}
df.groupby(df.index // N).agg(agg_dictionary)
I am not familiar enough with pandas to give a clever one-liner solution, so here is a manual solution using a simple loop:
with open('data.txt', 'r') as f:
averages_over_five = [] # resulting x,t will go here
current_group = [] # building a group of 5
for line in f:
row = line.strip().split() # splitting on whitespace
x = int(row[0]) # use float() instead if x is not integer
t = int(row[1]) # use datetime.datetime.strptime() instead if t is timestamp
current_group.append((x, t))
if len(current_group) == 5:
xaverage = sum(x for x,t in current_group) / 5
tmiddle = current_group[2][1]
averages_over_five.append((xaverage, tmiddle))
current_group = []
if len(current_group) > 0: # if number of lines was not multiple of five, what to do with remainder?
xaverage = sum(x for x,t in current_group) / len(current_group)
tmiddle = current_group[len(current_group) // 2][1]
averages_over_five.append((xaverage, tmiddle))
import matplotlib.pyplot as plt
plt.plot([t for x,t in averages_over_five], [x for x,t in averages_over_five])
Complete solution including reading the file. I assume the chunks are really divided by --characters in the file. If not, you could e.g. check len(values) == 5 instead of looking for the -. No special libraries required.
def get_mean(values, times):
return (sum(values) / len(values), times[len(times)//2])
result = []
values = []
times = []
with open("filename", "r") as f:
for line in f: # read the file line by line
if line.startswith("-"): # one chunk complete
result.append(get_mean(values, times))
values.clear()
times.clear()
else: # normal data line
l = line.split()
values.append(float(l[0]))
times.append(l[1])
if values: # if file doesn't end with ---, append last chunk
result.append(get_mean(values, times))
print(result)
Output:
[(3.0, 't3'), (8.0, 't8')]

Replace some values in a dataframe with NaN's if the index of the row does not exist in another dataframe

I have a really large dataframe similar to this:
CustomerId Latitude Longitude
0. a x1 y1
1. a x2 y2
2. b x3 y3
3. c x4 y4
And I have a second dataframe that corresponds to a sample of the first one, like this:
CustomerId Latitude Longitude
0. a x1 y1
3. c x4 y4
My goal is to get a new dataframe just like the original, but with NaN's instead of the coordinates of the rows with indexes that don't exist on the second dataframe. This is the result I would need:
CustomerId Latitude Longitude
0. a x1 y1
1. a NaN NaN
2. b NaN NaN
3. c x4 y4
I am new to Python and I haven't found any question like this one. Anybody has an idea of how to solve it?
First we create a mask with pandas.DataFrame.isin
After that we use np.where and ask for the opposite with ~
mask = df.CustomerId.isin(df2.CustomerId)
df['Latitude'] = np.where(~mask, np.NaN, df['Latitude'])
df['Longitude'] = np.where(~mask, np.NaN, df['Longitude'])
print(df)
CustomerId Latitude Longitude
0.0 a x1 y1
1.0 a x2 y2
2.0 b NaN NaN
3.0 c x4 y4
Explanation:
np.where works as following: np.where(condition, value if true, value if false)

Categories

Resources