I have a boolean dataframe indexed by timestamps
df
>>>
timestamp x0 x1 x2
2020-01-01 True False True
2020-01-02 True False True
2020-01-03 False True True
I want to save a csv of the column names for each row where the column is True, with the current timestamp as the csv filename. So in the example above, the desired output would be 3 csv's:
20200101.csv:
x0,
x2,
20200102.csv:
x0,
x2,
20200103.csv:
x1,
x2,
I have managed to do this using a for loop and some pandas methods, but it seems clunky. (This would be almost a one-liner in R, like using split and lapply.)
import numpy as np
for idx, row in df.iterrows():
tmp = row.replace({False: np.nan}).dropna()
tmp = pd.DataFrame({"my_col": tmp.index.tolist()})
file_name = ''.join(str(idx.date()).split('-'))
tmp.to_csv(f"{file_name}.csv", index=False)
Is there a clean way to do this using pure pandas / map reduce / pandas apply and avoiding for loops?
Had to stick with a loop to write out the CSVs.
df_out = df.melt(id_vars='timestamp').loc[lambda x: x['value']].sort_values('timestamp')
print(df_out)
timestamp variable value
0 2020-01-01 x0 True
6 2020-01-01 x2 True
1 2020-01-02 x0 True
7 2020-01-02 x2 True
5 2020-01-03 x1 True
8 2020-01-03 x2 True
Resorted to the much-maligned loop for output to CSV:
for t, frame in df_out.groupby('timestamp').variable:
frame.to_csv(re.sub('-','',fr'd:\jchtempnew\SO\{t}.csv'),
index=None, header=None, line_terminator=',\r\n')
20200101.csv:
x0,
x2,
20200102.csv:
x0,
x2,
20200103.csv:
x1,
x2,
Note that line_terminator=',\r\n' is included in to_csv to put a comma at the end of each line.
Related
I have the following dataset Excel Dummy DataSet that consists of a concatenation of several Tables in an Excel sheet. They are all stacked vertically. The columns of the different tables are same, col_x, col_y, col_t, except for the column Y that varies as the tables change (See the figure below).
I somehow manage to get the output. However, I wonder if there is a simpler|more efficient way to do this?
This is what I have tried
import pandas as pd
# Import Data
path = r"/content/test_data.xlsx"
df_original = pd.read_excel(path, skiprows=4, usecols= range(0,4), header=None)
df_original.columns=["col_x","col_y","col_z","col_t"]
# Begining of the code
mask_col_x = df_original["col_x"] == "col_x"
df_break = df_original[mask_col_x]
index_break_list = df_break.index
range_list = []
for i, val in enumerate(index_break_list):
if i < len(index_break_list)-1:
span1 = (val+1,index_break_list[i+1],df_original["col_y"][val])
range_list.append(span1)
span1 = (val+1,len(df_original),df_original["col_y"][val])
range_list.append(span1)
dataframe_list = []
for elt in range_list:
df_sub = df_original.iloc[elt[0]:elt[1]].copy()
df_sub["Value y"] = elt[2]
dataframe_list.append(df_sub)
new_df = pd.concat(dataframe_list,axis=0)
new_df.to_csv("test_data_result_combined.csv")
You can create column Value y by mask with Series.where and then forward filling missing values by ffill and last filter out rows by invert mask by ~:
path = "test_data.xlsx"
df_original = pd.read_excel(path, skiprows=4, usecols= range(0,4), header=None)
df_original.columns=["col_x","col_y","col_z","col_t"]
mask_col_x = df_original["col_x"] == "col_x"
df_original['Value y'] = df_original["col_y"].where(mask_col_x).ffill()
new_df = df_original[~mask_col_x]
print (new_df)
col_x col_y col_z col_t Value y
1 index1 val_y1_table1 val_z1_table1 val_t1_table1 y1
2 index2 val_y2_table1 val_z2_table1 val_t2_table1 y1
3 index3 val_y3_table1 val_z3_table1 val_t3_table1 y1
4 index4 val_y4_table1 val_z4_table1 val_t4_table1 y1
6 index5 val_y1_table2 val_z1_table2 val_t1_table2 y2
7 index6 val_y2_table2 val_z2_table2 val_t2_table2 y2
8 index7 val_y3_table2 val_z3_table2 val_t3_table2 y2
10 index8 val_y1_table3 val_z1_table3 val_t1_table3 y3
11 index9 val_y2_table3 val_z2_table3 val_t2_table3 y3
13 index10 val_y1_table4 val_z1_table4 val_t1_table4 y4
15 index11 val_y1_table5 val_z1_table5 val_t1_table5 y5
16 index12 val_y2_table5 val_z2_table5 val_t2_table5 y5
17 index13 val_y3_table5 val_z3_table5 val_t3_table5 y5
18 index14 val_y4_table5 val_z4_table5 val_t4_table5 y5
19 index15 val_y5_table5 val_z5_table5 val_t5_table5 y5
20 index16 val_y6_table5 val_z6_table5 val_t6_table5 y5
21 index17 val_y7_table5 val_z7_table5 val_t7_table5 y5
I did go through multiple StackOverflow posts to get an idea of how to solve this but couldn't come up with anything.
So, I have a dataframe with three attributes: id, X1, Y1.
I need to pass each instance/entry of the dataframe to a function(e.g., func) which returns two values: X2, Y2. The operation basically looks like this:
X2, Y2 = func(X1, Y1)
I need to save the X2, Y2 for each entry as a new column so that the new dataframe looks like: id, X1, Y1, X2, Y2
I am not sure how to perform this with pandas. Could you please give me some pointers?
Thanks a lot for your effort and time!
I believe this will do what your question asks (note that func() has been given an arbitrary sample implementation in this example):
import pandas as pd
df = pd.DataFrame({
'X1' : [1,2,3,4,5],
'Y1' : [2,2,3,3,4]
})
def func(a, b):
return a - b, a + b
df[['X2', 'Y2']] = pd.DataFrame(df.apply(lambda x: func(x['X1'], x['Y1']), axis=1).tolist(), columns=['foo', 'bar'])
print(df)
Output:
X1 Y1 X2 Y2
0 1 2 -1 3
1 2 2 0 4
2 3 3 0 6
3 4 3 1 7
4 5 4 1 9
I'm pretty sure we need more details, but you can do this with
df.apply(func, axis=1, expand=True)
Better would be
df["X2"] = df["id"] + df["X1"] + df["Y1"]
I believe the latter is vectorized while the former would be run as a for loop
Hope this helps
I have a data like this in an CSV file;
x Y
[2,3,4] [3.4,2.5,3.1]
[4,5,2] [6.2,7.5,9.7]
[2,6,9] [4.6,2.5,2.4]
[1,3,6] [8.9,7.5,9.2]
I want to calculate the mean for each list in a row
x Y
[2,3,4] < mean [3.4,2.5,3.1] < mean
[4,5,2] < mean [6.2,7.5,9.7] < mean
[2,6,9] < mean [4.6,2.5,2.4] < mean
[1,3,6] < mean [8.9,7.5,9.2] < mean
and output the mean value to a CSV file.
How can it achieve it using python (pandas)?
EDIT
After some research, I found the solution to my issue above;
import csv
import pandas as pd
import numpy as np
from ast import literal_eval
#csv file you want to import
filename ="xy.csv"
fields = ['X','Y'] #field names
df = pd.read_csv(filename,usecols=fields,quotechar='"', sep=',',low_memory = True)
df.X = df.X.apply(literal_eval)
df.X = df.X.apply(np.mean) #calculates mean for the list in field 'X'
print(df.X) #print result
df.Y = df.Y.apply(literal_eval)
df.Y = df.Y.apply(np.mean) #calculates mean for the list in field 'Y'
print(df.Y)
Via applymap:
# df = df.applymap(lambda x: sum(eval(x))/ len(eval(x)))
df = df.applymap(np.mean) # suggested by alex
df = df.applymap(lambda x: sum(x)/ len(x))
OUTPUT:
x Y
0 3.000000 3.000000
1 3.666667 7.800000
2 5.666667 3.166667
3 3.333333 8.533333
You can use .applymap() with np.mean() to map the dataframe element-wise.
import numpy as np
df = df.applymap(eval) # optional step if your column is a string like a list instead of truly a list
df = df.applymap(np.mean)
Result:
print(df)
x Y
0 3.000000 3.000000
1 3.666667 7.800000
2 5.666667 3.166667
3 3.333333 8.533333
I have the following dataframe,
I want to create a new row called test within the index column, which checks if the pd row is the same sign (negative is FALSE and positive is TRUE) as the cn row. The new row would say TRUE if and only if both pd and cn are the same sign (TRUE and positive or FALSE and negative). NaN is considered 0 and I want it to count as a different sign.
So I want the final result to look like this,
I know how to add new columns based on conditions and np.where, but I don't know how to add things row-wise.
I have no idea where to start. Any help with like examples or advice on where to start would be great.
Update:
I have added the code for creating the dataframe,
data = {'x1':[-0.00137,True,0.7],'x2':[0.00658,False,0.7], 'x3':[0.004332,np.nan,np.nan], 'x4':[-0.005762,np.nan,np.nan],
'x5':[0.005905,np.nan,np.nan],'x6':[0.001333,False,0.7],'x7':[0.001611,False,0.7],'x8':[-0.00089,False,1],'x9':[0.000042,np.nan,np.nan],'x10':[0.004027,np.nan,np.nan],}
df = pd.DataFrame(data, index =['pd',
'cn',
'td2',])
here are the test results i wish to reproduce,
data2 = {'x1':[False],'x2':[False], 'x3':[False], 'x4':[False],
'x5':[False],'x6':[False],'x7':[False],'x8':[True],'x9':[False],'x10':[False],}
df2 = pd.DataFrame(data2, index =['test'])
You can use a simple comparison:
df.loc['test', :] = df.loc['pd'].ge(0).eq(df.loc['cn'])
Output:
x1 x2 x3 x4 x5 x6 x7 x8 x9 x10
pd -0.00137 0.00658 0.004332 -0.005762 0.005905 0.001333 0.001611 -0.00089 4.2e-05 0.004027
cn True False NaN NaN NaN False False False NaN NaN
td2 0.7 0.7 NaN NaN NaN 0.7 0.7 1 NaN NaN
test False False False False False False False True False False
Explanation:
In pandas, ge means "greater or equal" just like >=. Therefore, positive values will return True (and False for negative values) so that we can compare them element-wise with cn index (it's numpy under the hood).
Note that NaN values will always return False (I think it's what you want). If you want to add a meaning to NaN values, you can replace df.loc['cn'] by df.loc['cn'].fillna(False), where you can choose to replace NaN with True or False values.
You should definetely consider sharing your data as text next time, so that we can reproduce your case with the original data.
Anyway, you can do operations row-wise by accessing them individually using .loc, like
import numpy as np
cn_row = [0 if np.isnan(i) else 1 if i else -1 for i in df.loc['cn']]
df.loc['test'] = (df.loc['pd'] * cn_row) > 0
df.loc['test'] = df.loc['test'].astype(bool)
Edit: Now with your code it is easy to test. We have the following output:
x1 x2 x3 x4 x5 x6 x7 x8 x9 x10
pd -0.00137 0.00658 0.004332 -0.005762 0.005905 0.001333 0.001611 -0.00089 4.2e-05 0.004027
cn True False NaN NaN NaN False False False NaN NaN
td2 0.7 0.7 NaN NaN NaN 0.7 0.7 1 NaN NaN
test False False False False False False False True False False
As your data is, it would be better to work with transposed DataFrame.
In this case, you could just add the column 'test' as follows
new = data.T
new['test'] = [True if new.iloc[row[pd]>0 and new.iloc[row]['cn']==True else False for row in range(new.shape[0])]
I have a file where data is saved in a way that x represents the data value and t represents the time of the data point as below:
x1 t1
x2 t2
x3 t3
x4 t4
x5 t5
-----
x6 t6
x7 t7
x8 t8
x9 t9
x10 t10
-----
.
.
.
So as you see above, one column holds the data samples and the other holds the time points.
Now what I want to do is I want to take mean value of each five points and associate it with the middle value(not the mean) of the five time points. So that I would have a mean values plot and time will be the middle value.
To make it more clear the new desired array will be like:
mean(x1, x2, x3, x4, x5) ----> t3
mean(x6, x7, x8, x9, x10) ----> t8
.
.
.
I can also use pandas module for this for instance, but couldn't figure out the algorithm.
I created my own data, to show an example
import pandas as pd
import numpy as np
x = pd.date_range(start='01-01-2020', end='31-10-2020')
df = pd.DataFrame({
'x': x,
'y': np.random.rand(len(x))
})
df
Output
x y
0 2020-01-01 0.939691
1 2020-01-02 0.835836
2 2020-01-03 0.893328
3 2020-01-04 0.887928
4 2020-01-05 0.393777
.. ... ...
300 2020-10-27 0.072485
301 2020-10-28 0.797486
302 2020-10-29 0.236217
303 2020-10-30 0.619942
304 2020-10-31 0.471080
[305 rows x 2 columns]
To compute the middle timestep and the mean value I group by the index devided by 5 with integer devision
df.groupby(df.index // 5).apply(
lambda x: pd.Series([x.x[2], np.mean(x.y)])
)
Output
0 1
0 2020-01-03 0.790112
1 2020-01-08 0.700751
2 2020-01-13 0.437752
3 2020-01-18 0.531026
4 2020-01-23 0.597368
.. ... ...
56 2020-10-09 0.549869
57 2020-10-14 0.589078
58 2020-10-19 0.388551
59 2020-10-24 0.679042
60 2020-10-29 0.439442
[61 rows x 2 columns]
You could also use the agg() method to aggregate the dataframe. You additionally need to pass in a dictionary to specify the function to be used in aggregating each column
N = 5
agg_dictionary = {'data_column': 'mean', 'time_column': lambda col: col.tolist()[N//2]}
df.groupby(df.index // N).agg(agg_dictionary)
I am not familiar enough with pandas to give a clever one-liner solution, so here is a manual solution using a simple loop:
with open('data.txt', 'r') as f:
averages_over_five = [] # resulting x,t will go here
current_group = [] # building a group of 5
for line in f:
row = line.strip().split() # splitting on whitespace
x = int(row[0]) # use float() instead if x is not integer
t = int(row[1]) # use datetime.datetime.strptime() instead if t is timestamp
current_group.append((x, t))
if len(current_group) == 5:
xaverage = sum(x for x,t in current_group) / 5
tmiddle = current_group[2][1]
averages_over_five.append((xaverage, tmiddle))
current_group = []
if len(current_group) > 0: # if number of lines was not multiple of five, what to do with remainder?
xaverage = sum(x for x,t in current_group) / len(current_group)
tmiddle = current_group[len(current_group) // 2][1]
averages_over_five.append((xaverage, tmiddle))
import matplotlib.pyplot as plt
plt.plot([t for x,t in averages_over_five], [x for x,t in averages_over_five])
Complete solution including reading the file. I assume the chunks are really divided by --characters in the file. If not, you could e.g. check len(values) == 5 instead of looking for the -. No special libraries required.
def get_mean(values, times):
return (sum(values) / len(values), times[len(times)//2])
result = []
values = []
times = []
with open("filename", "r") as f:
for line in f: # read the file line by line
if line.startswith("-"): # one chunk complete
result.append(get_mean(values, times))
values.clear()
times.clear()
else: # normal data line
l = line.split()
values.append(float(l[0]))
times.append(l[1])
if values: # if file doesn't end with ---, append last chunk
result.append(get_mean(values, times))
print(result)
Output:
[(3.0, 't3'), (8.0, 't8')]