Fastest way to calculate in Pandas? - python

Given these two dataframes:
df1 =
Name Start End
0 A 10 20
1 B 20 30
2 C 30 40
df2 =
0 1
0 5 10
1 15 20
2 25 30
df2 has no column names, but you can assume column 0 is an offset of df1.Start and column 1 is an offset of df1.End. I would like to transpose df2 onto df1 to get the Start and End differences. The final df1 dataframe should look like this:
Name Start End Start_Diff_0 End_Diff_0 Start_Diff_1 End_Diff_1 Start_Diff_2 End_Diff_2
0 A 10 20 5 10 -5 0 -15 -10
1 B 20 30 15 20 5 10 -5 0
2 C 30 40 25 30 15 20 5 10
I have a solution that works, but I'm not satisfied with it because it takes too long to run when processing a dataframe that has millions of rows. Below is a sample test case to simulate processing 30,000 rows. As you can imagine, running the original solution (method_1) on a 1GB dataframe is going to be a problem. Is there a faster way to do this using Pandas, Numpy, or maybe another package?
UPDATE: I've added the provided solutions to the benchmarks.
# Import required modules
import numpy as np
import pandas as pd
import timeit
# Original
def method_1():
df1 = pd.DataFrame([['A', 10, 20], ['B', 20, 30], ['C', 30, 40]] * 10000, columns=['Name', 'Start', 'End'])
df2 = pd.DataFrame([[5, 10], [15, 20], [25, 30]], columns=None)
# Store data for new columns in a dictionary
new_columns = {}
for index1, row1 in df1.iterrows():
for index2, row2 in df2.iterrows():
key_start = 'Start_Diff_' + str(index2)
key_end = 'End_Diff_' + str(index2)
if (key_start in new_columns):
new_columns[key_start].append(row1[1]-row2[0])
else:
new_columns[key_start] = [row1[1]-row2[0]]
if (key_end in new_columns):
new_columns[key_end].append(row1[2]-row2[1])
else:
new_columns[key_end] = [row1[2]-row2[1]]
# Add dictionary data as new columns
for key, value in new_columns.items():
df1[key] = value
# jezrael - https://stackoverflow.com/a/60843750/452587
def method_2():
df1 = pd.DataFrame([['A', 10, 20], ['B', 20, 30], ['C', 30, 40]] * 10000, columns=['Name', 'Start', 'End'])
df2 = pd.DataFrame([[5, 10], [15, 20], [25, 30]], columns=None)
# Convert selected columns to 2d numpy array
a = df1[['Start', 'End']].to_numpy()
b = df2[[0, 1]].to_numpy()
# Output is 3d array; convert it to 2d array
c = (a - b[:, None]).swapaxes(0, 1).reshape(a.shape[0], -1)
# Generate columns names and with DataFrame.join; add to original
cols = [item for x in range(b.shape[0]) for item in (f'Start_Diff_{x}', f'End_Diff_{x}')]
df1 = df1.join(pd.DataFrame(c, columns=cols, index=df1.index))
# sammywemmy - https://stackoverflow.com/a/60844078/452587
def method_3():
df1 = pd.DataFrame([['A', 10, 20], ['B', 20, 30], ['C', 30, 40]] * 10000, columns=['Name', 'Start', 'End'])
df2 = pd.DataFrame([[5, 10], [15, 20], [25, 30]], columns=None)
# Create numpy arrays of df1 and df2
df1_start = df1.loc[:, 'Start'].to_numpy()
df1_end = df1.loc[:, 'End'].to_numpy()
df2_start = df2[0].to_numpy()
df2_end = df2[1].to_numpy()
# Use np tile to create shapes that allow elementwise subtraction
tiled_start = np.tile(df1_start, (len(df2), 1)).T
tiled_end = np.tile(df1_end, (len(df2), 1)).T
# Subtract df2 from df1
start = np.subtract(tiled_start, df2_start)
end = np.subtract(tiled_end, df2_end)
# Create columns for start and end
start_columns = [f'Start_Diff_{num}' for num in range(len(df2))]
end_columns = [f'End_Diff_{num}' for num in range(len(df2))]
# Create dataframes of start and end
start_df = pd.DataFrame(start, columns=start_columns)
end_df = pd.DataFrame(end, columns=end_columns)
# Lump start and end into one dataframe
lump = pd.concat([start_df, end_df], axis=1)
# Sort the columns by the digits at the end
filtered = lump.columns[lump.columns.str.contains('\d')]
cols = sorted(filtered, key=lambda x: x[-1])
lump = lump.reindex(cols, axis='columns')
# Hook lump back to df1
df1 = pd.concat([df1,lump],axis=1)
print('Method 1:', timeit.timeit(method_1, number=3))
print('Method 2:', timeit.timeit(method_2, number=3))
print('Method 3:', timeit.timeit(method_3, number=3))
Output:
Method 1: 50.506279182
Method 2: 0.08886280600000163
Method 3: 0.10297686199999845

I suggest use here numpy - convert selected columns to 2d numpy array in first step::
a = df1[['Start','End']].to_numpy()
b = df2[[0,1]].to_numpy()
Output is 3d array, convert it to 2d array:
c = (a - b[:, None]).swapaxes(0,1).reshape(a.shape[0],-1)
print (c)
[[ 5 10 -5 0 -15 -10]
[ 15 20 5 10 -5 0]
[ 25 30 15 20 5 10]]
Last generate columns names and with DataFrame.join add to original:
cols = [item for x in range(b.shape[0]) for item in (f'Start_Diff_{x}', f'End_Diff_{x}')]
df = df1.join(pd.DataFrame(c, columns=cols, index=df1.index))
print (df)
Name Start End Start_Diff_0 End_Diff_0 Start_Diff_1 End_Diff_1 \
0 A 10 20 5 10 -5 0
1 B 20 30 15 20 5 10
2 C 30 40 25 30 15 20
Start_Diff_2 End_Diff_2
0 -15 -10
1 -5 0
2 5 10

Don't use iterrows(). If you're simply subtracting values, use vectorization with Numpy (Pandas also offers vectorization, but Numpy is faster).
For instance:
df2 = pd.DataFrame([[5, 10], [15, 20], [25, 30]], columns=None)
col_names = "Start_Diff_1 End_Diff_1".split()
df3 = pd.DataFrame(df2.to_numpy() - 10, columns=colnames)
Here df3 equals:
Start_Diff_1 End_Diff_1
0 -5 0
1 5 10
2 15 20
You can also change column names by doing:
df2.columns = "Start_Diff_0 End_Diff_0".split()
You can use f-strings to change column names in a loop, i.e., f"Start_Diff_{i}", where i is a number in a loop
You can also combine multiple dataframes with:
df = pd.concat([df1, df2],axis=1)

This is one way to go about it:
#create numpy arrays of df1 and 2
df1_start = df1.loc[:,'Start'].to_numpy()
df1_end = df1.loc[:,'End'].to_numpy()
df2_start = df2[0].to_numpy()
df2_end = df2[1].to_numpy()
#use np tile to create shapes
#that allow element wise subtraction
tiled_start = np.tile(df1_start,(len(df2),1)).T
tiled_end = np.tile(df1_end,(len(df2),1)).T
#subtract df2 from df1
start = np.subtract(tiled_start,df2_start)
end = np.subtract(tiled_end, df2_end)
#create columns for start and end
start_columns = [f'Start_Diff_{num}' for num in range(len(df2))]
end_columns = [f'End_Diff_{num}' for num in range(len(df2))]
#create dataframes of start and end
start_df = pd.DataFrame(start,columns=start_columns)
end_df = pd.DataFrame(end, columns = end_columns)
#lump start and end into one dataframe
lump = pd.concat([start_df,end_df],axis=1)
#sort the columns by the digits at the end
filtered = final.columns[final.columns.str.contains('\d')]
cols = sorted(filtered, key = lambda x: x[-1])
lump = lump.reindex(cols,axis='columns')
#hook lump back to df1
final = pd.concat([df1,lump],axis=1)

Related

how to apply multiplication within pandas dataframe

please advice how to get the following output:
df1 = pd.DataFrame([['1, 2', '2, 2','3, 2','1, 1', '2, 1','3, 1']])
df2 = pd.DataFrame([[1, 2, 100, 'x'], [3, 4, 200, 'y'], [5, 6, 300, 'x']])
import numpy as np
df22 = df2.rename(index = lambda x: x + 1).set_axis(np.arange(1, len(df2.columns) + 1), inplace=False, axis=1)
f = lambda x: df22.loc[tuple(map(int, x.split(',')))]
df = df1.applymap(f)
print (df)
Output:
0 1 2 3 4 5
0 2 4 6 1 3 5
df1 is 'address' of df2 in row, col format (1,2 is first row, second column which is 2, 2,2 is 4 3,2 is 6 etc.)
I need to add values from the 3rd and 4th columns to get something like (2*100x, 4*200y, 6*300x, 1*100x, 3*200y, 5*300x)
the output should be 5000(sum of x's and y's), 0.28 (1400/5000 - % of y's)
It's not clear to me why you need df1 and df... Maybe your question is lacking some details?
You can compute your values directly:
df22['val'] = (df22[1] + df22[2])*df22[3]
Output:
1 2 3 4 val
1 1 2 100 x 300
2 3 4 200 y 1400
3 5 6 300 x 3300
From there it's straightforward to compute the sums (total and grouped by column 4):
total = df22['val'].sum() # 5000
y_sum = df22.groupby(4).sum().loc['y', 'val'] # 1400
print(y_sum/total) # 0.28
Edit: if df1 doesn't necessarily contain all members of columns 1 and 2, you could loop through it (it's not clear in your question why df1 is a Dataframe or if it can have more than one row, therefore I flattened it):
df22['val'] = 0
for c in df1.to_numpy().flatten():
i, j = map(int, c.split(','))
df22.loc[i, 'val'] += df22.loc[i, j]*df22.loc[i, 3]
This gives you the same output as above for your example but will ignore values that are not in df1.

Fill a dataframe with Carthesian product of variably shaped input lists

I want to create a script that fills a dataframe with values that are the Carthesian product of parameters I want to vary in a series of experiments.
My first thought was to use the product function of itertools, however it seems to require a fixed set of input lists.
The output I'm looking for can be generated using this sample:
cols = ['temperature','pressure','power']
l1 = [1, 100, 50.0 ]
l2 = [1000, 10, np.nan]
l3 = [0, 100, np.nan]
data = []
for val in itertools.product(l1,l2,l3): #use itertools to get the Carthesian product of the lists
data.append(val) #make a list of lists to store each variation
df = pd.DataFrame(data, columns=cols).dropna(0) #make a dataframe from the list of lists (dropping NaN values)
However, I would like instead to extract the parameters from dataframes of arbitrary shape and then fill up a dataframe with the product, like so (code doesn't work):
data = [{'parameter':'temperature','value1':1,'value2':100,'value3':50},
{'parameter':'pressure','value1':1000,'value2':10},
{'parameter':'power','value1':0,'value2':100},
]
df = pd.DataFrame(data)
l = []
cols = []
for i in range(df.shape[0]):
l.append(df.iloc[i][1:].to_list()) #store the values of each df row to a separate list
cols.append(df.iloc[i][0]) #store the first value of the row as column header
data = []
for val in itertools.product(l): #ask itertools to parse a list of lists
data.append(val)
df2 = pd.DataFrame(data, columns=cols).dropna(0)
Can you recommend a way about this? My goal is creating the final dataframe, so it's not a requirement to use itertools.
Another alternative without product (nothing wrong with product, though) could be to use .join() with how="cross" to produce successive cross-products:
df2 = df.T.rename(columns=df.iloc[:, 0]).drop(df.columns[0])
df2 = (
df2.iloc[:, [0]]
.join(df2.iloc[:, [1]], how="cross")
.join(df2.iloc[:, [2]], how="cross")
.dropna(axis=0)
)
Result:
temperature pressure power
0 1 1000 0
1 1 1000 100
3 1 10 0
4 1 10 100
9 100 1000 0
10 100 1000 100
12 100 10 0
13 100 10 100
18 50.0 1000 0
19 50.0 1000 100
21 50.0 10 0
22 50.0 10 100
A compacter version with product:
from itertools import product
df2 = pd.DataFrame(
product(*df.set_index("parameter", drop=True).itertuples(index=False)),
columns=df["parameter"]
).dropna(axis=0)

pandas replace all values of a column with a column values that increment by n starting at 0

Say I have a dataframe like so that I have read in from a file (note: *.ene is a txt file)
df = pd.read_fwf('filename.ene')
TS DENSITY STATS
1
2
3
1
2
3
I would like to only change the TS column. I wish to replace all the column values of 'TS' with the values from range(0,751,125). The desired output should look like so:
TS DENSITY STATS
0
125
250
500
625
750
I'm a bit lost and would like some insight regarding the code to do such a thing in a general format.
I used a for loop to store the values into a list:
K=(6*125)+1
m = []
for i in range(0,K,125):
m.append(i)
I thought to use .replace like so:
df['TS']=df['TS'].replace(old_value, m, inplace=True)
but was not sure what to put in place of old_value to select all the values of the 'TS' column or if this would even work as a method.
it's pretty straight forward, if you're replacing all the data you just need to do
df['TS'] =m
example :
import pandas as pd
data = [[10, 20, 30], [40, 50, 60], [70, 80, 90]]
df = pd.DataFrame(data, index=[0, 1, 2], columns=['a', 'b', 'c'])
print(df)
# a b c
# 0 10 20 30
# 1 40 50 60
# 2 70 80 90
df['a'] = [1,2,3]
print(df)
# a b c
# 0 1 20 30
# 1 2 50 60
# 2 3 80 90

Pandas: Help transforming data and writing better code

I have two data sources I can join by a field and want to summarize them in a chart:
Data
The two DataFrames share column A:
ROWS = 1000
df = pd.DataFrame.from_dict({'A': np.arange(ROWS),
'B': np.random.randint(0, 60, size=ROWS),
'C': np.random.randint(0, 100, size=ROWS)})
df.head()
A B C
0 0 10 11
1 1 7 64
2 2 22 12
3 3 1 67
4 4 34 57
And other which I joined as such:
other = pd.DataFrame.from_dict({'A': np.arange(ROWS),
'D': np.random.choice(['One', 'Two'], ROWS)})
other.set_index('A', inplace=True)
df = df.join(other, on=['A'], rsuffix='_right')
df.head()
A B C D
0 0 10 11 One
1 1 7 64 Two
2 2 22 12 One
3 3 1 67 Two
4 4 34 57 One
Question
A proper way to get a column chart with the count of:
C is GTE50 and D is One
C is GTE50 and D is Two
C is LT50 and D is One
C is LT50 and D is Two
Grouped by B, binned into 0, 1-10, 11-20, 21-30, 21-40, 41+.
IIUC, this can be dramatically simplified to a single groupby, taking advantage of clip and np.ceil to form your groups. A single unstack with 2 levels gives us the B-grouping as our x-axis with bars for each D-C combination:
If you want slightly nicer labels, you can map the groupby values:
(df.groupby(['D',
df.C.ge(50).map({True: 'GE50', False: 'LT50'}),
np.ceil(df.B.clip(lower=0, upper=41)/10).map({0: '0', 1: '1-10', 2: '11-20', 3: '21-30', 4: '31-40', 5: '41+'})
])
.size().unstack([0,1]).plot.bar())
Also it's equivalent to group B on:
pd.cut(df['B'],
bins=[-np.inf, 1, 11, 21, 31, 41, np.inf],
right=False,
labels=['0', '1-10', '11-20', '21-30', '31-40', '41+'])
I arrived to this solution after days of grinding, going back and forth, but there are many things I consider code smells:
groupby returns a sort-of pivot table and melt's purpose is to unpivot data.
The use of dummies for Cx, but not for D? Ultimately they are both categorical data with 2 options. After two days, when I got this first solution I needed a break before trying another branch that treat these two equally.
reset_index, only to set_index lines later. Having to sort_values before set_index
That last summary.unstack().unstack() reads like a big hack.
# %% Cx
df['Cx'] = df['C'].apply(lambda x: 'LT50' if x < 50 else 'GTE50')
df.head()
# %% Bins
df['B_binned'] = pd.cut(df['B'],
bins=[-np.inf, 1, 11, 21, 31, 41, np.inf],
right=False,
labels=['0', '1-10', '11-20', '21-30', '31-40', '41+'])
df.head()
# %% Dummies
s = df['D']
dummies = pd.get_dummies(s.apply(pd.Series).stack()).sum(level=0)
df = pd.concat([df, dummies], axis=1)
df.head()
# %% Summary
summary = df.groupby(['B_binned', 'Cx']).agg({'One': 'sum', 'Two': 'sum'})
summary.reset_index(inplace=True)
summary = pd.melt(summary,
id_vars=['B_binned', 'Cx'],
value_vars=['One', 'Two'],
var_name='D',
value_name='count')
summary.sort_values(['B_binned', 'D', 'Cx'], inplace=True)
summary.set_index(['B_binned', 'D', 'Cx'], inplace=True)
summary
# %% Chart
summary.unstack().unstack().plot(kind='bar')
Numpy
Using numpy arrays to count then construct the DataFrame to plot
labels = np.array(['0', '1-10', '11-20', '21-30', '31-40', '41+'])
ge_lbl = np.array(['GE50', 'LT50'])
u, d = np.unique(df.D.values, return_inverse=True)
bins = np.array([1, 11, 21, 31, 41]).searchsorted(df.B)
ltge = (df.C.values >= 50).astype(int)
shape = (len(u), len(labels), len(ge_lbl))
out = np.zeros(shape, int)
np.add.at(out, (d, bins, ltge), 1)
pd.concat({
d_: pd.DataFrame(o, labels, ge_lbl)
for d_, o in zip(u, out)
}, names=['Cx', 'D'], axis=1).plot.bar()
Tried a different way of doing it.
df['Bins'] = np.where(df['B'].isin([0]), '0',
np.where(df['B'].isin(range(1,11)), '1-10',
np.where(df['B'].isin(range(11,21)), '11-20',
np.where(df['B'].isin(range(21,31)), '21-30',
np.where(df['B'].isin(range(31,40)), '31-40','41+')
))))
df['Class_type'] = np.where(((df['C']>50) & (df['D']== 'One') ), 'C is GTE50 and D is One',
np.where(((df['C']>50) & (df['D']== 'Two')) , 'C is GTE50 and D is Two',
np.where(((df['C']<50) & (df['D']== 'One') ), 'C is LT50 and D is One',
'C is LT50 and D is Two')
))
df.groupby(['Bins', 'Class_type'])['C'].sum().unstack().plot(kind='bar')
plt.show()
#### Output ####
WARNING: Not sure how optimal the solution is.And also it consumes extra space so space complexity may increase.

How to concatenate Pandas Dataframe columns dynamically?

I have a dataframe df (see program below) whose column names and number are not fixed.
However, there is a list ls which will have the list of columns of df that needs to be appended together.
I tried
df['combined'] = df[ls].apply(lambda x: '{}{}{}'.format(x[0], x[1], x[2]), axis=1)
but here I am assuming that the list ls has 3 elements which is hard coding and incorrect.What if the list has 10 elements.. I want to dynamically read the list and append the columns of the dataframe.
import pandas as pd
def main():
df = pd.DataFrame({
'col_1': [0, 1, 2, 3],
'col_2': [4, 5, 6, 7],
'col_3': [14, 15, 16, 19],
'col_4': [22, 23, 24, 25],
'col_5': [30, 31, 32, 33],
})
ls = ['col_1','col_4', 'col_3']
df['combined'] = df[ls].apply(lambda x: '{}{}'.format(x[0], x[1]), axis=1)
print(df)
if __name__ == '__main__':
main()
You can use ''.join after converting the columns' data type to str:
df[ls].astype(str).apply(''.join, axis=1)
#0 02214
#1 12315
#2 22416
#3 32519
#dtype: object
You can use cumulative sum over strings for this for more speed i.e
df[ls].astype(str).cumsum(1).iloc[:,-1].values
Output :
0 02214
1 12315
2 22416
3 32519
Name: combined, dtype: object
If you need to add space then first add ' ' then find sum i.e
n = (df[ls].astype(str)+ ' ').sum(1)
0 0 22 14
1 1 23 15
2 2 24 16
3 3 25 19
dtype: object
Timings :
ndf = pd.concat([df]*10000)
%%timeit
ndf[ls].astype(str).cumsum(1).iloc[:,-1].values
1 loop, best of 3: 538 ms per loop
%%timeit
ndf[ls].astype(str).apply(''.join, axis=1)
1 loop, best of 3: 1.93 s per loop

Categories

Resources