Pandas: Help transforming data and writing better code - python

I have two data sources I can join by a field and want to summarize them in a chart:
Data
The two DataFrames share column A:
ROWS = 1000
df = pd.DataFrame.from_dict({'A': np.arange(ROWS),
'B': np.random.randint(0, 60, size=ROWS),
'C': np.random.randint(0, 100, size=ROWS)})
df.head()
A B C
0 0 10 11
1 1 7 64
2 2 22 12
3 3 1 67
4 4 34 57
And other which I joined as such:
other = pd.DataFrame.from_dict({'A': np.arange(ROWS),
'D': np.random.choice(['One', 'Two'], ROWS)})
other.set_index('A', inplace=True)
df = df.join(other, on=['A'], rsuffix='_right')
df.head()
A B C D
0 0 10 11 One
1 1 7 64 Two
2 2 22 12 One
3 3 1 67 Two
4 4 34 57 One
Question
A proper way to get a column chart with the count of:
C is GTE50 and D is One
C is GTE50 and D is Two
C is LT50 and D is One
C is LT50 and D is Two
Grouped by B, binned into 0, 1-10, 11-20, 21-30, 21-40, 41+.

IIUC, this can be dramatically simplified to a single groupby, taking advantage of clip and np.ceil to form your groups. A single unstack with 2 levels gives us the B-grouping as our x-axis with bars for each D-C combination:
If you want slightly nicer labels, you can map the groupby values:
(df.groupby(['D',
df.C.ge(50).map({True: 'GE50', False: 'LT50'}),
np.ceil(df.B.clip(lower=0, upper=41)/10).map({0: '0', 1: '1-10', 2: '11-20', 3: '21-30', 4: '31-40', 5: '41+'})
])
.size().unstack([0,1]).plot.bar())
Also it's equivalent to group B on:
pd.cut(df['B'],
bins=[-np.inf, 1, 11, 21, 31, 41, np.inf],
right=False,
labels=['0', '1-10', '11-20', '21-30', '31-40', '41+'])

I arrived to this solution after days of grinding, going back and forth, but there are many things I consider code smells:
groupby returns a sort-of pivot table and melt's purpose is to unpivot data.
The use of dummies for Cx, but not for D? Ultimately they are both categorical data with 2 options. After two days, when I got this first solution I needed a break before trying another branch that treat these two equally.
reset_index, only to set_index lines later. Having to sort_values before set_index
That last summary.unstack().unstack() reads like a big hack.
# %% Cx
df['Cx'] = df['C'].apply(lambda x: 'LT50' if x < 50 else 'GTE50')
df.head()
# %% Bins
df['B_binned'] = pd.cut(df['B'],
bins=[-np.inf, 1, 11, 21, 31, 41, np.inf],
right=False,
labels=['0', '1-10', '11-20', '21-30', '31-40', '41+'])
df.head()
# %% Dummies
s = df['D']
dummies = pd.get_dummies(s.apply(pd.Series).stack()).sum(level=0)
df = pd.concat([df, dummies], axis=1)
df.head()
# %% Summary
summary = df.groupby(['B_binned', 'Cx']).agg({'One': 'sum', 'Two': 'sum'})
summary.reset_index(inplace=True)
summary = pd.melt(summary,
id_vars=['B_binned', 'Cx'],
value_vars=['One', 'Two'],
var_name='D',
value_name='count')
summary.sort_values(['B_binned', 'D', 'Cx'], inplace=True)
summary.set_index(['B_binned', 'D', 'Cx'], inplace=True)
summary
# %% Chart
summary.unstack().unstack().plot(kind='bar')

Numpy
Using numpy arrays to count then construct the DataFrame to plot
labels = np.array(['0', '1-10', '11-20', '21-30', '31-40', '41+'])
ge_lbl = np.array(['GE50', 'LT50'])
u, d = np.unique(df.D.values, return_inverse=True)
bins = np.array([1, 11, 21, 31, 41]).searchsorted(df.B)
ltge = (df.C.values >= 50).astype(int)
shape = (len(u), len(labels), len(ge_lbl))
out = np.zeros(shape, int)
np.add.at(out, (d, bins, ltge), 1)
pd.concat({
d_: pd.DataFrame(o, labels, ge_lbl)
for d_, o in zip(u, out)
}, names=['Cx', 'D'], axis=1).plot.bar()

Tried a different way of doing it.
df['Bins'] = np.where(df['B'].isin([0]), '0',
np.where(df['B'].isin(range(1,11)), '1-10',
np.where(df['B'].isin(range(11,21)), '11-20',
np.where(df['B'].isin(range(21,31)), '21-30',
np.where(df['B'].isin(range(31,40)), '31-40','41+')
))))
df['Class_type'] = np.where(((df['C']>50) & (df['D']== 'One') ), 'C is GTE50 and D is One',
np.where(((df['C']>50) & (df['D']== 'Two')) , 'C is GTE50 and D is Two',
np.where(((df['C']<50) & (df['D']== 'One') ), 'C is LT50 and D is One',
'C is LT50 and D is Two')
))
df.groupby(['Bins', 'Class_type'])['C'].sum().unstack().plot(kind='bar')
plt.show()
#### Output ####
WARNING: Not sure how optimal the solution is.And also it consumes extra space so space complexity may increase.

Related

Dataframe filtering with multiple conditions on different columns

Let's say we have the following dataframe:
data = {'Item':['1', '2', '3', '4', '5'],
'A':[142, 11, 50, 60, 12],
'B':[55, 65, 130, 14, 69],
'C':[68, -18, 65, 16, 17],
'D':[60, 0, 150, 170, 130],
'E':[230, 200, 5, 10, 160]}
df = pd.DataFrame(data)
representing different items and the corresponding values related to some of their parameters (e.g. length, width, and so on). However, not all the reported values are acceptable, in fact each item has a different range of allowed values:
A and B go from -100 to +100
C goes from -70 to +70
D and E go from +100 to +300
So, as you can see, some items show values outside limits for just one parameter (e.g. item 2 is outside for parameter D), while others are outside for more than one parameter (e.g. item 1 is outside for both parameters A and D).
I need to analyze these data and get a table reporting:
how many items are outside for just one parameter and the name of this parameter
how many items are outside for more than one parameter and the name of those parameter
To be clearer: I need a simple simple way to know how many items are failed and for which parameters. For example: four items out of five are failed, and 2 of them (item #1 and item#3) for two parameters (item #1 for A and D, item #3 for B and E), while items #2 and #4 are out for one parameter (item #2 for D, item #4 for E)
I have tried to define the following masks:
df_mask_1 = abs(df['A'])>=100
df_mask_2 = abs(df['B'])>=100
df_mask_3 = abs(df['C'])>=70
df_mask_4 = ((df['D']<=110) | (df['D']>=800))
df_mask_5 = ((df['E']<=110) | (df['E']>=800))
to get the filtered dataframe:
filtered_df = df[df_mask_1 & df_mask_2 & df_mask_3 & df_mask_4 & df_mask_5]
but what I obtain is just an empty dataframe. I have also tried with
filtered_df = df.loc[(df_mask_1) & (df_mask_2) & (df_mask_3) & (df_mask_4) & (df_mask_5)]
but the result does not change.
Any suggestion?
Use a condition list then flat your dataframe with melt then keep rows where condition is False (~x) then unpivot your dataframe with groupby_apply:
condlist = [df['A'].between(-100, 100),
df['B'].between(-100, 100),
df['C'].between(-70, 70),
df['D'].between(100, 300),
df['E'].between(100, 300)]
df['Fail'] = pd.concat(condlist, axis=1).melt(ignore_index=False) \
.loc[lambda x: ~x['value']].groupby(level=0)['variable'].apply(list)
# Output:
Item A B C D E Fail
0 1 142 55 68 60 230 [A, D]
1 2 11 65 -18 0 200 [D]
2 3 50 130 65 150 5 [B, E]
3 4 60 14 16 170 10 [E]
4 5 12 69 17 130 160 NaN
Note: if your dataframe is large and you only need to display failed items, use : df[df['Fail'].notna()] to filter out your dataframe.
Note 2: variable and value are the default column names when you melt a dataframe.
The result from the example code is what it should be. The code filtered_df = df[df_mask_1 & df_mask_2 & df_mask_3 & df_mask_4 & df_mask_5] applies all masks to the data, hence df_mask_1 and df_mask_2 alone would result in empty table (as there are no items where A and B are both over 100).
If you are looking to create a table with information on how many items have parameters outside limits, I'd suggest doing according to Corralien's answer and summing per row to get the amount of limits broken.
Taking Corralien's answer, then
a = pd.concat([df['A'].between(-100, 100),
df['B'].between(-100, 100),
df['C'].between(-70, 70),
df['D'].between(100, 300),
df['E'].between(100, 300)], axis=1)
b = a.sum(axis=1)
where b gives per item the amount of limits "broken"
0 3
1 4
2 3
3 4
4 5
Another approach is to use the pandas.DataFrame.agg function like so:
import pandas as pd
# Defining a limit checker function
def limit_check(values, limit):
return (abs(values) >= limit).sum()
# Applying different limits checks to columns
res = df.agg({
'A': lambda x: limit_check(x, 100),
'B': lambda x: limit_check(x, 100),
'C': lambda x: limit_check(x, 70),
'D': lambda x: limit_check(x, 800),
'E': lambda x: limit_check(x, 800),
})
print(res)
For the sample data you provided it will result in
A 1
B 1
C 0
D 0
E 0
dtype: int64
Using the answer by Corralien and taking advantage of what's written by tiitinha, and considering the possibility to have some NaN values, here is how I put it all together:
df.replace([np.nan], np.inf,inplace=True)
condlist = [df['A'].between(-100, 100) | df['A'] == np.inf,
df['B'].between(-100, 100) | df['B'] == np.inf,
df['C'].between(-70, 70) | df['C'] == np.inf,
df['D'].between(100, 300) | df['D'] == np.inf,
df['E'].between(100, 300) | df['E'] == np.inf]
To get the total number of failed parameters for each item:
bool_df = ~pd.concat(condlist, axis=1).astype('bool')
df['#Fails'] = bool_df.sum(axis=1)
To know who are the parameters out of limits, for each item:
df['Fail'] = pd.concat(condlist, axis=1).melt(ignore_index=False) \
.loc[lambda x: ~x['value']].groupby(level=0)['variable'].apply(list)
In this way I get two columns with the wanted results.

Fastest way to calculate in Pandas?

Given these two dataframes:
df1 =
Name Start End
0 A 10 20
1 B 20 30
2 C 30 40
df2 =
0 1
0 5 10
1 15 20
2 25 30
df2 has no column names, but you can assume column 0 is an offset of df1.Start and column 1 is an offset of df1.End. I would like to transpose df2 onto df1 to get the Start and End differences. The final df1 dataframe should look like this:
Name Start End Start_Diff_0 End_Diff_0 Start_Diff_1 End_Diff_1 Start_Diff_2 End_Diff_2
0 A 10 20 5 10 -5 0 -15 -10
1 B 20 30 15 20 5 10 -5 0
2 C 30 40 25 30 15 20 5 10
I have a solution that works, but I'm not satisfied with it because it takes too long to run when processing a dataframe that has millions of rows. Below is a sample test case to simulate processing 30,000 rows. As you can imagine, running the original solution (method_1) on a 1GB dataframe is going to be a problem. Is there a faster way to do this using Pandas, Numpy, or maybe another package?
UPDATE: I've added the provided solutions to the benchmarks.
# Import required modules
import numpy as np
import pandas as pd
import timeit
# Original
def method_1():
df1 = pd.DataFrame([['A', 10, 20], ['B', 20, 30], ['C', 30, 40]] * 10000, columns=['Name', 'Start', 'End'])
df2 = pd.DataFrame([[5, 10], [15, 20], [25, 30]], columns=None)
# Store data for new columns in a dictionary
new_columns = {}
for index1, row1 in df1.iterrows():
for index2, row2 in df2.iterrows():
key_start = 'Start_Diff_' + str(index2)
key_end = 'End_Diff_' + str(index2)
if (key_start in new_columns):
new_columns[key_start].append(row1[1]-row2[0])
else:
new_columns[key_start] = [row1[1]-row2[0]]
if (key_end in new_columns):
new_columns[key_end].append(row1[2]-row2[1])
else:
new_columns[key_end] = [row1[2]-row2[1]]
# Add dictionary data as new columns
for key, value in new_columns.items():
df1[key] = value
# jezrael - https://stackoverflow.com/a/60843750/452587
def method_2():
df1 = pd.DataFrame([['A', 10, 20], ['B', 20, 30], ['C', 30, 40]] * 10000, columns=['Name', 'Start', 'End'])
df2 = pd.DataFrame([[5, 10], [15, 20], [25, 30]], columns=None)
# Convert selected columns to 2d numpy array
a = df1[['Start', 'End']].to_numpy()
b = df2[[0, 1]].to_numpy()
# Output is 3d array; convert it to 2d array
c = (a - b[:, None]).swapaxes(0, 1).reshape(a.shape[0], -1)
# Generate columns names and with DataFrame.join; add to original
cols = [item for x in range(b.shape[0]) for item in (f'Start_Diff_{x}', f'End_Diff_{x}')]
df1 = df1.join(pd.DataFrame(c, columns=cols, index=df1.index))
# sammywemmy - https://stackoverflow.com/a/60844078/452587
def method_3():
df1 = pd.DataFrame([['A', 10, 20], ['B', 20, 30], ['C', 30, 40]] * 10000, columns=['Name', 'Start', 'End'])
df2 = pd.DataFrame([[5, 10], [15, 20], [25, 30]], columns=None)
# Create numpy arrays of df1 and df2
df1_start = df1.loc[:, 'Start'].to_numpy()
df1_end = df1.loc[:, 'End'].to_numpy()
df2_start = df2[0].to_numpy()
df2_end = df2[1].to_numpy()
# Use np tile to create shapes that allow elementwise subtraction
tiled_start = np.tile(df1_start, (len(df2), 1)).T
tiled_end = np.tile(df1_end, (len(df2), 1)).T
# Subtract df2 from df1
start = np.subtract(tiled_start, df2_start)
end = np.subtract(tiled_end, df2_end)
# Create columns for start and end
start_columns = [f'Start_Diff_{num}' for num in range(len(df2))]
end_columns = [f'End_Diff_{num}' for num in range(len(df2))]
# Create dataframes of start and end
start_df = pd.DataFrame(start, columns=start_columns)
end_df = pd.DataFrame(end, columns=end_columns)
# Lump start and end into one dataframe
lump = pd.concat([start_df, end_df], axis=1)
# Sort the columns by the digits at the end
filtered = lump.columns[lump.columns.str.contains('\d')]
cols = sorted(filtered, key=lambda x: x[-1])
lump = lump.reindex(cols, axis='columns')
# Hook lump back to df1
df1 = pd.concat([df1,lump],axis=1)
print('Method 1:', timeit.timeit(method_1, number=3))
print('Method 2:', timeit.timeit(method_2, number=3))
print('Method 3:', timeit.timeit(method_3, number=3))
Output:
Method 1: 50.506279182
Method 2: 0.08886280600000163
Method 3: 0.10297686199999845
I suggest use here numpy - convert selected columns to 2d numpy array in first step::
a = df1[['Start','End']].to_numpy()
b = df2[[0,1]].to_numpy()
Output is 3d array, convert it to 2d array:
c = (a - b[:, None]).swapaxes(0,1).reshape(a.shape[0],-1)
print (c)
[[ 5 10 -5 0 -15 -10]
[ 15 20 5 10 -5 0]
[ 25 30 15 20 5 10]]
Last generate columns names and with DataFrame.join add to original:
cols = [item for x in range(b.shape[0]) for item in (f'Start_Diff_{x}', f'End_Diff_{x}')]
df = df1.join(pd.DataFrame(c, columns=cols, index=df1.index))
print (df)
Name Start End Start_Diff_0 End_Diff_0 Start_Diff_1 End_Diff_1 \
0 A 10 20 5 10 -5 0
1 B 20 30 15 20 5 10
2 C 30 40 25 30 15 20
Start_Diff_2 End_Diff_2
0 -15 -10
1 -5 0
2 5 10
Don't use iterrows(). If you're simply subtracting values, use vectorization with Numpy (Pandas also offers vectorization, but Numpy is faster).
For instance:
df2 = pd.DataFrame([[5, 10], [15, 20], [25, 30]], columns=None)
col_names = "Start_Diff_1 End_Diff_1".split()
df3 = pd.DataFrame(df2.to_numpy() - 10, columns=colnames)
Here df3 equals:
Start_Diff_1 End_Diff_1
0 -5 0
1 5 10
2 15 20
You can also change column names by doing:
df2.columns = "Start_Diff_0 End_Diff_0".split()
You can use f-strings to change column names in a loop, i.e., f"Start_Diff_{i}", where i is a number in a loop
You can also combine multiple dataframes with:
df = pd.concat([df1, df2],axis=1)
This is one way to go about it:
#create numpy arrays of df1 and 2
df1_start = df1.loc[:,'Start'].to_numpy()
df1_end = df1.loc[:,'End'].to_numpy()
df2_start = df2[0].to_numpy()
df2_end = df2[1].to_numpy()
#use np tile to create shapes
#that allow element wise subtraction
tiled_start = np.tile(df1_start,(len(df2),1)).T
tiled_end = np.tile(df1_end,(len(df2),1)).T
#subtract df2 from df1
start = np.subtract(tiled_start,df2_start)
end = np.subtract(tiled_end, df2_end)
#create columns for start and end
start_columns = [f'Start_Diff_{num}' for num in range(len(df2))]
end_columns = [f'End_Diff_{num}' for num in range(len(df2))]
#create dataframes of start and end
start_df = pd.DataFrame(start,columns=start_columns)
end_df = pd.DataFrame(end, columns = end_columns)
#lump start and end into one dataframe
lump = pd.concat([start_df,end_df],axis=1)
#sort the columns by the digits at the end
filtered = final.columns[final.columns.str.contains('\d')]
cols = sorted(filtered, key = lambda x: x[-1])
lump = lump.reindex(cols,axis='columns')
#hook lump back to df1
final = pd.concat([df1,lump],axis=1)

How to do filter by two criteria in creating a new column in pandas

I want to create a new column that returns a value if it matches criteria is both columns of an existing df.
df = pd.DataFrame({
'first_column': [1, 2, 3, 5],
'second_column': [10, 20, 30, 50]
})
df.loc[df.first_column <= 3, 'new_column'] = 'good'
df.loc[df.first_column == 1, 'new_column'] = 'bad'
df.loc[df.first_column >= 4, 'new_column'] = 'great'
This works for one condition (though I assume there is a way to say between 2 and 3 which is what I really want for the first line)
But I can't figure out how to get it do so something where I could say if df.first_column >= 4 AND df.second_column >= 50 = 'super great'
Method 1: pd.cut
What you want is pd.cut, to assign 'labels' to certain 'ranges' in this case called bins:
df['new_column'] = pd.cut(df['first_column'],
bins=[-np.inf, 1,3,50, np.inf],
labels=['bad', 'good', 'great', 'supergreat'])
first_column second_column new_column
0 1 10 bad
1 2 20 good
2 3 30 good
3 5 50 great
Method 2: np.select:
We can also use numpy.select which takes multiple conditions and based on those conditions it returns a value (choice):
conditions = [
df['first_column'] <= 1,
df['first_column'].between(1, 3),
(df['first_column'] >= 4) & (df['second_column'] >= 50),
df['first_column'] >= 4
]
choices = ['bad', 'good', 'supergreat', 'great']
df['new_column'] = np.select(conditions, choices)
first_column second_column new_column
0 1 10 bad
1 2 20 good
2 3 30 good
3 5 50 supergreat

How to concatenate Pandas Dataframe columns dynamically?

I have a dataframe df (see program below) whose column names and number are not fixed.
However, there is a list ls which will have the list of columns of df that needs to be appended together.
I tried
df['combined'] = df[ls].apply(lambda x: '{}{}{}'.format(x[0], x[1], x[2]), axis=1)
but here I am assuming that the list ls has 3 elements which is hard coding and incorrect.What if the list has 10 elements.. I want to dynamically read the list and append the columns of the dataframe.
import pandas as pd
def main():
df = pd.DataFrame({
'col_1': [0, 1, 2, 3],
'col_2': [4, 5, 6, 7],
'col_3': [14, 15, 16, 19],
'col_4': [22, 23, 24, 25],
'col_5': [30, 31, 32, 33],
})
ls = ['col_1','col_4', 'col_3']
df['combined'] = df[ls].apply(lambda x: '{}{}'.format(x[0], x[1]), axis=1)
print(df)
if __name__ == '__main__':
main()
You can use ''.join after converting the columns' data type to str:
df[ls].astype(str).apply(''.join, axis=1)
#0 02214
#1 12315
#2 22416
#3 32519
#dtype: object
You can use cumulative sum over strings for this for more speed i.e
df[ls].astype(str).cumsum(1).iloc[:,-1].values
Output :
0 02214
1 12315
2 22416
3 32519
Name: combined, dtype: object
If you need to add space then first add ' ' then find sum i.e
n = (df[ls].astype(str)+ ' ').sum(1)
0 0 22 14
1 1 23 15
2 2 24 16
3 3 25 19
dtype: object
Timings :
ndf = pd.concat([df]*10000)
%%timeit
ndf[ls].astype(str).cumsum(1).iloc[:,-1].values
1 loop, best of 3: 538 ms per loop
%%timeit
ndf[ls].astype(str).apply(''.join, axis=1)
1 loop, best of 3: 1.93 s per loop

How to place existing columns under a hierarchy?

My imported data from excel has been multi indexed with the time stamp column in Pandas. I would like place the remaining columns into their respective hierarchical groups. The frequency band columns (7 columns: delta', 'theta', 'alpha', 'beta', 'high beta', 'gamma') under the hierarchy column labeled 'EMG' and the biological measures (2 columns: 'Heart Rate Variabilty' and 'GSR') under 'Biofeedback'.
Is there a straight forward way to do this?
The second part is, how can a single level dataframe be appended to this multi index hierarchical dataframe without flattening the hierarchy created in the first part?
You can create MultiIndex.from_arrays and then reindex:
cols = ['delta','theta','alpha','beta','Heart Rate Variabilty','high beta', 'gamma','GSR']
df = pd.DataFrame([[4,5,8,3,1,0,9,2]], columns=cols)
print (df)
delta theta alpha beta Heart Rate Variabilty high beta gamma GSR
0 4 5 8 3 1 0 9 2
c1 = ['delta', 'theta', 'alpha', 'beta','high beta', 'gamma']
c2 = ['Heart Rate Variabilty', 'GSR']
mux = pd.MultiIndex.from_arrays([ ['EMG'] * len(c1) + ['Biofeedback'] * len(c2),c1 + c2])
print (mux)
MultiIndex(levels=[['Biofeedback', 'EMG'],
['GSR', 'Heart Rate Variabilty', 'alpha', 'beta',
'delta', 'gamma', 'high beta', 'theta']],
labels=[[1, 1, 1, 1, 1, 1, 0, 0], [4, 7, 2, 3, 6, 5, 1, 0]])
df = df.reindex(columns=mux, level=1)
print (df)
EMG Biofeedback
delta theta alpha beta high beta gamma Heart Rate Variabilty GSR
0 4 5 8 3 0 9 1 2
EDIT by comment:
Thank you for final solution OP:
df1.columns = pd.MultiIndex.from_tuples([(c, '', '') for c in df1])
df = pd.concat([df, df1], axis=1)

Categories

Resources