I have two data frames, trying to use entries from df1 to limit amounts in df2, then add them up. It seems like my code is limiting right, but not adding the amounts up.
Code:
import pandas as pd
df1 = pd.DataFrame({'Caps':['25','45','65']})
df2 = pd.DataFrame({'Amounts':['45','25','65','35','85']})
df1['Capped'] = df1.apply(lambda row: df2['Amounts'].where(
df2['Amounts'] <= row['Caps'], row['Caps']).sum(), axis=1)
Output:
>>> df1
Caps Capped
0 25 2525252525
1 45 4525453545
2 65 4525653565
First is necessary convert values to integers by Series.astype:
df1['Caps'] = df1['Caps'].astype(int)
df2['Amounts'] = df2['Amounts'].astype(int)
df1['Capped'] = df1.apply(lambda row: df2['Amounts'].where(
df2['Amounts'] <= row['Caps'], row['Caps']).sum(), axis=1)
print (df1)
Caps Capped
0 25 125
1 45 195
2 65 235
For improve performance is possible use numpy.where with broadcasting:
df1['Caps'] = df1['Caps'].astype(int)
df2['Amounts'] = df2['Amounts'].astype(int)
am = df2['Amounts'].to_numpy()
ca = df1['Caps'].to_numpy()
#pandas below 0.24
#am = df2['Amounts'].values
#ca = df1['Caps'].values
df1['Capped'] = np.where(am <= ca[:, None], am[None, :], ca[:, None]).sum(axis=1)
print (df1)
Caps Capped
0 25 125
1 45 195
2 65 235
Related
I understand the title might not be very clear, but please hear me out.
I have a pandas dataframe column of ~850 unique display sizes, for e.g.
1 320x480
2 480x320
3 382x215
4 676x320
5 694x320
6 1080x2123
7 2094x1080
8 1080x2020
I want to match them/ convert them to the closest possible standard display sizes (there are around ~20 of them provided in the use-case dataset).
320x350
320x480
480x320
640x360
800x600
1024x768
1280x720
1280x800
1280x1024
1360x768
1366x768
1440x900
1536x864
1600x900
I tried separating height and width into separate columns and rounding them up but its still creating a lot of non-standard display sizes (for my use case).
How can I achieve this?
Idea is convert both columns/Series to DataFrames by Series.str.split, then use cross join by DataFrame.merge, get differences and sum them , get rows with minimal diffs by DataFrameGroupBy.idxmin and DataFrame.loc, last join together with DataFrame.pop for use and drop columns:
df11 = df1['col'].str.split('x', expand=True).astype(int)
df22 = df2['col'].str.split('x', expand=True).astype(int)
df = df11.assign(a=1).merge(df22.assign(a=1), on='a')
df['diff'] = df['0_x'].sub(df['0_y']).abs() + df['1_x'].sub(df['1_y']).abs()
df = df.loc[df.groupby(['0_x','1_x'])['diff'].idxmin()]
df['a'] = df.pop('0_x').astype(str).str.cat(df.pop('0_y').astype(str), 'x')
df['b'] = df.pop('1_x').astype(str).str.cat(df.pop('1_y').astype(str), 'x')
print (df)
a diff b
1 320x320 0 480x480
28 382x320 197 215x350
16 480x480 0 320x320
45 676x640 76 320x360
59 694x640 94 320x360
106 1080x1280 1196 2020x1024
78 1080x1280 1299 2123x1024
97 2094x1600 674 1080x900
Similar idea with euclidean distance, with sample data same output:
df11 = df1['col'].str.split('x', expand=True).astype(int)
df22 = df2['col'].str.split('x', expand=True).astype(int)
df = df11.assign(a=1).merge(df22.assign(a=1), on='a')
df['diff'] = np.sqrt(df['0_x'].sub(df['0_y']) ** 2 + df['1_x'].sub(df['1_y']) ** 2)
df = df.loc[df.groupby(['0_x','1_x'])['diff'].idxmin()]
df['a'] = df.pop('0_x').astype(str).str.cat(df.pop('0_y').astype(str), 'x')
df['b'] = df.pop('1_x').astype(str).str.cat(df.pop('1_y').astype(str), 'x')
print (df)
a diff b
1 320x320 0.000000 480x480
30 382x480 143.627992 215x320
16 480x480 0.000000 320x320
45 676x640 53.814496 320x360
59 694x640 67.201190 320x360
106 1080x1280 1015.881883 2020x1024
78 1080x1280 1117.050133 2123x1024
97 2094x1600 525.771814 1080x900
Another numpy solution:
df11 = df1['col'].str.split('x', expand=True).astype(int)
df22 = df2['col'].str.split('x', expand=True).astype(int)
a1 = np.sqrt(np.square(df11[0].to_numpy()[:, None] - df22[0].to_numpy()) +
np.square(df11[1].to_numpy()[:, None] - df22[1].to_numpy()))
df1['b1'] = df2['col'].to_numpy()[np.argmin(a1, axis=1)]
a2 = (np.abs(df11[0].to_numpy()[:, None] - df22[0].to_numpy()) +
np.abs(df11[1].to_numpy()[:, None] - df22[1].to_numpy()))
df1['b2'] = df2['col'].to_numpy()[np.argmin(a2, axis=1)]
print (df1)
col b1 b2
1 320x480 320x480 320x480
2 480x320 480x320 480x320
3 382x215 480x320 320x350
4 676x320 640x360 640x360
5 694x320 640x360 640x360
6 1080x2123 1280x1024 1280x1024
7 2094x1080 1600x900 1600x900
8 1080x2020 1280x1024 1280x1024
Say I have a dataframe like so that I have read in from a file (note: *.ene is a txt file)
df = pd.read_fwf('filename.ene')
TS DENSITY STATS
1
2
3
1
2
3
I would like to only change the TS column. I wish to replace all the column values of 'TS' with the values from range(0,751,125). The desired output should look like so:
TS DENSITY STATS
0
125
250
500
625
750
I'm a bit lost and would like some insight regarding the code to do such a thing in a general format.
I used a for loop to store the values into a list:
K=(6*125)+1
m = []
for i in range(0,K,125):
m.append(i)
I thought to use .replace like so:
df['TS']=df['TS'].replace(old_value, m, inplace=True)
but was not sure what to put in place of old_value to select all the values of the 'TS' column or if this would even work as a method.
it's pretty straight forward, if you're replacing all the data you just need to do
df['TS'] =m
example :
import pandas as pd
data = [[10, 20, 30], [40, 50, 60], [70, 80, 90]]
df = pd.DataFrame(data, index=[0, 1, 2], columns=['a', 'b', 'c'])
print(df)
# a b c
# 0 10 20 30
# 1 40 50 60
# 2 70 80 90
df['a'] = [1,2,3]
print(df)
# a b c
# 0 1 20 30
# 1 2 50 60
# 2 3 80 90
I'm trying something new. I want to populate a new df column based on some conditions affecting another column with values.
I have a data frame with two columns (ID,Retailer). I want to populate the Retailer column based on the ids in the ID column. I know how to do this in SQL, using a CASE statement, but how can I go about it in python?
I've had look at this example but it isn't exactly what I'm looking for.
Python : populate a new column with an if/else statement
import pandas as pd
data = {'ID':['112','5898','32','9985','23','577','17','200','156']}
df = pd.DataFrame(data)
df['Retailer']=''
if df['ID'] in (112,32):
df['Retailer']='Webmania'
elif df['ID'] in (5898):
df['Retailer']='DataHub'
elif df['ID'] in (9985):
df['Retailer']='TorrentJunkie'
elif df['ID'] in (23):
df['Retailer']='Apptronix'
else: df['Retailer']='Other'
print(df)
The output I'm expecting to see would be something along these lines:
ID Retailer
0 112 Webmania
1 5898 DataHub
2 32 Webmania
3 9985 TorrentJunkie
4 23 Apptronix
5 577 Other
6 17 Other
7 200 Other
8 156 Other
Use numpy.select and for test multiple values use Series.isin, also if need test strings like in sample data change numbers to numeric like 112 to '112':
m1 = df['ID'].isin(['112','32'])
m2 = df['ID'] == '5898'
m3 = df['ID'] == '9985'
m4 = df['ID'] == '23'
vals = ['Webmania', 'DataHub', 'TorrentJunkie', 'Apptronix']
masks = [m1, m2, m3, m4]
df['Retailer'] = np.select(masks, vals, default='Other')
print(df)
ID Retailer
0 112 Webmania
1 5898 DataHub
2 32 Webmania
3 9985 TorrentJunkie
4 23 Apptronix
5 577 Other
6 17 Other
7 200 Other
8 156 Other
If many catagories also is possible use your solution with custom function:
def get_data(x):
if x in ('112','32'):
return 'Webmania'
elif x == '5898':
return 'DataHub'
elif x == '9985':
return 'TorrentJunkie'
elif x == '23':
return 'Apptronix'
else: return 'Other'
df['Retailer'] = df['ID'].apply(get_data)
print (df)
ID Retailer
0 112 Webmania
1 5898 DataHub
2 32 Webmania
3 9985 TorrentJunkie
4 23 Apptronix
5 577 Other
6 17 Other
7 200 Other
8 156 Other
Or use map by dictionary, if no match get NaN, so added fillna:
d = {'112': 'Webmania','32':'Webmania',
'5898':'DataHub',
'9985':'TorrentJunkie',
'23':'Apptronix'}
df['Retailer'] = df['ID'].map(d).fillna('Other')
I have a pandas DataFrame with hierarchical column names like this
import pandas as pd
import numpy as np
np.random.seed(1542)
dates = pd.date_range('29/01/17', periods = 6)
pd.DataFrame(np.random.randn(6,6), index = dates,\
columns = [['g1', 'g1', 'g1', 'g2', 'g2', 'g2'],\
['c1', 'c2', 'c3', 'c1', 'c2', 'c3']])
And I want to apply a function that, for each group in the first level of columns, takes the columns 'c2' and 'c3' and returns a single value.
An example of the function (that in the real case is more complex) can be
def function(first_column, second_column):
return(max(first_column) - max(second_column))
When I apply it to my DataFrame I want to have back a DataFrame that tells me the output of 'function' for each group, so, in this case, just 2 numbers for 'g1' and 'g2'.
Note that I want it to works also in case of gorupby() so that, in this case, I get the result of the function for each group ('g1' and 'g2') and for each groupby subset.
For the case above, if I want to aggregate by month, the result should be:
g1 g2
1 0.909464 1.638375
2 0.698515 0.33819
I think you need groupby by first level of MultiIndex with custom function with xs for select second level of MultiIndex:
np.random.seed(1542)
df = pd.DataFrame(np.random.randn(6,6), index = range(6),\
columns = [['g1', 'g1', 'g1', 'g2', 'g2', 'g2'],\
['c1', 'c2', 'c3', 'c1', 'c2', 'c3']])
print (df)
g1 g2
c1 c2 c3 c1 c2 c3
0 -0.556376 -0.295627 0.618673 -0.409434 0.107020 -1.143460
1 -0.145909 0.017417 0.117667 -0.301128 0.880918 -1.027282
2 2.287448 1.528137 -1.528636 0.052728 -1.842634 -0.757457
3 -0.651587 -1.075176 1.128277 0.632036 -0.240965 0.421812
4 -1.620718 0.146108 0.030500 -0.446294 -0.206774 0.819859
5 -0.757296 1.826793 -0.352837 -2.048026 1.362865 1.024671
def f(x):
a = x.xs('c2', axis=1, level=1)[x.name].max()
b = x.xs('c3', axis=1, level=1)[x.name].max()
#print (a)
return a - b
s = df.groupby(level=0, axis=1).apply(f)
print (s)
g1 0.698516
g2 0.338194
dtype: float64
Similar solution:
def f(x):
a = x.xs('c2', axis=1, level=1).squeeze()
b = x.xs('c3', axis=1, level=1).squeeze()
return a.max() - b.max()
a = df.groupby(level=0, axis=1).apply(f)
print (a)
g1 0.698516
g2 0.338194
dtype: float64
EDIT:
def f(x):
a = x.xs('c2', axis=1, level=1)[x.name]
b = x.xs('c3', axis=1, level=1)[x.name]
#print (a)
return a - b
s = df.resample('M').max().groupby(level=0, axis=1).apply(f)
print (s)
g1 g2
2017-01-31 0.909464 1.638375
2017-02-28 0.698516 0.338194
print (df.resample('M').max())
g1 g2
c1 c2 c3 c1 c2 c3
2017-01-31 2.287448 1.528137 0.618673 0.052728 0.880918 -0.757457
2017-02-28 -0.651587 1.826793 1.128277 0.632036 1.362865 1.024671
EDIT1:
Solution should be simplify more:
a = df.resample('M').max()
b = a.xs('c2', axis=1, level=1)
c = a.xs('c3', axis=1, level=1)
d = b - c
print (d)
g1 g2
2017-01-31 0.909464 1.638375
2017-02-28 0.698516 0.338194
Thanks to jezrael for your useful input. Builiding up on it, I've written a solution to the problem: apply a complex function that takes two or more arrays as input and return a single value and apply it to a dataframe with hierarchical column names together with resample based on datetime indexing.
First, here is the table that I will use for the example
mat = np.random.randint(0, 101, size = (10, 6))
index = pd.date_range(start = '25 Jan 2018', periods = 10)
first_column_name = ['Group1']*3 + ['Group2']*3
second_column_name = ['Col1', 'Col2', 'Col3']*2
df = pd.DataFrame(mat, index = index, columns = [first_column_name,\
second_column_name])
Group1 Group2
Col1 Col2 Col3 Col1 Col2 Col3
2018-01-25 11 36 80 88 31 33
2018-01-26 30 32 61 53 55 43
2018-01-27 64 26 21 63 33 93
2018-01-28 52 59 23 54 91 60
2018-01-29 93 88 27 16 88 7
2018-01-30 28 76 48 5 38 1
2018-01-31 7 29 45 86 53 96
2018-02-01 18 89 69 3 34 34
2018-02-02 0 7 94 99 5 68
2018-02-03 29 13 98 25 51 44
Now I want to apply the function:
def my_fun(arr1, arr2):
arr1 = np.array(arr1)
arr2 = np.array(arr2)
tmp = np.abs(arr1 - arr2)
return(np.sum(tmp))
Note that this is a simple case: in the real case the function is extremely more complex and work-around can't be taken!
The desired output is the following, when I apply the function to 'Col1' 'Col 3':
Group1 Group2
2018-01-31 296 124
2018-02-28 214 81
To do that I've applied a little bit of object oriented programming to combine resample with groupby.
So I created this class
class ApplyFunction():
def __init__(self, column_names, fun, resample = None):
self.cn = column_names
self.fun = fun
self.resample = resample
# Initialize the stored values
self.stored_values = dict()
for name in self.cn:
self.stored_values[name] = []
def __store(self, x):
self.stored_values[self.to_store].append(x.values.copy())
def wrapper_with_resample(self, x):
if self.resample is None:
print('Can not use this function with resample = None')
return np.nan
# Get the names of the group
group_name = x.columns.levels[0][x.columns.labels[0][0]]
# Get the time-steps output of resample (doing a dumm operation)
self.timesteps = x.resample(self.resample).apply(lambda x : len(x)).index
# Store the resampled variables
for name in self.cn:
self.to_store = name
x[(group_name, name)].resample(self.resample).apply(self.__store)
# Create a new DataFrame for the output
out = []
for i in range(len(self.timesteps)):
out.append(self.fun(*[self.stored_values[name][i] for name in self.cn]))
out = pd.Series(out, index = self.timesteps)
# Reset self.stored_values
for name in self.cn:
self.stored_values[name] = []
return out
And then I use it as follows:
f = ApplyFunction(column_names = ['Col1', 'Col3'], fun = my_fun, resample = 'M')
output = df.groupby(level = 0, axis = 1).apply(f.wrapper_with_resample)
This solution has been done because here we want to apply together groupby and resample and I haven't found a suitable solution in pandas.
I hope this solution is useful for someone; of course there is room for improvement so feel free to post alternative and more efficient solutions!
Thanks. Marco
I have a dataframe with complex numbers as shown below,
>>>df.head()
96 98
1 3.608719+23.596415j 3.782185+22.818022j
2 7.239342+28.990936j 6.801471+26.978671j
3 3.671441+23.73544j 3.842973+22.940337j
4 3.935188+24.191112j 4.063692+23.269834j
5 3.544675+23.561207j 3.852818+22.951067j
I am trying to create a new multiindexed dataframe out of it as below,
96 98
R X R X
1 3.608719 23.596415 3.782185 22.818022
2 7.239342 28.990936 6.801471 26.978671
3 3.671441 23.73544 3.842973 22.940337
4 3.935188 24.191112 4.063692 23.269834
5 3.544675 23.561207 3.852818 22.951067
I have tried splitting them into real & imaginary dataframes and then merge/concat them, but was not successful.
You can use:
df = pd.concat([df.apply(lambda x: x.real), df.apply(lambda x: x.imag)],
axis=1,
keys=('R','X')) \
.swaplevel(0,1,1) \
.sort_index(1)
print (df)
96 98
R X R X
1 3.608719 23.596415 3.782185 22.818022
2 7.239342 28.990936 6.801471 26.978671
3 3.671441 23.735440 3.842973 22.940337
4 3.935188 24.191112 4.063692 23.269834
5 3.544675 23.561207 3.852818 22.951067
Another solution:
a = df.values
mux = pd.MultiIndex.from_product([ ['R','X'], df.columns])
df1 = pd.DataFrame(np.concatenate([a.real, a.imag], axis=1), columns=mux)
.swaplevel(0,1,1)
.sort_index(1)
print (df1)
96 98
R X R X
0 3.608719 23.596415 3.782185 22.818022
1 7.239342 28.990936 6.801471 26.978671
2 3.671441 23.735440 3.842973 22.940337
3 3.935188 24.191112 4.063692 23.269834
4 3.544675 23.561207 3.852818 22.951067