My understanding of a Pandas dataframe vectorization (through Pandas vectorization itself or through Numpy) is applying a function to an array, similar to .apply() (Please correct me if I'm wrong). Suppose I have the following dataframe:
import pandas as pd
df = pd.DataFrame({'color' : ['red','blue','yellow','orange','green',
'white','black','brown','orange-red','teal',
'beige','mauve','cyan','goldenrod','auburn',
'azure','celadon','lavender','oak','chocolate'],
'group' : [1,1,1,1,1,
1,1,1,1,1,
1,2,2,2,2,
4,4,5,6,7]})
df = df.set_index('color')
df
For this data, I want to apply a special counter for each unique value in A. Here's my current implementation of it:
df['C'] = 0
for value in set(df['group'].values):
filtered_df = df[df['group'] == value]
adj_counter = 0
initialize_counter = -1
spacing_counter = 20
special_counters = [0,1,-1,2,-2,3,-3,4,-4,5,-5,6,-6,7,-7]
for color,rows in filtered_df.iterrows():
if len(filtered_df.index) < 7:
initialize_counter +=1
df.loc[color,'C'] = (46+special_counters[initialize_counter])
else:
spacing_counter +=1
if spacing_counter > 5:
spacing_counter = 0
df.loc[color,'C'] = spacing_counter
df
Is there a faster way to implement this that doesn't involve iterrows or itertuples? Since the counting in the C columns is very irregular, I'm not sure as how I could implement this through apply or even through vectorization
What you can do is first create the column 'C' with groupby on the column 'group' and cumcount that would almost represent spacing_counter or initialize_counter depending on if len(filtered_df.index) < 7 or not.
df['C'] = df.groupby('group').cumcount()
Now you need to select the appropriate rows to do the if or the else part of your code. One way is to create a series using groupby again and transform to know the size of the group related to each row. Then, use loc on you df using this series and do: if the value is smaller than 7, you can map your values with the special_counters else just use modulo % 6
ser_size = df.groupby('group')['C'].transform('size')
df.loc[ser_size < 7,'C'] = df.loc[ser_size < 7,'C'].map(lambda x: 46 + special_counters[x])
df.loc[ser_size >= 7,'C'] %= 6
at the end, you get as expected:
print (df)
group C
color
red 1 0
blue 1 1
yellow 1 2
orange 1 3
green 1 4
white 1 5
black 1 0
brown 1 1
orange-red 1 2
teal 1 3
beige 1 4
mauve 2 46
cyan 2 47
goldenrod 2 45
auburn 2 48
azure 4 46
celadon 4 47
lavender 5 46
oak 6 46
chocolate 7 46
Related
I want to replicate the data from the same dataframe when a certain condition is fulfilled.
Dataframe:
Hour,Wage
1,15
2,17
4,20
10,25
15,26
16,30
17,40
19,15
I want to replicate the dataframe when going through a loop and there is a difference greater than 4 in row.hour.
Expected Output:
Hour,Wage
1,15
2,17
4,20
10,25
15,26
16,30
17,40
19,15
2,17
4,20
i want to replicate the rows when the iterating through all the row and there is a difference greater than 4 in row.hour
row.hour[0] = 1
row.hour[1] = 2.here the difference between is 1 but in (row.hour[2]=4 and row,hour[3]=10).here the difference is 6 which is greater than 4.I want to replicate the data above of the index where this condition(greater than 4) is fulfilled
I can replicate the data with **df = pd.concat([df]*2, ignore_index=False)**.but it does not replicate when i run it with if statement
I tried the code below but nothing is happening.
**for i in range(0,len(df)-1):
if (df.iloc[i,0] - df.iloc[i+1,0]) > 4 :
df = pd.concat([df]*2, ignore_index=False)**
My understanding is: you want to compare 'Hour' values for two successive rows.
If the difference is > 4 you want to add the previous row to the DF.
If that is what you want try this:
Create a DF:
j = pd.DataFrame({'Hour':[1, 2, 4,10,15,16,17,19],
'Wage':[15,17,20,25,26,30,40,15]})
Define a function:
def f1(d):
dn = d.copy()
for x in range(len(d)-2):
if (abs(d.iloc[x+1].Hour - d.iloc[x+2].Hour) > 4):
idx = x + 0.5
dn.loc[idx] = d.iloc[x]['Hour'], d.iloc[x]['Wage']
dn = dn.sort_index().reset_index(drop=True)
return dn
Call the function passing your DF:
nd = f1(j)
Hour Wage
0 1 15
1 2 17
2 2 17
3 4 20
4 4 20
5 10 25
6 15 26
7 16 30
8 17 40
9 19 15
In line
if df.iloc[i,0] - df.iloc[i+1,0] > 4
you calculate 4-10 instead of 10-4 so you check -6 > 4 instead of 6 > 4
You have to replace items
if df.iloc[i+1,0] - df.iloc[i,0] > 4
or use abs() if you want to replicate in both situations - > 4 and < -4
if abs(df.iloc[i+1,0] - df.iloc[i,0]) > 4
If you would use print( df.iloc[i,0] - df.iloc[i+1,0]) (or debuger) the you would see it.
from itertools import product
import pandas as pd
df = pd.DataFrame.from_records(product(range(10), range(10)))
df = df.sample(90)
df.columns = "c1 c2".split()
df = df.sort_values(df.columns.tolist()).reset_index(drop=True)
# c1 c2
# 0 0 0
# 1 0 1
# 2 0 2
# 3 0 3
# 4 0 4
# .. .. ..
# 85 9 4
# 86 9 5
# 87 9 7
# 88 9 8
# 89 9 9
#
# [90 rows x 2 columns]
How do I quickly find, identify, and remove the last duplicate of all symmetric pairs in this data frame?
An example of symmetric pair is that '(0, 1)' is equal to '(1, 0)'. The latter should be removed.
The algorithm must be fast, so it is recommended to use numpy. Converting to python object is not allowed.
You can sort the values, then groupby:
a= np.sort(df.to_numpy(), axis=1)
df.groupby([a[:,0], a[:,1]], as_index=False, sort=False).first()
Option 2: If you have a lot of pairs c1, c2, groupby can be slow. In that case, we can assign new values and filter by drop_duplicates:
a= np.sort(df.to_numpy(), axis=1)
(df.assign(one=a[:,0], two=a[:,1]) # one and two can be changed
.drop_duplicates(['one','two']) # taken from above
.reindex(df.columns, axis=1)
)
One way is using np.unique with return_index=True and use the result to index the dataframe:
a = np.sort(df.values)
_, ix = np.unique(a, return_index=True, axis=0)
print(df.iloc[ix, :])
c1 c2
0 0 0
1 0 1
20 2 0
3 0 3
40 4 0
50 5 0
6 0 6
70 7 0
8 0 8
9 0 9
11 1 1
21 2 1
13 1 3
41 4 1
51 5 1
16 1 6
71 7 1
...
frozenset
mask = pd.Series(map(frozenset, zip(df.c1, df.c2))).duplicated()
df[~mask]
I will do
df[~pd.DataFrame(np.sort(df.values,1)).duplicated().values]
From pandas and numpy tri
s=pd.crosstab(df.c1,df.c2)
s=s.mask(np.triu(np.ones(s.shape)).astype(np.bool) & s==0).stack().reset_index()
Here's one NumPy based one for integers -
def remove_symm_pairs(df):
a = df.to_numpy(copy=False)
b = np.sort(a,axis=1)
idx = np.ravel_multi_index(b.T,(b.max(0)+1))
sidx = idx.argsort(kind='mergesort')
p = idx[sidx]
m = np.r_[True,p[:-1]!=p[1:]]
a_out = a[np.sort(sidx[m])]
df_out = pd.DataFrame(a_out)
return df_out
If you want to keep the index data as it is, use return df.iloc[np.sort(sidx[m])].
For generic numbers (ints/floats, etc.), we will use a view-based one -
# https://stackoverflow.com/a/44999009/ #Divakar
def view1D(a): # a is array
a = np.ascontiguousarray(a)
void_dt = np.dtype((np.void, a.dtype.itemsize * a.shape[1]))
return a.view(void_dt).ravel()
and simply replace the step to get idx with idx = view1D(b) in remove_symm_pairs.
If this needs to be fast, and if your variables are integer, then the following trick may help: let v,w be the columns of your vector; construct [v+w, np.abs(v-w)] =: [x, y]; then sort this matrix lexicographically, remove duplicates, and finally map it back to [v, w] = [(x+y), (x-y)]/2.
Suppose I have the following dataframe:
CategoryID Days Views
a 1 19
a 2 2000
a 5 5667
a 7 7899
b 1 2
b 3 245
c 1 1
c 2 252
c 7 2657
Given a threshold = n, I want to create two lists and I'll append them until I reach that threshold + 1 element for each category.
So, if n < 4, I expect for category a:
days_list = [1,2,5]
views_list = [19, 2000, 5667]
After that, I want to apply a function in those lists and then, start the iteration in the next category. However, I'm facing two issues with the following code:
I can't iterate properly when i == 0
The iteration does not go to the next category.
df['interpolated'] = int
days_list = []
views_list = []
for i,post in enumerate(category):
if df['category_id'].iloc[i-1] != post:
days_list.append(df['days new'].iloc[i])
views_list.append(df['views'].iloc[i])
elif df['category_id'].iloc[i] == post and df[category_id].iloc[i-1] == post:
if df['days new'].iloc[i] < 3:
days_list.append(df['days new'].iloc[i])
views_list.append(df['views'].iloc[i])
elif df['days new'].iloc[i] != 3:
days_list.append(df['days new'].iloc[i])
views_list.append(df['views'].iloc[i])
break
# Calculate the interpolation
interpolator = log_interp1d(days_list,views_list)
df['interpolated'] = round(interpolator(4).astype(int))
# Reset the lists after the category loop
days_list = []
views_list = []
Can someone give me some light? Thanks!
You can use a row_number type operation.
....
df['row_number'] = df.groupby(['CategoryId']).cumcount+1
Then, you will have a dataframe
CategoryID Days Views row_number
a 1 19 1
a 2 2000 2
a 5 5667 3
a 7 7899 4
b 1 2 1
b 3 245 2
c 1 1 1
c 2 252 2
c 7 2657 3
Then, you should be able to use boolean filtering to get what you want. So for your example,
df_category_a_filtered_4 = df[(df['row_number'] == 3]) & (df['CategoryID'] == 'a')]
Which will filter your dataframe so that the two lists you want are the two columns. This can be functionized obviously to do whatever you need.
If you want a more specific output, please specify what that would look like.
Having a DataFrame with the following column:
df['A'] = [1,1,1,0,1,1,1,1,0,1]
What would be the best vectorized way to control the length of "1"-series by some limiting value? Let's say the limit is 2, then the resulting column 'B' must look like:
A B
0 1 1
1 1 1
2 1 0
3 0 0
4 1 1
5 1 1
6 1 0
7 1 0
8 0 0
9 1 1
One fully-vectorized solution is to use the shift-groupby-cumsum-cumcount combination1 to indicate where consecutive runs are shorter than 2 (or whatever limiting value you like). Then, & this new boolean Series with the original column:
df['B'] = ((df.groupby((df.A != df.A.shift()).cumsum()).cumcount() <= 1) & df.A)\
.astype(int) # cast the boolean Series back to integers
This produces the new column in the DataFrame:
A B
0 1 1
1 1 1
2 1 0
3 0 0
4 1 1
5 1 1
6 1 0
7 1 0
8 0 0
9 1 1
1 See the pandas cookbook; the section on grouping, "Grouping like Python’s itertools.groupby"
Another way (checking if previous two are 1):
In [443]: df = pd.DataFrame({'A': [1,1,1,0,1,1,1,1,0,1]})
In [444]: limit = 2
In [445]: df['B'] = map(lambda x: df['A'][x] if x < limit else int(not all(y == 1 for y in df['A'][x - limit:x])), range(len(df)))
In [446]: df
Out[446]:
A B
0 1 1
1 1 1
2 1 0
3 0 0
4 1 1
5 1 1
6 1 0
7 1 0
8 0 0
9 1 1
If you know that the values in the series will all be either 0 or 1, I think you can use a little trick involving convolution. Make a copy of your column (which need not be a Pandas object, it can just be a normal Numpy array)
a = df['A'].as_matrix()
and convolve it with a sequence of 1's that is one longer than the cutoff you want, then chop off the last cutoff elements. E.g. for a cutoff of 2, you would do
long_run_count = numpy.convolve(a, [1, 1, 1])[:-2]
The resulting array, in this case, gives the number of 1's that occur in the 3 elements prior to and including that element. If that number is 3, then you are in a run that has exceeded length 2. So just set those elements to zero.
a[long_run_count > 2] = 0
You can now assign the resulting array to a new column in your DataFrame.
df['B'] = a
To turn this into a more general method:
def trim_runs(array, cutoff):
a = numpy.asarray(array)
a[numpy.convolve(a, numpy.ones(cutoff + 1))[:-cutoff] > cutoff] = 0
return a
i have a pandas dataframe whose one of the column is :
a = [1,0,1,0,1,3,4,6,4,6]
now i want to create another column such that any value greater than 0 and less than 5 is assigned 1 and rest is assigned 0 ie:
a = [1,0,1,0,1,3,4,6,4,6]
b = [1,0,1,0,1,1,1,0,1,0]
now i have done this
dtaframe['b'] = dtaframe['a'].loc[0 < dtaframe['a'] < 5] = 1
dtaframe['b'] = dtaframe['a'].loc[dtaframe['a'] >4 or dtaframe['a']==0] = 0
but the code throws and error . what to do ?
You can use between to get Boolean values, then astype to convert from Boolean values to 0/1:
dtaframe['b'] = dtaframe['a'].between(0, 5, inclusive=False).astype(int)
The resulting output:
a b
0 1 1
1 0 0
2 1 1
3 0 0
4 1 1
5 3 1
6 4 1
7 6 0
8 4 1
9 6 0
Edit
For multiple ranges, you could use pandas.cut:
dtaframe['b'] = pd.cut(dtaframe['a'], bins=[0,1,6,9], labels=False, include_lowest=True)
You'll need to be careful about how you define bins. Using labels=False will return integer indicators for each bin, which happens to correspond with the labels you provided. You could also manually specify the labels for each bin, e.g. labels=[0,1,2], labels=[0,17,19], labels=['a','b','c'], etc. You may need to use astype if you manually specify the labels, as they'll be returned as categories.
Alternatively, you could combine loc and between to manually specify each range:
dtaframe.loc[dtaframe['a'].between(0,1), 'b'] = 0
dtaframe.loc[dtaframe['a'].between(2,6), 'b'] = 1
dtaframe.loc[dtaframe['a'].between(7,9), 'b'] = 2
When using comparison operators and boolean logic to filter dataframes you can't use the pythonic idiom of a < myseries < b. Instead you need to (a < myseries) & (myseries < b)
cond1 = (0 < dtaframe['a'])
cond2 = (dtaframe['a'] <= 5)
dtaframe['b'] = (cond1 & cond2) * 1
Try this with np.where:
dtaframe['b'] = np.where(([dtaframe['a'] > 4) | (dtaframe['a']==0),0, dtaframe['a'])