i have a pandas dataframe whose one of the column is :
a = [1,0,1,0,1,3,4,6,4,6]
now i want to create another column such that any value greater than 0 and less than 5 is assigned 1 and rest is assigned 0 ie:
a = [1,0,1,0,1,3,4,6,4,6]
b = [1,0,1,0,1,1,1,0,1,0]
now i have done this
dtaframe['b'] = dtaframe['a'].loc[0 < dtaframe['a'] < 5] = 1
dtaframe['b'] = dtaframe['a'].loc[dtaframe['a'] >4 or dtaframe['a']==0] = 0
but the code throws and error . what to do ?
You can use between to get Boolean values, then astype to convert from Boolean values to 0/1:
dtaframe['b'] = dtaframe['a'].between(0, 5, inclusive=False).astype(int)
The resulting output:
a b
0 1 1
1 0 0
2 1 1
3 0 0
4 1 1
5 3 1
6 4 1
7 6 0
8 4 1
9 6 0
Edit
For multiple ranges, you could use pandas.cut:
dtaframe['b'] = pd.cut(dtaframe['a'], bins=[0,1,6,9], labels=False, include_lowest=True)
You'll need to be careful about how you define bins. Using labels=False will return integer indicators for each bin, which happens to correspond with the labels you provided. You could also manually specify the labels for each bin, e.g. labels=[0,1,2], labels=[0,17,19], labels=['a','b','c'], etc. You may need to use astype if you manually specify the labels, as they'll be returned as categories.
Alternatively, you could combine loc and between to manually specify each range:
dtaframe.loc[dtaframe['a'].between(0,1), 'b'] = 0
dtaframe.loc[dtaframe['a'].between(2,6), 'b'] = 1
dtaframe.loc[dtaframe['a'].between(7,9), 'b'] = 2
When using comparison operators and boolean logic to filter dataframes you can't use the pythonic idiom of a < myseries < b. Instead you need to (a < myseries) & (myseries < b)
cond1 = (0 < dtaframe['a'])
cond2 = (dtaframe['a'] <= 5)
dtaframe['b'] = (cond1 & cond2) * 1
Try this with np.where:
dtaframe['b'] = np.where(([dtaframe['a'] > 4) | (dtaframe['a']==0),0, dtaframe['a'])
Related
I have a Pandas dataframe that looks like this:
df = pd.DataFrame({'gp_id': [1, 2, 1, 2], 'A': [1, 2, 3, 4]})
gp_id A
0 1 1
1 2 2
2 1 3
3 2 4
I want to assign the value -1 to the first row of the group with the id 2 (gp_id = 2), to get the following output:
gp_id A
0 1 1
1 2 -1
2 1 3
3 2 4
To do this, I've tried the following code:
df[df.gp_id == 2].A.iloc[0] = -1
But this doesn't do anything as I'm assigning a value in the sub-dataframe df[df.gp_id == 2] and I'm not modifying the original dataframe df.
Is there an easy way to solve this problem?
You could do:
df.loc[(df.gp_id == 2).argmax(), 'A'] = -1
as pd.Series.argmax returns the first max.
If you are not sure that the value is present in the dataframe, you could do:
cond = (df.gp_id == 2)
if cond.sum():
df.loc[cond.argmax(), 'A'] = -1
General solution if possible mask return no rows is chain another mask by cumulative sum of mask by & for bitwise AND and set values by DataFrame.loc:
m = df.gp_id == 2
df.loc[m & (m.cumsum() == 1), 'A'] = -1
Working well if no match - no assign, no error, no incorrect assignment:
m = df.gp_id == 7
df.loc[m & (m.cumsum() == 1), 'A'] = -1
Solution if always match mask at least one row is:
idx = df[df.gp_id == 2].index[0]
df.loc[idx, 'A'] = -1
print (df)
gp_id A
0 1 1
1 2 -1
2 1 3
3 2 4
If no match, solution raise error, no incorrect assignment.
I have created a new column by comparing two boolean columns. If both are positive, I assign a 1, otherwise a 0. This is my code below, but is there a way to be more pythonic? I tried list comprehension but failed.
lst = []
for i,k in zip(df['new_customer'],df['y']):
if i == 1 & k == 1:
lst.append(1)
else:
lst.append(0)
df['new_customer_subscription'] = lst
Use np.sign:
m = np.sign(df[['new_customer', 'y']]) >= 0
df['new_customer_subscription'] = m.all(axis=1).astype(int)
If you want to consider only positive non-zero values, change >= 0 to > 0 (since np.sign(0) is 0).
# Sample DataFrame.
df = pd.DataFrame(np.random.randn(5, 2), columns=['A', 'B'])
df
A B
0 0.511684 -0.512633
1 -1.254813 -1.721734
2 0.751830 0.285449
3 -0.934877 1.407998
4 -1.686066 -0.947015
# Get the sign of the numbers.
m = np.sign(df[['A', 'B']]) >= 0
m
A B
0 True False
1 False False
2 True True
3 False True
4 False False
# Find all rows where both columns are `True`.
m.all(axis=1).astype(int)
0 0
1 0
2 1
3 0
4 0
dtype: int64
Another solution if you have to deal with only two columns would be:
df['new_customer_subscription'] = (
df['new_customer'].gt(0) & df['y'].gt(0)).astype(int)
To generalise to multiple columns, use logical_and.reduce:
df['new_customer_subscription'] = np.logical_and.reduce(
df[['new_customer', 'y']] > 0, axis=1).astype(int)
Or,
df['new_customer_subscription'] = (df[['new_customer', 'y']] > 0).all(1).astype(int)
Another way to do this is using the np.where from the numpys module:
df['Indicator'] = np.where((df.A > 0) & (df.B > 0), 1, 0)
Output
A B Indicator
0 -0.464992 0.418243 0
1 -0.902320 0.496530 0
2 0.219111 1.052536 1
3 -1.377076 0.207964 0
4 1.051078 2.041550 1
The np.where method works like this:
np.where(condition, true value, false value)
My understanding of a Pandas dataframe vectorization (through Pandas vectorization itself or through Numpy) is applying a function to an array, similar to .apply() (Please correct me if I'm wrong). Suppose I have the following dataframe:
import pandas as pd
df = pd.DataFrame({'color' : ['red','blue','yellow','orange','green',
'white','black','brown','orange-red','teal',
'beige','mauve','cyan','goldenrod','auburn',
'azure','celadon','lavender','oak','chocolate'],
'group' : [1,1,1,1,1,
1,1,1,1,1,
1,2,2,2,2,
4,4,5,6,7]})
df = df.set_index('color')
df
For this data, I want to apply a special counter for each unique value in A. Here's my current implementation of it:
df['C'] = 0
for value in set(df['group'].values):
filtered_df = df[df['group'] == value]
adj_counter = 0
initialize_counter = -1
spacing_counter = 20
special_counters = [0,1,-1,2,-2,3,-3,4,-4,5,-5,6,-6,7,-7]
for color,rows in filtered_df.iterrows():
if len(filtered_df.index) < 7:
initialize_counter +=1
df.loc[color,'C'] = (46+special_counters[initialize_counter])
else:
spacing_counter +=1
if spacing_counter > 5:
spacing_counter = 0
df.loc[color,'C'] = spacing_counter
df
Is there a faster way to implement this that doesn't involve iterrows or itertuples? Since the counting in the C columns is very irregular, I'm not sure as how I could implement this through apply or even through vectorization
What you can do is first create the column 'C' with groupby on the column 'group' and cumcount that would almost represent spacing_counter or initialize_counter depending on if len(filtered_df.index) < 7 or not.
df['C'] = df.groupby('group').cumcount()
Now you need to select the appropriate rows to do the if or the else part of your code. One way is to create a series using groupby again and transform to know the size of the group related to each row. Then, use loc on you df using this series and do: if the value is smaller than 7, you can map your values with the special_counters else just use modulo % 6
ser_size = df.groupby('group')['C'].transform('size')
df.loc[ser_size < 7,'C'] = df.loc[ser_size < 7,'C'].map(lambda x: 46 + special_counters[x])
df.loc[ser_size >= 7,'C'] %= 6
at the end, you get as expected:
print (df)
group C
color
red 1 0
blue 1 1
yellow 1 2
orange 1 3
green 1 4
white 1 5
black 1 0
brown 1 1
orange-red 1 2
teal 1 3
beige 1 4
mauve 2 46
cyan 2 47
goldenrod 2 45
auburn 2 48
azure 4 46
celadon 4 47
lavender 5 46
oak 6 46
chocolate 7 46
Having a DataFrame with the following column:
df['A'] = [1,1,1,0,1,1,1,1,0,1]
What would be the best vectorized way to control the length of "1"-series by some limiting value? Let's say the limit is 2, then the resulting column 'B' must look like:
A B
0 1 1
1 1 1
2 1 0
3 0 0
4 1 1
5 1 1
6 1 0
7 1 0
8 0 0
9 1 1
One fully-vectorized solution is to use the shift-groupby-cumsum-cumcount combination1 to indicate where consecutive runs are shorter than 2 (or whatever limiting value you like). Then, & this new boolean Series with the original column:
df['B'] = ((df.groupby((df.A != df.A.shift()).cumsum()).cumcount() <= 1) & df.A)\
.astype(int) # cast the boolean Series back to integers
This produces the new column in the DataFrame:
A B
0 1 1
1 1 1
2 1 0
3 0 0
4 1 1
5 1 1
6 1 0
7 1 0
8 0 0
9 1 1
1 See the pandas cookbook; the section on grouping, "Grouping like Python’s itertools.groupby"
Another way (checking if previous two are 1):
In [443]: df = pd.DataFrame({'A': [1,1,1,0,1,1,1,1,0,1]})
In [444]: limit = 2
In [445]: df['B'] = map(lambda x: df['A'][x] if x < limit else int(not all(y == 1 for y in df['A'][x - limit:x])), range(len(df)))
In [446]: df
Out[446]:
A B
0 1 1
1 1 1
2 1 0
3 0 0
4 1 1
5 1 1
6 1 0
7 1 0
8 0 0
9 1 1
If you know that the values in the series will all be either 0 or 1, I think you can use a little trick involving convolution. Make a copy of your column (which need not be a Pandas object, it can just be a normal Numpy array)
a = df['A'].as_matrix()
and convolve it with a sequence of 1's that is one longer than the cutoff you want, then chop off the last cutoff elements. E.g. for a cutoff of 2, you would do
long_run_count = numpy.convolve(a, [1, 1, 1])[:-2]
The resulting array, in this case, gives the number of 1's that occur in the 3 elements prior to and including that element. If that number is 3, then you are in a run that has exceeded length 2. So just set those elements to zero.
a[long_run_count > 2] = 0
You can now assign the resulting array to a new column in your DataFrame.
df['B'] = a
To turn this into a more general method:
def trim_runs(array, cutoff):
a = numpy.asarray(array)
a[numpy.convolve(a, numpy.ones(cutoff + 1))[:-cutoff] > cutoff] = 0
return a
I am always surprised by this:
> data = DataFrame({'x':[1, 2], 'y':[2, 1]})
> data = data.sort('y')
> data
x y
1 2 1
0 1 2
> data['x'][0]
1
Is there a way I can cause the indices to be reassigned to fit the new ordering?
For my part, I'm glad that sort doesn't throw away the index information. If it did, there wouldn't be much point to having an index in the first place, as opposed to another column.
If you want to reset the index to a range, you could:
>>> data
x y
1 2 1
0 1 2
>>> data.reset_index(drop=True)
x y
0 2 1
1 1 2
Where you could reassign or use inplace=True as you liked. If instead the real issue is that you want to access by position independent of index, you could use iloc:
>>> data['x']
1 2
0 1
Name: x, dtype: int64
>>> data['x'][0]
1
>>> data['x'].iloc[0]
2