pythonic way of making dummy column from sum of two values - python

I have a dataframe with one column called label which has the values [0,1,2,3,4,5,6,8,9].
I would like to make dummy columns out of this, but I would like some labels to be joined together, so for example I want dummy_012 to be 1 if the observation has either label 0, 1 or 2.
If i use the command df2 = pd.get_dummies(df, columns=['label']), it would create 9 columns, 1 for each label.
I know I can use df2['dummy_012']=df2['dummy_0']+df2['dummy_1']+df2['dummy_2'] after that to turn it into one joint column, but I want to know if there's a more pythonic way of doing it (or some function where i can just change the parameters to the joins).

Maybe this approach can give a idea:
groups = ['012', '345', '6789']
for gp in groups:
df.loc[df['Label'].isin([int(x) for x in gp]), 'Label_Group'] = f'dummies_{gp}'
Output:
Label Label_Group
0 0 dummies_012
1 1 dummies_012
2 2 dummies_012
3 3 dummies_345
4 4 dummies_345
5 5 dummies_345
6 6 dummies_6789
7 8 dummies_6789
8 9 dummies_6789
And then apply dummy:
df_dummies = pd.get_dummies(df['Label_Group'])
dummies_012 dummies_345 dummies_6789
0 1 0 0
1 1 0 0
2 1 0 0
3 0 1 0
4 0 1 0
5 0 1 0
6 0 0 1
7 0 0 1
8 0 0 1

I don't know that this is pythonic because a more elegant solution might exist, but I does allow you to change parameters and it's vectorized. I've read that get_dummies() can be a bit slow with large amounts of data and vectorizing pandas is good practice in general. So I vectorized this function and had it do its calculations with numpy arrays. It should give you a boost in performance as the dataset increases in size compared to similar functions.
This function will take your dataframe and a list of numbers as strings and will return your dataframe with the column you wanted.
def get_dummy(df,column_nos):
new_col_name = 'dummy_'+''.join([i for i in column_nos])
vector_sum = sum([df[i].values for i in column_nos])
df[new_col_name] = [1 if i>0 else 0 for i in vector_sum]
return df
In case you'd rather the input to be integers rather than strings, you can tweak the above function to look like below.
def get_dummy(df,column_nos):
column_names = ['dummy_'+str(i) for i in column_nos]
new_col_name = 'dummy_'+''.join([str(i) for i in sorted(column_nos)])
vector_sum = sum([df[i].values for i in column_names])
df[new_col_name] = [1 if i>0 else 0 for i in vector_sum]
return df

Related

How to merge strings that have certain number of substrings in common to produce some groups in a data frame in Python

I asked a question like this. But that is a simple one. Which has been resolved. how to merge strings that have substrings in common to produce some groups in a data frame in Python.
But here, I have an advanced version of the similar question:
I have a sample data:
a=pd.DataFrame({'ACTIVITY':['b,c','a','a,c,d,e','f,g,h,i','j,k,l','k,l,m']})
What I want to do is merge some strings if they have sub strings in common. So, in this example, the strings 'b,c','a','a,c,d,e' should be merged together because they can be linked to each other. 'j,k,l' and 'k,l,m' should be in one group. In the end, I hope I can have something like:
group
'b,c', 0
'a', 0
'a,c,d,e', 0
'f,g,h,i', 1
'j,k,l', 2
'k,l,m' 2
So, I can have three groups and there is no common sub strings between any two groups.
Now, I am trying to build up a similarity data frame, in which 1 means two strings have sub strings in common. Here is my code:
commonWords=1
for i in np.arange(a.shape[0]):
a.loc[:,a.loc[i,'ACTIVITY']]=0
for i in a.loc[:,'ACTIVITY']:
il=i.split(',')
for j in a.loc[:,'ACTIVITY']:
jl=j.split(',')
c=[x in il for x in jl]
c1=[x for x in c if x==True]
a.loc[(a.loc[:,'ACTIVITY']==i),j]=1 if len(c1)>=commonWords else 0
a
The result is:
ACTIVITY b,c a a,c,d,e f,g,h,i j,k,l k,l,m
0 b,c 1 0 1 0 0 0
1 a 0 1 1 0 0 0
2 a,c,d,e 1 1 1 0 0 0
3 f,g,h,i 0 0 0 1 0 0
4 j,k,l 0 0 0 0 1 1
5 k,l,m 0 0 0 0 1 1
In this code, commonWords means how many sub strings I hope that two strings have in common. For example, if commonWords=2, then two strings will be merged together only if there are two, or more than two sub strings in them. When commonWords=2, the group should be:
group
'b,c', 0
'a', 1
'a,c,d,e', 2
'f,g,h,i', 3
'j,k,l', 4
'k,l,m' 4
Use:
a=pd.DataFrame({'ACTIVITY':['b,c','a','a,c,d,e','f,g,h,i','j,k,l','k,l,m']})
from itertools import combinations, chain
from collections import Counter
#split values by , to lists
splitted = a['ACTIVITY'].str.split(',')
commonWords=2
#create edges (can only connect two nodes)
L2_nested = [list(combinations(l,commonWords)) for l in splitted]
L2 = list(chain.from_iterable(L2_nested))
#convert values to sets
f1 = [set(k) for k, v in Counter(L2).items() if v >= commonWords]
f2 = [set(x) for x in splitted]
#create new columns for matched sets
for val in f1:
j = ','.join(val)
a[j] = [j if len(val & x) == commonWords else np.nan for x in f2]
print (a)
#forward filling values of new columns and use factorize for groups
new = pd.factorize(a[['ACTIVITY']].assign(ACTIVITY = a.index).ffill(axis=1).iloc[:, -1])[0]
a = a[['ACTIVITY']].assign(group = new)
print (a)
ACTIVITY group
0 b,c 0
1 a 1
2 a,c,d,e 2
3 f,g,h,i 3
4 j,k,l 4
5 k,l,m 4

Manipulating data with string titles

I have a sales data, the data has columns including
sales_2000, sales_2001, sales_2002...sales_2020
I am trying to extract rows that have following features:
first 8 years have zero value
On the 9th year, it has value larger than 0.
Any suggestions on how to code this using pandas?
I have simplified this problem for my answer because I didn't want to do that much typing. In the future, please provide actual sample data and any code you may have already tried in order to solve this problem. That being said, here is how you could find rows that have the first two years equal to 0 and the third greater than using slices:
In:
import pandas as pd
df = pd.DataFrame(dict(
sales_2000 = [1,0,10,0,5,0],
sales_2001 = [2,0,8,1,0,0],
sales_2002 = [1,2,3,0,0,4],
))
print(f'Orignal DataFrame:\n{df}')
df_extracted = df[
(
df['sales_2000'] == 0
) & (
df['sales_2001'] == 0
) & (
df['sales_2002'] > 0
)
]
print(f'\nExtracted DataFrame:\n{df_extracted}')
Out:
Orignal DataFrame:
sales_2000 sales_2001 sales_2002
0 1 2 1
1 0 0 2
2 10 8 3
3 0 1 0
4 5 0 0
5 0 0 4
Extracted DataFrame:
sales_2000 sales_2001 sales_2002
1 0 0 2
5 0 0 4
The key here is to wrap each condition inside of Round brackets (condition) and use the & operator to combine each condition. The and key word will not work here. Python Tutor Link to Example Code

Conditional length of a binary data series in Pandas

Having a DataFrame with the following column:
df['A'] = [1,1,1,0,1,1,1,1,0,1]
What would be the best vectorized way to control the length of "1"-series by some limiting value? Let's say the limit is 2, then the resulting column 'B' must look like:
A B
0 1 1
1 1 1
2 1 0
3 0 0
4 1 1
5 1 1
6 1 0
7 1 0
8 0 0
9 1 1
One fully-vectorized solution is to use the shift-groupby-cumsum-cumcount combination1 to indicate where consecutive runs are shorter than 2 (or whatever limiting value you like). Then, & this new boolean Series with the original column:
df['B'] = ((df.groupby((df.A != df.A.shift()).cumsum()).cumcount() <= 1) & df.A)\
.astype(int) # cast the boolean Series back to integers
This produces the new column in the DataFrame:
A B
0 1 1
1 1 1
2 1 0
3 0 0
4 1 1
5 1 1
6 1 0
7 1 0
8 0 0
9 1 1
1 See the pandas cookbook; the section on grouping, "Grouping like Python’s itertools.groupby"
Another way (checking if previous two are 1):
In [443]: df = pd.DataFrame({'A': [1,1,1,0,1,1,1,1,0,1]})
In [444]: limit = 2
In [445]: df['B'] = map(lambda x: df['A'][x] if x < limit else int(not all(y == 1 for y in df['A'][x - limit:x])), range(len(df)))
In [446]: df
Out[446]:
A B
0 1 1
1 1 1
2 1 0
3 0 0
4 1 1
5 1 1
6 1 0
7 1 0
8 0 0
9 1 1
If you know that the values in the series will all be either 0 or 1, I think you can use a little trick involving convolution. Make a copy of your column (which need not be a Pandas object, it can just be a normal Numpy array)
a = df['A'].as_matrix()
and convolve it with a sequence of 1's that is one longer than the cutoff you want, then chop off the last cutoff elements. E.g. for a cutoff of 2, you would do
long_run_count = numpy.convolve(a, [1, 1, 1])[:-2]
The resulting array, in this case, gives the number of 1's that occur in the 3 elements prior to and including that element. If that number is 3, then you are in a run that has exceeded length 2. So just set those elements to zero.
a[long_run_count > 2] = 0
You can now assign the resulting array to a new column in your DataFrame.
df['B'] = a
To turn this into a more general method:
def trim_runs(array, cutoff):
a = numpy.asarray(array)
a[numpy.convolve(a, numpy.ones(cutoff + 1))[:-cutoff] > cutoff] = 0
return a

When using Pandas DataFrame.sort(), can I make it actually renumber the rows?

I am always surprised by this:
> data = DataFrame({'x':[1, 2], 'y':[2, 1]})
> data = data.sort('y')
> data
x y
1 2 1
0 1 2
> data['x'][0]
1
Is there a way I can cause the indices to be reassigned to fit the new ordering?
For my part, I'm glad that sort doesn't throw away the index information. If it did, there wouldn't be much point to having an index in the first place, as opposed to another column.
If you want to reset the index to a range, you could:
>>> data
x y
1 2 1
0 1 2
>>> data.reset_index(drop=True)
x y
0 2 1
1 1 2
Where you could reassign or use inplace=True as you liked. If instead the real issue is that you want to access by position independent of index, you could use iloc:
>>> data['x']
1 2
0 1
Name: x, dtype: int64
>>> data['x'][0]
1
>>> data['x'].iloc[0]
2

Matrix of zeroes and ones without numpy

How would I create a matrix of single zeroes and ones in a size I specify without numpy? I tried looking this up but I only found results using it. I guess it would be using loops? Unless there's a more simple method?
For example, the size I specify could be 3 and the grid would be 3x3.
Col 0 Col 1 Col 2
Row 0 0 1 0
Row 1 0 0 1
Row 2 1 1 1
You could use a list comprehension:
def m(s):
return [s*[0] for _ in xrange(s)]

Categories

Resources