I asked a question like this. But that is a simple one. Which has been resolved. how to merge strings that have substrings in common to produce some groups in a data frame in Python.
But here, I have an advanced version of the similar question:
I have a sample data:
a=pd.DataFrame({'ACTIVITY':['b,c','a','a,c,d,e','f,g,h,i','j,k,l','k,l,m']})
What I want to do is merge some strings if they have sub strings in common. So, in this example, the strings 'b,c','a','a,c,d,e' should be merged together because they can be linked to each other. 'j,k,l' and 'k,l,m' should be in one group. In the end, I hope I can have something like:
group
'b,c', 0
'a', 0
'a,c,d,e', 0
'f,g,h,i', 1
'j,k,l', 2
'k,l,m' 2
So, I can have three groups and there is no common sub strings between any two groups.
Now, I am trying to build up a similarity data frame, in which 1 means two strings have sub strings in common. Here is my code:
commonWords=1
for i in np.arange(a.shape[0]):
a.loc[:,a.loc[i,'ACTIVITY']]=0
for i in a.loc[:,'ACTIVITY']:
il=i.split(',')
for j in a.loc[:,'ACTIVITY']:
jl=j.split(',')
c=[x in il for x in jl]
c1=[x for x in c if x==True]
a.loc[(a.loc[:,'ACTIVITY']==i),j]=1 if len(c1)>=commonWords else 0
a
The result is:
ACTIVITY b,c a a,c,d,e f,g,h,i j,k,l k,l,m
0 b,c 1 0 1 0 0 0
1 a 0 1 1 0 0 0
2 a,c,d,e 1 1 1 0 0 0
3 f,g,h,i 0 0 0 1 0 0
4 j,k,l 0 0 0 0 1 1
5 k,l,m 0 0 0 0 1 1
In this code, commonWords means how many sub strings I hope that two strings have in common. For example, if commonWords=2, then two strings will be merged together only if there are two, or more than two sub strings in them. When commonWords=2, the group should be:
group
'b,c', 0
'a', 1
'a,c,d,e', 2
'f,g,h,i', 3
'j,k,l', 4
'k,l,m' 4
Use:
a=pd.DataFrame({'ACTIVITY':['b,c','a','a,c,d,e','f,g,h,i','j,k,l','k,l,m']})
from itertools import combinations, chain
from collections import Counter
#split values by , to lists
splitted = a['ACTIVITY'].str.split(',')
commonWords=2
#create edges (can only connect two nodes)
L2_nested = [list(combinations(l,commonWords)) for l in splitted]
L2 = list(chain.from_iterable(L2_nested))
#convert values to sets
f1 = [set(k) for k, v in Counter(L2).items() if v >= commonWords]
f2 = [set(x) for x in splitted]
#create new columns for matched sets
for val in f1:
j = ','.join(val)
a[j] = [j if len(val & x) == commonWords else np.nan for x in f2]
print (a)
#forward filling values of new columns and use factorize for groups
new = pd.factorize(a[['ACTIVITY']].assign(ACTIVITY = a.index).ffill(axis=1).iloc[:, -1])[0]
a = a[['ACTIVITY']].assign(group = new)
print (a)
ACTIVITY group
0 b,c 0
1 a 1
2 a,c,d,e 2
3 f,g,h,i 3
4 j,k,l 4
5 k,l,m 4
I have a sales data, the data has columns including
sales_2000, sales_2001, sales_2002...sales_2020
I am trying to extract rows that have following features:
first 8 years have zero value
On the 9th year, it has value larger than 0.
Any suggestions on how to code this using pandas?
I have simplified this problem for my answer because I didn't want to do that much typing. In the future, please provide actual sample data and any code you may have already tried in order to solve this problem. That being said, here is how you could find rows that have the first two years equal to 0 and the third greater than using slices:
In:
import pandas as pd
df = pd.DataFrame(dict(
sales_2000 = [1,0,10,0,5,0],
sales_2001 = [2,0,8,1,0,0],
sales_2002 = [1,2,3,0,0,4],
))
print(f'Orignal DataFrame:\n{df}')
df_extracted = df[
(
df['sales_2000'] == 0
) & (
df['sales_2001'] == 0
) & (
df['sales_2002'] > 0
)
]
print(f'\nExtracted DataFrame:\n{df_extracted}')
Out:
Orignal DataFrame:
sales_2000 sales_2001 sales_2002
0 1 2 1
1 0 0 2
2 10 8 3
3 0 1 0
4 5 0 0
5 0 0 4
Extracted DataFrame:
sales_2000 sales_2001 sales_2002
1 0 0 2
5 0 0 4
The key here is to wrap each condition inside of Round brackets (condition) and use the & operator to combine each condition. The and key word will not work here. Python Tutor Link to Example Code
Having a DataFrame with the following column:
df['A'] = [1,1,1,0,1,1,1,1,0,1]
What would be the best vectorized way to control the length of "1"-series by some limiting value? Let's say the limit is 2, then the resulting column 'B' must look like:
A B
0 1 1
1 1 1
2 1 0
3 0 0
4 1 1
5 1 1
6 1 0
7 1 0
8 0 0
9 1 1
One fully-vectorized solution is to use the shift-groupby-cumsum-cumcount combination1 to indicate where consecutive runs are shorter than 2 (or whatever limiting value you like). Then, & this new boolean Series with the original column:
df['B'] = ((df.groupby((df.A != df.A.shift()).cumsum()).cumcount() <= 1) & df.A)\
.astype(int) # cast the boolean Series back to integers
This produces the new column in the DataFrame:
A B
0 1 1
1 1 1
2 1 0
3 0 0
4 1 1
5 1 1
6 1 0
7 1 0
8 0 0
9 1 1
1 See the pandas cookbook; the section on grouping, "Grouping like Python’s itertools.groupby"
Another way (checking if previous two are 1):
In [443]: df = pd.DataFrame({'A': [1,1,1,0,1,1,1,1,0,1]})
In [444]: limit = 2
In [445]: df['B'] = map(lambda x: df['A'][x] if x < limit else int(not all(y == 1 for y in df['A'][x - limit:x])), range(len(df)))
In [446]: df
Out[446]:
A B
0 1 1
1 1 1
2 1 0
3 0 0
4 1 1
5 1 1
6 1 0
7 1 0
8 0 0
9 1 1
If you know that the values in the series will all be either 0 or 1, I think you can use a little trick involving convolution. Make a copy of your column (which need not be a Pandas object, it can just be a normal Numpy array)
a = df['A'].as_matrix()
and convolve it with a sequence of 1's that is one longer than the cutoff you want, then chop off the last cutoff elements. E.g. for a cutoff of 2, you would do
long_run_count = numpy.convolve(a, [1, 1, 1])[:-2]
The resulting array, in this case, gives the number of 1's that occur in the 3 elements prior to and including that element. If that number is 3, then you are in a run that has exceeded length 2. So just set those elements to zero.
a[long_run_count > 2] = 0
You can now assign the resulting array to a new column in your DataFrame.
df['B'] = a
To turn this into a more general method:
def trim_runs(array, cutoff):
a = numpy.asarray(array)
a[numpy.convolve(a, numpy.ones(cutoff + 1))[:-cutoff] > cutoff] = 0
return a
I am always surprised by this:
> data = DataFrame({'x':[1, 2], 'y':[2, 1]})
> data = data.sort('y')
> data
x y
1 2 1
0 1 2
> data['x'][0]
1
Is there a way I can cause the indices to be reassigned to fit the new ordering?
For my part, I'm glad that sort doesn't throw away the index information. If it did, there wouldn't be much point to having an index in the first place, as opposed to another column.
If you want to reset the index to a range, you could:
>>> data
x y
1 2 1
0 1 2
>>> data.reset_index(drop=True)
x y
0 2 1
1 1 2
Where you could reassign or use inplace=True as you liked. If instead the real issue is that you want to access by position independent of index, you could use iloc:
>>> data['x']
1 2
0 1
Name: x, dtype: int64
>>> data['x'][0]
1
>>> data['x'].iloc[0]
2
How would I create a matrix of single zeroes and ones in a size I specify without numpy? I tried looking this up but I only found results using it. I guess it would be using loops? Unless there's a more simple method?
For example, the size I specify could be 3 and the grid would be 3x3.
Col 0 Col 1 Col 2
Row 0 0 1 0
Row 1 0 0 1
Row 2 1 1 1
You could use a list comprehension:
def m(s):
return [s*[0] for _ in xrange(s)]