I have a sales data, the data has columns including
sales_2000, sales_2001, sales_2002...sales_2020
I am trying to extract rows that have following features:
first 8 years have zero value
On the 9th year, it has value larger than 0.
Any suggestions on how to code this using pandas?
I have simplified this problem for my answer because I didn't want to do that much typing. In the future, please provide actual sample data and any code you may have already tried in order to solve this problem. That being said, here is how you could find rows that have the first two years equal to 0 and the third greater than using slices:
In:
import pandas as pd
df = pd.DataFrame(dict(
sales_2000 = [1,0,10,0,5,0],
sales_2001 = [2,0,8,1,0,0],
sales_2002 = [1,2,3,0,0,4],
))
print(f'Orignal DataFrame:\n{df}')
df_extracted = df[
(
df['sales_2000'] == 0
) & (
df['sales_2001'] == 0
) & (
df['sales_2002'] > 0
)
]
print(f'\nExtracted DataFrame:\n{df_extracted}')
Out:
Orignal DataFrame:
sales_2000 sales_2001 sales_2002
0 1 2 1
1 0 0 2
2 10 8 3
3 0 1 0
4 5 0 0
5 0 0 4
Extracted DataFrame:
sales_2000 sales_2001 sales_2002
1 0 0 2
5 0 0 4
The key here is to wrap each condition inside of Round brackets (condition) and use the & operator to combine each condition. The and key word will not work here. Python Tutor Link to Example Code
Related
I want to find the count for the number of previous rows that have the a greater value than the current row in a column and store it in a new column. It would be like a rolling countif that goes back to the beginning of the column. The desired example output below shows the value column given and the count column I want to create.
Desired Output:
Value Count
5 0
7 0
4 2
12 0
3 4
4 3
1 6
I plan on using this code with a large dataframe so the fastest way possible is appreciated.
We can do subtract.outer from numpy , then get lower tri and find the value is less than 0, and sum the value per row
a = np.sum(np.tril(np.subtract.outer(df.Value.values,df.Value.values), k=0)<0, axis=1)
# results in array([0, 0, 2, 0, 4, 3, 6])
df['Count'] = a
IMPORTANT: this only works with pandas < 1.0.0 and the error seems to be a pandas bug. An issue is already created at https://github.com/pandas-dev/pandas/issues/35203
We can do this with expanding and applying a function which checks for values that are higher than the last element in the expanding array.
import pandas as pd
import numpy as np
# setup
df = pd.DataFrame([5,7,4,12,3,4,1], columns=['Value'])
# calculate countif
df['Count'] = df.Value.expanding(1).apply(lambda x: np.sum(np.where(x > x[-1], 1, 0))).astype('int')
Input
Value
0 5
1 7
2 4
3 12
4 3
5 4
6 1
Output
Value Count
0 5 0
1 7 0
2 4 2
3 12 0
4 3 4
5 4 3
6 1 6
count = []
for i in range(len(values)):
count = 0
for j in values[:i]:
if values[i] < j:
count += 1
count.append(count)
The below generator will do what you need. You may be able to further optimize this if needed.
def generator (data) :
i=0
count_dict ={}
while i<len(data) :
m=max(data)
v=data[i]
count_dict[v] =count_dict[v] +1 if v in count_dict else 1
t=sum([(count_dict[j] if j in count_dict else 0) for j in range(v+1,m)])
i +=1
yield t
d=[1, 5,7,3,5,8]
foo=generator (d)
result =[b for b in foo]
print(result)
I asked a question like this. But that is a simple one. Which has been resolved. how to merge strings that have substrings in common to produce some groups in a data frame in Python.
But here, I have an advanced version of the similar question:
I have a sample data:
a=pd.DataFrame({'ACTIVITY':['b,c','a','a,c,d,e','f,g,h,i','j,k,l','k,l,m']})
What I want to do is merge some strings if they have sub strings in common. So, in this example, the strings 'b,c','a','a,c,d,e' should be merged together because they can be linked to each other. 'j,k,l' and 'k,l,m' should be in one group. In the end, I hope I can have something like:
group
'b,c', 0
'a', 0
'a,c,d,e', 0
'f,g,h,i', 1
'j,k,l', 2
'k,l,m' 2
So, I can have three groups and there is no common sub strings between any two groups.
Now, I am trying to build up a similarity data frame, in which 1 means two strings have sub strings in common. Here is my code:
commonWords=1
for i in np.arange(a.shape[0]):
a.loc[:,a.loc[i,'ACTIVITY']]=0
for i in a.loc[:,'ACTIVITY']:
il=i.split(',')
for j in a.loc[:,'ACTIVITY']:
jl=j.split(',')
c=[x in il for x in jl]
c1=[x for x in c if x==True]
a.loc[(a.loc[:,'ACTIVITY']==i),j]=1 if len(c1)>=commonWords else 0
a
The result is:
ACTIVITY b,c a a,c,d,e f,g,h,i j,k,l k,l,m
0 b,c 1 0 1 0 0 0
1 a 0 1 1 0 0 0
2 a,c,d,e 1 1 1 0 0 0
3 f,g,h,i 0 0 0 1 0 0
4 j,k,l 0 0 0 0 1 1
5 k,l,m 0 0 0 0 1 1
In this code, commonWords means how many sub strings I hope that two strings have in common. For example, if commonWords=2, then two strings will be merged together only if there are two, or more than two sub strings in them. When commonWords=2, the group should be:
group
'b,c', 0
'a', 1
'a,c,d,e', 2
'f,g,h,i', 3
'j,k,l', 4
'k,l,m' 4
Use:
a=pd.DataFrame({'ACTIVITY':['b,c','a','a,c,d,e','f,g,h,i','j,k,l','k,l,m']})
from itertools import combinations, chain
from collections import Counter
#split values by , to lists
splitted = a['ACTIVITY'].str.split(',')
commonWords=2
#create edges (can only connect two nodes)
L2_nested = [list(combinations(l,commonWords)) for l in splitted]
L2 = list(chain.from_iterable(L2_nested))
#convert values to sets
f1 = [set(k) for k, v in Counter(L2).items() if v >= commonWords]
f2 = [set(x) for x in splitted]
#create new columns for matched sets
for val in f1:
j = ','.join(val)
a[j] = [j if len(val & x) == commonWords else np.nan for x in f2]
print (a)
#forward filling values of new columns and use factorize for groups
new = pd.factorize(a[['ACTIVITY']].assign(ACTIVITY = a.index).ffill(axis=1).iloc[:, -1])[0]
a = a[['ACTIVITY']].assign(group = new)
print (a)
ACTIVITY group
0 b,c 0
1 a 1
2 a,c,d,e 2
3 f,g,h,i 3
4 j,k,l 4
5 k,l,m 4
I have a dataframe with one column called label which has the values [0,1,2,3,4,5,6,8,9].
I would like to make dummy columns out of this, but I would like some labels to be joined together, so for example I want dummy_012 to be 1 if the observation has either label 0, 1 or 2.
If i use the command df2 = pd.get_dummies(df, columns=['label']), it would create 9 columns, 1 for each label.
I know I can use df2['dummy_012']=df2['dummy_0']+df2['dummy_1']+df2['dummy_2'] after that to turn it into one joint column, but I want to know if there's a more pythonic way of doing it (or some function where i can just change the parameters to the joins).
Maybe this approach can give a idea:
groups = ['012', '345', '6789']
for gp in groups:
df.loc[df['Label'].isin([int(x) for x in gp]), 'Label_Group'] = f'dummies_{gp}'
Output:
Label Label_Group
0 0 dummies_012
1 1 dummies_012
2 2 dummies_012
3 3 dummies_345
4 4 dummies_345
5 5 dummies_345
6 6 dummies_6789
7 8 dummies_6789
8 9 dummies_6789
And then apply dummy:
df_dummies = pd.get_dummies(df['Label_Group'])
dummies_012 dummies_345 dummies_6789
0 1 0 0
1 1 0 0
2 1 0 0
3 0 1 0
4 0 1 0
5 0 1 0
6 0 0 1
7 0 0 1
8 0 0 1
I don't know that this is pythonic because a more elegant solution might exist, but I does allow you to change parameters and it's vectorized. I've read that get_dummies() can be a bit slow with large amounts of data and vectorizing pandas is good practice in general. So I vectorized this function and had it do its calculations with numpy arrays. It should give you a boost in performance as the dataset increases in size compared to similar functions.
This function will take your dataframe and a list of numbers as strings and will return your dataframe with the column you wanted.
def get_dummy(df,column_nos):
new_col_name = 'dummy_'+''.join([i for i in column_nos])
vector_sum = sum([df[i].values for i in column_nos])
df[new_col_name] = [1 if i>0 else 0 for i in vector_sum]
return df
In case you'd rather the input to be integers rather than strings, you can tweak the above function to look like below.
def get_dummy(df,column_nos):
column_names = ['dummy_'+str(i) for i in column_nos]
new_col_name = 'dummy_'+''.join([str(i) for i in sorted(column_nos)])
vector_sum = sum([df[i].values for i in column_names])
df[new_col_name] = [1 if i>0 else 0 for i in vector_sum]
return df
Having a DataFrame with the following column:
df['A'] = [1,1,1,0,1,1,1,1,0,1]
What would be the best vectorized way to control the length of "1"-series by some limiting value? Let's say the limit is 2, then the resulting column 'B' must look like:
A B
0 1 1
1 1 1
2 1 0
3 0 0
4 1 1
5 1 1
6 1 0
7 1 0
8 0 0
9 1 1
One fully-vectorized solution is to use the shift-groupby-cumsum-cumcount combination1 to indicate where consecutive runs are shorter than 2 (or whatever limiting value you like). Then, & this new boolean Series with the original column:
df['B'] = ((df.groupby((df.A != df.A.shift()).cumsum()).cumcount() <= 1) & df.A)\
.astype(int) # cast the boolean Series back to integers
This produces the new column in the DataFrame:
A B
0 1 1
1 1 1
2 1 0
3 0 0
4 1 1
5 1 1
6 1 0
7 1 0
8 0 0
9 1 1
1 See the pandas cookbook; the section on grouping, "Grouping like Python’s itertools.groupby"
Another way (checking if previous two are 1):
In [443]: df = pd.DataFrame({'A': [1,1,1,0,1,1,1,1,0,1]})
In [444]: limit = 2
In [445]: df['B'] = map(lambda x: df['A'][x] if x < limit else int(not all(y == 1 for y in df['A'][x - limit:x])), range(len(df)))
In [446]: df
Out[446]:
A B
0 1 1
1 1 1
2 1 0
3 0 0
4 1 1
5 1 1
6 1 0
7 1 0
8 0 0
9 1 1
If you know that the values in the series will all be either 0 or 1, I think you can use a little trick involving convolution. Make a copy of your column (which need not be a Pandas object, it can just be a normal Numpy array)
a = df['A'].as_matrix()
and convolve it with a sequence of 1's that is one longer than the cutoff you want, then chop off the last cutoff elements. E.g. for a cutoff of 2, you would do
long_run_count = numpy.convolve(a, [1, 1, 1])[:-2]
The resulting array, in this case, gives the number of 1's that occur in the 3 elements prior to and including that element. If that number is 3, then you are in a run that has exceeded length 2. So just set those elements to zero.
a[long_run_count > 2] = 0
You can now assign the resulting array to a new column in your DataFrame.
df['B'] = a
To turn this into a more general method:
def trim_runs(array, cutoff):
a = numpy.asarray(array)
a[numpy.convolve(a, numpy.ones(cutoff + 1))[:-cutoff] > cutoff] = 0
return a
I have a dataframe which shows; 1) dates, prices and 3) the difference between two prices by row.
dates | data | result | change
24-09 24 0 none
25-09 26 2 pos
26-09 27 1 pos
27-09 28 1 pos
28-09 26 -2 neg
I want to create a summary of the above data in a new dataframe. The summary would have 4 columns: 1) start date, 2) end date 3) number of days 4) run
For example using the above there was a positive run of +4 from 25-09 and 27-09, so I would want this in a row of a dataframe like so:
In the new dataframe there would be one new row for every change in the value of result from positive to negative. Where run = 0 this indicates no change from the previous days price and would also need its own row in the dataframe.
start date | end date | num days | run
25-09 27-09 3 4
27-09 28-09 1 -2
23-09 24-09 1 0
The first step I think would be to create a new column "change" based on the value of run which then shows either of: "positive","negative" or "no change". Then maybe I could groupby this column.
A couple of useful functions for this style of problem are diff() and cumsum().
I added some extra datapoints to your sample data to flesh out the functionality.
The ability to pick and choose different (and more than one) aggregation functions assigned to different columns is a super feature of pandas.
df = pd.DataFrame({'dates': ['24-09', '25-09', '26-09', '27-09', '28-09', '29-09', '30-09','01-10','02-10','03-10','04-10'],
'data': [24, 26, 27, 28, 26,25,30,30,30,28,25],
'result': [0,2,1,1,-2,0,5,0,0,-2,-3]})
def cat(x):
return 1 if x > 0 else -1 if x < 0 else 0
df['cat'] = df['result'].map(lambda x : cat(x)) # probably there is a better way to do this
df['change'] = df['cat'].diff()
df['change_flag'] = df['change'].map(lambda x: 1 if x != 0 else x)
df['change_cum_sum'] = df['change_flag'].cumsum() # which gives us our groupings
foo = df.groupby(['change_cum_sum']).agg({'result' : np.sum,'dates' : [np.min,np.max,'count'] })
foo.reset_index(inplace=True)
foo.columns = ['id','start date','end date','num days','run' ]
print foo
which yields:
id start date end date num days run
0 1 24-09 24-09 1 0
1 2 25-09 27-09 3 4
2 3 28-09 28-09 1 -2
3 4 29-09 29-09 1 0
4 5 30-09 30-09 1 5
5 6 01-10 02-10 2 0
6 7 03-10 04-10 2 -5