Python pandas How to pick up certain values by internal numbering?

Python pandas How to pick up certain values by internal numbering? - python

I have a dataframe that looks like this:
Answers all_answers Score
0 0.0 0 72
1 0.0 0 73
2 0.0 0 74
3 1.0 1 1
4 -1.0 1 2
5 1.0 1 3
6 -1.0 1 4
7 1.0 1 5
8 0.0 0 1
9 0.0 0 2
10 -1.0 1 1
11 0.0 0 1
12 0.0 0 2
13 1.0 1 1
14 0.0 0 1
15 0.0 0 2
16 1.0 1 1
The first column is a signal that the sign has changed in the calculation flow
The second one is I just removed the minus from the first one
The third is an internal account for the second column - how much was one and how much was zero
I want to add a fourth column to it that would show me only those units that went in a row for example 5 times while observing the sign of the first column.
To get something like this
Answers all_answers Score New
0 0.0 0 72 0
1 0.0 0 73 0
2 0.0 0 74 0
3 1.0 1 1 1
4 -1.0 1 2 -1
5 1.0 1 3 1
6 -1.0 1 4 -1
7 1.0 1 5 1
8 0.0 0 1 0
9 0.0 0 2 0
10 -1.0 1 1 0
11 0.0 0 1 0
12 0.0 0 2 0
13 1.0 1 1 0
14 0.0 0 1 0
15 0.0 0 2 0
16 1.0 1 1 0
17 0.0 0 1 0
Is it possible to do this by Pandas ?

You can use:
# group by consecutive 0/1
g = df['all_answers'].ne(df['all_answers'].shift()).cumsum()
# get size of each group and compare to threshold
m = df.groupby(g)['all_answers'].transform('size').ge(5)
# mask small groups
df['New'] = df['Answers'].where(m, 0)
Output:
Answers all_answers Score New
0 0.0 0 72 0.0
1 0.0 0 73 0.0
2 0.0 0 74 0.0
3 1.0 1 1 1.0
4 -1.0 1 2 -1.0
5 1.0 1 3 1.0
6 -1.0 1 4 -1.0
7 1.0 1 5 1.0
8 0.0 0 1 0.0
9 0.0 0 2 0.0
10 -1.0 1 1 0.0
11 0.0 0 1 0.0
12 0.0 0 2 0.0
13 1.0 1 1 0.0
14 0.0 0 1 0.0
15 0.0 0 2 0.0
16 1.0 1 1 0.0

A faster way (with regex):
import pandas as pd
import re
def repl5(m):
return '5' * len(m.group())
s = df['all_answers'].astype(str).str.cat()
d = re.sub('(?:1{5,})', repl5, s)
d = [x=='5' for x in list(d)]
df['New'] = df['Answers'].where(d, 0.0)
df
Output:
Answers all_answers Score New
0 0.0 0 72 0.0
1 0.0 0 73 0.0
2 0.0 0 74 0.0
3 1.0 1 1 1.0
4 -1.0 1 2 -1.0
5 1.0 1 3 1.0
6 -1.0 1 4 -1.0
7 1.0 1 5 1.0
8 0.0 0 1 0.0
9 0.0 0 2 0.0
10 -1.0 1 1 0.0
11 0.0 0 1 0.0
12 0.0 0 2 0.0
13 1.0 1 1 0.0
14 0.0 0 1 0.0
15 0.0 0 2 0.0
16 1.0 1 1 0.0

Related

Pandas apply function to column taking the value of previous column

I have to create a timeseries using column values for computing the Recency of a customer.
The formula I have to use is R(t) = 0 if the customer has bought something in that month, R(t-1) + 1 otherwise.
I managed to compute a dataframe
CustomerID -1 0 1 2 3 4 5 6 7 8 9 10 11 12
0 17850 0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
1 13047 0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 1.0 0.0 1.0 1.0
2 12583 0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 14688 0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
4 15311 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3750 15471 0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0
3751 13436 0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0
3752 15520 0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0
3753 14569 0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0
3754 12713 0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0
In which there's a 0 if the customer has bought something in that month and one otherwise. The column names indicate a time period, with the column "-1" as a dummy column.
How can I replace the value in each column with 0 if the current value is 0 and with the value of the previous column + 1 otherwise?
For example, the final result for the second customer should be 0 1 0 0 1 0 0 1 0 1 0 1 2
I know how to apply a function to a column, but I don't know how to make that function use the value from the previous column.

Just use apply function to iterate throw columns or rows of dataframe and do manipulation.
def apply_function(row):
return [item if i == 0 else 0 if item == 0 else item+row[i-1] for i,item in enumerate(row)]
new_df = df.apply(apply_function, axis=1, result_type='expand')
new_df.columns = df.columns # just to set previous column names

Do you insist on using the column structure? It is common with time series to use rows, e.g., a dataframe with columns CustomerID, hasBoughtThisMonth. You can then easily add the Recency column by using a pandas transform().
I cannot yet place comments hence the question in this way.
Edit: here is another way to go by. I took two customers as an example, and some random numbers of whether or not they bought something in a month.
Basically, you pivot your table, and use a groupby+cumsum to get your result. Notice that I avoid your dummy column in this way.
import pandas as pd
import numpy as np
np.random.seed(1)
# Make example dataframe
df = pd.DataFrame({'CustomerID': [1]*12+[2]*12,
'Month': [1,2,3,4,5,6,7,8,9,10,11,12]*2,
'hasBoughtThisMonth': np.random.randint(2,size=24)})
# Make Recency column by finding contiguous groups of ones, and groupby
contiguous_groups = df['hasBoughtThisMonth'].diff().ne(0).cumsum()
df['Recency']=df.groupby(by=['CustomerID', contiguous_groups],
as_index=False)['hasBoughtThisMonth'].cumsum().reset_index(drop=True)
The result is
CustomerID Month hasBoughtThisMonth Recency
0 1 1 1 1
1 1 2 1 2
2 1 3 0 0
3 1 4 0 0
4 1 5 1 1
5 1 6 1 2
6 1 7 1 3
7 1 8 1 4
8 1 9 1 5
9 1 10 0 0
10 1 11 0 0
11 1 12 1 1
12 2 1 0 0
13 2 2 1 1
14 2 3 1 2
15 2 4 0 0
16 2 5 0 0
17 2 6 1 1
18 2 7 0 0
19 2 8 0 0
20 2 9 0 0
21 2 10 1 1
22 2 11 0 0
23 2 12 0 0

It would be easier if you first set CustomerID as index and transpose your dataframe.
then apply your custom function.
i.e. something like:
df.T.apply(custom_func)

Remove groups from a DataFrame that contain only a single unique value in one column

I am processing data with Pandas. 'A' is a unique ID column and column 'E' contains either 1 or 0. I want to keep only groups where the value of column E contains both 0 and 1. (I want to delete rows where columns A are 2 and 4 as those groups contain only 1 and 0s respectively, leaving only rows where columns A are 1, 3, 5).
What is the best way to do this?
A B C D E F
1 1 0 0 0 1 1163.7
2 1 0.8 0.8 2.2 0 0
3 1 0.2 0.2 4.4 0 0
4 1 0.8 0.4 0.4 0 0
5 1 0.5 0.7 3.8 0 0
6 2 1 1 8.9 1 116
7 2 1.5 1.5 1.7 1 116
8 2 2 2 8.7 1 116
9 3 3 3 5. 0 0
10 3 4.5 4.5 2.2 0 0
11 3 6.0 6.5 0.8 0 0
12 3 8 8 0.3 0 0
13 3 5.3 0 0 1 116
14 3 0 0 0 1 116
15 4 0.8 0.8 1.1 0 0
16 4 0.2 0.5 3.4 0 0
17 4 0.4 0.8 3.2 0 0
18 4 0.7 0.5 3.0 0 0
19 5 1 1 1.5 0 0
20 5 1.5 1.5 1.7 0 0
21 5 2 2 7.9 1 116
I want to get the following data.
A B C D E F
1 1 0 0 0 1 1163.7
2 1 0.8 0.8 2.2 0 0
3 1 0.2 0.2 4.4 0 0
4 1 0.8 0.4 0.4 0 0
5 1 0.5 0.7 3.8 0 0
6 3 3 3 2.2 0 0
7 3 4.5 4.5 2.2 0 0
8 3 6.0 6.5 0.8 0 0
9 3 8 8 0.3 0 0
10 3 5.3 0 0 1 116
11 3 0 0 0 1 116
12 5 1 1 1.5 0 0
13 5 1.5 1.5 1.7 0 0
14 5 2 2 7.9 1 116

Use Series.groupby on column E and transform using any to create a boolean mask:
m = (df['E'].eq(0).groupby(df['A']).transform('any') &
df['E'].eq(1).groupby(df['A']).transform('any'))
df1 = df[m]
Or another idea if column E consists only of zeros and ones,
m = df.groupby('A')['E'].nunique().eq(2)
df1 = df[df['A'].isin(m[m].index)]
Result:
print(df1)
A B C D E F
1 1 0.0 0.0 0.0 1 1163.7
2 1 0.8 0.8 2.2 0 0.0
3 1 0.2 0.2 4.4 0 0.0
4 1 0.8 0.4 0.4 0 0.0
5 1 0.5 0.7 3.8 0 0.0
9 3 3.0 3.0 5.0 0 0.0
10 3 4.5 4.5 2.2 0 0.0
11 3 6.0 6.5 0.8 0 0.0
12 3 8.0 8.0 0.3 0 0.0
13 3 5.3 0.0 0.0 1 116.0
14 3 0.0 0.0 0.0 1 116.0
19 5 1.0 1.0 1.5 0 0.0
20 5 1.5 1.5 1.7 0 0.0
21 5 2.0 2.0 7.9 1 116.0

you can use drop_duplicates on columns A and E and groupby.size to see where the group by A has 2 different elements as E is only 0 or 1. Then use the index where the size is equal to 2 like:
s = df[['A','E']].drop_duplicates().groupby('A').size()
df_ = df[df['A'].isin(s[s.eq(2)].index)].copy()
print(df_)
A B C D E F
1 1 0.0 0.0 0.0 1 1163.7
2 1 0.8 0.8 2.2 0 0.0
3 1 0.2 0.2 4.4 0 0.0
4 1 0.8 0.4 0.4 0 0.0
5 1 0.5 0.7 3.8 0 0.0
9 3 3.0 3.0 5.0 0 0.0
10 3 4.5 4.5 2.2 0 0.0
11 3 6.0 6.5 0.8 0 0.0
12 3 8.0 8.0 0.3 0 0.0
13 3 5.3 0.0 0.0 1 116.0
14 3 0.0 0.0 0.0 1 116.0
19 5 1.0 1.0 1.5 0 0.0
20 5 1.5 1.5 1.7 0 0.0
21 5 2.0 2.0 7.9 1 116.0

Find nearest neighbors

I have a large dataframe of the form:
user_id time_interval A B C D E F G H ... Z
0 12166 2.0 3.0 1.0 1.0 1.0 3.0 1.0 1.0 1.0 ... 0.0
1 12167 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0 ... 0.0
2 12168 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0 ... 0.0
3 12169 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0 ... 0.0
4 12170 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0 ... 0.0
... ... ... ... ... ... ... ... ... ... ... ... ...
I would like to find, for each user_id, based on the columns A-Z as coordinates,the closest neighbors within a 'radius' distance r. The output should look like, for example, for r=0.1:
user_id neighbors
12166 [12251,12345, ...]
12167 [12168, 12169,12170, ...]
... ...
I tried for-looping throughout the user_id list but it takes ages.
I did something like this:
import scipy
neighbors = []
for i in range(len(dataframe)):
user_neighbors = [dataframe["user_id"][j] for j in range(i+1,len(dataframe)) if scipy.spatial.distance.euclidean(dataframe.values[i][2:],dataframe.values[j][2:])<0.1]
neighbors.append([dataframe["user_id"][i],user_neighbors])
and I have been waiting for hours.
Is there a pythonic way to improve this?

Here's how I've done it using apply method.
The dummy data consisting of columns A-D with an added column for neighbors:
print(df)
user_id time_interval A B C D neighbors
0 12166 2 3 2 2 3 NaN
1 12167 0 1 4 3 3 NaN
2 12168 0 4 3 3 1 NaN
3 12169 0 2 2 3 2 NaN
4 12170 0 3 3 1 1 NaN
the custom function:
def func(row):
r = 2.5 # the threshold
out = df[(((df.iloc[:, 2:-1] - row[2:-1])**2).sum(axis=1)**0.5).le(r)]['user_id'].to_list()
out.remove(row['user_id'])
df.loc[row.name, ['neighbors']] = str(out)
df.apply(func, axis=1)
the output:
print(df):
user_id time_interval A B C D neighbors
0 12166 2 3 2 2 3 [12169, 12170]
1 12167 0 1 4 3 3 [12169]
2 12168 0 4 3 3 1 [12169, 12170]
3 12169 0 2 2 3 2 [12166, 12167, 12168]
4 12170 0 3 3 1 1 [12166, 12168]
Let me know if it outperforms the for-loop approach.

Sum of dataframe columns to another dataframe column Python gives NaN

I want to sumarize rows and columns of dataframe (pdf and wdf) and save results in another dataframe columns (to_hex).
I tried it for one dataframe and it worked. It doesn't work for another (it gives NaN). I cannot understand what is the difference.
to_hex = pd.DataFrame(0, index=np.arange(len(sasiedztwo)), columns=['ID','podroze','p_rozmyte'])
to_hex.loc[:,'ID']= wdf.index+1
to_hex.index=pdf.index
to_hex.loc[:,'podroze']= pd.DataFrame(pdf.sum(axis=0))[:]
to_hex.index=wdf.index
to_hex.loc[:,'p_rozmyte']= pd.DataFrame(wdf.sum(axis=0))[:]
This is how pdf dataframe looks like:
0 1 2 3 4 5 6 7 8
0 0 0 10 0 0 0 0 0 100
1 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0
5 0 0 0 0 0 0 0 0 0
6 0 0 0 0 0 0 0 0 0
7 0 0 0 0 0 0 0 0 1000
8 0 0 0 0 0 0 0 0 0
This is wdf:
0 1 2 3 4 5 6 7 8
0 2.5 5.0 35.0 0.0 27.5 55.0 25.0 50.0 102.5
1 0.0 0.0 5.0 0.0 0.0 0.0 0.0 0.0 300.0
2 0.0 0.0 2.5 0.0 0.0 0.0 0.0 0.0 25.0
3 0.0 0.0 5.0 0.0 0.0 0.0 0.0 0.0 300.0
4 0.0 0.0 2.5 0.0 0.0 0.0 0.0 0.0 525.0
5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 250.0
6 0.0 0.0 2.5 0.0 0.0 0.0 0.0 0.0 525.0
7 0.0 0.0 250.0 0.0 250.0 500.0 250.0 500.0 1000.0
8 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 500.0
And this is the result in to_hex:
ID podroze p_rozmyte
0 1 0 NaN
1 2 0 NaN
2 3 10 NaN
3 4 0 NaN
4 5 0 NaN
5 6 0 NaN
6 7 0 NaN
7 8 0 NaN
8 9 1100 NaN

SOLUTION:
One option to solve it is to modify your code as follows:
to_hex.loc[:,'ID']= wdf.index+1
# to_hex.index=pdf.index # no need
to_hex.loc[:,'podroze']= pdf.sum(axis=0) # modified; directly use the series output from SUM()
# to_hex.index=wdf.index # no need
to_hex.loc[:,'p_rozmyte']= wdf.sum(axis=0) # modified
Then you get:
ID podroze p_rozmyte
0 1 0 2.5
1 2 0 5.0
2 3 10 302.5
3 4 0 0.0
4 5 0 277.5
5 6 0 555.0
6 7 0 275.0
7 8 0 550.0
8 9 1100 3527.5
I think the reason that you get NaN for one case and correct values for the other case lies in to_hex.dtypes:
ID int64
podroze int64
p_rozmyte int64
dtype: object
And as you see to_hex dataframe has column types as int64. This is fine when you add pdf dataframe (since it has the same dtype)
pd.DataFrame(pdf.sum(axis=0))[:].dtypes
0 int64
dtype: object
but does not work when you add wdf:
pd.DataFrame(wdf.sum(axis=0))[:].dtypes
0 float64
dtype: object

Pandas, create new columns based on existing with repeated count

It's a bit complicated for explain, so I'll do my best. I have a pandas with two columns: hour (from 1 to 24) and value(corresponding to each hour). Dataset index is huge but column hour is repeated on that 24 hours basis (from 1 to 24). I am trying to create new 24 columns: value -1, value -2, value -3...value -24 that will correspond to each row and value from -1 hour, value from -2 hours(from above rows).
hour | value | value -1 | value -2 | value -3| ... | value - 24
1 10 0 0 0 0
2 11 10 0 0 0
3 12 11 10 0 0
4 13 12 11 10 0
...
24 32 31 30 29 0
1 33 32 31 30 10
2 34 33 32 31 11
and so on...
All value numbers are for the example. As I said there are lots of rows, not only 24 for all hours in a day time but all following time series from 1 to 24 and etc.
Thanks in advance and may the force be with you!

Is this what you need?
df = pd.DataFrame([[1,10],[2,11],
[3,12],[4,13]], columns=['hour','value'])
for i in range(1, 24):
df['value -' + str(i)] = df['value'].shift(i).fillna(0)
result:

Is this what you are looking for?
import pandas as pd
df = pd.DataFrame({'hour': list(range(24))*2,
'value': list(range(48))})
shift_cols_n = 10
for shift in range(1, shift_cols_n):
new_columns_name = 'value - ' + str(shift)
# Assuming that you don't have any NAs in your dataframe
df[new_columns_name] = df['value'].shift(shift).fillna(0)
# A safer (and a less simple) way, in case you have NAs in your dataframe
df[new_columns_name] = df['value'].shift(shift)
df.loc[:shift, new_columns_name] = 0
print(df.head(9))
hour value value - 1 value - 2 value - 3 value - 4 value - 5 \
0 0 0 0.0 0.0 0.0 0.0 0.0
1 1 1 0.0 0.0 0.0 0.0 0.0
2 2 2 1.0 0.0 0.0 0.0 0.0
3 3 3 2.0 1.0 0.0 0.0 0.0
4 4 4 3.0 2.0 1.0 0.0 0.0
5 5 5 4.0 3.0 2.0 1.0 0.0
6 6 6 5.0 4.0 3.0 2.0 1.0
7 7 7 6.0 5.0 4.0 3.0 2.0
8 8 8 7.0 6.0 5.0 4.0 3.0
value - 6 value - 7 value - 8 value - 9
0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0
5 0.0 0.0 0.0 0.0
6 0.0 0.0 0.0 0.0
7 1.0 0.0 0.0 0.0
8 2.0 1.0 0.0 0.0

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python pandas How to pick up certain values by internal numbering? - python

Related

Pandas apply function to column taking the value of previous column

Remove groups from a DataFrame that contain only a single unique value in one column

Find nearest neighbors

Sum of dataframe columns to another dataframe column Python gives NaN

Pandas, create new columns based on existing with repeated count

Categories

Resources