Related
I would like to make a table of frequency and percent by container, class, and score.
df = pd.read_csv('https://drive.google.com/file/d/1pL8fHCc25-XRBYgj9n6NdRt5VHrIr-p1/view?usp=sharing', sep=',')
df.groupby([ 'Containe', 'Class']).count()
The output should be:
But that script does not work!
First, we stack the values in order to have one by rows :
>>> df1 = (df.set_index(["Containe", "Class"])
... .stack()
... .reset_index(name='Score')
... .rename(columns={'level_2':'letters'}))
Then, we use a groupby to get the size of each combinaison of values like so :
>>> df_grouped = df1.groupby(['Containe', 'Class', 'letters', 'Score'], as_index=False).size()
To finish, we use the pivot_table method to get the expected result :
>>> pd.pivot_table(df_grouped, values='size', index=['letters', 'Class', 'Containe'], columns=['Score']).fillna(0)
Score 0 1 2
letters Class Containe
AB A 1 2.0 1.0 1.0
2 1.0 2.0 1.0
B 3 2.0 1.0 1.0
4 1.0 2.0 1.0
AC A 1 0.0 2.0 2.0
2 1.0 2.0 1.0
B 3 1.0 2.0 1.0
4 2.0 2.0 0.0
AD A 1 2.0 0.0 2.0
2 1.0 3.0 0.0
B 3 2.0 1.0 1.0
4 1.0 1.0 2.0
I have to create a timeseries using column values for computing the Recency of a customer.
The formula I have to use is R(t) = 0 if the customer has bought something in that month, R(t-1) + 1 otherwise.
I managed to compute a dataframe
CustomerID -1 0 1 2 3 4 5 6 7 8 9 10 11 12
0 17850 0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
1 13047 0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 1.0 0.0 1.0 1.0
2 12583 0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 14688 0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
4 15311 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3750 15471 0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0
3751 13436 0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0
3752 15520 0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0
3753 14569 0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0
3754 12713 0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0
In which there's a 0 if the customer has bought something in that month and one otherwise. The column names indicate a time period, with the column "-1" as a dummy column.
How can I replace the value in each column with 0 if the current value is 0 and with the value of the previous column + 1 otherwise?
For example, the final result for the second customer should be 0 1 0 0 1 0 0 1 0 1 0 1 2
I know how to apply a function to a column, but I don't know how to make that function use the value from the previous column.
Just use apply function to iterate throw columns or rows of dataframe and do manipulation.
def apply_function(row):
return [item if i == 0 else 0 if item == 0 else item+row[i-1] for i,item in enumerate(row)]
new_df = df.apply(apply_function, axis=1, result_type='expand')
new_df.columns = df.columns # just to set previous column names
Do you insist on using the column structure? It is common with time series to use rows, e.g., a dataframe with columns CustomerID, hasBoughtThisMonth. You can then easily add the Recency column by using a pandas transform().
I cannot yet place comments hence the question in this way.
Edit: here is another way to go by. I took two customers as an example, and some random numbers of whether or not they bought something in a month.
Basically, you pivot your table, and use a groupby+cumsum to get your result. Notice that I avoid your dummy column in this way.
import pandas as pd
import numpy as np
np.random.seed(1)
# Make example dataframe
df = pd.DataFrame({'CustomerID': [1]*12+[2]*12,
'Month': [1,2,3,4,5,6,7,8,9,10,11,12]*2,
'hasBoughtThisMonth': np.random.randint(2,size=24)})
# Make Recency column by finding contiguous groups of ones, and groupby
contiguous_groups = df['hasBoughtThisMonth'].diff().ne(0).cumsum()
df['Recency']=df.groupby(by=['CustomerID', contiguous_groups],
as_index=False)['hasBoughtThisMonth'].cumsum().reset_index(drop=True)
The result is
CustomerID Month hasBoughtThisMonth Recency
0 1 1 1 1
1 1 2 1 2
2 1 3 0 0
3 1 4 0 0
4 1 5 1 1
5 1 6 1 2
6 1 7 1 3
7 1 8 1 4
8 1 9 1 5
9 1 10 0 0
10 1 11 0 0
11 1 12 1 1
12 2 1 0 0
13 2 2 1 1
14 2 3 1 2
15 2 4 0 0
16 2 5 0 0
17 2 6 1 1
18 2 7 0 0
19 2 8 0 0
20 2 9 0 0
21 2 10 1 1
22 2 11 0 0
23 2 12 0 0
It would be easier if you first set CustomerID as index and transpose your dataframe.
then apply your custom function.
i.e. something like:
df.T.apply(custom_func)
hey all i want to change the rows by condition on a column.
so where column "type"==A
i want the cols [col1-col5] will be 1 if the value
is biger 2
else i like the value to be 0
the DATA
data={"col1":[np.nan,3,4,5,9,2,6],
"col2":[4,2,4,6,0,1,5],
"col3":[7,6,0,11,3,6,7],
"col4":[14,11,22,8,6,np.nan,9],
"col5":[0,5,7,3,8,2,9],
"type":["A","A","C","A","B","A","E"],
"number":["one","two","two","one","one","two","two"]}
df=pd.DataFrame.from_dict(data)
df
How I expect the data to be
data={"col1":[0,1,4,1,9,0,6],
"col2":[1,0,4,1,0,0,5],
"col3":[1,1,0,1,3,1,7],
"col4":[1,1,22,1,6,0,9],
"col5":[0,1,7,1,1,0,9],
"type":["A","A","C","A","B","A","E"],
"number":["one","two","two","one","one","two","two"]}
df=pd.DataFrame.from_dict(data)
df
You can use df.query to get all type A rows, then use df._get_numeric_data/df.select_dtypes('number') to get all numeric fields, then use df.gt and cast them as int using df.astype, now update the DataFrame with new values using df.update
df.update(df.query('type == "A"')._get_numeric_data().gt(2).astype(int))
#.select_dtypes('number')
df
col1 col2 col3 col4 col5 type number
0 0.0 1.0 1.0 1.0 0.0 A one
1 1.0 0.0 1.0 1.0 1.0 A two
2 4.0 4.0 0.0 22.0 7.0 C two
3 1.0 1.0 1.0 1.0 1.0 A one
4 9.0 0.0 3.0 6.0 8.0 B one
5 0.0 0.0 1.0 0.0 0.0 A two
6 6.0 5.0 7.0 9.0 9.0 E two
Use DataFrame.loc for select by condition equal A and columns between first and last column name, then compare for greater like DataFrame.gt, for map True, False to 1,0 is used convert mask to integers, last update by DataFrame.update:
df.update(df.loc[df['type'].eq('A'), 'col1':'col5'].gt(2).astype(int))
print (df)
col1 col2 col3 col4 col5 type number
0 0.0 1.0 1.0 1.0 0.0 A one
1 1.0 0.0 1.0 1.0 1.0 A two
2 4.0 4.0 0.0 22.0 7.0 C two
3 1.0 1.0 1.0 1.0 1.0 A one
4 9.0 0.0 3.0 6.0 8.0 B one
5 0.0 0.0 1.0 0.0 0.0 A two
6 6.0 5.0 7.0 9.0 9.0 E two
Or by assign back:
m = df['type'].eq('A')
df.loc[m, 'col1':'col5'] = df.loc[m, 'col1':'col5'].gt(2).astype(int)
print (df)
col1 col2 col3 col4 col5 type number
0 0.0 1 1 1.0 0 A one
1 1.0 0 1 1.0 1 A two
2 4.0 4 0 22.0 7 C two
3 1.0 1 1 1.0 1 A one
4 9.0 0 3 6.0 8 B one
5 0.0 0 1 0.0 0 A two
6 6.0 5 7 9.0 9 E two
I have a Pandas (0.23.4) DataFrame with several categorical columns.
df = pd.DataFrame(np.random.choice([True, False, np.nan], (6,4)), columns = ['a','b','c','d'])
a b c d
0 NaN 1.0 NaN NaN
1 NaN 1.0 NaN 0.0
2 1.0 NaN 1.0 NaN
3 0.0 NaN 0.0 1.0
4 NaN 1.0 NaN NaN
5 NaN 1.0 0.0 1.0
I have two sets of columns of interest:
cross_cols = ['a', 'b']
type_cols = ['c', 'd']
I would like to get a cross tab of counts of each cross_col variable with each type_col variable (a with c and d, and b with c and d), excluding NaN, all displayed side-by-side. The desired result is:
c d
0.0 1.0 All 0.0 1.0 All
a 0.0 0 0 0 1 1 2
1.0 2 1 3 1 0 1
All 2 1 3 2 1 3
b 0.0 0 0 0 0 1 1
1.0 2 1 3 2 0 2
All 2 1 3 2 1 3
Notice that I am not interested in counts for different combinations of a and b or of c and d, which is what I'm getting by changing the index and columns parameters of pd.crosstab.
Currently I'm using the following code:
cross_rows = []
for col in cross_cols:
cross_rows.append(pd.concat([pd.crosstab(df[col], df[type_var],margins=True) for type_var in type_cols],axis=1,keys = type_cols,sort=True))
results = pd.concat(cross_rows, keys = cross_cols,sort=True)
It gives the following result:
c d
c 0.0 1.0 All 0.0 1.0 All
a 1.0 2.0 1.0 3.0 1 0 1
All 2.0 1.0 3.0 2 1 3
0.0 NaN NaN NaN 1 1 2
b 1.0 2.0 1.0 3.0 2 0 2
All 2.0 1.0 3.0 2 1 3
0.0 NaN NaN NaN 0 1 1
The result is fine, but the code is slow and a bit ugly. I suspect that there's a faster and more Pythonic approach. Is there a single function call that would get the job done, or another faster solution?
I have a big dataframe with many columns (like 1000). I have a list of columns (generated by a script ~10). And I would like to select all the rows in the original dataframe where at least one of my list of columns is not null.
So if I would know the number of my columns in advance, I could do something like this:
list_of_cols = ['col1', ...]
df[
df[list_of_cols[0]].notnull() |
df[list_of_cols[1]].notnull() |
...
df[list_of_cols[6]].notnull() |
]
I can also iterate over the list of cols and create a mask which then I would apply to df, but his looks too tedious. Knowing how powerful is pandas with respect to dealing with nan, I would expect that there is a way easier way to achieve what I want.
Use the thresh parameter in the dropna() method. By setting thresh=1, you specify that if there is at least 1 non null item, don't drop it.
df = pd.DataFrame(np.random.choice((1., np.nan), (1000, 1000), p=(.3, .7)))
list_of_cols = list(range(10))
df[list_of_cols].dropna(thresh=1).head()
Starting with this:
data = {'a' : [np.nan,0,0,0,0,0,np.nan,0,0, 0,0,0, 9,9,],
'b' : [np.nan,np.nan,1,1,1,1,1,1,1, 2,2,2, 1,7],
'c' : [np.nan,np.nan,1,1,2,2,3,3,3, 1,1,1, 1,1],
'd' : [np.nan,np.nan,7,9,6,9,7,np.nan,6, 6,7,6, 9,6]}
df = pd.DataFrame(data, columns=['a','b','c','d'])
df
a b c d
0 NaN NaN NaN NaN
1 0.0 NaN NaN NaN
2 0.0 1.0 1.0 7.0
3 0.0 1.0 1.0 9.0
4 0.0 1.0 2.0 6.0
5 0.0 1.0 2.0 9.0
6 NaN 1.0 3.0 7.0
7 0.0 1.0 3.0 NaN
8 0.0 1.0 3.0 6.0
9 0.0 2.0 1.0 6.0
10 0.0 2.0 1.0 7.0
11 0.0 2.0 1.0 6.0
12 9.0 1.0 1.0 9.0
13 9.0 7.0 1.0 6.0
Rows where not all values are nulls. (Removing row index 0)
df[~df.isnull().all(axis=1)]
a b c d
1 0.0 NaN NaN NaN
2 0.0 1.0 1.0 7.0
3 0.0 1.0 1.0 9.0
4 0.0 1.0 2.0 6.0
5 0.0 1.0 2.0 9.0
6 NaN 1.0 3.0 7.0
7 0.0 1.0 3.0 NaN
8 0.0 1.0 3.0 6.0
9 0.0 2.0 1.0 6.0
10 0.0 2.0 1.0 7.0
11 0.0 2.0 1.0 6.0
12 9.0 1.0 1.0 9.0
13 9.0 7.0 1.0 6.0
One can use boolean indexing
df[~pd.isnull(df[list_of_cols]).all(axis=1)]
Explanation:
The expression df[list_of_cols]).all(axis=1) returns a boolean array that is applied as a filter to the dataframe:
isnull() applied to df[list_of_cols] creates a boolean mask for the dataframe df[list_of_cols] with True values for the null elements in df[list_of_cols], False otherwise
all() returns True if all of the elements are True (row-wise axis=1)
So, by negation ~ (not all null = at least one is non-null) one gets a mask for all rows that have at least one non-null element in the given list of columns.
An example:
Dataframe:
>>> df=pd.DataFrame({'A':[11,22,33,np.NaN],
'B':['x',np.NaN,np.NaN,'w'],
'C':['2016-03-13',np.NaN,'2016-03-14','2016-03-15']})
>>> df
A B C
0 11 x 2016-03-13
1 22 NaN NaN
2 33 NaN 2016-03-14
3 NaN w 2016-03-15
isnull mask:
>>> ~pd.isnull(df[list_of_cols])
B C
0 True True
1 False False
2 False True
3 True True
apply all(axis=1) row-wise:
>>> ~pd.isnull(df[list_of_cols]).all(axis=1)
0 True
1 False
2 True
3 True
dtype: bool
Boolean selection from dataframe:
>>> df[~pd.isnull(df[list_of_cols]).all(axis=1)]
A B C
0 11 x 2016-03-13
2 33 NaN 2016-03-14
3 NaN w 2016-03-15