new Pandas Dataframe column calculated from other column values

new Pandas Dataframe column calculated from other column values - python

How can I create a new column in a dataframe that consists of the MEAN of an indexed range of values in that row?
example:
1 2 3 JUNK
0 0.0 0.0 0.0 A
1 1.0 1.0 -1.0 B
2 2.0 2.0 1.0 C
the JUNK column would be ignored when trying to determine the MEAN column
expected output:
1 2 3 JUNK MEAN
0 0.0 0.0 0.0 A 0.0
1 1.0 1.0 -1.0 B 0.33
2 2.0 2.0 1.0 C 1.66

Use drop for removing or iloc for filter out unnecessary columns:
df['MEAN'] = df.drop('JUNK', axis=1).mean(axis=1)
df['MEAN'] = df.iloc[:, :-1].mean(axis=1)
print (df)
1 2 3 JUNK MEAN
0 0.0 0.0 0.0 A 0.000000
1 1.0 1.0 -1.0 B 0.333333
2 2.0 2.0 1.0 C 1.666667

Related

Table of frequency of specific scores in python pandas

I would like to make a table of frequency and percent by container, class, and score.
df = pd.read_csv('https://drive.google.com/file/d/1pL8fHCc25-XRBYgj9n6NdRt5VHrIr-p1/view?usp=sharing', sep=',')
df.groupby([ 'Containe', 'Class']).count()
The output should be:
But that script does not work!

First, we stack the values in order to have one by rows :
>>> df1 = (df.set_index(["Containe", "Class"])
... .stack()
... .reset_index(name='Score')
... .rename(columns={'level_2':'letters'}))
Then, we use a groupby to get the size of each combinaison of values like so :
>>> df_grouped = df1.groupby(['Containe', 'Class', 'letters', 'Score'], as_index=False).size()
To finish, we use the pivot_table method to get the expected result :
>>> pd.pivot_table(df_grouped, values='size', index=['letters', 'Class', 'Containe'], columns=['Score']).fillna(0)
Score 0 1 2
letters Class Containe
AB A 1 2.0 1.0 1.0
2 1.0 2.0 1.0
B 3 2.0 1.0 1.0
4 1.0 2.0 1.0
AC A 1 0.0 2.0 2.0
2 1.0 2.0 1.0
B 3 1.0 2.0 1.0
4 2.0 2.0 0.0
AD A 1 2.0 0.0 2.0
2 1.0 3.0 0.0
B 3 2.0 1.0 1.0
4 1.0 1.0 2.0

Pandas apply function to column taking the value of previous column

I have to create a timeseries using column values for computing the Recency of a customer.
The formula I have to use is R(t) = 0 if the customer has bought something in that month, R(t-1) + 1 otherwise.
I managed to compute a dataframe
CustomerID -1 0 1 2 3 4 5 6 7 8 9 10 11 12
0 17850 0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
1 13047 0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 1.0 0.0 1.0 1.0
2 12583 0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 14688 0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
4 15311 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3750 15471 0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0
3751 13436 0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0
3752 15520 0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0
3753 14569 0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0
3754 12713 0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0
In which there's a 0 if the customer has bought something in that month and one otherwise. The column names indicate a time period, with the column "-1" as a dummy column.
How can I replace the value in each column with 0 if the current value is 0 and with the value of the previous column + 1 otherwise?
For example, the final result for the second customer should be 0 1 0 0 1 0 0 1 0 1 0 1 2
I know how to apply a function to a column, but I don't know how to make that function use the value from the previous column.

Just use apply function to iterate throw columns or rows of dataframe and do manipulation.
def apply_function(row):
return [item if i == 0 else 0 if item == 0 else item+row[i-1] for i,item in enumerate(row)]
new_df = df.apply(apply_function, axis=1, result_type='expand')
new_df.columns = df.columns # just to set previous column names

Do you insist on using the column structure? It is common with time series to use rows, e.g., a dataframe with columns CustomerID, hasBoughtThisMonth. You can then easily add the Recency column by using a pandas transform().
I cannot yet place comments hence the question in this way.
Edit: here is another way to go by. I took two customers as an example, and some random numbers of whether or not they bought something in a month.
Basically, you pivot your table, and use a groupby+cumsum to get your result. Notice that I avoid your dummy column in this way.
import pandas as pd
import numpy as np
np.random.seed(1)
# Make example dataframe
df = pd.DataFrame({'CustomerID': [1]*12+[2]*12,
'Month': [1,2,3,4,5,6,7,8,9,10,11,12]*2,
'hasBoughtThisMonth': np.random.randint(2,size=24)})
# Make Recency column by finding contiguous groups of ones, and groupby
contiguous_groups = df['hasBoughtThisMonth'].diff().ne(0).cumsum()
df['Recency']=df.groupby(by=['CustomerID', contiguous_groups],
as_index=False)['hasBoughtThisMonth'].cumsum().reset_index(drop=True)
The result is
CustomerID Month hasBoughtThisMonth Recency
0 1 1 1 1
1 1 2 1 2
2 1 3 0 0
3 1 4 0 0
4 1 5 1 1
5 1 6 1 2
6 1 7 1 3
7 1 8 1 4
8 1 9 1 5
9 1 10 0 0
10 1 11 0 0
11 1 12 1 1
12 2 1 0 0
13 2 2 1 1
14 2 3 1 2
15 2 4 0 0
16 2 5 0 0
17 2 6 1 1
18 2 7 0 0
19 2 8 0 0
20 2 9 0 0
21 2 10 1 1
22 2 11 0 0
23 2 12 0 0

It would be easier if you first set CustomerID as index and transpose your dataframe.
then apply your custom function.
i.e. something like:
df.T.apply(custom_func)

how to change the rows value by condition on a column (python,pandas)

hey all i want to change the rows by condition on a column.
so where column "type"==A
i want the cols [col1-col5] will be 1 if the value
is biger 2
else i like the value to be 0
the DATA
data={"col1":[np.nan,3,4,5,9,2,6],
"col2":[4,2,4,6,0,1,5],
"col3":[7,6,0,11,3,6,7],
"col4":[14,11,22,8,6,np.nan,9],
"col5":[0,5,7,3,8,2,9],
"type":["A","A","C","A","B","A","E"],
"number":["one","two","two","one","one","two","two"]}
df=pd.DataFrame.from_dict(data)
df
How I expect the data to be
data={"col1":[0,1,4,1,9,0,6],
"col2":[1,0,4,1,0,0,5],
"col3":[1,1,0,1,3,1,7],
"col4":[1,1,22,1,6,0,9],
"col5":[0,1,7,1,1,0,9],
"type":["A","A","C","A","B","A","E"],
"number":["one","two","two","one","one","two","two"]}
df=pd.DataFrame.from_dict(data)
df

You can use df.query to get all type A rows, then use df._get_numeric_data/df.select_dtypes('number') to get all numeric fields, then use df.gt and cast them as int using df.astype, now update the DataFrame with new values using df.update
df.update(df.query('type == "A"')._get_numeric_data().gt(2).astype(int))
#.select_dtypes('number')
df
col1 col2 col3 col4 col5 type number
0 0.0 1.0 1.0 1.0 0.0 A one
1 1.0 0.0 1.0 1.0 1.0 A two
2 4.0 4.0 0.0 22.0 7.0 C two
3 1.0 1.0 1.0 1.0 1.0 A one
4 9.0 0.0 3.0 6.0 8.0 B one
5 0.0 0.0 1.0 0.0 0.0 A two
6 6.0 5.0 7.0 9.0 9.0 E two

Use DataFrame.loc for select by condition equal A and columns between first and last column name, then compare for greater like DataFrame.gt, for map True, False to 1,0 is used convert mask to integers, last update by DataFrame.update:
df.update(df.loc[df['type'].eq('A'), 'col1':'col5'].gt(2).astype(int))
print (df)
col1 col2 col3 col4 col5 type number
0 0.0 1.0 1.0 1.0 0.0 A one
1 1.0 0.0 1.0 1.0 1.0 A two
2 4.0 4.0 0.0 22.0 7.0 C two
3 1.0 1.0 1.0 1.0 1.0 A one
4 9.0 0.0 3.0 6.0 8.0 B one
5 0.0 0.0 1.0 0.0 0.0 A two
6 6.0 5.0 7.0 9.0 9.0 E two
Or by assign back:
m = df['type'].eq('A')
df.loc[m, 'col1':'col5'] = df.loc[m, 'col1':'col5'].gt(2).astype(int)
print (df)
col1 col2 col3 col4 col5 type number
0 0.0 1 1 1.0 0 A one
1 1.0 0 1 1.0 1 A two
2 4.0 4 0 22.0 7 C two
3 1.0 1 1 1.0 1 A one
4 9.0 0 3 6.0 8 B one
5 0.0 0 1 0.0 0 A two
6 6.0 5 7 9.0 9 E two

Concatenating crosstabs of different variables

I have a Pandas (0.23.4) DataFrame with several categorical columns.
df = pd.DataFrame(np.random.choice([True, False, np.nan], (6,4)), columns = ['a','b','c','d'])
a b c d
0 NaN 1.0 NaN NaN
1 NaN 1.0 NaN 0.0
2 1.0 NaN 1.0 NaN
3 0.0 NaN 0.0 1.0
4 NaN 1.0 NaN NaN
5 NaN 1.0 0.0 1.0
I have two sets of columns of interest:
cross_cols = ['a', 'b']
type_cols = ['c', 'd']
I would like to get a cross tab of counts of each cross_col variable with each type_col variable (a with c and d, and b with c and d), excluding NaN, all displayed side-by-side. The desired result is:
c d
0.0 1.0 All 0.0 1.0 All
a 0.0 0 0 0 1 1 2
1.0 2 1 3 1 0 1
All 2 1 3 2 1 3
b 0.0 0 0 0 0 1 1
1.0 2 1 3 2 0 2
All 2 1 3 2 1 3
Notice that I am not interested in counts for different combinations of a and b or of c and d, which is what I'm getting by changing the index and columns parameters of pd.crosstab.
Currently I'm using the following code:
cross_rows = []
for col in cross_cols:
cross_rows.append(pd.concat([pd.crosstab(df[col], df[type_var],margins=True) for type_var in type_cols],axis=1,keys = type_cols,sort=True))
results = pd.concat(cross_rows, keys = cross_cols,sort=True)
It gives the following result:
c d
c 0.0 1.0 All 0.0 1.0 All
a 1.0 2.0 1.0 3.0 1 0 1
All 2.0 1.0 3.0 2 1 3
0.0 NaN NaN NaN 1 1 2
b 1.0 2.0 1.0 3.0 2 0 2
All 2.0 1.0 3.0 2 1 3
0.0 NaN NaN NaN 0 1 1
The result is fine, but the code is slow and a bit ugly. I suspect that there's a faster and more Pythonic approach. Is there a single function call that would get the job done, or another faster solution?

Select rows where at least one value from the list of columns is not null

I have a big dataframe with many columns (like 1000). I have a list of columns (generated by a script ~10). And I would like to select all the rows in the original dataframe where at least one of my list of columns is not null.
So if I would know the number of my columns in advance, I could do something like this:
list_of_cols = ['col1', ...]
df[
df[list_of_cols[0]].notnull() |
df[list_of_cols[1]].notnull() |
...
df[list_of_cols[6]].notnull() |
]
I can also iterate over the list of cols and create a mask which then I would apply to df, but his looks too tedious. Knowing how powerful is pandas with respect to dealing with nan, I would expect that there is a way easier way to achieve what I want.

Use the thresh parameter in the dropna() method. By setting thresh=1, you specify that if there is at least 1 non null item, don't drop it.
df = pd.DataFrame(np.random.choice((1., np.nan), (1000, 1000), p=(.3, .7)))
list_of_cols = list(range(10))
df[list_of_cols].dropna(thresh=1).head()

Starting with this:
data = {'a' : [np.nan,0,0,0,0,0,np.nan,0,0, 0,0,0, 9,9,],
'b' : [np.nan,np.nan,1,1,1,1,1,1,1, 2,2,2, 1,7],
'c' : [np.nan,np.nan,1,1,2,2,3,3,3, 1,1,1, 1,1],
'd' : [np.nan,np.nan,7,9,6,9,7,np.nan,6, 6,7,6, 9,6]}
df = pd.DataFrame(data, columns=['a','b','c','d'])
df
a b c d
0 NaN NaN NaN NaN
1 0.0 NaN NaN NaN
2 0.0 1.0 1.0 7.0
3 0.0 1.0 1.0 9.0
4 0.0 1.0 2.0 6.0
5 0.0 1.0 2.0 9.0
6 NaN 1.0 3.0 7.0
7 0.0 1.0 3.0 NaN
8 0.0 1.0 3.0 6.0
9 0.0 2.0 1.0 6.0
10 0.0 2.0 1.0 7.0
11 0.0 2.0 1.0 6.0
12 9.0 1.0 1.0 9.0
13 9.0 7.0 1.0 6.0
Rows where not all values are nulls. (Removing row index 0)
df[~df.isnull().all(axis=1)]
a b c d
1 0.0 NaN NaN NaN
2 0.0 1.0 1.0 7.0
3 0.0 1.0 1.0 9.0
4 0.0 1.0 2.0 6.0
5 0.0 1.0 2.0 9.0
6 NaN 1.0 3.0 7.0
7 0.0 1.0 3.0 NaN
8 0.0 1.0 3.0 6.0
9 0.0 2.0 1.0 6.0
10 0.0 2.0 1.0 7.0
11 0.0 2.0 1.0 6.0
12 9.0 1.0 1.0 9.0
13 9.0 7.0 1.0 6.0

One can use boolean indexing
df[~pd.isnull(df[list_of_cols]).all(axis=1)]
Explanation:
The expression df[list_of_cols]).all(axis=1) returns a boolean array that is applied as a filter to the dataframe:
isnull() applied to df[list_of_cols] creates a boolean mask for the dataframe df[list_of_cols] with True values for the null elements in df[list_of_cols], False otherwise
all() returns True if all of the elements are True (row-wise axis=1)
So, by negation ~ (not all null = at least one is non-null) one gets a mask for all rows that have at least one non-null element in the given list of columns.
An example:
Dataframe:
>>> df=pd.DataFrame({'A':[11,22,33,np.NaN],
'B':['x',np.NaN,np.NaN,'w'],
'C':['2016-03-13',np.NaN,'2016-03-14','2016-03-15']})
>>> df
A B C
0 11 x 2016-03-13
1 22 NaN NaN
2 33 NaN 2016-03-14
3 NaN w 2016-03-15
isnull mask:
>>> ~pd.isnull(df[list_of_cols])
B C
0 True True
1 False False
2 False True
3 True True
apply all(axis=1) row-wise:
>>> ~pd.isnull(df[list_of_cols]).all(axis=1)
0 True
1 False
2 True
3 True
dtype: bool
Boolean selection from dataframe:
>>> df[~pd.isnull(df[list_of_cols]).all(axis=1)]
A B C
0 11 x 2016-03-13
2 33 NaN 2016-03-14
3 NaN w 2016-03-15

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

new Pandas Dataframe column calculated from other column values - python

Use drop for removing or iloc for filter out unnecessary columns: df['MEAN'] = df.drop('JUNK', axis=1).mean(axis=1) df['MEAN'] = df.iloc[:, :-1].mean(axis=1) print (df) 1 2 3 JUNK MEAN 0 0.0 0.0 0.0 A 0.000000 1 1.0 1.0 -1.0 B 0.333333 2 2.0 2.0 1.0 C 1.666667

Related

Table of frequency of specific scores in python pandas

Pandas apply function to column taking the value of previous column

how to change the rows value by condition on a column (python,pandas)

Concatenating crosstabs of different variables

Select rows where at least one value from the list of columns is not null

Categories

Resources