pandas cut multiple columns - python

I am looking to apply a bin across a number of columns.
a = [1, 2, 9, 1, 5, 3]
b = [9, 8, 7, 8, 9, 1]
c = [a, b]
print(pd.cut(c, 3, labels=False))
which works great and creates:
[[0 0 2 0 1 0]
[2 2 2 2 2 0]]
However, i would like to apply the 'cut' to create a dataframe with number and bin it as below.
Values bin
0 1 0
1 2 0
2 9 2
3 1 0
4 5 1
5 3 0
Values bin
0 9 2
1 8 2
2 7 2
3 8 2
4 9 2
5 1 0
This is a simple example of what im looking to do. In reality i 63 separate dataframes and a & b are examples of a column from each dataframe.

Use zip with a list comp to build a list of dataframes -
c = [a, b]
r = pd.cut(c, 3, labels=False)
df_list = [pd.DataFrame({'Values' : v, 'Labels' : l}) for v, l in zip(c, r)]
df_list
[ Labels Values
0 0 1
1 0 2
2 2 9
3 0 1
4 1 5
5 0 3, Labels Values
0 2 9
1 2 8
2 2 7
3 2 8
4 2 9
5 0 1]

Related

df.loc behavior when assigning a dict to a column

Lets say we have a df like below:
df = pd.DataFrame({'A': [3, 9, 3, 4], 'B': [7, 1, 6, 0], 'C': [9, 0, 3, 4], 'D': [1, 8, 0, 0]})
Starting df:
A B C D
0 3 7 9 1
1 9 1 0 8
2 3 6 3 0
3 4 0 4 0
If we wanted to assign new values to column A, I would expect the following to work:
d = {0:10,1:20,2:30,3:40}
df.loc[:,'A'] = d
Output:
A B C D
0 0 7 9 1
1 1 1 0 8
2 2 6 3 0
3 3 0 4 0
The values that are assigned instead are the keys of the dictionary.
If however, instead of assigning the dictionary to an existing column, if we create a new column, we will get the same result the first time we run it, but running the same code again will get the expected result. We then are able to select any column and it will output the expected output.
First time running df.loc[:,'E'] = {0:10,1:20,2:30,3:40}
Output:
A B C D E
0 0 7 9 1 0
1 1 1 0 8 1
2 2 6 3 0 2
3 3 0 4 0 3
Second time running df.loc[:,'E'] = {0:10,1:20,2:30,3:40}
A B C D E
0 0 7 9 1 10
1 1 1 0 8 20
2 2 6 3 0 30
3 3 0 4 0 40
Then if we run the same code as we did at first, we get a different result:
df.loc[:,'A'] = {0:10,1:20,2:30,3:40}
Output:
A B C D E
0 10 7 9 1 10
1 20 1 0 8 20
2 30 6 3 0 30
3 40 0 4 0 40
Is this the intended behavior? (I am running pandas version 1.4.2)

How to count the number of occurrences of semi-duplicate rows and make the count a new column

I have a pandas dataframe as follows:
df = pd.DataFrame({'A':[4, 4, 1, 5, 1, 1],
'B':[2, 2, 2, 5, 2, 2],
'C':[1, 1, 3, 5, 3, 3],
'D':['q', 'e', 'r', 'y', 'u',' w']})
which looks like
A B C D
0 4 2 1 q
1 4 2 1 e
2 1 2 3 r
3 5 5 5 y
4 1 2 3 u
5 1 2 3 w
I would like to add a new column that is the count of duplicate rows, with respect to only the columns A, B, and C. This would look like
A B C D Count
0 4 2 1 q 2
1 4 2 1 e 2
2 1 2 3 r 3
3 5 5 5 y 1
4 1 2 3 u 3
5 1 2 3 w 3
I'm guessing this will be something like df.groupby(['A','B','C']).size() but I am unsure how to map the values back to the new 'Count' column. Thanks!
We can do transform
df['Count'] = df.groupby(['A','B','C']).D.transform('count')
df['Count']
0 2
1 2
2 3
3 1
4 3
5 3
Name: Count, dtype: int64

How do you add an array to each previous row in pandas?

If I have an array [1, 2, 3, 4, 5] and a Pandas Dataframe
df = pd.DataFrame([[1,1,1,1,1], [0,0,0,0,0], [0,0,0,0,0], [0,0,0,0,0]])
0 1 2 3 4
0 1 1 1 1 1
1 0 0 0 0 0
2 0 0 0 0 0
3 0 0 0 0 0
How do I iterate through the Pandas DataFrame adding my array to each previous row?
The expected result would be:
0 1 2 3 4
0 1 1 1 1 1
1 2 3 4 5 6
2 3 5 7 9 11
3 4 7 10 13 16
The array is added n times to the nth row, which you can create using np.arange(len(df))[:,None] * a and then add the first row:
df
# 0 1 2 3 4
#0 1 1 1 1 1
#1 0 0 0 0 0
#2 0 0 0 0 0
#3 0 0 0 0 0
a = np.array([1, 2, 3, 4, 5])
np.arange(len(df))[:,None] * a
#array([[ 0, 0, 0, 0, 0],
# [ 1, 2, 3, 4, 5],
# [ 2, 4, 6, 8, 10],
# [ 3, 6, 9, 12, 15]])
df[:] = df.iloc[0].values + np.arange(len(df))[:,None] * a
df
# 0 1 2 3 4
#0 1 1 1 1 1
#1 2 3 4 5 6
#2 3 5 7 9 11
#3 4 7 10 13 16
df = pd.DataFrame([
[1,1,1],
[0,0,0],
[0,0,0],
])
s = pd.Series([1,2,3])
# add to every row except first, then cumulative sum
result = df.add(s, axis=1)
result.iloc[0] = df.iloc[0]
result.cumsum()
Or if you want a one-liner:
pd.concat([df[:1], df[1:].add(s, axis=1)]).cumsum()
Either way, result:
0 1 2
0 1 1 1
1 2 3 4
2 3 5 7
Using cumsum and assignment:
df[1:] = (df+l).cumsum()[:-1].values
0 1 2 3 4
0 1 1 1 1 1
1 2 3 4 5 6
2 3 5 7 9 11
3 4 7 10 13 16
Or using concat:
pd.concat((df[:1], (df+l).cumsum()[:-1]))
0 1 2 3 4
0 1 1 1 1 1
0 2 3 4 5 6
1 3 5 7 9 11
2 4 7 10 13 16
After cumsum, you can shift and add back to the original df:
a = [1,2,3,4,5]
updated = df.add(pd.Series(a), axis=1).cumsum().shift().fillna(0)
df.add(updated)

DataFrame of lists to appropriate series

Suppose I have the dataframe df defined below
df = pd.DataFrame([
[[1, 2], [3, 4, 5]],
[[6], [7, 8, 9, 0]]
], list('AB'), list('XY'))
How do I get it to
A X 0 1
1 2
Y 0 3
1 4
2 5
B X 0 6
Y 0 7
1 8
2 9
3 0
dtype: int64
What I have tried
I started by admonishing the person who did this. That did not work.
Calling stack a couple times and applying pd.Series:
df.stack().apply(pd.Series).stack().astype(int)
The resulting output:
A X 0 1
1 2
Y 0 3
1 4
2 5
B X 0 6
Y 0 7
1 8
2 9
3 0
dtype: int32

Pandas dataframe: how to group by values in a column and create new columns out of grouped values

I have a dataframe with two columns:
x y
0 1
1 1
2 2
0 5
1 6
2 8
0 1
1 8
2 4
0 1
1 7
2 3
What I want is:
x val1 val2 val3 val4
0 1 5 1 1
1 1 6 8 7
2 2 8 4 3
I know that the values in column x are repeated all N times.
You could use groupby/cumcount to assign column numbers and then call pivot:
import pandas as pd
df = pd.DataFrame({'x': [0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2],
'y': [1, 1, 2, 5, 6, 8, 1, 8, 4, 1, 7, 3]})
df['columns'] = df.groupby('x')['y'].cumcount()
# x y columns
# 0 0 1 0
# 1 1 1 0
# 2 2 2 0
# 3 0 5 1
# 4 1 6 1
# 5 2 8 1
# 6 0 1 2
# 7 1 8 2
# 8 2 4 2
# 9 0 1 3
# 10 1 7 3
# 11 2 3 3
result = df.pivot(index='x', columns='columns')
print(result)
yields
y
columns 0 1 2 3
x
0 1 5 1 1
1 1 6 8 7
2 2 8 4 3
Or, if you can really rely on the values in x being repeated in order N times,
N = 3
result = pd.DataFrame(df['y'].values.reshape(-1, N).T)
yields
0 1 2 3
0 1 5 1 1
1 1 6 8 7
2 2 8 4 3
Using reshape is quicker than calling groupby/cumcount and pivot, but it
is less robust since it relies on the values in y appearing in the right order.

Categories

Resources