Aggregating and plotting multiple columns using matplotlib - python

I've got data in a pandas dataframe that looks like this:
ID A B C D
100 0 1 0 1
101 1 1 0 1
102 0 0 0 1
...
The idea is to create a barchart plot that shows the total of each (sum of the total number of A's, B's, etc.). Something like:
X
X X
x X X
A B C D
This should be so simple...

Set 'ID' aside, sum, and plot.bar:
df.set_index('ID').sum().plot.bar()
# or
df.drop(columns=['ID']).sum().plot.bar()
output:
just for fun
print(df.drop(columns='ID')
.replace({0: ' ', 1: 'X'})
.apply(sorted, reverse=True)
.to_string(index=False)
)
Output:
A B C D
X X X
X X
X

Related

Pandas dataframe: Creating a new column based on data from other columns

I have a pandas dataframe, df:
foo bar
0 Supplies Sample X
1 xyz A
2 xyz B
3 Supplies Sample Y
4 xyz C
5 Supplies Sample Z
6 xyz D
7 xyz E
8 xyz F
I want to create a new df that looks something like this:
bar
0 Sample X - A
1 Sample X - B
2 Sample Y - C
3 Sample Z - D
4 Sample Z - E
5 Sample Z - F
I am new to Pandas so I don't know how to achieve this. Could someone please help?
I tried DataFrame.iterrows
but no luck.
You can use boolean indexing and ffill:
m = df['foo'].ne('Supplies')
out = (df['bar'].mask(m).ffill()[m]
.add(' - '+df.loc[m, 'bar'])
.to_frame().reset_index(drop=True)
)
Output:
bar
0 Sample X - A
1 Sample X - B
2 Sample Y - C
3 Sample Z - D
4 Sample Z - E
5 Sample Z - F
You can do:
s = (df["bar"].mask(df.foo == "xyz").ffill() + "-" + df["bar"]).reindex(
df.loc[df.foo == "xyz"].index
)
df = s.to_frame()
print(df):
bar
1 Sample X-A
2 Sample X-B
4 Sample Y-C
6 Sample Z-D
7 Sample Z-E
8 Sample Z-F
Another possible solution:
g = df.groupby(np.cumsum(df.bar.str.startswith('Sample')))
pd.DataFrame([x[1].bar.values[0] + ' - ' +
y for x in g for y in x[1].bar.values[1:]], columns=['bar'])
Output:
bar
0 Sample X - A
1 Sample X - B
2 Sample Y - C
3 Sample Z - D
4 Sample Z - E
5 Sample Z - F

calculating the percentage of count in pandas groupby

I want to discover the underlying pattern between my features and target so I tried to use groupby but instead of the count I want to calculate the ratio or the percentage compared to the total of the count of each class
the following code is similar to the work I have done.
fet1=["A","B","C"]
fet2=["X","Y","Z"]
target=["0","1"]
df = pd.DataFrame(data={"fet1":np.random.choice(fet1,1000),"fet2":np.random.choice(fet2,1000),"class":np.random.choice(target,1000)})
df.groupby(['fet1','fet2','class'])['class'].agg(['count'])
You can achieve this more simply with:
out = df.groupby('class').value_counts(normalize=True).mul(100)
Output:
class fet1 fet2
0 A Y 13.859275
B Y 12.366738
X 12.153518
C X 11.513859
Y 10.660981
B Z 10.447761
A Z 10.021322
C Z 9.594883
A X 9.381663
1 A Y 14.124294
C Z 13.935970
B Z 11.676083
Y 11.111111
C Y 11.111111
X 11.111111
A X 10.169492
B X 9.416196
A Z 7.344633
dtype: float64
If you want the same order of multiindex:
out = (df
.groupby('class').value_counts(normalize=True).mul(100)
.reorder_levels(['fet1', 'fet2', 'class']).sort_index()
)
Output:
fet1 fet2 class
A X 0 9.381663
1 10.169492
Y 0 13.859275
1 14.124294
Z 0 10.021322
1 7.344633
B X 0 12.153518
1 9.416196
Y 0 12.366738
1 11.111111
Z 0 10.447761
1 11.676083
C X 0 11.513859
1 11.111111
Y 0 10.660981
1 11.111111
Z 0 9.594883
1 13.935970
dtype: float64
I achieved it by doing this
fet1=["A","B","C"]
fet2=["X","Y","Z"]
target=["0","1"]
df = pd.DataFrame(data={"fet1":np.random.choice(fet1,1000),"fet2":np.random.choice(fet2,1000),"class":np.random.choice(target,1000)})
df.groupby(['fet1','fet2','class'])['class'].agg(['count'])/df.groupby(['class'])['class'].agg(['count'])*100

pd.Dataframe.update puts the result at the top of the dataframe

Lets say I have two dataframes like this:
n = {'x':['a','b','c','d','e'], 'y':['1','2','3','4','5'],'z':['0','0','0','0','0']}
nf = pd.DataFrame(n)
m = {'x':['b','d','e'], 'z':['10','100','1000']}
mf = pd.DataFrame(n)
I want to update the zeroes in the z column in the nf dataframe with the values from the z column in the mf dataframe only in the rows with keys from the column x
when i call
nf.update(mf)
i get
x y z
b 1 10
d 2 100
e 3 1000
d 4 0
e 5 0
instead of the desired output
x y z
a 1 0
b 2 10
c 3 0
d 4 100
e 5 1000
To answer your problem, you need to match the indexes of both dataframes, here how you can do it :
n = {'x':['a','b','c','d','e'], 'y':['1','2','3','4','5'],'z':['0','0','0','0','0']}
nf = pd.DataFrame(n).set_index('x')
m = {'x':['b','d','e'], 'z':['10','100','1000']}
mf = pd.DataFrame(m).set_index('x')
nf.update(mf)
nf = nf.reset_index()

How to change values in a dataframe Python

I've searched for an answer for the past 30 min, but the only solutions are either for a single column or in R. I have a dataset in which I want to change the ('Y/N') values to 1 and 0 respectively. I feel like copying and pasting the code below 17 times is very inefficient.
df.loc[df.infants == 'n', 'infants'] = 0
df.loc[df.infants == 'y', 'infants'] = 1
df.loc[df.infants == '?', 'infants'] = 1
My solution is the following. This doesn't cause an error, but the values in the dataframe doesn't change. I'm assuming I need to do something like df = df_new. But how to do this?
for coln in df:
for value in coln:
if value == 'y':
value = '1'
elif value == 'n':
value = '0'
else:
value = '1'
EDIT: There are 17 columns in this dataset, but there is another dataset I'm hoping to tackle which contains 56 columns.
republican n y n.1 y.1 y.2 y.3 n.2 n.3 n.4 y.4 ? y.5 y.6 y.7 n.5 y.8
0 republican n y n y y y n n n n n y y y n ?
1 democrat ? y y ? y y n n n n y n y y n n
2 democrat n y y n ? y n n n n y n y n n y
3 democrat y y y n y y n n n n y ? y y y y
4 democrat n y y n y y n n n n n n y y y y
This should work:
for col in df.columns():
df.loc[df[col] == 'n', col] = 0
df.loc[df[col] == 'y', col] = 1
df.loc[df[col] == '?', col] = 1
I think simpliest is use replace by dict:
np.random.seed(100)
df = pd.DataFrame(np.random.choice(['n','y','?'], size=(5,5)),
columns=list('ABCDE'))
print (df)
A B C D E
0 n n n ? ?
1 n ? y ? ?
2 ? ? y n n
3 n n ? n y
4 y ? ? n n
d = {'n':0,'y':1,'?':1}
df = df.replace(d)
print (df)
A B C D E
0 0 0 0 1 1
1 0 1 1 1 1
2 1 1 1 0 0
3 0 0 1 0 1
4 1 1 1 0 0
This should do:
df.infants = df.infants.map({ 'Y' : 1, 'N' : 0})
Maybe you can try apply,
import pandas as pd
# create dataframe
number = [1,2,3,4,5]
sex = ['male','female','female','female','male']
df_new = pd.DataFrame()
df_new['number'] = number
df_new['sex'] = sex
df_new.head()
# create def for category to number 0/1
def tran_cat_to_num(df):
if df['sex'] == 'male':
return 1
elif df['sex'] == 'female':
return 0
# create sex_new
df_new['sex_new']=df_new.apply(tran_cat_to_num,axis=1)
df_new
raw
number sex
0 1 male
1 2 female
2 3 female
3 4 female
4 5 male
after use apply
number sex sex_new
0 1 male 1
1 2 female 0
2 3 female 0
3 4 female 0
4 5 male 1
You can change the values using the map function.
Ex.:
x = {'y': 1, 'n': 0}
for col in df.columns():
df[col] = df[col].map(x)
This way you map each column of your dataframe.
All the solutions above are correct, but what you could also do is:
df["infants"] = df["infants"].replace("Y", 1).replace("N", 0).replace("?", 1)
which now that I read more carefully is very similar to using replace with dict !
Use dataframe.replace():
df.replace({'infants':{'y':1,'?':1,'n':0}},inplace=True)

Pandas calculate length of consecutive equal values from a grouped dataframe

I want to do what they've done in the answer here: Calculating the number of specific consecutive equal values in a vectorized way in pandas
, but using a grouped dataframe instead of a series.
So given a dataframe with several columns
A B C
------------
x x 0
x x 5
x x 2
x x 0
x x 0
x x 3
x x 0
y x 1
y x 10
y x 0
y x 5
y x 0
y x 0
I want to groupby columns A and B, then count the number of consecutive zeros in C. After that I'd like to return counts of the number of times each length of zeros occurred. So I want output like this:
A B num_consecutive_zeros count
---------------------------------------
x x 1 2
x x 2 1
y x 1 1
y x 2 1
I don't know how to adapt the answer from the linked question to deal with grouped dataframes.
Here is the code, count_consecutive_zeros() use numpy functions and pandas.value_counts() to get the results, and use groupby().apply(count_consecutive_zeros) to call count_consecutive_zeros() for every group. call reset_index() to change MultiIndex to columns:
import pandas as pd
import numpy as np
from io import BytesIO
text = """A B C
x x 0
x x 5
x x 2
x x 0
x x 0
x x 3
x x 0
y x 1
y x 10
y x 0
y x 5
y x 0
y x 0"""
df = pd.read_csv(BytesIO(text.encode()), delim_whitespace=True)
def count_consecutive_zeros(s):
v = np.diff(np.r_[0, s.values==0, 0])
s = pd.value_counts(np.where(v == -1)[0] - np.where(v == 1)[0])
s.index.name = "num_consecutive_zeros"
s.name = "count"
return s
df.groupby(["A", "B"]).C.apply(count_consecutive_zeros).reset_index()

Categories

Resources