Aggregating and plotting multiple columns using matplotlib

Aggregating and plotting multiple columns using matplotlib - python

I've got data in a pandas dataframe that looks like this:
ID A B C D
100 0 1 0 1
101 1 1 0 1
102 0 0 0 1
...
The idea is to create a barchart plot that shows the total of each (sum of the total number of A's, B's, etc.). Something like:
X
X X
x X X
A B C D
This should be so simple...

Set 'ID' aside, sum, and plot.bar:
df.set_index('ID').sum().plot.bar()
# or
df.drop(columns=['ID']).sum().plot.bar()
output:
just for fun
print(df.drop(columns='ID')
.replace({0: ' ', 1: 'X'})
.apply(sorted, reverse=True)
.to_string(index=False)
)
Output:
A B C D
X X X
X X
X

Related

Pandas dataframe: Creating a new column based on data from other columns

I have a pandas dataframe, df:
foo bar
0 Supplies Sample X
1 xyz A
2 xyz B
3 Supplies Sample Y
4 xyz C
5 Supplies Sample Z
6 xyz D
7 xyz E
8 xyz F
I want to create a new df that looks something like this:
bar
0 Sample X - A
1 Sample X - B
2 Sample Y - C
3 Sample Z - D
4 Sample Z - E
5 Sample Z - F
I am new to Pandas so I don't know how to achieve this. Could someone please help?
I tried DataFrame.iterrows
but no luck.

You can use boolean indexing and ffill:
m = df['foo'].ne('Supplies')
out = (df['bar'].mask(m).ffill()[m]
.add(' - '+df.loc[m, 'bar'])
.to_frame().reset_index(drop=True)
)
Output:
bar
0 Sample X - A
1 Sample X - B
2 Sample Y - C
3 Sample Z - D
4 Sample Z - E
5 Sample Z - F

You can do:
s = (df["bar"].mask(df.foo == "xyz").ffill() + "-" + df["bar"]).reindex(
df.loc[df.foo == "xyz"].index
)
df = s.to_frame()
print(df):
bar
1 Sample X-A
2 Sample X-B
4 Sample Y-C
6 Sample Z-D
7 Sample Z-E
8 Sample Z-F

Another possible solution:
g = df.groupby(np.cumsum(df.bar.str.startswith('Sample')))
pd.DataFrame([x[1].bar.values[0] + ' - ' +
y for x in g for y in x[1].bar.values[1:]], columns=['bar'])
Output:
bar
0 Sample X - A
1 Sample X - B
2 Sample Y - C
3 Sample Z - D
4 Sample Z - E
5 Sample Z - F

calculating the percentage of count in pandas groupby

I want to discover the underlying pattern between my features and target so I tried to use groupby but instead of the count I want to calculate the ratio or the percentage compared to the total of the count of each class
the following code is similar to the work I have done.
fet1=["A","B","C"]
fet2=["X","Y","Z"]
target=["0","1"]
df = pd.DataFrame(data={"fet1":np.random.choice(fet1,1000),"fet2":np.random.choice(fet2,1000),"class":np.random.choice(target,1000)})
df.groupby(['fet1','fet2','class'])['class'].agg(['count'])

You can achieve this more simply with:
out = df.groupby('class').value_counts(normalize=True).mul(100)
Output:
class fet1 fet2
0 A Y 13.859275
B Y 12.366738
X 12.153518
C X 11.513859
Y 10.660981
B Z 10.447761
A Z 10.021322
C Z 9.594883
A X 9.381663
1 A Y 14.124294
C Z 13.935970
B Z 11.676083
Y 11.111111
C Y 11.111111
X 11.111111
A X 10.169492
B X 9.416196
A Z 7.344633
dtype: float64
If you want the same order of multiindex:
out = (df
.groupby('class').value_counts(normalize=True).mul(100)
.reorder_levels(['fet1', 'fet2', 'class']).sort_index()
)
Output:
fet1 fet2 class
A X 0 9.381663
1 10.169492
Y 0 13.859275
1 14.124294
Z 0 10.021322
1 7.344633
B X 0 12.153518
1 9.416196
Y 0 12.366738
1 11.111111
Z 0 10.447761
1 11.676083
C X 0 11.513859
1 11.111111
Y 0 10.660981
1 11.111111
Z 0 9.594883
1 13.935970
dtype: float64

I achieved it by doing this
fet1=["A","B","C"]
fet2=["X","Y","Z"]
target=["0","1"]
df = pd.DataFrame(data={"fet1":np.random.choice(fet1,1000),"fet2":np.random.choice(fet2,1000),"class":np.random.choice(target,1000)})
df.groupby(['fet1','fet2','class'])['class'].agg(['count'])/df.groupby(['class'])['class'].agg(['count'])*100

pd.Dataframe.update puts the result at the top of the dataframe

Lets say I have two dataframes like this:
n = {'x':['a','b','c','d','e'], 'y':['1','2','3','4','5'],'z':['0','0','0','0','0']}
nf = pd.DataFrame(n)
m = {'x':['b','d','e'], 'z':['10','100','1000']}
mf = pd.DataFrame(n)
I want to update the zeroes in the z column in the nf dataframe with the values from the z column in the mf dataframe only in the rows with keys from the column x
when i call
nf.update(mf)
i get
x y z
b 1 10
d 2 100
e 3 1000
d 4 0
e 5 0
instead of the desired output
x y z
a 1 0
b 2 10
c 3 0
d 4 100
e 5 1000

To answer your problem, you need to match the indexes of both dataframes, here how you can do it :
n = {'x':['a','b','c','d','e'], 'y':['1','2','3','4','5'],'z':['0','0','0','0','0']}
nf = pd.DataFrame(n).set_index('x')
m = {'x':['b','d','e'], 'z':['10','100','1000']}
mf = pd.DataFrame(m).set_index('x')
nf.update(mf)
nf = nf.reset_index()

How to change values in a dataframe Python

I've searched for an answer for the past 30 min, but the only solutions are either for a single column or in R. I have a dataset in which I want to change the ('Y/N') values to 1 and 0 respectively. I feel like copying and pasting the code below 17 times is very inefficient.
df.loc[df.infants == 'n', 'infants'] = 0
df.loc[df.infants == 'y', 'infants'] = 1
df.loc[df.infants == '?', 'infants'] = 1
My solution is the following. This doesn't cause an error, but the values in the dataframe doesn't change. I'm assuming I need to do something like df = df_new. But how to do this?
for coln in df:
for value in coln:
if value == 'y':
value = '1'
elif value == 'n':
value = '0'
else:
value = '1'
EDIT: There are 17 columns in this dataset, but there is another dataset I'm hoping to tackle which contains 56 columns.
republican n y n.1 y.1 y.2 y.3 n.2 n.3 n.4 y.4 ? y.5 y.6 y.7 n.5 y.8
0 republican n y n y y y n n n n n y y y n ?
1 democrat ? y y ? y y n n n n y n y y n n
2 democrat n y y n ? y n n n n y n y n n y
3 democrat y y y n y y n n n n y ? y y y y
4 democrat n y y n y y n n n n n n y y y y

This should work:
for col in df.columns():
df.loc[df[col] == 'n', col] = 0
df.loc[df[col] == 'y', col] = 1
df.loc[df[col] == '?', col] = 1

I think simpliest is use replace by dict:
np.random.seed(100)
df = pd.DataFrame(np.random.choice(['n','y','?'], size=(5,5)),
columns=list('ABCDE'))
print (df)
A B C D E
0 n n n ? ?
1 n ? y ? ?
2 ? ? y n n
3 n n ? n y
4 y ? ? n n
d = {'n':0,'y':1,'?':1}
df = df.replace(d)
print (df)
A B C D E
0 0 0 0 1 1
1 0 1 1 1 1
2 1 1 1 0 0
3 0 0 1 0 1
4 1 1 1 0 0

This should do:
df.infants = df.infants.map({ 'Y' : 1, 'N' : 0})

Maybe you can try apply,
import pandas as pd
# create dataframe
number = [1,2,3,4,5]
sex = ['male','female','female','female','male']
df_new = pd.DataFrame()
df_new['number'] = number
df_new['sex'] = sex
df_new.head()
# create def for category to number 0/1
def tran_cat_to_num(df):
if df['sex'] == 'male':
return 1
elif df['sex'] == 'female':
return 0
# create sex_new
df_new['sex_new']=df_new.apply(tran_cat_to_num,axis=1)
df_new
raw
number sex
0 1 male
1 2 female
2 3 female
3 4 female
4 5 male
after use apply
number sex sex_new
0 1 male 1
1 2 female 0
2 3 female 0
3 4 female 0
4 5 male 1

You can change the values using the map function.
Ex.:
x = {'y': 1, 'n': 0}
for col in df.columns():
df[col] = df[col].map(x)
This way you map each column of your dataframe.

All the solutions above are correct, but what you could also do is:
df["infants"] = df["infants"].replace("Y", 1).replace("N", 0).replace("?", 1)
which now that I read more carefully is very similar to using replace with dict !

Use dataframe.replace():
df.replace({'infants':{'y':1,'?':1,'n':0}},inplace=True)

Pandas calculate length of consecutive equal values from a grouped dataframe

I want to do what they've done in the answer here: Calculating the number of specific consecutive equal values in a vectorized way in pandas
, but using a grouped dataframe instead of a series.
So given a dataframe with several columns
A B C
------------
x x 0
x x 5
x x 2
x x 0
x x 0
x x 3
x x 0
y x 1
y x 10
y x 0
y x 5
y x 0
y x 0
I want to groupby columns A and B, then count the number of consecutive zeros in C. After that I'd like to return counts of the number of times each length of zeros occurred. So I want output like this:
A B num_consecutive_zeros count
---------------------------------------
x x 1 2
x x 2 1
y x 1 1
y x 2 1
I don't know how to adapt the answer from the linked question to deal with grouped dataframes.

Here is the code, count_consecutive_zeros() use numpy functions and pandas.value_counts() to get the results, and use groupby().apply(count_consecutive_zeros) to call count_consecutive_zeros() for every group. call reset_index() to change MultiIndex to columns:
import pandas as pd
import numpy as np
from io import BytesIO
text = """A B C
x x 0
x x 5
x x 2
x x 0
x x 0
x x 3
x x 0
y x 1
y x 10
y x 0
y x 5
y x 0
y x 0"""
df = pd.read_csv(BytesIO(text.encode()), delim_whitespace=True)
def count_consecutive_zeros(s):
v = np.diff(np.r_[0, s.values==0, 0])
s = pd.value_counts(np.where(v == -1)[0] - np.where(v == 1)[0])
s.index.name = "num_consecutive_zeros"
s.name = "count"
return s
df.groupby(["A", "B"]).C.apply(count_consecutive_zeros).reset_index()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Aggregating and plotting multiple columns using matplotlib - python

I've got data in a pandas dataframe that looks like this: ID A B C D 100 0 1 0 1 101 1 1 0 1 102 0 0 0 1 ... The idea is to create a barchart plot that shows the total of each (sum of the total number of A's, B's, etc.). Something like: X X X x X X A B C D This should be so simple...

Set 'ID' aside, sum, and plot.bar: df.set_index('ID').sum().plot.bar() # or df.drop(columns=['ID']).sum().plot.bar() output: just for fun print(df.drop(columns='ID') .replace({0: ' ', 1: 'X'}) .apply(sorted, reverse=True) .to_string(index=False) ) Output: A B C D X X X X X X

Related

Pandas dataframe: Creating a new column based on data from other columns

calculating the percentage of count in pandas groupby

pd.Dataframe.update puts the result at the top of the dataframe

How to change values in a dataframe Python

Pandas calculate length of consecutive equal values from a grouped dataframe

Categories

Resources