Related
I am making a graph to plot Gender count for the time series data that look like following data. Each row represent hourly data of each respective patient.
HR
SBP
DBP
Sepsis
Gender
P_ID
92
120
80
0
0
0
98
115
85
0
0
0
93
125
75
1
1
1
95
130
90
1
1
1
102
120
80
0
0
2
109
115
75
0
0
2
94
135
100
0
0
2
97
100
70
1
1
3
85
120
80
1
1
3
88
115
75
1
1
3
93
125
85
1
1
3
78
130
90
1
0
4
115
140
110
1
0
4
102
120
80
0
1
5
98
140
110
0
1
5
This is my code:
gender = df_n['Gender'].value_counts()
plt.figure(figsize=(7, 6))
ax = gender.plot(kind='bar', rot=0, color="c")
ax.set_title("Bar Graph of Gender", y = 1)
ax.set_xlabel('Gender')
ax.set_ylabel('Number of People')
ax.set_xticklabels(('Male', 'Female'))
for rect in ax.patches:
y_value = rect.get_height()
x_value = rect.get_x() + rect.get_width() / 2
space = 1
label = format(y_value)
ax.annotate(label, (x_value, y_value), xytext=(0, space), textcoords="offset points", ha='center', va='bottom')
plt.show()
Now what is happening is the code is calculating total number of instances (0: Male, 1: Female) and plotting it. But I want to plot the total males and females, not the total number of 0s and 1s, as the Same patient is having multiple rows of data (as per P_ID). Like how many patients are male and how many are female?
Can someone help me out? I guess maybe sns.countplot can be used. But I don't know how.
Thanks for helping me out >.<
__________ Udpate ________________
How I can group those Genders that are sepsis (1) or no sepsis (0)?
__________ Update 2 ___________
So, I got the total actual count of Male and Female, thanks to #Shaido.
In the whole dataset, there are only 2932 septic patients. Rest are non-septic. This is what I got from #JohanC answer.
Now, the problem is that as there are only 2932 septic patients, by looking at the graph, it is assumed that only 426 (251 Male) and (175 Female) are septic patients (out of 2932), rest are non-septic. But this is not true. Please help. Thanks.
I have a working example for selecting the unique IDS, it looks ugly so there is probably a better way, but it works...
import pandas as pd
# example of data:
data = {'gender': [0, 0, 1, 1, 1, 1, 0, 0], 'id': [1, 1, 2, 2, 3, 3, 4, 4]}
df = pd.DataFrame(data)
# get all unique ids:
ids = set(df.id)
# Go over all id, get first element of gender:
g = [list(df[df['id'] == i]['gender'])[0] for i in ids]
# count genders, laze way using pandas since the rest of the code also assumes a dataframe for plotting:
gender_counts = pd.DataFrame(g).value_counts()
# from here you can use your plot function.
# Or Counter
from collections import Counter
gender_counts = Counter(g)
# You have to create another method for plotting the gender.
You can group by 'P_ID' and take the first row for each of them (supposing a 'P_ID' has only one gender and only one sepsis). Then you can call sns.countplot on that dataframe, using gender for x and sepsis for hue (or vice versa). You can rename the values in the columns to show their names in the legend and in the tick labels.
import matplotlib.pyplot as plt
import seaborn as sns
from io import StringIO
data_str = '''
HR|SBP|DBP|Sepsis|Gender|P_ID
92|120|80|0|0|0
98|115|85|0|0|0
93|125|75|1|1|1
95|130|90|1|1|1
102|120|80|0|0|2
109|115|75|0|0|2
94|135|100|0|0|2
97|100|70|1|1|3
85|120|80|1|1|3
88|115|75|1|1|3
93|125|85|1|1|3
78|130|90|1|0|4
115|140|110|1|0|4
102|120|80|0|1|5
98|140|110|0|1|5
'''
df = pd.read_csv(StringIO(data_str), delimiter='|')
# new df: take Sepsis and Gender from the first row for every P_ID
df_per_PID = df.groupby('P_ID')[['Sepsis', 'Gender']].first()
# give names to the values in the columns
df_per_PID = df_per_PID.replace({'Gender': {0: 'Male', 1: 'Female'}, 'Sepsis': {0: 'No sepsis', 1: 'Sepsis'}})
# show counts per Gender and Sepsis
ax = sns.countplot(data=df_per_PID, x='Gender', hue='Sepsis', palette='rocket')
ax.legend(title='') # remove title, as it is clear from the legend items
ax.set_xlabel('')
for bars in ax.containers:
ax.bar_label(bars)
# ax.margins(y=0.1) # make some extra space for the labels
ax.locator_params(axis='y', integer=True)
sns.despine()
plt.show()
I have a dataset of four years' worth of ACT participation percentages by state entitled 'part_ACT'. Here's a snippet of it:
Index State ACT17 ACT18 ACT19 ACT20
0 Alabama 100 100 100 100
1 Alaska 65 33 38 33
2 Arizona 62 66 73 71
3 Arkansas 100 100 100 100
4 California 31 27 23 19
5 Colorado 100 30 27 25
6 Connecticut 31 26 22 19
I'm trying to produce a line graph with each of the four column headings on the x-axis and their values on the y-axis (1-100). I would prefer to display all of these line graphs into a single figure.
What's the easiest way to do this? I'm fine with Pandas, Matplotlib, Seaborn, or whatever. Thanks much!
One solution is to melt the df and plot with hue
import numpy as np
import pandas as pd
import seaborn as sns
df = pd.DataFrame({
'State': ['A', 'B', 'C', 'D'],
'x18': sorted(np.random.randint(0, 100, 4)),
'x19': sorted(np.random.randint(0, 100, 4)),
'x20': sorted(np.random.randint(0, 100, 4)),
'x21': sorted(np.random.randint(0, 100, 4)),
})
df_melt = df.melt(id_vars='State', var_name='year')
sns.relplot(
kind='line',
data=df_melt,
x='year', y='value',
hue='State'
)
Creating a plot is all about the shape of the DataFrame.
One way to accomplish this is by converting the DataFrame from wide to long, with melt, but this isn't necessary.
The primary requirement, is set 'State' as the index.
Plots can be generated directly with df, or df.T (.T is the transpose of the DataFrame).
The OP requests a line plot, but this is discrete data, and the correct way to visualize discrete data is with a bar plot, not a line plot.
pandas v1.2.3, seaborn v0.11.1, and matplotlib v3.3.4
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
data = {'State': ['Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California', 'Colorado', 'Connecticut'],
'ACT17': [100, 65, 62, 100, 31, 100, 31],
'ACT18': [100, 33, 66, 100, 27, 30, 26],
'ACT19': [100, 38, 73, 100, 23, 27, 22],
'ACT20': [100, 33, 71, 100, 19, 25, 19]}
df = pd.DataFrame(data)
# set State as the index - this is important
df.set_index('State', inplace=True)
# display(df)
ACT17 ACT18 ACT19 ACT20
State
Alabama 100 100 100 100
Alaska 65 33 38 33
Arizona 62 66 73 71
Arkansas 100 100 100 100
California 31 27 23 19
Colorado 100 30 27 25
Connecticut 31 26 22 19
# display(df.T)
State Alabama Alaska Arizona Arkansas California Colorado Connecticut
ACT17 100 65 62 100 31 100 31
ACT18 100 33 66 100 27 30 26
ACT19 100 38 73 100 23 27 22
ACT20 100 33 71 100 19 25 19
Plot 1
Use pandas.DataFrame.plot
df.T.plot()
plt.legend(title='State', bbox_to_anchor=(1.05, 1), loc='upper left')
# get rid of the ticks between the labels - not necessary
plt.xticks(ticks=range(0, len(df.T)))
plt.show()
Plot 2 & 3
Use pandas.DataFrame.plot with kind='bar' or kind='barh'
The bar plot is much better at conveying the yearly changes in the data, and allows for an easy comparison between states.
df.plot(kind='bar')
plt.legend(title='Year', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.show()
kind='bar'
kind='barh'
Plot 4
Use seaborn.lineplot
Will correctly plot a line plot from a wide dataframe with the columns and index labels.
sns.lineplot(data=df.T)
plt.legend(title='State', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.show()
I have 3 same models(M1,M2,M3) each for 5 customers(x1,x2,x3,x4,x5) and now I came to know from my business that for each customer one model has been chosen by them. The models chosen for the customer could be seen in Best_Models dataframe. Now I have to select the result of the best model that has been chosen by the business for each customer, which can be seen in Output data frame, How can I do that?
import pandas as pd
data1 = {'x1': [86,23,32,13,45,12],
'x2': [96,98,34,12,22,19],
'x3': [56,23,44,12,32,33],
'x4': [96,43,84,72,42,97],
'x5': [16,33,64,82,92,44]
}
Model1 = pd.DataFrame(data1,
columns=['x1','x2','x3','x4','x5']
)
data2 = {'x1': [36,23,32,13,66,12],
'x2': [56,98,64,12,22,19],
'x3': [86,23,44,52,32,33],
'x4': [96,43,74,72,42,97],
'x5': [16,53,64,82,77,44]
}
Model2 = pd.DataFrame(data1,
columns=['x1','x2','x3','x4','x5'])
data3 = {'x1': [36,43,32,13,66,12],
'x2': [56,48,64,12,22,19],
'x3': [86,23,44,54,32,33],
'x4': [96,44,74,44,42,97],
'x5': [16,53,64,82,44,44]
}
Model3 = pd.DataFrame(data3,
columns=['x1','x2','x3','x4','x5'])
Model3
data4 = {"Customer":["x1","x2","x3","x4","x5"],
"Best_Model":["M2","M3","M1","M2","M3"]
}
Best_Models = pd.DataFrame(data4, columns=['Customer', 'Best_Model'])
Best_Models
data5 = {'x1': [36,23,32,13,66,12],
'x2': [56,48,64,12,22,19],
'x3': [56,23,44,12,32,33],
'x4': [96,43,74,72,42,97],
'x5': [16,53,64,82,44,44]
}
Output = pd.DataFrame(data5,
columns=['x1','x2','x3','x4','x5'],
index=['I1', 'I2','I3','I4','I5','I6'])
Output
What I tried:
I tried to do the pivot of the best models dataframe and then map the results but that did not work for me, could anyone suggest me a better way to code this?
Let's try concat then using loc:
(pd.concat([Model1,Model2,Model3], keys=['M1','M2','M3'], axis=1)
.loc[:,[(m,c) for m,c in zip(Best_Models.Best_Model, Best_Models.Customer)]]
)
Output:
M2 M3 M1 M2 M3
x1 x2 x3 x4 x5
0 86 56 56 96 16
1 23 48 23 43 53
2 32 64 44 84 64
3 13 12 12 72 82
4 45 22 32 42 44
5 12 19 33 97 44
Best_Models.apply(lambda r:
{'M1': Model1, 'M2': Model2, 'M3': Model3}[
r['Best_Model']][r['Customer']], axis=1).T.rename(
columns=Best_Models.Customer)
output:
x1 x2 x3 x4 x5
0 86 56 56 96 16
1 23 48 23 43 53
2 32 64 44 84 64
3 13 12 12 72 82
4 45 22 32 42 44
5 12 19 33 97 44
Create a dictionary to map best model names to the actual Model.
Since customers names in the best_models and the Models match we can directly index them
Finally rename the result with the corresponding customer names.
gender math score reading score writing score
female 65 73 74
male 69 66 64
Given the dataframe (see above) how can we add a line that would calculate the difference between the row values in the following way :
gender math score reading score writing score
female 65 73 74
male 69 66 64
Difference -3 7 10
Or is there a more convenient way of expressing the difference between the rows?
Thank you in advance
Let -
df = pd.DataFrame({"A":[5, 10], "B":[9, 8], "gender": ["female", "male"]}).set_index("gender")
df.loc['Difference'] = df.apply(lambda x: x["female"]-x["male"])
In a one-liner with .loc[] and .diff():
df.loc['Difference'] = df.diff(-1).dropna().values.tolist()[0]
Another idea would be to work with a transposed dataframe and then transpose it back:
import pandas as pd
df = pd.DataFrame({'gender':['male','female'],'math score':[65,69],'reading score':[73,66],'writing score':[74,64]}).set_index('gender')
df = df.T
df['Difference'] = df.diff(axis=1)['female'].values
df = df.T
Output:
math score reading score writing score
gender
male 65.0 73.0 74.0
female 69.0 66.0 64.0
Difference 4.0 -7.0 -10.0
You can calculate the diff by selecting each row and then subtracting. But as you've correctly guessed, that is not the best way to do this. A more convenient way would be to transpose the df and then do subtraction:
import pandas as pd
df = pd.DataFrame([[65, 73, 74], [69, 66, 64]],
index=['female', 'male'],
columns=['math score', 'reading score', 'writing score'])
df_ = df.T
df_['Difference'] = df_['female'] - df_['male']
This is what you get:
female male Difference
math score 65 69 -4
reading score 73 66 7
writing score 74 64 10
If you want you can transpose it again df_.T, to revert back to it's initial form.
I have two csv files, one very large with thousands of rows and the other one has a normal size. I have a column in each csv file that contains the name of a certain product, which I call ProductName. The large csv contains the name of all the products in one column and the label of those products in another column. The smaller csv file contains some of the products names of the larger csv and some of the names that do not exist there. What I want to do is to read every row of the ProductName column in the smaller csv file and check whether I can find the same name in the ProductName column in the large csv. If, the match is found in the larger csv, I need to copy the content of the label column of the corresponding product in the large csv file and save it in a new column in the smaller csv. I'm using pandas and I could get what I was looking for. Here is my code:
import pandas as pd
df=pd.read_csv('Products.csv') #small csv file
df2=pd.read_csv('ProductsMain.csv') #large csv file
rowCounter=0
for name in (df['ProductName']):
nameCounter=df2.ProductName.str.contains(name).sum()
if nameCounter>0: # only checking for the product label if it exists in the larger csv
rowNum=df2[df2['ProductName']==name].index[0]
label=df2.iloc[rowNum,-1] #Label column is the last column in df2
df.set_value(rowCounter,'Label',label)
df.to_csv('Products.csv',index=False)
rowCounter +=1
I have two questions here: first, is there a better way to do this. In particular, when the size of csv file is very large, I'm not sure if this is the best way (in terms of speed) to find the matched name in the larger csv file. Second, what if I don't know the location of the label column and I want to call it by name and the index of row, since iloc doesn't work with names and numbers together. I mean, I cannot use df2.iloc[rowNum,'label'], but I like to know some way to do this.
Edit: Please take a look at the this example, if the description above is not clear enough. Let's say I have two csv files as follows:
ProductsMain.csv:
ProductName 0 1 2 3 Label
X1 29 74 30 60 0
X2 18 25 84 70 0
X3 10 45 72 43 1
X4 35 70 65 39 0
Y1 14 35 80 58 2
Y2 25 65 40 30 2
Y3 40 60 18 90 2
Y4 10 20 35 70 1
Products.csv:
ProductName 0 1 2 3
X2 18 25 84 70
Y1 14 35 80 58
Y5 19 37 49 75
X1 29 74 30 60
After running the code:
Products.csv:
ProductName 0 1 2 3 Label
X2 18 25 84 70 0
Y1 14 35 80 58 2
Y5 19 37 49 75
X1 29 74 30 60 0
In other words, first I check for the products name in Products.csv, if I can find the matching name in ProductsMain.csv, I will find the corresponding label of that product and save it in a new column (which is called Label) in the Products.csv, if the name doesn't exist in ProductsMain.csv, I don't do anything, continuing to the next productName in the Products.csv, until I reach the end of Products.csv.
Edit2: I also figured out I can use ix instead of iloc to reach cells by name and index: label=df2.ix[rowNum,'label']
You can use Merge function in pandas to merge two data frames as follows-
import pandas as pd
df_productsMain = pd.DataFrame({'ProductName': ['P0', 'P1', 'P3'],
'X1': ['X10', 'X11', 'X13'],
'X2': ['X20', 'X21', 'X23'],
'Label': ['L0', 'L1', 'L3']},
index=[0, 1, 2])
df_products= pd.DataFrame({'ProductName': ['P0', 'P1', 'P2', 'P3', 'P4'],
'Y1': ['Y0', 'Y1', 'Y2', 'Y3', 'Y4'],
'Y2': ['Y0', 'Y1', 'Y2', 'Y3', 'Y4'],
'Y3': ['Y0', 'Y1', 'Y2', 'Y3', 'Y4']},
index=[0, 1, 2, 3, 4])
df_mergedResult = pd.merge(df_products, df_productsMain[['ProductName', 'Label']], on='ProductName', how='left' )
Data Frames: