I want a grouped bar chart, but the default plot doesn't have the groupings the way I'd like, and I'm struggling to get them rearranged properly.
The dataframe looks like this:
user year cat1 cat2 cat3 cat4 cat5
0 Brad 2014 309 186 119 702 73
1 Brad 2015 280 177 100 625 75
2 Brad 2016 306 148 127 671 74
3 Brian 2014 298 182 131 702 73
4 Brian 2015 295 125 117 607 76
5 Brian 2016 298 137 97 596 75
6 Chris 2014 309 171 111 654 72
7 Chris 2015 251 146 105 559 76
8 Chris 2016 231 130 105 526 75
etc
Elsewhere, the code produces two variables, user1 and user2. I want to produce a bar chart that compares the numbers for those two users over time in cat1, cat2, and cat3. So for example if user1 and user2 were Brian and Chris, I would want a chart that looks something like this:
On an aesthetic note: I'd prefer the year labels be vertical text or a font size that fits on a single line, but it's really the dataframe pivot that's confusing me at the moment.
Select the subset of users you want to plot against. Use pivot_table later to transform the DF to the required format to be plotted by transposing and unstacking it.
import matplotlib.pyplot as plt
def select_user_plot(user_1, user_2, cats, frame, idx, col):
frame = frame[(frame[idx[0]] == user_1)|(frame[idx[0]] == user_2)]
frame_pivot = frame.pivot_table(index=idx, columns=col, values=cats).T.unstack()
frame_pivot.plot.bar(legend=True, cmap=plt.get_cmap('RdYlGn'), figsize=(8,8), rot=0)
Finally,
Choose the users and categories to be included in the bar plot.
user_1 = 'Brian'
user_2 = 'Chris'
cats = ['cat1', 'cat2', 'cat3']
select_user_plot(user_1, user_2, cats, frame=df, idx=['user'], col=['year'])
Note: This gives close to the plot that the OP had posted.(Year appearing as Legends instead of the tick labels)
Related
My dataset is called df:
year
french
flemish
2014
200
200
2015
170
210
2016
130
220
2017
120
225
2018
210
250
I want to create a histogram in seaborn with french and flemish on the x-axis and year as the hue.
I tried this, but it didn't work successfully:
sns.histplot(data=df, x="french", hue="year", multiple="dodge", shrink=.8)
The y-axis should show the height of the number of the columns of french and flemish.
You need to use a bar function, not a histogram function. Histogram functions take raw data and count them, but your data are already counted.
You need to melt the french and flemish columns into "long form." Then x will be the language, and y will be the counts.
sns.barplot(data=df.melt("year", var_name="language", value_name="count"),
x="language",
y="count",
hue="year")
plt.legend(loc=(1.05, 0))
The melted dataframe for reference:
>>> df.melt("year", var_name="language", value_name="count")
# year language count
# 0 2014 french 200
# 1 2015 french 170
# 2 2016 french 130
# 3 2017 french 120
# 4 2018 french 210
# 5 2014 flemish 200
# 6 2015 flemish 210
# 7 2016 flemish 220
# 8 2017 flemish 225
# 9 2018 flemish 250
I have two dataframes:
Researchers: a list of all researcher and their id_number
Samples: a list of samples and all researchers related to it, there may be several researchers in the same cell.
I want to go through every row in the researcher table and check if they occur in each row of the Table Samples. If they do I want to get: a) their id from the researcher table and the sample number from the Samples table.
Table researcher
id_researcher full_name
0 1 Jack Sparrow
1 2 Demi moore
2 3 Bickman
3 4 Charles Darwin
4 5 H. Haffer
Table samples
sample_number collector
230 INPA A 231 Haffer
231 INPA A 232 Charles Darwin
232 INPA A 233 NaN
233 INPA A 234 NaN
234 INPA A 235 Jack Sparrow; Demi Moore ; Bickman
Output I want:
id_researcher num_samples
0 5 INPA A 231
1 4 INPA A 232
2 1 INPA A 235
3 2 INPA A 235
4 3 INPA A 235
I was able to it with a loop in regular python with the following code, but it is extremely low and quite long. Does anyone know a faster and simpler way? perhaps with pandas apply?
id_researcher = []
id_num_sample = []
for c in range(len(data_researcher)):
for a in range(len(data_samples)):
if pd.isna(data_samples['collector'].iloc[a]) == False and data_researcher['full_name'].iloc[c] in data_samples['collector'].iloc[a]:
id_researcher.append(data_researcher['id_researcher'].iloc[c])
id_num_sample.append(data_samples['No TEC'].iloc[a])
data_researcher_sample = pd.DataFrame.from_dict({'id_pesq': id_researcher, 'num_sample': id_num_sample}).sort_values(by='num_amostra')
You have a few data cleaning job to do such as 'Moore' in lowercase, 'Haffer' with first name initials in one case and none in the other, etc. After normalizing your two dataframes, you can split and explode collections and use merge:
samples['collector'] = samples['collector'].str.split(';')
samples = samples.explode('collector')
samples['collector'] = samples['collector'].str.strip()
out = researchers.merge(samples, right_on='collector', left_on='full_name', how='left')[['id_researcher','sample_number']].sort_values(by='sample_number').reset_index(drop=True)
Output:
id_researcher sample_number
0 5 INPA A 231
1 4 INPA A 232
2 1 INPA A 235
3 2 INPA A 235
4 3 INPA A 235
I have two datasets: one with cancer positive patients (df_pos), and the other with the cancer negative patients (df_neg).
df_pos
id
0 123
1 124
2 125
df_neg
id
0 234
1 235
2 236
I want to compile these datasets into one with an extra column if the patient has cancer or not (yes or no).
Here is my desired outcome:
id outcome
0 123 yes
1 124 yes
2 125 yes
3 234 no
4 235 no
5 236 no
What would be a smarter approach to compile these?
Any suggestions would be appreciated. Thanks!
Use pandas.DataFrame.append and pandas.DataFrame.assign:
>>> df_pos.assign(outcome='Yes').append(df_neg.assign(outcome='No'), ignore_index=True)
id outcome
0 123 Yes
1 124 Yes
2 125 Yes
3 234 No
4 235 No
5 236 No
df_pos['outcome'] = True
df_neg['outcome'] = False
df = pd.concat([df_pos, df_neg]).reset_index(drop=True)
I tried looking for a succinct answer and nothing helped. I am trying to add a row to a dataframe that takes a string for the first column and then for each column grabbing the sum. I ran into a scalar issue, so I tried to make the desired row into a series then convert to a dataframe, but apparently I was adding four rows with one column value instead of one row with the four column values.
My code:
def country_csv():
# loop through absolute paths of each file in source
for filename in os.listdir(source):
filepath = os.path.join(source, filename)
if not os.path.isfile(filepath):
continue
df = pd.read_csv(filepath)
df = df.groupby(['Country']).sum()
df.reset_index()
print(df)
# df.to_csv(os.path.join(path1, filename))
Sample dataframe:
Confirmed Deaths Recovered
Country
Afghanistan 299 7 10
Albania 333 20 99
Would like to see this as the first row
World 632 27 109
import pandas as pd
import datetime as dt
df
Confirmed Deaths Recovered
Country
Afghanistan 299 7 10
Albania 333 20 99
df.loc['World'] = [df['Confirmed'].sum(),df['Deaths'].sum(),df['Recovered'].sum()]
df.sort_values(by=['Confirmed'], ascending=False)
Confirmed Deaths Recovered
Country
World 632 27 109
Albania 333 20 99
Afghanistan 299 7 10
IIUC, you can create a dict then repass it into a dataframe to concat.
data = df.sum(axis=0).to_dict()
data.update({'Country' : 'World'})
df2 = pd.concat([pd.DataFrame(data,index=[0]).set_index('Country'),df],axis=0)
print(df2)
Confirmed Deaths Recovered
Country
World 632 27 109
Afghanistan 299 7 10
Albania 333 20 99
or a oner liner using assign and Transpose
df2 = pd.concat(
[df.sum(axis=0).to_frame().T.assign(Country="World").set_index("Country"), df],
axis=0,
)
print(df2)
Confirmed Deaths Recovered
Country
World 632 27 109
Afghanistan 299 7 10
Albania 333 20 99
This is my dataset.
Country Type Disaster Count
0 CHINA P REP Industrial Accident 415
1 CHINA P REP Transport Accident 231
2 CHINA P REP Flood 175
3 INDIA Transport Accident 425
4 INDIA Flood 206
5 INDIA Storm 121
6 UNITED STATES Storm 348
7 UNITED STATES Transport Accident 159
8 UNITED STATES Flood 92
9 PHILIPPINES Storm 249
10 PHILIPPINES Transport Accident 84
11 PHILIPPINES Flood 71
12 INDONESIA Transport Accident 136
13 INDONESIA Flood 110
14 INDONESIA Seismic Activity 77
I would like to make a triple bar chart and the label is based on the column 'Type'. I would also like to group the bar based on the column 'Country'.
I have tried using (with df as the DataFrame object of the pandas library),
df.groupby('Country').plot.bar()
but the result came out as multiple bar charts representing each group in the 'Country' column.
The expected output is similar to this:
What are the codes that I need to run in order to achieve this graph?
There are two ways -
df.set_index('Country').pivot(columns='Type').plot.bar()
df.set_index(['Country','Type']).plot.bar()