I have the following function:
def plot_distribution(df, var, target, **kwargs):
row = kwargs.get('row', None)
col = kwargs.get('col', None)
facet = sns.FacetGrid(df, hue=target, aspect=4, row = row, col = col)
facet.map(sns.kdeplot, var, shade=True)
facet.set(xlim=(0, df[var].max()))
facet.add_legend()
plot_distribution(asma_df, var = 'ADDITIONAL_ASMA_40', target = 'RUNWAY', row = 'RUNWAY')
This function creates the following chart:
I want to change it in such a way that X axis contains average values of months, while the Y. axis contains average values of ADDITIONAL_ASMA_40 per each month.
This is a sample DataFrame df:
month ADDITIONAL_ASMA_40 RUNWAY
1 20 32L
1 22 32L
1 18 32R
2 25 32L
2 26 32L
2 25 32L
2 25 32R
Simply use groupby function
df.groupby('month').mean()
and plot its columns of interest
Related
I have a dataframe (df_1) which contains coordinates and value data with no order that looks like this:
x_grid
y_grid
n_value
0
204.0
32.0
45
1
204.0
33.0
32
2
204.0
34.0
94
3
204.0
35.0
92
4
204.0
36.0
84
I wanted to shape in into another dataframe (df_2) to be able to create a heatmap. So I created an empty dataframe where the column indexes are the x_grid values and row indexes are y_grid values.
Then in a for loop I tried I performed an operation where I tried if the row index is equal to x_grid value then change the column with the index of the y_grid value into the n_value.
Here is my code:
for i, row in enumerate(df_2.iterrows()):
row_ind = index_list[i]
for j, item in enumerate(df_1.iterrows()):
x_ind = item[1].x_grid
if row_ind == x_ind:
col_ind = item[1].y_grid
row[1].col_ind = item[1].n_value
What I run this loop I see that there are new values filling dataframe but it does not seem right. The coordinates and values in the second dataframe do not match with the first one.
Second dataframe (df_2) partially looks something like this:
0
25
26
27
0
0
0
27
0
195
0
0
32
36
196
0
65
0
0
197
0
0
0
24
198
0
73
58
0
Is it a better way to perform this? I would also appreciate any other methods to turn the initial dataframe into a heatmap.
IIUC:
df_2 = df_1.pivot('x_grid', 'y_grid', 'n_value') \
.reindex(index=pd.RangeIndex(0, df_1['y_grid'].max()+1),
columns=pd.RangeIndex(0, df_1['x_grid'].max()+1),
fill_value=0)
If you have duplicated values for the same (x, y), use pivot_table:
df_2 = df_1.pivot_table('n_value', 'x_grid', 'y_grid', aggfunc='mean') \
.reindex(index=pd.RangeIndex(df_1['y_grid'].min(), df_1['y_grid'].max()+1),
columns=pd.RangeIndex(df_1['x_grid'].min(), df_1['x_grid'].max()+1))
Example:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
np.random.seed(2022)
df_1 = pd.DataFrame(np.random.randint(0, 20, (1000, 3)),
columns=['x_grid', 'y_grid', 'n_value'])
df_2 = df_1.pivot_table('n_value', 'x_grid', 'y_grid', aggfunc='mean') \
.reindex(index=pd.RangeIndex(df_1['y_grid'].min(), df_1['y_grid'].max()+1),
columns=pd.RangeIndex(df_1['x_grid'].min(), df_1['x_grid'].max()+1))
sns.heatmap(df_2, vmin=0, vmax=df_1['n_value'].max())
plt.show()
I have a dataset that consists of around 33 variables. The dataset contains patient information and the outcome of interest is binary in nature. Below is a snippet of the data.
The dataset is stored as a pandas dataframe
df.head()
ID Age GAD PHQ Outcome
1 23 17 23 1
2 54 19 21 1
3 61 23 19 0
4 63 16 13 1
5 37 14 8 0
I want to run independent t-tests looking at the differences in patient information based on outcome. So, if I were to run a t-test for each alone, I would do:
age_neg_outcome = df.loc[df.outcome ==0, ['Age']]
age_pos_outcome = df.loc[df.outcome ==1, ['Age']]
t_age, p_age = stats.ttest_ind(age_neg_outcome ,age_pos_outcome, unequal = True)
print('\t Age: t= ', t_age, 'with p-value= ', p_age)
How can I do this in a for loop for each of the variables?
I've seen this post which is slightly similar but couldn't manage to use it.
Python : T test ind looping over columns of df
You are almost there. ttest_ind accepts multi-dimensional arrays too:
cols = ['Age', 'GAD', 'PHQ']
cond = df['outcome'] == 0
neg_outcome = df.loc[cond, cols]
pos_outcome = df.loc[~cond, cols]
# The unequal parameter is invalid so I'm leaving it out
t, p = stats.ttest_ind(neg_outcome, pos_outcome)
for i, col in enumerate(cols):
print(f'\t{col}: t = {t[i]:.5f}, with p-value = {p[i]:.5f}')
Output:
Age: t = 0.12950, with p-value = 0.90515
GAD: t = 0.32937, with p-value = 0.76353
PHQ: t = -0.96683, with p-value = 0.40495
I am working with a database that looks like the below. For each fruit (just apple and pears below, for conciseness), we have:
1. yearly sales,
2. current sales,
3. monthly sales and
4.the standard deviation of sales.
Their ordering may vary, but it's always 4 values per fruit.
dataset = {'apple_yearly_avg': [57],
'apple_sales': [100],
'apple_monthly_avg':[80],
'apple_st_dev': [12],
'pears_monthly_avg': [33],
'pears_yearly_avg': [35],
'pears_sales': [40],
'pears_st_dev':[8]}
df = pd.DataFrame(dataset).T#tranpose
df = df.reset_index()#clear index
df.columns = (['Description', 'Value'])#name 2 columns
I would like to perform two sets of operations.
For the first set of operations, we isolate a fruit price, say 'pears', and subtract each average sales from current sales.
df_pear = df[df.loc[:, 'Description'].str.contains('pear')]
df_pear['temp'] = df_pear['Value'].where(df_pear.Description.str.contains('sales')).bfill()
df_pear ['some_op'] = df_pear['Value'] - df_pear['temp']
The above works, by creating a temporary column holding pear_sales of 40, backfill it and then use it to subtract values.
Question 1: is there a cleaner way to perform this operation without a temporary array? Also I do get the common warning saying I should use '.loc[row_indexer, col_indexer], even though the output still works.
For the second sets of operations, I need to add '5' rows equal to 'new_purchases' to the bottom of the dataframe, and then fill df_pear['some_op'] with sales * (1 + std_dev *some_multiplier).
df_pear['temp2'] = df_pear['Value'].where(df_pear['Description'].str.contains('st_dev')).bfill()
new_purchases = 5
for i in range(new_purchases):
df_pear = df_pear.append(df_pear.iloc[-1])#appends 5 copies of the last row
counter = 1
for i in range(len(df_pear)-1, len(df_pear)-new_purchases, -1):#backward loop from the bottom
df_pear.some_op.iloc[i] = df_pear['temp'].iloc[0] * (1 + df_pear['temp2'].iloc[i] * counter)
counter += 1
This 'backwards' loop achieves it, but again, I'm worried about readability since there's another temporary column created, and then the indexing is rather ugly?
Thank you.
I think, there is a cleaner way to perform your both tasks, for each
fruit in one go:
Add 2 columns, Fruit and Descr, the result of splitting of Description at the first "_":
df[['Fruit', 'Descr']] = df['Description'].str.split('_', n=1, expand=True)
To see the result you may print df now.
Define the following function to "reformat" the current group:
def reformat(grp):
wrk = grp.set_index('Descr')
sal = wrk.at['sales', 'Value']
dev = wrk.at['st_dev', 'Value']
avg = wrk.at['yearly_avg', 'Value']
# Subtract (yearly) average
wrk['some_op'] = wrk.Value - avg
# New rows
wrk2 = pd.DataFrame([wrk.loc['st_dev']] * 5).assign(
some_op=[ sal * (1 + dev * i) for i in range(5, 0, -1) ])
return pd.concat([wrk, wrk2]) # Old and new rows
Apply this function to each group, grouped by Fruit, drop Fruit
column and save the result back in df:
df = df.groupby('Fruit').apply(reformat)\
.reset_index(drop=True).drop(columns='Fruit')
Now, when you print(df), the result is:
Description Value some_op
0 apple_yearly_avg 57 0
1 apple_sales 100 43
2 apple_monthly_avg 80 23
3 apple_st_dev 12 -45
4 apple_st_dev 12 6100
5 apple_st_dev 12 4900
6 apple_st_dev 12 3700
7 apple_st_dev 12 2500
8 apple_st_dev 12 1300
9 pears_monthly_avg 33 -2
10 pears_sales 40 5
11 pears_yearly_avg 35 0
12 pears_st_dev 8 -27
13 pears_st_dev 8 1640
14 pears_st_dev 8 1320
15 pears_st_dev 8 1000
16 pears_st_dev 8 680
17 pears_st_dev 8 360
Edit
I'm in doubt whether Description should also be replicated to new
rows from "st_dev" row. If you want some other content there, set it
in reformat function, after wrk2 is created.
I have grouped the dataset by month and date and I have added third column for count the data in each day.
Dataframe before
month day
0 1 1
1 1 1
2 1 1
..
3000 12 31
3001 12 31
3002 12 31
Dataframe now:
month day count
0 1 1 300
1 1 2 500
2 1 3 350
..
363 12 28 700
364 12 29 1300
365 12 30 1000
How to do subplot for each month , x will be the days and y will be the count
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
df= pd.read_csv('/home/rand/Downloads/Flights.csv')
by_month= df.groupby(['month','day']).day.agg('count').to_frame('count').reset_index()
I'm beginner in data science field
Try this
fig, ax = plt.subplots()
ax.set_xticks(df['day'].unique())
df.groupby(["day", "month"]).mean()['count'].unstack().plot(ax=ax)
Above code will give you 12 lines representing each month in one plot. If you want to have 12 individual subplots for those months, try this:
fig = plt.figure()
for i in range(1,13):
df_monthly = df[df['month'] == i] # select dataframe with month = i
ax = fig.add_subplot(12,1,i) # add subplot in the i-th position on a grid 12x1
ax.plot(df_monthly['day'], df_monthly['count'])
ax.set_xticks(df_monthly['day'].unique()) # set x axis
I think you could use pandas.DataFrame.pivot to change the shape of your table to make it more convenient for the plot. So in your code you could do something like this:
plot_data= df.pivot(index='day', columns='month', values='count')
plot_data.plot()
plt.show()
This is assuming you have equal number of days in every month since in the sample you included, month 12 only has 30 days. More on pivot.
Try this:
df = pd.DataFrame({
'month': list(range(1, 13))*3,
'days': np.random.randint(1,11, 12*3),
'count': np.random.randint(10,20, 12*3)})
df.set_index(['month', 'days'], inplace=True)
df.sort_index()
df = df.groupby(level=[0, 1]).sum()
Code to plot it:
df.reset_index(inplace=True)
df.pivot(index='days', columns='month', values='count').fillna(0).plot()
df = pd.DataFrame({'A':[11,11,22,22],'mask':[0,0,0,1],'values':np.arange(10,30,5)})
df
A mask values
0 11 0 10
1 11 0 15
2 22 0 20
3 22 1 25
Now how can I group by A, and keep the column names in tact, and yet put a custom function into Z:
def calculate_df_stats(dfs):
mask_ = list(dfs['B'])
mean = np.ma.array(list(dfs['values']), mask=mask_).mean()
return mean
df['Z'] = df.groupby('A').agg(calculate_df_stats) # does not work
and generate:
A mask values Z
0 11 0 10 12.5
1 22 0 20 25
Whatever I do it only replaces values column with the masked mean.
and can your solution be applied for a function on two columns and return in a new column?
Thanks!
Edit:
To clarify more: let's say I have such a table in Mysql:
SELECT * FROM `Reader_datapoint` WHERE `wavelength` = '560'
LIMIT 200;
which gives me such result:
http://pastebin.com/qXiaWcJq
If I run now this:
SELECT *, avg(action_value) FROM `Reader_datapoint` WHERE `wavelength` = '560'
group by `reader_plate_ID`;
I get:
datapoint_ID plate_ID coordinate_x coordinate_y res_value wavelength ignore avg(action_value)
193 1 0 0 2.1783 560 NULL 2.090027083333334
481 2 0 0 1.7544 560 NULL 1.4695583333333333
769 3 0 0 2.0161 560 NULL 1.6637885416666673
How can I replicate this behaviour in Pandas? note that all the column names stay the same, the first value is taken, and the new column is added.
If you want the original columns in your result, you can first calculate the grouped and aggregated dataframe (but you will have to aggregate in some way your original columns. I took the first occuring as an example):
>>> df = pd.DataFrame({'A':[11,11,22,22],'mask':[0,0,0,1],'values':np.arange(10,30,5)})
>>>
>>> grouped = df.groupby("A")
>>>
>>> result = grouped.agg('first')
>>> result
mask values
A
11 0 10
22 0 20
and then add a column 'Z' to that result by applying your function on the groupby result 'grouped':
>>> def calculate_df_stats(dfs):
... mask_ = list(dfs['mask'])
... mean = np.ma.array(list(dfs['values']), mask=mask_).mean()
... return mean
...
>>> result['Z'] = grouped.apply(calculate_df_stats)
>>>
>>> result
mask values Z
A
11 0 10 12.5
22 0 20 20.0
In your function definition you can always use more columns (just by their name) to return the result.