I have grouped the dataset by month and date and I have added third column for count the data in each day.
Dataframe before
month day
0 1 1
1 1 1
2 1 1
..
3000 12 31
3001 12 31
3002 12 31
Dataframe now:
month day count
0 1 1 300
1 1 2 500
2 1 3 350
..
363 12 28 700
364 12 29 1300
365 12 30 1000
How to do subplot for each month , x will be the days and y will be the count
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
df= pd.read_csv('/home/rand/Downloads/Flights.csv')
by_month= df.groupby(['month','day']).day.agg('count').to_frame('count').reset_index()
I'm beginner in data science field
Try this
fig, ax = plt.subplots()
ax.set_xticks(df['day'].unique())
df.groupby(["day", "month"]).mean()['count'].unstack().plot(ax=ax)
Above code will give you 12 lines representing each month in one plot. If you want to have 12 individual subplots for those months, try this:
fig = plt.figure()
for i in range(1,13):
df_monthly = df[df['month'] == i] # select dataframe with month = i
ax = fig.add_subplot(12,1,i) # add subplot in the i-th position on a grid 12x1
ax.plot(df_monthly['day'], df_monthly['count'])
ax.set_xticks(df_monthly['day'].unique()) # set x axis
I think you could use pandas.DataFrame.pivot to change the shape of your table to make it more convenient for the plot. So in your code you could do something like this:
plot_data= df.pivot(index='day', columns='month', values='count')
plot_data.plot()
plt.show()
This is assuming you have equal number of days in every month since in the sample you included, month 12 only has 30 days. More on pivot.
Try this:
df = pd.DataFrame({
'month': list(range(1, 13))*3,
'days': np.random.randint(1,11, 12*3),
'count': np.random.randint(10,20, 12*3)})
df.set_index(['month', 'days'], inplace=True)
df.sort_index()
df = df.groupby(level=[0, 1]).sum()
Code to plot it:
df.reset_index(inplace=True)
df.pivot(index='days', columns='month', values='count').fillna(0).plot()
Related
I have a dataframe (df_1) which contains coordinates and value data with no order that looks like this:
x_grid
y_grid
n_value
0
204.0
32.0
45
1
204.0
33.0
32
2
204.0
34.0
94
3
204.0
35.0
92
4
204.0
36.0
84
I wanted to shape in into another dataframe (df_2) to be able to create a heatmap. So I created an empty dataframe where the column indexes are the x_grid values and row indexes are y_grid values.
Then in a for loop I tried I performed an operation where I tried if the row index is equal to x_grid value then change the column with the index of the y_grid value into the n_value.
Here is my code:
for i, row in enumerate(df_2.iterrows()):
row_ind = index_list[i]
for j, item in enumerate(df_1.iterrows()):
x_ind = item[1].x_grid
if row_ind == x_ind:
col_ind = item[1].y_grid
row[1].col_ind = item[1].n_value
What I run this loop I see that there are new values filling dataframe but it does not seem right. The coordinates and values in the second dataframe do not match with the first one.
Second dataframe (df_2) partially looks something like this:
0
25
26
27
0
0
0
27
0
195
0
0
32
36
196
0
65
0
0
197
0
0
0
24
198
0
73
58
0
Is it a better way to perform this? I would also appreciate any other methods to turn the initial dataframe into a heatmap.
IIUC:
df_2 = df_1.pivot('x_grid', 'y_grid', 'n_value') \
.reindex(index=pd.RangeIndex(0, df_1['y_grid'].max()+1),
columns=pd.RangeIndex(0, df_1['x_grid'].max()+1),
fill_value=0)
If you have duplicated values for the same (x, y), use pivot_table:
df_2 = df_1.pivot_table('n_value', 'x_grid', 'y_grid', aggfunc='mean') \
.reindex(index=pd.RangeIndex(df_1['y_grid'].min(), df_1['y_grid'].max()+1),
columns=pd.RangeIndex(df_1['x_grid'].min(), df_1['x_grid'].max()+1))
Example:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
np.random.seed(2022)
df_1 = pd.DataFrame(np.random.randint(0, 20, (1000, 3)),
columns=['x_grid', 'y_grid', 'n_value'])
df_2 = df_1.pivot_table('n_value', 'x_grid', 'y_grid', aggfunc='mean') \
.reindex(index=pd.RangeIndex(df_1['y_grid'].min(), df_1['y_grid'].max()+1),
columns=pd.RangeIndex(df_1['x_grid'].min(), df_1['x_grid'].max()+1))
sns.heatmap(df_2, vmin=0, vmax=df_1['n_value'].max())
plt.show()
If I have the following dataframe, df, with millons variables
id score
140 0.1223
142 0.01123
148 0.1932
166 0.0226
.. ..
My problem is,
How can I study the distribution of each percentile?
So, my idea was divide score into percentiles and see how much percentage corresponds to each one.
I would like to get something like
percentil countofindex percentage
1 154.000 %20
2 100.320 %17
3 250.000 %21
...
where countofindex, is the number of differents Id, and percentage is the percentage that represent the first, second,.. percentil.
So for this, I get df['percentage'] = df['score'] / df['score'].sum() * 100, but this is the percentage of all data.
To get the percentage of each score you can get the sum of all scores and divide each one by it:
df= pd.DataFrame({'score': [0.1223,0.01123,0.1932]})
df['percentage'] = df['score'] / df['score'].sum() * 100
score percentage
0 0.12230 37.431518
1 0.01123 3.437089
2 0.19320 59.131393
To sort you can use .sort_values:
df.sort_values(by=['percentage'], ascending=False)
df.insert(1, 'percentile', range(1,len(df)+1))
score percentile percentage
2 0.19320 1 59.131393
0 0.12230 2 37.431518
1 0.01123 3 3.437089
let's go through the following example.
print(df)
0
0 0.127975
1 0.146976
2 0.721326
3 0.003722
df[0].sum()
1.0
Now, to create the chart:
import matplotlib.pyplot as plt
fig = plt.figure()
ax = fig.add_axes([0,0,1,1])
langs = [str(round(i[0]*100,2)) for i in df.iloc]
data = [round(i[0]*100,2) for i in df.iloc]
ax.bar(langs,data)
plt.ylabel("Percentiles")
plt.xlabel("Values")
plt.xticks(rotation=45)
plt.show()
If you want to add it to the dataset, you can use the code below.
df['Percentiles (%)']=df.apply(lambda x: round(x*100,2))
print(df)
0 Percentiles (%)
0 0.127975 12.80
1 0.146976 14.70
2 0.721326 72.13
3 0.003722 0.37
How can I plot a bubble chart from a dataframe that has been created from a pandas crosstab of another dataframe?
Imports;
import plotly as py
import plotly.graph_objects as go
from plotly.subplots import make_subplots
The crosstab was created using;
df = pd.crosstab(raw_data['Speed'], raw_data['Height'].fillna('n/a'))
The df contains mostly zeros, however where a number appears I want a point where the value controls the point size. I want to set the Index values as the x axis and the columns name values as the Y axis.
The df would look something like;
10 20 30 40 50
1000 0 0 0 0 5
1100 0 0 0 7 0
1200 1 0 3 0 0
1300 0 0 0 0 0
1400 5 0 0 0 0
I’ve tried using scatter & Scatter like this;
fig.add_trace(go.Scatter(x=df.index.values, y=df.columns.values, size=df.values,
mode='lines'),
row=1, col=3)
This returned a TypeError: 'Module' object not callable.
Any help is really appreciatted. Thanks
UPDATE
The answers below are close to what I ended up with, main difference being that I reference 'Speed' in the melt line;
df.reset_index()
df.melt(id_vars="Speed")
df.rename(columns={"index":"Engine Speed",
"variable":"Height",
"value":"Count"})
df[df!=0].dropna()
scale=1000
fig.add_trace(go.Scatter(x=df["Speed"], y=df["Height"],mode='markers',marker_size=df["Count"]/scale),
row=1, col=3)
This works however my main problem now is that the dataset is huge and plotly is really struggling to deal with it.
Update 2
Using Scattergl allows Plotly to deal with the large dataset very well!
If this is the case you can use plotly.express this is very similar to #Erik answer but shouldn't return errors.
import pandas as pd
import plotly.express as px
from io import StringIO
txt = """
10 20 30 40 50
1000 0 0 0 0 5
1100 0 0 0 7 0
1200 1 0 3 0 0
1300 0 0 0 0 0
1400 5 0 0 0 0
"""
df = pd.read_csv(StringIO(txt), delim_whitespace=True)
df = df.reset_index()\
.melt(id_vars="index")\
.rename(columns={"index":"Speed",
"variable":"Height",
"value":"Count"})
fig = px.scatter(df, x="Speed", y="Height",size="Count")
fig.show()
UPDATE
In case you got error please check your pandas version with pd.__version__ and try to check line by line this
df = pd.read_csv(StringIO(txt), delim_whitespace=True)
df = df.reset_index()
df = df.melt(id_vars="index")
df = df.rename(columns={"index":"Speed",
"variable":"Height",
"value":"Count"})
and report in which line it breaks.
I recommend to use tidy format to represent your data. We say a dataframe is tidy if and only if
Each row is an observation
Each column is a variable
Each value must have its own cell
To create a more tidy-dataframe you can do
df = pd.crosstab(raw_data["Speed"], raw_data["Height"])
df.reset_index(level=0, inplace=True)
df.melt(id_vars=["Speed", "Height"], value_vars=["Counts"])
Speed Height Counts
0 1000 10 2
1 1100 20 1
2 1200 10 1
3 1200 30 1
4 1300 40 1
5 1400 50 1
The next step is to do the actual plotting.
# when scale is increased bubbles will become larger
scale = 10
# create the scatter plot
scatter = go.Scatter(
x=df.Speed,
y=df.Height,
marker_size=df.counts*scale,
mode='markers')
fig = go.Figure(scatter)
fig.show()
This will create a plot as shown below.
I have the following function:
def plot_distribution(df, var, target, **kwargs):
row = kwargs.get('row', None)
col = kwargs.get('col', None)
facet = sns.FacetGrid(df, hue=target, aspect=4, row = row, col = col)
facet.map(sns.kdeplot, var, shade=True)
facet.set(xlim=(0, df[var].max()))
facet.add_legend()
plot_distribution(asma_df, var = 'ADDITIONAL_ASMA_40', target = 'RUNWAY', row = 'RUNWAY')
This function creates the following chart:
I want to change it in such a way that X axis contains average values of months, while the Y. axis contains average values of ADDITIONAL_ASMA_40 per each month.
This is a sample DataFrame df:
month ADDITIONAL_ASMA_40 RUNWAY
1 20 32L
1 22 32L
1 18 32R
2 25 32L
2 26 32L
2 25 32L
2 25 32R
Simply use groupby function
df.groupby('month').mean()
and plot its columns of interest
I have a Pandas dataframe with the columns ['week', 'price_per_unit', 'total_units']. I wish to create a new column called 'weighted_price' as follows: first group by 'week' and then for each week calculate price_per_unit * total_units / sum(total_units) for that week. I have code that does this:
import pandas as pd
import numpy as np
def create_features_by_group(df):
# first group data
grouped = df.groupby(['week'])
df_temp = pd.DataFrame(columns=['weighted_price'])
# run through the groups and create the weighted_price per group
for name, group in grouped:
res = (group['total_units'] * group['price_per_unit']) / np.sum(group['total_units'])
for idx in res.index:
df_temp.loc[idx] = [res[idx]]
df.join(df_temp['weighted_price'])
return df
The only problem is that this is very, very slow. Is there some faster way to do this?
I used the following code to test the function.
import pandas as pd
import numpy as np
df = pd.DataFrame(columns=['week', 'price_per_unit', 'total_units'])
for i in range(10):
df.loc[i] = [round(int(i % 3), 0) , 10 * np.random.rand(), round(10 * np.random.rand(), 0)]
I think you need to do it this way:
df
price total_units week
0 5 100 1
1 7 200 1
2 9 150 2
3 11 250 2
4 13 125 2
def fun(table):
table['measure'] = table['price'] * (table['total_units'] / table['total_units'].sum())
return table
df.groupby('week').apply(fun)
price total_units week measure
0 5 100 1 1.666667
1 7 200 1 4.666667
2 9 150 2 2.571429
3 11 250 2 5.238095
4 13 125 2 3.095238
I have grouped the dataset by 'Week' to calculate the weighted price for each week.
Then I joined the original dataset with the grouped dataset to get the result:
# importing the libraries
import pandas as pd
import numpy as np
# creating the dataset
df = {
'Week' : [1,1,1,1,2,2],
'price_per_unit' : [10,11,22,12,12,45],
'total_units' : [10,10,10,10,10,10]
}
df = pd.DataFrame(df)
df['price'] = df['price_per_unit'] * df['total_units']
# calculate the total sales and total number of units sold in each week
df_grouped_week = df.groupby(by = 'Week').agg({'price' : 'sum', 'total_units' : 'sum'}).reset_index()
# calculate the weighted price
df_grouped_week['wt_price'] = df_grouped_week['price'] / df_grouped_week['total_units']
# merging df and df_grouped_week
df_final = pd.merge(df, df_grouped_week[['Week', 'wt_price']], how = 'left', on = 'Week')