Plotting a trend graph in Python - python

I have the following data in a DataFrame:
+----------------------+--------------+-------------------+
| Physician Profile Id | Program Year | Value Of Interest |
+----------------------+--------------+-------------------+
| 1004777 | 2013 | 83434288.00 |
| 1004777 | 2014 | 89237990.00 |
| 1004777 | 2015 | 96321258.00 |
| 1004777 | 2016 | 186993309.00 |
| 1004777 | 2017 | 205274459.00 |
| 1315076 | 2013 | 127454475.84 |
| 1315076 | 2014 | 156388338.20 |
| 1315076 | 2015 | 199733425.11 |
| 1315076 | 2016 | 242766959.37 |
+----------------------+--------------+-------------------+
I want to plot a trend graph with the Program year on the x-axis and Value of Interest on the y-axis and different lines for each Physician Profile ID. What is the best way to get this done?

Two routes I'd consider going with this:
Basic, fast, easy: matplotlib, which would look something like this:
install it, like pip install matplotlib
use it, like import matplotlib.pyplot as plt and this cheatsheet
Graphically compelling and you can drop your pandas dataframe right into it: Bokeh
I hope that helps you get started!

I tried a few things and was able to implement it:
years = df["Program_Year"].unique()
PhysicianIds = sorted(df["Physician_Profile_ID"].unique())
pd.options.mode.chained_assignment = None
for ID in PhysicianIds:
df_filter = df[df["Physician_Profile_ID"] == ID]
for year in years:
found = False
for index, row in df_filter.iterrows():
if row["Program_Year"] == year:
found = True
break
else:
found = False
if not found:
df_filter.loc[index+1] = [ID, year, 0]
VoI = list(df_filter["Value_of_Interest"])
sns.lineplot(x=years, y=VoI, label=ID, linestyle='-')
plt.ylabel("Value of Interest (in 100,000,000)")
plt.xlabel("Year")
plt.title("Top 10 Physicians")
plt.legend(title="Physician Profile ID")
plt.show()

Related

How regroup string data in subfamily and make graph with these

I have a csv with several columns including string data, here are the first rows out of about 2000
| | Title | FormeJuridique | Siren | TVA | NAFAPE | TypeAct | DateCrea | DateClo | DureeExe | Adresse | Coordonee |
| 0 | AGILIS COMPTABILITE | Société à responsabilité limitée (sans autre indication) | 902252782 | FR33902252782 | 6920Z | activités comptables | 01-09-2021 | 30-04-2022 | 241 days, 0:00:00 | Mâcon | (46.3036683, 4.8322266) |
| 1 | ALD VOLAILLES | SAS, société par actions simplifiée | 877535864 | FR56877535864 | 4639B | commerce de gros | 20-09-2019 | 19-04-2022 | 942 days, 0:00:00 | Montceau-les-Mines | (46.6740455, 4.3631681) |
first I would like to group data together, as a sub-family, such as the NAFAPE variable, group all the lines that start with 45--- which will correspond to a "Restaurant" family. it's possible ? Or another example group the address variable by city. make one group per city.
another point is to make graphs with string data, whether histograms or pie, I have trouble making them. I put you an example of one of my tries.
import pandas as pd
import matplotlib
matplotlib.use('TkAgg')
import matplotlib.pyplot as plt
data_pie = brg.groupby("FormeJuridique").count()['NAFAPE']
explode = (0,0,0,0.05,0,0,0,0,0.05)
plt.pie(x=data_pie, autopct="%.1f%%", explode=explode,
pctdistance=1.1, labels = data_pie.keys())
plt.title("FormeJuridique", fontsize=14);
plt.legend(formju,
loc="center left",
bbox_to_anchor=(1, 0, 0.5, 1))
plt.setp(formju, size=8, weight="bold")
it's not really readable I have problems with the legend, the name of the variables around the pie, ..
For graphs like histograms it's a disaster and I think that grouping variables into sub-family could make it easier, because for the NAFAPE variable for example there are almost only single variables and this makes the graph unreadable
Thanks for your help !

How to boxplot different columns from a dataframe (y axis) vs groupby a range of hours (x axis) using plotly

Good morning,
I'm trying to boxplot the 'columns' from 1 to 6 vs the 'ElapsedTime(hours)' column with the use of plotly library.
Here is my dataframe :
+------------+----------+----------+---------+----------+----------+---------+----------+--------------------+
| Date | Time | Column1 | Column2 | Column3 | Column4 | Column5 | Column6 | ElapsedTime(hours) |
+------------+----------+----------+---------+----------+----------+---------+----------+--------------------+
| 29/07/2021 | 06:48:37 | 0,011535 | 8,4021 | 0,00027 | 0,027806 | 8,431 | 0,000362 | 0 |
+------------+----------+----------+---------+----------+----------+---------+----------+--------------------+
| 29/07/2021 | 06:59:37 | 0,013458 | 8,4421 | 0,000314 | 0,032214 | 8,4738 | 0,000416 | 0,183333333 |
+------------+----------+----------+---------+----------+----------+---------+----------+--------------------+
| 29/07/2021 | 07:14:37 | 0,017793 | 8,4993 | 0,000384 | 0,038288 | 8,5372 | 0,000486 | 0,433333333 |
+------------+----------+----------+---------+----------+----------+---------+----------+--------------------+
| 30/07/2021 | 08:12:50 | 0,018808 | 8,545 | 0,000414 | 0,042341 | 8,5891 | 0,000539 | 24,9702778 |
+------------+----------+----------+---------+----------+----------+---------+----------+--------------------+
| 30/07/2021 | 08:42:50 | 0,025931 | 8,3627 | 0,000534 | 0,032379 | 8,3556 | 0,000557 | 25,9036111 |
+------------+----------+----------+---------+----------+----------+---------+----------+--------------------+
| 30/07/2021 | 08:57:50 | 0,025164 | 8,5518 | 0,000505 | 0,041134 | 8,6516 | 0,000254 | 26,1536111 |
+------------+----------+----------+---------+----------+----------+---------+----------+--------------------+
| 31/07/2021 | 05:45:28 | 0,026561 | 8,6266 | 0,000533 | 0,050387 | 8,6718 | 0,00065 | 46,9475 |
+------------+----------+----------+---------+----------+----------+---------+----------+--------------------+
| 31/07/2021 | 05:55:28 | 0,027744 | 8,6455 | 0,000543 | 0,051511 | 8,6916 | 0,000664 | 47,11416667 |
+------------+----------+----------+---------+----------+----------+---------+----------+--------------------+
| 31/07/2021 | 06:05:28 | 0,028854 | 8,485 | 0,000342 | 0,05693 | 8,6934 | 0,000695 | 47,28083333 |
+------------+----------+----------+---------+----------+----------+---------+----------+--------------------+
for now, i just know how to boxplot each column vs nothing using these lines of code :
import warnings
import pandas as pd
import plotly.graph_objects as go
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category= UserWarning)
da = pd.DataFrame()
da['Date'] = ["29/07/2021", "29/07/2021", "29/07/2021", "30/07/2021", "30/07/2021", "30/07/2021", "31/07/2021", "31/07/2021", "31/07/2021"]
da['Time'] = ["06:48:37", "06:59:37", "07:14:37", "08:12:50", "08:42:50", "08:57:50", "05:45:28", "05:55:28", "06:05:28"]
da["Column1"] = [0.011534891, 0.013458399, 0.017792937, 0.018807581, 0.025931434, 0.025163517, 0.026561283, 0.027743659, 0.028854]
da["Column2"] = [8.4021, 8.4421, 8.4993, 8.545, 8.3627, 8.5518, 8.6266, 8.6455, 8.485]
da["Column3"] = [0.000270475, 0.000313769, 0.000383506, 0.000414331, 0.000533619, 0.000505081, 0.000533131, 0.000543031, 0.000342]
da["Column4"] = [0.027806399, 0.032213984, 0.038287754, 0.042340721, 0.032378571, 0.041134106, 0.050387029, 0.051511238, 0.05693]
da["Column5"] = [8.431, 8.4738, 8.5372, 8.5891, 8.3556, 8.6516, 8.6718, 8.6916, 8.6934]
da["Column6"] = [0.000362081, 0.000416463, 0.000486275, 0.000539244, 0.000556613, 0.000253831, 0.00064975, 0.000664063, 0.000695]
da["ElapsedTime(hours)"] = [0, 0.183333333, 0.433333333, 24.9702778, 25.9036111, 26.1536111, 46.9475, 47.11416667, 47.28083333]
fig = go.Figure()
fig.add_trace(
go.Box(y=da['Column1'], name='Column1'))
fig.add_trace(
go.Box(y=da['Column2'], name='Column2'))
fig.add_trace(
go.Box(y=da['Column3'], name='Column3'))
fig.add_trace(
go.Box(y=da['Column4'], name='Column4'))
fig.add_trace(
go.Box(y=da['Column5'], name='Column5'))
fig.add_trace(
go.Box(y=da['Column6'], name='Column6'))
fig.update_layout(legend=dict(
yanchor="top",
y=1.24,
xanchor="left",
x=0.15
))
from plotly import offline
offline.plot(fig)
Output :
I can choose to show one column :
What i want (if possible) : Plot my columns from 1 to 6 vs a range of ElapsedTime(hours). For exemple i choose to have a range of 10 hours, so the boxplots will be taking in consideration that range and plot all the values of that range into one box.
PS : if i add x=da['ElapsedTime(hours)'] inside the go.Box(), i will be ploting each value of columns 1 to 6 versus one value from the ElapsedTime column and i don't want that, I want a box in a range of an ElapsedTime.
Extra : If possible, i want the columns from 1 to 6 to be in a dropdown button so that i can click and choose which column i wanna see in the range of the ElapsedTime i choosed.
Thank you for your time, and have a great day !
EDIT :#################################################
I tried these lines. The problem is that i have an error saying dataframe doesn't have a name argument (name=data.name) and if i get rid of that, let's say i don't use name=data.name, i will get a plot that is not Box. Do you have any idea on how to overcome this problem ?
da["DateTime"] = pd.to_datetime(da.Date + " " + da.Time)
columns = [c for c in da.columns if c.startswith("Column")]
da.set_index("DateTime")[columns].resample("1D")
fig = go.Figure()
for start_datetime, data in da.set_index("DateTime")[columns].resample("1D"):
fig.add_trace(
go.Box(x=data.index, y=data.values, name=data.name))
fig.update_layout(legend=dict(
yanchor="top",
y=1.24,
xanchor="left",
x=0.15
))
fig.update_layout(boxmode='group')
from plotly import offline
offline.plot(fig)
Here are some suggestions.
Merge the Date and Time columns into a DateTime column:
import pandas as pd
da = pd.DataFrame()
da['Date'] = ["29/07/2021", "29/07/2021", "29/07/2021", "30/07/2021", "30/07/2021", "30/07/2021", "31/07/2021", "31/07/2021", "31/07/2021"]
da['Time'] = ["06:48:37", "06:59:37", "07:14:37", "08:12:50", "08:42:50", "08:57:50", "05:45:28", "05:55:28", "06:05:28"]
da["Column1"] = [0.011534891, 0.013458399, 0.017792937, 0.018807581, 0.025931434, 0.025163517, 0.026561283, 0.027743659, 0.028854]
da["Column2"] = [8.4021, 8.4421, 8.4993, 8.545, 8.3627, 8.5518, 8.6266, 8.6455, 8.485]
da["Column3"] = [0.000270475, 0.000313769, 0.000383506, 0.000414331, 0.000533619, 0.000505081, 0.000533131, 0.000543031, 0.000342]
da["Column4"] = [0.027806399, 0.032213984, 0.038287754, 0.042340721, 0.032378571, 0.041134106, 0.050387029, 0.051511238, 0.05693]
da["Column5"] = [8.431, 8.4738, 8.5372, 8.5891, 8.3556, 8.6516, 8.6718, 8.6916, 8.6934]
da["Column6"] = [0.000362081, 0.000416463, 0.000486275, 0.000539244, 0.000556613, 0.000253831, 0.00064975, 0.000664063, 0.000695]
da["ElapsedTime(hours)"] = [0, 0.183333333, 0.433333333, 24.9702778, 25.9036111, 26.1536111, 46.9475, 47.11416667, 47.28083333]
da["DateTime"] = pd.to_datetime(df.Date + " " + df.Time)
df = da # df is more natural for me ;)
I use the following to mark the interesting columns:
columns = [c for c in df.columns if c.startswith("Column")]
Use the aggregate method, to aggregate data over some time range.
For example to aggregate over one day, use
df.set_index("DateTime")[columns].resample("1D")
The result is an object, that you can either run some aggregations on, e.g. compute the mean for each such sample:
df.set_index("DateTime")[columns].resample("1D").mean()
If you want to leverage plotly's functionality to create the boxplot, I would use a loop though:
for start_datetime, data in df.set_index("DateTime")[columns].resample("1D"):
print(start_datetime)
print(data)
print()
Instead of the print functions, use the plotly commands to create a box in the boxplot.

Using pandas apply to pass in both a row and the entire dataframe with it [duplicate]

This question already has an answer here:
Pandas - Finding percent contributed by each group
(1 answer)
Closed 2 years ago.
I have a df and I want to create some new cols with it. How would I use the apply function to both pass in the row, and the entire df with it? I need the entire df to do some filtering, and the data is subject to the values in each row.
Or maybe I don't need to use apply, but that's the first thing that came to my mind. Thank you and all help is appreciated!
Ex of df:
+----+--------+--------+
| ID | Family | Amount |
+----+--------+--------+
| 1 | A | 2 |
| 2 | A | 10 |
| 3 | B | 4 |
| 4 | B | 7 |
+----+--------+--------+
Result:
+----+--------+--------+-----------+------------+
| ID | Family | Amount | Total_Fam | Id_Percent |
+----+--------+--------+-----------+------------+
| 1 | A | 2 | 12 | .166 |
| 2 | A | 10 | 12 | .833 |
| 3 | B | 4 | 11 | .363 |
| 4 | B | 7 | 11 | .636 |
+----+--------+--------+-----------+------------+
First, group by Family and then transform amount and then you can directly divide Amount by the new column.
df['Total_Fam'] = df.groupby('Family')['Amount'].transform(np.sum)
df['Id_Percent'] = df['Amount']/df['Total_Fam']
df
Using apply on a column passes each row individualy. If you use apply on the entire dataset, it sees the entire dataset, hence, you can use all columns. As you can see in the example below, df['new_2] which is made using a function which I apply to the dataset, I do not need to pass the df to it.
import pandas as pd
import seaborn as sns
df = sns.load_dataset('iris')
df['new'] = df['species'].apply(lambda x: x[:2])
def sumIsMore(dataframe):
x = dataframe['sepal_length']
y = dataframe['sepal_width']
return x+y >= 8.5
df['new_2'] = df.apply(sumIsMore, axis=1)

Creating area chart from csv file containing multiple values in one column

I have a model that produces an output in csv. The columns are as follows (just an fictive example):
| Car | Price | Year |
The car column has different car manufacturers for example, with an average car price for each year in column 'Price'.
Example
| Car | Price | Year |
| BMW | 34000 | 1990 |
| BMW | 35000 | 1991 |
| BMW | 37000 | 1993 |
| AUDI | 32000 | 1991 |
| AUDI | 33500 | 1992 |
| AUDI | 34000 | 1993 |
| AUDI | 35500 | 1994 |
| SEAT | 25600 | 1994 |
...
I would like to be able to plot:
An area chart with all the prices for each car manufacturer in the years that the prices are available, within a 20 year period (for example 1990-2010).
Some years, there is no price available for some of the car manufacturers, and for that reason not all car manufacturer has 20 rows of data in the csv, the output just skips the whole year and row. See the BWM in the example, lacking 1992.
Since I run the model with different inputs, the actual names of the "Cars" change (and so do the prices), so I need the code to pick up a certain car name and then plot the available values for each run.
This is just an example for simplification, but the layout of the actual data is the same. Would much appreciate some help on this one!
Try this I think this might work. Also, I am not a pro just a beginner
import pandas as pd
import matplotlib.pyplot as plt
med_path = "path for csv file"
med = pd.read_csv(med_path)
fig, ax = plt.subplots(dpi=120)
area = pd.DataFrame(prices, columns=[‘a’, ‘b’, ‘c’, ‘d’]) # in the places of a,b,c replace with years
area.plot(kind=’area’,ax=ax)
plt.title(‘Graph for Area plot’)
plt.show()
I think this might not be an ideal way to hardcode all the values but you can use for loop to iterate through the csv file's content

Barplot comparing two columns

I would like to draw a barplot graph that would compare the evolution of 2 variables of revenues on a monthly time-axis (12 months of invoices).
I wanted to use sns.barplot, but can't use "hue" (cause the 2 variables aren't subcategories?). Is there another way, as simple as with hue? Can I "create" a hue?
Here is a small sample of my data:
(I did transform my table into a pivot table)
[In]
data_pivot['Revenue-Small-Seller-in'] = data_pivot["Small-Seller"] + data_pivot["Best-Seller"] + data_pivot["Medium-Seller"]
data_pivot['Revenue-Not-Small-Seller-in'] = data_pivot["Best-Seller"] + data_pivot["Medium-Seller"]
data_pivot
[Out]
InvoiceNo Month Year Revenue-Small-Seller-in Revenue-Not-Small-Seller-in
536365 12 2010 139.12 139.12
536366 12 2010 22.20 11.10
536367 12 2010 278.73 246.93
(sorry for the ugly presentation of my data, see the picture to see the complete table (as there are multiple columns))
You can do:
render_df = data_pivot[data_pivot.columns[-2:]]
fig, ax = plt.subplots(1,1)
render_df.plot(kind='bar', ax=ax)
ax.legend()
plt.show()
Output:
Or sns style like you requested
render_df = data_pivot[data_pivot.columns[-2:]].stack().reset_index()
sns.barplot('level_0', 0, hue='level_1',
render_df)
here render_df after stack() is:
+---+---------+-----------------------------+--------+
| | level_0 | level_1 | 0 |
+---+---------+-----------------------------+--------+
| 0 | 0 | Revenue-Small-Seller-in | 139.12 |
| 1 | 0 | Revenue-Not-Small-Seller-in | 139.12 |
| 2 | 1 | Revenue-Small-Seller-in | 22.20 |
| 3 | 1 | Revenue-Not-Small-Seller-in | 11.10 |
| 4 | 2 | Revenue-Small-Seller-in | 278.73 |
| 5 | 2 | Revenue-Not-Small-Seller-in | 246.93 |
+---+---------+-----------------------------+--------+
and output:

Categories

Resources