I have a csv with several columns including string data, here are the first rows out of about 2000
| | Title | FormeJuridique | Siren | TVA | NAFAPE | TypeAct | DateCrea | DateClo | DureeExe | Adresse | Coordonee |
| 0 | AGILIS COMPTABILITE | Société à responsabilité limitée (sans autre indication) | 902252782 | FR33902252782 | 6920Z | activités comptables | 01-09-2021 | 30-04-2022 | 241 days, 0:00:00 | Mâcon | (46.3036683, 4.8322266) |
| 1 | ALD VOLAILLES | SAS, société par actions simplifiée | 877535864 | FR56877535864 | 4639B | commerce de gros | 20-09-2019 | 19-04-2022 | 942 days, 0:00:00 | Montceau-les-Mines | (46.6740455, 4.3631681) |
first I would like to group data together, as a sub-family, such as the NAFAPE variable, group all the lines that start with 45--- which will correspond to a "Restaurant" family. it's possible ? Or another example group the address variable by city. make one group per city.
another point is to make graphs with string data, whether histograms or pie, I have trouble making them. I put you an example of one of my tries.
import pandas as pd
import matplotlib
matplotlib.use('TkAgg')
import matplotlib.pyplot as plt
data_pie = brg.groupby("FormeJuridique").count()['NAFAPE']
explode = (0,0,0,0.05,0,0,0,0,0.05)
plt.pie(x=data_pie, autopct="%.1f%%", explode=explode,
pctdistance=1.1, labels = data_pie.keys())
plt.title("FormeJuridique", fontsize=14);
plt.legend(formju,
loc="center left",
bbox_to_anchor=(1, 0, 0.5, 1))
plt.setp(formju, size=8, weight="bold")
it's not really readable I have problems with the legend, the name of the variables around the pie, ..
For graphs like histograms it's a disaster and I think that grouping variables into sub-family could make it easier, because for the NAFAPE variable for example there are almost only single variables and this makes the graph unreadable
Thanks for your help !
Good morning,
I'm trying to boxplot the 'columns' from 1 to 6 vs the 'ElapsedTime(hours)' column with the use of plotly library.
Here is my dataframe :
+------------+----------+----------+---------+----------+----------+---------+----------+--------------------+
| Date | Time | Column1 | Column2 | Column3 | Column4 | Column5 | Column6 | ElapsedTime(hours) |
+------------+----------+----------+---------+----------+----------+---------+----------+--------------------+
| 29/07/2021 | 06:48:37 | 0,011535 | 8,4021 | 0,00027 | 0,027806 | 8,431 | 0,000362 | 0 |
+------------+----------+----------+---------+----------+----------+---------+----------+--------------------+
| 29/07/2021 | 06:59:37 | 0,013458 | 8,4421 | 0,000314 | 0,032214 | 8,4738 | 0,000416 | 0,183333333 |
+------------+----------+----------+---------+----------+----------+---------+----------+--------------------+
| 29/07/2021 | 07:14:37 | 0,017793 | 8,4993 | 0,000384 | 0,038288 | 8,5372 | 0,000486 | 0,433333333 |
+------------+----------+----------+---------+----------+----------+---------+----------+--------------------+
| 30/07/2021 | 08:12:50 | 0,018808 | 8,545 | 0,000414 | 0,042341 | 8,5891 | 0,000539 | 24,9702778 |
+------------+----------+----------+---------+----------+----------+---------+----------+--------------------+
| 30/07/2021 | 08:42:50 | 0,025931 | 8,3627 | 0,000534 | 0,032379 | 8,3556 | 0,000557 | 25,9036111 |
+------------+----------+----------+---------+----------+----------+---------+----------+--------------------+
| 30/07/2021 | 08:57:50 | 0,025164 | 8,5518 | 0,000505 | 0,041134 | 8,6516 | 0,000254 | 26,1536111 |
+------------+----------+----------+---------+----------+----------+---------+----------+--------------------+
| 31/07/2021 | 05:45:28 | 0,026561 | 8,6266 | 0,000533 | 0,050387 | 8,6718 | 0,00065 | 46,9475 |
+------------+----------+----------+---------+----------+----------+---------+----------+--------------------+
| 31/07/2021 | 05:55:28 | 0,027744 | 8,6455 | 0,000543 | 0,051511 | 8,6916 | 0,000664 | 47,11416667 |
+------------+----------+----------+---------+----------+----------+---------+----------+--------------------+
| 31/07/2021 | 06:05:28 | 0,028854 | 8,485 | 0,000342 | 0,05693 | 8,6934 | 0,000695 | 47,28083333 |
+------------+----------+----------+---------+----------+----------+---------+----------+--------------------+
for now, i just know how to boxplot each column vs nothing using these lines of code :
import warnings
import pandas as pd
import plotly.graph_objects as go
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category= UserWarning)
da = pd.DataFrame()
da['Date'] = ["29/07/2021", "29/07/2021", "29/07/2021", "30/07/2021", "30/07/2021", "30/07/2021", "31/07/2021", "31/07/2021", "31/07/2021"]
da['Time'] = ["06:48:37", "06:59:37", "07:14:37", "08:12:50", "08:42:50", "08:57:50", "05:45:28", "05:55:28", "06:05:28"]
da["Column1"] = [0.011534891, 0.013458399, 0.017792937, 0.018807581, 0.025931434, 0.025163517, 0.026561283, 0.027743659, 0.028854]
da["Column2"] = [8.4021, 8.4421, 8.4993, 8.545, 8.3627, 8.5518, 8.6266, 8.6455, 8.485]
da["Column3"] = [0.000270475, 0.000313769, 0.000383506, 0.000414331, 0.000533619, 0.000505081, 0.000533131, 0.000543031, 0.000342]
da["Column4"] = [0.027806399, 0.032213984, 0.038287754, 0.042340721, 0.032378571, 0.041134106, 0.050387029, 0.051511238, 0.05693]
da["Column5"] = [8.431, 8.4738, 8.5372, 8.5891, 8.3556, 8.6516, 8.6718, 8.6916, 8.6934]
da["Column6"] = [0.000362081, 0.000416463, 0.000486275, 0.000539244, 0.000556613, 0.000253831, 0.00064975, 0.000664063, 0.000695]
da["ElapsedTime(hours)"] = [0, 0.183333333, 0.433333333, 24.9702778, 25.9036111, 26.1536111, 46.9475, 47.11416667, 47.28083333]
fig = go.Figure()
fig.add_trace(
go.Box(y=da['Column1'], name='Column1'))
fig.add_trace(
go.Box(y=da['Column2'], name='Column2'))
fig.add_trace(
go.Box(y=da['Column3'], name='Column3'))
fig.add_trace(
go.Box(y=da['Column4'], name='Column4'))
fig.add_trace(
go.Box(y=da['Column5'], name='Column5'))
fig.add_trace(
go.Box(y=da['Column6'], name='Column6'))
fig.update_layout(legend=dict(
yanchor="top",
y=1.24,
xanchor="left",
x=0.15
))
from plotly import offline
offline.plot(fig)
Output :
I can choose to show one column :
What i want (if possible) : Plot my columns from 1 to 6 vs a range of ElapsedTime(hours). For exemple i choose to have a range of 10 hours, so the boxplots will be taking in consideration that range and plot all the values of that range into one box.
PS : if i add x=da['ElapsedTime(hours)'] inside the go.Box(), i will be ploting each value of columns 1 to 6 versus one value from the ElapsedTime column and i don't want that, I want a box in a range of an ElapsedTime.
Extra : If possible, i want the columns from 1 to 6 to be in a dropdown button so that i can click and choose which column i wanna see in the range of the ElapsedTime i choosed.
Thank you for your time, and have a great day !
EDIT :#################################################
I tried these lines. The problem is that i have an error saying dataframe doesn't have a name argument (name=data.name) and if i get rid of that, let's say i don't use name=data.name, i will get a plot that is not Box. Do you have any idea on how to overcome this problem ?
da["DateTime"] = pd.to_datetime(da.Date + " " + da.Time)
columns = [c for c in da.columns if c.startswith("Column")]
da.set_index("DateTime")[columns].resample("1D")
fig = go.Figure()
for start_datetime, data in da.set_index("DateTime")[columns].resample("1D"):
fig.add_trace(
go.Box(x=data.index, y=data.values, name=data.name))
fig.update_layout(legend=dict(
yanchor="top",
y=1.24,
xanchor="left",
x=0.15
))
fig.update_layout(boxmode='group')
from plotly import offline
offline.plot(fig)
Here are some suggestions.
Merge the Date and Time columns into a DateTime column:
import pandas as pd
da = pd.DataFrame()
da['Date'] = ["29/07/2021", "29/07/2021", "29/07/2021", "30/07/2021", "30/07/2021", "30/07/2021", "31/07/2021", "31/07/2021", "31/07/2021"]
da['Time'] = ["06:48:37", "06:59:37", "07:14:37", "08:12:50", "08:42:50", "08:57:50", "05:45:28", "05:55:28", "06:05:28"]
da["Column1"] = [0.011534891, 0.013458399, 0.017792937, 0.018807581, 0.025931434, 0.025163517, 0.026561283, 0.027743659, 0.028854]
da["Column2"] = [8.4021, 8.4421, 8.4993, 8.545, 8.3627, 8.5518, 8.6266, 8.6455, 8.485]
da["Column3"] = [0.000270475, 0.000313769, 0.000383506, 0.000414331, 0.000533619, 0.000505081, 0.000533131, 0.000543031, 0.000342]
da["Column4"] = [0.027806399, 0.032213984, 0.038287754, 0.042340721, 0.032378571, 0.041134106, 0.050387029, 0.051511238, 0.05693]
da["Column5"] = [8.431, 8.4738, 8.5372, 8.5891, 8.3556, 8.6516, 8.6718, 8.6916, 8.6934]
da["Column6"] = [0.000362081, 0.000416463, 0.000486275, 0.000539244, 0.000556613, 0.000253831, 0.00064975, 0.000664063, 0.000695]
da["ElapsedTime(hours)"] = [0, 0.183333333, 0.433333333, 24.9702778, 25.9036111, 26.1536111, 46.9475, 47.11416667, 47.28083333]
da["DateTime"] = pd.to_datetime(df.Date + " " + df.Time)
df = da # df is more natural for me ;)
I use the following to mark the interesting columns:
columns = [c for c in df.columns if c.startswith("Column")]
Use the aggregate method, to aggregate data over some time range.
For example to aggregate over one day, use
df.set_index("DateTime")[columns].resample("1D")
The result is an object, that you can either run some aggregations on, e.g. compute the mean for each such sample:
df.set_index("DateTime")[columns].resample("1D").mean()
If you want to leverage plotly's functionality to create the boxplot, I would use a loop though:
for start_datetime, data in df.set_index("DateTime")[columns].resample("1D"):
print(start_datetime)
print(data)
print()
Instead of the print functions, use the plotly commands to create a box in the boxplot.
I have a model that produces an output in csv. The columns are as follows (just an fictive example):
| Car | Price | Year |
The car column has different car manufacturers for example, with an average car price for each year in column 'Price'.
Example
| Car | Price | Year |
| BMW | 34000 | 1990 |
| BMW | 35000 | 1991 |
| BMW | 37000 | 1993 |
| AUDI | 32000 | 1991 |
| AUDI | 33500 | 1992 |
| AUDI | 34000 | 1993 |
| AUDI | 35500 | 1994 |
| SEAT | 25600 | 1994 |
...
I would like to be able to plot:
An area chart with all the prices for each car manufacturer in the years that the prices are available, within a 20 year period (for example 1990-2010).
Some years, there is no price available for some of the car manufacturers, and for that reason not all car manufacturer has 20 rows of data in the csv, the output just skips the whole year and row. See the BWM in the example, lacking 1992.
Since I run the model with different inputs, the actual names of the "Cars" change (and so do the prices), so I need the code to pick up a certain car name and then plot the available values for each run.
This is just an example for simplification, but the layout of the actual data is the same. Would much appreciate some help on this one!
Try this I think this might work. Also, I am not a pro just a beginner
import pandas as pd
import matplotlib.pyplot as plt
med_path = "path for csv file"
med = pd.read_csv(med_path)
fig, ax = plt.subplots(dpi=120)
area = pd.DataFrame(prices, columns=[‘a’, ‘b’, ‘c’, ‘d’]) # in the places of a,b,c replace with years
area.plot(kind=’area’,ax=ax)
plt.title(‘Graph for Area plot’)
plt.show()
I think this might not be an ideal way to hardcode all the values but you can use for loop to iterate through the csv file's content