How regroup string data in subfamily and make graph with these - python

I have a csv with several columns including string data, here are the first rows out of about 2000
| | Title | FormeJuridique | Siren | TVA | NAFAPE | TypeAct | DateCrea | DateClo | DureeExe | Adresse | Coordonee |
| 0 | AGILIS COMPTABILITE | Société à responsabilité limitée (sans autre indication) | 902252782 | FR33902252782 | 6920Z | activités comptables | 01-09-2021 | 30-04-2022 | 241 days, 0:00:00 | Mâcon | (46.3036683, 4.8322266) |
| 1 | ALD VOLAILLES | SAS, société par actions simplifiée | 877535864 | FR56877535864 | 4639B | commerce de gros | 20-09-2019 | 19-04-2022 | 942 days, 0:00:00 | Montceau-les-Mines | (46.6740455, 4.3631681) |
first I would like to group data together, as a sub-family, such as the NAFAPE variable, group all the lines that start with 45--- which will correspond to a "Restaurant" family. it's possible ? Or another example group the address variable by city. make one group per city.
another point is to make graphs with string data, whether histograms or pie, I have trouble making them. I put you an example of one of my tries.
import pandas as pd
import matplotlib
matplotlib.use('TkAgg')
import matplotlib.pyplot as plt
data_pie = brg.groupby("FormeJuridique").count()['NAFAPE']
explode = (0,0,0,0.05,0,0,0,0,0.05)
plt.pie(x=data_pie, autopct="%.1f%%", explode=explode,
pctdistance=1.1, labels = data_pie.keys())
plt.title("FormeJuridique", fontsize=14);
plt.legend(formju,
loc="center left",
bbox_to_anchor=(1, 0, 0.5, 1))
plt.setp(formju, size=8, weight="bold")
it's not really readable I have problems with the legend, the name of the variables around the pie, ..
For graphs like histograms it's a disaster and I think that grouping variables into sub-family could make it easier, because for the NAFAPE variable for example there are almost only single variables and this makes the graph unreadable
Thanks for your help !

Related

Sum column data while merging zip_code polygons to MultiPolygons in geopandas

I m working with python on a Jupyter notebook
I have the following dataset:
+-------+------------+----------+---------------------------------------------------+
| zip | population | area# | polygon |
+-------+------------+----------+---------------------------------------------------+
| 12345 | 50 | 55 | POLYGON ((-55.66788 40.04416, -55.66790 40.044... |
| 12346 | 100 | 55 | POLYGON ((-55.54666 40.40131, -55.54678 40.400... |
| . | . | . | . |
| . | . | . | . |
| 98765 | 236667 | 155 | POLYGON ((-155.04682 78.53585, -155.04680 78.5.. |
+-------+--------+--------------+---------------------------------------------------+
Where the polygon column is a geopandas.GeoSeries and each geometry element is a shapely.geometry.polygon.Polygon.
I transformed the dataset into a geodataframe:
from geopandas import GeoDataFrame
dataset = GeoDataFrame(dataset)
And used the set_geometry function to assign the geometry column:
dataset = dataset.set_geometry("polygon")
Everything seems to be working fine and I am able to plot heatmaps using this GeoDataFrame.
The issue I am having is that I am trying to create a dataset grouping the population per area, but I also have to group the polygons, which I have been failing to do so.
the final dataset should look like this, with all the zip polygons with the same area# should be collapsed into a single row with a MultiPolygon geometry and the total of the population values:
+------------+----------+--------------------------------------------------------+
| population | area# | polygon |
+------------+----------+--------------------------------------------------------+
| 150 | 55 | MULTYPOLYGON ((-55.66788 40.04416, -55.66790 40.044... |
| . | . | . |
| . | . | . |
| . | . | . |
| 236667 | 155 | MULTYPOLYGON ((-155.04682 78.53585, -155.04680 78.5.. |
+------------+----------+--------------------------------------------------------+
I really don't need to follow the steps I outlined before, these are the steps I found here on Stack Overflow. I am ok doing something else from scratch.
The geopandas spatial equivalent of a pandas .groupby().aggreagte() operation is dissolve. Take a look through the docs, they're really helpful.
One key argument to note is the aggfunc argument. From the docs:
The aggfunc = argument defaults to ‘first’ which means that the first row of attributes values found in the dissolve routine will be assigned to the resultant dissolved geodataframe. However it also accepts other summary statistic options as allowed by pandas.groupby including:
‘first’
‘last’
‘min’
‘max’
‘sum’
‘mean’
‘median’
function
string function name
list of functions and/or function names, e.g. [np.sum, ‘mean’]
dict of axis labels -> functions, function names or list of such.
If you're looking to group on area, and sum the populations within each area, as well as unify the polygons, you can use aggfunc={"population": "sum"}, e.g.:
aggregated = dataset.dissolve("area#", aggfunc={"population": "sum"})

Plotting a trend graph in Python

I have the following data in a DataFrame:
+----------------------+--------------+-------------------+
| Physician Profile Id | Program Year | Value Of Interest |
+----------------------+--------------+-------------------+
| 1004777 | 2013 | 83434288.00 |
| 1004777 | 2014 | 89237990.00 |
| 1004777 | 2015 | 96321258.00 |
| 1004777 | 2016 | 186993309.00 |
| 1004777 | 2017 | 205274459.00 |
| 1315076 | 2013 | 127454475.84 |
| 1315076 | 2014 | 156388338.20 |
| 1315076 | 2015 | 199733425.11 |
| 1315076 | 2016 | 242766959.37 |
+----------------------+--------------+-------------------+
I want to plot a trend graph with the Program year on the x-axis and Value of Interest on the y-axis and different lines for each Physician Profile ID. What is the best way to get this done?
Two routes I'd consider going with this:
Basic, fast, easy: matplotlib, which would look something like this:
install it, like pip install matplotlib
use it, like import matplotlib.pyplot as plt and this cheatsheet
Graphically compelling and you can drop your pandas dataframe right into it: Bokeh
I hope that helps you get started!
I tried a few things and was able to implement it:
years = df["Program_Year"].unique()
PhysicianIds = sorted(df["Physician_Profile_ID"].unique())
pd.options.mode.chained_assignment = None
for ID in PhysicianIds:
df_filter = df[df["Physician_Profile_ID"] == ID]
for year in years:
found = False
for index, row in df_filter.iterrows():
if row["Program_Year"] == year:
found = True
break
else:
found = False
if not found:
df_filter.loc[index+1] = [ID, year, 0]
VoI = list(df_filter["Value_of_Interest"])
sns.lineplot(x=years, y=VoI, label=ID, linestyle='-')
plt.ylabel("Value of Interest (in 100,000,000)")
plt.xlabel("Year")
plt.title("Top 10 Physicians")
plt.legend(title="Physician Profile ID")
plt.show()

Plotting in Python extracting only specific columns from a CSV

EDIT: as suggested shortening the question:
Quite new to python and programming, and I would like to plot the 1st and 4th column into a log(x) log(y) graph. And honestly I don't knot how to extract only the two columns i need from this.
16:58:58 | 2.090 | 26.88 | 1.2945E-9 | 45.8
16:59:00 | 2.031 | 27.00 | 1.3526E-9 | 132.1
16:59:02 | 2.039 | 26.90 | 1.3843E-9 | 178.5
16:59:04 | 2.031 | 26.98 | 1.4628E-9 | 228.9
16:59:06 | 2.031 | 27.04 | 1.5263E-9 | 259.8
16:59:08 | 2.027 | 26.84 | 1.6010E-9 | 271.8
Using pandas:
import pandas as pd
df = pd.read_csv("data.txt", delimiter="\s[|]\s+", header=None, index_col=0)
df.plot(y=4)
(Note that this ignores the logarithmic scaling because it's not clear what the logarithm of a time should be)
If you want to not use the excellent pandas, here is a steam approach.
import matplotlib.pyplot as plt
import math
import datetime as dt
test = """16:58:58 | 2.090 | 26.88 | 1.2945E-9 | 45.8\n
16:59:00 | 2.031 | 27.00 | 1.3526E-9 | 132.1\n
16:59:02 | 2.039 | 26.90 | 1.3843E-9 | 178.5\n
16:59:04 | 2.031 | 26.98 | 1.4628E-9 | 228.9\n
16:59:06 | 2.031 | 27.04 | 1.5263E-9 | 259.8\n
16:59:08 | 2.027 | 26.84 | 1.6010E-9 | 271.8\n"""
lines = [line for line in test.splitlines() if line != ""]
# Here is the real code
subset = []
for line in lines:
parts = line.split('|')
ts = dt.datetime.strptime(parts[0].strip(), "%H:%M:%S")
num = math.log(float(parts[3].strip()))
subset.append((ts, num))
# now there is a list of tuples with your datapoints, looking like
# [(datetime.datetime(1900, 1, 1, 16, 58, 58), 1.2945E-9), (datetime.datetime(1900, 1, 1, 16, 59), ...]
# I made this list intentionally so that you can see how one can gather everything in a tidy way from the
# raw string data.
# Now lets separate things for plotting
times = [elem[0] for elem in subset]
values = [elem[1] for elem in subset]
# now to plot, I'm going to use the matplotlib plot_date function.
plt.figure()
plt.plot_date(times, values)
# do some formatting on the date axis
plt.gcf().autofmt_xdate()
plt.show()

how to stop seaborn visualisation sorting the values on the X axis of a bar chart

I have a column in a pandas dataframe called month_name which contains some months of the year.
I have tried feeding this data to seaborn visualisation library as both an object and categorical datatype.
I have also tried sorting the list before I pass it to seaborn library. In Both Instances the resulting graph looks like (with the month_name out of order:
150 105 _ 147_
Y-amount 100 47 ____ 38 33 | |
50 ____ | | _____ ____ | |
0 | | | | | | | | | |
| | | | | | | | | |
X-month_name
August July June October September
How do I get the month_names to appear in the correct order
And have the actual exact value on top of the bar like in the year example over here which is just above the DOT PLOT example
From the docs:
{x, hue, col, row}_order : list-like, optional
Order of levels plotted on various dimensions of the figure.
Default is to use sorted level values.
This is an issue with the new pd.Categorical type in seaborn, which is currently not supported, but is slated for addition in version 0.6.0. See this github issue: https://github.com/mwaskom/seaborn/issues/361

Plotting a line plot with error bars and datapoints from a pandas DataFrame

I've been racking my brain to try to figure out how to plot a pandas DataFrame the way I want but to no avail.
The DataFrame has a MultiIndex and it looks like this:
+-----------+--------------+------------+--------------+-----------------+---------+---------+---------+---------+---------+
| | | | | | run_001 | run_002 | run_003 | run_004 | run_005 |
+-----------+--------------+------------+--------------+-----------------+---------+---------+---------+---------+---------+
| file_type | server_count | file_count | thread_count | cacheclear_type | | | | | |
+-----------+--------------+------------+--------------+-----------------+---------+---------+---------+---------+---------+
| gor | 01servers | 05files | 20threads | ccALWAYS | 15.918 | 16.275 | 15.807 | 17.781 | 16.233 |
| gor | 01servers | 10files | 20threads | ccALWAYS | 17.322 | 17.636 | 16.096 | 16.484 | 16.715 |
| gor | 01servers | 15files | 20threads | ccALWAYS | 19.265 | 17.128 | 17.630 | 18.739 | 16.833 |
| gor | 01servers | 20files | 20threads | ccALWAYS | 23.744 | 20.539 | 21.416 | 22.921 | 22.794 |
+-----------+--------------+------------+--------------+-----------------+---------+---------+---------+---------+---------+
What I want to do is plot a line graph where the x values are the 'file_count' value, and the y value for each is the average of all the run_xxx values for the corresponding line in the DataFrame.
If possible I would like to add error bars and even the data points themselves so that I can see the distribution of the data behind that average.
Here's a (crappy) mockup of roughly what I'm talking about:
I've been able to create a boxplot using the boxplot() function built into pandas' DataFrame by doing:
df.transpose().boxplot()
This looks almost okay but a little bit cluttered and doesn't have the actual data points plotted.
Beeswarm plot will very nice in this situation, especially when you have a lot of dots and what to show the distributions of those dots. You need to, however, supply the position parameter to beeswarm as by default it will started at 0. The the boxplot method of pandas DataFrame, on the other hand, plots boxes at x = 1, 2 ...
It comes down to just these:
from beeswarm import *
D1 = beeswarm(df.values, positions = np.arange(len(df.values))+1)
D2 = df.transpose().boxplot(ax=D1[1])
For completeness I'll include the way I finally managed to do this here:
import numpy as np
import matplotlib.pyplot as plt
import random
dft = df.sortlevel(2).transpose()
fig, ax = plt.subplots()
x = []
y = []
y_err = []
scatterx = []
scattery = []
for n, col in enumerate(dft.columns):
x.append(n)
y.append(np.mean(dft[col]))
y_err.append(np.std(dft[col]))
for v in dft[col]:
scattery.append(v)
scatterx.append(n + ((random.random()-0.5)*0.05))
p = plt.plot(x, y, label=label)
color=p[0].get_color()
plt.errorbar(x, y, yerr=y_err, fmt=color)
plt.scatter(scatterx, scattery, alpha=0.3, color=color)
plt.legend(loc=2)
ax.set_xticks(range(len(dft.columns)))
ax.set_xticklabels([x[2] for x in dft.columns])
plt.show()
This will show a line chart with error bars and data points. There may be some errors in the above code. I copied it and simplified a bit before pasting here.

Categories

Resources