I am trying to plot the data from my pd file that contains data about man and woman in different function levels. However whilst plotting the pyramid df the data is swapped. PhD and assistant are swapped and associate and postdoc. However I can't find a problem or mistake.
import altair as alt
from vega_datasets import data
import pandas as pd
df_natuur_vrouw = df_natuur[df_natuur['geslacht'] == 'V']
df_natuur_man = df_natuur[df_natuur['geslacht'] == 'M']
df_techniek_vrouw = df_techniek[df_techniek['geslacht'] == 'V']
df_techniek_man = df_techniek[df_techniek['geslacht'] == 'M']
slider = alt.binding_range(min=2011, max=2020, step=1)
select_year = alt.selection_single(name='year', fields=['year'],
bind=slider, init={'year': 2020})
base_vrouw = alt.Chart(df_natuur_vrouw).add_selection(
select_year
).transform_filter(
select_year
).properties(
width=250
)
base_man = alt.Chart(df_natuur_man).add_selection(
select_year
).transform_filter(
select_year
).properties(
width=250
)
color_scale = alt.Scale(domain=['M', 'V'],
range=['#003865', '#ee7203'])
left = base_vrouw.encode(
y=alt.Y('functieniveau:O', axis=None),
x=alt.X('percentage_afgerond:Q',
title='percentage',
scale=alt.Scale(domain=[0, 100], reverse=True)),
color=alt.Color('geslacht:N', scale=color_scale, legend=None)
).mark_bar().properties(title='Female')
middle = base_vrouw.encode(
y=alt.Y('functieniveau:O', axis=None, sort=['Professor', 'Associate Professor', 'Assistant Professor', 'Postdoc', 'PhD']),
text=alt.Text('functieniveau:O'),
).mark_text().properties(width=110)
right = base_man.encode(
y=alt.Y('functieniveau:O', axis=None),
x=alt.X('percentage_afgerond:Q', title='percentage', scale=alt.Scale(domain=[0, 100])),
color=alt.Color('geslacht:N', scale=color_scale, legend=None)
).mark_bar().properties(title='Male')
alt.concat(left, middle, right, spacing=5, title='Percentage male and female employees per academic level in nature sector 2011-2020')
The data I want to show, however the values for PHD and assistant are swapped and so are the values for associate professor and postdoc
It is a little hard to tell without having a sample of the data to be able to run the code, but the problem is likely that you are sorting the middle plot, but no the left and right plots. Try applying the same Y sort order to the bar plots as you are using for the text and see if that works.
I tried to write a code that creates a visualization of all forest fires that happened during the year 2021. The CSV file containing the data is around 1.5Gb, the program looks correct for me, but when I try to run it, it gets stuck without displaying any visualization or error message. The last time I tried, it run for almost half a day until python crashed.
I don't know if I am having an infinite loop, if that's because the file is too big or if there is something else I am missing.
Can anyone provide feedback, please?
Here is my code:
import csv
from datetime import datetime
from plotly.graph_objs import Scattergeo , Layout
from plotly import offline
filename='fire_nrt_J1V-C2_252284.csv'
with open(filename) as f:
reader=csv.reader(f)
header_row=next(reader)
lats, lons, brights, dates=[],[],[],[]
for row in reader:
date=datetime.strptime(row[5], '%Y-%m-%d')
lat=row[0]
lon=row[1]
bright=row[2]
lats.append(lat)
lons.append(lon)
brights.append(bright)
dates.append(date)
data=[{
'type':'scattergeo',
'lon':lons,
'lat':lats,
'text':dates,
'marker':{
'size':[5*bright for bright in brights],
'color': brights,
'colorscale':'Reds',
'colorbar': {'title':'Fire brightness'},
}
}]
my_layout=Layout(title="Forestfires during the year 2021")
fig={'data':data,'layout':my_layout}
offline.plot(fig, filename='global_fires_2021.html')
have found data you describe here https://wifire-data.sdsc.edu/dataset/viirs-i-band-375-m-active-fire-data/resource/3ce73b20-f584-44f7-996b-2f319c480294
plotly uses resources for every point plotted on a scatter. So there is a limit before you run out of resources
there are other approaches to plotting larger number of points
https://plotly.com/python/mapbox-density-heatmaps/ fewer limits, but still limited on very large data sets
https://plotly.com/python/datashader/ can work with very large data sets as it generates an image. It is more challenging to work with (install and navigate API)
data sourcing
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
df = pd.read_csv("https://firms.modaps.eosdis.nasa.gov/data/active_fire/noaa-20-viirs-c2/csv/J1_VIIRS_C2_Global_7d.csv")
df
scatter_geo
limited to random sample of 1000 rows
px.scatter_geo(
df.sample(1000),
lat="latitude",
lon="longitude",
color="bright_ti4",
# size="size",
hover_data=["acq_date"],
color_continuous_scale="reds",
)
density mapbox
px.density_mapbox(
df.sample(5000),
lat="latitude",
lon="longitude",
z="bright_ti4",
radius=3,
color_continuous_scale="reds",
zoom=1,
mapbox_style="carto-positron",
)
datashader Mapbox
all data
some libraries are more difficult to install and use
need to deal with this issue https://community.plotly.com/t/datashader-image-distorted-when-passed-to-mapbox/39375/2
import datashader as ds, colorcet
from pyproj import Transformer
t3857_to_4326 = Transformer.from_crs(3857, 4326, always_xy=True)
# project CRS to ensure image overlays appropriately back over mapbox
# https://community.plotly.com/t/datashader-image-distorted-when-passed-to-mapbox/39375/2
df.loc[:, "longitude_3857"], df.loc[:, "latitude_3857"] = ds.utils.lnglat_to_meters(
df.longitude, df.latitude
)
RESOLUTION=1000
cvs = ds.Canvas(plot_width=RESOLUTION, plot_height=RESOLUTION)
agg = cvs.points(df, x="longitude_3857", y="latitude_3857")
img = ds.tf.shade(agg, cmap=colorcet.fire).to_pil()
fig = go.Figure(go.Scattermapbox())
fig.update_layout(
mapbox={
"style": "carto-positron",
"layers": [
{
"sourcetype": "image",
"source": img,
# Sets the coordinates array contains [longitude, latitude] pairs for the image corners listed in
# clockwise order: top left, top right, bottom right, bottom left.
"coordinates": [
t3857_to_4326.transform(
agg.coords["longitude_3857"].values[a],
agg.coords["latitude_3857"].values[b],
)
for a, b in [(0, -1), (-1, -1), (-1, 0), (0, 0)]
],
}
],
},
margin={"l": 0, "r": 0, "t": 0, "r": 0},
)
Software-versions
pandas: 1.3.3,
datashader: 0.13.0,
bokeh: 2.3.3,
holoviews: 1.14.6
What I want to achieve/My current problem
I do some scatterplots of categorical data with bokeh/holoviews. Sometimes the sets are big so I want to use datashader.
But in many cases my data is too sparse to look any good (1672 points in this case). So I have to spread it. But it does not look good:
(Without spreading the data there are only about 9 pixels visible; I do not show a picture of this.)
For this small sizes it is also possible to use holoviews without datashader. There the picture looks much better:
Following the ideas in Datashader: categorical colormapping of GeoDataFrames I tried to use aggregator=ds.by(cat_color, ds.any()) instead of aggregator=ds.by(cat_color) in the datashade-function.
The result is strange:
When you do not spread the result you get the same strange olive background-color but more transparent.
Interestingly this background-color is not always the same.
Reproducible code example
import numpy as np
import pandas as pd
import holoviews as hv
hv.extension('bokeh')
import datashader as ds
from datashader.colors import Sets1to3
from holoviews.operation.datashader import datashade,dynspread
raw_data = [('Alice', 60, 'London', 5) ,
('Bob', 14, 'Delhi' , 7) ,
('Charlie', 66, np.NaN, 11) ,
('Dave', np.NaN,'Delhi' , 15) ,
('Eveline', 33, 'Delhi' , 4) ,
('Fred', 32, 'New York', np.NaN ),
('George', 95, 'Paris', 11)
]
# Create a DataFrame object
df = pd.DataFrame(raw_data, columns=['Name', 'Age', 'City', 'Experience'])
df['City']=pd.Categorical(df['City'])
x='Age'
y='Experience'
color='City'
cats=df[color].cat.categories
# Make dummy-points (currently the only way to make a legend: https://holoviews.org/user_guide/Large_Data.html)
for cat in cats:
#Just to make clear how many points of a given category we have
print(cat,((df[color]==cat)&(df[x].notnull())&(df[y].notnull())).sum())
color_key=[(name,color) for name, color in zip(cats,Sets1to3)]
color_points = hv.NdOverlay({n: hv.Points([0,0], label=str(n)).opts(color=c,size=0) for n,c in color_key})
# Create the plot with datashader
points=hv.Points(df, [x, y],label="%s vs %s" % (x, y),)#.redim.range(Age=(0,90), Experience=(0,14))
datashaded1=datashade(points,aggregator=ds.by(color)).opts(width=550, height=480)
datashaded2=datashade(points,aggregator=ds.by(color,ds.any())).opts(width=550, height=480)
dynspread(datashaded1)*color_points+dynspread(datashaded2)*color_points
When you remove ds.any() then everything works more or less (there are some minor problems as discussed on https://github.com/holoviz/holoviews/issues/5070 ) but when doing ds.any() the dynspread does not work at all. This problem is also present in my actual data but I will probably just use spread which works better. Is there a reason for this?
Is there something that I am missing?
I am trying to plot the amount of times a satellite goes over a certain location using Python and a heatmap. I easily generate the satellite data, but I am having issues with displaying it in a nice manner. I am trying to follow this example, as I can use the style function to lower the opacity. I am having some issues replicating this though as it seems that the GeoJson version they were using no longer accepts the same inputs. This is the dataframe I am using:
print(df.head())
latitude longitude countSp geometry
0 -57.9 151.1 1.0 POLYGON ((151.05 -57.95, 151.15 -57.95, 151.15...
1 -57.9 151.2 2.0 POLYGON ((151.15 -57.95, 151.25 -57.95, 151.25...
2 -57.8 151.2 1.0 POLYGON ((151.15 -57.84999999999999, 151.25 -5...
3 -57.8 151.3 3.0 POLYGON ((151.25 -57.84999999999999, 151.35 -5...
4 -57.8 151.4 2.0 POLYGON ((151.35 -57.84999999999999, 151.45 -5...
I then call folium through:
hmap = folium.Map(location=[42.5, -80], zoom_start=7, )
colormap_dept = branca.colormap.StepColormap(
colors=['#00ae53', '#86dc76', '#daf8aa',
'#ffe6a4', '#ff9a61', '#ee0028'],
vmin=0,
vmax=max_amt,
index=[0, 2, 4, 6, 8, 10, 12])
style_func = lambda x: {
'fillColor': colormap_dept(x['countSp']),
'color': '',
'weight': 0.0001,
'fillOpacity': 0.1
}
folium.GeoJson(
df,
style_function=style_func,
).add_to(hmap)
This is the error I get when I run my code:
ValueError: Cannot render objects with any missing geometries: latitude longitude countSp geometry
I know that I can use the HeatMap plugin from folium in order to get most of this done, but I have found a couple of issues with doing that. First is that I cannot easily generate a legend (though I have been able to work around this). Second is that it is way too opaque, and I am not finding any ways of reducing that. I have tried playing around with the radius, and blur parameters for HeatMap without much change. I think that the fillOpacity of the style_func above is a much better way of making my data translucent.
By the way, I generate the polygon in my df by the following command. So in my dataframe all I need folium to know about is the geometry and countSp (which is the number of times a satellite goes over a certain area - ~10kmx10km square).
df['geometry'] = df.apply(lambda row: Polygon([(row.longitude-0.05, row.latitude-0.05),
(row.longitude+0.05, row.latitude-0.05),
(row.longitude+0.05, row.latitude+0.05),
(row.longitude-0.05, row.latitude+0.05)]), axis=1)
Is there a good way of going about this issue?
Once again, they were looking for a way to express the purpose in a heat map, so I used Plotly's data on airline arrivals and departures to visualize it.
The number of flights to and from the U.S. mainland only was used for the data.
Excluded IATA codes['LIH','HNL','STT','STX','SJU','OGG','KOA']
Draw a straight line on the map from the latitude and longitude of the departure airport to the latitude and longitude of the arrival airport.
Draw a heat map with data on the number of arrivals and departures by airport.
Since we cannot use a discrete colormap, we will create a linear colormap and add it.
Embed the heatmap as a layer named Traffic
import pandas as pd
df_airports = pd.read_csv('https://raw.githubusercontent.com/plotly/datasets/master/2011_february_us_airport_traffic.csv')
df_airports.sort_values('cnt', ascending=False)
df_air = df_airports[['lat','long','cnt']]
df_flight_paths = pd.read_csv('https://raw.githubusercontent.com/plotly/datasets/master/2011_february_aa_flight_paths.csv')
df_flight_paths = df_flight_paths[~df_flight_paths['airport1'].isin(['HNL','STT','SJU','OGG','KOA'])]
df_flight_paths = df_flight_paths[~df_flight_paths['airport2'].isin(['LIH','HNL','STT','STX','SJU'])]
df_flight_paths = df_flight_paths[['start_lat', 'start_lon', 'end_lat', 'end_lon', 'cnt']]
import folium
from folium.plugins import HeatMap
import branca.colormap as cm
from collections import defaultdict
steps=10
colormap = cm.linear.YlGnBu_09.scale(0, 1).to_step(steps)
gradient_map=defaultdict(dict)
for i in range(steps):
gradient_map[1/steps*i] = colormap.rgb_hex_str(1/steps*i)
m = folium.Map(location=[32.500, -97.500], zoom_start=4, tiles="cartodbpositron")
data = []
for idx,row in df_flight_paths.iterrows():
folium.PolyLine([[row.start_lat, row.start_lon], [row.end_lat, row.end_lon]], weight=2, color="red", opacity=0.4
).add_to(m)
HeatMap(
df_air.values,
gradient=gradient_map,
name='Traffic',
mini_opacity=0.1,
radius=15,
blur=5
).add_to(m)
folium.LayerControl().add_to(m)
colormap.add_to(m)
m
I am currently trying to learn how to utilize csv data via pandas and matplotlib. I have this issue where for a dataset that clearly has spikes in the data, I would need to "clean up" before evaluating anything out of it. But I am having difficulties understanding how to "detect" spikes in a graph...
So the datatset I am working is as follows:
df = pd.DataFrame({'price':[340.6, 35.66, 33.98, 38.67, 32.99, 32.04, 37.64,
38.22, 37.13, 38.57, 32.4, 34.98, 36.74, 32.9,
32.52, 38.83, 33.9, 32.62, 38.93, 32.14, 33.09,
34.25, 34.39, 33.28, 38.13, 36.25, 38.91, 38.9,
36.85, 32.17, -2.07, 34.49, 35.7, 32.54, 37.91,
37.35, 32.05, 38.03, 0.32, 33.87, 33.16, 34.74,
32.47, 33.31, 34.54, 36.6, 36.09, 35.49, 370.51,
37.33, 37.54, 33.32, 35.09, 33.08, 38.3, 34.32,
37.01, 33.63, 36.35, 33.77, 33.74, 36.62, 36.74,
37.76, 35.58, 38.76, 36.57, 37.05, 35.33, 36.41,
35.54, 37.48, 36.22, 36.19, 36.43, 34.31, 34.85,
38.76, 38.52, 38.02, 36.67, 32.51, 321.6, 37.82,
34.76, 33.55, 32.85, 32.99, 35.06]},
index = pd.date_range('2014-03-03 06:00','2014-03-06 22:00',freq='H'))
Which produces this graph:
So all of these values are in the range of 32 to 38. I've intentionally placed very large numbers on indexes of [0, 30, 38, 48, 82] to create spikes in the graph.
Now I was trying to look up how to do this so called "step detection" on a graph, and the only real useful answer I have found is through this question here, and so utilizing that I have come up with this overall code...
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.signal import argrelextrema
df = pd.DataFrame({'price':[340.6, 35.66, 33.98, 38.67, 32.99, 32.04, 37.64,
38.22, 37.13, 38.57, 32.4, 34.98, 36.74, 32.9,
32.52, 38.83, 33.9, 32.62, 38.93, 32.14, 33.09,
34.25, 34.39, 33.28, 38.13, 36.25, 38.91, 38.9,
36.85, 32.17, -2.07, 34.49, 35.7, 32.54, 37.91,
37.35, 32.05, 38.03, 0.32, 33.87, 33.16, 34.74,
32.47, 33.31, 34.54, 36.6, 36.09, 35.49, 370.51,
37.33, 37.54, 33.32, 35.09, 33.08, 38.3, 34.32,
37.01, 33.63, 36.35, 33.77, 33.74, 36.62, 36.74,
37.76, 35.58, 38.76, 36.57, 37.05, 35.33, 36.41,
35.54, 37.48, 36.22, 36.19, 36.43, 34.31, 34.85,
38.76, 38.52, 38.02, 36.67, 32.51, 321.6, 37.82,
34.76, 33.55, 32.85, 32.99, 35.06]},
index = pd.date_range('2014-03-03 06:00','2014-03-06 22:00',freq='H'))
# df.plot()
# plt.show()
threshold = int(len(df['price']) * 0.75)
maxPeaks = argrelextrema(df['price'].values, np.greater, order=threshold)
minPeaks = argrelextrema(df['price'].values, np.less, order=threshold)
df2 = df.copy()
price_column_index = df2.columns.get_loc('price')
allPeaks = maxPeaks + minPeaks
for peakList in allPeaks:
for peak in peakList:
print(df2.iloc[peak]['price'])
But the issue with this is that it only seems to be returning the indexes of 30 and 82, and its not grabbing the large value in index 0, and also is not grabbing anything in the negative dips. Though I am very sure I am using these methods incorrectly.
Now, I understand for this SPECIFIC issue I COULD just look for values in a column that is either greater or less than a certain value, but I am thinking of situations of dealing with 1000+ entries where dealing with the "lowest/highest normal values" can not accurately be determined, and therefore I just would like a spike detection that works regardless of scale.
So my questions are as follows:
1) The information I've been looking at about step detection seemed really really dense, and very difficult for me to comprehend. Could anyone provide a general rule about how to approaching these "step detection" issues?
2) Are there any public libraries that allows for this kind of work to be done with a little more ease? If so what are they?
3) How can you achieve the same results using vanilla Python? I've been in many workplaces that do not allow for any other libraries to be installed, forcing solutions to be made that does not utilize any of these useful external libraries, so I am wondering if there is some kind of formula/function that could be written to achieve similar results...
4) What other approaches could I use from a Data Analysis standpoint on dealing with this issue? I read something about correlation, standard deviation, but I don't actually know how any of these can be utilized to identify WHERE the spikes are...
EDIT: also, I found this answer as well using scipy's find_peaks method, but reading its doc I don't really understand what they represent, and where the values passed came from... Any clarification of this would be greatly appreciated...
Solution using scipy.signal.find_peaks
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.signal import find_peaks
df = pd.DataFrame({'price':[340.6, 35.66, 33.98, 38.67, 32.99, 32.04, 37.64,
38.22, 37.13, 38.57, 32.4, 34.98, 36.74, 32.9,
32.52, 38.83, 33.9, 32.62, 38.93, 32.14, 33.09,
34.25, 34.39, 33.28, 38.13, 36.25, 38.91, 38.9,
36.85, 32.17, -2.07, 34.49, 35.7, 32.54, 37.91,
37.35, 32.05, 38.03, 0.32, 33.87, 33.16, 34.74,
32.47, 33.31, 34.54, 36.6, 36.09, 35.49, 370.51,
37.33, 37.54, 33.32, 35.09, 33.08, 38.3, 34.32,
37.01, 33.63, 36.35, 33.77, 33.74, 36.62, 36.74,
37.76, 35.58, 38.76, 36.57, 37.05, 35.33, 36.41,
35.54, 37.48, 36.22, 36.19, 36.43, 34.31, 34.85,
38.76, 38.52, 38.02, 36.67, 32.51, 321.6, 37.82,
34.76, 33.55, 32.85, 32.99, 35.06]},
index = pd.date_range('2014-03-03 06:00','2014-03-06 22:00',freq='H'))
x = df['price'].values
x = np.insert(x, 0, 0) # added padding to catch any initial peaks in data
# for positive peaks
peaks, _ = find_peaks(x, height=50) # hieght is the threshold value
peaks = peaks - 1
print("The indices for peaks in the dataframe: ", peaks)
print(" ")
print("The values extracted from the dataframe")
print(df['price'][peaks])
# for negative peaks
x = x * - 1
neg_peaks, _ = find_peaks(x, height=0) # hieght is the threshold value
neg_peaks = neg_peaks - 1
print(" ")
print("The indices for negative peaks in the dataframe: ", neg_peaks)
print(" ")
print("The values extracted from the dataframe")
print(df['price'][neg_peaks])
First note that the algorithm works in a way that makes comparrisons between values. The upshot being that the first value of the array gets ignored, I suspect that this was the probelm with the solution you posted.
To get around this I padded the x array with an extra 0 at position 0 the value you put there is upto you,
x = np.insert(x, 0, 0)
The algorthim then returns the indices of where the peak values are to be found in the array into the variable peaks,
peaks, _ = find_peaks(x, height=50) # hieght is the threshold value
As I have added an initial value I have to subtract 1 from each of these indices,
peaks = peaks - 1
I can now use these indices to extract the peak values from the dataframe,
print(df['price'][peaks])
In terms of not detecting the peak at the beginning of the data, what you would usually do is re-sample the data set periodically and overlap the start of this sample with the end of the previous sample by a little bit. This "sliding window" over the data helps you avoid exactly this scenario, missing peaks on the boundary between scans of the data. The overlap should be greater than whatever your signal detection width is, in the above examples it appears to be a single data point.
For instance, if you are looking at daily data over a period of a month, with a resolution of "1 day," you would start your scan on the last day of the previous month, in order to detect a peak that happened on the first day of this month.