Create a scatterplot from the data of two dataframes?

Create a scatterplot from the data of two dataframes? - python

I have two dataframes in python. The content of them is the following:
Table=conn
A B relevance
1 3 0.7
2 7 0.1
5 20 2
6 2 7
table=point
Point Lat Lon
1 45.3 -65.2
2 34.4 -60.2
3 40.2 -60.1
20 40.4 -63.1
In the first table, column A represents an origin, column B a destination and the relevance of the link.
On the other hand, in the second table we have for each point (origin or destination) its coordinates.
The problem is that I want to create a visualization in Python that allows to query the coordinates of each origin or destination (column A and B of the first table) in the second table and make a scatterplot with it. Then, link each of the origins and destinations of the first column taking into account the relevance with thicker lines as it has more relevance.
link refers to the line that joins the points in the graphic representation.
Any idea? I've started with a very basic code approach but I'm really having trouble following along.
for row in conn.interrows():
row[1][0]
row[1][1]
row[1][3]

Do you have two DataFrames: point and conn, right?
# To set indexes of "point" equal to "Points"
point.set_index(point.Point, inplace=True)
# config width of lines
min_width = 0.5
max_width = 4.0
min_relevance = conn.relevance.min()
max_relevance = conn.relevance.max()
slope = (max_width - min_width)/(max_relevance - min_relevance)
widths = min_width + slope*(conn.relevance - min_relevance)
# plot lines
for i in range(len(conn)):
origin = conn.loc[i, 'A']
destin = conn.loc[i, 'B']
lat = point.loc[[origin, destin], 'Lat']
lon = point.loc[[origin, destin], 'Lon']
plt.plot(lat, lon, c='red', lw=widths[i])
# plot points
plt.plot(point.Lat, point.Lon, ls='', marker='o', c='blue')

Related

Extracting ID and Relevant data from a csv dataset in python

Making a program for my Final Year Project.
Program takes the longitude and latitude coords from a .csv dataset and plots them on the map.
Issue I am having is there is multiple ID's and this totals 445,000+ points.
How would I refine it so the program can differentiate between the IDs?
def create_image(self, color, width=2):
# Creates an image that contains the Map and the GPS record
# color = color the GPS line is
# width = width of the GPS line
data = pd.read_csv(self.data_path, header=0)
# sep will separate the latitude from the longitude
data.info()
self.result_image = Image.open(self.map_path, 'r')
img_points = []
gps_data = tuple(zip(data['latitude'].values, data['longitude'].values))
for d in gps_data:
x1, y1 = self.scale_to_img(d, (self.result_image.size[0], self.result_image.size[1]))
img_points.append((x1, y1))
draw = ImageDraw.Draw(self.result_image)
draw.line(img_points, fill=color, width=width)
I have also attached the github project here the program works but I am just trying to minimize how many users it plots at once.
Thanks in advance.

To check for a specific ID you could create a filter. For this dataframe
long lat ID
0 10 5 test1
1 15 20 test2
you could do the following:
id_filt = df_data['ID'] == 'test1'
This can be used to filter out every entry from the dataframe that has the ID 'test1'
df_data[id_filt]
long lat ID
10 5 test1

Pandas stacked bar plotting with different shapes

I'm currently experimenting with pandas and matplotlib.
I have created a Pandas dataframe which stores data like this:
cmc|coloridentity
1 | G
1 | R
2 | G
3 | G
3 | B
4 | B
What I now want to do is to make a stacked bar plot where I can see how many entries per cmc exist. And I want to do that for all coloridentity and stack them above.
My thoughts so far:
#get all unique values of coloridentity
unique_values = df['coloridentity'].unique()
#Create two dictionaries. One for the number of entries per cost and one
# to store the different costs for each color
color_dict_values = {}
color_dict_index = {}
for u in unique_values:
temp_df = df['cmc'].loc[df['coloridentity'] == u].value_counts()
color_dict_values[u] = np.array(temp_df)
color_dict_index[u] = temp_df.index.to_numpy()
width = 0.4
p1 = plt.bar(color_dict_index['G'], color_dict_values['G'], width, color='g')
p2 = plt.bar(color_dict_index['R'], color_dict_values['R'], width,
bottom=color_dict_values['G'], color='r')
plt.show()
So but this gives me an error because the line where I say that the bottom of the second plot shall be the values of different plot have different numpy shapes.
Does anyone know a solution? I thought of adding 0 values so that the shapes are the same , but I don't know if this is the best solution, and if yes how the best way would be to solve it.

Working with a fixed index (the range of cmc values), makes things easier. That way the color_dict_values of a color_id give a count for each of the possible cmc values (stays zero when there are none).
The color_dict_index isn't needed any more. To fill in the color_dict_values, we iterate through the temporary dataframe with the value_counts.
To plot the bars, the x-axis is now the range of possible cmc values. I added [1:] to each array to skip the zero at the beginning which would look ugly in the plot.
The bottom starts at zero, and gets incremented by the color_dict_values of the color that has just been plotted. (Thanks to numpy, the constant 0 added to an array will be that array.)
In the code I generated some random numbers similar to the format in the question.
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
N = 50
df = pd.DataFrame({'cmc': np.random.randint(1, 10, N), 'coloridentity': np.random.choice(['R', 'G'], N)})
# get all unique values of coloridentity
unique_values = df['coloridentity'].unique()
# find the range of all cmc indices
max_cmc = df['cmc'].max()
cmc_range = range(max_cmc + 1)
# dictionary for each coloridentity: array of values of each possible cmc
color_dict_values = {}
for u in unique_values:
value_counts_df = df['cmc'].loc[df['coloridentity'] == u].value_counts()
color_dict_values[u] = np.zeros(max_cmc + 1, dtype=int)
for ind, cnt in value_counts_df.iteritems():
color_dict_values[u][ind] = cnt
width = 0.4
bottom = 0
for col_id, col in zip(['G', 'R'], ['limegreen', 'crimson']):
plt.bar(cmc_range[1:], color_dict_values[col_id][1:], bottom=bottom, width=width, color=col)
bottom += color_dict_values[col_id][1:]
plt.xticks(cmc_range[1:]) # make sure every cmc gets a tick label
plt.tick_params(axis='x', length=0) # hide the tick marks
plt.xlabel('cmc')
plt.ylabel('count')
plt.show()

How can I auto-adjust my scatterplot labels without them being overlapped by other labels in python?

So I have been working on this for a bit, and just wanted to see if someone could look at why I could to auto-adjust my scatter-plot labels. As I was searching for a solution I came across the adjustText library found here https://github.com/Phlya/adjustText and it seems like it should work, but I'm just trying to find an example that plots from a dataframe. As I tried replicating the adjustText examples it throws me an error So this is my current code.
df["category"] = df["category"].astype(int)
df2 = df.sort_values(by=['count'], ascending=False).head()
ax = df.plot.scatter(x="category", y="count")
a = df2['category']
b = df2['count']
texts = []
for xy in zip(a, b):
texts.append(plt.text(xy))
adjust_text(texts, arrowprops=dict(arrowstyle="->", color='r', lw=0.5))
plt.title("Count of {column} in {table}".format(**sql_dict))
But then I got this TypeError: TypeError: text() missing 2 required positional arguments: 'y' and 's' This is what I tried to transform it from to pivot the coordinates, it works but coordinates just overlap.
df["category"] = df["category"].astype(int)
df2 = df.sort_values(by=['count'], ascending=False).head()
ax = df.plot.scatter(x="category", y="count")
a = df2['category']
b = df2['count']
for xy in zip(a, b):
ax.annotate('(%s, %s)' % xy, xy=xy)
As you can see here I'm getting my df constructed from tables in sql and I'll provide you what this specific table should look like here. In this specific table it's length of stay in days compared to how many people stayed that long.
So as a sample of the data may look like. I made a second datframe above so I would label only the highest values on the plot. This is one of my first experiences with graphing visualizations in python so any help would be appreciated.
[![picture of a graph of overlapping items][1]][1]
[los_days count]
3 350
1 4000
15 34
and so forth. Thanks so much. Let me know if you need anything else.
Here is an example of the df
category count
0 2 29603
1 4 33980
2 9 21387
3 11 17661
4 18 10618
5 20 8395
6 27 5293
7 29 4121

After some reverse engineering with an example from adjustText library and my own example, I just had to change my for loop to create the labels and it worked fantastically.
labels = ['{}'.format(i) for i in zip(a, b)]
texts = []
for x, y, text in zip(a, b, labels):
texts.append(ax.text(x, y, text))
adjust_text(texts, force_text=0.05, arrowprops=dict(arrowstyle="-|>",
color='r', alpha=0.5))

How to generate a heat map with 3 dimensional scattered data in python?

I have a dataframe with one column as latitude, one as longitude, and the other as mm. How can I plot a heat map using latitude and longitude as the grid, and use the mm for the heat value? the mm is not grided. For example:
lat = [1,1,2,2]
lon = [1,2,1,2]
mm = [1,2,3,4]
or I guess I want to ask how to turn this three lists into a grid:
1 2
1 1 3
2 2 4

If there aren't too many points to worry about (less than ~100,000), it should be enough to just map latitude/longitude tuples to a heat-map value, like so:
lat = [1,1,2,2]
lon = [1,2,1,2]
mm = [1,2,3,4]
heat_map = dict()
for latitude, longitude, heat_value in zip(lat, lon, mm):
heat_map[(latitude, longitude)] = heat_value
print(heat_map)
Demo

Python Histogram using matplotlib

I've written a python script that parses a trace file and retrieves a list of objects (vehicle objects) containing the vehicle id, timestep and the number of other vehicles in radio range of a particular vehicle for that timestep:
for d_obj in global_list_of_nbrs:
print "\t", d_obj.id, "\t", d_obj.time, "\t", d_obj.num_nbrs
The sample output from the test file I am using is:
0 0 1
0 1 2
0 2 0
1 0 1
1 1 2
2 0 0
2 1 2
This can be interpreted as vehicle with id 0 at timestep 0 has 1 neighbouring vehicle, vehicle with id 0 at timestep 1 has 2 neighbouring vehicles (i.e. in radio range) etc.
I would like to plot a histogram using matplotlib to represent this data but am confused what I should do with bins etc and how I should represent the list (currently a list of objects).
Can anyone please advise on this?
Many thanks in advance.

Here's an example of something you might be able to do with this data set:
Note: You'll need to install pandas for this example to work for you.
n = 10000
id_col = randint(3, size=n)
lam = 10
num_nbrs = poisson(lam, size=n)
d = DataFrame({'id': id_col, 'num_nbrs': num_nbrs})
fig, axs = subplots(2, 3, figsize=(12, 4))
def plotter(ax, grp, grp_name, method, **kwargs):
getattr(ax, method)(grp.num_nbrs.values, **kwargs)
ax.set_title('ID: %i' % grp_name)
gb = d.groupby('id')
for row, method in zip((0, 1), ('plot', 'hist')):
for ax, (grp_name, grp) in zip(axs[row].flat, gb):
plotter(ax, grp, grp_name, method)
What I've done is created 2 plots for each of 3 IDs. The top row shows the number of neighbors as a function of time for each ID. The bottom row shows the distribution of the number of neighbors across time.
You'll probably want to play around with sharing axes, axes labelling and all the other fun things that matplotlib offers.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Create a scatterplot from the data of two dataframes? - python

Related

Extracting ID and Relevant data from a csv dataset in python

Pandas stacked bar plotting with different shapes

How can I auto-adjust my scatterplot labels without them being overlapped by other labels in python?

How to generate a heat map with 3 dimensional scattered data in python?

Python Histogram using matplotlib

Categories

Resources