Python stats and visualization - python

I am new to Python and am currently working on a set of real estate data from redfinn.
Currently my data looks like this:
There are many different neighborhoods in the dataset. I would like
to:
get the average homes_sold per month(date field was cut out of the
screenshot) per neighborhood
graph the above using only the neighborhoods I wish to use (about
4).
Any help is greatly appreciated.

As I understood, you have different values of sold per month houses and you want to take an average of it. If so, try this code (provide your data instead):
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
%matplotlib inline
data = pd.DataFrame({'neighborhood':['n1','n1','n2','n3','n3','n4','n5'],'homes_sold per month':[5,7,2,6,4,1,5],'something_else':[5,3,3,5,5,5,5]})
neighborhoods_to_plot = ['n1','n2','n4','n5'] #provide here a list you want to plot
plot = pd.DataFrame()
for n in neighborhoods_to_plot:
plot.at[n,'homes_sold per month'] = data.loc[data['neighborhood']==n]['homes_sold per month'].mean()
plot.index.name = 'neighborhood'
plt.figure(figsize=(4,3),dpi=300,tight_layout=True)
sns.barplot(x=plot.index,y=plot['homes_sold per month'],data=plot)
plt.savefig('graph.png', bbox_inches='tight')
Plot

Okay so I am going to assume that you are using Pandas and Matplotlib in order to handle this data. Then in order to get the average number of homes sold for month you just need to do:
import pandas as pd
mean_number_of_homes_sold = data[['neighborhood','homes_sold']].groupby['neighborhood'].agg('mean')
In order to get the information plotted with only the neighborhoods you want you will need something like this
import pandas as pd
import matplotlib.pyplot as plt
#fill this list with strings representing the names of the data you need plotted
neighborhoods_to_plot = ['Albany Park', 'Tinley Park']
data_to_graph = data[data.neighborhood.isin(neighborhoods_to_plot)]
fig, ax = plt.subplots()
data_to_graph.plot(kind='scatter', x='avg_sale_to_list', y ='inventory_mom')
ax.set(title='Relationship between time to sale from listing and inventory momentum for selected neighborhoods')
fig.savefig('neighborhood.png', transparent=False, dpi=300, bbox_inches="tight")
You can obviously change which data is graphed or the type of graph but this should give you a decent starting point.

Related

Plot multi categorical data in Python

Month,Cluster,Count
7,Linux,42
7,Linux,56
7,Pct,6
7,Pct(C),11
7,Memory,28
10,Latency,73
10,Linux,47
10,Pct,21
10,Pct(C),18
10,Swap,41
10,Protection ,509
I need to compare Month here 7,10 and plot cluster for each month. How to visualize this data in python. I need to differentiate cluster count for both months
You only have one observation for Latency, Memory and swap
So you cant plot a line for the change in these, but you could combine a scatter plot with a lineplot like so
import pandas as pd
import io
strdata = '''
Month,Cluster,Count
7,Linux,42
7,Linux,56
7,Pct,6
7,Pct(C),11
7,Memory,28
10,Latency,73
10,Linux,47
10,Pct,21
10,Pct(C),18
10,Swap,41
'''
df = pd.read_csv(io.StringIO(strdata),sep=",")
df.drop_duplicates(subset=['Month','Cluster']).set_index("Month").groupby("Cluster")["Count"].plot(legend=True, marker=".")
df.set_index("Month").groupby("Cluster")["Count"].plot(legend=True, style=".")

Unable to draw KDE on python

I've created a Brownian motion and then I have taken the last values of 1000 entries repeated 10000 times. I was able to plot the histogram using the following code as follows:
import seaborn as sns
import matplotlib.pyplot as plt
\\BM represents list of values generated by the Brownian motion
fig, (ax1,ax2) = plt.subplots(2)
ax1.hist(BM[:,-1],12)
I've been able to draw the KDE as follows, however i unable to merge the two diagrams together. Can someone please help me?
sns.kdeplot(data=BM[:,-1])
Try with sns.kdeplot(BM['col1']) where 'col1' is the name of the column you want to plot.
I'll give you a reproducible example that works for me.
import seaborn as sns
import pandas as pd
import numpy as np
BM = pd.DataFrame(np.array([-0.00871515, -0.0001227 , -0.01449098, 0.01808527, 0.00074193, 0.01145541]
, columns=['col1'])
BM.head(2)
col1
0 -0.008715
1 -0.000123
sns.kdeplot(BM['col1'])
Edit based on your additional question:
To have the histogram and a kde plot use this one:
sns.distplot(BM['col1'])

matplotlib: Changing x limit dates

I would like to be able to change the x limits so it shows a time frame of my choice.
reproducible example:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# libraries for Data Download
import datetime # if you want to play with the dates
from pandas_datareader import data as pdr
import yfinance as yf
df = pdr.get_data_yahoo('ETH-USD', interval = '1d', period = "5y")
plt.figure(figsize=(24,10), dpi=140)
plt.grid()
df['Close'].plot()
df['Close'].ewm(span=50).mean().plot(c = '#4d00ff')
df['Close'].ewm(span=100).mean().plot(c = '#9001f0')
df['Close'].ewm(span=200).mean().plot(c = '#d102e8')
df['Close'].ewm(span=300).mean().plot(c = '#f101c2')
df['Close'].rolling(window=200).mean().plot(c = '#e80030')
plt.title('ETH-USD PLOT',fontsize=25, ha='center')
plt.legend(['C L O S E', 'EMA 50','EMA 100','EMA 200','EMA 300','MA 200', ])
# plt.xlim(['2016-5','2017-05']) # My attempt
plt.show()
when un-commenting the line above I get:
I would have liked '2016-5' to '2017-05' to have taken up the whole plot so I can see more detail.
It seems to me that you xlim works well, however, if I understand your question correctly, you also need to adjust ylim (let's say (0,100) from your graph, as it doesn't seem data within the time period specified goes past value of 100) to stretch data vertically, and so fill the graph efficiently.
try adding plt.ylim((0,100)) together with your commented code
Output:
with your plt.xlim(['2016-5','2017-05']) and plt.ylim((0,100))
with your plt.xlim(['2016-5','2017-05']) and plt.ylim((0,40))
as you can see, due to data variance in the period, you might lose some data information at later dates or have less clear image of movement at earlier dates.

How to plot time series graph in jupyter?

I have tried to plot the data in order to achieve something like this:
But I could not and I just achieved this graph with plotly:
Here is the small sample of my data
Does anyone know how to achieve that graph?
Thanks in advance
You'll find a lot of good stuff on timeseries on plotly.ly/python. Still, I'd like to share some practical details that I find very useful:
organize your data in a pandas dataframe
set up a basic plotly structure using fig=go.Figure(go.Scatter())
Make your desired additions to that structure using fig.add_traces(go.Scatter())
Plot:
Code:
import plotly.graph_objects as go
import pandas as pd
import numpy as np
# random data or other data sources
np.random.seed(123)
observations = 200
timestep = np.arange(0, observations/10, 0.1)
dates = pd.date_range('1/1/2020', periods=observations)
val1 = np.sin(timestep)
val2=val1+np.random.uniform(low=-1, high=1, size=observations)#.tolist()
# organize data in a pandas dataframe
df= pd.DataFrame({'Timestep':timestep, 'Date':dates,
'Value_1':val1,
'Value_2':val2})
# Main plotly figure structure
fig = go.Figure([go.Scatter(x=df['Date'], y=df['Value_2'],
marker_color='black',
opacity=0.6,
name='Value 1')])
# One of many possible additions
fig.add_traces([go.Scatter(x=df['Date'], y=df['Value_1'],
marker_color='blue',
name='Value 2')])
# plot figure
fig.show()

Add date tickers to a matplotlib/python chart

I have a question that sounds simple but it's driving me mad for some days. I have a historical time series closed in two lists: the first list is containing prices, let's say P = [1, 1.5, 1.3 ...] while the second list is containing the related dates, let's say D = [01/01/2010, 02/01/2010...]. What I would like to do is to plot SOME of these dates (when I say "some" is because the "best" result I got so far is to show all of them as tickers, so creating a black cloud of unreadable data in the x-axis) that, when you zoom in, are shown more in details. This picture is now having the progressive automated range made by Matplotlib:
Instead of 0, 200, 400 etc. I would like to have the dates values that are related to the data-point plotted. Moreover, when I zoom-in I get the following:
As well as I get the detail between 0 and 200 (20, 40 etc.) I would like to get the dates attached to the list.
I'm sure this is a simple problem to solve but I'm new to Matplotlib as well as to Python and any hint would be appreciated. Thanks in advance
Matplotlib has sophisticated support for plotting dates. I'd recommend the use of AutoDateFormatter and AutoDateLocator. They are even locale-specific, so they choose month-names according to your locale.
import matplotlib.pyplot as plt
from matplotlib.dates import AutoDateFormatter, AutoDateLocator
xtick_locator = AutoDateLocator()
xtick_formatter = AutoDateFormatter(xtick_locator)
ax = plt.axes()
ax.xaxis.set_major_locator(xtick_locator)
ax.xaxis.set_major_formatter(xtick_formatter)
EDIT
For use with multiple subplots, use multiple locator/formatter pairs:
import datetime
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.dates import AutoDateFormatter, AutoDateLocator, date2num
x = [datetime.datetime.now() + datetime.timedelta(days=30*i) for i in range(20)]
y = np.random.random((20))
xtick_locator = AutoDateLocator()
xtick_formatter = AutoDateFormatter(xtick_locator)
for i in range(4):
ax = plt.subplot(2,2,i+1)
ax.xaxis.set_major_locator(xtick_locator)
ax.xaxis.set_major_formatter(xtick_formatter)
ax.plot(date2num(x),y)
plt.show()
You can do timeseries plot with pandas
For detail refer this : http://pandas.pydata.org/pandas-docs/dev/timeseries.html and
http://pandas.pydata.org/pandas-docs/dev/generated/pandas.Series.plot.html
import pandas as pd
DateStrList = ['01/01/2010','02/01/2010']
P = [2,3]
D = pd.Series([pd.to_datetime(date) for date in DateStrList])
series =pd.Series(P, index=D)
pd.Series.plot(series)
import matplotlib.pyplot as plt
import pandas
pandas.TimeSeries(P, index=D).plot()
plt.show()

Categories

Resources