How to plot CDF plot based on two selected pandas series - python

Background
I have a dataframe containing three variables:
city: the city names within China.
pop: the population number of the corresponding city.
conc: the concentration of ambient pollutant of the corresponding city.
I want to investigate the cumulative distribution of the concentration by the population.
The sample figure is shown like this:
The sample dataset is uploaded here
My solution
df = pd.read_csv("./data/test.csv",)
df = df[df.columns[1:]]
df = df.sort_values(by=['pm25'],ascending=False)
df = df.reset_index()
x_ = df['pm25'].values
y_ = []
for i in range(0,len(df)-1,1):
y_.append(df['pop'].iloc[:i+1].sum()/df['pop'].sum())
y_.append(1.0)
plt.plot(x_,y_)
1.
Any better method is highly appreciated!
2.
Also, how to make the curve smooth as the first plot?

You can replace the loop by a use of pd.Series.cumsum:
y_ = df.pop.cumsum() / df.pop.sum()
For smoothing, you can use pd.Series.rolling:
plot(x_, y_.rolling(3).mean())
which applies a low pass filter (of length 3). You should consider if that is what you want, however - your plot seems correct.

Related

bin value of histograms from grouped data

I am a beginner in Python and I am making separate histograms of travel distance per departure hour. Data I'm using, about 2500 rows of this. Distance is float64, the Departuretime is str. However, for making further calculations I'd like to have the value of each bin in a histogram, for all histograms.
Up until now, I have the following:
df['Distance'].hist(by=df['Departuretime'], color = 'red',
edgecolor = 'black',figsize=(15,15),sharex=True,density=True)
This creates in my case a figure with 21 small histograms. Histogram output I'm receiving.
Of all these histograms I want to know the y-axis value of each bar, preferably in a dataframe with the distance binning as rows and the hours as columns.
With single histograms, I'd paste counts, bins, bars = in front of the entire line and the variable counts would contain the data I was looking for, however, in this case it does not work.
Ideally I'd like a dataframe or list of some sort for each histogram, containing the density values of the bins. I hope someone can help me out! Big thanks in advance!
First of all, note that the bins used in the different histograms that you are generating don't have the same edges (you can see this since you are using sharex=True and the resulting bars don't have the same width), in all cases you are getting 10 bins (the default), but they are not the same 10 bins.
This makes it impossible to combine them all in a single table in any meaningful way. You could provide a fixed list of bin edges as the bins parameter to standarize this.
Alternatively, I suggest you calculate a new column that describes to which bin each row belongs, this way we are also unifying the bins calulation.
You can do this with the cut function, which also gives you the same freedom to choose the number of bins or the specific bin edges the same way as with hist.
df['DistanceBin'] = pd.cut(df['Distance'], bins=10)
Then, you can use pivot_table to obtain a table with the counts for each combination of DistanceBin and Departuretime as rows and columns respectively as you asked.
df.pivot_table(index='DistanceBin', columns='Departuretime', aggfunc='count')

Averaging a certain number of values from a data and plotting it wrt another set of data

I have ran a simulation and got a set of data. It consists of three rows. Row 1 contains time row 2 contains energy values and row 3 a specific wavelength.
Now for every wavelength value there are 10 energy values and likewise for each energy value there is a time.
Now suppose I have 10 wavelength for which I have 10*10 =100 energy values. So what I want to do is I want to write a code which first averages the energy value for a specific wavelength and then plots the value of average energy vs wavelength.
I am stuck for almost a week any help would be much appreciated.
I am not exactly sure if this is what you are looking for, if not, give an example of your data.
# Dummy data
energy = list(range(0,100))
wavelength = list(range(0,10))
# Compute how many energy values for each wavelength
k = int(len(energy)/len(wavelength))
# Compute average energy for each block of k values
energy_avg = [sum(energy[i:i+k])/k for i in range(0, len(energy), k)]
# Plot
import matplotlib.pyplot as plt
plt.plot(wavelength, energy_avg , '.')
plt.xlabel('wavelength')
plt.ylabel('average energy')
plt.show()

Cleaning up x-axis because there are too many datapoints

I have a data set that is like this
Date Time Cash
1/1/20 12:00pm 2
1/1/20 12:02pm 15
1/1/20 12:03pm 20
1/1/20 15:06pm 30
2/1/20 11:28am 5
. .
. .
. .
3/1/20 15:00pm 3
I basically grouped all the data by date along the y-axis and time along the x-axis, and plotted a facetgrid as shown below:
df_new= df[:300]
g = sns.FacetGrid(df_new.groupby(['Date','Time']).Cash.sum().reset_index(), col="Date", col_wrap=3)
g = g.map(plt.plot, "Time", "Cash", marker=".")
g.set_xticklabels(rotation=45)
What I got back was hideous(as shown below). So I'm wondering is there anyway to tidy up the x-axis? Maybe like having 5-10 time data labels so the time can be visible, or maybe expanding the image?
Edit: I am plotting using seaborn. I will want it to look something like that below where the x-axis has only a couple of labels:
Thanks for your inputs.
Have you tried to use moving average instead of the actual data? You can count the moving average of any data with the following function:
def moving_average(a, n=10) :
ret = np.cumsum(a, dtype=float)
ret[n:] = ret[n:] - ret[:-n]
return ret[n - 1:] / n
Set n to average you need, you can play around with that value. a is in your case variable Cash represented as numpy array.
After that, set column Cash to the moving average count from real values and plot it. The plot curve will be smoother.
P.S. the plot of suicides you have added in edit is really unreadable, as the range for y axis is way higher than needed. In practice, try to avoid such plots.
Edit
I did not notice how you aggregate the data at first, you might want to work with date and time merged. I do not know where you load data from, in case you load it from csv you can add this to read_csv method: parse_dates=[['Date', 'Time']]. In case not, you can play around with the dataframe:
df['datetime'] = pd.to_datetime(df['Date'] + ' ' + df['Time'])
you create a new column with datetime and can work simply with that.

Concise method to get corresponding columns for a query based selection in pandas

Current plot and anticipated plot
Im new to python. I'm trying to get a subset of the housing index dataset from https://github.com/ageron/handson-ml/blob/master/02_end_to_end_machine_learning_project.ipynb
I have imported the dataset as 'housing'. I am trying to plot just the outliers in quantile 0.95 on top of the plot which shows all the values for median_house_value
import matplotlib.image as mpimg
housing.plot(kind="scatter", x="median_income", y="median_house_value",
alpha=0.1)
this gets a plot of all the rows (i), i am trying to select the corresponding median_income rows for the subset of median_house_value that is the 0.95 quantile and plot them over the top in orange (j)
Below is my best attempt so far, which is not getting the correct values
plt.plot(housing.groupby('median_house_value').quantile(q=quant)["median_income"], housing.groupby('median_house_value').quantile(q=quant).index.get_level_values('median_house_value'),"or")
I can get the median_house_value rows in the quantile by doing..
quantile = int(round(housing["median_house_value"].quantile(q=0.95)))
housing.median_house_value > quantile
I want to end up with two panda arrays, one for the x axis, an array of median_income rows that correspond to the second array which would be an array of median_house_value rows that make up the quantile
Thanks in advance.
IIUC - Simply filter your main dataset since you have a boolean index: housing["median_house_value"] > quantile.
# REQUIRED THRESHOLD
quantile = int(round(housing["median_house_value"].quantile(q=0.95)))
# FILTER BY BOOLEAN
upper_housing = housing[housing["median_house_value"] > quantile]
# PLOTTING
housing.plot(kind="scatter", x="median_income", y="median_house_value", alpha=0.1, c='blue')
upper_housing.plot(kind="scatter", x="median_income", y="median_house_value", alpha=0.1, c='red')
plt.show()

finding the extrapolation from the structure of one data and apply it to the other

I have a filter . They are supposed to have the same structure but they are scaled differently and the data from the top filter shown in the plot is truncated before 10000. I just set the value equal to zero at 10000 but I would like to extrapolated the top filter in order to follow the structure of the bottom filter. The data related to each filter is provided in the links. I don't know how I can obtain the tail structure from the data in the bottom filter and apply it to the top one considering they have been scaled differently. Note that I need to use the upper panel filter because my other filters are calibrated accordingly.
I can obtain the interpolation for the lower filter using interp1d, but I don't know how I should rescale it properly that can be used for the top filter.
from scipy.interpolate import interp1d
from scipy import arange
import numpy as np
u=np.loadtxt('WFI_I.res')
f=interp1d(u[:,0], u[:,1])
x=arange(7050, 12000)
y=f(x)
I will be grateful for any suggestion or code to do that.
Assuming that you have two filter arrays with y values of filter1 and filter2 and x (wavelength) values of wave1 and wave2, then something like this should work (untested though):
wave_match = 9500 # wavelength for matching
index1 = np.searchsorted(wave1, wave_match)
index2 = np.searchsorted(wave2, wave_match)
match1 = filter1[index1]
match2 = filter2[index2]
scale = match1 / match2
wave12 = np.concatenate([wave1[:index1], wave2[index2:]])
filter12 = np.concatenate([filter1[:index1], scale * filter2[index2:]])

Categories

Resources