Plotting a histogram in Pandas with very heavy-tailed data

Plotting a histogram in Pandas with very heavy-tailed data - python

I am often working with data that has a very 'long tail'. I want to plot histograms to summarize the distribution, but when I try to using pandas I wind up with a bar graph that has one giant visible bar and everything else invisible.
Here is an example of the series I am working with. Since it's very long, I used value_counts() so it will fit on this page.
In [10]: data.value_counts.sort_index()
Out[10]:
0 8012
25 3710
100 10794
200 11718
300 2489
500 7631
600 34
700 115
1000 3099
1200 1766
1600 63
2000 1538
2200 41
2500 208
2700 2138
5000 515
5500 201
8800 10
10000 10
10900 465
13000 9
16200 74
20000 518
21500 65
27000 64
53000 82
56000 1
106000 35
530000 3
I'm guessing that the answer involves binning the less common results into larger groups somehow (53000, 56000, 106000, and 53000 into one group of >50000, etc.), and also changing the y index to represent percentages of the occurrence rather than the absolute number. However, I don't understand how I would go about doing that automatically.

import pandas as pd
from matplotlib import pyplot as plt
import numpy as np
mydict = {0: 8012,25: 3710,100: 10794,200: 11718,300: 2489,500: 7631,600: 34,700: 115,1000: 3099,1200: 1766,1600: 63,2000: 1538,2200: 41,2500: 208,2700: 2138,5000: 515,5500: 201,8800: 10,10000: 10,10900: 465,13000: 9,16200: 74,20000: 518,21500: 65,27000: 64,53000: 82,56000: 1,106000: 35,530000: 3}
mylist = []
for key in mydict:
for e in range(mydict[key]):
mylist.insert(0,key)
df = pd.DataFrame(mylist,columns=['value'])
df2 = df[df.value <= 5000]
Plot as a bar:
fig = df.value.value_counts().sort_index().plot(kind="bar")
plt.savefig("figure.png")
As a histogram (limited to values 5000 & under which is >97% of your data):
I like using linspace to control buckets.
df2 = df[df.value <= 5000]
df2.hist(bins=np.linspace(0,5000,101))
plt.savefig('hist1')
EDIT: Changed np.linspace(0,5000,100) to np.linspace(0,5000,101) & updated histogram.

Related

Background with range on seaborn based on two columns

I am trying to add to my several line plots a background that shows a range from value x (column "Min") to value y (column "Max") for each year. My dataset looks like that:
Country Model Year Costs Min Max
494 FR 1 1990 300 250 350
495 FR 1 1995 250 300 400
496 FR 1 2000 220 330 640
497 FR 1 2005 210 289 570
498 FR 2 1990 400 250 350
555 JPN 8 1990 280 250 350
556 JPN 8 1995 240 300 400
557 JPN 8 2000 200 330 640
558 JPN 8 2005 200 289 570
I used the following code:
example_1 = sns.relplot(data=example, x = "Year", y = "Costs", hue = "Model", style = "Model", col = "Country", kind="line", col_wrap=4,height = 4, dashes = True, markers = True, palette = palette, style_order = style_order)
I would like something like this with the range being my "Min" and "Max" by year.
Is it possible to do it?
Thank you very much !

Usually, grid.map is the tool for this, as shown in many examples in the mutli-plot grids tutorial. But you are using relplot to combine lineplot with a FacetGrid as it is suggested in the docs (last example) which lets you use some extra styling parameters.
Because relplot processes the data a bit differently than if you would first initiate a FacetGrid and then map a lineplot (you can check this with grid.data), using grid.map(plt.bar, ...) to plot the ranges is quite cumbersome as it requires editing the grid.data dataframe as well as the x- and y-axis labels.
The simplest way to plot the ranges is to loop through the grid.axes. This can be done with grid.axes_dict.items() which provides the column names (i.e. countries) that you can use to select the appropriate data for the bars (useful if the ranges were to differ, contrary to this example).
The default figure legend does not contain the complete legend including the key for ranges, but the first ax object does so that one displayed instead of the default legend in the following example. Note that I have edited the data you shared so that the min/max ranges make more sense:
import io
import pandas as pd # v 1.1.3
import matplotlib.pyplot as plt # v 3.3.2
import seaborn as sns # v 0.11.0
data ='''
Country Model Year Costs Min Max
494 FR 1 1990 300 250 350
495 FR 1 1995 250 200 300
496 FR 1 2000 220 150 240
497 FR 1 2005 210 189 270
555 JPN 8 1990 280 250 350
556 JPN 8 1995 240 200 300
557 JPN 8 2000 200 150 240
558 JPN 8 2005 200 189 270
'''
df = pd.read_csv(io.StringIO(data), delim_whitespace=True)
# Create seaborn FacetGrid with line plots
grid = sns.relplot(data=df, x='Year', y='Costs', hue='Model', style='Model',height=3.9,
col='Country', kind='line', markers=True, palette='tab10')
# Loop through axes of the FacetGrid to plot bars for ranges and edit x ticks
for country, ax in grid.axes_dict.items():
df_country = df[df['Country'] == country]
cost_range = df_country['Max']-df_country['Min']
ax.bar(x=df_country['Year'], height=cost_range, bottom=df_country['Min'],
color='black', alpha=0.1, label='Min/max\nrange')
ax.set_xticks(df_country['Year'])
# Remove default seaborn figure legend and show instead full legend stored in first ax
grid._legend.remove()
grid.axes.flat[0].legend(bbox_to_anchor=(2.1, 0.5), loc='center left',
frameon=False, title=grid.legend.get_title().get_text());

Selecting top % of rows in pandas

I have a sample dataframe as below (actual dataset is roughly 300k entries long):
user_id revenue
----- --------- ---------
0 234 100
1 2873 200
2 827 489
3 12 237
4 8942 28934
... ... ...
96 498 892384
97 2345 92
98 239 2803
99 4985 98332
100 947 4588
which displays the revenue generated by users. I would like to select the rows where the top 20% of the revenue is generated (hence giving the top 20% revenue generating users).
The methods that come closest to mind for me is calculating the total number of users, working out 20% of this ,sorting the dataframe with sort_values() and then using head() or nlargest(), but I'd like to know if there is a simpler and elegant way.
Can anybody propose a way for this?
Thank you!

Suppose You have dataframe df:
user_id revenue
234 21
2873 20
827 23
12 23
8942 28
498 22
2345 20
239 24
4985 21
947 25
I've flatten revenue distribution to show the idea.
Now calculating step by step:
df = pd.read_clipboard()
df = df.sort_values(by = 'revenue', ascending = False)
df['revenue_cum'] = df['revenue'].cumsum()
df['%revenue_cum'] = df['revenue_cum']/df['revenue'].sum()
df
result:
user_id revenue revenue_cum %revenue_cum
4 8942 28 28 0.123348
9 947 25 53 0.233480
7 239 24 77 0.339207
2 827 23 100 0.440529
3 12 23 123 0.541850
5 498 22 145 0.638767
0 234 21 166 0.731278
8 4985 21 187 0.823789
1 2873 20 207 0.911894
6 2345 20 227 1.000000
Only 2 top users generate 23.3% of total revenue.

This seems to be the case for df.quantile, from pandas documentation if you are looking for the top 20% all you need to do is pass the correct quantile value you desire.
A case example from your dataset:
import pandas as pd
import numpy as np
df = pd.DataFrame({'user_id':[234,2873,827,12,8942],
'revenue':[100,200,489,237,28934]})
df.quantile([0.8,1],interpolation='nearest')
This would print the top 2 rows in value:
user_id revenue
0.8 2873 489
1.0 8942 28934

I usually find useful to use sort_values to see the cumulative effect of every row and then keep rows up to some threshold:
# Sort values from highest to lowest:
df = df.sort_values(by='revenue', ascending=False)
# Add a column with aggregated effect of the row:
df['cumulative_percentage'] = 100*df.revenue.cumsum()/df.revenue.sum()
# Define the threshold I need to analyze and keep those rows:
min_threshold = 30
top_percent = df.loc[df['cumulative_percentage'] <= min_threshold]
The original df will be nicely sorted with a clear indication of the top contributing rows and the created 'top_percent' df will contain the rows that need to be analyzed in particular.

I am assuming you are looking for the cumulative top 20% revenue generating users. Here is a function that will help you get the expected output and even more. Just specify your dataframe, column name of the revenue and the n_percent you are looking for:
import pandas as pd
def n_percent_revenue_generating_users(df, col, n_percent):
df.sort_values(by=[col], ascending=False, inplace=True)
df[f'{col}_cs'] = df[col].cumsum()
df[f'{col}_csp'] = 100*df[f'{col}_cs']/df[col].sum()
df_ = df[df[f'{col}_csp'] > n_percent]
index_nearest = (df_[f'{col}_csp']-n_percent).abs().idxmin()
threshold_revenue = df_.loc[index_nearest, col]
output = df[df[col] >= threshold_revenue].drop(columns=[f'{col}_cs', f'{col}_csp'])
return output
n_percent_revenue_generating_users(df, 'revenue', 20)

Using if statements to filter data?

Lets say I have an excel document with the following format. I'm reading said excel doc with pandas and plotting data using matplotlib and numpy. Everything is great!
Buttttt..... I wan't more constraints. Now I want to constrain my data so that I can sort for only specific zenith angles and azimuth angles. More specifically: I only want zenith when it is between 30 and 90, and I only want azimuth when it is between 30 and 330
Air Quality Data
Azimuth Zenith Ozone Amount
230 50 12
0 81 10
70 35 7
110 90 17
270 45 23
330 45 13
345 47 6
175 82 7
220 7 8
This is an example of the sort of constraint I'm looking for.
Air Quality Data
Azimuth Zenith Ozone Amount
230 50 12
70 35 7
110 90 17
270 45 23
330 45 13
175 82 7
The following is my code:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import datetime
P_file = file1
out_file = file2
out_file2 = file3
data = pd.read_csv(file1,header=None,sep=' ')
df=pd.DataFrame(data=data)
df.to_csv(file2,sep=',',header = [19 headers. The three that matter for this question are 'DateTime', 'Zenith', 'Azimuth', and 'Ozone Amount'.]
df=pd.read_csv(file2,header='infer')
mask = df[df['DateTime'].str.contains('20141201')] ## In this line I'm sorting for anything containing the locator for the given day.
mask.to_csv(file2) ##I'm now updating file 2 so that it only has the data I want sorted for.
data2 = pd.read_csv(file2,header='infer')
df2=pd.DataFrame(data=data2)
def tojuliandate(date):
return.... ##give a function that changes normal date of format %Y%m%dT%H%M%SZ to julian date format of %y%j
def timeofday(date):
changes %Y%m%dT%H%M%SZ to %H%M%S for more narrow views of data
df2['Time of Day'] = df2['DateTime'].apply(timeofday)
df2.to_csv(file2) ##adds a column for "timeofday" to the file
So basically at this point this is all the code that goes into making the csv I want to sort. How would I go about sorting
'Zenith' and 'Azimuth'
If they met the criteria I specified above?
I know that I will need if statements to do this.
I tried something like this but it didn't work and I was looking for a bit of help:

df[(df["Zenith"]>30) & (df["Zenith"]<90) & (df["Azimuth"]>30) & (df["Azimuth"]<330)]
Basically a duplicate of Efficient way to apply multiple filters to pandas DataFrame or Series

You can use series between:
df[(df['Zenith'].between(30, 90)) & (df['Azimuth'].between(30, 330))]
Yields:
Azimuth Zenith Ozone Amount
0 230 50 12
2 70 35 7
3 110 90 17
4 270 45 23
5 330 45 13
7 175 82 7
Note that by default, these upper and lower bounds are inclusive (inclusive=True).

You can only write those entries of the dataframe to your file, which are meeting your boundary conditions
# replace the line df.to_csv(...) in your example with
df[((df['Zenith'] >= 3) & (df['Zenith'] <= 90)) and
((df['Azimuth'] >= 30) & (df['Azimuth'] <= 330))].to_csv('my_csv.csv')

Using pd.DataFrame.query:
df_new = df.query('30 <= Zenith <= 90 and 30 <= Azimuth <= 330')
print(df_new)
Azimuth Zenith OzoneAmount
0 230 50 12
2 70 35 7
3 110 90 17
4 270 45 23
5 330 45 13
7 175 82 7

Python plot data against date

I have a dataset:
A B C D yearweek
0 245 95 60 30 2014-48
1 245 15 70 25 2014-49
2 150 275 385 175 2014-50
3 100 260 170 335 2014-51
4 580 925 535 2590 2015-02
5 630 126 485 2115 2015-03
6 425 90 905 1085 2015-04
7 210 670 655 945 2015-05
How to plot each value against 'yearweek'?
I tried for example:
import matplotlib.pyplot as plt
import pandas as pd
new = pd.DataFrame([df['A'].values, df['yearweek'].values])
plt.plot(new)
but it doesn't work and shows
ValueError: could not convert string to float: '2014-48'
Then I tried this:
plt.scatter(df['Total'], df['yearweek'])
turns out:
ValueError: could not convert string to float: '2015-37'
Is this means the type of yearweek has some problem? How can I fix it?
Or if it's possible to change the index into date?

The best solution I see is to calculate the date from scratch and add it to a new column as a datetime. Then you can plot it easily.
df['date'] = df['yearweek'].map(lambda x: datetime.datetime.strptime(x,"%Y-%W")+datetime.timedelta(days=7*(int(x.split('-')[1])-1)))
df.plot('date','A')
So I start with the first january of the current year and go forward 7*(week-1) days, then generate the date from it.

As of pandas 0.20.X, you can use DataFrame.plot() to generate your required plots. It uses matplotlib under the hood -
import pandas as pd
data = pd.read_csv('Your_Dataset.csv')
data.plot(['yearweek'], ['A'])
Here, yearweek will become the x-axis and A will become the y. Since it's a list, you can use multiple in both cases
Note: If it still doesn't look good then you could go towards parsing the yearweek column correctly into dateformat and try again.

Using Pandas how to use next step analysis for getting data

I have data :
Village Workers Level
Aagar 10 Small
Dhagewadi 32 Small
Sherewadi 34 Small
Shindwad 42 Small
Dhokari 84 Medium
Khanapur 65 Medium
Ambikanagar 45 Medium
Takali 127 Large
Gardhani 122 Large
Pi.Khand 120 Large
Pangri 105 Large
Code :
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df=pd.read_csv("/home/desktop/Desktop/t.csv")
df = df.sort('Workers', ascending=False)
df['Level'] = pd.qcut(df['Workers'], 3, ['Small','Medium','Large'])
df['Sum_Level_wise'] = df.groupby('Level')['Workers'].transform('sum')
df['Probability'] = df['Sum_Level_wise'].div(df['Workers'].sum()).round(2)
df['Sample'] = df['Probability'] * df.groupby('Level')['Workers'].transform('size')
df['Selected villages'] = df['Sample'].apply(np.ceil).astype(int)
def f(x):
a = x['Village'].head(x['Selected villages'].iat[0])
print (x['Village'])
print (a)
if (len(x) < len(a)):
print ('original village cannot be filled to Selected village, because length is higher')
return a
df['Selected village'] = df.groupby('Level').apply(f).reset_index(level=0)['Village']
df['Selected village'] = df['Selected village'].fillna('')
print (df)
Next, I have get villages which was selected in sampling
So, I just want to selected village name corresponding workers details and Level column only.
like this :( Excel photo )
So,I just want that village name, because I dont want to show each steps.
Just using sampling that 5 villages are comes, that data will be show, any help?

it seems you need head:
result_df= df.head(n=5)
result_df
result_df will be:
Village Workers Level Sum_Level_wise Probability Sample Selected villages Selected village
7 Takali 127 Large 474 0.60 2.40 3 Takali
8 Gardhani 122 Large 474 0.60 2.40 3 Gardhani
9 Pi.Khand 120 Large 474 0.60 2.40 3 Pi.Khand
10 Pangri 105 Large 474 0.60 2.40 3
4 Dhokari 84 Medium 194 0.25 0.75 1 Dhokari
If you need just columns 'Village','Workers' and 'Level', then try with:
result_df[['Village','Workers','Level']]
It will give you:
Village Workers Level
7 Takali 127 Large
8 Gardhani 122 Large
9 Pi.Khand 120 Large
10 Pangri 105 Large
4 Dhokari 84 Medium
Update:
df['Selected village'].replace('', pd.np.nan, inplace=True)
df.dropna(subset=['Selected village'], inplace=True)
df[['Workers','Level','Selected village']]
it will give:
Workers Level Selected village
0 10 Small Aagar
4 84 Medium Dhokari
7 127 Large Takali
8 122 Large Gardhani
9 120 Large Pi.Khand

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Plotting a histogram in Pandas with very heavy-tailed data - python

Related

Background with range on seaborn based on two columns

Selecting top % of rows in pandas

Using if statements to filter data?

Python plot data against date

Using Pandas how to use next step analysis for getting data

Categories

Resources