Histogram function in python - python

i Have a set of 100 random distances ranging from 0.5 to 25, most of then only differ between each other by about 0.01 points. I want to create a script that:
Reads in a vector with cut off values
For example the vector for cut off values would be: [ 10.5, 15.2, 17.8 20.1, 24.3]
The script would ideally create the bins itself by taking the value before and making it the minimum for the bin and taking the next value in the vector and making it the max
For example:
bin 1 : min =10.50 and the bin max would be 15.2-0.01
bin 2: min=15.2 and the bin max would be 17.8-0.01
bin 3: min- 17.8 and the bin max would be 20.1-0.01
bin 4: min- 20.1 and the bin max would be 24.3-0.1
The Script would ideally take the 100 values and sort then into one of the bins so values of 10.8 and 10.99 would be sorted in the bin 1 which is between 10.5 and 15.2. Once the script sorts all 10 values it would return a list with the bin min and max values ( in other words the limits of the bin in this case 10.5 and 15.2) and also return how many numbers of the 100 total fit in the bin.
For example:
10.51
10.52
10.53
21.1
21.1
Therefore the histogram loop would run through the numbers on the list and return that bin 1 has 3 values in it and group 2 has 2 numbers in it.

Related

How to count things within string in Python

I have data where one column is a string. This column contains text, such as:
#
financial_covenants
1
Max. Debt to Cash Flow: Value is 6.00
2.
Max. Debt to Cash Flow: Decreasing from 4.00 to 3.00, Min. Fixed Charge Coverage Ratio: Value is 1.20
3
Min. Interest Coverage Ratio: Value is 3.00
4
Max. Debt to Cash Flow: Decreasing from 4.00 to 3.50, Min. Interest Coverage Ratio: Value is 3.00
5
Max. Leverage Ratio: Value is 0.6, Tangible Net Worth: 7.88e+008, Min. Fixed Charge Coverage Ratio: Value is 1.75, Min. Debt Service Coverage Ratio: Value is 2.00
I want a new column that counts how many covenants there are in "financial_covenants".
As you can see, the covenants are divided by a comma.
I want my final result to look like this:
financial_covenants
num_of_cov
Max. Debt to Cash Flow: Value is 6.00
1
Max. Debt to Cash Flow: Decreasing from 4.00 to 3.00, Min. Fixed Charge Coverage Ratio: Value is 1.20
2
Max. Debt to Cash Flow: Value is 3.00
1
Max. Debt to Cash Flow: Decreasing from 4.00 to 3.50, Min. Interest Coverage Ratio: Value is 3.00
2
Max. Leverage Ratio: Value is 0.6, Tangible Net Worth: 7.88e+008, Min. Fixed Charge Coverage Ratio: Value is 1.75, Min. Debt Service Coverage Ratio: Value is 2.00
4
The data set is large (3000 rows), and these phrases differ among themselves in values, such like:
Max. Debt to Cash Flow: Value is 3.00 and Max. Debt to Cash Flow: Value is 6.00. I am not interested in these values, but just want to know how many covenants there are.
Do you have any idea how to do this in Python?
Looks to me that you could use:
counts = [] # structure to store the results
for financial_covenant in financial_covenants: # your structure containing rows
parts = financial_covenant.split(',') # this will split your sentence using commas as delimiters
count = len(parts) # this will count the number of parts obtained
counts.append(count) # this will store the final results in a array
print(counts) # displays [1, 2, 1, 2, 4]
On the assumption that your data is in a pandas DataFrame called df with columns as labelled then you could use:
df['num_of_cov'] = df['financial_covenants'].map(lambda row : len(row.split(',')))

Faster way of finding count of a category over a window function in Python

I have a categorical column (well a discrete value column) and i would like to count the number of rows with each category over a centered sliding window function. I am using python with pandas as numpy to do this. I have something that works but it is slow and not so elegant.
I was wondering if there was a faster or easier way of doing this. I am running it over around 10,000 rows now and it takes around 20 seconds, which is ok but id like to run it over several 100,000 rows and up to 1,000,000 rows.
my code so far is as follows:
counted = pd.DataFrame()
for i in df[discrete_column].unique():
counts = df[discrete_column].rolling(window_size, 0, True).apply(lambda x:np.where(x==i, 1,0).sum())
counted[i]=counts
input would be like this (index, column)
index
discrete_column
58702
65030
58703
65030
58704
65030
58705
65030
58706
65030
58707
30000
58708
30000
58709
30000
58710
30000
Output (this is just a snippet) and i used a window size of 20
index
65030
30000
58703
0.684211
0.315789
58704
0.650000
0.350000
58705
0.600000
0.400000
58706
0.550000
0.450000
58707
0.500000
0.500000
58708
0.450000
0.550000
each category from the input becomes a column and the values in the output are the proportions (renormalized to 1 across the row) of each category in that window size.

Finding the first row which stratifies two conditions in python data frame

I have a data frame which looks like this:
data frame
I want to write a code which locates points that have distance less than 250 from the next point. When it finds the point searches for the first point that is more than 250 away with speed greater than 5.
For example in the sample data set, first find row 7 and then locate row 10 which is more than 250 away and has speed of 10.8 and return the index of row 10
I have write this code so far:
for i in (number+1 for number in range(data_gpd.index[-1]-1)):
if (data_gpd['distance'][i+1]< 250):
I'm not sure what should I do after this condition. I had in mind to use "Next" statement with conditions but I was only able to find it for list comprehension with one condition.
I really appreciate your help as I'm new to python and not sure which syntax would work better
You can use the pandas function loc and associated conditions to return a pandas DataFrame.
First Condition:
df['distance'] < 250
Second Condition:
df['speed'] > 5
Combined Condition:
(df['distance'] < 250) & (df['speed'] > 5)
Using loc and combined condition:
df.loc[(df['distance'] < 250) & (df['speed'] > 5)]
Input:
time location distance speed
0 300 9071 9071 108.00
1 300 18376 9304 11.00
2 300 28006 9630 115.00
3 200 30506 2500 45.00
4 400 31606 1100 9.90
5 500 31706 100 0.72
6 150 31756 50 1.20
7 20 31766 10 1.80
8 50 31916 150 10.80
Output:
time location distance speed
8 50 31916 150 10.8

Adding confidence intervals for population rates in a dataframe

I have a dataframe where I have created a new column which sums the first three columns (dates) with values. Then I have created a rate for each row based on population column.
I would like to create lower and upper 95% confidence levels for the "sum_of_days_rate" for each row in this dataset.
I can create a mean of the first three columns but not sure how to create lower and upper values for the sum of these three columns rate.
Sample of the dataset below:
data= {'09/01/2021': [74,84,38],
'10/11/2021': [43,35,35],
"12/01/2021": [35,37,16],
"population": [23000,69000,48000]}
df = pd.DataFrame (data, columns = ['09/01/2021','10/11/2021', "12/01/2021", "population"])
df['sum_of_days'] = df.loc[:, df.columns[0:3]].sum(1)
df['sum_of_days_rate'] = df['sum_of_days']/df['population'] * 100000
To estimate the confidence interval you need to make certain assumptions about the data, how it is distributed or what would be the associated error. I am not sure what your data points mean, why you are summing them up etc.
A commonly used distribution for rates would a poisson distribution and you can construct the confidence interval, given a mean:
lb, ub = scipy.stats.poisson.interval(0.95,df.sum_of_days_rate)
df['lb'] = lb
df['ub'] = ub
The arrays ub and lb are the upper and lower bound of the 95% confidence interval. Final data frame looks like this:
09/01/2021 10/11/2021 12/01/2021 population sum_of_days sum_of_days_rate lb ub
0 74 43 35 23000 152 660.869565 611.0 712.0
1 84 35 37 69000 156 226.086957 197.0 256.0
2 38 35 16 48000 89 185.416667 159.0 213.0

How to get the mean for each group in pandas.dataframe like seaborn.factorplot

I have a dataset formatted as a pandas dataframe. Please see this example in seaborn http://seaborn.pydata.org/generated/seaborn.factorplot.html#seaborn.factorplot
>>> import seaborn as sns
>>> sns.set(style="ticks")
>>> exercise = sns.load_dataset("exercise")
>>> g = sns.factorplot(x="time", y="pulse", hue="kind", data=exercise)
With sns.factorplot, I can see the mean of the data by group (for this instance, the chart shows the mean of pulse at 1/15/30 mins group by the "kind").
I want to directly get the "values" in the chart.
For example
time kind mean standard deviation
1 min running xx xx
15 min running xx xx
I can use 2-depth loop to get the value I want, but I think there should be something easyier in pandas since it is a common requirement.
Different from matplotlib, which will return all the values in the plot, seaborn returns a Facetgrid object. It seems that Facetgrid do not have the data I want.
I think you need groupby by columns time and kind and aggregate mean and std:
print (exercise.groupby(['time','kind'])['pulse'].agg(['mean', 'std']))
#agg same as aggregate, only less typing ;)
#print (exercise.groupby(['time','kind'])['pulse'].aggregate(['mean', 'std']))
mean std
time kind
1 min rest 90.2 6.545567
walking 93.1 6.297266
running 96.1 4.483302
15 min rest 90.9 6.118279
walking 96.6 7.441625
running 117.1 12.991023
30 min rest 91.4 5.337498
walking 95.9 6.740425
running 126.0 16.964014
df1 = exercise.groupby(['time','kind'])['pulse'].agg(['mean', 'std']).reset_index()
print (df1)
time kind mean std
0 1 min rest 90.2 6.545567
1 1 min walking 93.1 6.297266
2 1 min running 96.1 4.483302
3 15 min rest 90.9 6.118279
4 15 min walking 96.6 7.441625
5 15 min running 117.1 12.991023
6 30 min rest 91.4 5.337498
7 30 min walking 95.9 6.740425
8 30 min running 126.0 16.964014

Categories

Resources