Scaling numbers within a dataframe column to the same proportion

Scaling numbers within a dataframe column to the same proportion - python

I have a series of numbers of two different magnitudes in a dataframe column. They are
0 154480.429000
1 154.480844
2 154480.433000
3 154.480844
4 154480.433000
......
As we can see that above, I am not sure how to set a condition to scale the small number 154.480844 to have the same order of magnitude as the large one 154480.433000 in dataframe.
How can this be done efficiently with pandas?

Use np.log10 to determine the scaling factor required. Something like this:
v = np.log10(ser).astype(int)
ser * 10 ** (v.max() - v).values
0 154480.429
1 154480.844
2 154480.433
3 154480.844
4 154480.433
Name: 1, dtype: float64

Related

Create a custom percentile rank for a pandas series

I need to calculate the percentile using a specific algorithm that is not available using either pandas.rank() or numpy.rank().
The ranking algorithm is calculated as follows for a series:
rank[i] = (# of values in series less than i + # of values equal to
i*0.5)/total # of values
so if I had the following series
s=pd.Series(data=[5,3,8,1,9,4,14,12,6,1,1,4,15])
For the first element, 5 there are 6 values less than 5 and no other values = to 5. The rank would be (6+0x0.5)/13 or 6/13.
For the fourth element (1) it would be (0+ 2x0.5)/13 or 1/13.
How could I calculate this without using a loop? I assume a combination of s.apply and/or s.where() but can't figure it out and have tried searching. I am looking to apply to the entire series at once, with the result being a series with the percentile ranks.

You could use numpy broadcasting. First convert s to a numpy column array. Then use numpy broadcasting to count the number of items less than i for each i. Then count the number of items equal to i for each i (note that we need to subract 1 since, i is equal to i itself). Finally add them and build a Series:
tmp = s.to_numpy()
s_col = tmp[:, None]
less_than_i_count = (s_col>tmp).sum(axis=1)
eq_to_i_count = ((s_col==tmp).sum(axis=1) - 1) * 0.5
ranks = pd.Series((less_than_i_count + eq_to_i_count) / len(s), index=s.index)
Output:
0 0.461538
1 0.230769
2 0.615385
3 0.076923
4 0.692308
5 0.346154
6 0.846154
7 0.769231
8 0.538462
9 0.076923
10 0.076923
11 0.346154
12 0.923077
dtype: float64

Centre of mass row-wise in a dataframe and multiply each column by different mass

I'm trying to calculate the centre of mass of 20 objects, where each object has it's own different mass.
These objects are represented in a dataframe cm_x, and their associated masses in a list. Below I show an example of just 3 of those 20 objects, for the sake of saving space. Each object has an x, y, z coordinate, but I'll just show the x and then I can apply the same technique to the rest. Below is the head of the dataframe.
bar_head_x bar_hip_centre_x bar_left_ankle_x
0 -203.3502 -195.4573 -293.262
1 -203.4280 -195.4720 -293.251
2 -203.4954 -195.4675 -293.248
3 -203.5022 -195.9193 -293.219
4 -203.5014 -195.9092 -293.328
m_head = 0.081
m_hipc = 0.139
m_lank = 0.0465
m = [m_head,m_hipc,m_lank]
I saw in another similar question, someone has suggested this method, however this doesn't incorporate the masses, and that is where I'm having an issue:
def series_sum(pd_series):
return np.sum(np.dot(pd_series.values, np.asarray(range(1, len(pd_series)+1)))/np.sum(pd_series))
cm_x.apply(series_sum, axis=1)
Basically I want for each row, to have an associated centre of mass, using the formula for centre of mass which is sum(x_i * m_i) / sum(m_i).
The desired result would be a new column in the dataframe like so:
cm_x
0 -214.92
1 ...
2 ...
3 ...
4 ...
Any help?

If I understand correctly, you can compute the desired column like this:
>>> df.mul(m).sum(axis=1)/sum(m)
0 -214.921628
1 -214.951023
2 -214.968638
3 -215.201292
4 -215.214800

Use DataFrame.dot and divide by sum of list m:
s = df.dot(m).div(sum(m))
print (s)
0 -214.921628
1 -214.951023
2 -214.968638
3 -215.201292
4 -215.214800
7441 -245.078910
7442 -244.943961
7443 -244.806606
7444 -244.665285
7445 -244.533503
dtype: float64
If need DataFrame add Series.to_frame:
df1 = df.dot(m).div(sum(m)).to_frame('cm_x')
print (df1)
cm_x
0 -214.921628
1 -214.951023
2 -214.968638
3 -215.201292
4 -215.214800
7441 -245.078910
7442 -244.943961
7443 -244.806606
7444 -244.665285
7445 -244.533503

How to filter a pandas DataFrame and keep specific elements?

I have a pandas Data Frame which is a 50x50 correlation matrix. In the following picture you can see what I have as an example
What I would like to do, if it's possible of course, is to make a new data frame which has only the elements of the old one that are higher than 0.5 or lower than -0.5, indicating a strong linear relationship, but not 1, to avoid the variance parts.
I dont think what I ask is exactly possible because of course variable x0 wont have the same strong relationships that x1 have etc, so the new data frame wont be looking very good.
But is there any way to scan fast through this data frame, find the values I mentioned and maybe at least insert them into an array?
Any insight would be helpful. Thanks

you can't really look at a correlation matrix if you want to drop correlation pairs that are too low. One thing you could do is stack the frame and keep the relevant correlation pair.
having (randomly generated as an example):
0 1 2 3 4
0 0.038142 -0.881054 -0.718265 -0.037968 -0.587288
1 0.587694 -0.135326 -0.529463 -0.508112 -0.160751
2 -0.528640 -0.434885 -0.679416 -0.455866 0.077580
3 0.158409 0.827085 0.018871 -0.478428 0.129545
4 0.825489 -0.000416 0.682744 0.794137 0.694887
you could do:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.uniform(-1, 1, (5, 5)))
df = df.stack()
df = df[((df > 0.5) | (df < -0.5)) & (df != 1)]
0 1 -0.881054
2 -0.718265
4 -0.587288
1 0 0.587694
2 -0.529463
3 -0.508112
2 0 -0.528640
2 -0.679416
3 1 0.827085
4 0 0.825489
2 0.682744
3 0.794137
4 0.694887

finding best combination of date sets - given some constraints

I am looking for the right approach for solve the following task (using python):
I have a dataset which is a 2D matrix. Lets say:
1 2 3
5 4 7
8 3 9
0 7 2
From each row I need to pick one number which is not 0 (I can also make it NaN if that's easier).
I need to find the combination with the lowest total sum.
So far so easy. I take the lowest value of each row.
The solution would be:
1 x x
x 4 x
x 3 x
x x 2
Sum: 10
But: There is a variable minimum and a maximum sum allowed for each column. So just choosing the minimum of each row may lead to a not valid combination.
Let's say min is defined as 2 in this example, no max is defined. Then the solution would be:
1 x x
5 x x
x 3 x
x x 2
Sum: 11
I need to choose 5 in row two as otherwise column one would be below the minimum (2).
I could use brute force and test all possible combinations. But due to the amount of data which needs to be analyzed (amount of data sets, not size of each data set) that's not possible.
Is this a common problem with a known mathematical/statistical or other solution?
Thanks
Robert

is there any quick function to do looking-back calculating in pandas dataframe?

I wanna implement a calculate method like a simple scenario:
value computed as the sum of daily data during the previous N days (set N = 3 in the following example)
Dataframe df: (df.index is 'date')
date value
20140718 1
20140721 2
20140722 3
20140723 4
20140724 5
20140725 6
20140728 7
......
to do calculating like:
date value new
20140718 1 0
20140721 2 0
20140722 3 0
20140723 4 6 (3+2+1)
20140724 5 9 (4+3+2)
20140725 6 12 (5+4+3)
20140728 7 15 (6+5+4)
......
Now I have done this using for cycle like:
df['value']=[0]*len(df)
for idx in df.index
loc=df.index.get_loc(idx)
if((loc-N)>=0):
tmp=df.ix[df.index[loc-3]:df.index[loc-1]]
sum=tmp['value'].sum()
else:
sum=0
df['new'].ix(idx)=sum
But, when the length of dataframe or the value of N is very long / big, these calculating will be very slow....How I can implement this faster using a function or by other ways?
Besides, if the scenario is more complex? how ? Thanks.

Since you want the sum of the previous three excluding the current one, you can use rolling_apply over the a window of four and sum up all but the last value.
new = rolling_apply(df, 4, lambda x:sum(x[:-1]), min_periods=4)
This is the same as shifting afterwards with a window of three:
new = rolling_apply(df, 3, sum, min_periods=3).shift()
Then
df["new"] = new["value"].fillna(0)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scaling numbers within a dataframe column to the same proportion - python

Use np.log10 to determine the scaling factor required. Something like this: v = np.log10(ser).astype(int) ser * 10 ** (v.max() - v).values 0 154480.429 1 154480.844 2 154480.433 3 154480.844 4 154480.433 Name: 1, dtype: float64

Related

Create a custom percentile rank for a pandas series

Centre of mass row-wise in a dataframe and multiply each column by different mass

How to filter a pandas DataFrame and keep specific elements?

finding best combination of date sets - given some constraints

is there any quick function to do looking-back calculating in pandas dataframe?

Categories

Resources