Converting m to km and string to float in pandas DataFrame - python

I have this simplified DataFrame where I want to add a new column Distance_km.
In this new column all values should be in kilometres and converted to float dtype.
d = {'Point': ['a','b','c','d'], 'Distance': ['3km', '400m','1.1km','200m']}
dist=pd.DataFrame(data=d)
dist
Point Distance
0 a 3km
1 b 400m
2 c 1.1km
3 d 200m
Point object
Distance object
dtype: object
How can I get this output?
Point Distance Distance_km
0 a 3.8km 3.8
1 b 400m 0.4
2 c 1.1km 1.1
3 d 200m 0.2
Point object
Distance object
Distance_km float64
dtype: object
Thanks in advance!

You could use Pandas apply method to pass your distance column values to a function that converts it to a standardized unit like so
From the documentation
Apply a function along an axis of the DataFrame.
Objects passed to the function are Series objects whose index is
either the DataFrame’s index (axis=0) or the DataFrame’s columns
(axis=1). By default (result_type=None), the final return type is
inferred from the return type of the applied function. Otherwise, it
depends on the result_type argument.
First create the function that will transform the data, apply can even take in a lambda
import re
def convert_to_km(distance):
'''
distance can be a string with km or m as units
e.g. 300km, 1.1km, 200m, 4.5m
'''
# split the string into value and unit ['300', 'km']
split_dist = re.match('([\d\.]+)?([a-zA-Z]+)', distance)
value = split_dist.group(1) # 300
unit = split_dist.group(2) # km
if unit == 'km':
return float(value)
if unit == 'm':
return round(float(value)/1000, 2)
d = {'Point': ['a','b','c','d'], 'Distance': ['3km', '400m','1.1km','200m']}
dist=pd.DataFrame(data=d)
You can then apply this funtion to your distance column
dist['Distanc_km'] = dist.apply(lambda row: convert_to_km(row['Distance']), axis=1)
dist
The output will be
Point Distance Distanc_km
0 a 3km 3.0
1 b 400m 0.4
2 c 1.1km 1.1
3 d 200m 0.2

You may try following as well:
Check if second last character of the string is 'k'.
If it is then only remove the last two character i.e. 'km'
Otherwise take the characters except last one (i.e. 'm') and divide the float value by 1000
Below is the implementation using apply to Distance column:
dist['Distance_km'] = dist['Distance'].apply(lambda row: float(row[:-1])/1000 if not row[-2]=='k' else row[:-2])
Result is:
Point Distance Distance_km
a 3km 3
b 400m 0.4
c 1.1km 1.1
d 200m 0.2

Try:
# An "Weight" column marking those are in "m" units
dist["Weight"] = 1e-3
dist.loc[dist["Distance"].str.contains("km"),"Weight"] = 1
# Extract the numeric part of string and convert it to float
dist["NumericPart"] = dist["Distance"].str.extract("([0-9.]+)\w+").astype(float)
# Merge the numeric parts with their units(weights) by multiplication
dist["Distance_km"] = dist["NumericPart"] * dist["Weight"]
You will get:
Point Distance Weight NumericPart Distance_km
0 a 3km 1.000 3.0 3.0
1 b 400m 0.001 400.0 0.4
2 c 1.1km 1.000 1.1 1.1
3 d 200m 0.001 200.0 0.2
BTW: Avoid using apply if you can, that will be very slow if your data is big.

Related

Cumulative sum of a pandas column until a maximum value is met, and average adjacent rows

I'm a biology student who is fairly new to python and was hoping someone might be able to help with a problem I have yet to solve
With some subsequent code I have created a pandas dataframe that looks like the example below:
Distance. No. of values Mean rSquared
1 500 0.6
2 80 0.3
3 40 0.4
4 30 0.2
5 50 0.2
6 30 0.1
I can provide my previous code to create this dataframe, but I didn't think it was particularly relevant.
I need to sum the number of values column until I achieve a value >= 100; and then combine the data of the rows of the adjacent columns, taking the weighted average of the distance and mean r2 values, as seen in the example below
Mean Distance. No. Of values Mean rSquared
1 500 0.6
(80*2+40*3)/120 (80+40) = 120 (80*0.3+40*0.4)/120
(30*4+50*5+30*6)/110 (30+50+30) = 110 (30*0.2+50*0.2+30*0.1)/110
etc...
I know pandas has it's .cumsum function, which I might be able to implement into a for loop with an if statement that checks the upper limit and resets the sum back to 0 when it is greater than or equal to the upper limit. However, I haven't a clue how to average the adjacent columns.
Any help would be appreciated!
You can use this code snippet to solve your problem.
# First, compute some weighted values
df.loc[:, "weighted_distance"] = df["Distance"] * df["No. of values"]
df.loc[:, "weighted_mean_rSquared"] = df["Mean rSquared"] * df["No. of values"]
min_threshold = 100
indexes = []
temp_sum = 0
# placeholder for final result
final_df = pd.DataFrame()
columns = ["Distance", "No. of values", "Mean rSquared"]
# reseting index to make the 'df' usable in following output
df = df.reset_index(drop=True)
# main loop to check and compute the desired output
for index, _ in df.iterrows():
temp_sum += df.iloc[index]["No. of values"]
indexes.append(index)
# if the sum exceeds 'min_threshold' then do some computation
if temp_sum >= min_threshold:
temp_distance = df.iloc[indexes]["weighted_distance"].sum() / temp_sum
temp_mean_rSquared = df.iloc[indexes]["weighted_mean_rSquared"].sum() / temp_sum
# create temporary dataframe and concatenate with the 'final_df'
temp_df = pd.DataFrame([[temp_distance, temp_sum, temp_mean_rSquared]], columns=columns)
final_df = pd.concat([final_df, temp_df])
# reset the variables
temp_sum = 0
indexes = []
Numpy has a function numpy.frompyfunc You can use that to get the cumulative value based on a threshold.
Here's how to implement it. With that, you can then figure out the index when the value goes over the threshold. Use that to calculate the Mean Distance and Mean rSquared for the values in your original dataframe.
I also leveraged #sujanay's idea of calculating the weighted values first.
c = ['Distance','No. of values','Mean rSquared']
d = [[1,500,0.6], [2,80,0.3], [3,40,0.4],
[4,30,0.2], [5,50,0.2], [6,30,0.1]]
import pandas as pd
import numpy as np
df = pd.DataFrame(d,columns=c)
#calculate the weighted distance and weighted mean squares first
df.loc[:, "w_distance"] = df["Distance"] * df["No. of values"]
df.loc[:, "w_mean_rSqrd"] = df["Mean rSquared"] * df["No. of values"]
#use numpy.frompyfunc to setup the threshold condition
sumvals = np.frompyfunc(lambda a,b: a+b if a <= 100 else b,2,1)
#assign value to cumvals based on threshold
df['cumvals'] = sumvals.accumulate(df['No. of values'], dtype=np.object)
#find out all records that have >= 100 as cumulative values
idx = df.index[df['cumvals'] >= 100].tolist()
#if last row not in idx, then add it to the list
if (len(df)-1) not in idx: idx += [len(df)-1]
#iterate thru the idx for each set and calculate Mean Distance and Mean rSquared
i = 0
for j in idx:
df.loc[j,'Mean Distance'] = (df.iloc[i:j+1]["w_distance"].sum() / df.loc[j,'cumvals']).round(2)
df.loc[j,'New Mean rSquared'] = (df.iloc[i:j+1]["w_mean_rSqrd"].sum() / df.loc[j,'cumvals']).round(2)
i = j+1
print (df)
The output of this will be:
Distance No. of values ... Mean Distance New Mean rSquared
0 1 500 ... 1.00 0.60
1 2 80 ... NaN NaN
2 3 40 ... 2.33 0.33
3 4 30 ... NaN NaN
4 5 50 ... NaN NaN
5 6 30 ... 5.00 0.17
If you want to extract only the records that are non NaN, you can do:
final_df = df[df['Mean Distance'].notnull()]
This will result in:
Distance No. of values ... Mean Distance New Mean rSquared
0 1 500 ... 1.00 0.60
2 3 40 ... 2.33 0.33
5 6 30 ... 5.00 0.17
I looked up BEN_YO's implementation of numpy.frompyfunc. The original SO post can be found here. Restart cumsum and get index if cumsum more than value
If you figure out the grouping first, pandas groupby-functionality will do a lot of the remaining work for you. A loop is appropriate to get the grouping (unless somebody has a clever one-liner):
>>> groups = []
>>> group = 0
>>> cumsum = 0
>>> for n in df["No. of values"]:
... if cumsum >= 100:
... cumsum = 0
... group = group + 1
... cumsum = cumsum + n
... groups.append(group)
>>>
>>> groups
[0, 1, 1, 2, 2, 2]
Before doing the grouped operations you need to use the No. of values information to get the weighting in:
df[["Distance.", "Mean rSquared"]] = df[["Distance.", "Mean rSquared"]].multiply(df["No. of values"], axis=0)
Now get the sums like this:
>>> sums = df.groupby(groups)["No. of values"].sum()
>>> sums
0 500
1 120
2 110
Name: No. of values, dtype: int64
And finally the weighted group averages like this:
>>> df[["Distance.", "Mean rSquared"]].groupby(groups).sum().div(sums, axis=0)
Distance. Mean rSquared
0 1.000000 0.600000
1 2.333333 0.333333
2 5.000000 0.172727

What is the difference between the args 'index' and 'values' for the pandas interpolate function?

What is the difference between the pandas DataFrame interpolate function called with args 'index' and 'values' respectively? It's ambiguous from the documentation:
pandas.DataFrame.interpolate
method : str, default ‘linear’
Interpolation technique to use. One of:
‘linear’: Ignore the index and treat the values as equally spaced. This is the only method supported on MultiIndexes.
‘time’: Works on daily and higher resolution data to interpolate given length of interval.
‘index’, ‘values’: use the actual numerical values of the index."
Both appear to use the numerical values of the index, is this the case?
UPDATE:
Following ansev's answer, they do indeed do the same thing
I think it's pretty clear, imagine you're going to interpolate points. The values ​​of your DataFrame represent the Y values, it is about filling in the missing values ​​in Y with some logic, for them an interpolation function is used, in this case for the variable X there are two options, to assume a fixed step, independent of the index or take into account the values ​​of the index.
Example with linear interpolation:
Here for each row the index increases by 1 upward and therefore there is no difference between the methods.
df=pd.DataFrame({'Y':[1,np.nan,3]})
print(df)
Y
0 1.0
1 NaN
2 3.0
print(df.interpolate(method = 'index'))
Y
0 1.0
1 2.0
2 3.0
print(df.interpolate())
Y
0 1.0
1 2.0
2 3.0
but if we change the index values...
df.index = [0,1,10000]
print(df.interpolate(method = 'index'))
Y
0 1.0000
1 1.0002 #(3-1)*((1-0)/(10000-0))
10000 3.0000
print(df.interpolate())
Y
0 1.0
1 2.0
10000 3.0
df.index = [0,0.1,1]
print(df.interpolate(method = 'index'))
Y
0.0 1.0
0.1 1.2 #(3-1)*((0.1-0)/(1-0))
1.0 3.0

How to find the minimum distance .. when two points belongs to same distance

I have a dataframe like this:
A B
1 0.1
1 0.2
1 0.3
2 0.2
2 0.5
2 0.3
3 0.8
3 0.6
3 0.1
How can I find the minimum value belonging to each point 1,2,3 and there should be no conflict which means point 1 and 2 should not belong to same point 0.3..
If I understand correctly, you want to do two things:
- find the minimum B per distinct A, and
- make sure that they don't collide. You didn't specify what to do in case of collision, so I assume you just want to know if there is one.
The first can be achieved with Rarblack's answer (though you should use min and not max in your case).
For the second, you can use the .nunique() method - see how many unique B values are there (should be same as number of unique A valuse)
#setup dataframe
df = pd.DataFrame.from_dict({
'A': [1,1,1,2,2,2,3,3,3],
'B': [0.1,0.2,0.3,0.2,0.5,0.3,0.8,0.6,0.1]
})
# find minimum
x = df.groupby('A')['B'].min()
# assert that there are no collisions:
if not (x.nunique() == len(x)):
print ("Conflicting values")
You can use groupby and max function.
df.groupby('A').B.max()

Pandas inplace conditional value multiplication

I have tried all solutions I could find on the topic, all of them didn't apply to the dataframe "inplace" and the multiplication never happened.
So here is what I am trying to do:
I have a multilevel column dataframe with many measurements. Every column is ordered like this:
data:
MeasurementType
Value Unit Type
StudyNumber
1 1.0 m/s a
2 1.7 m/s v
3 10.5 cm/s b
I am trying to convert all measurements with unit m/s to cm/s, i.e. I need to filter all Values with Unit m/s, multiply them by 10 and then change the Unit in the Unit column.
I managed the filter, however when I perform a multiplication on it (by *10, .mul(10) directly, or making a new assignment), it doesn't stick. Printing the dataframe afterwards shows no change in the values.
Here is the code:
unit_df = data.iloc[:, data.columns.get_level_values(1)=='Unit']
unit_col_list = []
for unitcol in unit_df.columns:
unitget = unit_df[unitcol][unit_df[unitcol].notnull()].unique()
if unitget.size > 1:
unit_col_list.append(unitcol)
unit_col_list = [item[0] for item in unit_col_list] #so I get the header of the column
data_wrongunits = data[unit_col_list]
data_wrongunits[unit_col_list[0]][data_wrongunits[unit_col_list[0]]['Unit'] == 'm/s']['Value']*=10
or
data_wrongunits[unit_col_list[0]][data_wrongunits[unit_col_list[0]]['Unit'] == 'm/s']['Value'].mul(10)
or
data_wrongunits[unit_col_list[0]][data_wrongunits[unit_col_list[0]]['Unit'] == 'm/s']['Value']=data_wrongunits[unit_col_list[0]][data_wrongunits[unit_col_list[0]]['Unit'] == 'm/s']['Value']*10
The filter gives me a series of the Value column. Maybe another structure would help?
You can use:
print (data)
MeasurementType MeasurementType1
Value Unit Type Value Unit Type
StudyNumber
1 1.0 m/s a 1.0 m/s a
2 1.7 m/s v 1.7 cm/s v
3 10.5 cm/s b 10.5 mm/s b
#get columns with Unit
unit_df = data.loc[:, data.columns.get_level_values(1)=='Unit']
print (unit_df)
MeasurementType MeasurementType1
Unit Unit
StudyNumber
1 m/s m/s
2 m/s cm/s
3 cm/s mm/s
#create helper df with replace units by constants
#if value not in dict, get NaNs, so replaced by 1
d = {'m/s':10, 'mm/s':100}
df1 = unit_df.applymap(d.get).fillna(1).rename(columns={'Unit':'Value'})
print (df1)
MeasurementType MeasurementType1
Value Value
StudyNumber
1 10.0 10.0
2 10.0 1.0
3 1.0 100.0
#filter only Value columns and multiple by df1
data[df1.columns] = data[df1.columns].mul(df1)
print (data)
MeasurementType MeasurementType1
Value Unit Type Value Unit Type
StudyNumber
1 10.0 m/s a 10.0 m/s a
2 17.0 m/s v 1.7 cm/s v
3 10.5 cm/s b 1050.0 mm/s b
Another way:
# convert value
df.loc[df.Unit=='m/s', 'Value'] = \
df.loc[df.Unit=='m/s', 'Value'].mul(100) #!
# change unit
print df.set_value(df.Unit=='m/s', 'Unit', 'cm/s')

pandas interpolation : use np.interp with changing values

I have a 3 million rows dataframe that contains the different values :
d a0 a1 a2
0.5 10.0 5.0 1.0
0.8 10.0 2.0 0.0
I want to fill a fourth column with a linear interpolation of (a0,a1,a2) that takes the value in the "d" case,
d a0 a1 a2 newcol
1.5 10.0 5.0 1.0 3.0
0.8 10.0 2.0 0.0 3.6
newcol is the weighted average between a[int(d)] and a[int(d+1)], e.g. when d = 0.8, newcol = 0.2 * a0 + 0.8 * a1 because 0.8 is 80% of the way between 0 and 1
I found that np.interp can be used, but there is no way for me to put the three column names in variable) :
df["newcol"]=np.interp(df["d"],[0,1,2], [100,200,300])
will indeed give me
d a0 a1 a2 newcol
1.5 10.0 5.0 1.0 250.0
0.8 10.0 2.0 0.0 180.0
BUT I have no way to specify that the values vector changes :
df["newcol"]=np.interp(df["d"],[0,1,2], df[["a0","a1","a2"]])
gives me the following traceback :
File "C:\Python27\lib\site-packages\numpy\lib\function_base.py", line 1271, in interp
return compiled_interp(x, xp, fp, left, right)
ValueError: object too deep for desired array
Is there any way to use a different vector for values at each line? Could you think of any workaround ?
Basically, I could find no way to create this new column based on the definition :
What is the value in x = column "d" of the function that is piecewise linear
between given points and whose values at these points are described in the columns "ai"
Edit: Before, I used scipy.interp1d, which is not memory efficient, the comment helped me to solve partially my problem
Edit2 :
I tried the approach from ev-br that stated that I had to try to code the loop myself.
for i in range(len(tps)):
columns=["a1","a2","a3"]
length=len(columns)
x=np.maximum(0,np.minimum(df.ix[i,"d"],len-2))
xint = np.int(x)
xfrac = x-xint
name1=columns[xint]
name2=columns[xint+1]
tps.ix[i,"Multiplier"]=df.ix[i,name1]+xfrac*(df.ix[i,name2]-tps.ix[i,name1])
The above loop loops around 50 times a second, so I guess I have a major optimisation issue. What part of working on a DataFrame do I do wrong?
It might comes a bit too late, but I would use np.interpolate with pandas' apply function. Creating the DataFrame in your example:
t = pd.DataFrame([[1.5,10,5,1],[0.8,10,2,0]], columns=['d', 'a0', 'a1', 'a2'])
Then comes the apply function:
t.apply(lambda x: np.interp(x.d, [0,1,2], x['a0':]), axis=1)
which yields:
0 3.0
1 3.6
dtype: float64
This is perfectly usable on "normal" datasets. However, the size of your DataFrame might call for a better/more optimized solution. The processing time scales linearily, my machine clocks in 10000 lines per second, which means 5 minutes for 3 million...
OK, I have a second solution, which uses the numexpr module. This method is much more specific, but also much faster. I've measured the complete process to take 733 milliseconds for 1 million lines, which is not bad...
So we have the original DataFrame as before:
t = pd.DataFrame([[1.5,10,5,1],[0.8,10,2,0]], columns=['d', 'a0', 'a1', 'a2'])
We import the module and use it, but it requires that we separate the two cases where we will use 'a0' and 'a1' or 'a1' and 'a2' as lower/upper limits for the linear interpolation. We also prepare the numbers so they can be fed to the same evaluation (hence the -1). We do that by creating 3 arrays with the interpolation value (originally: 'd') and the limits, depending on the value of "d". So we have:
import numexpr as ne
lim = np.where(t.d > 1, [t.d-1, t.a1, t.a2], [t.d, t.a0, t.a1])
Then we evaluate the simple linear interpolation expression and finally add it as a new column like that:
x = ne.evaluate('(1-x)*a+x*b', local_dict={'x': lim[0], 'a': lim[1], 'b': lim[2]})
t['IP'] = np.where(t.d > 1, x+1, x)

Categories

Resources