I have this simplified DataFrame where I want to add a new column Distance_km.
In this new column all values should be in kilometres and converted to float dtype.
d = {'Point': ['a','b','c','d'], 'Distance': ['3km', '400m','1.1km','200m']}
dist=pd.DataFrame(data=d)
dist
Point Distance
0 a 3km
1 b 400m
2 c 1.1km
3 d 200m
Point object
Distance object
dtype: object
How can I get this output?
Point Distance Distance_km
0 a 3.8km 3.8
1 b 400m 0.4
2 c 1.1km 1.1
3 d 200m 0.2
Point object
Distance object
Distance_km float64
dtype: object
Thanks in advance!
You could use Pandas apply method to pass your distance column values to a function that converts it to a standardized unit like so
From the documentation
Apply a function along an axis of the DataFrame.
Objects passed to the function are Series objects whose index is
either the DataFrame’s index (axis=0) or the DataFrame’s columns
(axis=1). By default (result_type=None), the final return type is
inferred from the return type of the applied function. Otherwise, it
depends on the result_type argument.
First create the function that will transform the data, apply can even take in a lambda
import re
def convert_to_km(distance):
'''
distance can be a string with km or m as units
e.g. 300km, 1.1km, 200m, 4.5m
'''
# split the string into value and unit ['300', 'km']
split_dist = re.match('([\d\.]+)?([a-zA-Z]+)', distance)
value = split_dist.group(1) # 300
unit = split_dist.group(2) # km
if unit == 'km':
return float(value)
if unit == 'm':
return round(float(value)/1000, 2)
d = {'Point': ['a','b','c','d'], 'Distance': ['3km', '400m','1.1km','200m']}
dist=pd.DataFrame(data=d)
You can then apply this funtion to your distance column
dist['Distanc_km'] = dist.apply(lambda row: convert_to_km(row['Distance']), axis=1)
dist
The output will be
Point Distance Distanc_km
0 a 3km 3.0
1 b 400m 0.4
2 c 1.1km 1.1
3 d 200m 0.2
You may try following as well:
Check if second last character of the string is 'k'.
If it is then only remove the last two character i.e. 'km'
Otherwise take the characters except last one (i.e. 'm') and divide the float value by 1000
Below is the implementation using apply to Distance column:
dist['Distance_km'] = dist['Distance'].apply(lambda row: float(row[:-1])/1000 if not row[-2]=='k' else row[:-2])
Result is:
Point Distance Distance_km
a 3km 3
b 400m 0.4
c 1.1km 1.1
d 200m 0.2
Try:
# An "Weight" column marking those are in "m" units
dist["Weight"] = 1e-3
dist.loc[dist["Distance"].str.contains("km"),"Weight"] = 1
# Extract the numeric part of string and convert it to float
dist["NumericPart"] = dist["Distance"].str.extract("([0-9.]+)\w+").astype(float)
# Merge the numeric parts with their units(weights) by multiplication
dist["Distance_km"] = dist["NumericPart"] * dist["Weight"]
You will get:
Point Distance Weight NumericPart Distance_km
0 a 3km 1.000 3.0 3.0
1 b 400m 0.001 400.0 0.4
2 c 1.1km 1.000 1.1 1.1
3 d 200m 0.001 200.0 0.2
BTW: Avoid using apply if you can, that will be very slow if your data is big.
I want to create a new column in my table by implementing equation, but there might be 2 possible equations for the new table.
(1) frequency = (total x 100) / hour
(2) frequency = (total x 1000000) / km_length
the table is similar to this:
type hour km_length total
A 1 - 1
B - 2 1
the calculation for "frequency" table would depend on which columns between hour and km_length that has value.
then, I expect the table will be like this:
type hour km_length total frequency
A 1 - 1 100
B - 2 1 500000
I have tried using np.nan_to_num before but it did not show the expected table I want.
is there anyway I can make it using python? Looking forward to any help
thankyou.
We can use np.where for assigning values based on a condition:
df[["hour", "km_length"]] = df[["hour", "km_length"]].apply(pd.to_numeric, errors="coerce")
df["frequency"] = np.where(
df["km_length"].isna(),
df["total"] * 100 / df["hour"],
df["total"] * 1_000_000 / df["km_length"]
)
type hour km_length total frequency
0 A 1.0 NaN 1 100.0
1 B NaN 2.0 1 500000.0
Make your values numeric then multiply. Because a missing value indicates with method to use and because division with NaN results in a NaN do both multiplications and use .fillna to determine the correct resulting value.
df[['hour', 'km_length']] = df[['hour', 'km_length']].apply(pd.to_numeric, errors='coerce')
s1 = df['total'].divide(df['hour']).multiply(100)
s2 = df['total'].divide(df['km_length']).multiply(10**6)
df['frequency'] = s1.fillna(s2)
type hour km_length total frequency
0 A 1.0 NaN 1 100.0
1 B NaN 2.0 1 500000.0
You can store the data in numpy array.
import numpy as np
table = np.array([['hour' , 'km_lenght' , 'total' , 'frequrncy']] #set the value of frequency as 0
for i in table:
try:
i[3] = (i[2]*100)/i[0]
except:
i[3] = (i[2]*1000000)/i[1]
print(table)
This should print the desired table.
I have a dataframe called data6, with 6000 rows, and i want to copy to na 2000 rows data frames, called result, only Month columns values when level columns value are 1.
How do create a for loop with this rule?
Now:
in: data6 = df1[['level', 'Month']]
print(data6)
out: level Month
0 1.0 101.52
1 2.0 101.52
2 3.0 101.52
3 1.0 111.89
4 2.0 111.89
Expected after the for loop:
in: print(result)
out: level Month
0 1.0 101.52
1 1.0 111.89
2 1.0 112.27
3 1.0 89.57
4 1.0 110.35
Use Boolean indexing
Indexing and selecting data
# if level is a float
result = data6[data6.level == 1.0].reset_index(drop=True)
# if level is a string
result = data6[data6.level == '1.0'].reset_index(drop=True)
# if you only want the month column
result = pd.DataFrame(data6.Month[data6.level == 1.0]).reset_index(drop=True) # or '1.0'
I have a situation where I want to use the results of a groupby in my training set to fill in results for my test set.
I don't think there's a straight forward way to do this in pandas, so I'm trying use the apply method on the column in my test set.
MY SITUATION:
I want to use the average values from my MSZoning column to infer the missing value for my LotFrontage column.
If I use the groupby method on my training set I get this:
train.groupby('MSZoning')['LotFrontage'].agg(['mean', 'count'])
giving.....
Now, I want to use these values to impute missing values on my test set, so I can't just use the transform method.
Instead, I created a function that I wanted to pass into the apply method, which can be seen here:
def fill_MSZoning(row):
if row['MSZoning'] == 'C':
return 69.7
elif row['MSZoning'] == 'FV':
return 59.49
elif row['MSZoning'] == 'RH':
return 58.92
elif row['MSZoning'] == 'RL':
return 74.68
else:
return 52.4
I call the function like this:
test['LotFrontage'] = test.apply(lambda x: x.fillna(fill_MSZoning), axis=1)
Now, the results for the LotFrontage column are the same as the Id column, even though I didn't specify this.
Any idea what is happening?
you can do it like this
import pandas as pd
import numpy as np
## creating dummy data
np.random.seed(100)
raw = {
"group": np.random.choice("A B C".split(), 10),
"value": [np.nan if np.random.rand()>0.8 else np.random.choice(100) for _ in range(10)]
}
df = pd.DataFrame(raw)
display(df)
## calculate mean
means = df.groupby("group").mean()
display(means)
Fill With Group Mean
## fill with mean value
def fill_group_mean(x):
group_mean = means["value"].loc[x["group"].max()]
return x["value"].mask(x["value"].isna(), group_mean)
r= df.groupby("group").apply(fill_group_mean)
r.reset_index(level=0)
Output
group value
0 A NaN
1 A 24.0
2 A 60.0
3 C 9.0
4 C 2.0
5 A NaN
6 C NaN
7 B 83.0
8 C 91.0
9 C 7.0
group value
0 A 42.00
1 A 24.00
2 A 60.00
5 A 42.00
7 B 83.00
3 C 9.00
4 C 2.00
6 C 27.25
8 C 91.00
9 C 7.00
How to compare values to next or previous items in loop?
I need to summarize consecutive repetitinos of occurences in columns.
After that I need to create "frequency table" so the dfoutput schould looks like on the bottom picture.
This code doesn't work because I can't compare to another item.
Maybe there is another, simple way to do this without looping?
sumrep=0
df = pd.DataFrame(data = {'1' : [0,0,1,0,1,1,0,1,1,0,1,1,1,1,0],'2' : [0,0,1,1,1,1,0,0,1,0,1,1,0,1,0]})
df.index= [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15] # It will be easier to assign repetitions in output df - index will be equal to number of repetitions
dfoutput = pd.DataFrame(0,index=[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],columns=['1','2'])
#example for column 1
for val1 in df.columns[1]:
if val1 == 1 and val1 ==0: #can't find the way to check NEXT val1 (one row below) in column 1 :/
if sumrep==0:
dfoutput.loc[1,1]=dfoutput.loc[1,1]+1 #count only SINGLE occurences of values and assign it to proper row number 1 in dfoutput
if sumrep>0:
dfoutput.loc[sumrep,1]=dfoutput.loc[sumrep,1]+1 #count repeated occurences greater then 1 and assign them to proper row in dfoutput
sumrep=0
elif val1 == 1 and df[val1+1]==1 :
sumrep=sumrep+1
Desired output table for column 1 - dfoutput:
I don't undestand why there is no any simple method to move around dataframe like offset function in VBA in Excel:/
You can use the function defined here to perform fast run-length-encoding:
import numpy as np
def rlencode(x, dropna=False):
"""
Run length encoding.
Based on http://stackoverflow.com/a/32681075, which is based on the rle
function from R.
Parameters
----------
x : 1D array_like
Input array to encode
dropna: bool, optional
Drop all runs of NaNs.
Returns
-------
start positions, run lengths, run values
"""
where = np.flatnonzero
x = np.asarray(x)
n = len(x)
if n == 0:
return (np.array([], dtype=int),
np.array([], dtype=int),
np.array([], dtype=x.dtype))
starts = np.r_[0, where(~np.isclose(x[1:], x[:-1], equal_nan=True)) + 1]
lengths = np.diff(np.r_[starts, n])
values = x[starts]
if dropna:
mask = ~np.isnan(values)
starts, lengths, values = starts[mask], lengths[mask], values[mask]
return starts, lengths, values
With this function your task becomes a lot easier:
import pandas as pd
from collections import Counter
from functools import partial
def get_frequency_of_runs(col, value=1, index=None):
_, lengths, values = rlencode(col)
return pd.Series(Counter(lengths[np.where(values == value)]), index=index)
df = pd.DataFrame(data={'1': [0,0,1,0,1,1,0,1,1,0,1,1,1,1,0],
'2': [0,0,1,1,1,1,0,0,1,0,1,1,0,1,0]})
df.apply(partial(get_frequency_of_runs, index=df.index)).fillna(0)
# 1 2
# 0 0.0 0.0
# 1 1.0 2.0
# 2 2.0 1.0
# 3 0.0 0.0
# 4 1.0 1.0
# 5 0.0 0.0
# 6 0.0 0.0
# 7 0.0 0.0
# 8 0.0 0.0
# 9 0.0 0.0
# 10 0.0 0.0
# 11 0.0 0.0
# 12 0.0 0.0
# 13 0.0 0.0
# 14 0.0 0.0