I have a pandas dataframe that contains height information and I can't seem to figure out how to convert the somewhat unstructured information into an integer.
I figured the best way to approach this was to use regex but the main problem I'm having is that when I attempt to simplify a problem to use regex I usually take the first item in the dataframe (7 ' 5.5'') and try to use regex specifically on it. It seemed impossible for me to put this data in a string because of the quotes. So, I'm really confused on how to approach this problem.
here is my dataframe:
HeightNoShoes HeightShoes
0 7' 5.5" NaN
1 6' 11" 7' 0.25"
2 6' 7.75" 6' 9"
3 6' 5.5" 6' 6.75"
4 5' 11" 6' 0"
Output should be in inches:
HeightNoShoes HeightShoes
0 89.5 NaN
1 83 84.25
2 79.75 81
3 77.5 78.75
4 71 72
My next option would be writing this to csv and using excel, but I would prefer to learn how to do it in python/pandas. any help would be greatly appreciated.
The previous answer to the problem is a good solution to the problem without using regular expressions. I will post this in case you are curious about how to approach the problem using your first idea (using regexes).
It is possible to solve this using your approach of using a regular expression. In order to put the data you have (such as 7' 5.5") into a string in Python, you can escape the quote.
For example:
py_str = "7' 5.5\""
This, combined with a regular expression, will allow you to extract the information you need from the input data to calculate the output data. The input data consists of an integer (feet) followed by ', a space, and then a floating point number (inches). This float consists of one or more digits and then, optionally, a . and more digits. Here is a regular expression that can extract the feet and inches from the input data: ([0-9]+)' ([0-9]*\.?[0-9]+)"
The first group of the regex retrieves the feet and the second retrieves the inches. Here is an example of a function in python that returns a float, in inches, based on input data such as "7' 5.5\"", or NaN if there is no valid match:
Code:
r = re.compile(r"([0-9]+)' ([0-9]*\.?[0-9]+)\"")
def get_inches(el):
m = r.match(el)
if m == None:
return float('NaN')
else:
return int(m.group(1))*12 + float(m.group(2))
Example:
>>> get_inches("7' 5.5\"")
89.5
You could apply that regular expression to the elements in the data. However, the solution of mapping your own function over the data works well. Thought you might want to see how you could approach this using your original idea.
One possible method without using regex is to write your own function and just apply it to the column/Series of your choosing.
Code:
import pandas as pd
df = pd.read_csv("test.csv")
def parse_ht(ht):
# format: 7' 0.0"
ht_ = ht.split("' ")
ft_ = float(ht_[0])
in_ = float(ht_[1].replace("\"",""))
return (12*ft_) + in_
print df["HeightNoShoes"].apply(lambda x:parse_ht(x))
Output:
0 89.50
1 83.00
2 79.75
3 77.50
4 71.00
Name: HeightNoShoes, dtype: float64
Not perfectly elegant, but it does the job with minimal fuss. Best of all, it's easy to tweak and understand.
Comparison versus the accepted solution:
In [9]: import re
In [10]: r = re.compile(r"([0-9]+)' ([0-9]*\.?[0-9]+)\"")
...: def get_inches2(el):
...: m = r.match(el)
...: if m == None:
...: return float('NaN')
...: else:
...: return int(m.group(1))*12 + float(m.group(2))
...:
In [11]: %timeit get_inches("7' 5.5\"")
100000 loops, best of 3: 3.51 µs per loop
In [12]: %timeit parse_ht("7' 5.5\"")
1000000 loops, best of 3: 1.24 µs per loop
parse_ht is a little more than twice as fast.
First create the dataframe of height values
Let's first set up a Pandas dataframe to match the question. Then convert the values shown in feet and inches to a numerical value using apply. NOTE: The questioner asks if the values can be converted to integers, however the first value in the 'HeightNoShoes' column is 7' 5.5" Since this string value is expressed in half inches, it will be converted first to a float value. Then you can use the round function to round it before typcasting the values as integers.
# libraries
import pandas as pd
# height data
no_shoes = ['''7' 5.5"''',
'''6' 11"''',
'''6' 7.75"''',
'''6' 5.5" ''',
'''5' 11"''']
shoes = [np.nan,
'''7' 0.25"''',
'''6' 9"''',
'''6' 6.75"''',
'''6' 0"''']
# put height data into a Pandas dataframe
height_data = pd.DataFrame({'HeightNoShoes':no_shoes, 'HeightShoes':shoes})
height_data.head()
Next use a function to convert feet to float values
Here is a function that converts feet and inches to a float value.
def feet_to_float(cell_string):
try:
split_strings = cell_string.replace('"','').replace("'",'').split()
float_value = float(split_strings[0])+float(split_strings[1])
except:
float_value = np.nan
return float_value
Next, apply the function to each column in the dataframe.
# obtain a copy of the height data
df = height_data.copy()
for col in df.columns:
print(col)
df[col] = df[col].apply(feet_to_float)
df.head()
Here is a function to convert float values to integer values with NaN values in the Pandas column
If you would like to convert the dataframe to integer values with a NaN value in one column you can use the following function and code. Note, that the function rounds the values first before typecasting them as integers. Typecasting the float values as integers before rounding them will just truncate the values.
def float_to_int(cell_value):
try:
return int(round(cell_value,0))
except:
return cell_value
for col in df.columns:
df[col] = df[col].apply(feet_to_float)
Note: Pandas displays columns that contain both NaN values and integers as float values.
Here is the code to convert a single column in the dataframe to a numerical value.
df = height_data.copy()
df['HeightNoShoes'] = df['HeightNoShoes'].apply(feet_to_float)
df.head()
This is how to convert the single column of float values to integers. Note, that it's important to round the values first. Typecasting the values as integers before rounding them will incorrectly truncate the values.
df['HeightNoShoes'] = round(df['HeightNoShoes'],0).astype(int)
df.head()
There are NaN values in the second Pandas column labeled 'HeightShoes'. Both the feet_to_float and float_to_int functions found above should be able to handle these.
df = height_data.copy()
df['HeightShoes'] = df['HeightShoes'].apply(feet_to_float)
df['HeightShoes'] = df['HeightShoes'].apply(float_to_int)
df.head()
This may also serve the purpose
def inch_to_cm(x):
if x is np.NaN:
return x
else:
ft,inc = x.split("'")
inches = inc[1:-1]
return ((12*int(ft)) + int(inches)) * 2.54
df['Height'] = df['Height'].apply(inch_to_cm)
Here is a way using str.extract()
(df.stack()
.str.extract(r"(\d+)' (\d+\.?\d*)")
.rename({0:'feet',1:'inches'},axis=1)
.astype(float)
.assign(feet = lambda x: x['feet'].mul(12))
.sum(axis=1)
.unstack())
Output:
HeightNoShoes HeightShoes
0 89.50 NaN
1 83.00 84.25
2 79.75 81.00
3 77.50 78.75
4 71.00 72.00
Related
I have a data frame like this:
MONTH TIME PATH RATE
0 Feb 15:24:11 enp1s0 14.71Kb
I want to create a function which can identify if 'Kb' or 'Mb' is in the column RATE. If an entry in column RATE has 'Kb' or 'Mb' at the end, to strip it of 'Kb'/'Mb' and perform an operation to convert it into just b.
Here's my code so far where RATE is treated by the Dataframe as an object:
df=pd.DataFrame(listOfLists)
def strip(bytesData):
if "Kb" in bytesData:
bytesData/1000
elif "Mb" in bytesData:
bytesData/1000000
df['RATE']=df.apply(lambda x: strip(x['byteData']), axis=1)
How can I get it to change the value within the column while stripping it of unwanted characters and converting it into the format I need? I know once this operation is complete I'll have to change it to an int, however, I can't seem to alter the data in the way I need.
Thanks in advance!
I modified your function a bit and use map(lambda x:) instead of apply, since we are working with a series, and not the full dataframe. Also I added some additional lines as to provide examples for both Kb and Mb and if neither are present:
example_df = pd.DataFrame({'Month':[0,1,2,3],
'Time':['15:32','16:42','17:11','15:21'],
'Path':['xxxxx','yyyyy','zzzzz','aaaaa'],
'Rate':['14.71Kb','18.21Mb','19.01Kb','Error_1']})
def case_1(value):
if value[-2:] == 'Kb':
return float(value[:-2])*1000
elif value[-2:] == 'Mb':
return float(value[:-2])*100000
else:
return np.nan
example_df['Rate'] = example_df['Rate'].map(lambda x: case_1(x))
The logic for the function is, if it ends with Kb then multiply the value by 1000, else-if it ends with Mb multiply the value by 100000, otherwise simply return NaN (because neither of the two conditions are satisfied)
Output:
Month Time Path Rate
0 0 15:32 xxxxx 14710.0
1 1 16:42 yyyyy 1821000.0
2 2 17:11 zzzzz 19010.0
3 3 15:21 aaaaa NaN
Here's an alternative of how I may approach this. This solution handles other Abbreviations. It does rely on regex re standard lib package though.
This approach makes a new column called Bytes. I often find it helpful to keep the RATE column in this case to verify there aren't any edge cases I haven't thought of. I also use a mapping to obtain the necessary power to raise the value to to get the correct bytes. I did add the code required to drop the original RATE column and rename the new column.
import re
def convert_to_bytes(contents):
value, label, _ = re.split('([A-Za-z]+)', contents)
factors = {'Kb': 1, 'Mb': 2, 'Gb': 3, 'Tb': 4}
return float(value) * 1000**(factors[label])
df['Bytes'] = df['RATE'].map(convert_to_bytes)
# Drop original RATE column
df = df.drop('RATE', axis=1)
# Rename Bytes column to RATE
df = df.rename({'Bytes': 'RATE'}, axis='columns')
I am trying to find the mean of the given column in the data frame in Python (shown in the image). Some of them have ranges (i.e. 2-3 and 3-4), while other don't (i.e. 1 and 4).
Text version of the column in the dataframe:
lst = ["1", "2-3", "3-4", "4"]
df = pd.DataFrame(lst)
df
1
2-3
3-4
4
I tried using the below function, but it doesn't work for the ones that don't have ranges.
# a function to split the range and take the mean
def split_mean(x):
# split before and after the hyphen (-)
split_num = x.split("-")
mean = (float(split_num[0])+float(split_num[1]))/2
return mean
Edit:
Had to replace the NULL values for the bottom function to work!
Change your function like this:
def split_mean(x):
# split before and after the hyphen (-)
split_num = x.split("-")
if len(split_num) == 2:
return (float(split_num[0])+float(split_num[1]))/2
else:
return float(x)
If you use instead:
df[0].str.split('-').transform(lambda x: mean(map(int,x)))
you will get the output:
0 1.0
1 2.5
2 3.5
3 4.0
Name: 0, dtype: float64
I am attempting to iterate over a specific column in my dataframe.
The column is:
df['column'] = ['1.4million', '1,235,000','100million',NaN, '14million', '2.5mill']
I am trying to clean this column and eventually get it all to integers to do more work with. I am stuck on the step to clean out "million." I would like to replace the "million" with five zeros when there is a decimal (ie 1.4million becomes 1.400000) and the "million" with six zeros when there is no decimal (ie 100million becomes 100000000).
To simplify, the first step I'm trying is to just focus on filtering out the values with a decimal and replace those with 5 zeros. I have attempted to use np.where for this, however I cannot use the replace method with numpy.
I also attempted to use pd.DataFrame.where, but am getting an error:
for i,row in df.iterrows():
df.at[i,'column'] = pd.DataFrame.where('.' in df.at[i,'column'],df.at[i,'column'].replace('million',''),df.at[i,'column'])
``AttributeError: 'numpy.ndarray' object has no attribute 'replace'
Im sure there is something I'm missing here. (I'm also sure that I'll be told that I don't need to use iterrows here, so I am open to suggestions on that as well).
Given your sample data - it looks like you can strip out commas and then take all digits (and . characters) until the string mill or end of string and split those out, eg:
x = df['column'].str.replace(',', '').str.extract('(.*?)(mill.*)?$')
This'll give you:
0 1
0 1.4 million
1 1235000 NaN
2 100 million
3 NaN NaN
4 14 million
5 2.5 mill
Then take the number part and multiply it by a million where there's something in column 1 else multiple it by 1, eg:
res = pd.to_numeric(x[0]) * np.where(x[1].notna(), 1_000_000, 1)
That'll give you:
0 1400000.0
1 1235000.0
2 100000000.0
3 NaN
4 14000000.0
5 2500000.0
Try this:
df['column'].apply(lambda x : x.replace('million','00000'))
Make sure your dtype is string before applying this
For the given data:
df['column'].apply(lambda x: float(str(x).split('m')[0])*10**6
if 'million' in str(x) or 'mill' in str(x) else x)
If there may be many forms of million in the column, then regex search.
Another simple question. I have to clean up some data, and a few of the columns need to be in int64 format instead of the objects that they are now (example provided). how would I go about uniformly re-formatting these columns.
print(data.Result)
0 98.8 PG/ML H
1 8.20000
2 26.8 PG/ML H
3 40.8 PG/ML H
4 CREDIT
5 15.30000
You could parse with regex:
import re
def parse_int(s):
"""
A fast memoized function which builds a lookup dictionary then maps values to the series
"""
map_dict = {x:float(re.findall('[0-9.]+',x)[0]) for x in s.unique() if re.search('[0-9.]+',x)}
return s.map(map_dict)
data['Result'] = parse_int(data['Result'])
The function above takes all the unique values from the series and pairs them with its float equivalent. This is an extremely efficient approach in the case of repeated values. The function then maps these value pairs (map_dict) to the original series (s).
I'm working with imported lat/lon data from NASA's fireball data API (https://cneos.jpl.nasa.gov/fireballs/)
lat/lon data only have positive values
their direction (N/S and E/W) are in different columns called lat-dir/lon-dir
dataframe as below.
Now I want to:
Convert any lat values to negative (multiply by -1) if "lat-dir" == 'S'
Convert lon values to negative if "lon-dir" == 'W'
Below is roughly how I created my dataframe:
import requests
import pandas as pd
response = requests.get('https://ssd-api.jpl.nasa.gov/fireball.api')
j = response.json()
df = pd.DataFrame.from_dict(j[u'data'])
print( j[u'fields'] )
[u'date', u'energy', u'impact-e', u'lat', u'lat-dir', u'lon', u'lon-dir', u'alt', u'vel']
print( df.head() )
0 1 2 3 4 5 6 7 8
0 2019-12-06 10:19:57 4.6 0.15 3.3 S 37.7 W 19.5 None
1 2019-12-03 06:46:27 4.2 0.14 5.6 N 52.2 W 61.5 None
2 2019-11-28 20:30:54 2.7 0.095 35.7 N 31.7 W 35 13.0
3 2019-11-28 13:22:10 2.6 0.092 None None None None None None
4 2019-11-28 11:55:02 2.5 0.089 22.1 S 25.7 E 22.5 24.7
Lines of code I've attempted:
Attempted to use df.apply() - though through my searching, I don't think you can easily reference two columns in this manner...
df['lat'] = df['lat'].apply(lambda x: x * -1 if (df['lat-dir'][x] == 'S'))
for i, row in df.iterrows():
if (row['lat-dir'] == 'S'):
df['lat'][i].apply(lambda x: x*-1)
For this, I get 'numpy.float64' object has no attribute 'apply' ?
Attempted to use masking:
if( df['lon-dir'] == 'W'):
df['lon'] * -1
But frankly, I'm stumped on what to do next regarding applying the mask.
EDIT:
dfDate['lat'] = dfDate['lat'].apply(lambda row: row['lon'] * -1 , axis = 1 )
Attempted this as well per comments.
Yes, by either of the following:
A) using a vectorized mask. == isn't vectorized; .eq(...) is. For a vectorized expression, use dfDate['lon-dir'].eq('W'). Then negate 'lon' column on those rows.
B) using apply() row-wise: dfDate['lon'] = dfDate.apply(lambda row: ..., axis=1)
- and in your lambda selectively negate row['lon'] based on value row['lon-dir']
- the reason your apply call failed is you need to apply to the entire column/Series, not individual entries. So: df['lat'].apply(lambda: ..., axis=1)
lat-dir/lon-dir are essentially sign columns, you could convert them to +1/-1 when you read them in.
Code:
First some issues with your code that you'll want to fix:
Don't use the u'...' notation. Assuming you're using Python 3.x, don't need u'...', text is now unicode by default in 3.x. And if you're not using Python 3.x, you really should switch over now, 2.x is being sunsetted Jan 1, 2020.
Pass the JSON column names onto the dataframe, make your life easy:
df.columns = j['fields']
Reading in the JSON by passing response.json() into pd.DataFrame.from_dict() is a pain; your dataframe columns become string/'object' rather than the float ones being converted to float. Ideally we should be using pandas.read_json(..., dtype=...) for this and other convenience reasons.
You're going to want to convert the dtypes (e.g. string -> float) on numeric columns, and that also automatically converts Python None -> pandas/numpy nan (for the sake of the vectorized code we're going to write gracefully handling nan and not constantly throwing annoying TypeError: unsupported operand type(s) for *: 'NoneType' and 'int'). You can [do this with either astype(...), pd.to_numeric() or df.fillna(value=pd.np.nan, inplace=True)
Really those nan entries are going to keep being a pain for multiple reasons listed below (e.g. integers keep getting coerced back to float), so you'll probably want to drop or at least temporarily ignore the nan rows by doing:
df2 = df.dropna(how='any', inplace=False) # probably not with ..., inplace=True. Note that that preserves the row-indices, so you can always insert the result of prpcessing df2 back into df at the end. Read the dropna doc and figure out at what exact point you want to drop the nan's.
Note that 'vel' column actually has other nans which we want to ignore, you'll need to figure that out, or for now ignore them: e.g. do df2 = df[['date','energy','impact-e','lat','lat-dir','lon','lon-dir']].dropna(how='any', inplace=False)
Solution
Several ways to translate the lat/lon-dir columns to +/-1 signs:
A1) if you want the 'correct', nan-aware way which doesn't choke on nans...
df2['lat'] = df2['lat-dir'].map({'N': +1, 'S': -1})
df2['lon'] = df2['lon-dir'].map({'E': +1, 'W': -1})
A2) ...or a fast-and-dirty way:
(-1) ** df2['lat-dir'].eq('S')
(-1) ** df2['lon-dir'].eq('W')
B) But you can do this all in one row-wise apply() function:
def fixup_latlon_signs(row):
row['lat'] = row['lat'] * (-1) ** (row['lat-dir'] == 'S')
row['lon'] = row['lon'] * (-1) ** (row['lon-dir'] == 'W')
return row
df2.apply(fixup_latlon_signs, axis=1)
# Then insert the non-NA rows we processed back into the parent dataframe:
df.update(df2)
# Strictly we can drop 'lat-dir','lon-dir' now...