I have a Column with data like 3.4500,00 EUR.
Now I want to compare this with another column having float numbers like 4000.00.
How do I take this string, remove the EUR and replace comma with decimal and then convert into float to compare?
You can use regular expressions to make your conditions general that would work in all cases:
# Make example dataframe for showing answer
df = pd.DataFrame({'Value':['3.4500,00 EUR', '88.782,21 DOLLAR']})
Value
0 3.4500,00 EUR
1 88.782,21 DOLLAR
Use str.replace with regular expression:
df['Value'].str.replace('[A-Za-z]', '').str.replace(',', '.').astype(float)
0 34500.00
1 88782.21
Name: Value, dtype: float64
Explanation:
str.replace('[A-Za-z\.]', '') removes all alphabetic characters and dots.
str.replace(',', '.') replaces the comma for a dot
astype(float) converts it from object (string) type to float
Here is my solution:
mock data:
amount amount2
0 3.4500,00EUR 4000
1 3.600,00EUR 500
use apply() then convert the data type to float
data['amount'] = data['amount'].apply(lambda x: x.replace('EUR', '')).apply(lambda x: x.replace('.', '')).apply(lambda x: x.replace(',', '.')).astype('float')
result:
amount amount2
0 34500.0 4000
1 3600.0 500
Related
My task is to read data from excel to dataframe. The data is a bit messy and to clean that up I've done:
df_1 = pd.read_excel(offers[0])
df_1 = df_1.rename(columns={'Наименование [Дата Файла: 29.05.2019 время: 10:29:42 ]':'good_name',
'Штрихкод':'barcode',
'Цена шт. руб.':'price',
'Остаток': 'balance'
})
df_1 = df_1[new_columns]
# I don't know why but without replacing NaN with another char code doesn't work
df_1.barcode = df_1.barcode.fillna('_')
# remove all non-numeric characters
df_1.barcode = df_1.barcode.apply(lambda row: re.sub('[^0-9]', '', row))
# convert str to numeric
df_1.barcode = pd.to_numeric(df_1.barcode, downcast='integer').fillna(0)
df_1.head()
It returns column barcode with type float64 (why so?)
0 0.000000e+00
1 7.613037e+12
2 7.613037e+12
3 7.613034e+12
4 7.613035e+12
Name: barcode, dtype: float64
Then I try to convert that column to integer.
df_1.barcode = df_1.barcode.astype(int)
But I keep getting silly negative numbers.
df_1.barcode[0:5]
0 0
1 -2147483648
2 -2147483648
3 -2147483648
4 -2147483648
Name: barcode, dtype: int32
Thanks to #Will and #micric eventually I've got a solution.
df_1 = pd.read_excel(offers[0])
df_1 = df_1[new_columns]
# replacing NaN with 0, it'll help to convert the column explicitly to dtype integer
df_1.barcode = df_1.barcode.fillna('0')
# remove all non-numeric characters
df_1.barcode = df_1.barcode.apply(lambda row: re.sub('[^0-9]', '', row))
# convert str to integer
df_1.barcode = pd.to_numeric(df_1.barcode, downcast='integer')
Resume:
pd.to_numeric converts NaN to float64. As a result from column with
both NaN and not-Nan values we should expect column dtype float64.
Check size of number you're dealing with. int32 has its limit, which
is 2**32 = 4294967296.
Thanks a lot for your help, guys!
That number is a 32 bit lower limit. Your number is out of the int32 range you are trying to use, so it returns you the limit (notice that 2**32 = 4294967296, divided by 2 2147483648 that is your number).
You should use astype(int64) instead.
I ran into the same problem as OP, using
astype(np.int64)
solved mine, see the link here.
I like this solution because it's consistent with my habit of changing the column type of pandas column, maybe someone could check the performance of these solutions.
Many questions in one.
So your expected dtype...
pd.to_numeric(df_1.barcode, downcast='integer').fillna(0)
pd.to_numeric downcast to integer would give you an integer, however, you have NaNs in your data and pandas needs to use a float64 type to represent NaNs
I need to convert a currency column in my DataFrame to float values so I can compute some stats.
Here's how the column looks like:
10.785,177
10.783,554
10.781,931
10.782,094
10.780,843
656,530
The result should be:
10785.177
10783.554
10781.931
10782.094
10780.843
656.530
I was trying to do something with regex but I don't know a lot about it. Any help is appreciated!
You can use df.apply() like this:
df['col'].apply(lambda x: x.replace(".", "").replace(",",".")).astype(float)
You need to remove thousands separators (.), replace decimal separators (,) with ., and then you can use pd.to_numeric:
>>> df['col'].str.replace('.', '', regex=False).str.replace(',', '.', regex=False)\
... .transform(pd.to_numeric)
0 10785.177
1 10783.554
2 10781.931
3 10782.094
4 10780.843
5 656.530
Name: col, dtype: float64
I have a dataframe of 2 columns.I want to convert COUNT column to int.It keeps me giving value error:Unable to Parse string "0.58%" at position 0
METRIC COUNT
Scans 125487
No Reads 2541
Diverts 54710
No Code% 0.58%
No Read% 1.25%
df['COUNT'] = df['COUNT'].apply(pd.to_numeric)
How can i remove % before conversion
You can use str.strip:
pd.to_numeric(df.col1.str.strip('%'))
0 1
1 2
2 3
Name: col1, dtype: int64
Try this, I'm assuming that the 0.58% is read in as a string, meaning that the replace function will work to replace '%' with nothing, at which point, it can be converted to a number
import pandas as pd
df = pd.DataFrame({'col1':['1','2','3%']})
df.col1.str.replace('%','').astype(float)
I am getting data from an American system. The numbers what I get in the CSV are strings "(100)" and I have to convert it to -100 integer. I have N number of columns in the data frame and i have to do it for all the columns.
What I am doing now is that I replace all the parenthesis to empty and to negative value sign. This is not the best solution as it is converting all the given values in the data frame.
import pandas as pd
df=pd.read_csv('American.csv', thousands=r',')
df=df.apply(lambda z: z.astype(str).str.replace(')',''))
df=df.apply(lambda z: z.astype(str).str.replace('(','-'))
What I expect:
"(100)" -> -100
"Nick (Jones)" ->"Nick **(Jones)**"
What I get:
"(100)" -> -100
"Nick (Jones)" ->"Nick **-Jones**"
I would need a code that does the necessary conversion with the numbers for all columns but does not bother with other values.
Use pandas.DataFrame.replace with regex=True:
df = pd.DataFrame(["(100)", "Nick (Jones)"])
new_df = df.replace('\((\d+)\)', '-\\1',regex=True)
print(new_df)
Output:
0
0 -100
1 Nick (Jones)
Regex:
Captures any number of digits inside the pair of brackets (Group #1), and put - in front of it (-Group #1).
I have a pandas dataframe that contains height information and I can't seem to figure out how to convert the somewhat unstructured information into an integer.
I figured the best way to approach this was to use regex but the main problem I'm having is that when I attempt to simplify a problem to use regex I usually take the first item in the dataframe (7 ' 5.5'') and try to use regex specifically on it. It seemed impossible for me to put this data in a string because of the quotes. So, I'm really confused on how to approach this problem.
here is my dataframe:
HeightNoShoes HeightShoes
0 7' 5.5" NaN
1 6' 11" 7' 0.25"
2 6' 7.75" 6' 9"
3 6' 5.5" 6' 6.75"
4 5' 11" 6' 0"
Output should be in inches:
HeightNoShoes HeightShoes
0 89.5 NaN
1 83 84.25
2 79.75 81
3 77.5 78.75
4 71 72
My next option would be writing this to csv and using excel, but I would prefer to learn how to do it in python/pandas. any help would be greatly appreciated.
The previous answer to the problem is a good solution to the problem without using regular expressions. I will post this in case you are curious about how to approach the problem using your first idea (using regexes).
It is possible to solve this using your approach of using a regular expression. In order to put the data you have (such as 7' 5.5") into a string in Python, you can escape the quote.
For example:
py_str = "7' 5.5\""
This, combined with a regular expression, will allow you to extract the information you need from the input data to calculate the output data. The input data consists of an integer (feet) followed by ', a space, and then a floating point number (inches). This float consists of one or more digits and then, optionally, a . and more digits. Here is a regular expression that can extract the feet and inches from the input data: ([0-9]+)' ([0-9]*\.?[0-9]+)"
The first group of the regex retrieves the feet and the second retrieves the inches. Here is an example of a function in python that returns a float, in inches, based on input data such as "7' 5.5\"", or NaN if there is no valid match:
Code:
r = re.compile(r"([0-9]+)' ([0-9]*\.?[0-9]+)\"")
def get_inches(el):
m = r.match(el)
if m == None:
return float('NaN')
else:
return int(m.group(1))*12 + float(m.group(2))
Example:
>>> get_inches("7' 5.5\"")
89.5
You could apply that regular expression to the elements in the data. However, the solution of mapping your own function over the data works well. Thought you might want to see how you could approach this using your original idea.
One possible method without using regex is to write your own function and just apply it to the column/Series of your choosing.
Code:
import pandas as pd
df = pd.read_csv("test.csv")
def parse_ht(ht):
# format: 7' 0.0"
ht_ = ht.split("' ")
ft_ = float(ht_[0])
in_ = float(ht_[1].replace("\"",""))
return (12*ft_) + in_
print df["HeightNoShoes"].apply(lambda x:parse_ht(x))
Output:
0 89.50
1 83.00
2 79.75
3 77.50
4 71.00
Name: HeightNoShoes, dtype: float64
Not perfectly elegant, but it does the job with minimal fuss. Best of all, it's easy to tweak and understand.
Comparison versus the accepted solution:
In [9]: import re
In [10]: r = re.compile(r"([0-9]+)' ([0-9]*\.?[0-9]+)\"")
...: def get_inches2(el):
...: m = r.match(el)
...: if m == None:
...: return float('NaN')
...: else:
...: return int(m.group(1))*12 + float(m.group(2))
...:
In [11]: %timeit get_inches("7' 5.5\"")
100000 loops, best of 3: 3.51 µs per loop
In [12]: %timeit parse_ht("7' 5.5\"")
1000000 loops, best of 3: 1.24 µs per loop
parse_ht is a little more than twice as fast.
First create the dataframe of height values
Let's first set up a Pandas dataframe to match the question. Then convert the values shown in feet and inches to a numerical value using apply. NOTE: The questioner asks if the values can be converted to integers, however the first value in the 'HeightNoShoes' column is 7' 5.5" Since this string value is expressed in half inches, it will be converted first to a float value. Then you can use the round function to round it before typcasting the values as integers.
# libraries
import pandas as pd
# height data
no_shoes = ['''7' 5.5"''',
'''6' 11"''',
'''6' 7.75"''',
'''6' 5.5" ''',
'''5' 11"''']
shoes = [np.nan,
'''7' 0.25"''',
'''6' 9"''',
'''6' 6.75"''',
'''6' 0"''']
# put height data into a Pandas dataframe
height_data = pd.DataFrame({'HeightNoShoes':no_shoes, 'HeightShoes':shoes})
height_data.head()
Next use a function to convert feet to float values
Here is a function that converts feet and inches to a float value.
def feet_to_float(cell_string):
try:
split_strings = cell_string.replace('"','').replace("'",'').split()
float_value = float(split_strings[0])+float(split_strings[1])
except:
float_value = np.nan
return float_value
Next, apply the function to each column in the dataframe.
# obtain a copy of the height data
df = height_data.copy()
for col in df.columns:
print(col)
df[col] = df[col].apply(feet_to_float)
df.head()
Here is a function to convert float values to integer values with NaN values in the Pandas column
If you would like to convert the dataframe to integer values with a NaN value in one column you can use the following function and code. Note, that the function rounds the values first before typecasting them as integers. Typecasting the float values as integers before rounding them will just truncate the values.
def float_to_int(cell_value):
try:
return int(round(cell_value,0))
except:
return cell_value
for col in df.columns:
df[col] = df[col].apply(feet_to_float)
Note: Pandas displays columns that contain both NaN values and integers as float values.
Here is the code to convert a single column in the dataframe to a numerical value.
df = height_data.copy()
df['HeightNoShoes'] = df['HeightNoShoes'].apply(feet_to_float)
df.head()
This is how to convert the single column of float values to integers. Note, that it's important to round the values first. Typecasting the values as integers before rounding them will incorrectly truncate the values.
df['HeightNoShoes'] = round(df['HeightNoShoes'],0).astype(int)
df.head()
There are NaN values in the second Pandas column labeled 'HeightShoes'. Both the feet_to_float and float_to_int functions found above should be able to handle these.
df = height_data.copy()
df['HeightShoes'] = df['HeightShoes'].apply(feet_to_float)
df['HeightShoes'] = df['HeightShoes'].apply(float_to_int)
df.head()
This may also serve the purpose
def inch_to_cm(x):
if x is np.NaN:
return x
else:
ft,inc = x.split("'")
inches = inc[1:-1]
return ((12*int(ft)) + int(inches)) * 2.54
df['Height'] = df['Height'].apply(inch_to_cm)
Here is a way using str.extract()
(df.stack()
.str.extract(r"(\d+)' (\d+\.?\d*)")
.rename({0:'feet',1:'inches'},axis=1)
.astype(float)
.assign(feet = lambda x: x['feet'].mul(12))
.sum(axis=1)
.unstack())
Output:
HeightNoShoes HeightShoes
0 89.50 NaN
1 83.00 84.25
2 79.75 81.00
3 77.50 78.75
4 71.00 72.00