Converting Data Frame entry to float in Python/Pandas - python

I'm trying to save the values from populatioEst column in float variables using Python3 & Pandas,
I have the following table:
Name
populationEst
Amsterdam
872757
Netherlands
17407585
I have tried to separate both values as following,
populationAM = pops['populationEst'][pops.Name == 'Amsterdam']
populationNL = pops['populationEst'][pops.Name == 'Netherlands']
However when I try to print out the value,print(populationAM), I get this output
0 872757
Name: PopulationEstimate2020-01-01, dtype: int64
and I think that populationAM & populationNL are not int values, because Whenever I try to run some arithmetic operation on them I do not get the desired value.
For example, I have tried to calculate the fraction of the populationAM against populationNL using this formula
frac = populationAM.astype(float) * 100 / populationNL.astype(float)
and I did not get the desired output that should be 5,013659276 but I have got this one:
0 Nan
1 Nan
Name: PopulationEst, dtype: float64
Can Anybody tell where am I going wrong here or how can I save these values in simple float variables.

Is this what are you looking for?
populationAM = pops.loc[pops.Name == 'Amsterdam', 'populationEst'].iloc[0]
populationNL = pops.loc[pops.Name == 'Netherlands', 'populationEst'].iloc[0]
frac = populationAM * 100 / populationNL
The value of frac here is 5.013659275539944, while populationAM and populationNL are the integers corresponding to the respective populations (as you can see, the type of these variables is not a problem to compute the correct value of frac). In your code, the issue is that populationAM and populationNL are pandas Series, instead of integers; iloc[0] retrieves the value in the first position of the series.

Is this what you are trying to do?:
populationAM = pops[pops['pops.Name'] == 'Amsterdam']['populationEst']
populationNL = pops[pops['pops.Name'] == 'Netherlands']['populationEst']

Maybe you try this:
populationAM = pops['populationEst'][pops['Name'] == 'Amsterdam']
populationNL = pops['populationEst'][pops['Name'] == 'Netherlands']
It will be dtype: Int. But you could easily trnasform it to float.

Related

Python- Generate values for a new column using wildcard list search on another column

Currently I am working on assigning different inflow rates (float values) to each product based on the product code => There should be 2 columns: PRODUCT_CODE' and 'INFLOW_RATE'. The product code has 4 characters and the rule is as followed:
If the code starts with 'L','H' or 'M': assign float value = 1.0 to 'INFLOW_RATE' column.
If the codes are 'SVND' or 'SAVL': assign float value = 0.1 to 'INFLOW_RATE' column.
Other cases: assign float value = 0.5 to 'INFLOW_RATE' column.
The sample data is as followed:
There are > 50 product codes so I believe it is best to check the conditions and assign values using wildcards. So far I managed to come up with this code:
Import re
CFIn_01 = ['SVND','SAVL']
CFIn_10 = ["M.+","L.+","H.+"]
file_consol['INFLOW_RATE'] = 0.5
file_consol.loc[file_consol['PRODUCT_CODE'].isin(CFIn_01), 'INFLOW_RATE'] = 0.1
file_consol.loc[file_consol['PRODUCT_CODE'].isin(CFIn_10), 'INFLOW_RATE'] = 1.0
However, when I check the result, all columns of 'INFLOW_RATE' are still filled with 0.5, instead of the rules I expected. I'm not sure what will be the appropriate code for this problem. Any help or advise is appreciated!
Create your custom function like you would do with a simple string:
def my_func(word: str):
if word.startswith('H') or word.startswith('L') or word.startswith('M'):
out = 0.1
elif word == 'SVND' or word == 'SAVL':
out = 1.0
else:
out = 0.5
return out
Then apply the function:
df['INFLOW'] = df.PRODUCT_CODE.apply(my_func)

Pandas Dataframe How to cut off float decimal points without rounding?

I have longitude and latitude in two dataframes that are close together. If I run an exact similarity check such as
test_similar = test1_latlon.loc[~test1_latlon['cr'].isin(test2_latlon['cr'])]
I get a lot of failures because a lot of the numbers are off at the 5th decimal place. I want to truncate at after the 3rd decimal. I've seen people format so it shows up truncated, but I want to change the actual value. Using round() rounds off the data and I get even more errors, so is there a way to just drop after 3 decimal points?
You may want to use numpy.trunc:
import numpy as np
import pandas as pd
df = pd.DataFrame([[1.2366, 1.2310], [1, 1]])
df1 = np.trunc(1000 * df) / 1000
print(df1, type(df1))
# 0 1
# 0 1.236 1.231
# 1 1.000 1.000 <class 'pandas.core.frame.DataFrame'>
Note that df1 is still a DataFrame not a numpy.array
As suggested here you can do:
x = 1.123456
float( '%.3f'%(x) )
if you want more decimal places, just change the 3 with any number you need.
import math
value1 = 1.1236
value2 = 1.1266
value1 = math.trunc(1000 * value1) / 1000;
value2 = math.trunc(1000 * value2) / 1000;
#value1 output
1.123
#value2 output
1.126

Python / Pandas / Data Frame / Calculate date difference

I have a data Frame, and I am doing the following:
def calculate_planungsphase(audit, phase1, phase2):
datum_first_milestone = data_audit[(data_audit.Audit == audit) & (data_audit.Meilenstein == phase1)]
datum_second_milestone = data_audit[(data_audit.Audit == audit) & (data_audit.Meilenstein == phase2)]
print(datum_first_milestone['GeplantesErledigungsdatum'])
print(datum_second_milestone['GeplantesErledigungsdatum'])
print(datum_first_milestone['GeplantesErledigungsdatum'] - datum_second_milestone['GeplantesErledigungsdatum'])
The result of print(datum_first_milestone['GeplantesErledigungsdatum']) =
2018-01-01
Name: GeplantesErledigungsdatum, dtype: datetime64[ns]
The result of print(datum_second_milestone['GeplantesErledigungsdatum']) =
2018-01-02 Name: GeplantesErledigungsdatum, dtype: datetime64[ns]
The result of the difference calculation is:
0 NaT
1 NaT
Name: GeplantesErledigungsdatum, dtype: timedelta64[ns
Why is the result of the calculation NaT? And why do i have two results, when I am doing only one calculation? (Index 0 and Index 1 = NaT)
Thank you for your help!
There is problem different index values, so in subtraction Series are not aligned.
Possible solution, if same size of both filtered Series, is create same index values:
datum_first_milestone.index = datum_second_milestone.index
Also solution should be simplify if need filter only column by loc + column name:
datum_first_milestone = data_audit.loc[(data_audit.Audit == audit) & (data_audit.Meilenstein == phase1), 'GeplantesErledigungsdatum']
datum_second_milestone = data_audit.loc[(data_audit.Audit == audit) & (data_audit.Meilenstein == phase2), 'GeplantesErledigungsdatum']
print(datum_first_milestone)
print(datum_second_milestone)
and if always is returned one value Series.item return scalars:
print (datum_first_milestone.item() - datum_second_milestone.item())
More general if there is one or more values is possible select first value for scalars:
print (datum_first_milestone.iat[0] - datum_second_milestone.iat[0])

How do I multiply a dataframe column by a float constant?

I'm trying to multiply a column by a float. I have the code for it here:
if str(cMachineName)==str("K42"):
df_temp.loc[:, "P"] *= float((105.0* 59.0*math.pi*0.95/1000)/3540)
But it gives me this error:
TypeError: can't multiply sequence by non-int of type 'float'.
How do I solve it?
I think problem is some non numeric values like 45 as string:
Solution is converting to float, int by astype:
df_temp = pd.DataFrame({'P':[1,2.5,'45']})
print (df_temp['P'].dtype)
object
df_temp["P"] = df_temp["P"].astype(float)
df_temp["P"] *= float((105.0* 59.0*math.pi*0.95/1000)/3540)
print (df_temp)
P
0 0.005223
1 0.013057
2 0.235030
Another problem is non numeric data like gh, for converting is necessary to_numeric with errors='coerce' for converting them to NaNs:
df_temp = pd.DataFrame({'P':[1,2.5,'gh']})
print (df_temp['P'].dtype)
object
df_temp["P"] = pd.to_numeric(df_temp["P"], errors='coerce')
print (df_temp)
P
0 1.0
1 2.5
2 NaN
df_temp["P"] *= float((105.0* 59.0*math.pi*0.95/1000)/3540)
print (df_temp)
P
0 0.005223
1 0.013057
2 NaN
Maybe this is too simple an answer, but this worked for me and is relatively simple.
Dataframe["new column"] = (dataframe ["old column"] * by float constant)
January1st["weight_lb"] = (January1st["weight_kg"] * 2.2)
Dataframe.head() to see whether it worked.

converting an object to float in pandas along with replacing a $ sign

I am fairly new to Pandas and I am working on project where I have a column that looks like the following:
AverageTotalPayments
$7064.38
$7455.75
$6921.90
ETC
I am trying to get the cost factor out of it where the cost could be anything above 7000. First, this column is an object. Thus, I know that I probably cannot do a comparison with it to a number. My code, that I have looks like the following:
import pandas as pd
health_data = pd.read_csv("inpatientCharges.csv")
state = input("What is your state: ")
issue = input("What is your issue: ")
#This line of code will create a new dataframe based on the two letter state code
state_data = health_data[(health_data.ProviderState == state)]
#With the new data set I search it for the injury the person has.
issue_data=state_data[state_data.DRGDefinition.str.contains(issue.upper())]
#I then make it replace the $ sign with a '' so I have a number. I also believe at this point my code may be starting to break down.
issue_data = issue_data['AverageTotalPayments'].str.replace('$', '')
#Since the previous line took out the $ I convert it from an object to a float
issue_data = issue_data[['AverageTotalPayments']].astype(float)
#I attempt to print out the values.
cost = issue_data[(issue_data.AverageTotalPayments >= 10000)]
print(cost)
When I run this code I simply get nan back. Not exactly what I want. Any help with what is wrong would be great! Thank you in advance.
Try this:
In [83]: df
Out[83]:
AverageTotalPayments
0 $7064.38
1 $7455.75
2 $6921.90
3 aaa
In [84]: df.AverageTotalPayments.str.extract(r'.*?(\d+\.*\d*)', expand=False).astype(float) > 7000
Out[84]:
0 True
1 True
2 False
3 False
Name: AverageTotalPayments, dtype: bool
In [85]: df[df.AverageTotalPayments.str.extract(r'.*?(\d+\.*\d*)', expand=False).astype(float) > 7000]
Out[85]:
AverageTotalPayments
0 $7064.38
1 $7455.75
Consider the pd.Series s
s
0 $7064.38
1 $7455.75
2 $6921.90
Name: AverageTotalPayments, dtype: object
This gets the float values
pd.to_numeric(s.str.replace('$', ''), 'ignore')
0 7064.38
1 7455.75
2 6921.90
Name: AverageTotalPayments, dtype: float64
Filter s
s[pd.to_numeric(s.str.replace('$', ''), 'ignore') > 7000]
0 $7064.38
1 $7455.75
Name: AverageTotalPayments, dtype: object

Categories

Resources