RuntimeWarning: invalid value encountered in longlong_scalars

RuntimeWarning: invalid value encountered in longlong_scalars - python

What I'm trying to do
I want to report the weekly rejection rate for multiple users. I use a for loop to go through a monthly dataset to get the numbers for every user. The final dataframe, rates, should look something like:
The end product, rates
Description
I have an initial dataframe (numbers), that contains only the ACCEPT, REJECT and REVIEW numbers, where I added these rows and columns:
Rows: Grand Total, Rejection Rate
Columns: Grand Total
Here's how numbers look like:
|---|--------|--------|--------|--------|-------------|
| | Week 1 | Week 2 | Week 3 | Week 4 | Grand Total |
|---|--------|--------|--------|--------|-------------|
| 0 | 994 | 699 | 529 | 877 | 3099 |
|---|--------|--------|--------|--------|-------------|
| 1 | 27 | 7 | 8 | 13 | 55 |
|---|--------|--------|--------|--------|-------------|
| 2 | 100 | 86 | 64 | 107 | 357 |
|---|--------|--------|--------|--------|-------------|
| 3 | 1121 | 792 | 601 | 997 | 3511 |
|---|--------|--------|--------|--------|-------------|
The indexes represent the following values:
0 - ACCEPT
1 - REJECT
2 - REVIEW
3 - TOTAL (Accept+Reject+Review)
I wrote 2 pre-defined functions:
get_decline_rates(df): The get the decline rates by week in the numbers dataframe.
copy(empty_df, data): To transfer all data to a new dataframe with "double" headers (for reporting purposes).
Here's my code where I add rows and columns to numbers, then re-format it:
# Adding "Grand Total" column and rows
totals = numbers.sum(axis=0) # column sum
numbers = numbers.append(totals, ignore_index=True)
grand_total = numbers.sum(axis=1) # row sum
numbers.insert(len(numbers.columns), "Grand Total", grand_total)
# Adding "Rejection Rate" and re-indexing numbers
decline_rates = get_decline_rates(numbers)
numbers = numbers.append(decline_rates, ignore_index=True)
numbers.index = ["ACCEPT","REJECT","REVIEW","Grand Total","Rejection Rate"]
# Creating a new df with report format requirements
final = pd.DataFrame(0, columns=numbers.columns, index=["User A"]+list(numbers.index))
final.ix["User A",:] = final.columns
# Copying data from numbers to newly formatted df
copy(final,numbers)
# Append final df of this user to the final dataframe
rates = rates.append(final)
I'm using Python 3.5.2 and Pandas 0.19.2. If it helps, here's how the initial dataset looks like:
Data format
I do a resampling on the date column to get the data by week.
What's going wrong
Here's the funny part - the code runs fine and I get all the required information in rates. However, I'm seeing this warning message:
RuntimeWarning: invalid value encountered in longlong_scalars
If i break down the code and run it line by line, this message does not appear. Even the message looks weird (what does longlong_scalars even mean?) Does anyone know what this warning message mean, and what's causing it?
UPDATE:
I just ran a similar script that takes in exactly the same input and produces a similar output (except I get daily rejection rates instead of weekly). I get the same Runtime warning, except more information is given:
RuntimeWarning: invalid value encountered in longlong_scalars
rej_rate = str(int(round((col.ix[1 ]/col.ix[3 ])*100))) + "%"
I suspect something must have gone wrong when I was trying to calculate the decline rates with my pre-defined function, get_decline_rates(df). Could it be due to the dtype of the values? All columns on the input df, numbers, are int64.
Here's the code for my pre-defined function (the input, numbers, can be found under Description):
# Description: Get rejection rates for all weeks.
# Parameters: Pandas Dataframe with ACCEPT, REJECT, REVIEW count by week.
# Output: Pandas Series with rejection rates for all days in input df.
def get_decline_rates(df):
decline_rates = []
for i in range(len(df.columns)):
col = df.ix[:,i]
try:
rej_rate = str(int(round((col[1]/col[3])*100))) + "%"
except ValueError:
rej_rate = "0%"
decline_rates.append(rej_rate)
return pd.Series(decline_rates, index=df.columns)

I had the same RuntimeWarning, and after looking into the data, it was because of a null-division. I did not have the time to look into your sample, but you could look around id=0, or some other records, where null-division or such could occur.

Related

Get just one value of a nested list inside a dictionary to create a Dataframe Update #1

I am using an API that returns a dictionary with a nested list inside, lets name it coins_best The result looks like this:
{'bitcoin': [[1603782192402, 13089.646908288987],
[1603865643028, 13712.070136258053]],
'ethereum': [[1603782053064, 393.6741989091851],
[1603865024078, 404.86117057956386]]}
The first value on the list is a timestamp, while the second is a price in dollars. I want to create a DataFrame with the prices and having the timestamps as index. I tried with this code to do it in just one step:
d = pd.DataFrame()
for id, obj in coins_best.items():
for i in range(0,len(obj)):
temp = pd.DataFrame({
obj[i][1]
}
)
d = pd.concat([d, temp])
d
This attempt gave me a DataFrame with just one column and not the two required, because using the columns argument threw errors (TypeError: Index(...) must be called with a collection of some kind, 'bitcoin' was passed) when I tried with id
Then I tried with comprehensions to preprocess the dictionary and their lists:
for k in coins_best.keys():
inner_lists = (coins_best[k] for inner_dict in coins_best.values())
items = (item[1] for ls in inner_lists for item in ls)
I could not obtain the both elements in the dictionary, just the last.
I know is possible to try:
df = pd.DataFrame(coins_best, columns=coins_best.keys())
Which gives me:
bitcoin ethereum
0 [1603782192402, 13089.646908288987] [1603782053064, 393.6741989091851]
1 [1603785693143, 13146.275972229188] [1603785731599, 394.6174435303511]
And then try to remove the first element in every list of every row, but was even harder to me. The required answer is:
bitcoin ethereum
1603782192402 13089.646908288987 393.6741989091851
1603785693143 13146.275972229188 394.6174435303511
Do you know how to process the dictionary before creating the DataFrame in order the get this result?
Is my first question, I tried to be as clear as possible. Thank you very much.
Update #1
The answer by Sander van den Oord also solved the problem of timestamps and is useful for its purpose. However, the sample code while correct (as it used the info provided) was limited to these two keys. This is the final code that solved the problem for every key in the dictionary.
for k in coins_best:
df_coins1 = pd.DataFrame(data=coins_best[k], columns=['timestamp', k])
df_coins1['timestamp'] = pd.to_datetime(df_coins1['timestamp'], unit='ms')
df_coins = pd.concat([df_coins1, df_coins], sort=False)
df_coins_resampled = df_coins.set_index('timestamp').resample('d').mean()
Thank you very much for your answers.

I think you shouldn't ignore the fact that values of coins are taken at different times. You could do something like this:
import pandas as pd
import hvplot.pandas
coins_best = {
'bitcoin': [[1603782192402, 13089.646908288987],
[1603865643028, 13712.070136258053]],
'ethereum': [[1603782053064, 393.6741989091851],
[1603865024078, 404.86117057956386]],
}
df_bitcoin = pd.DataFrame(data=coins_best['bitcoin'], columns=['timestamp', 'bitcoin'])
df_bitcoin['timestamp'] = pd.to_datetime(df_bitcoin['timestamp'], unit='ms')
df_ethereum = pd.DataFrame(data=coins_best['ethereum'], columns=['timestamp', 'ethereum'])
df_ethereum['timestamp'] = pd.to_datetime(df_ethereum['timestamp'], unit='ms')
df_coins = pd.concat([df_ethereum, df_bitcoin], sort=False)
Your df_coins will now look like this:
+----+----------------------------+------------+-----------+
| | timestamp | ethereum | bitcoin |
|----+----------------------------+------------+-----------|
| 0 | 2020-10-27 07:00:53.064000 | 393.674 | nan |
| 1 | 2020-10-28 06:03:44.078000 | 404.861 | nan |
| 0 | 2020-10-27 07:03:12.402000 | nan | 13089.6 |
| 1 | 2020-10-28 06:14:03.028000 | nan | 13712.1 |
+----+----------------------------+------------+-----------+
Now if you want values to be on the same line, you could use resampling, here I do it per day: all values of the same day for a coin type are averaged:
df_coins_resampled = df_coins.set_index('timestamp').resample('d').mean()
df_coins_resampled will look like this:
+---------------------+------------+-----------+
| timestamp | ethereum | bitcoin |
|---------------------+------------+-----------|
| 2020-10-27 00:00:00 | 393.674 | 13089.6 |
| 2020-10-28 00:00:00 | 404.861 | 13712.1 |
+---------------------+------------+-----------+
I like to use hvplot to get an interactive plot of the result:
df_coins_resampled.hvplot.scatter(
x='timestamp',
y=['bitcoin', 'ethereum'],
s=20, padding=0.1
)
Resulting plot:

There are different timestamps, so the correct output looks differently, than what you presented, but other than that, its a oneliner (where d is your input dictionary):
pd.concat([pd.DataFrame(val, columns=['timestamp', key]).set_index('timestamp') for key, val in d.items()], axis=1)

How can I get pandas to adjust my formula based on a specific value in a dataframe?

I have a pandas dataframe that looks like this:
Emp_ID | Weekly_Hours | Hire_Date | Termination_Date | Salary_Paid | Multiplier | Hourly_Pay
A1 | 35 | 01/01/1990 | 06/04/2020 | 5000 | 0.229961 | 32.85
B2 | 35 | 02/01/2020 | NaN | 10000 | 0.229961 | 65.70
C3 | 30 | 23/03/2020 | NaN | 5800 | 0.229961 | 44.46
The multiplier is a static figure for all employees, calculated as 7 / 30.44. The hourly pay is worked out by multiplying the monthly salary by the multiplier and dividing by the weekly contracted hours.
Now my challenge is to get Pandas to recognise a date in the Termination Date field, and adjust the calculation. For instance, the first record would need to be updated to show that the employee was actually paid 5k through the payroll for 4 business days, not the full month, given that they resigned on 06/04/2020. So the expected hourly pay figure would be (5000 / 4 * 7 / 35) = 250.
I can code the calculation quite easily; my struggle is adding a column to reflect the business days (4 in the above example) in a fresh column for all April leavers (not interested in any other months). So far I have tried.
df['T_Mth_Workdays'] = np.where(df['Termination_Date'].notnull(), np.busday_count('2020-04-01', df['Termination_Date']), 0)
However the above approach returns an error stating that:
iterator operand 0 dtype could not be cast from dtype(' m8 [ns] ') to dtype(' m8 [d] ')
I should add here that I had to change the dates to datetime[ns64] format manually.
Any pointers gratefully received. Thanks!

The issue with your np.where function call is that it is trying to pass the entire series df["Termination_Date"] as an argument to np.busday_count. The count function fails because it requires arguments to be in the np.datetime64[D] format (i.e., value only specified to the day), and the Series cannot be easily converted to this format.
One solution is to write a custom function that only calls that np.busday_count on elements that are not NaTs, converting those to the datetime64[D] type before calling np.busday_count. Then, you can apply the custom function to the df["Termination_Date"] series, as below:
#!/usr/bin/env python3
import numpy as np
import pandas as pd
DATE_FORMAT = "%d-%m-%Y"
# Reproduce raw data
raw_data = [
["A1", 35, "01/01/1990", "06/04/2020", 5000, 0.229961, 32.85],
["B2", 35, "02/01/2020", None, 10000, 0.229961, 65.70],
["C3", 35, "23/03/2020", "NAT", 5800, 0.229961, 44.46],
]
# Convert raw dates to ISO format, then np.datetime64
def parse_raw_dates(s):
try:
spl = s.split("/")
ds = "%s-%s-%s" %(spl[2], spl[1], spl[0])
except:
ds = "NAT"
return np.datetime64(ds)
for line in raw_data:
line[2] = parse_raw_dates(line[2])
# Create dataframe
df = pd.DataFrame(
data = raw_data,
columns = [
"Emp_ID", "Weekly_Hours", "Hire_Date", "Termination_Date",
"Salary_Paid", "Multiplier", "Hourly_Pay"],
)
# Create special conversion function
def myfunc(d):
d = d.to_numpy().astype('datetime64[D]')
if np.isnat(d):
return 0
else:
return np.busday_count('2020-04-01', d)
df['T_Mth_Workdays'] = df["Termination_Date"].apply(myfunc)
def format_date(d):
d = d.to_numpy().astype('datetime64[D]')
if np.isnat(d):
return ""
else:
return pd.to_datetime(d).strftime(DATE_FORMAT)
df["Hire_Date"] = df["Hire_Date"].apply(format_date)
df["Termination_Date"] = df["Termination_Date"].apply(format_date)

Posting my approach here in case it helps others in the future. Firstly code for creating the dataframe:
d = {'Emp_ID': ['A1', 'B2', 'C3'], 'Weekly Hours': ['35', '35', '30'], 'Hire_Date': ['01/01/1990', '02/01/2020', '23/03/2020'],
'Termination_Date': ['06/04/2020', np.nan, np.nan], 'Salary_Paid': [5000, 10000, 5800]}
df = pd.DataFrame(data=d)
df
The first step was to convert the dates to a more useable format - this is where pd.to_datetime() comes in handy -the adjustment needed was to specify the format.
df['Hire_Date'] = pd.to_datetime(df['Hire_Date'], format='%d/%m/%Y')
df['Termination_Date'] = pd.to_datetime(df['Termination_Date'], format='%d/%m/%Y')
This has the desired effect; whereby the dates are correctly represented and April is picked up as the right month of termination for employee A1.
I now (slightly) adjusted Ken's custom solution for calculating the working days in April:
def workday_calc(d):
d = d.to_numpy().astype('datetime64[D]')
if np.isnat(d):
return 30.44
else:
d = d.astype(str)
d = dt.datetime.strptime(d, '%Y-%m-%d')
e = (d + dt.timedelta(1)).strftime('%Y-%m-%d')
return np.busday_count('2020-04-01', e, weekmask=[1,1,1,1,1,0,0])
I spotted the error while reviewing numpy documentation on np.busday_count(). There are two useful pointers to note:
The use of the datetime64[D] is mandatory in the first line of the function - you can't use pd.to_datetime(). This is because the datetime64[D] format is a pre-requisite to being able to call the np.isnat() function.
However, the minute we deal with the NaT in the dataframe, we need to switch back to a string format, which is needed for the datetime.strptime() function.
Using the datetime.strptime() feature, we tell Python that the date is a) represented in the ISO format, and we need to retain it as a string. The advantage with both datetime.strptime() and np.busday_count() is that they are both built to handle strings.
Also, the np.busday_count() excludes the end date, so I used timedelta() to increment the end date by one, so that all the dates in the interim are counted. This may or may not be appropriate given what you're trying to do, but I wanted an inclusive count of days worked in April. So in this case, the employee has worked for 4 business days in April.
We then simply apply the custom function and create a new column.
df['Days_Worked_April'] = df['Termination_Date'].apply(workday_calc)
I was now able to use the freshly created column to derive my multiplier - using the same old approach. The rest is simple, but I'm including the code and results below for completeness.
df['Multiplier'] = df.apply(lambda x: 7 / x['Days_Worked_April'], axis=1)
df['Hourly_Pay_Calc'] = round((df.apply(lambda x: x['Salary_Paid'] * x['Multiplier'] / x['Weekly Hours'], axis=1)), 2)
Output:
Emp_ID Weekly Hours Hire_Date Termination_Date Salary_Paid Days_Worked_April Multiplier Hourly_Pay_Calc
0 A1 35.0 1990-01-01 2020-04-06 5000 4.00 1.750000 250.00
1 B2 35.0 2020-01-02 NaT 10000 30.44 0.229961 65.70
2 C3 30.0 2020-03-23 NaT 5800 30.44 0.229961 44.46

Compare multiple cells in openpyxl

I need to make a comparison of multiple cells in openpyxl but I have not been successful. To be more precise, I have an .xlsx file that I import into my python script, which contains 4 columns, and around 70,000 rows. The rows that have the same first 3 columns, must be joined and add the digit that appears in the fourth column.
For example
Row 1 .. Type of material: A | Location: NY | Month of sale: January | Cost: 100
..
Row 239 Type of material: A | Location: NY | Month of sale: January | Cost: 150
..
Row 1020 Type of material: A | Location: NY | Month of sale: January | Cost: 80
..
etc
Assuming that only such matches existed, a new data table must be generated (for example in a data sheet) where only one row appears in this way:
Type of material: A | Location: NY | Month of sale: January | Cost: 330 (sum of costs)
And so on, with all the data in .xlsx file to get a new consolidated table.
I hope to have been clear with the explanation, but if it was not, I can be even more precise if necessary.
As I mentioned at the beginning, I have not been successful so far, so I will appreciate any help!
Thank you very much

instead of reading it via openpyxl, I would use pandas
import pandas as pd
raw_data = pd.read_excel(filename, header=0)
summary = raw_data.groupby(['Type of material', 'Location', 'Month of sale'])['Cost'].sum()
If this raises some KeyErrors you'll need to fix the labels

Generating a retention cohort from a pandas dataframe

I have a pandas dataframe that looks like this:
+-----------+------------------+---------------+------------+
| AccountID | RegistrationWeek | Weekly_Visits | Visit_Week |
+-----------+------------------+---------------+------------+
| ACC1 | 2015-01-25 | 0 | NaT |
| ACC2 | 2015-01-11 | 0 | NaT |
| ACC3 | 2015-01-18 | 0 | NaT |
| ACC4 | 2014-12-21 | 14 | 2015-02-12 |
| ACC5 | 2014-12-21 | 5 | 2015-02-15 |
| ACC6 | 2014-12-21 | 0 | 2015-02-22 |
+-----------+------------------+---------------+------------+
It's essentially a visit log of sorts, as it holds all the necessary data for creating a cohort analysis.
Each registration week is a cohort.
To know how many people are part of the cohort I can use:
visit_log.groupby('RegistrationWeek').AccountID.nunique()
What I want to do is create a pivot table with the registration weeks as keys. The columns should be the visit_weeks and the values should be the count of unique account ids who have more than 0 weekly visits.
Together with the total accounts in each cohort, I will then be able to show percentages instead of absolute values.
The end product would look something like this:
+-------------------+-------------+-------------+-------------+
| Registration Week | Visit_week1 | Visit_Week2 | Visit_week3 |
+-------------------+-------------+-------------+-------------+
| week1 | 70% | 30% | 20% |
| week2 | 70% | 30% | |
| week3 | 40% | | |
+-------------------+-------------+-------------+-------------+
I tried pivoting the dataframe like this:
visit_log.pivot_table(index='RegistrationWeek', columns='Visit_Week')
But I haven't nailed down the value part. I'll need to somehow count account Id and divide the sum by the registration week aggregation from above.
I'm new to pandas so if this isn't the best way to do retention cohorts, please enlighten me!
Thanks

There are several aspects to your question.
What you can build with the data you have
There are several kinds of retention. For simplicity, we’ll mention only two :
Day-N retention : if a user registered on day 0, did she log in on day N ? (Logging on day N+1 does not affect this metric). To measure it, you need to keep track of all the logs of your users.
Rolling retention : if a user registered on day 0, did she log in on day N or any day after that ? (Logging in on day N+1 affects this metric). To measure it, you just need the last know logs of your users.
If I understand your table correctly, you have two relevant variables to build your cohort table : registration date, and last log (visit week). The number of weekly visits seems irrelevant.
So with this you can only go with option 2, rolling retention.
How to build the table
First, let's build a dummy data set so that we have enough to work on and you can reproduce it :
import pandas as pd
import numpy as np
import math
import datetime as dt
np.random.seed(0) # so that we all have the same results
def random_date(start, end,p=None):
# Return a date randomly chosen between two dates
if p is None:
p = np.random.random()
return start + dt.timedelta(seconds=math.ceil(p * (end - start).days*24*3600))
n_samples = 1000 # How many users do we want ?
index = range(1,n_samples+1)
# A range of signup dates, say, one year.
end = dt.datetime.today()
from dateutil.relativedelta import relativedelta
start = end - relativedelta(years=1)
# Create the dataframe
users = pd.DataFrame(np.random.rand(n_samples),
index=index, columns=['signup_date'])
users['signup_date'] = users['signup_date'].apply(lambda x : random_date(start, end,x))
# last logs randomly distributed within 10 weeks of singing up, so that we can see the retention drop in our table
users['last_log'] = users['signup_date'].apply(lambda x : random_date(x, x + relativedelta(weeks=10)))
So now we should have something that looks like this :
users.head()
Here is some code to build a cohort table :
### Some useful functions
def add_weeks(sourcedate,weeks):
return sourcedate + dt.timedelta(days=7*weeks)
def first_day_of_week(sourcedate):
return sourcedate - dt.timedelta(days = sourcedate.weekday())
def last_day_of_week(sourcedate):
return sourcedate + dt.timedelta(days=(6 - sourcedate.weekday()))
def retained_in_interval(users,signup_week,n_weeks,end_date):
'''
For a given list of users, returns the number of users
that signed up in the week of signup_week (the cohort)
and that are retained after n_weeks
end_date is just here to control that we do not un-necessarily fill the bottom right of the table
'''
# Define the span of the given week
cohort_start = first_day_of_week(signup_week)
cohort_end = last_day_of_week(signup_week)
if n_weeks == 0:
# If this is our first week, we just take the number of users that signed up on the given period of time
return len( users[(users['signup_date'] >= cohort_start)
& (users['signup_date'] <= cohort_end)])
elif pd.to_datetime(add_weeks(cohort_end,n_weeks)) > pd.to_datetime(end_date) :
# If adding n_weeks brings us later than the end date of the table (the bottom right of the table),
# We return some easily recognizable date (not 0 as it would cause confusion)
return float("Inf")
else:
# Otherwise, we count the number of users that signed up on the given period of time,
# and whose last known log was later than the number of weeks added (rolling retention)
return len( users[(users['signup_date'] >= cohort_start)
& (users['signup_date'] <= cohort_end)
& pd.to_datetime((users['last_log']) >= pd.to_datetime(users['signup_date'].map(lambda x: add_weeks(x,n_weeks))))
])
With this we can create the actual function :
def cohort_table(users,cohort_number=6,period_number=6,cohort_span='W',end_date=None):
'''
For a given dataframe of users, return a cohort table with the following parameters :
cohort_number : the number of lines of the table
period_number : the number of columns of the table
cohort_span : the span of every period of time between the cohort (D, W, M)
end_date = the date after which we stop counting the users
'''
# the last column of the table will end today :
if end_date is None:
end_date = dt.datetime.today()
# The index of the dataframe will be a list of dates ranging
dates = pd.date_range(add_weeks(end_date,-cohort_number), periods=cohort_number, freq=cohort_span)
cohort = pd.DataFrame(columns=['Sign up'])
cohort['Sign up'] = dates
# We will compute the number of retained users, column-by-column
# (There probably is a more pythonesque way of doing it)
range_dates = range(0,period_number+1)
for p in range_dates:
# Name of the column
s_p = 'Week '+str(p)
cohort[s_p] = cohort.apply(lambda row: retained_in_interval(users,row['Sign up'],p,end_date), axis=1)
cohort = cohort.set_index('Sign up')
# absolute values to percentage by dividing by the value of week 0 :
cohort = cohort.astype('float').div(cohort['Week 0'].astype('float'),axis='index')
return cohort
Now you can call it and see the result :
cohort_table(users)
Hope it helps

Using the same format of users data from rom_j's answer, this will be cleaner/faster, but only works assuming there is at least one signup/churn per week. Not a terrible assumption on large enough data.
users = users.applymap(lambda d: d.strftime('%Y-%m-%V') if pd.notnull(d) else d)
tab = pd.crosstab(signup_date, last_log)
totals = tab.T.sum()
retention_counts = ((tab.T.cumsum().T * -1)
.replace(0, pd.NaT)
.add(totals, axis=0)
)
retention = retention_counts.div(totals, axis=0)
realined = [retention.loc[a].dropna().values for a in retention.index]
realigned_retention = pd.DataFrame(realined, index=retention.index)

converting string to float (and removing decimal point) in python

I'm trying to create a function that imports a csv file which holds records for population per year (as strings). It imports a file which has the year in the 3rd column and the population count in the 4th.
It should remove the decimal point '.' and display the resulting population.
16122.83
16223.248
should become
1612283
16223248
When I try to do this I get: in print_population_list
year, population = row[2], float(row[3]) ValueError: could not convert string to float: POP.
This is my code:
import csv
file = csv.reader(open(filename))
year, population = 0, 0
for row in file:
year, population = row[2], float(row[3])
print year,":", population,
To do this I figured it should first be converted to a float and be multiplied by the highest number of decimal places, after which all zero's at the end should be removed (since the data doesn't all have the same number of decimal places). But I'm stuck at the float conversion.

Most direct route:
>>> s = '16122.83'
>>> int(s.replace('.', ''))
1612283
While performance is probably not a big concern in your use case,
a replace strategy is about 30% faster than the split-join strategy,
based on a simple benchmark.
Benchmark Report
================
Options
-------
name | rank | runs | mean | sd | timesBaseline
--------|------|------|----------|-----------|--------------
replace | 1 | 1000 | 0.009488 | 0.0006711 | 1.0
join | 2 | 1000 | 0.01258 | 0.0007729 | 1.32589602108
Each of the above 2000 runs were run in random, non-consecutive order by
`benchmark` v0.1.5 (http://jspi.es/benchmark)
For this problem, int is what you seem to need. But for related problems, using float instead of int would keep you in the floating point realm. The round(value, places) function also might be handy.

This should do the trick
def decToNum(s):
return int(''.join(s.split('.')))
>>> s = '16122.83'
>>> decToNum(s)
1612283

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

RuntimeWarning: invalid value encountered in longlong_scalars - python

I had the same RuntimeWarning, and after looking into the data, it was because of a null-division. I did not have the time to look into your sample, but you could look around id=0, or some other records, where null-division or such could occur.

Related

Get just one value of a nested list inside a dictionary to create a Dataframe Update #1

How can I get pandas to adjust my formula based on a specific value in a dataframe?

Compare multiple cells in openpyxl

Generating a retention cohort from a pandas dataframe

converting string to float (and removing decimal point) in python

Categories

Resources