I have an initial column in a dataframe that contains several bits of information (weight and count of items) I am trying to pull out and do some calculations with.
When I pull out my desired numbers everything looks fine if I print out the variable I store the series in.
Below is my code for how I am parsing out my numbers from the initial column. I just stacked a few methods and used regex to tease it out.
[Hopefully it is fairly easy to read, with some cleaning, my target weight numbers are always in the 3rd to last position after the split() // and my target count numbers are always in the 2nd to last position after the split]
weight = df['Item'].str.replace('1.0gal','128oz').str.replace('YYY','').str.split().str[-3].str.extract('(\d+)', expand=False).astype(np.float64)
count = df['Item'].str.replace('NN','').str.split().str[-2].replace('XX','1ct').str.extract('(\d+)', expand=False).astype(np.float64)
Variable 'weight' returns a series like [32, 32, 0.44, 5.3, 64] and that is what I want to see.
HOWEVER, when I try to set these values into a new column in the dataframe it leaves off everything to the right of the decimal place; for example my new column shows up as [32, 32, 0, 5, 64].
This is throwing off my calculated columns as well.
However if I do the math in a separate variable and print that out it shows up right (decimals and all). But something about assigning it to the dataframe zeros out my weight and screws up any calculations thereafter.
Any and all help is greatly appreciated!
cast the series values to string,
then after you insert the values into a DataFrame column, convert the column to numeric. For example,
weight = weight.asType(str)
df['new_column'] = weight
df['new_column'] = pd.to_numeric(df['new_column'])
check out: Change column type in pandas
I wrote some code to perform interpolation based on two criteria, the amount of insurance and the deductible amount %. I was struggling to do the interpolation all at once, so had split the filtering.The table hf contains the known data which I am using to base my interpolation results on.Table df contains the new data which needs the developed factors interpolated based on hf.
Right now my work around is first filtering each table based on the ded_amount percentage and then performing the interpolation into an empty data frame and appending after each loop.
I feel like this is inefficient, and there is a better way to perform this, looking to hear some feedback on some improvements I can make. Thanks
Test data provided below.
import pandas as pd
from scipy import interpolate
known_data={'AOI':[80000,100000,150000,200000,300000,80000,100000,150000,200000,300000],'Ded_amount':['2%','2%','2%','2%','2%','3%','3%','3%','3%','3%'],'factor':[0.797,0.774,0.739,0.733,0.719,0.745,0.737,0.715,0.711,0.709]}
new_data={'AOI':[85000,120000,130000,250000,310000,85000,120000,130000,250000,310000],'Ded_amount':['2%','2%','2%','2%','2%','3%','3%','3%','3%','3%']}
hf=pd.DataFrame(known_data)
df=pd.DataFrame(new_data)
deduct_fact=pd.DataFrame()
for deduct in hf['Ded_amount'].unique():
deduct_table=hf[hf['Ded_amount']==deduct]
aoi_table=df[df['Ded_amount']==deduct]
x=deduct_table['AOI']
y=deduct_table['factor']
f=interpolate.interp1d(x,y,fill_value="extrapolate")
xnew=aoi_table[['AOI']]
ynew=f(xnew)
append_frame=aoi_table
append_frame['Factor']=ynew
deduct_fact=deduct_fact.append(append_frame)
Yep, there is a way to do this more efficiently, without having to make a bunch of intermediate dataframes and appending them. have a look at this code:
from scipy import interpolate
known_data={'AOI':[80000,100000,150000,200000,300000,80000,100000,150000,200000,300000],'Ded_amount':['2%','2%','2%','2%','2%','3%','3%','3%','3%','3%'],'factor':[0.797,0.774,0.739,0.733,0.719,0.745,0.737,0.715,0.711,0.709]}
new_data={'AOI':[85000,120000,130000,250000,310000,85000,120000,130000,250000,310000],'Ded_amount':['2%','2%','2%','2%','2%','3%','3%','3%','3%','3%']}
hf=pd.DataFrame(known_data)
df=pd.DataFrame(new_data)
# Create this column now
df['Factor'] = None
# I like specifying this explicitly; easier to debug
deduction_amounts = list(hf.Ded_amount.unique())
for deduction_amount in deduction_amounts:
# You can index a dataframe and call a column in one line
x, y = hf[hf['Ded_amount']==deduction_amount]['AOI'], hf[hf['Ded_amount']==deduction_amount]['factor']
f = interpolate.interp1d(x, y, fill_value="extrapolate")
# This is the most important bit. Lambda function on the dataframe
df['Factor'] = df.apply(lambda x: f(x['AOI']) if x['Ded_amount']==deduction_amount else x['Factor'], axis=1)
The way the lambda function works is:
It goes row by row through the column 'Factor' and gives it a value based on conditions on the other columns.
It returns the interpolation of the AOI column of df (this is what you called xnew) if the deduction amount matches, otherwise it just returns the same thing back.
I am producing a list of pandas series where the index is temperature and data is some volume of a thermodynamic solid phase present at said temperatures. Each time I run a simulation I don't know how many phases will be returned so I am appending them to a list of pandas series to be concatenated along columns. For Example, lets say I am interested in phases present from 400 to 900 degrees and want to fill a data frame whose index is all the temperatures.
# data.values() is an object whose values x's are temperature and y's are amount of some phase.
lst_phases = []
for d in data.values():
idx = pd.Index(d.temp)
idx.drop_duplicates(keep='first')
## sometimes there can be duplicate temperatures with an empty index in between each.
## ex. temp = [473, 478, 480, , 480, 483....] # so I drop the first, I am not sure what to do abut the empty index or if that is my issue.
s = pd.Series(d.phase, index=idx, name=d.name)
lst_phases.append(s)
result = pd.concat(lst_phases, axis=1)
returns:
ValueError: cannot reindex from a duplicat axis
I have also tried do a merge like so.
pd.merge(lst_phases[0], lst_phases[1], how='outer', left_index='True', right_index='True')
This returns a full outer join of so my index of temperatures are all the temperatures in order and exactly what I am trying to achieve. The issue is that it's difficult to do a merge on a list of phases especially when I don't know how many phases/pd.Series I will have for each simulation
Let's say I have an hourly series in pandas, fine to assume the source is regular but it is gappy. If I want to interpolate it to 15min, the pandas API provides resample(15min).interpolate('cubic'). It interpolates to the new times and provides some control over the limits of interpolation. The spline is helping to refine the series as well as fill small gaps. To be concrete:
tndx = pd.date_range(start="2019-01-01",end="2019-01-10",freq="H")
tnum = np.arange(0.,len(tndx))
signal = np.cos(tnum*2.*np.pi/24.)
signal[80:85] = np.nan # too wide a gap
signal[160:168:2] = np.nan # these can be interpolated
df = pd.DataFrame({"signal":signal},index=tndx)
df1= df.resample('15min').interpolate('cubic',limit=9)
Now let's say I have an irregular datetime index. In the example below, the first time is a regular time point, the second is in the big gap and the last is in the interspersed brief gaps.
tndx2 = pd.DatetimeIndex('2019-01-04 00:00','2019-01-04 10:17','2019-01-07 16:00')
How do I interpolate to from the original series (hourly) to this irregular series of times?
Is the only option to build a series that includes the original data and the destination data? How would I do this? What is the most economical way to achieve the goals of interpolating to an independent irregular index and imposing a gap limit?
In case of irregular timestamps, first you set datetime as index and then you can use interpolate method to index df1= df.resample('15min').interpolate('index')
You can find more information here https://pandas.pydata.org/pandas-docs/version/0.16.2/generated/pandas.DataFrame.interpolate.html
This is an example solution within the pandas interpolate API, which doesn't seem to have a way of using abscissa and values from the source series to interpolate to new times provided by the destination index, as separate data structure. This method solves this by tacking the destination to the source. The method makes use of the limit argument of df.interpolate and it can use any interpolation algorithm from that API but it isn't perfect because the limit is in terms of the number of values and if there are a lot of destination points in a patch of NaNs those get counted as well.
tndx = pd.date_range(start="2019-01-01",end="2019-01-10",freq="H")
tnum = np.arange(0.,len(tndx))
signal = np.cos(tnum*2.*np.pi/24.)
signal[80:85] = np.nan
signal[160:168:2] = np.nan
df = pd.DataFrame({"signal":signal},index=tndx)
# Express the destination times as a dataframe and append to the source
tndx2 = pd.DatetimeIndex(['2019-01-04 00:00','2019-01-04 10:17','2019-01-07 16:00'])
df2 = pd.DataFrame( {"signal": [np.nan,np.nan,np.nan]} , index = tndx2)
big_df = df.append(df2,sort=True)
# At this point there are duplicates with NaN values at the bottom of the DataFrame
# representing the destination points. If these are surrounded by lots of NaNs in the source frame
# and we want the limit argument to work in the call to interpolate, the frame has to be sorted and duplicates removed.
big_df = big_df.loc[~big_df.index.duplicated(keep='first')].sort_index(axis=0,level=0)
# Extract at destination locations
interpolated = big_df.interpolate(method='cubic',limit=3).loc[tndx2]
I would like to color the cells of a (python) pandas dataframe according to wether their value is in the top 5%, top 10%, ..., last 10%, last 5% of the data in this column.
According to this post Coloring Cells in Pandas, one can define a function and then apply it to the dataframe.
If you want to color cell if they are in a fixed range, this works fine.
But if you want to color only the first 5%, you need to have all the information about the column. Hence you cannot apply a function that only evaluates every single cell.
Hence my question:
Is there a smart way to color the top 5%, 10%,... of a dataframe in each column?
Try this:
df = pd.DataFrame(np.arange(100).reshape(20,-1))
def colorme(x):
c = x.rank(pct=True)
c = np.select([c<=.05,c<=.10,c>=.95,c>=.90],['red','orange','yellow','green'])
return [f'background-color: {i}' for i in c]
df.style.apply(colorme)