How to iterate columns with control statement? - python

I have the following code right now:
import pandas as pd
df_area=pd.DataFrame({"area":["Coesfeld","Recklinghausen"],"factor":[2,5]})
df_timeseries=pd.DataFrame({"Coesfeld":[1000,2000,3000,4000],"Recklinghausen":[2000,5000,6000,7000]})
columns_in_timeseries=list(df_timeseries)
columns_to_iterate=columns_in_timeseries[0:]
newlist=[]
for i,k in enumerate(columns_to_iterate):
new=df_area.loc[i,"factor"]*df_timeseries[k]
newlist.append(new)
newframe=pd.DataFrame(newlist)
df1_transposed = newframe.T
The code multiplys each factor from an area with the timeseries from that area. In this example the code is iterating immediately the rows and columns after multiplying. In the next step I want to expand the df_area-Dataframe like the following:
df_area=pd.DataFrame({"area":["Coesfeld","Coesfeld","Recklinghausen","Recklinghausen"],"factor":[2,3,5,6]})
As you can see, I have different factors for the same area. The goal is to iterate the columns in df_timeseries only when the area in df_area changes. My first intention is to use an if-Statement but right now I have no idea how to realize that with the for-loop.

I can't shake off the suspicion that there is something wrong about your whole approach. A first red flag is your use of wide format instead of long format – in my experience, that's probably going to cause you unnecessary trouble.
Be it as it may, here's a function that takes a data frame with time series data and a second data frame with multiplier values and area names as arguments. The two data frames use the same structure as your examples df_timeseries (area names as columns, time series values as cell values) and df_area (area name as values in column area, multiplier as value in column factor). I'm pretty sure that this is not a good way to organize your data, but that's up to you to decide.
What the function does is it iterates through the rows of the second data frame (the df_area-like). It uses the area value to select the correct series from the first data frame (the df_timeseries-like), and multiplies this series with the factor value from that row. The result is added as an element within a list generator.
def do_magic(df1, df2):
return [df1[area] * factor for area, factor in zip(df2.area, df2.factor)]
You can insert this directly into your code to replace your loop:
df_area = pd.DataFrame({"area": ["Coesfeld", "Recklinghausen"],
"factor": [2, 5]})
df_timeseries = pd.DataFrame({"Coesfeld": [1000, 2000, 3000, 4000],
"Recklinghausen": [2000, 5000, 6000, 7000]})
newlist = do_magic(df_timeseries, df_area)
newframe = pd.DataFrame(newlist)
df1_transposed = newframe.T
It also works with your expanded df_area. The resulting list will consist of four series (two for Coesfeld, two for Recklinghausen).

Related

Pulling Numerics Out Of Column Drop Numbers Right Of Decimal..?

I have an initial column in a dataframe that contains several bits of information (weight and count of items) I am trying to pull out and do some calculations with.
When I pull out my desired numbers everything looks fine if I print out the variable I store the series in.
Below is my code for how I am parsing out my numbers from the initial column. I just stacked a few methods and used regex to tease it out.
[Hopefully it is fairly easy to read, with some cleaning, my target weight numbers are always in the 3rd to last position after the split() // and my target count numbers are always in the 2nd to last position after the split]
weight = df['Item'].str.replace('1.0gal','128oz').str.replace('YYY','').str.split().str[-3].str.extract('(\d+)', expand=False).astype(np.float64)
count = df['Item'].str.replace('NN','').str.split().str[-2].replace('XX','1ct').str.extract('(\d+)', expand=False).astype(np.float64)
Variable 'weight' returns a series like [32, 32, 0.44, 5.3, 64] and that is what I want to see.
HOWEVER, when I try to set these values into a new column in the dataframe it leaves off everything to the right of the decimal place; for example my new column shows up as [32, 32, 0, 5, 64].
This is throwing off my calculated columns as well.
However if I do the math in a separate variable and print that out it shows up right (decimals and all). But something about assigning it to the dataframe zeros out my weight and screws up any calculations thereafter.
Any and all help is greatly appreciated!
cast the series values to string,
then after you insert the values into a DataFrame column, convert the column to numeric. For example,
weight = weight.asType(str)
df['new_column'] = weight
df['new_column'] = pd.to_numeric(df['new_column'])
check out: Change column type in pandas

Alternative method for two way interpolation

I wrote some code to perform interpolation based on two criteria, the amount of insurance and the deductible amount %. I was struggling to do the interpolation all at once, so had split the filtering.The table hf contains the known data which I am using to base my interpolation results on.Table df contains the new data which needs the developed factors interpolated based on hf.
Right now my work around is first filtering each table based on the ded_amount percentage and then performing the interpolation into an empty data frame and appending after each loop.
I feel like this is inefficient, and there is a better way to perform this, looking to hear some feedback on some improvements I can make. Thanks
Test data provided below.
import pandas as pd
from scipy import interpolate
known_data={'AOI':[80000,100000,150000,200000,300000,80000,100000,150000,200000,300000],'Ded_amount':['2%','2%','2%','2%','2%','3%','3%','3%','3%','3%'],'factor':[0.797,0.774,0.739,0.733,0.719,0.745,0.737,0.715,0.711,0.709]}
new_data={'AOI':[85000,120000,130000,250000,310000,85000,120000,130000,250000,310000],'Ded_amount':['2%','2%','2%','2%','2%','3%','3%','3%','3%','3%']}
hf=pd.DataFrame(known_data)
df=pd.DataFrame(new_data)
deduct_fact=pd.DataFrame()
for deduct in hf['Ded_amount'].unique():
deduct_table=hf[hf['Ded_amount']==deduct]
aoi_table=df[df['Ded_amount']==deduct]
x=deduct_table['AOI']
y=deduct_table['factor']
f=interpolate.interp1d(x,y,fill_value="extrapolate")
xnew=aoi_table[['AOI']]
ynew=f(xnew)
append_frame=aoi_table
append_frame['Factor']=ynew
deduct_fact=deduct_fact.append(append_frame)
Yep, there is a way to do this more efficiently, without having to make a bunch of intermediate dataframes and appending them. have a look at this code:
from scipy import interpolate
known_data={'AOI':[80000,100000,150000,200000,300000,80000,100000,150000,200000,300000],'Ded_amount':['2%','2%','2%','2%','2%','3%','3%','3%','3%','3%'],'factor':[0.797,0.774,0.739,0.733,0.719,0.745,0.737,0.715,0.711,0.709]}
new_data={'AOI':[85000,120000,130000,250000,310000,85000,120000,130000,250000,310000],'Ded_amount':['2%','2%','2%','2%','2%','3%','3%','3%','3%','3%']}
hf=pd.DataFrame(known_data)
df=pd.DataFrame(new_data)
# Create this column now
df['Factor'] = None
# I like specifying this explicitly; easier to debug
deduction_amounts = list(hf.Ded_amount.unique())
for deduction_amount in deduction_amounts:
# You can index a dataframe and call a column in one line
x, y = hf[hf['Ded_amount']==deduction_amount]['AOI'], hf[hf['Ded_amount']==deduction_amount]['factor']
f = interpolate.interp1d(x, y, fill_value="extrapolate")
# This is the most important bit. Lambda function on the dataframe
df['Factor'] = df.apply(lambda x: f(x['AOI']) if x['Ded_amount']==deduction_amount else x['Factor'], axis=1)
The way the lambda function works is:
It goes row by row through the column 'Factor' and gives it a value based on conditions on the other columns.
It returns the interpolation of the AOI column of df (this is what you called xnew) if the deduction amount matches, otherwise it just returns the same thing back.

Pandas merge or concat on list of pd.series. along columns. Series indexes may overlap

I am producing a list of pandas series where the index is temperature and data is some volume of a thermodynamic solid phase present at said temperatures. Each time I run a simulation I don't know how many phases will be returned so I am appending them to a list of pandas series to be concatenated along columns. For Example, lets say I am interested in phases present from 400 to 900 degrees and want to fill a data frame whose index is all the temperatures.
# data.values() is an object whose values x's are temperature and y's are amount of some phase.
lst_phases = []
for d in data.values():
idx = pd.Index(d.temp)
idx.drop_duplicates(keep='first')
## sometimes there can be duplicate temperatures with an empty index in between each.
## ex. temp = [473, 478, 480, , 480, 483....] # so I drop the first, I am not sure what to do abut the empty index or if that is my issue.
s = pd.Series(d.phase, index=idx, name=d.name)
lst_phases.append(s)
result = pd.concat(lst_phases, axis=1)
returns:
ValueError: cannot reindex from a duplicat axis
I have also tried do a merge like so.
pd.merge(lst_phases[0], lst_phases[1], how='outer', left_index='True', right_index='True')
This returns a full outer join of so my index of temperatures are all the temperatures in order and exactly what I am trying to achieve. The issue is that it's difficult to do a merge on a list of phases especially when I don't know how many phases/pd.Series I will have for each simulation

Interpolating from a pandas DataFrame or Series to a new DatetimeIndex

Let's say I have an hourly series in pandas, fine to assume the source is regular but it is gappy. If I want to interpolate it to 15min, the pandas API provides resample(15min).interpolate('cubic'). It interpolates to the new times and provides some control over the limits of interpolation. The spline is helping to refine the series as well as fill small gaps. To be concrete:
tndx = pd.date_range(start="2019-01-01",end="2019-01-10",freq="H")
tnum = np.arange(0.,len(tndx))
signal = np.cos(tnum*2.*np.pi/24.)
signal[80:85] = np.nan # too wide a gap
signal[160:168:2] = np.nan # these can be interpolated
df = pd.DataFrame({"signal":signal},index=tndx)
df1= df.resample('15min').interpolate('cubic',limit=9)
Now let's say I have an irregular datetime index. In the example below, the first time is a regular time point, the second is in the big gap and the last is in the interspersed brief gaps.
tndx2 = pd.DatetimeIndex('2019-01-04 00:00','2019-01-04 10:17','2019-01-07 16:00')
How do I interpolate to from the original series (hourly) to this irregular series of times?
Is the only option to build a series that includes the original data and the destination data? How would I do this? What is the most economical way to achieve the goals of interpolating to an independent irregular index and imposing a gap limit?
In case of irregular timestamps, first you set datetime as index and then you can use interpolate method to index df1= df.resample('15min').interpolate('index')
You can find more information here https://pandas.pydata.org/pandas-docs/version/0.16.2/generated/pandas.DataFrame.interpolate.html
This is an example solution within the pandas interpolate API, which doesn't seem to have a way of using abscissa and values from the source series to interpolate to new times provided by the destination index, as separate data structure. This method solves this by tacking the destination to the source. The method makes use of the limit argument of df.interpolate and it can use any interpolation algorithm from that API but it isn't perfect because the limit is in terms of the number of values and if there are a lot of destination points in a patch of NaNs those get counted as well.
tndx = pd.date_range(start="2019-01-01",end="2019-01-10",freq="H")
tnum = np.arange(0.,len(tndx))
signal = np.cos(tnum*2.*np.pi/24.)
signal[80:85] = np.nan
signal[160:168:2] = np.nan
df = pd.DataFrame({"signal":signal},index=tndx)
# Express the destination times as a dataframe and append to the source
tndx2 = pd.DatetimeIndex(['2019-01-04 00:00','2019-01-04 10:17','2019-01-07 16:00'])
df2 = pd.DataFrame( {"signal": [np.nan,np.nan,np.nan]} , index = tndx2)
big_df = df.append(df2,sort=True)
# At this point there are duplicates with NaN values at the bottom of the DataFrame
# representing the destination points. If these are surrounded by lots of NaNs in the source frame
# and we want the limit argument to work in the call to interpolate, the frame has to be sorted and duplicates removed.
big_df = big_df.loc[~big_df.index.duplicated(keep='first')].sort_index(axis=0,level=0)
# Extract at destination locations
interpolated = big_df.interpolate(method='cubic',limit=3).loc[tndx2]

Coloring cells in pandas according to their relative value

I would like to color the cells of a (python) pandas dataframe according to wether their value is in the top 5%, top 10%, ..., last 10%, last 5% of the data in this column.
According to this post Coloring Cells in Pandas, one can define a function and then apply it to the dataframe.
If you want to color cell if they are in a fixed range, this works fine.
But if you want to color only the first 5%, you need to have all the information about the column. Hence you cannot apply a function that only evaluates every single cell.
Hence my question:
Is there a smart way to color the top 5%, 10%,... of a dataframe in each column?
Try this:
df = pd.DataFrame(np.arange(100).reshape(20,-1))
def colorme(x):
c = x.rank(pct=True)
c = np.select([c<=.05,c<=.10,c>=.95,c>=.90],['red','orange','yellow','green'])
return [f'background-color: {i}' for i in c]
df.style.apply(colorme)

Categories

Resources