I want something like this:
df.groupby("A")["B"].diff()
But instead of diff(), I want be able to compute if the two rows are different or identical, and return 1 if the current row is different from the previous, and 0 if it is identical.
Moreover, I really would like to use a custom function instead of diff(), so that I can do general pairwise row operations.
I tried using .rolling(2) and .apply() at different places, but I just can not get it to work.
Edit:
Each row in the dataset is a packet.
The first row in the dataset is the first recorded packet, and the last row is the last recorded packet, i.e., they are ordered by time.
One of the features(columns) is called "ID", and several packets have the same ID.
Another column is called "data", its values are 64 bit binary values (strings), i.e., 001011010011001.....10010 (length 64).
I want to create two new features(columns):
Compare the "data" field of the current packet with the data field of the previous packet with the Same ID, and compute:
If they are different (1 or 0)
How different (a figure between 0 and 1)
Hi I think it is best if you forgo using the grouby and shift instead:
equal_index = (df == df.shift(1))[X].all(axis=1)
where X is a list of columns you want to be identic. Then you can create your own grouper by
my_grouper = (~equal_index).cumsum()
and use it together with agg to aggregate with whatever function you wish
df.groupby(my_grouper).agg({'B':f})
Use DataFrameGroupBy.shift with compare for not equal by Series.ne:
df["dc"] = df.groupby("ID")["data"].shift().ne(df['data']).astype(int)
EDIT: for correlation between 2 Series use:
df["dc"] = df['data'].corr(df.groupby("ID")["data"].shift())
Ok, I solved it myself with
def create_dc(df: pd.DataFrame):
dc = df.groupby("ID")["data"].apply(lambda x: x != x.shift(1)).astype(int)
dc.fillna(1, inplace=True)
df["dc"] = dc
this does what I want.
Thank you #Arnau for inspiring me to use .shift()!
I am playing with the famous Titanic dataset.
I have the train.csv and test.csv loaded into an array of size two called combined. and concatenated the Ticket column of each like this:
all_tickets = combined[0]['Ticket'].append(combined[1]['Ticket'])
Now, my goal is to apply the same categorical codes to each DataFrame. However, when I tried this for the first one:
combined[0]['TicketCode'] = all_tickets.astype('category').cat.codes
It complained: cannot reindex from a duplicate axis
It makes sense since both sets are from different sizes. How can I achieve my goal in this situation? Grouping the Tickets with a range enumerator?
Thanks
At the end, I just built a dictionary and applied to both DataFrames:
ticket_codes = {}
i = 0
grouped_series = combined[0]['Ticket'].append(combined[1]['Ticket'])
for g, _ in grouped_series.groupby(grouped_series):
ticket_codes[g] = i
i = i + 1
for x in positions:
combined[x]['TicketCode'] = combined[x]['Ticket'].apply(lambda y: ticket_codes[y])
I have a data frame in which one column 'F' has values from 0 to 100 and a second column 'E' has values from 0 to 500. I want to create a matrix in which frequencies fall within ranges in both 'F' and 'E'. For example, I want to know the frequency in range 20 to 30 for 'F' and range 400 to 500 for 'E'.
What I expect to have is the following matrix:
matrix of ranges
I have tried to group ranges using pd.cut() and groupby() but I don't know how to join data.
I really appreciate your help in creating the matrix with pandas.
you can use the cut function to create the bin "tag/name" for each column.
after you cat pivot the data frame.
df['rows'] = pd.cut(df['F'], 5)
df['cols'] = pd.cut(df['E'], 5)
df = df.groupby(['rows', 'cols']).agg('sum').reset_index([0,1], False) # your agg func here
df = df.pivot(columns='cols', index='rows')
So this is the way I found to create the matrix, that was obviously inspired by #usher's answer. I know it's more convoluted but wanted to share it. Thanks again #usher
E=df.E
F=df.F
bins_E=pd.cut(E, bins=(max(E)-min(E))/100)
bins_F=pd.cut(F, bins=(max(F)-min(F))/10)
bins_EF=bins_E.to_frame().join(bins_F)
freq_EF=bins_EF.groupby(['E', 'F']).size().reset_index(name="counts")
Mat_FE = freq_EF.pivot(columns='E', index='F')
I wrote some code to perform interpolation based on two criteria, the amount of insurance and the deductible amount %. I was struggling to do the interpolation all at once, so had split the filtering.The table hf contains the known data which I am using to base my interpolation results on.Table df contains the new data which needs the developed factors interpolated based on hf.
Right now my work around is first filtering each table based on the ded_amount percentage and then performing the interpolation into an empty data frame and appending after each loop.
I feel like this is inefficient, and there is a better way to perform this, looking to hear some feedback on some improvements I can make. Thanks
Test data provided below.
import pandas as pd
from scipy import interpolate
known_data={'AOI':[80000,100000,150000,200000,300000,80000,100000,150000,200000,300000],'Ded_amount':['2%','2%','2%','2%','2%','3%','3%','3%','3%','3%'],'factor':[0.797,0.774,0.739,0.733,0.719,0.745,0.737,0.715,0.711,0.709]}
new_data={'AOI':[85000,120000,130000,250000,310000,85000,120000,130000,250000,310000],'Ded_amount':['2%','2%','2%','2%','2%','3%','3%','3%','3%','3%']}
hf=pd.DataFrame(known_data)
df=pd.DataFrame(new_data)
deduct_fact=pd.DataFrame()
for deduct in hf['Ded_amount'].unique():
deduct_table=hf[hf['Ded_amount']==deduct]
aoi_table=df[df['Ded_amount']==deduct]
x=deduct_table['AOI']
y=deduct_table['factor']
f=interpolate.interp1d(x,y,fill_value="extrapolate")
xnew=aoi_table[['AOI']]
ynew=f(xnew)
append_frame=aoi_table
append_frame['Factor']=ynew
deduct_fact=deduct_fact.append(append_frame)
Yep, there is a way to do this more efficiently, without having to make a bunch of intermediate dataframes and appending them. have a look at this code:
from scipy import interpolate
known_data={'AOI':[80000,100000,150000,200000,300000,80000,100000,150000,200000,300000],'Ded_amount':['2%','2%','2%','2%','2%','3%','3%','3%','3%','3%'],'factor':[0.797,0.774,0.739,0.733,0.719,0.745,0.737,0.715,0.711,0.709]}
new_data={'AOI':[85000,120000,130000,250000,310000,85000,120000,130000,250000,310000],'Ded_amount':['2%','2%','2%','2%','2%','3%','3%','3%','3%','3%']}
hf=pd.DataFrame(known_data)
df=pd.DataFrame(new_data)
# Create this column now
df['Factor'] = None
# I like specifying this explicitly; easier to debug
deduction_amounts = list(hf.Ded_amount.unique())
for deduction_amount in deduction_amounts:
# You can index a dataframe and call a column in one line
x, y = hf[hf['Ded_amount']==deduction_amount]['AOI'], hf[hf['Ded_amount']==deduction_amount]['factor']
f = interpolate.interp1d(x, y, fill_value="extrapolate")
# This is the most important bit. Lambda function on the dataframe
df['Factor'] = df.apply(lambda x: f(x['AOI']) if x['Ded_amount']==deduction_amount else x['Factor'], axis=1)
The way the lambda function works is:
It goes row by row through the column 'Factor' and gives it a value based on conditions on the other columns.
It returns the interpolation of the AOI column of df (this is what you called xnew) if the deduction amount matches, otherwise it just returns the same thing back.
I am using a rather large dataset of ~37 million data points that are hierarchically indexed into three categories country, productcode, year. The country variable (which is the countryname) is rather messy data consisting of items such as: 'Austral' which represents 'Australia'. I have built a simple guess_country() that matches letters to words, and returns a best guess and confidence interval from a known list of country_names. Given the length of the data and the nature of hierarchy it is very inefficient to use .map() to the Series: country. [The guess_country function takes ~2ms / request]
My question is: Is there a more efficient .map() which takes the Series and performs map on only unique values? (Given there are a LOT of repeated countrynames)
There isn't, but if you want to only apply to unique values, just do that yourself. Get mySeries.unique(), then use your function to pre-calculate the mapped alternatives for those unique values and create a dictionary with the resulting mappings. Then use pandas map with the dictionary. This should be about as fast as you can expect.
On Solution is to make use of the Hierarchical Indexing in DataFrame!
data = data.set_index(keys=['COUNTRY', 'PRODUCTCODE', 'YEAR'])
data.index.levels[0] = pd.Index(data.index.levels[0].map(lambda x: guess_country(x, country_names)[0]))
This works well ... by replacing the data.index.levels[0] -> when COUNTRY is level 0 in the index, replacement then which propagates through the data model.
Call guess_country() on unique country names, and make a country_map Series object with the original name as the index, converted name as the value. Then you can use country_map[df.country] to do the conversion.
import pandas as pd
c = ["abc","abc","ade","ade","ccc","bdc","bxy","ccc","ccx","ccb","ccx"]
v = range(len(c))
df = pd.DataFrame({"country":c, "data":v})
def guess_country(c):
return c[0]
uc = df.country.unique()
country_map = pd.Series(list(map(guess_country, uc)), index=uc)
df["country_id"] = country_map[df.country].values
print(df)