Adding random values in column depending on other columns with pandas - python

I have a dataframe with the Columns "OfferID", "SiteID" and "CatgeoryID" which should represent an online ad on a website. I then want to add a new Column called "NPS" for the net promoter score. The values should be given randomly between 1 and 10 but where the OfferID, the SideID and the CatgeoryID are the same, they need to have the same value for the NPS. I thought of using a dictionary where the NPS is the key and the pairs of different IDs are the values but I haven't found a good way to do this.
Are there any recommendations?
Thanks in advance.
Alina

The easiest would be first to remove all duplicates ; you can do this using :
uniques = df[['OfferID', 'SideID', 'CategoryID']].drop_duplicates(keep="first")
Afterwards, you can do something like this (note that your random values are not uniques) :
uniques['NPS'] = [random.randint(0, 100) for x in uniques.index]
And then :
df = df.merge(uniques, on=['OfferID', 'SideID', 'CategoryID'], how='left')

Related

pandas: calculate overlapping words between rows only if values in another column match

I have a dataframe that looks like the following, but with many rows:
import pandas as pd
data = {'intent': ['order_food', 'order_food','order_taxi','order_call','order_call','order_taxi'],
'Sent': ['i need hamburger','she wants sushi','i need a cab','call me at 6','she called me','i would like a new taxi' ],
'key_words': [['need','hamburger'], ['want','sushi'],['need','cab'],['call','6'],['call'],['new','taxi']]}
df = pd.DataFrame (data, columns = ['intent','Sent','key_words'])
I have calculated the jaccard similarity using the code below (not my solution):
def lexical_overlap(doc1, doc2):
words_doc1 = set(doc1)
words_doc2 = set(doc2)
intersection = words_doc1.intersection(words_doc2)
return intersection
and modified the code given by #Amit Amola to compare overlapping words between every possible two rows and created a dataframe out of it:
overlapping_word_list=[]
for val in list(combinations(range(len(data_new)), 2)):
overlapping_word_list.append(f"the shared keywords between {data_new.iloc[val[0],0]} and {data_new.iloc[val[1],0]} sentences are: {lexical_overlap(data_new.iloc[val[0],1],data_new.iloc[val[1],1])}")
#creating an overlap dataframe
banking_overlapping_words_per_sent = DataFrame(overlapping_word_list,columns=['overlapping_list'])
since my dataset is huge, when i run this code to compare all rows, it takes forever. so i would like to instead only compare the sentences which have the same intents and do not compare sentences that have different intents. I am not sure on how to proceed to do only that
IIUC you just need to iterate over the unique values in the intent column and then use loc to grab just the rows that correspond to that. If you have more than two rows you will still need to use combinations to get the unique combinations between similar intents.
from itertools import combinations
for intent in df.intent.unique():
# loc returns a DataFrame but we need just the column
rows = df.loc[df.intent == intent, ["Sent"]].Sent.to_list()
combos = combinations(rows, 2)
for combo in combos:
x, y = rows
overlap = lexical_overlap(x, y)
print(f"Overlap for ({x}) and ({y}) is {overlap}")
# Overlap for (i need hamburger) and (she wants sushi) is 46.666666666666664
# Overlap for (i need a cab) and (i would like a new taxi) is 40.0
# Overlap for (call me at 6) and (she called me) is 54.54545454545454
ok, so I figured out what to do to get my desired output mentioned in the comments based on #gold_cy 's answer:
for intent in df.intent.unique():
# loc returns a DataFrame but we need just the column
rows = df.loc[df.intent == intent,['intent','key_words','Sent']].values.tolist()
combos = combinations(rows, 2)
for combo in combos:
x, y = rows
overlap = lexical_overlap(x[1], y[1])
print(f"Overlap of intent ({x[0]}) for ({x[2]}) and ({y[2]}) is {overlap}")

Replace NaN values from one column with different length into other column with additional condition

I am working with Titanic data set. This set have 891 rows. At moment I am focus on column 'Age'.
import pandas as pd
import numpy as np
import os
titanic_df = pd.read_csv('titanic_data.csv')
titanic_df ['Age']
Column 'Age' have 177 Nan values, so I want to replace this values from values from my sample. I already made sample for this column and you can see code below.
age_sample= titanic_df ['Age'][titanic_df ['Age'].notnull()].sample(177)
So next steep should be replacing Nan value from age_sample into titanic_df ['Age']. In order to do this I try with this lines of code.
titanic_df ['Age']=age_sample
titanic_df ['Age'].isna()=age_sample
But obliviously here I made some mistakes. So can anybody help me how to replace value from sample (177 rows) into original data set (891 rows) and replace only Nan values.
A two line solution:
age_sample = df['Age'][df['Age'].notnull()]
df['Age'] = list(age_sample) + list(age_sample.sample(177))
If number of NaN values is not known:
nan_len = len(df['Age'][df['Age'].isna()])
age_sample = df['Age'][df['Age'].notnull()]
df['Age'] = list(age_sample) + list(age_sample.sample(nan_len ))
You need to select the subframe you want to update using loc:
titanic_df.loc[titanic_df['Age'].isna(), 'Age'] = age_sample
I will divide my answer to two parts. Solution you are looking for and solution that makes it more robust.
Solution you are looking for
We have to find the number of missing values first, then generate number of sample matching our missing value and then assign. This will insure that you have the same size of needed missing values.
...
age_na_size = titanic_df ['Age'].isna().sum()
# generate sample of that sum
age_sample= titanic_df ['Age'][titanic_df ['Age'].notnull()].sample(age_na_size)
# feed that to missing values
titanic_df.loc[titanic_df['Age'].isna(), 'Age'] = age_sample
Solutions to make it robust
find the group mean age and replace missing values according. Example group by gender, carbin etc features that makes sense and use median age as a replacer.
Use k-Nearest Neighbour as age replacer. See scikit-learn
knnimputer
Use bins of age instead of actual ages. In this way you can first create a classifier to predict the age bin then use that as your code imputer.

Is there a way to allocates sorted values in a dataframe to groups based on alternating elements

I have a Pandas DataFrame like:
COURSE BIB# COURSE 1 COURSE 2 STRAIGHT-GLIDING MEAN PRESTASJON
1 2 20.220 22.535 19.91 21.3775 1.073707
0 1 21.235 23.345 20.69 22.2900 1.077332
This is from a pilot and the DataFrame may be much longer when we perform the real experiment. Now that I have calculated the performance for each BIB#, I want to allocate them into two different groups based on their performance. I have therefore written the following code:
df1 = df1.sort_values(by='PRESTASJON', ascending=True)
This sorts values in the DataFrame. Now I want to assign even rows to one group and odd rows to another. How can I do this?
I have no idea what I am looking for. I have looked up in the documentation for the random module in Python but that is not exactly what I am looking for. I have seen some questions/posts pointing to a scikit-learn stratification function but I don't know if that is a good choice. Alternatively, is there a way to create a loop that accomplishes this? I appreciate your help.
Here a figure to illustrate what I want to accomplish
How about this:
threshold = 0.5
df1['group'] = df1['PRESTASJON'] > threshold
Or if you want values for your groups:
df['group'] = np.where(df['PRESTASJON'] > threshold, 'A', 'B')
Here, 'A' will be assigned to column 'group' if precision meets our threshold, otherwise 'B'.
UPDATE: Per OP's update on the post, if you want to group them alternatively into two groups:
#sort your dataframe based on precision column
df1 = df1.sort_values(by='PRESTASJON')
#create new column with default value 'A' and assign even rows (alternative rows) to 'B'
df1['group'] = 'A'
df1.iloc[1::2,-1] = 'B'
Are you splitting the dataframe alternatingly? If so, you can do:
df1 = df1.sort_values(by='PRESTASJON', ascending=True)
for i,d in df1.groupby(np.arange(len(df1)) %2):
print(f'group {i}')
print(d)
Another way without groupby:
df1 = df1.sort_values(by='PRESTASJON', ascending=True)
mask = np.arange(len(df1)) %2
group1 = df1.loc[mask==0]
group2 = df1.loc[mask==1]

Create a matrix with a set of ranges in columns and a set of ranges in rows with Pandas

I have a data frame in which one column 'F' has values from 0 to 100 and a second column 'E' has values from 0 to 500. I want to create a matrix in which frequencies fall within ranges in both 'F' and 'E'. For example, I want to know the frequency in range 20 to 30 for 'F' and range 400 to 500 for 'E'.
What I expect to have is the following matrix:
matrix of ranges
I have tried to group ranges using pd.cut() and groupby() but I don't know how to join data.
I really appreciate your help in creating the matrix with pandas.
you can use the cut function to create the bin "tag/name" for each column.
after you cat pivot the data frame.
df['rows'] = pd.cut(df['F'], 5)
df['cols'] = pd.cut(df['E'], 5)
df = df.groupby(['rows', 'cols']).agg('sum').reset_index([0,1], False) # your agg func here
df = df.pivot(columns='cols', index='rows')
So this is the way I found to create the matrix, that was obviously inspired by #usher's answer. I know it's more convoluted but wanted to share it. Thanks again #usher
E=df.E
F=df.F
bins_E=pd.cut(E, bins=(max(E)-min(E))/100)
bins_F=pd.cut(F, bins=(max(F)-min(F))/10)
bins_EF=bins_E.to_frame().join(bins_F)
freq_EF=bins_EF.groupby(['E', 'F']).size().reset_index(name="counts")
Mat_FE = freq_EF.pivot(columns='E', index='F')

Sort DataFrame on column of intervals

My output looks like this:
binnedb Proba-A Proba-B Esperance-A Esperance-B
0 (0.0101, 0.0202] 0.547826 0.539130 0.007817 0.007693
1 (0.0302, 0.0402] 0.547826 0.539130 0.005963 0.005854
2 (0.0201, 0.0302] 0.547826 0.539130 0.008360 0.008227
What I would like to do is to sort the df in an ascending order based on the binnedb column(which will be also sorted in ascending order). Please let me know if you don't understand the question. That is what I tried so far: df.sort_values(by=['binnedb'], ascending = False)
But it does not work... thanks!
Since it is inverval type column , you can using left to get the left range and sort base on it .
df['sortkey']=df.binnedb.map(lambda x : x.left)
df=df.sort_values('sortkey')
Interval columns are actually categorical columns which follow a specific ordering. If "binnedb" is categorical column, you can access its category codes and use argsort:
df = df.iloc[df['binnedb'].cat.codes.argsort()]

Categories

Resources