Setting DataFrame columns from current columns' data - python

I've stumbled upon intricate data and I want to present it totally differently.
Currently, my dataframe has a default index (numerated) and 3 labels: sequence (that stores sentences), labels (which is a list that contains 20 different strings) and scores which is again a list (length of 20) that corresponds to the labels list and the ith element in the scores list is the score of the ith element in the labels list.
The labels list is sorted via the scores list; if label j has the highest score in row i, then j would show up first in the labels list; but if another label has the highest score, it would show up first instead.. so essentially it's sorted by the scores list.
I want to paint a different picture: use the labels list as my new columns and as value, use the corresponding values via the scores list.
For example, if this is is how my current dataframe looks like:
d = {'sentence': ['Hello, my name is...', 'I enjoy reading books'], 'labels': [['Happy', 'Sad'],['Sad', 'Happy']],'score': [['0.9','0.1'],['0.8','0.2']]}
df = pd.DataFrame(data=d)
df
I want to keep the first column which is the sentence, but then use the labels like the rest of the columns and fill it with the value of the corresponding scores.
An example output would be then:
new_format_d = {'sentence': ['Hello, my name is...', 'I enjoy reading books'], 'Happy': ['0.9', '0.2'],'Sad': ['0.1','0.2']}
new_format_df = pd.DataFrame(data=new_format_df )
new_format_df
Is there an "easy" way to execute that?

I was finally able to solve it using a NumPy array hack:
First you convert the lists to np arrays:
df['labels'] = df['labels'].map(lambda x: np.array(x))
df['scores'] = df['scores'].map(lambda x: np.array(x))
Then, you loop over the labels and add each label, one at a time, and its corresponding scores using the boolean condition described below:
for label in df['labels'][0]:
df[label] = df_text_20[['labels','scores']].apply(lambda x: x[1][x[0]==label][0], axis=1)

My suggestion is to change your dictionary if you can. First find the indices of the Happy and Sad from labels:
happy_index = [internal_list.index('Happy') for internal_list in d['labels']]
sad_index = [internal_list.index('Sad') for internal_list in d['labels']]
Then add new keys name Happy and Sad to your dictionary:
d['Happy'] = [d['score'][cnt][index] for cnt, index in enumerate(happy_index)]
d['Sad'] = [d['score'][cnt][index] for cnt, index in enumerate(sad_index)]
Finally, delete your redundant keys and convert it to dataframe:
del d['labels']
del d['score']
df = pd.DataFrame(d)
sentence Happy Sad
0 Hello, my name is... 0.9 0.1
1 I enjoy reading books 0.2 0.8

Related

pandas: calculate overlapping words between rows only if values in another column match

I have a dataframe that looks like the following, but with many rows:
import pandas as pd
data = {'intent': ['order_food', 'order_food','order_taxi','order_call','order_call','order_taxi'],
'Sent': ['i need hamburger','she wants sushi','i need a cab','call me at 6','she called me','i would like a new taxi' ],
'key_words': [['need','hamburger'], ['want','sushi'],['need','cab'],['call','6'],['call'],['new','taxi']]}
df = pd.DataFrame (data, columns = ['intent','Sent','key_words'])
I have calculated the jaccard similarity using the code below (not my solution):
def lexical_overlap(doc1, doc2):
words_doc1 = set(doc1)
words_doc2 = set(doc2)
intersection = words_doc1.intersection(words_doc2)
return intersection
and modified the code given by #Amit Amola to compare overlapping words between every possible two rows and created a dataframe out of it:
overlapping_word_list=[]
for val in list(combinations(range(len(data_new)), 2)):
overlapping_word_list.append(f"the shared keywords between {data_new.iloc[val[0],0]} and {data_new.iloc[val[1],0]} sentences are: {lexical_overlap(data_new.iloc[val[0],1],data_new.iloc[val[1],1])}")
#creating an overlap dataframe
banking_overlapping_words_per_sent = DataFrame(overlapping_word_list,columns=['overlapping_list'])
since my dataset is huge, when i run this code to compare all rows, it takes forever. so i would like to instead only compare the sentences which have the same intents and do not compare sentences that have different intents. I am not sure on how to proceed to do only that
IIUC you just need to iterate over the unique values in the intent column and then use loc to grab just the rows that correspond to that. If you have more than two rows you will still need to use combinations to get the unique combinations between similar intents.
from itertools import combinations
for intent in df.intent.unique():
# loc returns a DataFrame but we need just the column
rows = df.loc[df.intent == intent, ["Sent"]].Sent.to_list()
combos = combinations(rows, 2)
for combo in combos:
x, y = rows
overlap = lexical_overlap(x, y)
print(f"Overlap for ({x}) and ({y}) is {overlap}")
# Overlap for (i need hamburger) and (she wants sushi) is 46.666666666666664
# Overlap for (i need a cab) and (i would like a new taxi) is 40.0
# Overlap for (call me at 6) and (she called me) is 54.54545454545454
ok, so I figured out what to do to get my desired output mentioned in the comments based on #gold_cy 's answer:
for intent in df.intent.unique():
# loc returns a DataFrame but we need just the column
rows = df.loc[df.intent == intent,['intent','key_words','Sent']].values.tolist()
combos = combinations(rows, 2)
for combo in combos:
x, y = rows
overlap = lexical_overlap(x[1], y[1])
print(f"Overlap of intent ({x[0]}) for ({x[2]}) and ({y[2]}) is {overlap}")

Getting to know which value corresponds to a particular column value

I wish to find the exact value of the index for an input defined value of key in a dataframe, below is the code I am trying to do to get it.
data_who = pd.DataFrame({'index':data['index'],
'Publisher_Key':data['Key']})
Below is my O/P dataframe:
If suppose I give an input say 100 as the key value, I would like to get the O/P of the index value, which is Goat, what should I do in my code??
PS: Too many labels in the data after performing label encoding, so wanted to know the value of the labels corresponds to which category.
If index is a column, then you can do as follows:
data.loc[data['Key'] == 100, 'index'].iloc[0]
>>> 'Zebra'
Or other option:
data[data['Key'] == 100]['index'].iloc[0]
>>> 'Zebra'
If index is the index of the dataframe, replace ['index'] with .index.
As a side note: you shouldn't name a column index in pandas, it's a concept itself and naming a column that way could be misleading.
I'd suggest three ways of doing this:
Using pandas:
data_who.loc[data_who['key'] == 100, 'index'].values[0]
>>> 'Goat'
Using python dictionaries:
who_dict = dict(zip(data_who['key'], data_who['index']))
who_dict[100]
>>> 'Goat'
Finally, if you were using LabelEncoder from skearn, it can inverse transform values:
le = LabelEncoder()
le.fit(animals) # fit on the list of animals
le.inverse_transform([100])

Adding random values in column depending on other columns with pandas

I have a dataframe with the Columns "OfferID", "SiteID" and "CatgeoryID" which should represent an online ad on a website. I then want to add a new Column called "NPS" for the net promoter score. The values should be given randomly between 1 and 10 but where the OfferID, the SideID and the CatgeoryID are the same, they need to have the same value for the NPS. I thought of using a dictionary where the NPS is the key and the pairs of different IDs are the values but I haven't found a good way to do this.
Are there any recommendations?
Thanks in advance.
Alina
The easiest would be first to remove all duplicates ; you can do this using :
uniques = df[['OfferID', 'SideID', 'CategoryID']].drop_duplicates(keep="first")
Afterwards, you can do something like this (note that your random values are not uniques) :
uniques['NPS'] = [random.randint(0, 100) for x in uniques.index]
And then :
df = df.merge(uniques, on=['OfferID', 'SideID', 'CategoryID'], how='left')

Pandas Dataframe grouping / combining columns?

I'm new to Pandas, and I'm having a horrible time figuring out datasets.
I have a csv file I've read in using pandas.read_csv, dogData, that looks as follows:
The column names are dog breeds, the first line [0] refers to the size of the dogs, and beyond that there's a bunch of numerical values. The very first column has string description that I need to keep, but isn't relevant to the question. The last column for each size category contains separate "Average" values. (Note that it changed the "Average" columns to "Average.1", "Average.2" and so on, to take care of them not being unique)
Basically, I want to "group" by the first row - so all "small" dog values will be averaged except the "small" average column, and so on. The result would look like something like this:
The existing "Average" columns should not be included in the new average being calculated. The existing "Average" columns for each size don't need to be altered at all. All "small" breed values should be averaged, all "medium" breed values should be averaged, and so on (actual file is much larger then the sample I showed here).
There's no guarantee the breeds won't be altered, and no guarantee the "sizes" will remain the same / always be included ("Small" could be left out, for example).
EDIT:: After Joe Ferndz's comment, I've updated my code and have something slightly closer to working, but the actual adding-the-columns is giving me trouble still.
dogData = pd.read_csv("dogdata.csv", header=[0,1])
dogData.columns = dogData.columns.map("_".join)
totalVal = ""
count = 0
for col in dogData:
if "Unnamed" in col:
continue # to skip starting columns
if "Average" not in col:
totalVal += dogData[col]
count += 1
else:
# this is where I'd calculate average, then reset count and totalVal
# right now, because the addition isn't working, I'm haven't figured that out
break
print(totalVal)
Now, this code is getting the correct values technically... but it won't let me numerically add them (hence why totalVal is a string right now). It gives me a string of concatenated numbers, the correct concatenated numbers, but won't let me convert them to floats to actually add.
I've tried doing float(dogData[col]) for the totalVal addition line - it gives me a TypeError: cannot convert the series to <class float>
I've tried keeping it as a string, putting in "," between the numbers, then doing totalVal.split(",") to separate them, then convert and add... but obviously that doesn't work either, because AttributeError: 'Series' has no attribute 'split'
These errors make sense to me and I understand why it's happening, but I don't know what the correct method for doing this is. dogData[col] gives me all the values for every row at once, which is what I want, but I don't know how to then store that and add it in the next iteration of the loop.
Here's a copy/pastable sample of data:
,Corgi,Yorkie,Pug,Average,Average,Dalmation,German Shepherd,Average,Great Dane,Average
,Small,Small,Small,Small,Medium,Large,Large,Large,Very Large,Very Large
Words,1,3,3,3,2.4,3,5,7,7,7
Words1,2,2,4,4,2.2,4,4,6,8,8
Words2,2,1,5,3,2.5,5,3,8,9,6
Words3,1,4,4,2,2.7,6,6,5,6,9
You have to do a few tricks to get this to work.
Step 1: You need to read the csv file and use first two rows as header. It will create a MultiIndex column list.
Step 2: You need to join them together with say an _.
Step 3: Then rename the specific columns as per your requirement like S-Average, M-Average, ....
Step 4: find out how many columns have dog name + small
Step 5: Compute value for Small. Per your req, sum (columns with Small) / count (columns with Small)
Step 6,7: do same for Large
Step 8,9: do same for Very Large
This will give you the final list. If you want the columns to be in specific order, then you can change the order.
Step 10: Change the order for the dataframe
import pandas as pd
df = pd.read_csv('abc.txt',header=[0,1], index_col=0)
df.columns = df.columns.map('_'.join)
df.rename(columns={'Average_Small': 'S-Average',
'Average_Medium': 'M-Average',
'Average_Large': 'L-Average',
'Average_Very Large': 'Very L-Average'}, inplace = True)
idx = [i for i,x in enumerate(df.columns) if x.endswith('_Small')]
if idx:
df['Small']= ((df.iloc[:, idx].sum(axis=1))/len(idx)).round(2)
df.drop(df.columns[idx], axis = 1, inplace = True)
idx = [i for i,x in enumerate(df.columns) if x.endswith('_Large')]
if idx:
df['Large']= ((df.iloc[:, idx].sum(axis=1))/len(idx)).round(2)
df.drop(df.columns[idx], axis = 1, inplace = True)
idx = [i for i,x in enumerate(df.columns) if x.endswith('_Very Large')]
if idx:
df['Very_Large']= ((df.iloc[:, idx].sum(axis=1))/len(idx)).round(2)
df.drop(df.columns[idx], axis = 1, inplace = True)
df = df[['Small', 'S-Average', 'M-Average', 'L-Average', 'Very L-Average', 'Large', 'Very_Large', ]]
print (df)
The output of this will be:
Small S-Average M-Average ... Very L-Average Large Very_Large
Words 2.33 3 2.4 ... 7 4.0 7.0
Words1 2.67 4 2.2 ... 8 4.0 8.0
Words2 2.67 3 2.5 ... 6 4.0 9.0
Words3 3.00 2 2.7 ... 9 6.0 6.0

Python pandas dataframe and list merging

I currently have a pandas DataFrame df:
paper reference
2171686 p84 r51
3816503 p41 r95
4994553 p112 r3
2948201 p112 r61
2957375 p32 r41
2938471 p65 r41
...
Here, each row of df shows the relationship of citation between paper and reference (where paper cites reference).
I need the following numbers for my analysis:
Frequency of elements of paper in df
When two elements from paper are randomly selected, the number of reference they cite in common
For number 1, I performed the following:
df_count = df.groupby(['paper'])['paper'].count()
For number 2, I performed the operation that returns pairs of elements in paper that cite the same element in reference:
from collections import defaultdict
pair = []
d = defaultdict(list)
for idx, row in df.iterrows():
d[row['paper']].append(row['paper'])
for ref, lst in d.items():
for i in range(len(lst)):
for j in range(i+1, len(lst)):
pair.append([lst[i], lst[j], ref])
pair is a list that consists of three elements: first two elements are the pair of paper, and the third element is from reference that both paper elements cite. Below is what pair looks like:
[['p88','p7','r11'],
['p94','p33','r11'],
['p75','p33','r43'],
['p5','p12','r79'],
...]
I would like to retrieve a DataFrame in the following format:
paper1 freq1 paper2 freq2 common
p17 4 p45 3 2
p5 2 p8 5 2
...
where paper1 and paper2 represent the first two elements of each list of pair, freq1 and freq2 represent the frequency count of each paper done by df_count, and common is a number of reference both paper1 and paper2 cite in common.
How can I retrieve my desired dataset (in the desired format) from df, df_count, and pair?
I think this can be solved only using pandas.DataFrame.merge. I am not sure whether this is the most efficient way, though.
First, generate common reference counts:
# Merge the dataframe with itself to generate pairs
# Note that we merge only on reference, i.e. we generate each and every pair
df_pairs = df.merge(df, on=["reference"])
# Dataframe contains duplicate pairs of form (p1, p2) and (p2, p1), remove duplicates
df_pairs = df_pairs[df_pairs["paper_x"] < df_pairs["paper_y"]]
# Now group by pairs, and count the rows
# This will give you the number of common references per each paper pair
# reset_index is necessary to get each row separately
df_pairs = df_pairs.groupby(["paper_x", "paper_y"]).count().reset_index()
df_pairs.columns = ["paper1", "paper2", "common"]
Second, generate number of references per paper (you already got this):
df_refs = df.groupby(["paper"]).count().reset_index()
df_refs.columns = ["paper", "freq"]
Third, merge the two DataFrames:
# Note that we merge twice to get the count for both papers in each pair
df_all = df_pairs.merge(df_refs, how="left", left_on="paper1", right_on="paper")
df_all = df_all.merge(df_refs, how="left", left_on="paper2", right_on="paper")
# Get necessary columns and rename them
df_all = df_all[["paper1", "freq_x", "paper2", "freq_y", "common"]]
df_all.columns = ["paper1", "freq1", "paper2", "freq2", "common"]

Categories

Resources