I currently have a pandas DataFrame df:
paper reference
2171686 p84 r51
3816503 p41 r95
4994553 p112 r3
2948201 p112 r61
2957375 p32 r41
2938471 p65 r41
...
Here, each row of df shows the relationship of citation between paper and reference (where paper cites reference).
I need the following numbers for my analysis:
Frequency of elements of paper in df
When two elements from paper are randomly selected, the number of reference they cite in common
For number 1, I performed the following:
df_count = df.groupby(['paper'])['paper'].count()
For number 2, I performed the operation that returns pairs of elements in paper that cite the same element in reference:
from collections import defaultdict
pair = []
d = defaultdict(list)
for idx, row in df.iterrows():
d[row['paper']].append(row['paper'])
for ref, lst in d.items():
for i in range(len(lst)):
for j in range(i+1, len(lst)):
pair.append([lst[i], lst[j], ref])
pair is a list that consists of three elements: first two elements are the pair of paper, and the third element is from reference that both paper elements cite. Below is what pair looks like:
[['p88','p7','r11'],
['p94','p33','r11'],
['p75','p33','r43'],
['p5','p12','r79'],
...]
I would like to retrieve a DataFrame in the following format:
paper1 freq1 paper2 freq2 common
p17 4 p45 3 2
p5 2 p8 5 2
...
where paper1 and paper2 represent the first two elements of each list of pair, freq1 and freq2 represent the frequency count of each paper done by df_count, and common is a number of reference both paper1 and paper2 cite in common.
How can I retrieve my desired dataset (in the desired format) from df, df_count, and pair?
I think this can be solved only using pandas.DataFrame.merge. I am not sure whether this is the most efficient way, though.
First, generate common reference counts:
# Merge the dataframe with itself to generate pairs
# Note that we merge only on reference, i.e. we generate each and every pair
df_pairs = df.merge(df, on=["reference"])
# Dataframe contains duplicate pairs of form (p1, p2) and (p2, p1), remove duplicates
df_pairs = df_pairs[df_pairs["paper_x"] < df_pairs["paper_y"]]
# Now group by pairs, and count the rows
# This will give you the number of common references per each paper pair
# reset_index is necessary to get each row separately
df_pairs = df_pairs.groupby(["paper_x", "paper_y"]).count().reset_index()
df_pairs.columns = ["paper1", "paper2", "common"]
Second, generate number of references per paper (you already got this):
df_refs = df.groupby(["paper"]).count().reset_index()
df_refs.columns = ["paper", "freq"]
Third, merge the two DataFrames:
# Note that we merge twice to get the count for both papers in each pair
df_all = df_pairs.merge(df_refs, how="left", left_on="paper1", right_on="paper")
df_all = df_all.merge(df_refs, how="left", left_on="paper2", right_on="paper")
# Get necessary columns and rename them
df_all = df_all[["paper1", "freq_x", "paper2", "freq_y", "common"]]
df_all.columns = ["paper1", "freq1", "paper2", "freq2", "common"]
Related
I am writing a program in Python for a data analytics project involving advertisement performance data matched to advertisement characteristics aimed at identifying high performing groups of ads that share n similar characteristics. The dataset I am using has individual ads as rows, and characteristic, summary, and performance data as columns. Below is my current code - the actual dataset I am using has 51 columns, 4 are excluded, so it is running with 47 C 4, or 178365 iterations in the outer loop.
Currently, this code takes ~2 hours to execute. I know that nested for loops can be the source of such a problem, but I do not know why it is taking so long to run, and am not sure how I can modify the inner/outer for loops to improve performance. Any feedback on either of these topics would be greatly appreciated.
import itertools
import pandas as pd
import numpy as np
# Identify Clusters of Rows (Ads) that have a KPI value above a certain threshold
def set_groups(df, n):
"""This function takes a dataframe and a number n, and returns a list of lists. Each list is a group of n columns.
The list of lists will hold all size n combinations of the columns in the dataframe.
"""
# Create a list of all relevant column names
columns = list(df.columns[4:]) # exclude first 4 summary columns
# Create a list of lists, where each list is a group of n columns
groups = []
vals_lst = list(map(list, itertools.product([True, False], repeat=n))) # Create a list of all possible combinations of 0s and 1s
for comb in itertools.combinations(columns, n): # itertools.combinations returns a list of tuples
groups.append([comb, vals_lst])
groups = np.array(groups,dtype=object)
return groups # len(groups) = len(columns(df)) choose n
def identify_clusters(df, KPI, KPI_threshhold, max_size, min_size, groups):
"""
This function takes in a dataframe, a KPI, a threshhold value, a max and min size, and a list of lists of groupings.
The function will identify groups of rows in the dataframe that have the same values for each column in each list of groupings.
The function will return a list of lists with each list of groups, the values list, and the ad_ids in the cluster.
"""
# Create a dictionary to hold the results
output = []
# Iterate through each list of groups
for group in groups:
for vals_lst in group[1]: # for each pair of groups and associated value matrices
# Create a temporary dataframe to hold the group of rows with matching values for columns in group
temp_df = df
for i in range(len(group[0])):
temp_df = temp_df[(temp_df[group[0][i]] == vals_lst[i])] # reduce the temp_df to only rows that match the values in vals_lst for each combination of values
if temp_df[KPI].mean() > KPI_threshhold: # if the mean of the KPI for the temp_df is above the threshhold
output.append([group, vals_lst, temp_df['ad_id'].values]) # append the group, vals_lst, and ad_ids to the output list
print(output)
return output
## Main
df = pd.read_excel('data.xlsx', sheet_name='name')
groups = set_groups(df, 4)
print(len(groups))
identify_clusters(df, 'KPI_var', 0.0015, 6, 4, groups)
Any insight into why the code is taking such a long time to run, and/or any advice on improving the performance of this code would be extremely helpful.
I think your biggest issue is the lines:
temp_df = df
for i in range(len(group[0])):
temp_df = temp_df[(temp_df[group[0][i]] == vals_lst[i])]
You're filtering the entire dataframe while I think you're only actually interested in the KPI and ad_id columns. You could instead create a rolling mask, something like
mask = pd.Series(True, index=df.index)
for i in range(len(group[0])):
mask = mask & (temp_df[group[0][i]] == vals_lst[i])]
You can then access your subsets something like df[mask][KPI].mean() and df[mask]['ad_id'].values. If you do this, you will avoid copying a huge amount of data on every iteration.
I would also be tempted to simplify the code a little, for example I believe vals_lst = list(map(list, itertools.product([True, False], repeat=n))) is the same for each group, so I would probably calculate it once and hold it as a stand alone variable rather than add it to every group; this would clean up the group[0], group[1] and group[0][i] references which were a little hard to track on first reading the code.
Looking at the change from iterative filtering to tracking a mask, the mask approach always to perform better, but the gap increases with data size. With 10000 rows the gaps are:
Method
Time
Relative
Original
2.900383699918166
2.8098094911581533
Using Mask
1.03223499993328
1.0
with the following test code:
import random, timeit
import pandas as pd
random.seed(1)
iterations = 1000
data = {hex(i): [random.randint(0, 1) for i in range(10000)] for i in range(52)}
df = pd.DataFrame(data)
kpi_col = hex(1)
# test group of columns with desired values
group = (
(hex(5), 1),
(hex(6), 1),
(hex(7), 1),
(hex(8), 1)
)
def method0():
tmp = df
for column, value in group:
tmp = tmp[tmp[column] == value]
return tmp[kpi_col].mean()
def method1():
mask = pd.Series(True, df.index)
for column, value in group:
mask = mask & (df[column] == value)
return df[mask][kpi_col].mean()
assert method0() == method1()
t0 = timeit.timeit(lambda: method0(), number=iterations)
t1 = timeit.timeit(lambda: method1(), number=iterations)
tmin = min((t0, t1))
print(f'| Method | Time | Relative |')
print(f'|------------------ |----------------------|')
print(f'| Original | {t0} | {t0 / tmin} |')
print(f'| Using Mask | {t1} | {t1 / tmin} |')
I've stumbled upon intricate data and I want to present it totally differently.
Currently, my dataframe has a default index (numerated) and 3 labels: sequence (that stores sentences), labels (which is a list that contains 20 different strings) and scores which is again a list (length of 20) that corresponds to the labels list and the ith element in the scores list is the score of the ith element in the labels list.
The labels list is sorted via the scores list; if label j has the highest score in row i, then j would show up first in the labels list; but if another label has the highest score, it would show up first instead.. so essentially it's sorted by the scores list.
I want to paint a different picture: use the labels list as my new columns and as value, use the corresponding values via the scores list.
For example, if this is is how my current dataframe looks like:
d = {'sentence': ['Hello, my name is...', 'I enjoy reading books'], 'labels': [['Happy', 'Sad'],['Sad', 'Happy']],'score': [['0.9','0.1'],['0.8','0.2']]}
df = pd.DataFrame(data=d)
df
I want to keep the first column which is the sentence, but then use the labels like the rest of the columns and fill it with the value of the corresponding scores.
An example output would be then:
new_format_d = {'sentence': ['Hello, my name is...', 'I enjoy reading books'], 'Happy': ['0.9', '0.2'],'Sad': ['0.1','0.2']}
new_format_df = pd.DataFrame(data=new_format_df )
new_format_df
Is there an "easy" way to execute that?
I was finally able to solve it using a NumPy array hack:
First you convert the lists to np arrays:
df['labels'] = df['labels'].map(lambda x: np.array(x))
df['scores'] = df['scores'].map(lambda x: np.array(x))
Then, you loop over the labels and add each label, one at a time, and its corresponding scores using the boolean condition described below:
for label in df['labels'][0]:
df[label] = df_text_20[['labels','scores']].apply(lambda x: x[1][x[0]==label][0], axis=1)
My suggestion is to change your dictionary if you can. First find the indices of the Happy and Sad from labels:
happy_index = [internal_list.index('Happy') for internal_list in d['labels']]
sad_index = [internal_list.index('Sad') for internal_list in d['labels']]
Then add new keys name Happy and Sad to your dictionary:
d['Happy'] = [d['score'][cnt][index] for cnt, index in enumerate(happy_index)]
d['Sad'] = [d['score'][cnt][index] for cnt, index in enumerate(sad_index)]
Finally, delete your redundant keys and convert it to dataframe:
del d['labels']
del d['score']
df = pd.DataFrame(d)
sentence Happy Sad
0 Hello, my name is... 0.9 0.1
1 I enjoy reading books 0.2 0.8
I have a dataframe that looks like the following, but with many rows:
import pandas as pd
data = {'intent': ['order_food', 'order_food','order_taxi','order_call','order_call','order_taxi'],
'Sent': ['i need hamburger','she wants sushi','i need a cab','call me at 6','she called me','i would like a new taxi' ],
'key_words': [['need','hamburger'], ['want','sushi'],['need','cab'],['call','6'],['call'],['new','taxi']]}
df = pd.DataFrame (data, columns = ['intent','Sent','key_words'])
I have calculated the jaccard similarity using the code below (not my solution):
def lexical_overlap(doc1, doc2):
words_doc1 = set(doc1)
words_doc2 = set(doc2)
intersection = words_doc1.intersection(words_doc2)
return intersection
and modified the code given by #Amit Amola to compare overlapping words between every possible two rows and created a dataframe out of it:
overlapping_word_list=[]
for val in list(combinations(range(len(data_new)), 2)):
overlapping_word_list.append(f"the shared keywords between {data_new.iloc[val[0],0]} and {data_new.iloc[val[1],0]} sentences are: {lexical_overlap(data_new.iloc[val[0],1],data_new.iloc[val[1],1])}")
#creating an overlap dataframe
banking_overlapping_words_per_sent = DataFrame(overlapping_word_list,columns=['overlapping_list'])
since my dataset is huge, when i run this code to compare all rows, it takes forever. so i would like to instead only compare the sentences which have the same intents and do not compare sentences that have different intents. I am not sure on how to proceed to do only that
IIUC you just need to iterate over the unique values in the intent column and then use loc to grab just the rows that correspond to that. If you have more than two rows you will still need to use combinations to get the unique combinations between similar intents.
from itertools import combinations
for intent in df.intent.unique():
# loc returns a DataFrame but we need just the column
rows = df.loc[df.intent == intent, ["Sent"]].Sent.to_list()
combos = combinations(rows, 2)
for combo in combos:
x, y = rows
overlap = lexical_overlap(x, y)
print(f"Overlap for ({x}) and ({y}) is {overlap}")
# Overlap for (i need hamburger) and (she wants sushi) is 46.666666666666664
# Overlap for (i need a cab) and (i would like a new taxi) is 40.0
# Overlap for (call me at 6) and (she called me) is 54.54545454545454
ok, so I figured out what to do to get my desired output mentioned in the comments based on #gold_cy 's answer:
for intent in df.intent.unique():
# loc returns a DataFrame but we need just the column
rows = df.loc[df.intent == intent,['intent','key_words','Sent']].values.tolist()
combos = combinations(rows, 2)
for combo in combos:
x, y = rows
overlap = lexical_overlap(x[1], y[1])
print(f"Overlap of intent ({x[0]}) for ({x[2]}) and ({y[2]}) is {overlap}")
I have a big dataframe (~10 millon rows). Each row has:
category
start position
end position
If two rows are in the same category and the start and end position overlap with a +-5 tolerance, I want to keep just one of the rows.
For example
1, cat1, 10, 20
2, cat1, 12, 21
3, cat2, 10, 25
I want to filter out 1 or 2.
What I'm doing right now isn't very efficient,
import pandas as pd
df = pd.read_csv('data.csv', sep='\t', header=None)
dfs = []
for seq in df.category.unique():
dfs[seq] = df[df.category == seq]
for index, row in df.iterrows():
if index in discard:
continue
df_2 = dfs[row.category]
res = df_2[(abs(df_2.start - row.start) <= params['min_distance']) & (abs(df_2.end - row.end) <= params['min_distance'])]
if len(res.index) > 1:
discard.extend(res.index.values)
rows.append(row)
df = pd.DataFrame(rows)
I've also tried a different approach making use of a sorted version of the dataframe.
my_index = 0
indexes = []
discard = []
count = 0
curr = 0
total_len = len(df.index)
while my_index < total_len - 1:
row = df.iloc[[my_index]]
cond = True
next_index = 1
while cond:
second_row = df.iloc[[my_index + next_index]]
c1 = (row.iloc[0].category == second_row.iloc[0].category)
c2 = (abs(second_row.iloc[0].sstart - row.iloc[0].sstart) <= params['min_distance'])
c3 = (abs(second_row.iloc[0].send - row.iloc[0].send) <= params['min_distance'])
cond = c1 and c2 and c3
if cond and (c2 amd c3):
indexes.append(my_index)
cond = True
next_index += 1
indexes.append(my_index)
my_index += next_index
indexes.append(total_len - 1)
The problem is that this solution is not perfect, sometimes it misses a row because the overlapping could be several rows ahead, and not in the next one
I'm looking for any ideas on how approach this problem in a more pandas friendly way, if exists.
The approach here should be this:
pandas.groupby by categories
agg(Func) on groupby result
the Func should implement the logic of finding the best range inside categories (sorted search, balanced trees or anything else)
Do you want to merge all similar or only 2 consecutive?
If all similar, I suggest you first order the rows, by category, then on the 2 other columns and squash similar in a single row.
If only consecutive 2 then, check if the next value is in the range you set and if yes, merge it. Here you can see how:
merge rows pandas dataframe based on condition
I don't believe the numeric comparisons can be made without a loop, but you can make at least part of this cleaner and more efficient:
dfs = []
for seq in df.category.unique():
dfs[seq] = df[df.category == seq]
Instead of this, use df.groupby('category').apply(drop_duplicates).droplevel(0), where drop_duplicates is a function containing your second loop. The function will then be called separately for each category, with a dataframe that contains only the filtered rows. The outputs will be combined back into a single dataframe. The dataframe will be a MultiIndex with the value of "category" as an outer level; this can be removed with droplevel(0).
Secondly, within the category you could sort by the first of the two numeric columns for another small speed-up:
def drop_duplicates(df):
df = df.sort_values("sstart")
...
This will allow you to stop the inner loop as soon as the sstart column value is out of range, instead of comparing every row to every other row.
I have a dictionary (in python), where the keys are animal names, and the values are sets that contain gene names. Not all the animals have all the genes.
There are about 108 genes (of which I have a list) and 15 species. There are 28 genes common to all animals.
I would like to plot the presence of a gene in an animal for every animal and gene.
For example:
d = {'dog': {'tnfa', 'tlr1'}, 'cat': {'myd88', 'tnfa', 'map2k2'}}
The plot I'd like would look something like this:
dog cat
tnfa x x
myd88 x
tlr1 x
map2k2 x
It would be nice if I could group the animals with the most number of genes together too. But that's optional.
Do you have any suggestions for an approach I can make?
Let's try this:
d = {'dog': {'tnfa', 'tlr1'}, 'cat': {'myd88', 'tnfa'}}
df = pd.DataFrame.from_dict(d, orient='index')
df.stack().reset_index()\
.drop('level_1',axis=1).assign(Value='x')\
.set_index([0,'level_0'])['Value']\
.unstack().rename_axis('gene')\
.rename_axis('animal', 1)
Output:
animal cat dog
gene
myd88 x None
tlr1 None x
tnfa x x
Using pandas crosstab will get you the matrix you are looking for
d = {'dog': ['tnfa', 'tlr1'], 'cat': ['myd88', 'tnfa']}
#data munging
df = pd.DataFrame(d).stack()
df.index = df.index.droplevel(0)
#create and format crosstab
ct = pd.crosstab(df.index, df.values)
ct.index.name = "animal"
ct.columns.name= "gene"
ct = ct.replace([0, 1], ["" , "x"])
ct = ct.T
print(ct)
Results in
animal cat dog
gene
myd88 x
tlr1 x
tnfa x x
Not really sure about the grouping - do you mean by number of genes or common genes? Probably need some more examples as well for that one.
A pure python solution:
Instead of using pandas, my solution just uses some simple for-loops and the .ljust method to print a neat table.
I am not too used to working with dictionaries in python, but using .keys() seemed the way to go. The code loops through each animal and gets that animal's genes. Then for each row so far in the table, if the first value of that row is in the genes, then just add an 'x' to the end of that row to mark that this animal has that gene, also remove that gene so it doesn't create its own row at the end. Otherwise, if the first element of that row was not one of the animal's genes, then just append an empty string to fill that cell of the table.
Finally, for all the remaining genes, if they have not been removed from being already in the table, create a new row in the table with cells of: that gene, the number of animals already seen before (['']*index) and then finally an 'x' to show that the current animal does have that gene.
Finally, the last step is to inset a row at the beginning to simply have the animal names from the dict.
Here's the code:
d = {'dog': {'tnfa', 'tlr1'}, 'cat': {'myd88', 'tnfa', 'map2k2'}}
table = []
cellWidth = 0
for index, animal in enumerate(d.keys()):
cellWidth = max(cellWidth, len(animal))
genes = d[animal]
for row in table:
if row[0] in genes:
row.append('x')
genes.remove(row[0])
else:
row.append('')
for gene in genes:
cellWidth = max(cellWidth, len(gene))
table.append([gene] + ['']*index + ['x'])
table.insert(0, [''] + list(d.keys()))
[print(''.join([c.ljust(cellWidth + 1) for c in r])) for r in table]
and the result is what is wanted:
cat dog
map2k2 x
tnfa x x
myd88 x
tlr1 x
Update:
I have added a variable : cellWidth which will store the greatest length animal or gene. To do this, the max() function is utilized to minimize code length. In the final print, the cells are printed with one extra space than the max so there is some room.