Relation between dataframes - python

I have the following dataframes:
df:
points
0 a,b,c
1 a,e
2 d,f,c
and df1:
point relation
a [b,f,e]
b []
c [e]
and i want to keep only the points on the first dataframe that have a relation between them (using the second dataframe).
What i've tried :
I tried converting the second dataframe to a dictionary and search for the points in the first dataframe but i had no luck because it was searching the whole dataframe.
I tried a sort of combination of the 2 dataframes into one with only the common elements but again i had no luck.
Any suggestions out there? Thank you in advance
Edit
Desire output
points
0 a,b
1 a,e
2 []
code:
my_dict = df1.set_index('point')['relation'].to_dict() #create dict
my_dict = dict((k, v) for k, v in my_dict.items() if v) #delete empty []
for key, values in my_dict.items():
if df['point'].str.contains(key,values).any():
but then i saw with print that the code in not working

Related

python, pandas, How to find connections between each group

I am having troubles to find connections between groups based on the associated data (groupby maybe?) in order to create a network.
For each group, if they have the same element, they are connected.
For example, my data frame is looks like this:
group_number data
1 a
2 a
2 b
2 c
2 a
3 c
4 a
4 c
So the out put would be
Source_group Target_group Frequency
2 1 1 (because a-a)
3 2 1 (because c-c)
4 2 2 (because a-a, c-c)
Of course (because...) will not be in the output, just explanation
Thank you very much
I thought about your problem. You could do something like the following:
import pandas as pd
from collections import defaultdict
df = pd.DataFrame({'group_number': [1,2,2,2,2,3,4,4],
'data': ['a','a','b','c','a','c','a','c']})
# group the data using multiindex and convert it to dictionary
d = defaultdict(dict)
for multiindex, group in df.groupby(['group_number', 'data']):
d[multiindex[0]][multiindex[1]] = group.data.size
# iterate groups twice to compare every group
# with every other group
relationships = []
for key, val in d.items():
for k, v in d.items():
if key != k:
# get the references to two compared groups
current_row_rel = {}
current_row_rel['Source_group'] = key
current_row_rel['Target_group'] = k
# this is important, but at this point
# you are basically comparing intersection of two
# simple python lists
current_row_rel['Frequency'] = len(set(val).intersection(v))
relationships.append(current_row_rel)
# convert the result to pandas DataFrame for further analysis.
df = pd.DataFrame(relationships)
I’m sure that this could be done without the need to convert to a list of dictionaries. I find this solution however to be more straightforward.

Pandas setting entries for selected values in column A in another column B

I am looking for an elegant way to make the following query:
# Given
original_df = pd.DataFrame({'A':[1,3,5],'B':[2,4,6]})
A_values_where = [1,3]
B_values_setTo = [10,11]
# Wished output
target_df = pd.DataFrame({'A':[1,3,5],'B':[10,11,6]})
Should be self-explanatory, but to be precise: Wherever a value in 'A' of 'A_values_where' is found set column 'B' in the same row to the value in 'B_values_setTo'. Most importantly, the values in 'Three' shall not be touched.
Use Series.map by dictionary created from lists for always correct matching by values in both lists (also working if sublist match):
d = dict(zip(A_values_where, B_values_setTo))
original_df['B'] = original_df['A'].map(d).fillna(original_df['B'])
print (original_df)
A B
0 1 10.0
1 3 11.0
2 5 6.0
If order is always same is possible use this alternative, but failed in general data, so first solution is prefered:
original_df.loc[original_df['A'].isin(A_values_where), 'B'] = B_values_setTo

Python pandas dataframe and list merging

I currently have a pandas DataFrame df:
paper reference
2171686 p84 r51
3816503 p41 r95
4994553 p112 r3
2948201 p112 r61
2957375 p32 r41
2938471 p65 r41
...
Here, each row of df shows the relationship of citation between paper and reference (where paper cites reference).
I need the following numbers for my analysis:
Frequency of elements of paper in df
When two elements from paper are randomly selected, the number of reference they cite in common
For number 1, I performed the following:
df_count = df.groupby(['paper'])['paper'].count()
For number 2, I performed the operation that returns pairs of elements in paper that cite the same element in reference:
from collections import defaultdict
pair = []
d = defaultdict(list)
for idx, row in df.iterrows():
d[row['paper']].append(row['paper'])
for ref, lst in d.items():
for i in range(len(lst)):
for j in range(i+1, len(lst)):
pair.append([lst[i], lst[j], ref])
pair is a list that consists of three elements: first two elements are the pair of paper, and the third element is from reference that both paper elements cite. Below is what pair looks like:
[['p88','p7','r11'],
['p94','p33','r11'],
['p75','p33','r43'],
['p5','p12','r79'],
...]
I would like to retrieve a DataFrame in the following format:
paper1 freq1 paper2 freq2 common
p17 4 p45 3 2
p5 2 p8 5 2
...
where paper1 and paper2 represent the first two elements of each list of pair, freq1 and freq2 represent the frequency count of each paper done by df_count, and common is a number of reference both paper1 and paper2 cite in common.
How can I retrieve my desired dataset (in the desired format) from df, df_count, and pair?
I think this can be solved only using pandas.DataFrame.merge. I am not sure whether this is the most efficient way, though.
First, generate common reference counts:
# Merge the dataframe with itself to generate pairs
# Note that we merge only on reference, i.e. we generate each and every pair
df_pairs = df.merge(df, on=["reference"])
# Dataframe contains duplicate pairs of form (p1, p2) and (p2, p1), remove duplicates
df_pairs = df_pairs[df_pairs["paper_x"] < df_pairs["paper_y"]]
# Now group by pairs, and count the rows
# This will give you the number of common references per each paper pair
# reset_index is necessary to get each row separately
df_pairs = df_pairs.groupby(["paper_x", "paper_y"]).count().reset_index()
df_pairs.columns = ["paper1", "paper2", "common"]
Second, generate number of references per paper (you already got this):
df_refs = df.groupby(["paper"]).count().reset_index()
df_refs.columns = ["paper", "freq"]
Third, merge the two DataFrames:
# Note that we merge twice to get the count for both papers in each pair
df_all = df_pairs.merge(df_refs, how="left", left_on="paper1", right_on="paper")
df_all = df_all.merge(df_refs, how="left", left_on="paper2", right_on="paper")
# Get necessary columns and rename them
df_all = df_all[["paper1", "freq_x", "paper2", "freq_y", "common"]]
df_all.columns = ["paper1", "freq1", "paper2", "freq2", "common"]

Tuple key Dictionary to Table/Graph 3-dimensional

I have a dictionary like this:
dict = {(100,22,123):'55%',(110,24,123):'58%'}
Where, for example, the elements of the tuple are (x,y,z) and the value is the error rate of something... I want to print that dictionary but I'm not very clear how to do it or in what format to do it (which would be better to see easily the information, maybe: x - y - z - Rate ).
I found that: Converting Dictionary to Dataframe with tuple as key ,but I think it does not fit what I want and I can not understand it.
Thank you
You can use Series with reset_index, last only set new column names:
import pandas as pd
d = {(100,22,123):'55%',(110,24,123):'58%'}
df = pd.Series(d).reset_index()
df.columns = ['a','b','c', 'd']
print (df)
a b c d
0 100 22 123 55%
1 110 24 123 58%

python pandas: how to group by and count with a condition for every value in a column?

I have table like this:
d group
1 a
2 b
3 a
4 c
5 f
and I like to iterate over values of d and count number of rows that have group=a .
Here is what I am doing now, but It does not work:
for index,row in df.iterrows():
for x in (1,5):
if row['d'] > x:
row['tp'] = df.groupby('group').agg(lambda x:x.manual_type=='a')
Can anybody help?
try:
df['group'].value_counts()['a']
in general, you should NEVER use for loops in pandas. it's inefficient and usually recreating some existing functionality in the package.

Categories

Resources