I'm trying to merge two dataframes with conditions and one to many relationships.
The dataframes do not have an overlapping column as I am working with coordinate data.
df1 = pd.DataFrame({'Label': ['A', 'B', 'C', 'D'],
'x_low': [101940675,101947985,101941345,101948789],
'x_high': [101940777,10194855,101941577,101949111],
'y_low': [427429081, 427429000, 427429596, 427429466],
'y_high': [427429089, 427429001, 427429599, 427429467]})
df2 = pd.DataFrame({'Image': ['1', '2', '3', '4', '5'],
'X': [101948445, 101948467, 101948764, 101947896, 101941234],
'Y': [427429082, 427429001, 427429597, 427429467, 427430045]})
df1
Label x_low x_high y_low y_high
0 A 101940675 101940777 427429081 427429089
1 B 101947985 10194855 427429000 427429001
2 C 101941345 101941577 427429596 427429599
3 D 101948789 101949111 427429466 427429467
df2
Image X Y
0 1 101948445 427429082
1 2 101948467 427429001
2 3 101948764 427429597
3 4 101947896 427429467
4 5 101941234 427430045
I want to merge the datasets with the condition that the X and Y coordinates from df2 have to be between the corresponding _low and _high values of df1. I would also like to know if multiple coordinate pairs would form one to many relationships.
desired output would look something like this:
Label x_low x_high y_low y_high Image X Y
0 B 101947985 101948555 427429000 427429002 2 101948467 427429001
1 B 101947985 101948555 427429000 427429002 3 101948477 427429008
2 D 101948789 101949111 427429466 427429467 5 101941234 427430045
I tried merging both the whole dataframes but they have 3000 and 30000 lines so I dont have the memory available. Tried multiple other methods but nothing seems to be working.
Related
please advice how to get the following output:
df1 = pd.DataFrame([['1, 2', '2, 2','3, 2','1, 1', '2, 1','3, 1']])
df2 = pd.DataFrame([[1, 2, 100, 'x'], [3, 4, 200, 'y'], [5, 6, 300, 'x']])
import numpy as np
df22 = df2.rename(index = lambda x: x + 1).set_axis(np.arange(1, len(df2.columns) + 1), inplace=False, axis=1)
f = lambda x: df22.loc[tuple(map(int, x.split(',')))]
df = df1.applymap(f)
print (df)
Output:
0 1 2 3 4 5
0 2 4 6 1 3 5
df1 is 'address' of df2 in row, col format (1,2 is first row, second column which is 2, 2,2 is 4 3,2 is 6 etc.)
I need to add values from the 3rd and 4th columns to get something like (2*100x, 4*200y, 6*300x, 1*100x, 3*200y, 5*300x)
the output should be 5000(sum of x's and y's), 0.28 (1400/5000 - % of y's)
It's not clear to me why you need df1 and df... Maybe your question is lacking some details?
You can compute your values directly:
df22['val'] = (df22[1] + df22[2])*df22[3]
Output:
1 2 3 4 val
1 1 2 100 x 300
2 3 4 200 y 1400
3 5 6 300 x 3300
From there it's straightforward to compute the sums (total and grouped by column 4):
total = df22['val'].sum() # 5000
y_sum = df22.groupby(4).sum().loc['y', 'val'] # 1400
print(y_sum/total) # 0.28
Edit: if df1 doesn't necessarily contain all members of columns 1 and 2, you could loop through it (it's not clear in your question why df1 is a Dataframe or if it can have more than one row, therefore I flattened it):
df22['val'] = 0
for c in df1.to_numpy().flatten():
i, j = map(int, c.split(','))
df22.loc[i, 'val'] += df22.loc[i, j]*df22.loc[i, 3]
This gives you the same output as above for your example but will ignore values that are not in df1.
I have a dataframe like this:
r_id
c_id
0
x
1
1
y
1
2
z
2
3
u
3
4
v
3
5
w
4
6
x
4
which you can reproduce like this:
import pandas as pd
r1 = ['x', 'y', 'z', 'u', 'v', 'w', 'x']
r2 = ['1', '1', '2', '3', '3', '4', '4']
df = pd.DataFrame([r1,r2]).T
df.columns = ['r_id', 'c_id']
Where a row has a duplicate r_id, I want to relabel all cases of that c_id with the first c_id value that was given for the duplicate r_id.
(Edit: maybe this is somewhat subtle, but I therefore want to relabel 'w's c_id as '1', as well as that belonging to the second case of 'x'. The duplication of 'x' shows me that all instances where c_id == '1' and c_id == '2' should have the same label.)
For a small dataframe, this works:
from collections import defaultdict
import networkx as nx
g = nx.from_pandas_edgelist(df, 'r_id', 'c_id')
subgraphs = [g.subgraph(c) for c in nx.connected_components(g)]
translator = {n: sorted(list(g.nodes))[0] for g in subgraphs for n in g.nodes if n in df.c_id.values}
df['simplified'] = df.c_id.apply(lambda x: translator[x])
so that I get this:
r_id
c_id
simplified
0
x
1
1
1
y
1
1
2
z
2
2
3
u
3
3
4
v
3
3
5
w
4
1
6
x
4
1
But I'm trying to do this for a table with 2.5 million rows and my computer is struggling... There must be a more efficient way to do something like this.
Okay, if I optimize my initial answer by just using the memory id() as a unique label for a connected set (or rather, a subgraph, since I'm using networkx to find these), and don't check any condition while I'm generating the dictionary but just use a .get() so that it passes gracefully past values that have no key, then this seems to work:
def simplify(original_df):
df = original_df.copy()
g = nx.from_pandas_edgelist(df, 'r_id', 'c_id')
subgraphs = [g.subgraph(c) for c in nx.connected_components(g)]
translator = {n: id(g) for g in subgraphs for n in g.nodes}
df['simplified'] = df.c_id.apply(lambda x: translator.get(x,x))
return df
Manages to do what I want for 2,840,759 rows in 14.49 seconds on my laptop, which will do fine.
I want to convert a dataframe which has tuples in cells into a dataframe with MultiIndex.
Here is an example of the table code:
d = {2:[(0,2),(0,4)], 3:[(826.0, 826.0),(4132.0, 4132.0)], 4:[(6019.0, 6019.0),(12037.0, 12037.0)], 6:[(18337.0, 18605.0),(36674.0, 37209.0)]}
test = pd.DataFrame(d)
This is how the dataframe looks like:
2 3 4 6
0 (0, 2) (826.0, 826.0) (6019.0, 6019.0) (18337.0, 18605.0)
1 (0, 4) (4132.0, 4132.0) (12037.0, 12037.0) (36674.0, 37209.0)
This is what I want it to look like
2 3 4 6
0 A 0 826.0 6019.0 18337.0
B 2 826.0 6019.0 18605.0
1 A 0 4132.0 12037.0 36674.0
B 4 4132.0 12037.0 37209.0
Thanks for your help!
Unsure for the efficiency, because this will rely an the apply method, but you could concat the dataframe with itself, adding a 'A' column to the first and a 'B' one to the second. Then you sort the resulting dataframe by its index, and use apply to change even rows to the first value of the tuple and odd ones to the second:
df = pd.concat([test.assign(X='A'), test.assign(X='B')]).set_index(
'X', append=True).sort_index().rename_axis(index=(None, None))
df.iloc[0:len(df):2] = df.iloc[0:len(df):2].apply(lambda x: x.apply(lambda y: y[0]))
df.iloc[1:len(df):2] = df.iloc[1:len(df):2].apply(lambda x: x.apply(lambda y: y[1]))
It gives as expected:
2 3 4 6
0 A 0 826 6019 18337
B 2 826 6019 18605
1 A 0 4132 12037 36674
B 4 4132 12037 37209
I have a pandas DataFrame:
import pandas as pd
e = [{'E1': 'A', 'E2': 'B', 'E3': 'C', 'EDAY1': 0, 'EDAY2': 1, 'EDAY3': 2}, {'E1': 'B', 'E2': '0', 'E3': '0', 'EDAY1': 2, 'EDAY2': -1, 'EDAY3': -1}, {'E1': 'F', 'E2': 'A', 'E3': 'D', 'EDAY1': 5, 'EDAY2': 5, 'EDAY3': 2}]
df = pd.DataFrame(e)
display(df)
Output:
E1 E2 E3 EDAY1 EDAY2 EDAY3
0 A B C 0 1 2
1 B 0 0 2 -1 -1
2 F A D 5 5 2
Where E1 through E3 are events, and EDAY1 through EDAY3 are the days that the corresponding events occurred on. Note that:
If no event occurred, it is labelled as '0' and the corresponding EDAY is set to -1
Event E1 has greater precedence than E2 and E2 than E3
Event precedence does not correspond to EDAY (see the last row)
Some events occurred on the same day
I would like to turn these events into 10 char long strings based on the following criteria:
Each character position on the string roughly corresponds to the day that the event occurred
Days where there were no events will be represented by the character '0'
Events that occurred on the same day will be sorted by level of precedence and set immediately adjacent to one another (I understand that this is not a perfect representation, but it will do for now)
Therefore given the example above, I would like to have the following representation:
E1 E2 E3 EDAY1 EDAY2 EDAY3 E_STR
0 A B C 0 1 2 ABC0000000
1 B 0 0 2 -1 -1 00B0000000
2 F A D 5 5 2 00D00FA000
Please note that this is not homework but I am a Python and Pandas newbie, and this has me stumped.
Just share my way for this question , I using wide_to_long to flatten you original dataframe , then exclude the -1 , and zip all the value into list of list , bad structure , but no worry we just need it create the pair of values and position (In my understanding the EDAY is the position of the char in E )
newdf=pd.wide_to_long(df.reset_index(),['E','EDAY'],i='index',j='drop').loc[lambda x : x.EDAY!=-1]
newdf.EDAY+=newdf.groupby(['index','EDAY']).cumcount()# here is to protect when two position show up same time
newdf=newdf.groupby(level=0).agg(list)
After beyond reshape we using for loop to create the char you need
l=[]
for x ,y in zip(newdf.E,newdf.EDAY):
xvar=list('0000000000')
for idx,z in enumerate(y):
xvar[z]=x[idx]
l.append(''.join(xvar))
l
Out[111]: ['ABC0000000', '00B0000000', '00D00FA000']
I am trying to merge two pandas tables where I find all rows in df2 which have coordinates close to each row in df1. Example follows.
df1:
x y val
0 0 1 A
1 1 3 B
2 2 9 C
df2:
x y val
0 1.2 2.8 a
1 0.9 3.1 b
2 2.0 9.5 c
desired result:
x y val_x val_y
0 0 1 A NaN
1 1 3 B a
2 1 3 B b
3 2 0 C c
Each row in df1 can have 0, 1, or many corresponding entries in df2, and finding the match should be done with a cartesian distance:
(x1 - x2)^2 + (y1 - y2)^2 < 1
The input dataframes have different sizes, even though they don't in this example. I can get close by iterating over the rows in df1 and finding the close values in df2, but am not sure what to do from there:
for i, row in df1.iterrows():
df2_subset = df2.loc[(df2.x - row.x)**2 + (df2.y - row.y)**2 < 1.0]
# ?? What now?
Any help would be very much appreciated. I made this example with an ipython notebook, so which you can view/access here: http://nbviewer.ipython.org/gist/anonymous/49a3d821420c04169f02
I found an answer, though I am not real happy with having to loop over the rows in df1. In this case there are only a few hundred so I can deal with it, but it won't scale as well as something else. Solution:
df2_list = []
df1['merge_row'] = df1.index.values # Make a row to merge on with the index values
for i, row in df1.iterrows():
df2_subset = df2.loc[(df2.x - row.x)**2 + (df2.y - row.y)**2 < 1.0]
df2_subset['merge_row'] = i # Add a merge row
df2_list.append(df2_subset)
df2_found = pd.concat(df2_list)
result = pd.merge(df1, df2_found, on='merge_row', how='left')