Pandas - Merge one dataframe with itself only partially - python

This is a follow up question from the following Question:
Pandas Similarity Matching
The ultimate goal of the first question was to find a way to similarity match each row with another if they have the same CountryId.
Here is the sample dataframe:
df = pd.DataFrame([[1, 5, 'AADDEEEEIILMNORRTU'], [2, 5, 'AACEEEEGMMNNTT'], [3, 5, 'AAACCCCEFHIILMNNOPRRRSSTTUUY'], [4, 5, 'DEEEGINOOPRRSTY'], [5, 5, 'AACCDEEHHIIKMNNNNTTW'], [6, 5, 'ACEEHHIKMMNSSTUV'], [7, 5, 'ACELMNOOPPRRTU'], [8, 5, 'BIT'], [9, 5, 'APR'], [10, 5, 'CDEEEGHILLLNOOST'], [11, 5, 'ACCMNO'], [12, 5, 'AIK'], [13, 5, 'CCHHLLOORSSSTTUZ'], [14, 5, 'ANNOSXY'], [15, 5, 'AABBCEEEEHIILMNNOPRRRSSTUUVY']],columns=['PartnerId','CountryId','Name'])
The answer in other thread was good for the question but I ended up getting computational problems. My real source contains >19.000 rows and will be even bigger in the future.
The answer suggested to merge the dataframe with each self to compare it with every other row that has the same CountryId:
df = df.merge(df, on='CountryId', how='outer')
Even for the small example of 15 rows provided above we will end up with 225 merged rows. For the whole dataset I ended up with 131.044.638 rows which made my RAM refuse to work. Therefore I need to think of a better way to mergethe two dataframes.
As I´m doing a similarity check I was wondering if there is a possibility to:
Sort the dataframe based on the CountryId and the Name
Only merge each row with the +/- 3 rows connecting. E.g. After sorting Row 1 will only be merged with (2,3 & 4) as this is the first
row., Row 2 will only be merged with (1, 3, 4, 5) and so on.
Like this I will have similar names almost next to each other and names "further away" will not be similar anyway. Therefore its not needed to check the similarity of them.

I found a workaround for my problem that is taking the 3 rows before (if existing) and after.
sorted_df = df.sort_values(by=['CountryId','Name']).reset_index(drop=True)
new_sorted = pd.Series()
min = -3
max = 3
for s in list(range(min,max+1,1)):
if s == min:
new_sorted = sorted_df['PartnerId'].astype(str).shift(s,fill_value='A').rename('MatchingID')
elif s != 0:
new_sorted = new_sorted + '-' + sorted_df['PartnerId'].astype(str).shift(s,fill_value='A').rename('MatchingID')
match = sorted_df.merge(new_sorted,left_index=True,right_index=True)
matching_df = []
for index, row in match.iterrows():
row_values = row.tolist()
matching_df += [row_values[0:-1] + [int(w)] for w in row_values[-1].split('-') if w != 'A']
If anyone can come up with a better idea I would be glad to hear it!

Related

How to input values x, y as array or dataframe and what is fastest way : np.where( (quote >= x ) & (quote <= y) )

This example of my code works fine :
quote = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
array = np.array([[1, 5], [7, 10], [3, 9]])
df = pd.DataFrame(np.array([[1, 5], [7, 10], [3, 9]]),columns= ('A','B'))
A B
0 1 5
1 7 10
2 3 9
list = quote[(quote >= 1 )&( quote <= 5 )]
-> array([1, 2, 3, 4, 5])
list = quote[(quote >= 7 )&( quote <= 10 )]
-> array([ 7, 8, 9, 10])
list = quote[(quote >= 3 )&( quote <= 9 )]
-> array([3, 4, 5, 6, 7, 8, 9])
Q1: How to put them as array or dataframe
I put argue (1,5),(7,10),(3,9) respectively,
but how to put them as array or dataframe ?
something like (I konw it does not work :( ) :
list = quote[(quote >= df['A'] )&( quote <= df['B'] )]
so that I can apply for my real calculation(: large array, dataframe).
Anyway, my best solution is :
[quote[(quote >= y['A']) & (quote <= y['B'])] for x, y in df.iterrows()]
Q2. Fastest way (vectorized ? )
The code above is fine. But it took 3s when I tried to use 'for loop' for my real one of dataframes( I have 2,000 dataframes which means I still have 1,999 dataframe for future work ).
A base code would be the same as my example code.
How to apply vectorization instead of 'for loop'
Please give me some suggestion or any advice
My real quote : (range 1 to 3000000 has different increment which depends on its value )
My real data : 2,000 dataframe & each dataframe dimension is (20000, 5)
Fastest way of applying function to array or dataframe

How to match two dataframe columns and return matching values on a separate column in Python?

I am a newbie in Python and I need some help.
I have 2 data frame containing a list of users with a list of recommended friends from two tables.
I would like to achieve the following:
Sort the list of recommended friends by ascending order from 2 data frame for each user.
Match the list of matching recommended friends from dataframe2 to dataframe1 for each user. Return only the matched values.
I have tried my code but it didn't achieve the desired results.
import pandas as pd
import numpy as np
///load data from csv
df1 = pd.read_csv('CommonFriend.csv')
df2 = pd.read_csv('InfluenceFriend.csv')
print(df1)
print(df2)
///convert values to list to sort by recommended friends ID
df1.values.tolist()
df1.sort_values(by=['User','RecommendedFriends'])
df2.values.tolist()
df2.sort_values(by=['User','RecommendedFriends'])
///obtain only matched values from list of recommended friends from df1 and df2.
df3 = df1.merge(df2, how='inner', on='User')
/// return dataframe with user, matched recommendedfriends ID
print(df3)
Problem encountered:
The elements in each list are not sorted in ascending order.
While matching each elements through pandas merge with "inner-join". It seems that it is not able to read certain elements.
Updates: Below are the data frame header which cause some error in the code.
This should be the solution to your problem. You might have to change a few variables but you get the idea: you merge the two dataframes on the users, so you get a dataframe with both lists for each user. You then take the intersection of both lists and store that in a new column.
df1 = pd.DataFrame(np.array([[1, [5, 7, 10, 11]], [2, [3, 8, 5, 12]]]),
columns=['User', 'Recommended friends'])
df2 = pd.DataFrame(np.array([[1, [5, 7, 9]], [2, [4, 7, 10]], [3, [15, 7, 9]]]),
columns=['User', 'Recommended friends'])
df3 = pd.merge(df1, df2, on='User')
df3['intersection'] = [list(set(a).intersection(set(b))) for a, b in zip(df3['Recommended friends_x'], df3['Recommended friends_y'])]
Output df3:
User Recommended friends_x Recommended friends_y intersection
0 1 [5, 7, 10, 11] [5, 7, 9] [5, 7]
1 2 [3, 8, 5, 12] [4, 7, 10] []
I do not quite understand what exactly your problem is, but in general you will have to assign the dataframe to itself again.
import pandas as pd
import numpy as np
df1 = pd.read_csv('CommonFriend.csv')
df2 = pd.read_csv('InfluenceFriend.csv')
print(df1)
print(df2)
df1 = df1.values.tolist()
df1 = df1.sort_values(by=['User','RecommendedFriends'])
df2 = df2.values.tolist()
df2 = df2.sort_values(by=['User','RecommendedFriends'])
df3 = df1.merge(df2, how='inner', on='User')
print(df3)

Using Array or Series to select from multiple columns

I have a counter column which contains an integer. Based on that integer I would like to pick one of consecutive columns in my dataframe.
I tried using .apply(lambda x: ..., axis =1) but my solution there requires an extra if for each column I want to pick from.
df2 = pd.DataFrame(np.array([[1, 2, 3, 0 ], [4, 5, 6, 2 ], [7, 8, 9, 1]]),columns=['a', 'b', 'c','d'])
df2['e'] = df.iloc[:,df2['d']]
This code doesn't work because iloc only wants one item in that position and not 3 (df2['d']= [0,2,1]).
What I would like it to do is give me the 0th item in the first row the 2nd item in the second row and the 1st item in the third row. so
df2['e'] = [1,6,8]
You are asking for something similar to fancy indexing in numpy. In pandas, it is lookup. Try this:
df2.lookup(df2.index, df2.columns[df2['d']])
Out[86]: array([1, 6, 8])

Eliminating Consecutive Numbers

If you have a range of numbers from 1-49 with 6 numbers to choose from, there are nearly 14 million combinations. Using my current script, I currently have only 7.2 million combinations remaining. Of the 7.2 million remaining combinations, I want to eliminate all 3, 4, 5, 6, dual, and triple consecutive numbers.
Example:
3 consecutive: 1, 2, 3, x, x, x
4 consecutive: 3, 4, 5, 6, x, x
5 consecutive: 4, 5, 6, 7, 8, x
6 consecutive: 5, 6, 7, 8, 9, 10
double separate consecutive: 1, 2, 5, 6, 14, 18
triple separate consecutive: 1, 2, 9, 10, 22, 23
Note: combinations such as 1, 2, 12, 13, 14, 15 must also be eliminated or else they conflict with the rule that double and triple consecutive combinations to be eliminated.
I'm looking to find how many combinations of the 7.2 million remaining combinations have zero consecutive numbers (all mixed) and only 1 consecutive pair.
Thank you!
import functools
_MIN_SUM = 120
_MAX_SUM = 180
_MIN_NUM = 1
_MAX_NUM = 49
_NUM_CHOICES = 6
_MIN_ODDS = 2
_MAX_ODDS = 4
#functools.lru_cache(maxsize=None)
def f(n, l, s = 0, odds = 0):
if s > _MAX_SUM or odds > _MAX_ODDS:
return 0
if n == 0 :
return int(s >= _MIN_SUM and odds >= _MIN_ODDS)
return sum(f(n-1, i+1, s+i, odds + i % 2) for i in range(l, _MAX_NUM+1))
result = f(_NUM_CHOICES, _MIN_NUM)
print('Number of choices = {}'.format(result))
While my answer should work, I think someone might be able to offer a faster solution.
Consider the following code:
not_allowed = []
for x in range(48):
not_allowed.append([x, x+1, x+2])
# not_allowed = [ [0,1,2], [1,2,3], ... [11,12,13], ... [47,48,49] ]
my_numbers = [[1, 2, 5, 9, 11, 33], [1, 3, 7, 8, 9, 31], [12, 13, 14, 15, 23, 43]]
for x in my_numbers:
for y in not_allowed:
if set(y) <= set(x): # if [1,2,3] is a subset of [1,2,5,9,11,33], etc.
# drop x
This code will remove all instances that contain double consecutive numbers, which is all you really need to check for, because triple, quadruple, etc. all imply double consecutive. Try implementing this and let me know how it works.
The easiest approach is probably to generate and filter. I used numpy to try to vectorize as much of this as I could:
import numpy as np
from itertools import combinations
combos = np.array(list(combinations(range(1, 50), 6))) # build all combos
# combos is shape (13983816, 6)
filt = np.where(np.bincount(np.where(np.abs(
np.subtract(combos[:, :-1], combos[:, 1:])) == 1)[0]) <= 1)[0] # magic!
filtered = combos[filt]
# filtered is shape (12489092, 6)
Breaking down that "magic" line
First we subtract the first five items in the list from the last five items to get the differences between them. We do this for the entire set of combinations in one shot with np.subtract(combos[:, :-1], combos[:, 1:]). Note that itertools.combinations produces sorted combinations, on which this depends.
Next we take the absolute value of these differences to make sure we only look at positive distances between numbers with np.abs(...).
Next we grab the indicies from this operation for the entire dataset that indicate a difference of 1 (consecutive numbers) with np.where(... == 1)[0]. Note that np.where returns a tuple where the first item are all of the rows, and the second item are all of the corresponding columns for our condition. This is important because any row value that shows up more than once tells us that we have more than one consecutive number in that row!
So we count how many times each row shows up in our results with np.bincount(...), which will return something like [5, 4, 4, 4, 3, 2, 1, 0] indicating how many consecutive pairs are in each row of our combinations dataset.
Finally we grab only the row numbers where there are 0 or 1 consecutive values with np.where(... <= 1)[0].
I am returning way more combinations than you seem to indicate, but I feel fairly confident that this is working. By all means, poke holes in it in the comments and I will see if I can find fixes!
Bonus, because it's all vectorized, it's super fast!

How to sum a slice from a pandas dataframe

I'm trying to sum the a portion of the sessions in my dictionary so I can get totals for the current and previous week.
I've converted the JSON into a pandas dataframe in one test. I'm summing the total of the sessions using the .sum() function in pandas. However, I also need to know the total sessions from this week and the week prior. I've tried a few methods to sum values (-1:-7) and (-8:-15), but I'm pretty sure I need to use .iloc.
IN:
response = requests.get("url")
data = response.json()
df=pd.DataFrame(data['DailyUsage'])
total_sessions = df['Sessions'].sum()
current_week= df['Sessions'].iloc[-1:-7]
print(current_week)
total_sessions =['current_week'].sum
OUT:
Series([], Name: Sessions, dtype: int64)
AttributeError 'list' object has no attribute 'sum'
Note: I've tried this with and without pd.to_numeric and also with variations on the syntax of the slice and sum methods. Pandas doesn't feel very Pythonic and I'm out of ideas as to what to try next.
Assuming that df['Sessions'] holds each day, and you are comparing current and previous week only, you can use reshape to create a weekly sum for the last 14 values.
weekly_matrix = df['Sessions'][:-15:-1].values.reshape((2, 7))
Then, you can sum each row and get the weekly sum, most recent will be the first element.
import numpy as np
weekly_sum = np.sum(weekly_matrix, axis=1)
current_week = weekly_sum[0]
previous_week = weekly_sum[1]
EDIT: how the code works
Let's take the 1D-array which is accessed by the values attribute of the pandas Series. It contains the last 14 days, which is ordered from most recent to the oldest. I will call it x.
x = array([14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1])
The array's reshape function is then called on x to split this data into a 2D-array (matrix) with 2 rows and 7 columns.
The default behavior of the reshape function is to first fill all columns in a row before moving to the next row. Therefore, x[0] will be the element (1,1) in the reshaped array, x[1] will be the element (1,2), and so on. After the element (1,7) is filled with x[6] (ending the current week), the next element x[7] will then be placed in (2,1). This continues until finishing the reshape operation, with the placement of x[13] in (2,7).
This results in placing the first 7 elements of x (current week) in the first row, and the last 7 elements of x (previous week) in the second row. This was called weekly_matrix.
weekly_matrix = x.reshape((2, 7))
# weekly_matrix = array([[14, 13, 12, 11, 10, 9, 8],
# [ 7, 6, 5, 4, 3, 2, 1]])
Since now we have the values of each week organized in a matrix, we can use numpy.sum function to finish our operation. numpy.sum can take an axis argument, which will control how the value is computed:
if axis=None, all elements are added in a grand total.
if axis=0, all rows in each column will be added. In the case of weekly_matrix, this will result in a 7 element 1D-array ([21, 19,
17, 15, 13, 11, 9], which is not the result we want, as we are
actually adding equivalent days on each week).
if axis=1 (as the case of the solution), all columns in each row will be added, producing a 2 element 1D-array in the case of weekly_matrix. Order of this result
array follows the same order of the rows in the matrix (i.e., element
0 is the total of the first row, and element 1 is the total of the
second row). Since we know that the first row is the current week, and
the second row is the previous week, we can extract the information
using these indexes, which is
# weekly_sum = array([77, 28])
current_week = weekly_sum[0] # sum of [14, 13, 12, 11, 10, 9, 8] = 77
previous_week = weekly_sum[1] # sum of [ 7, 6, 5, 4, 3, 2, 1] = 28
To group and sum by a fixed number of values, for instance with daily data and weekly aggregation, consider groupby. You can do this forwards or backwards by slicing your series as appropriate:
np.random.seed(0)
df = pd.DataFrame({'col': np.random.randint(0, 10, 21)})
print(df['col'].values)
# array([5, 0, 3, 3, 7, 9, 3, 5, 2, 4, 7, 6, 8, 8, 1, 6, 7, 7, 8, 1, 5])
# forwards groupby
res = df['col'].groupby(df.index // 7).sum()
# 0 30
# 1 40
# 2 35
# Name: col, dtype: int32
# backwards groupby
df['col'].iloc[::-1].reset_index(drop=True).groupby(df.index // 7).sum()
# 0 35
# 1 40
# 2 30
# Name: col, dtype: int32

Categories

Resources