Python Join two dataframes on columns meeting a condition - python

Say I have two dataframes A and B, each containing two columns called x and y.
I want to join these two dataframes but not on rows on which the x and y columns are equal across the two dataframes, but on rows where A's x columns is a substring of B's x column and same for y.
For example
if A[x][1]='mpla' and B[x][1]='mplampla'
I would want that to be captured.
On sql it would be something like:
select *
from A
join B
on A.x<=B.x and A.y<=B.y.
Can something like this be done on python?

You can match a single string at a time against all the strings in one column, like this:
import numpy.core.defchararray as ca
ca.find(B.x.values.astype(str), 'mpla') >= 0
The problem with that is you'll have to loop over all elements of A. But if you can afford that, it should work.
See also: pandas + dataframe - select by partial string

you could try something like
B.x.where(B.x.str.contains(A.x), B.index, axis=index) #this would give you the ones that don't match
B.x.where(B.x.str.match(A.x, as_indexer=True), B.index, axis=index) #this would also give you the one's that don't match. You could see if you can use the "^" operator used for regex to get the ones that match.
You could also maybe try
np.where(B.x.str.contains(A.x), B.index, np.nan)
also you can try:
matchingmask = B[B.x.str.contains(A.x)]
matchingframe = B.ix[matchingmask.index] #or
matchingcolumn = B.ix[matchingmask.index].x #or
matchingindex = B.ix[matchingmask.index].index
All of these assume you have the same index on both frames (I think)
You want to look at the string methods: http://pandas.pydata.org/pandas-docs/stable/text.html#text-string-methods
you want to read up on regex and pandas where method: http://pandas.pydata.org/pandas-docs/dev/indexing.html#the-where-method-and-masking

Related

How Do I Count The Number of Times a Subset of Words Appear In My Pandas Dataframe?

I have a bunch of keywords stored in a 620x2 pandas dataframe seen below. I think I need to treat each entry as its own set, where semicolons separate elements. So, we end up with 1240 sets. Then I'd like to be able to search how many times keywords of my choosing appear together. For example, I'd like to figure out how many times 'computation theory' and 'critical infrastructure' appear together as a subset in these sets, in any order. Is there any straightforward way I can do this?
Use .loc to find if the keywords appear together.
Do this after you have split the data into 1240 sets. I don't understand whether you want to make new columns or just want to keep the columns as is.
# create a filter for keyword 1
filter_keyword_2 = (df['column_name'].str.contains('critical infrastructure'))
# create a filter for keyword 2
filter_keyword_2 = (df['column_name'].str.contains('computation theory'))
# you can create more filters with the same construction as above.
# To check the number of times both the keywords appear
len(df.loc[filter_keyword_1 & filter_keyword_2])
# To see the dataframe
subset_df = df.loc[filter_keyword_1 & filter_keyword_2]
.loc selects the conditional data frame. You can use subset_df=df[df['column_name'].str.contains('string')] if you have only one condition.
To the column split or any other processing before you make the filters or run the filters again after processing.
Not sure if this is considered straightforward, but it works. keyword_list is the list of paired keywords you want to search.
df['Author Keywords'] = df['Author Keywords'].fillna('').str.split(';\s*').apply(set)
df['Index Keywords'] = df['Index Keywords'].fillna('').str.split(';\s*').apply(set)
df.apply(lambda x : x.apply(lambda y : all([kw in y for kw in keyword_list]))).sum().sum()

How to find a specif string inside a bunch of strings? Both inside different cells

How can I find if one of these words from a column A is inside of one of these columns B and C?
column A) df['all_types'] = 'spray, protetor, toalha, esfoliante'
column B) df['make_sempre'] = 'limpador-facial,esfoliante,hidratante-labial,hidratante,serum'
column C) df['skin_sempre'] = 'corretivo,batom,produtos-sombrancelha,pinceis,mascara-cilios,iluminador,gloss,blush,delineador'
I've done it using a loop inside a loop inside a loop. And it worked.
But with hundreds of thousands of rows, this was impossible.
I've split these words into separated columns, and then applied a loop to compare each column with the others.
I'm using python and pandas
You might want to try using set differences, so each column being a set of words or combine columns B and C to be one set, then get the difference.

Pandas series string manipulation using Python - 1st two chars flip and append to end of string

I have a column (series) of values I'm trying to move characters around and I'm going nowhere fast! I found some snippets of code to get me where I am but need a "Closer". I'm working with one column, datatype (STR). Each column strings are a series of numbers. Some are duplicated. These duplicate numbers have a (n-) in front of the number. The (n) number will change based on how many duplicate numbers strings are listed. Some may have two duplicates, some eight duplicates. Doesn't matter, order should stay the same.
I need to go down through each cell or string, pluck the (n-) from the left of string, swap the two characters around, and append it to the end of the string. No number sorting needed. The column is 4-5k lines long and will look like the example given all the way down. No other special characters or letters. Also, the duplicate rows will always be together no matter where in the column.
My problem is the code below actually works and will step through each string, evaluate it for a dash, then process the numbers in the way I need. However, I have not learned how to get the changes back into my dataframe from a python for-loop. I was really hoping that somebody had a niffy lambda fix or a pandas apply function to address the whole column at once. But I haven't found anything that I can tweak to work. I know there is a better way than slowly traversing down through a series and I would like to learn.
Two possible fixes needed:
Is there a way to have the code below replace the old df.string value with the newly created df.string value? If so, please let me know.
I've been trying to read up on df.apply using the split feature so I can address the whole column at once. I understand it's the smarter play. Is there a couple of lines code that would do what I need?
Please let me know what you think. I appreciate the help. Thank you for taking the time.
import re
import pandas as pd
from pandas import DataFrame, Series
import numpy as np
df = pd.read_excel("E:\Book2.xlsx")
df.column1=df.column1.astype(str)
for r in df['column1']: #Finds Column
if bool(re.search('-', r))!=True: #test if string has '-'
continue
else:
a = [] #string holder for '-'
b = [] #string holder for numbers
for c in r:
if c == '-': #if '-' then hold in A
a.append(c)
else:
b.append(c) #if number then hold in B
t = (''.join(b + a)) #puts '-' at the end of string
z = t[1:] + t[:1] #picks up 1st position char and moves to end of string
r = z #assigns new created string to df.column1 value
print(df)
Starting File: Ending File:
column1 column1
41887 41887
1-41845 41845-1
2-41845 41845-2
40905 40905
1-41323 41323-1
2-41323 41323-2
3-41323 41323-3
41778 41778
You can use df.str.replace():
If we recreate your example with a file containing all your values and retain column1 as the column name:
import pandas as pd
df=pd.read_csv('file.txt')
df.columns=['column1']
df['column1']=df['column1'].str.replace('(^\d)-(\d+)',r'\2-\1')
print(df)
This will give the desired output. Replace the old column with new one, and do it all in one (without loops).
#in
41887
1-41845
2-41845
40905
1-41323
2-41323
3-41323
41778
#out
column1
0 41887
1 41845-1
2 41845-2
3 40905
4 41323-1
5 41323-2
6 41323-3
7 41778

Searching a list of dataframes for a specific value

I scraped a bunch of tables of financial data using pandas.read_excel. I am trying to search through the list of dataframes and select only the ones that contain a certain value/string. Is it possible to do that? I had thought I could do something like:
search = [x.isin('string') for x in df_list]
You might want this (for each frame):
(df == 'foo').any()
That will return True if 'foo' is anywhere in the frame.
[x for x in df.isin('string').any().sum()]
check if word exist in each column and sum the boolean vales for each columns.
so, it will returm True if it exist at least in one of the columns.

Replacing multiple values within a pandas dataframe cell - python

My problem: I have a pandas dataframe and one column in particular which I need to process contains values separated by (":") and in some cases, some of those values between ":" can be value = value, and can appear at the start/middle/end of the string. The length of the string can differ in each cell as we iterate through the row, for e.g.
clickstream['events']
1:3:5:7=23
23=1:5:1:5:3
9:0:8:6=5:65:3:44:56
1:3:5:4
I have a file which contains the lookup values of these numbers,e.g.
event_no,description,event
1,xxxxxx,login
3,ffffff,logout
5,eeeeee,button_click
7,tttttt,interaction
23,ferfef,click1
output required:
clickstream['events']
login:logout:button_click:interaction=23
click1=1:button_click:login:button_click:logout
Is there a pythonic way of looking up these individual values and replacing with the event column corresponding to the event_no row as shown in the output? I have hundreds of events and trying to work out a smart way of doing this. pd.merge would have done the trick if I had a single value, but I'm struggling to work out how I can work across the values and ignore the "=value" part of the string
Edit for to ignore missing keys in Dict:
import pandas as pd
EventsDict = {1:'1:3:5:7',2:'23:45:1:5:3',39:'0:8:46:65:3:44:56',4:'1:3:5:4'}
clickstream = pd.Series(EventsDict)
#Keep this as a dictionary
EventsLookup = {1:'login',3:'logout',5:'button_click',7:'interaction'}
def EventLookup(x):
list1 = [EventsLookup.get(int(item),'Missing') for item in x.split(':')]
return ":".join(list1)
clickstream.apply(EventLookup)
Since you are using a full DF and not just a series, use:
clickstream['events'].apply(EventLookup)
Output:
1 login:logout:button_click:interaction
2 Missing:Missing:login:button_click:logout
4 login:logout:button_click:Missing
39 Missing:Missing:Missing:Missing:logout:Missing...

Categories

Resources