I have two groups, one with the rows to be processed as groups, another with groups to be looked upon.
test = pd.DataFrame({'Address1':['123 Cheese Way','234 Cookie Place','345 Pizza Drive','456 Pretzel Junction'],'city':['X','U','X','U']})
test2 = pd.DataFrame({'Address1':['123 chese wy','234 kookie Pl','345 Pizzza DR','456 Pretzel Junktion'],'city':['X','U','Z','Y'] , 'ID' : ['1','3','4','8']})
gr1 = test.groupby('city')
gr2 = test2.groupby('city')
Currently I am applying my function to every row of the group,
gr1.apply(lambda x: custom_func(x.Address1, gr2.get_group(x.name)))
However I don't know how to do multiprocessing on this. Please advise.
EDIT : - I tried to use dask , but I can't pass the entire data frame to my function in dask - as there is a limitation with its apply function. And I tried to use dask apply on my gr1 (group), but since I am setting index in my custom function, dask throws an error saying "Too many indexers".
Here with Dask, this gives me an error - 'Pandas' object has no attribute 'city'
ddf1 = dd.from_pandas(test, 2)
ddf2 = dd.from_pandas(test2, 2)
dgr1 = ddf1.groupby('city')
dgr2 = ddf2.groupby('city')
meta = pd.DataFrame(columns=['Address1', 'score', 'idx','source_index'])
ddf1.map_partitions(custom_func, x.Address1, dgr2.get_group(x.city).Address1,meta=meta).compute()
I provide an alternative solution to using dask here,
import pandas as pd
from multiprocessing import Pool
test = pd.DataFrame({'Address1':['123 Cheese Way','234 Cookie Place','345 Pizza Drive','456 Pretzel Junction'],'city':['X','U','X','U']})
test2 = pd.DataFrame({'Address1':['123 chese wy','234 kookie Pl','345 Pizzza DR','456 Pretzel Junktion'],'city':['X','U','Z','Y'] , 'ID' : ['1','3','4','8']})
test=test.assign(dataset = 'test')
test2=test2.assign(dataset = 'test2')
newdf=pd.concat([test2,test],keys = ['test2','test'])
gpd=newdf.groupby('city')
def my_func(mygrp):
test_data=mygrp.loc['test']
test2_data=mygrp.loc['test2']
#do something specific
#if needed print something
return {'Address':test2_data.Address1.values[0],'ID':test2_data.ID.values[0]} #return some other stuff
mypool=Pool(processes=2)
ret_list=mypool.imap(my_func,(group for name, group in gpd))
pd.DataFrame(ret_list)
returns something like
ID address
0 3 234 kookie Pl
1 1 123 chese wy
2 8 456 Pretzel Junktion
3 4 345 Pizzza DR
PS: In OP's question two similar datasets are compared in a specialized function, the solution here uses pandas.concat . One could also imagine a pd.merge depending on the problem.
Related
Hi I have a data frame as below:
response ticket
so service reset performed 123
reboot done 343
restart performed 223
no value 444
ticket created 765
Im trying something like this:
import pandas as pd
df = pd.read_excel (r'C:\Users\Downloads\response.xlsx')
print (df)
count_other = 0
othersvocab = ['Service reset' , 'Reboot' , 'restart']
if df.response = othersvocab
{
count_other = count_other + 1
}
What I'm trying to do is get the count of how many have either of 'othersvocab' and how many don't.
I'm really new to Python, and I'm not sure how to do this.
Expected Output:
other ticketed
3 2
Can you help me figure it out, hopefully with what's happening in your code?
I am doing this on lunch break, I don't like the for other in others thing I have and there are better ways using pandas DataFrame methods you can use but it will have to do.
import pandas as pd
df = pd.DataFrame({"response": ["so service reset performed", "reboot done",
"restart performed"],
"ticket": [123, 343, 223]})
others = ['service reset' , 'reboot' , 'restart']
count_other = 0
for row in df["response"].values:
for other in others:
if other in row:
count_other += 1
So first you are going to need to address that if you want to perform this in the way I have you're going to have to lowercase the response column and the others variable, that's not very hard (lookup for pandas apply and the string operator .lower).
What I have done in this is I am looping first over the values in the loop column.
Then within this loop I am looping over the others list items.
Finally seeing whether any of these is in the list.
I hope my rushed response gives a hand.
Consider below df:
In [744]: df = pd.DataFrame({'response':['so service reset performed', 'reboot done', 'restart performed', 'no value', 'ticket created'], 'ticket':[123, 343, 223, 444, 765]})
In [745]: df
Out[745]:
response ticket
0 so service reset performed 123
1 reboot done 343
2 restart performed 223
3 no value 444
4 ticket created 765
Below is your othersvocab:
In [727]: othersvocab = ['Service reset' , 'Reboot' , 'restart']
# Converting all elements to lowercase
In [729]: othersvocab = [i.lower() for i in othersvocab]
Use Series.str.contains:
# Converting response column to lowercase
In [733]: df.response = df.response.str.lower()
In [740]: count_in_vocab = len(df[df.response.str.contains('|'.join(othersvocab))])
In [742]: count_others = len(df) - count_in_vocab
In [752]: res = pd.DataFrame({'other': [count_in_vocab], 'ticketed': [count_others]})
In [753]: res
Out[753]:
other ticketed
0 3 2
I'm learning object oriented programing in a data science context.
I want to understand what good practice is in terms of writing methods within a class that relate to one another.
When I run my code:
import pandas as pd
pd.options.mode.chained_assignment = None
class MyData:
def __init__(self, file_path):
self.file_path = file_path
def prepper_fun(self):
'''Reads in an excel sheet, gets rid of missing values and sets datatype to numerical'''
df = pd.read_excel(self.file_path)
df = df.dropna()
df = df.apply(pd.to_numeric)
self.df = df
return(df)
def quality_fun(self):
'''Checks if any value in any column is more than 10. If it is, the value is replaced with
a warning 'check the original data value'.'''
for col in self.df.columns:
for row in self.df.index:
if self.df[col][row] > 10:
self.df[col][row] = str('check original data value')
return(self.df)
data = MyData('https://archive.ics.uci.edu/ml/machine-learning-databases/00429/Cryotherapy.xlsx')
print(data.prepper_fun())
print(data.quality_fun())
I get the following output (only part of the output is shown due to space constrains):
sex age Time
0 1 35 12.00
1 1 29 7.00
2 1 50 8.00
3 1 32 11.75
4 1 67 9.25
.. ... ... ...
sex age Time
0 1 check original data value check original data value
1 1 check original data value 7
2 1 check original data value 8
3 1 check original data value check original data value
4 1 check original data value 9.25
.. ... ... ...
I am happy with the output generated by each method.
But if I try to call print(data.quality_fun()) without first calling print(data.prepper_fun()), I get an error AttributeError: 'MyData' object has no attribute 'df'.
Being new to objected oriented programming, I am wondering if it is considered good practice to structure things like this, or if there is some other way of doing it.
Thanks for any help!
Make sure you have the df before you use it.
class MyData:
def __init__(self, file_path):
self.file_path = file_path
self.df = None
def quality_fun():
if self.df is None:
self.prepper_fun()
# rest of the code
If the csv file isn't changed during runtime you should call prepper_fun(self) in the __init__., calling it separately leads to a high chance of bugs.
If the csv file is changed then the other answer works perfectly well
How to apply stemming on Pandas Dataframe column
am using this function for stemming which is working perfect on string
xx='kenichan dived times ball managed save 50 rest'
def make_to_base(x):
x_list = []
doc = nlp(x)
for token in doc:
lemma=str(token.lemma_)
if lemma=='-PRON-' or lemma=='be':
lemma=token.text
x_list.append(lemma)
print(" ".join(x_list))
make_to_base(xx)
But when i am applying this function on my pandas dataframe column it is not working neither giving any error
x = list(df['text']) #my df column
x = str(x)#converting into string otherwise it is giving error
make_to_base(x)
i've tried different thing but nothing working. like this
df["texts"] = df.text.apply(lambda x: make_to_base(x))
make_to_base(df['text'])
my dataset looks like this:
df['text'].head()
Out[17]:
0 Hope you are having a good week. Just checking in
1 K..give back my thanks.
2 Am also doing in cbe only. But have to pay.
3 complimentary 4 STAR Ibiza Holiday or £10,000 ...
4 okmail: Dear Dave this is your final notice to...
Name: text, dtype: object
You need to actually return the value you got inside the make_to_base method, use
def make_to_base(x):
x_list = []
for token in nlp(x):
lemma=str(token.lemma_)
if lemma=='-PRON-' or lemma=='be':
lemma=token.text
x_list.append(lemma)
return " ".join(x_list)
Then, use
df['texts'] = df['text'].apply(lambda x: make_to_base(x))
I am trying to match the exact substring from the string of pandas data frame series but somehow str.contains don't seem to be working here. I saw the documentation and it's saying to apply regex = False which is also not working. Can anyone suggest a solution?
Output:
Creative Name Revised Targeting Type
0 ff~tg~conbhv contextual
1 ff~tg~conbhv contextual
2 ff~tg~con contextual
Expected Output:
Creative Name Revised Targeting Type
0 ff~tg~conbhv contextual + behavioral
1 ff~tg~conbhv contextual + behavioral
2 ff~tg~con contextual
Approach:
import pandas as pd
import numpy as np
column = {'col_name': ['Revised Targeting Type']}
data = {"Creative Name":["ff~pd~q4-smartphones-note10-pdp-iphone7_mk~gb_ch~social_md~h_ad~ss1x1_dt~cross_fm~spost_pb~fcbk_sz~1x1_rt~cpm_tg~conbhv_sa~lo_vv~ia_it~soc_ts~lo-iphone7_ff~ukp q4 smartphones ukc q4 - smartphones - static ukt lo-iphone7 ukcdj buy_ct~fb_cs~1x1_lg~engb_cv~ge_ce~loc_mg~oth_ta~lrn_cw~na",
"ff~tg~conbhv",
"ff~tg~con"], "Revised Targeting Type":["ABC", "NA", "NA"]}
mapping = {"Code": ['con', 'conbhv'], "Actual": ['contextual', 'contextual + behavioral'], "OtherPV": [np.nan, np.nan],
"SheetName": ['tg', 'tg']}
# Creating a dataFrame
dataframe_data = pd.DataFrame(data)
mapping_data = pd.DataFrame(mapping)
column_data = pd.DataFrame(column)
print(dataframe_data)
print(mapping_data)
print(column_data)
# loop through Dataframe column avilable in (column_data) dataframe
for i in column_data.iloc[:,0]:
print(i)
# loop through mapping dataframe (mapping_data)
for k, l, m in zip(mapping_data.iloc[:, 0], mapping_data.iloc[:, 1], mapping_data.iloc[:, 3]):
# mask the dataframe (dataframe_date)
mask_null_revised_new_col = (dataframe_data['{}'.format(i)].isin(['NA']))
#apply dataframe values in main dataframe (dataframe_data)
dataframe_data['{}'.format(i)] = np.select([mask_null_revised_new_col &
dataframe_data['Creative Name'].str.contains('{}~{}'.format(m, k))],
[l], default=dataframe_data['{}'.format(i)])
print(dataframe_data)
Creative Name Revised Targeting Type
0 ff~tg~conbhv contextual
1 ff~tg~conbhv contextual
2 ff~tg~con contextual
To be honest, I'm a little confused by your question, but is this what your looking for?
dataframe_data['Revised Targeting Type'] = np.where(dataframe_data['Creative Name'].str.contains('.*conbhv*', regex = True), 'contextual + behavioral', 'contextual')
(I suck at titling these questions...)
So I've gotten 90% of the way through a very laborious learning process with pandas, but I have one thing left to figure out. Let me show an example (actual original is a comma-delimited CSV that has many more rows):
Name Price Rating URL Notes1 Notes2 Notes3
Foo $450 9 a.com/x NaN NaN NaN
Bar $99 5 see over www.b.com Hilarious Nifty
John $551 2 www.c.com Pretty NaN NaN
Jane $999 8 See Over in Notes Funky http://www.d.com Groovy
The URL column can say many different things, but they all include "see over," and do not indicate with consistency which column to the right includes the site.
I would like to do a few things, here: first, move websites from any Notes column to URL; second, collapse all notes columns to one column with a new line between them. So this (NaN's removed because pandas makes me in order to use them in df.loc):
Name Price Rating URL Notes1
Foo $450 9 a.com/x
Bar $99 5 www.b.com Hilarious
Nifty
John $551 2 www.c.com Pretty
Jane $999 8 http://www.d.com Funky
Groovy
I got partway there by doing this:
df['URL'] = df['URL'].fillna('')
df['Notes1'] = df['Notes1'].fillna('')
df['Notes2'] = df['Notes2'].fillna('')
df['Notes3'] = df['Notes3'].fillna('')
to_move = df['URL'].str.lower().str.contains('see over')
df.loc[to_move, 'URL'] = df['Notes1']
What I don't know is how to find the Notes column with either www or .com. If I, for example, try to use my above method as a condition, e.g.:
if df['Notes1'].str.lower().str.contains('www'):
df.loc[to_move, 'URL'] = df['Notes1']
I get back ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all() But adding .any() or .all() has the obvious flaw that they don't give me what I'm looking for: with any, e.g., every line that meets the to_move requirement in URL will get whatever's in Notes1. I need the check to occur row by row. For similar reasons, I can't even get started collapsing the Notes columns (and I don't know how to check for non-null empty string cells, either, a problem I created at this point).
Where it stands, I know I also have to move in Notes2 to Notes1, Notes3 to Notes2, and '' to Notes3 when the first condition is satisfied, because I don't want the leftover URLs in the Notes columns. I'm sure pandas has easier routes than what I'm doing, because it's pandas, and when I try to do anything with pandas, I find out that it can be done in one line instead of my 20...
(PS, I don't care if the empty columns Notes2 and Notes3 are left over, b/c I'm not using them in my CSV import in the next step, though I can always learn more than I need)
UPDATE: So I figured out a crummy verbose solution using my non-pandas python logic one step at a time. I came up with this (same first five lines above, minus the df.loc line):
url_in1 = df['Notes1'].str.contains('\.com')
url_in2 = df['Notes2'].str.contains('\.com')
to_move = df['URL'].str.lower().str.contains('see-over')
to_move1 = to_move & url_in1
to_move2 = to_move & url_in2
df.loc[to_move1, 'URL'] = df.loc[url_in1, 'Notes1']
df.loc[url_in1, 'Notes1'] = df['Notes2']
df.loc[url_in1, 'Notes2'] = ''
df.loc[to_move2, 'URL'] = df.loc[url_in2, 'Notes2']
df.loc[url_in2, 'Notes2'] = ''
(Lines moved around and to_move repeated in actual code) I know there has to be a more efficient method... This also doesn't collapse in the Notes columns, but that should be easy using the same method, except that I still don't know a good way to find the empty strings.
I'm still learning pandas, so some parts of this code may be not so elegant, but general idea is - get all notes columns, find all urls in there, combine it with URL column and then concat remaining notes into Notes1 column:
import pandas as pd
import numpy as np
import pandas.core.strings as strings
# Just to get first notnull occurence
def geturl(s):
try:
return next(e for e in s if not pd.isnull(e))
except:
return np.NaN
df = pd.read_csv("d:/temp/data2.txt")
dfnotes = df[[e for e in df.columns if 'Notes' in e]]
# Notes1 Notes2 Notes3
# 0 NaN NaN NaN
# 1 www.b.com Hilarious Nifty
# 2 Pretty NaN NaN
# 3 Funky http://www.d.com Groovy
dfurls = dfnotes.apply(lambda x: x.str.contains('\.com'), axis=1)
dfurls = dfurls.fillna(False).astype(bool)
# Notes1 Notes2 Notes3
# 0 False False False
# 1 True False False
# 2 False False False
# 3 False True False
turl = dfnotes[dfurls].apply(geturl, axis=1)
df['URL'] = np.where(turl.isnull(), df['URL'], turl)
df['Notes1'] = dfnotes[~dfurls].apply(lambda x: strings.str_cat(x[~x.isnull()], sep=' '), axis=1)
del df['Notes2']
del df['Notes3']
df
# Name Price Rating URL Notes1
# 0 Foo $450 9 a.com/x
# 1 Bar $99 5 www.b.com Hilarious Nifty
# 2 John $551 2 www.c.com Pretty
# 3 Jane $999 8 http://www.d.com Funky Groovy