I have a panda dataframe with 500k rows that contains paid out expenses. It looks like so:
As you can see, the 'LastName' column can contain entries should be the same but in practice they contain minor differences. My ultimate goal is to see how much was paid to each entity by doing a simple group_by and .sum. However, for that to work the entries under 'LastName' must be uniform.
I'm attempting to solve this problem using fuzzywuzzy.
First I take the unique vales from 'LastName' and save them to a list for comparison:
choices = expenditures_df['LastName'].astype('str').unique()
This leaves me with 50k unique entries from 'LastName' that I now need to compare the full 500k against.
Then I run through every line in the dataframe and look at it's similarity to each choice. If the similarity is high enough I overwrite the data in the dataframe with the entity name from choices.
for choice in choices:
word = str(choice)
for i in expenditures_df.index:
if fuzz.ratio(word, str(expenditures_df.loc[i,'LastName'])) > 79:
expenditures_df.loc[i, 'LastName'] = word
The problem, of course, is this is incredibly slow. So, I'd love some thoughts on accomplishing the same thing in a more efficient manner.
See: How to group words whose Levenshtein distance is more than 80 percent in Python
Based on this you can do something like:
import pandas as pd
from fuzzywuzzy import fuzz
df = pd.DataFrame({'expenditure':[1000,500,250,11,456,755],'last_name':['rakesh', 'zakesh', 'bikash', 'zikash', 'goldman LLC', 'oldman LLC']})
choices = df['last_name'].unique()
grs = list() # groups of names with distance > 80
for name in choices:
for g in grs:
if all(fuzz.ratio(name, w) > 80 for w in g):
g.append(name)
break
else:
grs.append([name, ])
name_map = []
for c, group in enumerate(grs):
for name in group:
name_map.append([c,name])
group_map = pd.DataFrame(name_map, columns=['group','name'])
df = df.merge(group_map, left_on='last_name',right_on='name')
df = df.groupby('group')['expenditure'].sum().reset_index()
df = df.merge(group_map.groupby('group')['name'].apply(list), on='group')
OUTPUT:
group expenditure name
0 0 1500 [rakesh, zakesh]
1 1 261 [bikash, zikash]
2 2 1211 [goldman LLC, oldman LLC]
Related
All credits to the Kaggle Course on pandas
here the dataset.head() :
enter image description here
here the task:
We'd like to host these wine reviews on our website, but a rating system ranging from 80 to 100 points is too hard to understand - we'd like to translate them into simple star ratings. A score of 95 or higher counts as 3 stars, a score of at least 85 but less than 95 is 2 stars. Any other score is 1 star.
Also, the Canadian Vintners Association bought a lot of ads on the site, so any wines from Canada should automatically get 3 stars, regardless of points.
Create a series star_ratings with the number of stars corresponding to each review in the dataset.
so here the solution :
def stars(row):
if row.country == 'Canada':
return 3
elif row.points >= 95:
return 3
elif row.points >= 85:
return 2
else:
return 1
star_ratings = reviews.apply(stars, axis='columns')
so here my question : I wonder when this solution perform applying functions on each row. is it checking the condition for every column in the row and applying on all of them cause it doesn't specify to perform only on the 'points' column
Creating a sample dataframe and then using the same function that you have created, you would only need to do an .apply() to get the required result.
Note: This is a sample dataset, you can use your own wine dataset instead of creating it in the second line of the code.
import pandas as pd
wine = pd.DataFrame({"country": ["Canada", "US", "Aus"], "points": [85,99,45]})
def stars(row):
if row.country == 'Canada':
return 3
elif row.points >= 95:
return 3
elif row.points >= 85:
return 2
else:
return 1
wine["stars"] = wine.apply(lambda x: stars(x), axis = 1)
Explanation:
.apply() function applies any given function to each row/column of a pandas dataframe. Since here we want to apply to each row, we give an additional paramater, axis = 1, axis is by default set to 0 (column-wise)
There are multiple conditions. One applies on every row of the "country" column, while the other two are on the "points" column. The "alternative result" with else is if no conditions are met. With that said, it is better practice for pandas to use np.select, so that your solution is highly vectorized (faster run time):
import numpy as np
star_ratings = np.select([(row.country == 'Canada') | (row.points >= 95), (row.points >= 85)], #condiitons
[3,3], #results
1) #alternative result (like your else)
The three parameter arguments are conditions (list of all conditions), Results (list of results in order of conditions), and alternative result. More here on numpy.select.
I am working on NLP by SparkNLP and SparkML on databricks.
I used LDA (from SparkML) to do the topic modeling and got the following topics.
It is a pyspark dataframe (df1):
df1:
t_id word_index weights
0 [0, 2, 3] [0.2105, 0.116, 0.18]
1 [1, 4, 6] [0.15, 0.05, 0.36]
"t_id" is topic id.
"weights" is the weight value of each word with index in "word_index"
The "word_index" in df1 corresponds to the location of each word in the list (lt).
df1 is small with not more than 100 rows.
I have a word list (lt): it is python list
lt:
['like', 'book', 'music', 'bike', 'great', 'pen', 'laptop']
lt has about 20k words.
I have another large pyspark dataframe (df2) with more than 20 million rows.
It size is 50+ GB.
df2:
u_id p_id reviews
sra tvs "I like this music" # some english tokens (each token can be found in "lt")
fbs dvr "The book is great"
I would like to assign the "t_id" (topics) in df1 to each row of df2 such that I can get a pyspark dataframe like:
u_id p_id reviews t_id the_highest_weights
sra tvs "I like this music" 1 # the highest of all tokens' weights among all "t_id"s
fbs dvr "The book is great" 4
But, one review may have multiple "t_id" (topics) because the review may have words covered by multiple "t_id".
So I have to calculate each "t_id"'s total weights such that the "t_id" with the highest total weights is assigned to the "reviews" in df2.
It is presented as "the_highest_weights" of the final result.
I do not want to use "for loops" to work on this row by row because it is not efficient for the large dataframe.
How can I use pyspark dataframe (not pandas) and vectorization (if needed) to get the result efficiently ?
thanks
I am not sure about the exact thing you want to compute but you will be able to tweak this answer to get what you need. Let's say you want to find for each sentence, the t_id with the maximum score (given by the sum of the weights of its tokens).
You can start by generating a dataframe that associates each word to its index.
df_lt = spark.createDataFrame([(i, lt[i]) for i in
range(0, len(lt))], ['word_index', 'w'])
Then, we will flatten df1 so that each line contains a t_id index, a word index and the corresponding weight. For this, we can use a UDF. Note that in spark >= 2.4 you may use array_union and create_map instead but since df1 is small, using a UDF won't be a problem.
def create_pairs(index, weights):
return [(index[i], weights[i]) for i in range(0, len(index))]
create_pairs_udf = udf(create_pairs, ArrayType(StructType([
StructField(IntegerType(), 'word_index'),
StructField(DoubleType(), 'weight')
])))
df1_exp = df1\
.select('t_id', explode(create_pairs_udf(df1['word_index'], df1['weights']))
.alias('pair'))\
.select('t_id', 'pair.word_index', 'pair.weight')
Finally, the main work is done on df2, the large dataframe. We start by exploding the sentence to get on word per line (+ the u_id and p_id). Then we need to join with df_lt to translate words into indices. Then, by joining with df1_exp, we associate each word index to its weight. Then, we group by all the indices (including t_id) to compute the sum of the weights and group again to select the best t_id for each sentence.
To speed up things, we can hint spark to broadcast df_lt and df1_exp that are small to avoid shuffling df2 which is large.
The code looks like this:
df2\
.select("u_id", "p_id", explode(split(df2['reviews'], "\\s+")).alias("w"))\
.join(broadcast(df_lt), ['w'])\
.drop('w')\
.join(broadcast(df1_exp), ['word_index'])\
.groupBy('u_id', 'p_id', 't_id')\
.agg(sum('weight').alias('score'))\
.withColumn('t_id', struct('score', 't_id'))\
.groupBy('u_id', 'p_id')\
.agg(max('t_id').alias('t_id'))\
.select('u_id', 'p_id', 't_id.score', 't_id.t_id')\
.show()
+----+----+------+----+
|u_id|p_id| score|t_id|
+----+----+------+----+
| fbs| dvr| 0.2| 1|
| sra| tvs|0.3265| 0|
+----+----+------+----+
I have data that looks like:
Identifier Category1 Category2 Category3 Category4 Category5
1000 foo bat 678 a.x ld
1000 foo bat 78 l.o op
1000 coo cat 678 p.o kt
1001 coo sat 89 a.x hd
1001 foo bat 78 l.o op
1002 foo bat 678 a.x ld
1002 foo bat 78 l.o op
1002 coo cat 678 p.o kt
What i am trying to do is compare 1000 to 1001 and to 1002 and so on. The output I want the code to give is : 1000 is the same as 1002. So, the approach I wanted to use was:
First group all the identifier items into separate dataframes (maybe?). For example, df1 would be all rows pertaining to identifier 1000 and df2 would be all rows pertaining to identifier 1002. (**Please note that I want the code to do this itself as there are millions of rows, as opposed to me writing code to manually compare identifiers **). I have tried using the groupby feature of pandas, it does the part of grouping well, but then I do not know how to compare the groups.
Compare each of the groups/sub-data frames.
One method I was thinking of was reading each row of a particular identifier into an array/vector and comparing arrays/vectors using a comparison metric (Manhattan distance, cosine similarity etc).
Any help is appreciated, I am very new to Python. Thanks in advance!
You could do something like the following:
import pandas as pd
input_file = pd.read_csv("input.csv")
columns = ['Category1','Category2','Category3','Category4','Category5']
duplicate_entries = {}
for group in input_file.groupby('Identifier'):
# transforming to tuples so that it can be used as keys on a dict
lines = [tuple(y) for y in group[1].loc[:,columns].values.tolist()]
key = tuple(lines)
if key not in duplicate_entries:
duplicate_entries[key] = []
duplicate_entries[key].append(group[0])
Then the duplicate_entries values will have the list of duplicate Identifiers
duplicate_entries.values()
> [[1000, 1002], [1001]]
EDIT:
To get only the entries that have duplicates, you could have something like:
all_dup = [dup for dup in duplicate_entries if len(dup) > 1]
Explaining the indices (sorry I didn't explained it before): Iterating through the df.groupby outcome gives a tuple where the first entry is the key of the group (in this case it would be a 'Identifier') and the second one is a Series of the grouped dataframes. So to get the lines that contain the duplicate entries we'd use [1] and the 'Identifier' for that group is found at [0]. Because on the duplicate_entries array we'd like the identifier of that entry, using group[0] would get us that.
We could separate in groups with groupby, then sort all groups (so we can detect equals even when rows are in different order) by all columns except for "Identifier" and compare the groups:
Suppose that columns = ["Identifier", "Category1", "Category2", "Category3", "Category4", "Category5"]
We can do:
groups = []
pure_groups = []
for name, group in df.groupby("Identifier"):
pure_groups += [group]
g_idfless = group[group.columns.difference(["Identifier"])]
groups += [g_idfless.sort_values(columns[1:]).reset_index().drop("index", axis=1)]
And to compare them:
for i in range(len(groups)):
for j in range(i + 1, len(groups)):
id1 = str(pure_groups[i]["Identifier"].iloc[0])
id2 = str(pure_groups[j]["Identifier"].iloc[0])
print(id1 + " and " + id2 + " equal?: " + str(groups[i].equals(groups[j])))
#-->1000 and 1001 equal?: False
#-->1000 and 1002 equal?: True
#-->1001 and 1002 equal?: False
EDIT: Added code to print the identifiers of the groups that match
Apologies for the messy title: Problem as follows:
I have some data frame of the form:
df1 =
Entries
0 "A Level"
1 "GCSE"
2 "BSC"
I also have a data frame of the form:
df2 =
Secondary Undergrad
0 "A Level" "BSC"
1 "GCSE" "BA"
2 "AS Level" "MSc"
I have a function which searches each entry in df1, looking for the words in each column of df2. The words that match, are saved (Words_Present):
def word_search(df,group,words):
Yes, No = 0,0
Words_Present = []
for i in words:
match_object = re.search(i,df)
if match_object:
Words_Present.append(i)
Yes = 1
else:
No = 0
if Yes == 1:
Attribute = 1
return Attribute
I apply this function over all entries in df1, and all columns in df2, using the following iteration:
for i in df2:
terms = df2[i].values.tolist()
df1[i] = df1['course'][0:1].apply(lambda x: word_search(x,i,terms))
This yields an output df which looks something like:
df1 =
Entries Secondary undergrad
0 "A Level" 1 0
1 "GCSE" 1 0
2 "AS Level" 1 0
I want to amend the Word_Search function to output a the Words_Present list as well as the Attribute, and input these into a new column, so that my eventual df1 array looks like:
Desired dataframe:
Entries Secondary Words Found undergrad Words Found
0 "A Level" 1 "A Level" 0
1 "GCSE" 1 "GCSE" 0
2 "AS Level" 1 "AS Level" 0
If I do:
def word_search(df,group,words):
Yes, No = 0,0
Words_Present = []
for i in words:
match_object = re.search(i,df)
if match_object:
Words_Present.append(i)
Yes = 1
else:
No = 0
if Yes == 1:
Attribute = 1
if Yes == 0:
Attribute = 0
return Attribute,Words_Present
My function therefore now has multiple outputs. So applying the following:
for i in df2:
terms = df2[i].values.tolist()
df1[i] = df1['course'][0:1].apply(lambda x: word_search(x,i,terms))
My Output Looks like this:
Entries Secondary undergrad
0 "A Level" [1,"A Level"] 0
1 "GCSE" [1, "GCSE"] 0
2 "AS Level" [1, "AS Level"] 0
The output of pd.apply() is always a pandas series, so it just shoves everything into the single cell of df[i] where i = secondary.
Is it possible to split the output of .apply into two separate columns, as shown in the desired dataframe?
I have consulted many questions, but none seem to deal directly with yielding multiple columns when the function contained within the apply statement has multiple outputs:
Applying function with multiple arguments to create a new pandas column
Create multiple columns in Pandas Dataframe from one function
Apply pandas function to column to create multiple new columns?
For example, I have also tried:
for i in df2:
terms = df2[i].values.tolist()
[df1[i],df1[i]+"Present"] = pd.concat([df1['course'][0:1].apply(lambda x: word_search(x,i,terms))])
but this simply yields errors such as:
raise ValueError('Length of values does not match length of ' 'index')
Is there a way to use apply, but still extract the extra information directly into multiple columns?
Many thanks, apologies for the length.
The direct answer to your question is yes: use the apply method of the DataFrame object, so you'd be doing df1.apply().
However, for this problem, and anything in pandas in general, try to vectorise rather than iterate through rows -- it's faster and cleaner.
It looks like you are trying to classify Entries into Secondary or Undergrad, and saving the keyword used to make the match. If you assume that each element of Entries has no more than one keyword match (i.e. you won't run into 'GCSE A Level'), you can do the following:
df = df1.copy()
df['secondary_words_found'] = df.Entries.str.extract('(A Level|GCSE|AS Level)')
df['undergrad_words_found'] = df.Entries.str.extract('(BSC|BA|MSc)')
df['secondary'] = df.secondary_words_found.notnull() * 1
df['undergrad'] = df.undergrad_words_found.notnull() * 1
EDIT:
In response to your issue with having many more categories and keywords, you can continue in the spirit of this solution by using an appropriate for loop and doing '(' + '|'.join(df2['Undergrad'].values) + ')' inside the extract method.
However, if you have exact matches, you can do everything by a combination of pivots and joins:
keywords = df2.stack().to_frame('Entries').reset_index().drop('level_0', axis = 1).rename(columns={'level_1':'category'})
df = df1.merge(keywords, how = 'left')
for colname in df.category:
df[colname] = (df.Entries == colname) * 1 # Your indicator variable
df.loc[df.category == colname, colname + '_words_found'] = df.loc[df.category == colname, 'Entries']
The first line 'pivots' your table of keywords into a 2-column dataframe of keywords and categories. Your keyword column must be the same as the column in df1; in SQL, this would be called the foreign key that you are going to join these tables on.
Also, you generally want to avoid having duplicate indexes or columns, which in your case, was Words Found in the desired dataframe!
For the sake of completeness, if you insisted on using the apply method, you would iterate over each row of the DataFrame; your code would look something like this:
secondary_words = df2.Secondary.values
undergrad_words = df2.Undergrad.values
def(s):
if s.Entries.isin(secondary_words):
return pd.Series({'Entries':s.Entries, 'Secondary':1, 'secondary_words_found':s.Entries, 'Undergrad':0, 'undergrad_words_found':''})
elif s.Entries.isin(undergrad_words ):
return pd.Series({'Entries':s.Entries, 'Secondary':0, 'secondary_words_found':'', 'Undergrad':1, 'undergrad_words_found':s.Entries})
else:
return pd.Series({'Entries':s.Entries, 'Secondary':0, 'secondary_words_found':'', 'Undergrad':0, 'undergrad_words_found':''})
This second version will only work in the cases you want it to if the element in Entries is exactly the same as its corresponding element in df2. You certainly don't want to do this, as it's messier, and will be noticeably slower if you have a lot of data to work with.
My data analysis repeatedly falls back on a simple but iffy motif, namely "groupby everything except". Take this multi-index example, df:
accuracy velocity
name condition trial
john a 1 -1.403105 0.419850
2 -0.879487 0.141615
b 1 0.880945 1.951347
2 0.103741 0.015548
hans a 1 1.425816 2.556959
2 -0.117703 0.595807
b 1 -1.136137 0.001417
2 0.082444 -1.184703
What I want to do now, for instance, is averaging over all available trials while retaining info about names and conditions. This is easily achieved:
average = df.groupby(level=('name', 'condition')).mean()
Under real-world conditions, however, there's a lot more metadata stored in the multi-index. The index easily spans 8-10 columns per row. So the pattern above becomes quite unwieldy. Ultimately, I'm looking for a "discard" operation; I want to perform an operation that throws out or reduces a single index column. In the case above, that's trial number.
Should I just bite the bullet or is there a more idiomatic way of going about this? This might well be an anti-pattern! I want to build a decent intuition when it comes to the "true pandas way"... Thanks in advance.
You could define a helper-function for this:
def allbut(*names):
names = set(names)
return [item for item in levels if item not in names]
Demo:
import pandas as pd
levels = ('name', 'condition', 'trial')
names = ('john', 'hans')
conditions = list('ab')
trials = range(1, 3)
idx = pd.MultiIndex.from_product(
[names, conditions, trials], names=levels)
df = pd.DataFrame(np.random.randn(len(idx), 2),
index=idx, columns=('accuracy', 'velocity'))
def allbut(*names):
names = set(names)
return [item for item in levels if item not in names]
In [40]: df.groupby(level=allbut('condition')).mean()
Out[40]:
accuracy velocity
trial name
1 hans 0.086303 0.131395
john 0.454824 -0.259495
2 hans -0.234961 -0.626495
john 0.614730 -0.144183
You can remove more than one level too:
In [53]: df.groupby(level=allbut('name', 'trial')).mean()
Out[53]:
accuracy velocity
condition
a -0.597178 -0.370377
b -0.126996 -0.037003
In the documentation of groupby, there is an example of how to group by all but one specified column of a multiindex. It uses the .difference method of the index names:
df.groupby(level=df.index.names.difference(['name']))