Currently working to implement some fuzzy matching logic to group together emails with similar patterns and I need to improve the efficiency of part of the code but not sure what the best path forward is. I use a package to output a pandas dataframe that looks like this:
I redacted the data, but it's just four columns with an ID #, the email associated with a given ID, a group ID number that identifies the cluster a given email falls into, and then the group rep which is the most mathematically central email of a given cluster.
What I want to do is count the number of occurrences of each distinct element in the group rep column and create a new dataframe that's just two columns with one column having the group rep email and then the second column having the corresponding count of that group rep in the original dataframe. It should look something like this:
As of now, I'm converting my group reps to a list and then using a for-loop to create a list of tuples(I think?) with each tuple containing a centroid email group identifiers and the number of times that identifier occurs in the original df (aka the number of emails in the original data that belong to that centroid email's group). The code looks like this:
groups = list(df['group rep'].unique())
# preparing list of tuples with group count
req_groups = []
for g in groups:
count = (g, df['group rep'].value_counts()[g])
#print(count)
req_groups.append(count)
print(req_groups)
Unfortunately, this operation takes far too long. I'm sure there's a better solution, but could definitely use some help finding a path forward. Thanks in advance for your help!
You can use df.groupby('group rep').count().
Let's consider the following dataframe :
email
0 zucchini#yahoo.fr
1 apple#gmail.com
2 citrus#protonmail.com
3 banana#gmail.com
4 pear#gmail.com
5 apple#gmail.com
6 citrus#protonmail.com
Proposed script
import pandas as pd
import operator
m = {'email':['zucchini#yahoo.fr','apple#gmail.com','citrus#protonmail.com','banana#gmail.com',
'pear#gmail.com','apple#gmail.com','citrus#protonmail.com']}
df = pd.DataFrame(m)
counter = pd.DataFrame.from_dict({c: [operator.countOf(df['email'], c)] for c in df['email'].unique()})
cnt_df = counter.T.rename(columns={0:'count'})
print(cnt_df)
Result
count
zucchini#yahoo.fr 1
apple#gmail.com 2
citrus#protonmail.com 2
banana#gmail.com 1
pear#gmail.com 1
Related
I have a data frame containing hyponym and hypernym pairs extracted from StackOverflow posts. You can see an excerpt from it in the following:
0 1 2 3 4
linq query asmx web service THH 10 a linq query as an asmx web service
application bolt THH 1 my application is a bolt on data visualization...
area r time THH 1 the area of the square is r times
sql query syntax HTH 3 sql like query syntax
...
7379596 rows × 5 columns
The column 0 and the column 1 contain the hyponym and hyperonym parts of the phrases contained by the column 4. I would like to implement a filter based on statistical features, therefore I have to count all occurrences of the pairs (0, 1) columns together, all occurrences of the hyponym and hyperonym parts respectively. Pandas has a method called value_counts(), so counting the occurrences can be obtained by:
df.value_counts([0])
df.value_counts([1])
df.value_counts([0, 1])
This is nice, but the method resulted in a Pandas Series which has much fewer records than the original DataFrame, therefore, adding a new column like df[5] = df.value_counts([0, 1]) does not work.
I have found a workaround: I have created 3 Pandas Series for every occurrence type (pair, hyponym, hyperonym) and I have written a small loop to calculate a confidence score for every pair but as the original dataset is huge (more than 7 million records) this calculation is not an efficient way to do that (the calculation has not finished after 30 hours). So, the feasible and hopefully efficient solution would be using the Pandas applymap() for this purpose, but it is needed to attach columns containing the occurrences to the original DataFrame. So I would like a DataFrame like this one:
0 1 2 3 4 5 6 7
sql query anything anything a phrase 1000 800 500
sql query anything anything anotherphrase 1000 800 500
...
The column 5 is the occurences of the hyponym part (sql), the column 6 is the number of occurrences of the hyperonym part (query) and the column 7 is the occurrences of the pair (sql,
query). As you can see the pairs are the same but they are extracted from different phrases.
My question is how to do that? How can I attach occurrences as a new column to an existing DataFrame?
Here's a solution on how to map the value counts of the combination of two columns to a new column:
# Create an example DataFrame
df = pd.DataFrame({0: ["a", "a", "a", "b"], 1: ["c", "d", "d", "d"]})
# Count the paired occurrences in a new column
df["count"] = df.groupby([0,1])[0].transform('size')
Before editing, I had answered this question with a solution using value_counts and a merge. This original solution is slower and more complicated than the groupby:
# Put the value_counts in a new DataFrame, call them count
vcdf = pd.DataFrame(df[[0, 1]].value_counts(), columns=["count"])
# Merge the df with the vcs
merged = pd.merge(left=df, right=vcdf, left_on=[0, 1], right_index=True)
# Potentially sort index
merged = merged.sort_index()
The resulting DataFrame:
0 1 count
0 a c 1
1 a d 2
2 a d 2
3 b d 1
I have two pandas dataframes:
Dataframe 1:
ITEM ID TEXT
1 some random words
2 another word
3 blah
4 random words
Dataframe 2:
INDEX INFO
1 random
3 blah
I would like to match the values from the INFO column (of dataframe 2) with the TEXT column of dataframe 1. If there is a match, I would like to see a new column with a "1".
Something like this:
ITEM ID TEXT MATCH
1 some random words 1
2 another word
3 blah 1
4 random words 1
I was able to create a match per value of the INFO column that I'm looking for with this line of code:
dataframe1.loc[dataframe1['TEXT'].str.contains('blah'), 'MATCH'] = '1'
However, in reality, my real dataframe 2 has 5000 rows. So I cannot manually copy paste all of this. But basically I'm looking for something like this:
dataframe1.loc[dataframe1['TEXT'].str.contains('Dataframe2[INFO]'), 'MATCH'] = '1'
I hope someone can help, thanks!
Give this a shot:
Code:
dfA['MATCH'] = dfA['TEXT'].apply(lambda x: min(len([ y for y in dfB['INFO'] if y in x]), 1))
Output:
ITEM ID TEXT MATCH
0 1 some random words 1
1 2 another word 0
2 3 blah 1
3 4 random words 1
It's a 0 if it's not a match, but that's easy enough to weed out.
There may be a better / faster native solution, but it gets the job done by iterating over both the 'TEXT' column and the 'INFO'. Depending on your use case, it may be fast enough.
Looks like .map() in lieu of .apply() would work just as well. Could make a difference in timing, again, based on your use case.
Updated to take into account string contains instead of exact match...
You could get the unique values from the column in the first dataframe convert them to list and then use eval method on the second with Column.str.contains on that list.
unique = df1['TEXT'].unique().tolist()
df2.eval("Match=Text.str.contains('|'.join(#unique))")
I'm working with the Bureau of Labor Statistics data which looks like this:
series_id year period value
CES0000000001 2006 M01 135446.0
series_id[3][4] indicate the supersector. for example, CES10xxxxxx01 would be Mining & Logging. There are 15 supersectors that I'm concerned with and hence I want to create 15 separate data frames for each supersector to perform time series analysis. So I'm trying to access each value as a list to achieve something like:
# *psuedocode*:
mining_and_logging = df[df.series_id[3]==1 and df.series_id[4]==0]
Can I avoid writing a for loop where I convert each value to a list then access by index and add the row to the new dataframe?
How can I achieve this?
One way to do what you want and recursively store the dataframes through a for loop could be:
First, create an auxiliary column to make your life easier:
df['id'] = df['series_id'][3:5] #Exctract characters 3 and 4 of every string (counting from zero)
Then, you create an empty dictionary and populate it:
dict_df = {}
for unique_id in df.id.unique():
dict_df[unique_id] = df[df.id == unique_id]
Now you'll have a dictionary with 15 dataframes inside. For example, if you want to call the dataframe associated with id = 01, you just do:
dict_df['01']
Hope it helps !
Solved it by combining answers from Juan C and G. Anderson.
Select the 3rd and 4th character:
df['id'] = df.series_id.str.slice(start=3, stop=5)
And then the following to create dataframes:
dict_df = {}
for unique_id in df.id.unique():
dict_df[unique_id] = df[df.id == unique_id]
I have a pandas DataFrame with columns patient_id, patient_sex, patient_dob (and other less relevant columns). Rows can have duplicate patient_ids, as each patient may have more than one entry in the data for multiple medical procedures. I discovered, however, that a great many of the patient_ids are overloaded, i.e. more than one patient has been assigned to the same id (evidenced by many instances of a single patient_id being associated with multiple sexes and multiple days of birth).
To refactor the ids so that each patient has a unique one, my plan was to group the data not only by patient_id, but by patient_sex and patient_dob as well. I figure this must be sufficient to separate the data into individual users (and if two patients with the same sex and dob just happened to be assigned the same id, then so be it.
Here is the code I currently use:
# I just use first() here as a way to aggregate the groups into a DataFrame.
# Bonus points if you have a better solution!
indv_patients = patients.groupby(['patient_id', 'patient_sex', 'patient_dob']).first()
# Create unique ids
new_patient_id = 'new_patient_id'
for index, row in indv_patients.iterrows():
# index is a tuple of the three column values, so this should get me a unique
# patient id for each patient
indv_patients.loc[index, new_patient_id] = str(hash(index))
# Merge new ids into original patients frame
patients_with_new_ids = patients.merge(indv_patients, left_on=['patient_id', 'patient_sex', 'patient_dob'], right_index=True)
# Remove byproduct columns, and original id column
drop_columns = [col for col in patients_with_new_ids.columns if col not in patients.columns or col == new_patient_id]
drop_columns.append('patient_id')
patients_with_new_ids = patients_with_new_ids.drop(columns=drop_columns)
patients = patients_with_new_ids.rename(columns={new_patient_id : 'patient_id'})
The problem is that with over 7 million patients, this is way too slow a solution, the biggest bottleneck being the for-loop. So my question is, is there a better way to fix these overloaded ids? (The actual id doesn't matter, so long as its unique for each patient)
I don't know what the values for the columns are but have you tried something like this?
patients['new_patient_id'] = patients.apply(lambda x: x['patient_id'] + x['patient_sex'] + x['patient_dob'],axis=1)
This should create a new column and you can then use groupby with the new_patient_id
I have a large dataset consisting of 1000's of grouped data, for each group I am trying do a linear regression and find values for 6 different parameters. Since it was a large dataset, I worked with only one group to begin with.
I read it in as a pandas dataframe and made a subset dataframe containing only one group.
#EXTRACTING STARTING PARAMETER VALUES for just 1 group
a = code to calculate parameter & returns a singular number...
b = ...
c = ...
d = ...
e = ...
f = ...
I found out it's parameters and added 6 new columns to store the values.
I used:
df = df.assign(log_B0=a, E=b, Eh=c, El=d, Th=e, Tl=f)
to create the new columns whilst storing the value which is repeated down the column for that particular group.
I am using a loop to calculate the parameters for each group using:
for i, g in df.groupby('uniqueID'):
But I am having trouble appending the output parameter values for each group to the original dataframe.
I think i need to use:
g.assign(log_B0=...)
..to append the parameter values for each group to the columns.
But that only saves it for the last group and I also don't want to keep adding a new column header.
Do I need to increment?
I want the output like this:
parameter values log_B0, E..etc for the two groups 1 and 2