Pandas Dataframe iteration loop keeps loading undefinitely

Pandas Dataframe iteration loop keeps loading undefinitely - python

I have the following dataframe consisting of UserId and the Name of the badge earned by that person on Stackoverflow. Now, each badge belongs to a particular category such as Question, Answer, Participation, Moderation and Tag. I want to create a column called Category to store the category of each badge.
The code that I have written works well if data is less than 1M users, for more data it just keeps loading. How to fix this?
Dataframe (badges)
UserId | Name
1 | Altruist
2 | Autobiographer
3 | Enlightened
4 | Citizen Patrol
5 | python
Code
def category(df):
questionCategory = ['Altruist', 'Benefactor', 'Curious', 'Inquisitive', 'Socratic', 'Favorite Question', 'Stellar Question', 'Investor', 'Nice Question', 'Good Question', 'Great Question', 'Popular Question', 'Notable Question', 'Famous Question', 'Promoter', 'Scholar', 'Student']
answerCategory = ['Enlightened', 'Explainer', 'Refiner', 'Illuminator', 'Generalist', 'Guru', 'Lifejacket', 'Lifeboat', 'Nice Answer', 'Good Answer', 'Great Answer', 'Populist', 'Revival', 'Necromancer', 'Self-Learner','Teacher', 'Tenacious', 'Unsung Hero']
participationCategory = ['Autobiographer','Caucus', 'Constituent', 'Commentator', 'Pundit', 'Enthusiast', 'Fanatic', 'Mortarboard', 'Epic', 'Legendary', 'Precognitive', 'Beta', 'Quorum', 'Convention', 'Talkative', 'Outspoken', 'Yearling']
moderationCategory = ['Citizen Patrol', 'Deputy', 'Marshal', 'Civic Duty', 'Cleanup', 'Constable', 'Sheriff', 'Critic', 'Custodian', 'Reviewer', 'Steward', 'Disciplined', 'Editor', 'Strunk & White', 'Copy Editor', 'Electorate', 'Excavator', 'Archaelogist', 'Organizer', 'Peer Pressure', 'Proofreader', 'Sportsmanship', 'Suffrage', 'Supporter', 'Synonymizer', 'Tag Editor', 'Research Assistant', 'Taxonomist', 'Vox Populi']
#Tag Category will be represented as 0
df['Category'] = 0
for i in range(len(df)) :
if (df.loc[i, "Name"] in questionCategory):
df.loc[i, 'Category'] = 1
elif (df.loc[i, "Name"] in answerCategory):
df.loc[i, 'Category'] = 2
elif (df.loc[i, "Name"] in participationCategory):
df.loc[i, 'Category'] = 3
elif (df.loc[i, "Name"] in moderationCategory):
df.loc[i, 'Category'] = 4
return df
category(stackoverflow_badges)
Expected Output
UserId | Name | Category
1 | Altruist | 1
2 | Autobiographer | 3
3 | Enlightened | 2
4 | Citizen Patrol | 4
5 | python | 0

If you want to update a dataframe with more than 1M rows, than you definetely want to avoid for loops whenever possible. There is an easier to update your 'Category' column, like it was done here.
In your case, you just need to convert your 4 lists with the badges names to a dictionary matching the badge name to its numerical category, like:
category_dict = {
**{key: 1 for key in questionCategory},
**{key: 2 for key in answerCategory},
**{key: 3 for key in participationCategory},
**{key: 4 for key in moderationCategory},
}
And then you can replace all your for loops for this command:
df['Category'] = df['Name'].map(category_dict).fillna(0)
This may not solve your whole issue, but at least will save some time.

Related

Increase Speed of Nested For Loops While Changing Value of a DataFrame

I'm looking to increase the speed of the nested for loops.
VARIABLES:
'dataframe' - The dataframe I am attempting to modify in the second for loop. It consists of a multitude of training sessions for the same people. This is the attendance document that is changed if a match exists in the reporting dataframe.
'dictNewNames' - This is a dictionary of session title names. The key is the longer session title name and the value is a stripped session title name. For example {'Week 1: Training': 'Training'} etc. The key is equal to the 'Session Title' column in each row but the value is used for searching a substring in the second for loop.
'reporting' - A dataframe that includes information regarding session titles and attendance participation. The reporting dataframe is already filtered so everyone in the 'reporting' dataframe should get credit in 'dataframe'. The only caveat is that the 'search' name is nested within the pathway title.
dataframe = {
'Session Title': ['Organization Week 1: Train', 'Organization Week 2: Train', 'Organization Week 3: Train'],
'Attendee Email': ['name#gmail.com', 'name2#gmail.com', 'name3#gmail.com'],
'Completed': ['No', 'No', 'No'],
'Date Completed': ['','','']}
dictNewNames = { 'Organization Week 1: Train': 'Train', ' Organization Week 2: Train': 'Train', 'Organization Week 3: Train': 'Train' }
Title formatting is not incorrect (i.e. ':' vs '-' as seen in pathway title below). The data is completely all over the place in terms of format.
reporting = {
'Pathway Title': ['Training 1 - Train', 'Training 2: Train', 'Training 3 - Train'],
'Email': ['name#gmail.com', 'name2#gmail.com', 'name3#gmail.com'],
'Date Completed': ['xx/yy/xx', 'yy/xx/zz', 'zz/xx/yy']}
expectedOuput = {
'Session Title': ['Organization Week 1: Train', 'Organization Week 2: Train', 'Organization Week 3: Train'],
'Attendee Email': ['name#gmail.com', 'name2#gmail.com', 'name3#gmail.com'],
'Completed': ['Yes', 'Yes', 'Yes'],
'Date Completed': ['xx/yy/xx', 'yy/xx/zz', 'zz/xx/yy']}
My code:
def giveCredit(dataframe, dictNewNames, reporting):
for index, row in dataframe.iterrows():
temp = row['Session Title']
searchName = dictNewNames[temp]
attendeeEmail = row['Attendee: Email']
for index1, row1 in reporting.iterrows():
pathwayTitle = row1['Pathway Title']
Email = row1['Organization Email']
dateCompleted = row1['Date Completed']
if attendeeEmail == Email and searchName in pathwayTitle:
dataframe.at[index, 'Completed'] = 'Yes'
dataframe.at[index, 'Date Completed'] = dateCompleted
break
return dataframe

Your pattern looks like merge:
for loop1 on first dataframe:
for loop2 on second dataframe:
if conditions match between both dataframes:
So:
# Create a common key Name based on dictNewNames
pat = fr"({'|'.join(dictNewNames.values())})"
name1 = dataframe['Session Title'].map(dictNewNames)
name2 = reporting['Pathway Title'].str.extract(pat)
# Merge dataframes based on this key and email
out = pd.merge(dataframe.assign(Name=name1),
reporting.assign(Name=name2),
left_on=['Name', 'Attendee Email'],
right_on=['Name', 'Email'],
how='left', suffixes=(None, '_'))
# Update the dataframe
out['Date Completed'] = out.pop('Date Completed_')
out['Completed'] = np.where(out['Date Completed'].notna(), 'Yes', 'No')
out = out[dataframe.columns]
Output:
>>> out
Session Title Attendee Email Completed Date Completed
0 Week 1: Train 1 name#gmail.com Yes xx/yy/xx
1 Week 2: Train 2 name2#gmail.com Yes yy/xx/zz
2 Week 3: Train 3 name3#gmail.com Yes zz/xx/yy

This workaround cut my execution time from 460 seconds to under 10.
def giveCredit(dataframe, dictNewNames, reporting):
reporting['Date Completed'] = pd.to_datetime(reporting['Date Completed'])
for index1, row in dataframe.iterrows():
temp = row['Session Title']
numberList = re.findall('[0-9]+', temp)
finalNumber = str(numberList[0])
searchName = dictNewNames[temp]
attendeeEmail = row['Attendee: Email']
row = reporting.loc[(reporting['Pathway Title'].str.contains(searchName, case=False)) & (reporting['Organization Email'] == attendeeEmail)]
if len(row.index) != 0:
new_row = row.loc[(reporting['Pathway Title'].str.contains(finalNumber, case=False))]
if len(new_row.index) != 0:
dataframe = modifyFrame(dataframe, new_row, index1)
else:
dataframe= modifyFrame(dataframe, row, index1)
dataframe = dataframe.sort_values(["Completed", "Attendee"], ascending=[False, True])
return dataframe
def modifyFrame(frame, row, index1):
dateCompleted = row['Date Completed']
dateCompleted = dateCompleted.to_string(buf=None, header=False, index=False, length=False, name=False, max_rows=None).strip()
dataframe.at[index1, 'Completed'] = 'Yes'
dataframe.at[index1, 'Date Completed'] = dateCompleted
return dataframe

Extract data from nested JSON | Pandas

I'm dealing with a nested JSON in order to extract data about transactions from my database using pandas.
My JSON can have one of these contents :
{"Data":{"Parties":[{"ID":"JackyID","Role":12}],"NbIDs":1}} #One party identified
{"Data":{"Parties":[{"ID":"JackyID","Role":12},{"ID":"SamNumber","Role":10}],"NbIDs":2}} #Two Parties identified
{"Data":{"Parties":[],"NbIDs":0}} #No parties identified
{"Data": None} #No data
When looking to extract the values of ID (ID of the party - String datatype) and Role (Int datatype - refer to buyers when Role=12 and sellers when Role=10) and write it in a pandas dataframe, I'm using the following code :
for i,row in df.iterrows():
json_data = json.dumps(row['Data'])
data = pd_json.loads(json_data)
data_json = json.loads(data)
df['ID'] = pd.json_normalize(data_json, ['Data', 'Parties'])['ID']
df['Role'] = pd.json_normalize(data_json, ['Data', 'Parties'])['Role']
Now when trying to check its values and give every Role its correspending ID:
for i,row in df.iterrows():
if row['Role'] == 12:
df.at[i,'Buyer'] = df.at[i,'ID']
elif row['Role'] == 10:
df.at[i,'Seller'] = df.at[i,'ID']
df = df[['Buyer', 'Seller']]
The expected df result for the given scenario should be as below :
{"Data":{"Parties":[{"ID":"JackyID","Role":12}],"NbIDs":1}} #Transaction 1
{"Data":{"Parties":[{"ID":"JackyID","Role":12},{"ID":"SamNumber","Role":10}],"NbIDs":2}} #Transaction 2
{"Data":{"Parties":[],"NbIDs":0}} #Transaction 3
{"Data": None} #Transaction 4
>>print(df)
Buyer | Seller
------------------
JackyID| #Transaction 1 we have info about the buyer
JackyID| SamNumber #Transaction 2 we have infos about the buyer and the seller
| #Transaction 3 we don't have any infos about the parties
| #Transaction 4 we don't have any infos about the parties
What is the correct way to do so ?

You can special consider case 4 where there is no Data as empty Parties
df = pd.DataFrame(data['Data']['Parties'] if data['Data'] else [], columns=['ID', 'Role'])
df['Role'] = df['Role'].map({10: 'Seller', 12: 'Buyer'})
Then add possible missing values for Role
df = df.set_index('Role').reindex(['Seller', 'Buyer'], fill_value=pd.NA).T
print(df)
# Case 1
Role Seller Buyer
ID <NA> JackyID
# Case 2
Role Seller Buyer
ID SamNumber JackyID
# Case 3
Role Seller Buyer
ID <NA> <NA>
# Case 4
Role Seller Buyer
ID <NA> <NA>

Cannot get the value if the sharepoint column type is "Person" - Python

I am trying to extract a list from Sharepoint. The thing is that if the column type is "Person or Group" Python show me a KeyError but if the column type is different I can get it.
This is my code to to get the values:
print("Item title: {0}, Id: {1}".format(item.properties["Title"], item.properties['AnalystName']))
And Title works but AnalystName does not. both are the internal names in the sharepoint.

authcookie = Office365('https://xxxxxxxxx.sharepoint.com', username='xxxxxxxxx', password='xxxxxxxxx').GetCookies()
site = Site('https://xxxxxxxxxxxx.sharepoint.com/sites/qualityassuranceteam', authcookie=authcookie)
new_list = site.List('Process Review - Customer Service Opt In/Opt Out')
query = {'Where': [('Gt', 'Audit Date', '2020-02-16')]}
sp_data = new_list.GetListItems(fields=['App ID', 'Analyst Name', 'Team Member Name', "Team Member's Supervisor Name",
'Audit Date', 'Event Date (E.g. Call date)', 'Product Type', 'Master Contact Id',
'Location', 'Team member read the disclosure?', 'Team member withheld the disclosure?',
'Did the team member take the correct action?', 'Did the team member notate the account?',
'Did the team member add the correct phone number?', 'Comment (Required)',
'Modified'], query=query)
#print(sp_data[0])
final_file = '' #Create an empty File
num = 0
for k in sp_data:
values = sp_data[num].values()
val = "|".join(str(v).replace('None', 'null') for v in values) + '\n'
num += 1
final_file += val
file_name = 'test.txt'
with open(file_name, 'a', encoding='utf-8') as file:
file.write(final_file)
So right now I´m getting what I want but there is a problem. When a Column is empty it skips the column instead of bring an empty space. for example:
col-1 | col-2 | col-3 |
HI | 10 | 8 |
Hello | | 7 |
So in this table the row 1 is full so it will bring me evertything as:
HI|10|8
but the second row brings me
Hello|7
and I need Hello||7

Person Fields are getting parsed with different names from items
Ex: UserName gets changed to UserNameId and UserNameString
That is the reason for 'KeyError' since the items list is not having the item
Use Below code to get the person field values
#Python Code
from office365.runtime.auth.user_credential import UserCredential
from office365.sharepoint.client_context import ClientContext
site_url = "enter sharepoint url"
sp_list = "eneter list name"
ctx = ClientContext(site_url).with_credentials(UserCredential("username","password"))
tasks_list = ctx.web.lists.get_by_title(sp_list)
items = tasks_list.items.get().select(["*", "UserName/Id", "UserName/Title"]).expand(["UserName"]).execute_query()
for item in items: # type:ListItem
print("{0}".format(item.properties.get('UserName').get("Title")))

django count per column

I have a ORM like this
from django.db import models,
class MyObject(models.Model):
class Meta:
db_table = 'myobject'
id = models.IntegerField(primary_key=True)
name = models.CharField(max_length=48)
status = models.CharField(max_length=48)
Imagine I have the following entries
1 | foo | completed
2 | foo | completed
3 | bar | completed
4 | foo | failed
What is the django ORM query that I have to make in order to get a queryset somewhat like the following
[{'name': 'foo', 'status_count': 'completed: 2, failed: 1'},
{'name': 'bar', 'status_count': 'completed: 1'}]
I started with the following but I don't know how to "merge" the two columns:
from django.db.models import Count
models.MyObject.objects.values(
'name',
'status'
).annotate(my_count=Count('id'))
The goal of all this to get a table where I can show something like the following:
Name | completed | failed
foo | 2 | 1
bar | 1 | 0

This should work as expected:
test = MyObject.objects.values('name').annotate(
total_completed=Count(
Case(
When(
status='completed', then=1), output_field=DecimalField()
)
),
total_failed=Count(
Case(
When(status='failed', then=1), output_field=DecimalField()
)
)
)

You need to include an "order_by" on to the end of your query to group the like items together.
Something like this should work:
from django.db.models import Count
models.MyObject.objects.values(
'name',
'status'
).annotate(my_count=Count('id')).order_by()
See https://docs.djangoproject.com/en/1.11/topics/db/aggregation/#interaction-with-default-ordering-or-order-by for details.
EDIT: Sorry, I realize this doesn't answer the question about merging the columns... I don't think you can actually do it in a single query, although you can then loop through the results pretty easily and make your output table.

Mapping each element in list to different column in pandas dataframe

Background: I have a dataframe with individuals' names and addresses. I'm trying to catalog people associated with each person in my dataframe, so I'm running each row/record in the dataframe through an external API that returns a list of people associated with the individual. The idea is to write a series of functions that calls the API, returns the list of relatives, and appends each name in the list to a distinct column in the original dataframe. The code will eventually be parallelized.
The dataframe:
import pandas as pd
df = pd.DataFrame({
'first_name': ['Kyle', 'Ted', 'Mary', 'Ron'],
'last_name': ['Smith', 'Jones', 'Johnson', 'Reagan'],
'address': ['123 Main Street', '456 Maple Street', '987 Tudor Place', '1600 Pennsylvania Avenue']},
columns = ['first_name', 'last_name', 'address'])
The first function, which calls the API and returns a list of names:
import requests
import json
import numpy as np
from multiprocessing import Pool
def API_call(row):
api_key = '123samplekey'
first_name = str(row['First_Name'])
last_name = str(row['Last_Name'])
address = str(row['Street_Address'])
url = 'https://apiaddress.com/' + '?first_name=' + first_name + '?last_name=' + last_name + '?address' = address + '?api_key' + api_key
response = requests.get(url)
JSON = response.json()
name_list = []
for index, person in enumerate(JSON['people']):
name = JSON['people'].get('name')
name_list.append(name)
return name_list
This function works well. For each person in the dataframe, a list of family/friends is returned. So, for Kyle Smith, the function returns [Heather Smith, Dan Smith], for Ted Jones the function returns [Al Jones, Karen Jones, Tiffany Jones, Natalie Jones], and so on for each row/record in the dataframe.
Problem: I'm struggling to write a subsequent function that will iterate through the returned list and append each value to a unique column that corresponds to the searched name in the dataframe. I want the function to return a database that looks like this:
First_Name | Last_Name | Street_Address | relative1_name | relative2_name | relative3_name | relative4_name
-----------------------------------------------------------------------------------------------------------------------------
Kyle | Smith | 123 Main Street | Heather Smith | Dan Smith | |
Ted | Jones | 456 Maple Street | Al Jones | Karen Jones | Tiffany Jones | Natalie Jones
Mary | Johnson | 987 Tudor Place | Kevin Johnson | | |
Ron | Reagan | 1600 Pennsylvania Avenue | Nancy Reagan | Patti Davis | Michael Reagan | Christine Reagan
NOTE: The goal is to vectorize everything, so that I can use the apply method and eventually run the whole thing in parallel. Something along the lines of the following code has worked for me in the past, when the "API_call" function was returning a single object instead of a list that needed to be iterated/mapped:
def API_call(row):
# all API parameters
url = 'https//api.com/parameters'
response = request.get(url)
JSON = response.json()
single_object = JSON['key1']['key2'].get('key3')
return single_object
def second_function(data):
data['single_object'] = data.apply(API_call, axis =1)
return data
def parallelize(dataframe, function):
df_splits = np.array_split(dataframe, 10)
pool = Pool(4)
df_whole = pd.concat(pool.map(function, df_splits))
pool.close()
pool.join()
return df_whole
parallelize(df, second_function)
The problem is I just can't write a vectorizable function (second_function) that maps names from the list returned by the API to unique columns in the original dataframe. Thanks in advance for any help!

import pandas as pd
def make_relatives_frame(relatives):
return pd.DataFrame(data=[relatives],
columns=["relative%i_name" % x for x in range(1, len(relatives) + 1)])
# example output from an API call
df_names = pd.DataFrame(data=[["Kyle", "Smith"]], columns=["First_Name", "Last_Name"])
relatives = ["Heather Smith", "Dan Smith"]
df_relatives = make_relatives_frame(relatives)
df_names[df_relatives.columns] = df_relatives
# example output from another API Call with more relatives
df_names2 = pd.DataFrame(data=[["John", "Smith"]], columns=["First_Name", "Last_Name"])
relatives2 = ["Heath Smith", "Daryl Smith", "Scott Smith"]
df_relatives2 = make_relatives_frame(relatives2)
df_names2[df_relatives2.columns] = df_relatives2
# example of stacking the outputs
total_df = df_names.append(df_names2)
print total_df
The above code should get you started. Obviously it is just a representative example, but you should be able to refactor it to fit your specific use case.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas Dataframe iteration loop keeps loading undefinitely - python

Related

Increase Speed of Nested For Loops While Changing Value of a DataFrame

Extract data from nested JSON | Pandas

Cannot get the value if the sharepoint column type is "Person" - Python

django count per column

Mapping each element in list to different column in pandas dataframe

Categories

Resources