Tallying number of times certain strings occur in Python - python

Im working on a database of incidents affecting different sectors in different countries and want to create a table tallying the incident rate breakdown for each country.
The database looks like this atm
Incident Name | Country Affected | Sector Affected
incident_1 | US,TW,CN | Engineering,Media
incident_2 | FR,RU,CN | Government
etc., etc.
My aim would be to build a which looks like this:
Country | Engineering | Media | Government
CN | 3 | 0 | 5
etc.
Right now my method is basically to use an if loop to check if the country column contains a specific string (for example 'CN') and if this returns True then to run Counter from collections to create a dictionary of the initial tally, then save this.
My issue is how to scale this us to a level where it can be run across the entire database AND how to actually save the dictionary produced by Counter.

pd.Series.str.get_dummies and pd.DataFrame.dot
c = df['Country Affected'].str.get_dummies(sep=',')
s = df['Sector Affected'].str.get_dummies(sep=',')
c.T.dot(s)
Engineering Government Media
CN 1 1 1
FR 0 1 0
RU 0 1 0
TW 1 0 1
US 1 0 1
bigger example
np.random.seed([3,1415])
countries = ['CN', 'FR', 'RU', 'TW', 'US', 'UK', 'JP', 'AU', 'HK']
sectors = ['Engineering', 'Government', 'Media', 'Commodidty']
def pick_rnd(x):
i = np.random.randint(1, len(x))
j = np.random.choice(x, i, False)
return ','.join(j)
df = pd.DataFrame({
'Country Affected': [pick_rnd(countries) for _ in range(10)],
'Sector Affected': [pick_rnd(sectors) for _ in range(10)]
})
df
Country Affected Sector Affected
0 CN Government,Media
1 FR,TW,JP,US,UK,CN,RU,AU Commodidty,Government
2 HK,AU,JP Commodidty
3 RU,CN,FR,JP,UK Media,Commodidty,Engineering
4 CN,RU,FR,JP,TW,HK,US,UK Government,Media,Commodidty
5 FR,CN Commodidty
6 FR,HK,JP,TW,US,AU,CN Commodidty
7 CN,HK,RU,TW,UK,US,FR,JP Media,Commodidty
8 JP,UK,AU Engineering,Media
9 RU,UK,FR Media
Then
c = df['Country Affected'].str.get_dummies(sep=',')
s = df['Sector Affected'].str.get_dummies(sep=',')
c.T.dot(s)
Commodidty Engineering Government Media
AU 3 1 1 1
CN 6 1 3 4
FR 6 1 2 4
HK 4 0 1 2
JP 6 2 2 4
RU 4 1 2 4
TW 4 0 2 2
UK 4 2 2 5
US 4 0 2 2

Related

NaN issue after merge two tables | Python

I tried to merge two tables on person_skills, but recieved a merged table that has a lot NaN value.
I'm sure the second table has no duplicate value and tried to zoom out the possible issues caused by datatype or NA value, but still receive the same wrong result.
Please help me and have a look at the following code.
Table 1
lst_col = 'person_skills'
skills = skills.assign(**{lst_col:skills[lst_col].str.split(',')})
skills = skills.explode(['person_skills'])
skills['person_id'] = skills['person_id'].astype(int)
skills['person_skills'] = skills['person_skills'].astype(str)
skills.head(10)
person_id person_skills
0 1 Talent Management
0 1 Human Resources
0 1 Performance Management
0 1 Leadership
0 1 Business Analysis
0 1 Policy
0 1 Talent Acquisition
0 1 Interviews
0 1 Employee Relations
Table 2
standard_skills = df["person_skills"].str.split(',', expand=True)
series1 = pd.Series(standard_skills[0])
standard_skills = series1.unique()
standard_skills= pd.DataFrame(standard_skills, columns = ["person_skills"])
standard_skills.insert(0, 'skill_id', range(1, 1 + len(standard_skills)))
standard_skills['skill_id'] = standard_skills['skill_id'].astype(int)
standard_skills['person_skills'] = standard_skills['person_skills'].astype(str)
standard_skills = standard_skills.drop_duplicates(subset='person_skills').reset_index(drop=True)
standard_skills = standard_skills.dropna(axis=0)
standard_skills.head(10)
skill_id person_skills
0 1 Talent Management
1 2 SEM
2 3 Proficient with Microsoft Windows: Word
3 4 Recruiting
4 5 Employee Benefits
5 6 PowerPoint
6 7 Marketing
7 8 nan
8 9 Human Resources (HR)
9 10 Event Planning
Merged table
combine_skill = skills.merge(standard_skills,on='person_skills', how='left')
combine_skill.head(10)
person_id person_skills skill_id
0 1 Talent Management 1.0
1 1 Human Resources NaN
2 1 Performance Management NaN
3 1 Leadership NaN
4 1 Business Analysis NaN
5 1 Policy NaN
6 1 Talent Acquisition NaN
7 1 Interviews NaN
8 1 Employee Relations NaN
9 1 Staff Development NaN
Please let me know where I made mistakes, thanks!

How to generate label by sparse cumcount

Here's my master dataset
Id Data Category Code
1 tey Airport AIR_02
2 fg Hospital HEA_04
3 dffs Airport AIR_01
4 dsfs Hospital HEA_03
5 fdsf Airport AIR_04
Here's the data I want to merge
Id Data Category
1 tetyer Airport
2 fgdss Hospital
3 dffsdsa Airport
4 dsfsas Hospital
5 fdsfada Airport
My Expected Output
Id Data Category Code
1 tey Airport AIR_02
2 fg Hospital HEA_04
3 dffs Airport AIR_01
4 dsfs Hospital HEA_03
5 fdsf Airport AIR_04
6 tetyer Airport AIR_03
7 fgdss Hospital HEA_01
8 dffsdsa Airport AIR_05
9 dsfsas Hospital HEA_02
10 fdsfada Airport AIR_06
Note:
HEA_01is not available on existing dataset, Every Hospital Code start with HEA_ and Every airport start with AIR_, code 01,02 etc is by availability.
Use:
#split Code by _
df1[['a','b']] = df1['Code'].str.split('_', expand=True)
#converting values to integers
df1['b'] = df1['b'].astype(int)
#aggregate for list and first value for mapping
df11 = df1.groupby(['Category']).agg({'a':'first', 'b':list})
#get difference by np.arange with used values
def f(x):
L = df11['b'][x.name]
a = np.arange(1, len(x) + len(L) + 1)
#difference with filter same number of values like length of group
return np.setdiff1d(a, L)[:len(x)]
df2['Code'] = df2.groupby('Category')['Category'].transform(f)
#created Code with join
df2['Code'] = df2['Category'].map(df11['a']) + '_' + df2['Code'].astype(str).str.zfill(2)
print (df2)
Id Data Category Code
0 1 tetyer Airport AIR_03
1 2 fgdss Hospital HEA_01
2 3 dffsdsa Airport AIR_05
3 4 dsfsas Hospital HEA_02
4 5 fdsfada Airport AIR_06
df = pd.concat([df1.drop(['a','b'], 1), df2], ignore_index=True)
print (df)
Id Data Category Code
0 1 tey Airport AIR_02
1 2 fg Hospital HEA_04
2 3 dffs Airport AIR_01
3 4 dsfs Hospital HEA_03
4 5 fdsf Airport AIR_04
5 1 tetyer Airport AIR_03
6 2 fgdss Hospital HEA_01
7 3 dffsdsa Airport AIR_05
8 4 dsfsas Hospital HEA_02
9 5 fdsfada Airport AIR_06
To solve this, I would define a class to act as a code filler. The advantage of this approach is that you can then easily add more data without needing to recompute everything:
class CodeFiller():
def __init__(self, df, col='Code', maps=None):
codes = df[col].str.split('_', expand=True).groupby(0)[1].agg(set).to_dict()
self.maps = maps
self.gens = {prefix: self.code_gen(prefix, codes[prefix]) for prefix in codes}
def code_gen(self, prefix, codes):
from itertools import count
for i in count(1):
num = f'{i:02}'
if num not in codes:
yield f'{prefix}_{num}'
def __call__(self, prefix):
if self.maps:
prefix = self.maps[prefix]
return next(self.gens[prefix])
refs = {'Airport': 'AIR', 'Hospital': 'HEA'}
filler = CodeFiller(df1, maps=refs)
df3 = pd.concat([df1, df2.assign(Code=df2['Category'].map(filler))], ignore_index=True)
output:
Id Data Category Code
0 1 tey Airport AIR_02
1 2 fg Hospital HEA_04
2 3 dffs Airport AIR_01
3 4 dsfs Hospital HEA_03
4 5 fdsf Airport AIR_04
5 1 tetyer Airport AIR_03
6 2 fgdss Hospital HEA_01
7 3 dffsdsa Airport AIR_05
8 4 dsfsas Hospital HEA_02
9 5 fdsfada Airport AIR_06
Now imagine you have more data coming, you can just continue (reusing df2 here for the example):
pd.concat([df3, df2.assign(Code=df2['Category'].map(filler))], ignore_index=True)
output:
Id Data Category Code
[...]
10 1 tetyer Airport AIR_10
11 2 fgdss Hospital HEA_07
12 3 dffsdsa Airport AIR_11
13 4 dsfsas Hospital HEA_08
14 5 fdsfada Airport AIR_12

Pandas - find max number of taxi stops before back to depot

I've a pandas dataframe as below:
driver_id | trip_id | pickup_from | drop_off_to | date
1 | 3 | stop 1 | city1 | 2018-02-04
1 | 7 | city 2 | city3 | 2018-02-04
1 | 4 | city 4 | stop1 | 2018-02-04
2 | 8 | stop 1 | city7 | 2018-02-06
2 | 9 | city 8 | stop1 | 2018-02-06
2 | 12 | stop 1 | city5 | 2018-02-07
2 | 10 | city 3 | city1 | 2018-02-07
2 | 1 | city 4 | city7 | 2018-02-07
2 | 6 | city 2 | stop1 | 2018-02-07
I want to calculate the longest trip for each driver between stop 1 in the (pick_from) column and stop 1 in the (drop_off_to) column. i.e: for driver 1 if he started from stop 1 and went to city 1 then city 2 then city 3 then city 4 and then back to stop 1. so the max number of trips should be the number of cities he visited = 4 cities.
For driver 2 he started from stop 1 and then went to city 7 and city 8 before he goes back to stop 1 so he visited 2 cities. Then he started from stop 1 again and visited city 5, city 3, city 1, city 4, city 7 and city 2 before he goes back to stop 1 then total number of cities he worked in is 6 cities. So for driver 2 the max number of cities he visited = 6. The date doesn't matter in our calculation.
How can I do this using Pandas
Define the following function computing the longest trip
for a driver:
def maxTrip(grp):
trip = pd.DataFrame({'city': grp[['pickup_from', 'drop_off_to']]
.values.reshape(1, -1).squeeze()})
return trip.groupby(trip.city.str.match('stop').cumsum())\
.apply(lambda grp2: grp2.drop_duplicates().city.size).max() - 1
Then apply it:
result = df.groupby('driver_id').apply(maxTrip)
The result, for your data sample, is:
driver_id
1 4
2 6
dtype: int64
Note: It is up to you whether you want to eliminate repeating cities during
one sub-trip (from leaving the stop to return).
I assumed that they are to be eliminated. If you don't want this, drop
.drop_duplicates() from my code.
For your data sample this does not matter, since in each sub-trip
city names are unique. But it can happen that a driver visits a city,
then goes to another city and sometimes later (but before the return
to the stop) visits the same city again.
This would be best done using a function. Assuming the name of your dataframe above is df, Consider this as a continuation:
drop_cities1 = list(df.loc[1:3, "drop_off_to"]) #Makes a list of all driver 1 drops
pick_cities1 = list(df.loc[1:3, "pickup from"]) #Makes a list of all driver 1 pickups
drop_cities2 = list(df.loc[4:, "drop_off_to"])
pick_cities2 = list(df.loc[4:, "pickup from"])
no_of_c1 = 0 #Number of cities visited by driver 1
no_of_c2 = 0 #Number of cities visited by driver 2
def max_trips(pick_list, drop_list):
no_of_cities = 0
for pick in pick_list:
if pick == stop1:
for drop in drop_list:
no_of_cities += 1
if drop == stop1:
break
return no_of_cities
max_trips(pick_cities1, drop_cities1)
max_trips(pick_cities2, drop_cities2)

find rows that share values

I have a pandas dataframe that look like this:
df = pd.DataFrame({'name': ['bob', 'time', 'jane', 'john', 'andy'], 'favefood': [['kfc', 'mcd', 'wendys'], ['mcd'], ['mcd', 'popeyes'], ['wendys', 'kfc'], ['tacobell', 'innout']]})
-------------------------------
name | favefood
-------------------------------
bob | ['kfc', 'mcd', 'wendys']
tim | ['mcd']
jane | ['mcd', 'popeyes']
john | ['wendys', 'kfc']
andy | ['tacobell', 'innout']
For each person, I want to find out how many favefood's of other people overlap with their own.
I.e., for each person I want to find out how many other people have a non-empty intersection with them.
The resulting dataframe would look like this:
------------------------------
name | overlap
------------------------------
bob | 3
tim | 2
jane | 2
john | 1
andy | 0
The problem is that I have about 2 million rows of data. The only way I can think of doing this would be through a nested for-loop - i.e. for each person, go through the entire dataframe to see what overlaps (this would be extremely inefficient). Would there be anyway to do this more efficiently using pandas notation? Thanks!
Logic behind it
s=df['favefood'].explode().str.get_dummies().sum(level=0)
s.dot(s.T).ne(0).sum(axis=1)-1
Out[84]:
0 3
1 2
2 2
3 1
4 0
dtype: int64
df['overlap']=s.dot(s.T).ne(0).sum(axis=1)-1
Method from sklearn
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
s=pd.DataFrame(mlb.fit_transform(df['favefood']),columns=mlb.classes_, index=df.index)
s.dot(s.T).ne(0).sum(axis=1)-1
0 3
1 2
2 2
3 1
4 0
dtype: int64

I am not abe to make accurate pivot table

I have one dataframe which contain many columns and i am trying to make pivot table
like this
Data sample
program | InWappTable | InLeadExportTrack
VIC | True | 1
VIC | True |1
VIC | True |1
VIC | True | 1
Here is my code
rec.groupby(['InWappTable', 'InLeadExportTrack','program']).size()
And Expected Output is
IIUC, you can try this:
df_new=df.groupby(['program'])['InWappTable','InLeadExporttrack'].count().reset_index()
total = df_new.sum()
total['program'] = 'Total'
df_new=df_new.append(total, ignore_index=True)
print(df_new)
I do not believe that you require a pivot_table here, though a pivot_table approach with aggfunc can also be used effectively.
Here is how I approached this
Generate some data
a = [['program','InWappTable','InLeadExportTrack'],
['VIC',True,1],
['Mall',False,15],
['VIC',True,101],
['VIC',True,1],
['Mall',True,74],
['Mall',True,11],
['VIC',False,44]]
df = pd.DataFrame(a[1:], columns=a[0])
print(df)
program InWappTable InLeadExportTrack
0 VIC True 1
1 Mall False 15
2 VIC True 101
3 VIC True 1
4 Mall True 74
5 Mall True 11
6 VIC False 44
First do GROUP BY with count aggregation
df_grouped = df.groupby(['program']).count()
print(df_grouped)
InWappTable InLeadExportTrack
program
Mall 3 3
VIC 4 4
Then to get the sum of all columns
num_cols = ['InWappTable','InLeadExportTrack']
df_grouped[num_cols] = df_grouped[num_cols].astype(int)
df_grouped.loc['Total']= df_grouped.sum(axis=0)
df_grouped.reset_index(drop=False, inplace=True)
print(df_grouped)
program InWappTable InLeadExportTrack
0 Mall 3 3
1 VIC 4 4
2 Total 7 7
EDIT
Based on the comments in the OP, df_grouped = df.groupby(['program']).count() could be replaced by df_grouped = df.groupby(['program']).sum(). In this case, the output is shown below
program InWappTable InLeadExportTrack
0 Mall 2 100
1 VIC 3 147
2 Total 5 247

Categories

Resources