NaN issue after merge two tables | Python - python

I tried to merge two tables on person_skills, but recieved a merged table that has a lot NaN value.
I'm sure the second table has no duplicate value and tried to zoom out the possible issues caused by datatype or NA value, but still receive the same wrong result.
Please help me and have a look at the following code.
Table 1
lst_col = 'person_skills'
skills = skills.assign(**{lst_col:skills[lst_col].str.split(',')})
skills = skills.explode(['person_skills'])
skills['person_id'] = skills['person_id'].astype(int)
skills['person_skills'] = skills['person_skills'].astype(str)
skills.head(10)
person_id person_skills
0 1 Talent Management
0 1 Human Resources
0 1 Performance Management
0 1 Leadership
0 1 Business Analysis
0 1 Policy
0 1 Talent Acquisition
0 1 Interviews
0 1 Employee Relations
Table 2
standard_skills = df["person_skills"].str.split(',', expand=True)
series1 = pd.Series(standard_skills[0])
standard_skills = series1.unique()
standard_skills= pd.DataFrame(standard_skills, columns = ["person_skills"])
standard_skills.insert(0, 'skill_id', range(1, 1 + len(standard_skills)))
standard_skills['skill_id'] = standard_skills['skill_id'].astype(int)
standard_skills['person_skills'] = standard_skills['person_skills'].astype(str)
standard_skills = standard_skills.drop_duplicates(subset='person_skills').reset_index(drop=True)
standard_skills = standard_skills.dropna(axis=0)
standard_skills.head(10)
skill_id person_skills
0 1 Talent Management
1 2 SEM
2 3 Proficient with Microsoft Windows: Word
3 4 Recruiting
4 5 Employee Benefits
5 6 PowerPoint
6 7 Marketing
7 8 nan
8 9 Human Resources (HR)
9 10 Event Planning
Merged table
combine_skill = skills.merge(standard_skills,on='person_skills', how='left')
combine_skill.head(10)
person_id person_skills skill_id
0 1 Talent Management 1.0
1 1 Human Resources NaN
2 1 Performance Management NaN
3 1 Leadership NaN
4 1 Business Analysis NaN
5 1 Policy NaN
6 1 Talent Acquisition NaN
7 1 Interviews NaN
8 1 Employee Relations NaN
9 1 Staff Development NaN
Please let me know where I made mistakes, thanks!

Related

Dataframe Insert Labels if filename starts with a 'b'

I want to create a dataframe and give a lable to each file, based on the first letter of the filename:
This is where I created the dataframe, which works out fine:
[IN]
df = pd.read_csv('data.txt', sep="\t", names=['file', 'text', 'label'], header=None, engine='python')
texts = df['text'].values.astype("U")
print(df)
[OUT]
file text label
0 b_001.txt Ad sales boost Time Warner profitQuarterly pro... NaN
1 b_002.txt Dollar gains on Greenspan speechThe dollar has... NaN
2 b_003.txt Yukos unit buyer faces loan claimThe owners of... NaN
3 b_004.txt High fuel prices hit BA's profitsBritish Airwa... NaN
4 b_005.txt Pernod takeover talk lifts DomecqShares in UK ... NaN
... ... ... ...
2220 t_397.txt BT program to beat dialler scamsBT is introduc... NaN
2221 t_398.txt Spam e-mails tempt net shoppersComputer users ... NaN
2222 t_399.txt Be careful how you codeA new European directiv... NaN
2223 t_400.txt US cyber security chief resignsThe man making ... NaN
2224 t_401.txt Losing yourself in online gamingOnline role pl... NaN
Now I want to insert labels based on the filename
for index, row in df.iterrows():
if row['file'].startswith('b'):
row['label'] = 0
elif row['file'].startswith('e'):
row['label'] = 1
elif row['file'].startswith('p'):
row['label'] = 2
elif row['file'].startswith('s'):
row['label'] = 3
else:
row['label'] = 4
print(df)
[OUT]
file text label
0 b_001.txt Ad sales boost Time Warner profitQuarterly pro... 4
1 b_002.txt Dollar gains on Greenspan speechThe dollar has... 4
2 b_003.txt Yukos unit buyer faces loan claimThe owners of... 4
3 b_004.txt High fuel prices hit BA's profitsBritish Airwa... 4
4 b_005.txt Pernod takeover talk lifts DomecqShares in UK ... 4
... ... ... ...
2220 t_397.txt BT program to beat dialler scamsBT is introduc... 4
2221 t_398.txt Spam e-mails tempt net shoppersComputer users ... 4
2222 t_399.txt Be careful how you codeA new European directiv... 4
2223 t_400.txt US cyber security chief resignsThe man making ... 4
2224 t_401.txt Losing yourself in online gamingOnline role pl... 4
As you can see, every row got the label 4. What did I do wrong?
here is one way to do it
instead of for loop, you can use map to assign the values to the label
# create a dictionary of key: value map
d={'b':0,'e':1,'p':2,'s':3}
else_val=4
#take the first character from the filename, and map using dictionary
# null values (else condition) will be 4
df['file'].str[:1].map(d).fillna(else_val).astype(int)
file text label
0 0 b_001.txt Ad sales boost Time Warner profitQuarterly pro... 0
1 1 b_002.txt Dollar gains on Greenspan speechThe dollar has... 0
2 2 b_003.txt Yukos unit buyer faces loan claimThe owners of... 0
3 3 b_004.txt High fuel prices hit BA's profitsBritish Airwa... 0
4 4 b_005.txt Pernod takeover talk lifts DomecqShares in UK ... 0
5 2220 t_397.txt BT program to beat dialler scamsBT is introduc... 4
6 2221 t_398.txt Spam e-mails tempt net shoppersComputer users ... 4
7 2222 t_399.txt Be careful how you codeA new European directiv... 4
8 2223 t_400.txt US cyber security chief resignsThe man making ... 4
9 2224 t_401.txt Losing yourself in online gamingOnline role pl... 4
According to the documentation usage of iterrows() to modify data frame not guaranteed work in all cases beacuse it is not preserve dtype accross rows and etc...
You should never modify something you are iterating over. This is not
guaranteed to work in all cases. Depending on the data types, the
iterator returns a copy and not a view, and writing to it will have no
effect.
Therefore do instead as follows.
def label():
if row['file'].startswith('b'):
return 0
elif row['file'].startswith('e'):
return 1
elif row['file'].startswith('p'):
return 2
elif row['file'].startswith('s'):
return 3
else:
return 4
df['label'] = df.apply(lambda row :label(row[0]),axis=1)

Defining Parent For a Dataset with Several Conditions in Pandas

I have a CSV file with more than 10,000,000 rows of data with below structures:
I have an ID as my uniqueID per group:
Data Format
ID Type Name
1 Head abc-001
1 Senior abc-002
1 Junior abc-003
1 Junior abc-004
2 Head abc-005
2 Senior abc-006
2 Junior abc-007
3 Head abc-008
3 Junior abc-009
...
For defining parent relationship below conditions exist:
Each group MUST has 1 Head.
It is OPTIONAL to have ONLY 1 Senior in each group.
Each group MUST have AT LEAST one Junior.
EXPECTED RESULT
ID Type Name Parent
1 Senior abc-002 abc-001
1 Junior abc-003 abc-002
1 Junior abc-004 abc-002
2 Senior abc-006 abc-005
2 Junior abc-007 abc-006
3 Junior abc-009 abc-008
Below code works when I have one Junior, I want to know if there is any way to define parent for more than one juniors:
order = ['Head', 'Senior', 'Junior']
key = pd.Series({x: i for i,x in enumerate(order)})
df2 = df.sort_values(by='Type', key=key.get)
df4=df.join(df2.groupby('IP')['Type'].shift().dropna().rename('Parent'),how='right')
print(df4)
You could pivot the Type and Name columns then forword fill within ID group. Then take the right-hand two non-NaN entries to get the Parent and Name.
Pivot and forward-fill:
dfn = pd.concat([df[['ID','Type']], df.pivot(columns='Type', values='Name')], axis=1) \
.groupby('ID').apply(lambda x: x.ffill())[['ID','Type','Head','Senior','Junior']]
print(dfn)
ID Type Head Senior Junior
0 1 Head abc-001 NaN NaN
1 1 Senior abc-001 abc-002 NaN
2 1 Junior abc-001 abc-002 abc-003
3 1 Junior abc-001 abc-002 abc-004
4 2 Head abc-005 NaN NaN
5 2 Senior abc-005 abc-006 NaN
6 2 Junior abc-005 abc-006 abc-007
7 3 Head abc-008 NaN NaN
8 3 Junior abc-008 NaN abc-009
A function to pull the last two non-NaN entries:
def get_np(x):
rc = [np.nan,np.nan]
if x.isna().sum() != 2:
if x.isna().sum() == 0:
rc = [x['Junior'],x['Senior']]
elif pd.isna(x['Junior']):
rc = [x['Senior'],x['Head']]
else:
rc = [x['Junior'],x['Head']]
return pd.concat([x[['ID','Type']], pd.Series(rc, index=['Name','Parent'])])
Apply it and drop the non-applicable rows:
dfn.apply(get_np, axis=1).dropna()
ID Type Name Parent
1 1 Senior abc-002 abc-001
2 1 Junior abc-003 abc-002
3 1 Junior abc-004 abc-002
5 2 Senior abc-006 abc-005
6 2 Junior abc-007 abc-006
8 3 Junior abc-009 abc-008

Pandas, Dataframe, conditional sum of column for each row

I am new to python and trying to move some of my work from excel to python, and wanted an excel SUMIFS equivalent in pandas, for example something like:
SUMIFS(F:F, D:D, "<="&C2, B:B, B2, F:F, ">"&0)
I my case, I have 6 columns, a unique Trade ID, an Issuer, a Trade date, a release date, a trader, and a quantity. I wanted to get a column which show the sum of available quantity for release at each row. Something like the below:
A B C D E F G
ID Issuer TradeDate ReleaseDate Trader Quantity SumOfAvailableRelease
1 Horse 1/1/2012 13/3/2012 Amy 7 0
2 Horse 2/2/2012 15/5/2012 Dave 2 0
3 Horse 14/3/2012 NaN Dave -3 7
4 Horse 16/5/2012 NaN John -4 9
5 Horse 20/5/2012 10/6/2012 John 2 9
6 Fish 6/6/2013 20/6/2013 John 11 0
7 Fish 25/6/2013 9/9/2013 Amy 4 11
8 Fish 8/8/2013 15/9/2013 Dave 5 11
9 Fish 25/9/2013 NaN Amy -3 20
Usually, in excel, I just pull the SUMIFS formulas down the whole column and it will work, I am not sure how I can do it in python.
Many thanks!
What you could do is a df.where
so for example you could say
Qdf = df.where(df["Quantity"]>=5)
and then do you sum, Idk what you want to do since I have 0 knowledge about excell but I hope this helps

create separate columns whose titles are based on values in a column

I am trying to create values for each location of data. I have:
Portafolio Zona Region COM PROV Type of Housing
654738 1 2 3 21 compuesto
65344 3 8 4 22 error
I want to make new columns for each of the types of housing and for their values i want to be able to count how many there are total in each portafolio, zona, region, com, and prov. I have struggled with it for 2 days and I am new to python pandas. It should look like this:
Zona Region COM PROV Compuesto Error
1 2 3 21 24 444
3 8 4 22 34 32
You want pd.pivot_table specifying that the aggregation function is size
df1 = pd.pivot_table(df, index=['Zona', 'Region', 'COM', 'PROV'],
columns='Type of Housing',
aggfunc='size').reset_index()
df1.columns.name=None
Output: df1
Zona Region COM PROV compuesto error
0 1 2 3 21 1.0 NaN
1 3 8 4 22 NaN 1.0

Tallying number of times certain strings occur in Python

Im working on a database of incidents affecting different sectors in different countries and want to create a table tallying the incident rate breakdown for each country.
The database looks like this atm
Incident Name | Country Affected | Sector Affected
incident_1 | US,TW,CN | Engineering,Media
incident_2 | FR,RU,CN | Government
etc., etc.
My aim would be to build a which looks like this:
Country | Engineering | Media | Government
CN | 3 | 0 | 5
etc.
Right now my method is basically to use an if loop to check if the country column contains a specific string (for example 'CN') and if this returns True then to run Counter from collections to create a dictionary of the initial tally, then save this.
My issue is how to scale this us to a level where it can be run across the entire database AND how to actually save the dictionary produced by Counter.
pd.Series.str.get_dummies and pd.DataFrame.dot
c = df['Country Affected'].str.get_dummies(sep=',')
s = df['Sector Affected'].str.get_dummies(sep=',')
c.T.dot(s)
Engineering Government Media
CN 1 1 1
FR 0 1 0
RU 0 1 0
TW 1 0 1
US 1 0 1
bigger example
np.random.seed([3,1415])
countries = ['CN', 'FR', 'RU', 'TW', 'US', 'UK', 'JP', 'AU', 'HK']
sectors = ['Engineering', 'Government', 'Media', 'Commodidty']
def pick_rnd(x):
i = np.random.randint(1, len(x))
j = np.random.choice(x, i, False)
return ','.join(j)
df = pd.DataFrame({
'Country Affected': [pick_rnd(countries) for _ in range(10)],
'Sector Affected': [pick_rnd(sectors) for _ in range(10)]
})
df
Country Affected Sector Affected
0 CN Government,Media
1 FR,TW,JP,US,UK,CN,RU,AU Commodidty,Government
2 HK,AU,JP Commodidty
3 RU,CN,FR,JP,UK Media,Commodidty,Engineering
4 CN,RU,FR,JP,TW,HK,US,UK Government,Media,Commodidty
5 FR,CN Commodidty
6 FR,HK,JP,TW,US,AU,CN Commodidty
7 CN,HK,RU,TW,UK,US,FR,JP Media,Commodidty
8 JP,UK,AU Engineering,Media
9 RU,UK,FR Media
Then
c = df['Country Affected'].str.get_dummies(sep=',')
s = df['Sector Affected'].str.get_dummies(sep=',')
c.T.dot(s)
Commodidty Engineering Government Media
AU 3 1 1 1
CN 6 1 3 4
FR 6 1 2 4
HK 4 0 1 2
JP 6 2 2 4
RU 4 1 2 4
TW 4 0 2 2
UK 4 2 2 5
US 4 0 2 2

Categories

Resources