Dataframe Insert Labels if filename starts with a 'b' - python

I want to create a dataframe and give a lable to each file, based on the first letter of the filename:
This is where I created the dataframe, which works out fine:
[IN]
df = pd.read_csv('data.txt', sep="\t", names=['file', 'text', 'label'], header=None, engine='python')
texts = df['text'].values.astype("U")
print(df)
[OUT]
file text label
0 b_001.txt Ad sales boost Time Warner profitQuarterly pro... NaN
1 b_002.txt Dollar gains on Greenspan speechThe dollar has... NaN
2 b_003.txt Yukos unit buyer faces loan claimThe owners of... NaN
3 b_004.txt High fuel prices hit BA's profitsBritish Airwa... NaN
4 b_005.txt Pernod takeover talk lifts DomecqShares in UK ... NaN
... ... ... ...
2220 t_397.txt BT program to beat dialler scamsBT is introduc... NaN
2221 t_398.txt Spam e-mails tempt net shoppersComputer users ... NaN
2222 t_399.txt Be careful how you codeA new European directiv... NaN
2223 t_400.txt US cyber security chief resignsThe man making ... NaN
2224 t_401.txt Losing yourself in online gamingOnline role pl... NaN
Now I want to insert labels based on the filename
for index, row in df.iterrows():
if row['file'].startswith('b'):
row['label'] = 0
elif row['file'].startswith('e'):
row['label'] = 1
elif row['file'].startswith('p'):
row['label'] = 2
elif row['file'].startswith('s'):
row['label'] = 3
else:
row['label'] = 4
print(df)
[OUT]
file text label
0 b_001.txt Ad sales boost Time Warner profitQuarterly pro... 4
1 b_002.txt Dollar gains on Greenspan speechThe dollar has... 4
2 b_003.txt Yukos unit buyer faces loan claimThe owners of... 4
3 b_004.txt High fuel prices hit BA's profitsBritish Airwa... 4
4 b_005.txt Pernod takeover talk lifts DomecqShares in UK ... 4
... ... ... ...
2220 t_397.txt BT program to beat dialler scamsBT is introduc... 4
2221 t_398.txt Spam e-mails tempt net shoppersComputer users ... 4
2222 t_399.txt Be careful how you codeA new European directiv... 4
2223 t_400.txt US cyber security chief resignsThe man making ... 4
2224 t_401.txt Losing yourself in online gamingOnline role pl... 4
As you can see, every row got the label 4. What did I do wrong?

here is one way to do it
instead of for loop, you can use map to assign the values to the label
# create a dictionary of key: value map
d={'b':0,'e':1,'p':2,'s':3}
else_val=4
#take the first character from the filename, and map using dictionary
# null values (else condition) will be 4
df['file'].str[:1].map(d).fillna(else_val).astype(int)
file text label
0 0 b_001.txt Ad sales boost Time Warner profitQuarterly pro... 0
1 1 b_002.txt Dollar gains on Greenspan speechThe dollar has... 0
2 2 b_003.txt Yukos unit buyer faces loan claimThe owners of... 0
3 3 b_004.txt High fuel prices hit BA's profitsBritish Airwa... 0
4 4 b_005.txt Pernod takeover talk lifts DomecqShares in UK ... 0
5 2220 t_397.txt BT program to beat dialler scamsBT is introduc... 4
6 2221 t_398.txt Spam e-mails tempt net shoppersComputer users ... 4
7 2222 t_399.txt Be careful how you codeA new European directiv... 4
8 2223 t_400.txt US cyber security chief resignsThe man making ... 4
9 2224 t_401.txt Losing yourself in online gamingOnline role pl... 4

According to the documentation usage of iterrows() to modify data frame not guaranteed work in all cases beacuse it is not preserve dtype accross rows and etc...
You should never modify something you are iterating over. This is not
guaranteed to work in all cases. Depending on the data types, the
iterator returns a copy and not a view, and writing to it will have no
effect.
Therefore do instead as follows.
def label():
if row['file'].startswith('b'):
return 0
elif row['file'].startswith('e'):
return 1
elif row['file'].startswith('p'):
return 2
elif row['file'].startswith('s'):
return 3
else:
return 4
df['label'] = df.apply(lambda row :label(row[0]),axis=1)

Related

NaN issue after merge two tables | Python

I tried to merge two tables on person_skills, but recieved a merged table that has a lot NaN value.
I'm sure the second table has no duplicate value and tried to zoom out the possible issues caused by datatype or NA value, but still receive the same wrong result.
Please help me and have a look at the following code.
Table 1
lst_col = 'person_skills'
skills = skills.assign(**{lst_col:skills[lst_col].str.split(',')})
skills = skills.explode(['person_skills'])
skills['person_id'] = skills['person_id'].astype(int)
skills['person_skills'] = skills['person_skills'].astype(str)
skills.head(10)
person_id person_skills
0 1 Talent Management
0 1 Human Resources
0 1 Performance Management
0 1 Leadership
0 1 Business Analysis
0 1 Policy
0 1 Talent Acquisition
0 1 Interviews
0 1 Employee Relations
Table 2
standard_skills = df["person_skills"].str.split(',', expand=True)
series1 = pd.Series(standard_skills[0])
standard_skills = series1.unique()
standard_skills= pd.DataFrame(standard_skills, columns = ["person_skills"])
standard_skills.insert(0, 'skill_id', range(1, 1 + len(standard_skills)))
standard_skills['skill_id'] = standard_skills['skill_id'].astype(int)
standard_skills['person_skills'] = standard_skills['person_skills'].astype(str)
standard_skills = standard_skills.drop_duplicates(subset='person_skills').reset_index(drop=True)
standard_skills = standard_skills.dropna(axis=0)
standard_skills.head(10)
skill_id person_skills
0 1 Talent Management
1 2 SEM
2 3 Proficient with Microsoft Windows: Word
3 4 Recruiting
4 5 Employee Benefits
5 6 PowerPoint
6 7 Marketing
7 8 nan
8 9 Human Resources (HR)
9 10 Event Planning
Merged table
combine_skill = skills.merge(standard_skills,on='person_skills', how='left')
combine_skill.head(10)
person_id person_skills skill_id
0 1 Talent Management 1.0
1 1 Human Resources NaN
2 1 Performance Management NaN
3 1 Leadership NaN
4 1 Business Analysis NaN
5 1 Policy NaN
6 1 Talent Acquisition NaN
7 1 Interviews NaN
8 1 Employee Relations NaN
9 1 Staff Development NaN
Please let me know where I made mistakes, thanks!

Get latest value looked up from other dataframe

My first data frame
product=pd.DataFrame({
'Product_ID':[101,102,103,104,105,106,107,101],
'Product_name':['Watch','Bag','Shoes','Smartphone','Books','Oil','Laptop','New Watch'],
'Category':['Fashion','Fashion','Fashion','Electronics','Study','Grocery','Electronics','Electronics'],
'Price':[299.0,1350.50,2999.0,14999.0,145.0,110.0,79999.0,9898.0],
'Seller_City':['Delhi','Mumbai','Chennai','Kolkata','Delhi','Chennai','Bengalore','New York']
})
My 2nd data frame has transactions
customer=pd.DataFrame({
'id':[1,2,3,4,5,6,7,8,9],
'name':['Olivia','Aditya','Cory','Isabell','Dominic','Tyler','Samuel','Daniel','Jeremy'],
'age':[20,25,15,10,30,65,35,18,23],
'Product_ID':[101,0,106,0,103,104,0,0,107],
'Purchased_Product':['Watch','NA','Oil','NA','Shoes','Smartphone','NA','NA','Laptop'],
'City':['Mumbai','Delhi','Bangalore','Chennai','Chennai','Delhi','Kolkata','Delhi','Mumbai']
})
I want Price from 1st data frame to come in the merged dataframe. Common element being 'Product_ID'. Note that against product_ID 101, there are 2 prices - 299.00 and 9898.00. I want the later one to come in the merged data set i.e. 9898.0 (Since this is latest price)
Currently my code is not giving the right answer. It is giving both
customerpur = pd.merge(customer,product[['Price','Product_ID']], on="Product_ID", how = "left")
customerpur
id name age Product_ID Purchased_Product City Price
0 1 Olivia 20 101 Watch Mumbai 299.0
1 1 Olivia 20 101 Watch Mumbai 9898.0
There is no explicit timestamp so I assume the index is the order of the dataframe. You can drop duplicates at the end:
customerpur.drop_duplicates(subset = ['id'], keep = 'last')
result:
id name age Product_ID Purchased_Product City Price
1 1 Olivia 20 101 Watch Mumbai 9898.0
2 2 Aditya 25 0 NA Delhi NaN
3 3 Cory 15 106 Oil Bangalore 110.0
4 4 Isabell 10 0 NA Chennai NaN
5 5 Dominic 30 103 Shoes Chennai 2999.0
6 6 Tyler 65 104 Smartphone Delhi 14999.0
7 7 Samuel 35 0 NA Kolkata NaN
8 8 Daniel 18 0 NA Delhi NaN
9 9 Jeremy 23 107 Laptop Mumbai 79999.0
Please note keep = 'last' argument since we are keeping only last price registered.
Deduplication should be done before merging if Yuo care about performace or dataset is huge:
product = product.drop_duplicates(subset = ['Product_ID'], keep = 'last')
In your data frame there is no indicator of latest entry, so you might need to first remove the the first entry for id 101 from product dataframe as follows:
result_product = product.drop_duplicates(subset=['Product_ID'], keep='last')
It will keep the last entry based on Product_ID and you can do the merge as:
pd.merge(result_product, customer, on='Product_ID')

Dataframe Updating a column to category name based on a list of string values in that category

I have lists that are categorized by name, such as:
dining = ['CARLS', 'SUBWAY', 'PIZZA']
bank = ['TRANSFER', 'VENMO', 'SAVE AS YOU GO']
and I want to update a new column to the category name if any of those strings are found in the other column. An example from my other question here, I have the following data set (an example bank transactions list):
import pandas as pd
import numpy as np
dining = ['CARLS', 'SUBWAY', 'PIZZA']
bank = ['TRANSFER', 'VENMO', 'SAVE AS YOU GO']
data = [
[-68.23 , 'PAYPAL TRANSFER'],
[-12.46, 'RALPHS #0079'],
[-8.51, 'SAVE AS YOU GO'],
[25.34, 'VENMO CASHOUT'],
[-2.23 , 'PAYPAL TRANSFER'],
[-64.29 , 'PAYPAL TRANSFER'],
[-7.06, 'SUBWAY'],
[-7.03, 'CARLS JR'],
[-2.35, 'SHELL OIL'],
[-35.23, 'CHEVRON GAS']
]
df = pd.DataFrame(data, columns=['amount', 'details'])
df['category'] = np.nan
df
amount details category
0 -68.23 PAYPAL TRANSFER NaN
1 -12.46 RALPHS #0079 NaN
2 -8.51 SAVE AS YOU GO NaN
3 25.34 VENMO CASHOUT NaN
4 -2.23 PAYPAL TRANSFER NaN
5 -64.29 PAYPAL TRANSFER NaN
6 -7.06 SUBWAY NaN
7 -7.03 CARLS JR NaN
8 -2.35 SHELL OIL NaN
9 -35.23 CHEVRON GAS NaN
Is there an efficient way for me update the category column to either 'dining' or 'bank' based on if the strings in the list are found in data.details?
I.e. Desired Output:
amount details category
0 -68.23 PAYPAL TRANSFER bank
1 -12.46 RALPHS #0079 NaN
2 -8.51 SAVE AS YOU GO bank
3 25.34 VENMO CASHOUT bank
4 -2.23 PAYPAL TRANSFER bank
5 -64.29 PAYPAL TRANSFER bank
6 -7.06 SUBWAY dining
7 -7.03 CARLS JR dining
8 -2.35 SHELL OIL NaN
9 -35.23 CHEVRON GAS NaN
From my previous question, so far I'm assuming I need to work with a new list that I create by using str.extract.
We can do this with np.select since we have multiple conditions:
dining = '|'.join(dining)
bank = '|'.join(bank)
conditions = [
df['details'].str.contains(f'({dining})'),
df['details'].str.contains(f'({bank})')
]
choices = ['dining', 'bank']
df['category'] = np.select(conditions, choices, default=np.NaN)
amount details category
0 -68.23 PAYPAL TRANSFER bank
1 -12.46 RALPHS #0079 nan
2 -8.51 SAVE AS YOU GO bank
3 25.34 VENMO CASHOUT bank
4 -2.23 PAYPAL TRANSFER bank
5 -64.29 PAYPAL TRANSFER bank
6 -7.06 SUBWAY dining
7 -7.03 CARLS JR dining
8 -2.35 SHELL OIL nan
9 -35.23 CHEVRON GAS nan
You can do with findall + dict map
sub = {**dict.fromkeys(dining, 'dining'), **dict.fromkeys(bank, 'bank')}
df.details.str.findall('|'.join(sub)).str[0].map(sub)
Out[146]:
0 bank
1 NaN
2 bank
3 bank
4 bank
5 bank
6 dining
7 dining
8 NaN
9 NaN
Name: details, dtype: object
#df['category'] = df.details.str.findall('|'.join(sub)).str[0].map(sub)

Conditionally filling blank values in Pandas dataframes

I have a datafarme which looks like as follows (there are more columns having been dropped off):
memberID shipping_country
264991
264991 Canada
100 USA
5000
5000 UK
I'm trying to fill the blank cells with existing value of shipping country for each user:
memberID shipping_country
264991 Canada
264991 Canada
100 USA
5000 UK
5000 UK
However, I'm not sure what's the most efficient way to do this on a large scale dataset. Perhaps, using a vectored groupby method?
You can use GroupBy + ffill / bfill:
def filler(x):
return x.ffill().bfill()
res = df.groupby('memberID')['shipping_country'].apply(filler)
A custom function is necessary as there's no combined Pandas method to ffill and bfill sequentially.
This also caters for the situation where all values are NaN for a specific memberID; in this case they will remain NaN.
For the following sample dataframe (I added a memberID group that only contains '' in the shipping_country column):
memberID shipping_country
0 264991
1 264991 Canada
2 100 USA
3 5000
4 5000 UK
5 54
This should work for you, and also as the behavior that if a memberID group only has empty string values ('') in shipping_country, those will be retained in the output df:
df['shipping_country'] = df.replace('',np.nan).groupby('memberID')['shipping_country'].transform('first').fillna('')
Yields:
memberID shipping_country
0 264991 Canada
1 264991 Canada
2 100 USA
3 5000 UK
4 5000 UK
5 54
If you would like to leave the empty strings '' as NaN in the output df, then just remove the fillna(''), leaving:
df['shipping_country'] = df.replace('',np.nan).groupby('memberID')['shipping_country'].transform('first')
You can use chained groupbys, one with forward fill and one with backfill:
# replace blank values with `NaN` first:
df['shipping_country'].replace('',pd.np.nan,inplace=True)
df.iloc[::-1].groupby('memberID').ffill().groupby('memberID').bfill()
memberID shipping_country
0 264991 Canada
1 264991 Canada
2 100 USA
3 5000 UK
4 5000 UK
This method will also allow a group made up of all NaN to remain NaN:
>>> df
memberID shipping_country
0 264991
1 264991 Canada
2 100 USA
3 5000
4 5000 UK
5 1
6 1
df['shipping_country'].replace('',pd.np.nan,inplace=True)
df.iloc[::-1].groupby('memberID').ffill().groupby('memberID').bfill()
memberID shipping_country
0 264991 Canada
1 264991 Canada
2 100 USA
3 5000 UK
4 5000 UK
5 1 NaN
6 1 NaN

Selecting data based on number of occurences using Python / Pandas

My dataset is based on the results of Food Inspections in the City of Chicago.
import pandas as pd
df = pd.read_csv("C:/~/Food_Inspections.csv")
df.head()
Out[1]:
Inspection ID DBA Name \
0 1609238 JR'SJAMAICAN TROPICAL CAFE,INC
1 1609245 BURGER KING
2 1609237 DUNKIN DONUTS / BASKIN ROBINS
3 1609258 CHIPOTLE MEXICAN GRILL
4 1609244 ATARDECER ACAPULQUENO INC.
AKA Name License # Facility Type Risk \
0 NaN 2442496.0 Restaurant Risk 1 (High)
1 BURGER KING 2411124.0 Restaurant Risk 2 (Medium)
2 DUNKIN DONUTS / BASKIN ROBINS 1717126.0 Restaurant Risk 2 (Medium)
3 CHIPOTLE MEXICAN GRILL 1335044.0 Restaurant Risk 1 (High)
4 ATARDECER ACAPULQUENO INC. 1910118.0 Restaurant Risk 1 (High)
Here is how often each of the facilities appear in the dataset:
df['Facility Type'].value_counts()
Out[3]:
Restaurant 14304
Grocery Store 2647
School 1155
Daycare (2 - 6 Years) 367
Bakery 316
Children's Services Facility 262
Daycare Above and Under 2 Years 248
Long Term Care 169
Daycare Combo 1586 142
Catering 123
Liquor 78
Hospital 68
Mobile Food Preparer 67
Golden Diner 65
Mobile Food Dispenser 51
Special Event 25
Shared Kitchen User (Long Term) 22
Daycare (Under 2 Years) 18
I am trying to create a new set of data containing those rows where its Facility Type has over 50 occurrences in the dataset. How would I approach this?
Please note the list of facility counts is MUCH LARGER as I have cut out most of the information as it did not contribute to the question at hand (so simply removing occurrences of "Special Event", " Shared Kitchen User", and "Daycare" is not what I'm looking for).
IIUC then you want to filter:
df.groupby('Facility Type').filter(lambda x: len(x) > 50)
Example:
In [9]:
df = pd.DataFrame({'type':list('aabcddddee'), 'value':np.random.randn(10)})
df
Out[9]:
type value
0 a -0.160041
1 a -0.042310
2 b 0.530609
3 c 1.238046
4 d -0.754779
5 d -0.197309
6 d 1.704829
7 d -0.706467
8 e -1.039818
9 e 0.511638
In [10]:
df.groupby('type').filter(lambda x: len(x) > 1)
Out[10]:
type value
0 a -0.160041
1 a -0.042310
4 d -0.754779
5 d -0.197309
6 d 1.704829
7 d -0.706467
8 e -1.039818
9 e 0.511638
Not tested, but should work.
FT=df['Facility Type'].value_counts()
df[df['Facility Type'].isin(FT.index[FT>50])]

Categories

Resources