Problematic values - python

I have a dataframe that has a 'Trousers' column containing many different types of trousers. Most of the trousers would start by their type. For instance: Jeans- Replay-blue, or Chino- Uniqlo-~, or maybe Smart-Next-~). Others would just have a type but just a long name (2 or 3 strings)
What I want is to loop through that column to change the values to just Jean if jeans is in the cell,or Chinos if Chino is in the cell and so on.... so I can easily group them.
How can achieve that through with my for loop?

It seems you need split and then select first value of lists by str[0]:
df['type'] = df['Trousers'].str.split('-').str[0]
Sample:
df = pd.DataFrame({'Trousers':['Jeans- Replay-blue','Chino- Uniqlo-~','Smart-Next-~']})
print (df)
Trousers
0 Jeans- Replay-blue
1 Chino- Uniqlo-~
2 Smart-Next-~
df['type'] = df['Trousers'].str.split('-').str[0]
print (df)
Trousers type
0 Jeans- Replay-blue Jeans
1 Chino- Uniqlo-~ Chino
2 Smart-Next-~ Smart
df['Trousers'] = df['Trousers'].str.split('-').str[0]
print (df)
Trousers
0 Jeans
1 Chino
2 Smart
Another solution with extract:
df['Trousers'] = df['Trousers'].str.extract('([a-zA-z]+)-', expand=False)
print (df)
Trousers
0 Jeans
1 Chino
2 Smart

Related

how to choose certain amount of character from a column in Python?

for example, there is a column in a dataframe, 'ID'.
One of the entries is for example, '13245993, 3004992'
I only want to get '13245993'.
That also applies for every row in column 'ID'.
How to change the data in each row in column 'ID'?
You can try like this, apply slicing on ID column to get the required result. I am using 3 chars as no:of chars here
import pandas as pd
data = {'Name':['Tom', 'nick', 'krish', 'jack'],
'ID':[90877, 10909, 12223, 12334]}
df=pd.DataFrame(data)
print('Before change')
print(df)
df["ID"]=df["ID"].apply(lambda x: (str(x)[:3]))
print('After change')
print(df)
output
Before change
Name ID
0 Tom 90877
1 nick 10909
2 krish 12223
3 jack 12334
After change
Name ID
0 Tom 908
1 nick 109
2 krish 122
3 jack 123
You could do something like
data[data['ID'] == '13245993']
this will give you the columns where ID is 13245993
More Indepth Code
I hope this answers your question if not please let me know.
With best regards

How to copy a field based on a condition of another column in Python?

I need to copy a column's field into a variable, based on a specific condition, and then delete it.
This dataframe contains data of some kids, that have their favourite toy and colour associated:
data = {'Kid': ['Richard', 'Daphne', 'Andy', 'May', 'Claire', 'Mozart', 'Jane'],
'Toy': ['Ball', 'Doll', 'Car', 'Barbie', 'Frog', 'Bear', 'Doll'],
'Colour': ['white', np.nan, 'red', 'pink', 'green', np.nan, np.nan]
}
df = pd.DataFrame (data, columns = ['Kid', 'Toy','Colour'])
print (df)
The dataframe looks like this:
Kid Toy Colour
0 Richard Ball white
1 Daphne Doll NaN
2 Andy Car red
3 May Barbie pink
4 Claire Frog green
5 Mozart Bear NaN
6 Jane Doll NaN
The condition is: If a kid does have a toy, but it does not have a colour, then save both the kid and the toy in a separate array as follows and maintain the order/matching:
toy_array = ["Doll", "Bear", "Doll"]
kid_array = ["Daphne", "Mozart", "Jane"]
And then delete the toy from the dataframe. So the final dataframe should look like this:
Kid Toy Colour
0 Richard Ball white
1 Daphne NaN NaN
2 Andy Car red
3 May Barbie pink
4 Claire Frog green
5 Mozart NaN NaN
6 Jane NaN NaN
I got inspired by many sources, along with this one, and I tried this:
kid_array.append(df.loc[(df['Toy'] != np.nan) & (df['Colour'] == np.nan)])
print(kid_array)
I am at the very beginning, I highly appreciate all your help if you could possibly help me!
Test missing and no misisng values by Series.isna and
Series.notna and then set missing values to Toy column by DataFrame.loc:
mask = df['Toy'].notna() & df['Colour'].isna()
df.loc[mask, 'Toy'] = np.nan
Or in Series.mask:
df['Toy'] = df['Toy'].mask(mask)
Or by numpy.where:
df['Toy'] = np.where(mask, np.nan, df['Toy'])
print (df)
Kid Toy Colour
0 Richard Ball white
1 Daphne NaN NaN
2 Andy Car red
3 May Barbie pink
4 Claire Frog green
5 Mozart NaN NaN
6 Jane NaN NaN
If need lists:
toy_array = df.loc[mask, 'Toy'].tolist()
kid_array = df.loc[mask, 'Kid'].tolist()
print (toy_array)
['Doll', 'Bear', 'Doll']
print (kid_array)
['Daphne', 'Mozart', 'Jane']
Your logic is correct, just the function to compare needs to be matched with the correct function used for comparison in Numpy Library
numpy.isnan()
Try the following code
kid_array.append(df.loc[(!numpy.isnan( df['Toy'])) & (!numpy.isnan(df['Colour']))])

Use Excel sheet to create dictionary in order to replace values

I have an excel file with product names. First row is the category (A1: Water, A2: Sparkling, A3:Still, B1: Soft Drinks, B2: Coca Cola, B3: Orange Juice, B4:Lemonade etc.), each cell below is a different product. I want to keep this list in a viewable format (not comma separated etc.) as this is very easy for anybody to update the product names (I have a second person running the script without understanding the script)
If it helps I can also have the excel file in a CSV format and I can also move the categories from the top row to the first column
I would like to replace the cells of a dataframe (df) with the product categories. For example, Coca Cola would become Soft Drinks. If the product is not in the excel it would not be replaced (ex. Cookie).
print(df)
Product Quantity
0 Coca Cola 1234
1 Cookie 4
2 Still 333
3 Chips 88
Expected Outcome:
print (df1)
Product Quantity
0 Soft Drinks 1234
1 Cookie 4
2 Water 333
3 Snacks 88
Use DataFrame.melt with DataFrame.dropna or DataFrame.stack for helper Series and then use Series.replace:
s = df1.melt().dropna().set_index('value')['variable']
Alternative:
s = df1.stack().reset_index(name='v').set_index('v')['level_1']
df['Product'] = df['Product'].replace(s)
#if performance is important
#df['Product'] = df['Product'].map(s).fillna(df['Product'])
print (df)
Product Quantity
0 Soft Drinks 1234
1 Cookie 4
2 Water 333
3 Snacks 88

Counting the occurrences of a substring from one column within another column

I have two dataframes I am working with, one which contains a list of players and another that contains play by play data for the players from the other dataframe. Portions of the rows of interest within these two dataframes are shown below.
0 Matt Carpenter
1 Jason Heyward
2 Peter Bourjos
3 Matt Holliday
4 Jhonny Peralta
5 Matt Adams
...
Name: Name, dtype: object
0 Matt Carpenter grounded out to second (Grounder).
1 Jason Heyward doubled to right (Liner).
2 Matt Holliday singled to right (Liner). Jason Heyward scored.
...
Name: Play, dtype: object
What I am trying to do is create a column in the first dataframe that counts the number of occurrences of the string (df['Name'] + ' scored') in the column in the other dataframe. For example, it would search for instances of "Matt Carpenter scored", "Jason Heyward scored", etc. I know you can use str.contains to do this type of thing, but it only seems to work if you put in the explicit string. For example,
batter_game_logs_df['R vs SP'] = len(play_by_play_SP_df[play_by_play_SP_df['Play'].str.contains('Jason Heyward scored')].index)
works fine but if I try
batter_game_logs_df['R vs SP'] = len(play_by_play_SP_df[play_by_play_SP_df['Play'].str.contains(batter_game_logs_df['Name'].astype(str) + ' scored')].index)
it returns the error 'Series' objects are mutable, thus they cannot be hashed. I have looked at various similar questions but cannot find the solution to this problem for the life of me. Any assistance on this would be greatly appreciated, thank you!
I think need findall by regex with join all values of Name, then create indicator columns by MultiLabelBinarizer and add all missing columns by reindex:
s = df1['Name'] + ' scored'
pat = r'\b{}\b'.format('|'.join(s))
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df = pd.DataFrame(mlb.fit_transform(df2['Play'].str.findall(pat)),
columns=mlb.classes_,
index=df2.index).reindex(columns=s, fill_value=0)
print (df)
Name Matt Carpenter scored Jason Heyward scored Peter Bourjos scored \
0 0 0 0
1 0 0 0
2 0 1 0
Name Matt Holliday scored Jhonny Peralta scored Matt Adams scored
0 0 0 0
1 0 0 0
2 0 0 0
Last if necessary join to df1:
df = df2.join(df)
print (df)
Play Matt Carpenter scored \
0 Matt Carpenter grounded out to second (Grounder). 0
1 Jason Heyward doubled to right (Liner). 0
2 Matt Holliday singled to right (Liner). Jason ... 0
Jason Heyward scored Peter Bourjos scored Matt Holliday scored \
0 0 0 0
1 0 0 0
2 1 0 0
Jhonny Peralta scored Matt Adams scored
0 0 0
1 0 0
2 0 0

How to 'explode' Pandas Column Value to unique row

So, what i mean with explode is like this, i want to transform some dataframe like :
ID | Name | Food | Drink
1 John Apple, Orange Tea , Water
2 Shawn Milk
3 Patrick Chichken
4 Halley Fish Nugget
into this dataframe:
ID | Name | Order Type | Items
1 John Food Apple
2 John Food Orange
3 John Drink Tea
4 John Drink Water
5 Shawn Drink Milk
6 Pattrick Food Chichken
i dont know how to make this happen. any help would be appreciated !
IIUC stack with unnest process , here I would not change the ID , I think keeping the original one is better
s=df.set_index(['ID','Name']).stack()
pd.DataFrame(data=s.str.split(',').sum(),index=s.index.repeat(s.str.split(',').str.len())).reset_index()
Out[289]:
ID Name level_2 0
0 1 John Food Apple
1 1 John Food Orange
2 1 John Drink Tea
3 1 John Drink Water
4 2 Shawn Drink Milk
5 3 Patrick Food Chichken
6 4 Halley Food Fish Nugget
# if you need rename the column to item try below
#pd.DataFrame(data=s.str.split(',').sum(),index=s.index.repeat(s.str.split(',').str.len())).rename(columns={0:'Item'}).reset_index()
You can use pd.melt to convert the data from wide to long format. I think this will be easier to understand step by step.
# first split into separate columns
df[['Food1','Food2']] = df.Food.str.split(',', expand=True)
df[['Drink1','Drink2']] = df.Drink.str.split(',', expand=True)
# now melt the df into long format
df = pd.melt(df, id_vars=['Name'], value_vars=['Food1','Food2','Drink1','Drink2'])
# remove unwanted rows and filter data
df = df[df['value'].notnull()].sort_values('Name').reset_index(drop=True)
# rename the column names and values
df.rename(columns={'variable':'Order Type', 'value':'Items'}, inplace=True)
df['Order Type'] = df['Order Type'].str.replace('\d','')
# output
print(df)
Name Order Type Items
0 Halley Food Fish Nugget
1 John Food Apple
2 John Food Orange
3 John Drink Tea
4 John Drink Water
5 Patrick Food Chichken
6 Shawn Drink Milk

Categories

Resources