Manipulating Values in Pandas DataFrames - python

I am trying to create and apply a function def change(x): which modifies a single column of values grocery in the grocery data frame as shown in the image below
grocery data
I want to achieve the result in the image below
output
I am at the beginner level in python but I know I can use the map() or apply() functions to solve this. My main problem is using the split() method to achieve the result as the values in the category column are of varying lengths. Or are there other string manipulation methods that can be used?
import pandas as pd
groceries = {
'grocery':['Tesco's wafers', 'Asda's shortbread', 'Aldi's lemon tea', 'Sainsbury's croissant', 'Morrison's doughnut', 'Amazon fresh's peppermint tea', 'Bar becan's pizza', 'Pound savers' shower gel'],
'category':['biscuit', 'biscuit', 'tea', 'bakery', 'bakery', 'tea', 'bakery', 'hygiene'],
'price':[0.99, 1.24, 1.89, 0.75, 0.50, 2.5, 4.99, 2]
}
df = pd.DataFrame(groceries)
df
# function to modify a single column of values - grocery
def change(x):
return df['grocery].str.split(' ').str[1]
df = pd.DataFrame(groceries)
df['grocery'] = df['grocery'].map(change)
df
# Expected DataFrame
groceries = pd.DataFrame({
'grocery':['Wafers', 'Shortbread', 'Lemon Tea', 'Croissant', 'Doughnut', 'Peppermint Tea', 'Pizza', 'Shower Gel'],
'category':['biscuit', 'biscuit', 'tea', 'bakery', 'bakery', 'tea', 'bakery', 'hygiene'],
'price':[0.99, 1.24, 1.89, 0.75, 0.50, 2.5, 4.99, 2]
})

I hope this works for your solution, I split it with "'" comma and then start it with from 1 index of a string. It depends on conditions
import pandas as pd
groceries = {
'grocery': [
"Tesco's wafers", "Asda's shortbread", "Aldi's lemon tea",
"Sainsbury's croissant", "Morrison's doughnut",
"Amazon fresh's peppermint tea", "Bar becan's pizza",
"Pound savers' shower gel"
],
'category': [
'biscuit', 'biscuit', 'tea', 'bakery', 'bakery', 'tea', 'bakery',
'hygiene'
],
'price': [0.99, 1.24, 1.89, 0.75, 0.50, 2.5, 4.99, 2]
}
df = pd.DataFrame(groceries)
# split it with "'" comma and then start it with from 1 index of a string
# if multiple conditions for grocery string then
# def grocery_chng(x):
# # specify multiple conditions to replace a string
# return x
# df['grocery'] = df['grocery'].apply(grocery_chng)
df['grocery'] = df['grocery'].apply(lambda x: x.split("'")[1][1:].title())
df

Assuming you have a dataframe df with your original data:
df['grocery'] = df['grocery'].apply(lambda x: (x.split("'s", 1) if "'s" in x else x.split("'", 1))[1].title())

Related

If match found then add to dictionary, otherwise perform a process extract one fuzz match

I have a 2 dataframes that Im comparing and then adding the results to a dictionary.
I can get the first batch of results to work but when I add in the else statement thats when things go bad
Right now, it appears to run forever. Im new to dictionaries and looping through dataframes.
Here's my code so far (and please note that it doesnt work) :
Also please note that Im using the tuple output of process.extract (address, score, index) and I created a separate dataframe that Im matching on the index and taking the value of that index and putting it as a item in my dictionary.
Here's my variables:
df1 = pd.DataFrame(np.random.randint(0,100,size=(100, 2)), columns=['Address1','Type'])
df2 = pd.DataFrame(np.random.randint(0,100,size=(100, 3)), columns=['Address2','Type', 'ID']) (I want to grab the ID from this DF)
id_match = df['ID'].to_dict()
resultstest= defaultdict(list)
matched = [process.extract(i,df2['Address_2'], limit=1)[0] for i in df1['Address_1']]
df2 has aprox 80k rows &
df1 has ~50 rows
Here's the code:
for lrow in df1.itertuples():
for vrow in df2.itertuples():
if lrow.Address_1 == vrow.Address_2:
resultstest[lrow.Address_PSL].append({'ID' : vrow._1, 'Address_match': vrow.Address_2,
'Sales' : vrow._6, 'Calls': vrow._7, 'Target': vrow._8,
'Type': vrow._5})
break #match found done
else:
for z, y in id_match.items():
for m in matched:
if z == m[2] :#matching on*indexs*
print(z)
resultstest[lrow.Address_1].append({' ID' : y, 'Address_match': m[0](score, \
'Fuzz_Score': m[1]})
break
#m[0] is the address, m[1] is the score
My output would be something like this:
defaultdict(list,
{'address xyz': [{' ID': '1111111',
'Address_match': 'address xyz',
'Sales': nan,
'Calls': nan,
'Target': 0.0,
'ID_Type': 'X'}],
{'address abc': [{' ID': '11112222',
'Address_match': 'address abc',
'Sales': nan,
'Calls': nan,
'Target': 0.0,
'ID_Type': 'Y'}],
{'address xyz12345':[{'ID': '1231569',
'Address_match': 'address xyz12345',
'Fuzz_Score': 97}]})

How to plot bar graph for top five game for each genre using for loop

genre_game=group_game.groupby(["Genre","Name"])"Global_Sales"].sum().reset_index().sort_values(["Genre","Global_Sales"],ascending=(True,False))
genre_game
enter image description here
genre_s=df["Genre"].unique()
genre_sorted=sorted(genre_s)
print(f'List of : {genre_sorted}' )
List of : ['Action', 'Adventure', 'Fighting', 'Misc', 'Platform', 'Puzzle', 'Racing', 'Role-Playing', 'Shooter', 'Simulation', 'Sports', 'Strategy']
def f(genre):
for genre in genre_sorted:
plt.bar(genre_game[genre][:5],genre_game["Global_Sales"][genre:genre+5])
plt.xticks(rotation=90)
f("Adventure")
You can do it without a loop. This example here groups by the top 2 largest sales, but you can change 2 to 5 or any number.
import pandas as pd
import matplotlib.pyplot as plt
genre = ['Action', 'Action', 'Action', 'Strategy', 'Strategy', 'Strategy']
name =['GTAV', 'GTAIV', 'FIFA13', 'Worms2', 'Tropico', 'WarOverworld']
sales = [55, 13, 16, 1, 1, 0.5]
df = pd.DataFrame(list(zip(genre, name, sales)),
columns=['genre', 'name', 'sales'])
df2 = df.groupby('genre') \
.apply(lambda x: x.nlargest(2, 'sales')) \
.reset_index(drop=True)
plt.bar(df2.name, df2.sales)
plt.xlabel('x')
plt.ylabel('% global')
plt.show()
Here is what the source df looks like:
And here is what the top 2 df2 looks like:

Convert a dictionary of a list of dictionaries to pandas DataFrame

I pulled a list of historical option price of AAPL from the RobinHoood function robin_stocks.get_option_historicals(). The data was returned in a form of dictional of list of dictionary as shown below.
I am having difficulties to convert the below object (named historicalData) into a DataFrame. Can someone please help?
historicalData = {'data_points': [{'begins_at': '2020-10-05T13:30:00Z',
'open_price': '1.430000',
'close_price': '1.430000',
'high_price': '1.430000',
'low_price': '1.430000',
'volume': 0,
'session': 'reg',
'interpolated': False},
{'begins_at': '2020-10-05T13:40:00Z',
'open_price': '1.430000',
'close_price': '1.340000',
'high_price': '1.440000',
'low_price': '1.320000',
'volume': 0,
'session': 'reg',
'interpolated': False}],
'open_time': '0001-01-01T00:00:00Z',
'open_price': '0.000000',
'previous_close_time': '0001-01-01T00:00:00Z',
'previous_close_price': '0.000000',
'interval': '10minute',
'span': 'week',
'bounds': 'regular',
'id': '22b49380-8c50-4c76-8fb1-a4d06058f91e',
'instrument': 'https://api.robinhood.com/options/instruments/22b49380-8c50-4c76-8fb1-a4d06058f91e/'}
I tried the below code code but that didn't help:
import pandas as pd
df = pd.DataFrame(historicalData)
df
You didn't write that you want only data_points (as in the
other answer), so I assume that you want your whole dictionary
converted to a DataFrame.
To do it, start with your code:
df = pd.DataFrame(historicalData)
It creates a DataFrame, with data_points "exploded" to
consecutive rows, but they are still dictionaries.
Then rename open_price column to open_price_all:
df.rename(columns={'open_price': 'open_price_all'}, inplace=True)
The reason is to avoid duplicated column names after join
to be performed soon (data_points contain also open_price
attribute and I want the corresponding column from data_points
to "inherit" this name).
The next step is to create a temporary DataFrame - a split of
dictionaries in data_points to individual columns:
wrk = df.data_points.apply(pd.Series)
Print wrk to see the result.
And the last step is to join df with wrk and drop
data_points column (not needed any more, since it was
split into separate columns):
result = df.join(wrk).drop(columns=['data_points'])
This is pretty easy to solve with the below. I have chucked the dataframe to a list via list comprehension
import pandas as pd
df_list = [pd.DataFrame(dic.items(), columns=['Parameters', 'Value']) for dic in historicalData['data_points']]
You then could do:
df_list[0]
which will yield
Parameters Value
0 begins_at 2020-10-05T13:30:00Z
1 open_price 1.430000
2 close_price 1.430000
3 high_price 1.430000
4 low_price 1.430000
5 volume 0
6 session reg
7 interpolated False

How do I convert nested list to dictionary?

I am currently working on an assignment where I need to convert a nested list to a dictionary, where i have to separate the codes from the nested list below.
data = [['ABC', "Tel", "12/07/2017", 1.5, 1000],['ACE', "S&P", "12/08/2017", 3.2, 2000],['AEB', "ENG", "04/03/2017", 1.4, 3000]]
to get this
Code Name Purchase Date Price Volume
ABC Tel 12/07/2017 1.5 1000
ACE S&P 12/08/2017 3.2 2000
AEB ENG 04/03/2017 1.4 3000
so the remaining values are still in a list, but tagged to codes as keys.
Could anyone advice on this please,thank you!
You can use a dictcomp:
keys = ['Code','Name','Purchase Date','Price','Volume']
{k: v for k, *v in zip(keys, *data)}
Result:
{'Code': ['ABC', 'ACE', 'AEB'],
'Name': ['Tel', 'S&P', 'ENG'],
'Purchase Date': ['12/07/2017', '12/08/2017', '04/03/2017'],
'Price': [1.5, 3.2, 1.4],
'Volume': [1000, 2000, 3000]}
You can use pandas dataframe for that:
import pandas as pd
data = [['ABC', "Tel", "12/07/2017", 1.5, 1000],['ACE', "S&P", "12/08/2017", 3.2, 2000],['AEB', "ENG", "04/03/2017", 1.4, 3000]]
columns = ["Code","Name","Purchase Date","Price","Volume"]
df = pd.DataFrame(data, columns=columns)
print(df)
I assume that by dictionaries you mean a list of dictionaries, each representing a row with the header as its keys.
You can do that like this:
keys = ['Code','Name','Purchase Date','Price','Volume']
dictionaries = [ dict(zip(keys,row)) for row in data ]

Neatest ways to extract pairs from pandas DataFrame

Given the following pandas DataFrame:
mydf = pd.DataFrame([{'Campaign': 'Campaign X', 'Date': '24-09-2014', 'Spend': 1.34, 'Clicks': 241}, {'Campaign': 'Campaign Y', 'Date': '24-08-2014', 'Spend': 2.89, 'Clicks': 12}, {'Campaign': 'Campaign X', 'Date': '24-08-2014', 'Spend': 1.20, 'Clicks': 1}, {'Campaign': 'Campaign Z2', 'Date': '24-08-2014', 'Spend': 4.56, 'Clicks': 13}] )
I wish to first extract Campaign-Spend pairs, first summing where applicable when a campaign has multiple entries (as is the case for campaign X in this example). With minimal pandas knowledge, I find myself doing:
summed = mydf.groupby('Campaign', as_index=False).sum()
campaignspends = zip(summed['Campaign'], summed['Spend'])
campaignspends = dict(campaignspends)
I'm guessing pandas or python itself has a one-liner for this?
You can pull out the column of interest from a groupby object using ["Spend"]:
>>> campaignspends
{'Campaign Y': 2.8900000000000001, 'Campaign Z2': 4.5599999999999996, 'Campaign X': 2.54}
>>> mydf.groupby("Campaign")["Spend"].sum()
Campaign
Campaign X 2.54
Campaign Y 2.89
Campaign Z2 4.56
Name: Spend, dtype: float64
>>> mydf.groupby("Campaign")["Spend"].sum().to_dict()
{'Campaign Y': 2.8900000000000001, 'Campaign Z2': 4.5599999999999996, 'Campaign X': 2.54}
Here I've added the to_dict() call (dict(mydf..etc) will also work), although note that depending on what you're planning to do next, you might not need to convert from a Series to a dictionary at all. For example,
>>> s = mydf.groupby("Campaign")["Spend"].sum()
>>> s["Campaign Z2"]
4.5599999999999996
works as you'd expect.

Categories

Resources