I have a problem. I want to print the 5 most names. But unfortunately the names are not only Latin letters, but also Chinese letters. As soon as I want to print the plot, I got:
C:\Users\user\Anaconda3\lib\site-packages\matplotlib\backends\backend_agg.py:240: RuntimeWarning: Glyph 32422 missing from current font.
How can I solve this error?
import pandas as pd
import seaborn as sns
d = {'id': [1, 2, 3, 4, 5],
'name': ['Max Power', 'Jessica', '约翰·多伊', '哈拉尔量杯', 'Frank High'],
}
df = pd.DataFrame(data=d)
print(df)
df_count = df['name'].value_counts()[:5]
ax = sns.barplot(x=df_count.index, y=df_count)
Related
I picked up part of the code from here and expanded a bit. However, I am not able to convert the datatypes of Basket & Count columns for further processing.
for e.g., Basket and Count columns are int64, I would like to change them to float64.
import ipywidgets as widgets
from IPython.display import display, clear_output
# creating a DataFrame
df = pd.DataFrame({'Basket': [1, 2, 3],
'Name': ['Apple', 'Orange',
'Count'],
'id': [111, 222,
333]})
vardict = df.columns
select_variable = widgets.Dropdown(
options=vardict,
value=vardict[0],
description='Select variable:',
disabled=False,
button_style=''
)
def get_and_plot(b):
clear_output
s = select_variable.value
col_dtype = df[s].dtypes
print(col_dtype)
display(select_variable)
select_variable.observe(get_and_plot, names='value')
Thanks in advance.
First time posting here, but having trouble with some code that I'm using to pull fantasy football data from ESPN. I pulled this from Steven Morse's blog (https://stmorse.github.io/journal/espn-fantasy-v3.html) and it appears to work EXCEPT for one error that I'm getting. The error is:
File "<ipython-input-65-56a5896c1c3c>", line 3, in <listcomp>
game['away']['teamId'], game['away']['totalPoints'],
KeyError: 'away'
I've looked in the dictionary and found that 'away' is in there. What I can't figure out is why 'home' works but not 'away'. Here is the code I'm using. Any help is appreciated:
import requests
import pandas as pd
url = 'https://fantasy.espn.com/apis/v3/games/ffl/seasons/2020/segments/0/leagues/721579?view=mMatchupScore'
r = requests.get(url,
cookies={"swid": "{1E653FDE-DA4A-4CC6-A53F-DEDA4A6CC663}",
"espn_s2": "AECpfE9Zsvwwsl7N%2BRt%2BAPhSAKmSs%2F2ZmQVuHJeKG8LGgLBDfRl0j88CvzRFsrRjLmjzASAdIUA9CyKpQJYBfn6avgXoPHJgDiCqfDPspruYqHNENjoeGuGfVqtPewVJGv3rBJPFMp1ugWiqlEzKiT9IXTFAIx3V%2Fp2GBuYjid2N%2FFcSUlRlr9idIL66tz2UevuH4F%2FP6ytdM7ABRCTEnrGXoqvbBPCVbtt6%2Fu69uBs6ut08ApLRQc4mffSYCONOqW1BKbAMPPMbwgCn1d5Ruubl"})
d = r.json()
df = [[
game['matchupPeriodId'],
game['away']['teamId'], game['away']['totalPoints'],
game['home']['teamId'], game['home']['totalPoints']
] for game in d['schedule']]
df = pd.DataFrame(df, columns=['Week', 'Team1', 'Score1', 'Team2', 'Score2'])
df['Type'] = ['Regular' if w<=14 else 'Playoff' for w in df['Week']]
Seems like some of the games in the schedule don't have an away team:
{'home': {'adjustment': 0.0,
'cumulativeScore': {'losses': 0, 'statBySlot': None, 'ties': 0, 'wins': 0},
'pointsByScoringPeriod': {'14': 102.7},
'teamId': 1,
'tiebreak': 0.0,
'totalPoints': 102.7},
'id': 78,
'matchupPeriodId': 14,
'playoffTierType': 'WINNERS_BRACKET',
'winner': 'UNDECIDED'}
For nested json data like this, it's often easier to use pandas.json_normalize which flattens the data structure and gives you a data frame with lots of columns with names like home.cumulativeScore.losses etc.
df = pd.json_normalize(r.json()['schedule'])
Then you can reshape the dataframe by dropping columns you don't care about and so on.
df = pd.json_normalize(r.json()['schedule'])
column_names = {
'matchupPeriodId':'Week',
'away.teamId':'Team1',
'away.totalPoints':'Score1',
'home.teamId':'Team2',
'home.totalPoints':'Score2',
}
df = df.reindex(columns=column_names).rename(columns=column_names)
df['Type'] = ['Regular' if w<=14 else 'Playoff' for w in df['Week']]
For the games where there's no away team, pandas will populate those columns with NaN values.
df[df.Team1.isna()]
If the city has been mentioned in cities_specific I would like to create a flag in the cities_all data. It's just a minimal example and in reality I would like to create multiple of these flags based on multiple data frames. That's why I tried to solve it with isin instead of a join.
However, I am running into ValueError: Length of values (3) does not match length of index (7).
# import packages
import pandas as pd
import numpy as np
# create minimal data
cities_specific = pd.DataFrame({'city': ['Melbourne', 'Cairns', 'Sydney'],
'n': [10, 4, 8]})
cities_all = pd.DataFrame({'city': ['Vancouver', 'Melbourne', 'Athen', 'Vienna', 'Cairns',
'Berlin', 'Sydney'],
'inhabitants': [675218, 5000000, 664046, 1897000, 150041, 3769000, 5312000]})
# get value error
# how can this be solved differently?
cities_all.assign(in_cities_specific=np.where(cities_specific.city.isin(cities_all.city), '1', '0'))
# that's the solution I would like to get
expected_solution = pd.DataFrame({'city': ['Vancouver', 'Melbourne', 'Athen', 'Vienna', 'Cairns',
'Berlin', 'Sydney'],
'inhabitants': [675218, 5000000, 664046, 1897000, 150041, 3769000, 5312000],
'in_cities': [0, 1, 0, 0, 1, 0, 1]})
I think you are changing the position in the condition.
Here you have some alternatives:
cities_all.assign(
in_cities_specific=np.where(cities_all.city.isin(cities_specific.city), '1', '0')
)
or
cities_all["in_cities_specific"] =
cities_all["city"].isin(cities_specific["city"]).astype(int).astype(str)
or
condlist = [cities_all["city"].isin(cities_specific["city"])]
choicelist = ["1"]
cities_all["in_cities_specific"] = np.select(condlist, choicelist,default="0")
I am creating parser of changes on pseudo-table web application to push notification if there any rows were added.
Mechanic of the pseudo-table: Table on the website changes per some time and adds new rows. This page is highly dynamic and sometimes changes the existing rows. Pseudo-table automatically assigns id respecting to the sorting mechanic. So to explain precisely, sorting algorithm is alphabetic so guy ID named Adam would be 1, Bob = 2, Coul=3. But if they will add person with name Caul it would become ID 3, when Coul would become 4. This ruins all the methods I have tried so far.
I am trying right now to compare two Pandas dataframe to detect row addition and return new-added rows. I do not want to return existing rows that were changed. I tried by using concat and removing duplicates but this results in duplicate rows where there was any minor change in the data.
TL;DR EXAMPLE
Input
d1 = {'#': [1, 2, 3], 'Name': ['James Bourne', 'Steve Johns', 'Steve Jobs']}
d2 = {'#': [1, 2, 3, 4], 'Name': ['James Bourne', 'Steve Jobs', 'Great Guy', 'Steve Johns']}
df_1 = pd.DataFrame(data=d1)
df_2 = pd.DataFrame(data=d2)
# ... code
Output should be
3 Great Guy
You could try a simpler solution:
df2[ ~df2.Name.isin(df1.Name)].dropna()
Output:
# Name
2 3 Great Guy
merge dfs with (how = outer), then compare merged df to list of original Names
>>> merged = pd.merge(df_1,df_2,on='Name', how = 'outer')
>>> [x for x in enumerate(merged.Name) if x[1] not in list(df_1.Name)]
Results in: [(3, 'Great Guy')]
I found out the subset parameter in the drop_duplicates.
d1 = {'#': [1, 2, 3], 'Name': ['James Bourne', 'Steve Johns', 'Steve Jobs']}
d2 = {'#': [1, 2, 3, 4], 'Name': ['James Bourne', 'Steve Jobs', 'Great Guy', 'Steve Johns']}
df_1 = pd.DataFrame(data=d1)
df_2 = pd.DataFrame(data=d2)
df_1 = df_1.set_index('#')
df_2 = df_2.set_index('#')
df = pd.concat([df_1,df_2]).drop_duplicates(subset=['Name'], keep=False)
df
results in
Name
#
3 Great Guy
This solves my question.
I have a csv file the that has a column that a bunch of different columns. the columns thhat i am interested in are the 'Items', 'OrderDate' and 'Units'.
In my IDE I am trying to generate a bar chart of the amount of 'Pencil's sold on each individual 'OrderDate'. What I am trying to do is to look down through the 'Item' columns using pandas and check to see if the item is a pencil and then add it to the graph if it is not then dont do anything.
I think I have made it a bit long winded with the code.
i have the coe going down through the 'Iems' column and checking to see if it is a pencil but i can't figure out what to do next.
import pandas as pd
import matplotlib.pyplot as plt
d = {'item' : pd.Series(['Pencil', 'Marker', 'Pencil', 'Headphones', 'Pencil', 'The moon', 'Wish you were here album']),
'OrderDate' : pd.Series(['5/15/2020', '5/16/2020', '5/16/2020','5/15/2020', \
'5/16/2020', '5/17/2020','5/16/2020','5/16/2020','5/17/2020']),
'Units' : pd.Series([4, 3, 2, 1, 3, 2, 4, 2, 3])}
df = pd.DataFrame.from_dict(d)
df.plot(kind='bar', x='OrderDate', y='Units')
item_col = df['Item']
pencil_binary = item_col.str.count('Pencil')
for entry in item_col:
if entry == 'Pencil':
print("i am a pencil")
else:
print("i am not a pencil")
print(df)
plt.plot()
plt.show()
If I understood correctly you want to plot the number of pencils sold per day. For that, you can just filter the dataframe and keep only rows about pencils, and then use a barchart.
Here's a reproducible code that assumes that all rows have different dates:
import pandas as pd
import matplotlib.pyplot as plt
d = {'item' : pd.Series(['Pencil', 'Marker', 'Pencil', 'Headphones', 'Pencil', 'The moon', 'Wish you were here album']),
'OrderDate' : pd.Series(['5/15/2020', '5/16/2020', '5/16/2020','5/15/2020', \
'5/16/2020', '5/17/2020','5/16/2020','5/16/2020','5/17/2020']),
'Units' : pd.Series([4, 3, 2, 1, 3, 2, 4, 2, 3])}
df = pd.DataFrame.from_dict(d)
#This dataframe only has pencils
df_pencils = df[df.item == 'Pencil']
df_pencils.groupby('OrderDate').agg('Units').sum().plot(kind='bar', x='OrderDate', y='Units')
df.plot(kind='bar', x='OrderDate', y='Units')
The groupby is used for grouping all rows with the same date, and, for each group, add up the Units sold.
In fact, when you do this:
df_pencils.groupby('OrderDate').agg('Units').sum()
this is the output:
OrderDate
5/15/2020 4
5/16/2020 5
Name: Units, dtype: int64
If you want a one liner, it's:
df[df.item == 'Pencil'].groupby('OrderDate').agg('Units').sum().plot(kind='bar', x='OrderDate', y='Units')