If the city has been mentioned in cities_specific I would like to create a flag in the cities_all data. It's just a minimal example and in reality I would like to create multiple of these flags based on multiple data frames. That's why I tried to solve it with isin instead of a join.
However, I am running into ValueError: Length of values (3) does not match length of index (7).
# import packages
import pandas as pd
import numpy as np
# create minimal data
cities_specific = pd.DataFrame({'city': ['Melbourne', 'Cairns', 'Sydney'],
'n': [10, 4, 8]})
cities_all = pd.DataFrame({'city': ['Vancouver', 'Melbourne', 'Athen', 'Vienna', 'Cairns',
'Berlin', 'Sydney'],
'inhabitants': [675218, 5000000, 664046, 1897000, 150041, 3769000, 5312000]})
# get value error
# how can this be solved differently?
cities_all.assign(in_cities_specific=np.where(cities_specific.city.isin(cities_all.city), '1', '0'))
# that's the solution I would like to get
expected_solution = pd.DataFrame({'city': ['Vancouver', 'Melbourne', 'Athen', 'Vienna', 'Cairns',
'Berlin', 'Sydney'],
'inhabitants': [675218, 5000000, 664046, 1897000, 150041, 3769000, 5312000],
'in_cities': [0, 1, 0, 0, 1, 0, 1]})
I think you are changing the position in the condition.
Here you have some alternatives:
cities_all.assign(
in_cities_specific=np.where(cities_all.city.isin(cities_specific.city), '1', '0')
)
or
cities_all["in_cities_specific"] =
cities_all["city"].isin(cities_specific["city"]).astype(int).astype(str)
or
condlist = [cities_all["city"].isin(cities_specific["city"])]
choicelist = ["1"]
cities_all["in_cities_specific"] = np.select(condlist, choicelist,default="0")
Related
I have a problem. I want to print the 5 most names. But unfortunately the names are not only Latin letters, but also Chinese letters. As soon as I want to print the plot, I got:
C:\Users\user\Anaconda3\lib\site-packages\matplotlib\backends\backend_agg.py:240: RuntimeWarning: Glyph 32422 missing from current font.
How can I solve this error?
import pandas as pd
import seaborn as sns
d = {'id': [1, 2, 3, 4, 5],
'name': ['Max Power', 'Jessica', '约翰·多伊', '哈拉尔量杯', 'Frank High'],
}
df = pd.DataFrame(data=d)
print(df)
df_count = df['name'].value_counts()[:5]
ax = sns.barplot(x=df_count.index, y=df_count)
I have this dataframe df where
>>> df = pd.DataFrame({'Date':['10/2/2011', '11/2/2011', '12/2/2011', '13/2/11'],
'Event':['Music', 'Poetry', 'Theatre', 'Comedy'],
'Cost':[10000, 5000, 15000, 2000],
'Name':['Roy', 'Abraham', 'Blythe', 'Sophia'],
'Age':['20', '10', '13', '17']})
I want to determine the column index with the corresponding name. I tried it with this:
>>> list(df.columns)
But the solution above only returns the column names without index numbers.
How can I code it so that it would return the column names and the corresponding index for that column? Like This:
0 Date
1 Event
2 Cost
3 Name
4 Age
Simpliest is add pd.Series constructor:
pd.Series(list(df.columns))
Or convert columns to Series and create default index:
df.columns.to_series().reset_index(drop=True)
Or:
df.columns.to_series(index=False)
You can use loop like this:
myList = list(df.columns)
index = 0
for value in myList:
print(index, value)
index += 1
A nice short way to get a dictionary:
d = dict(enumerate(df))
output: {0: 'Date', 1: 'Event', 2: 'Cost', 3: 'Name', 4: 'Age'}
For a Series, pd.Series(list(df)) is sufficient as iteration occurs directly on the column names
In addition to using enumerate, this also can get a numbers in order using zip, as follows:
import pandas as pd
df = pd.DataFrame({'Date':['10/2/2011', '11/2/2011', '12/2/2011', '13/2/11'],
'Event':['Music', 'Poetry', 'Theatre', 'Comedy'],
'Cost':[10000, 5000, 15000, 2000],
'Name':['Roy', 'Abraham', 'Blythe', 'Sophia'],
'Age':['20', '10', '13', '17']})
result = list(zip([i for i in range(len(df.columns))], df.columns.values,))
for r in result:
print(r)
#(0, 'Date')
#(1, 'Event')
#(2, 'Cost')
#(3, 'Name')
#(4, 'Age')
I want to convert this dict into a pandas dataframe where each key becomes a column and values in the list become the rows:
my_dict:
{'Last updated': ['2021-05-18T15:24:19.000Z', '2021-05-18T15:24:19.000Z'],
'Symbol': ['BTC', 'BNB', 'XRP', 'ADA', 'BUSD'],
'Name': ['Bitcoin', 'Binance Coin', 'XRP', 'Cardano', 'Binance USD'],
'Rank': [1, 3, 7, 4, 25],
}
The lists in my_dict can also have some missing values, which should appear as NaNs in dataframe.
This is how I'm currently trying to append it into my dataframe:
df = pd.DataFrame(columns = ['Last updated',
'Symbol',
'Name',
'Rank',]
df = df.append(my_dict, ignore_index=True)
#print(df)
df.to_excel(r'\walletframe.xlsx', index = False, header = True)
But my output only has a single row containing all the values.
The answer was pretty simple, instead of using
df = df.append(my_dict)
I used
df = pd.DataFrame.from_dict(my_dict).T
Which transposes the dataframe so it doesn't has any missing values for columns.
Credits to #Ank who helped me find the solution!
I have a dataframe which contains a variety of different values that indicate missingness. I modified it in a way that now they should be all specified as 'NaN' like this:
import numpy as np
import pandas as pd
import numpy as np
data = {'Name':['Tom', 'nick', '-', 'jack'],
'Age':['20', '0', '19', ''],
'color':['yellow','Na','blue','red']}
df = pd.DataFrame(data)
def missing_values(x):
missingness_indicators = ["NaN","NAN","NA","Na","n/a", "na", "--","-"," ", "-inf", "inf", "nan", "None", "0", "", np.nan]
modified_df = df.replace(missingness_indicators,'NaN')
modified_df["color"] = modified_df.loc[:,'color'].fillna(method='bfill', axis=0) #LOCF
return modified_df
But using pandas functions that build on the recognized missing values does not work, i think this is due to the fact that I did not import the dataframe with those values specified (as this would have led to other problems, I'm working on a bigger dataset than the example)
I am looking now for a way to apply pandas functions like .fillna on this dataset.
Use np.nan to replace the 'indicators':
modified_df = df.replace(missingness_indicators,'NaN')
with
modified_df = df.replace(missingness_indicators, np.nan)
I have a csv file the that has a column that a bunch of different columns. the columns thhat i am interested in are the 'Items', 'OrderDate' and 'Units'.
In my IDE I am trying to generate a bar chart of the amount of 'Pencil's sold on each individual 'OrderDate'. What I am trying to do is to look down through the 'Item' columns using pandas and check to see if the item is a pencil and then add it to the graph if it is not then dont do anything.
I think I have made it a bit long winded with the code.
i have the coe going down through the 'Iems' column and checking to see if it is a pencil but i can't figure out what to do next.
import pandas as pd
import matplotlib.pyplot as plt
d = {'item' : pd.Series(['Pencil', 'Marker', 'Pencil', 'Headphones', 'Pencil', 'The moon', 'Wish you were here album']),
'OrderDate' : pd.Series(['5/15/2020', '5/16/2020', '5/16/2020','5/15/2020', \
'5/16/2020', '5/17/2020','5/16/2020','5/16/2020','5/17/2020']),
'Units' : pd.Series([4, 3, 2, 1, 3, 2, 4, 2, 3])}
df = pd.DataFrame.from_dict(d)
df.plot(kind='bar', x='OrderDate', y='Units')
item_col = df['Item']
pencil_binary = item_col.str.count('Pencil')
for entry in item_col:
if entry == 'Pencil':
print("i am a pencil")
else:
print("i am not a pencil")
print(df)
plt.plot()
plt.show()
If I understood correctly you want to plot the number of pencils sold per day. For that, you can just filter the dataframe and keep only rows about pencils, and then use a barchart.
Here's a reproducible code that assumes that all rows have different dates:
import pandas as pd
import matplotlib.pyplot as plt
d = {'item' : pd.Series(['Pencil', 'Marker', 'Pencil', 'Headphones', 'Pencil', 'The moon', 'Wish you were here album']),
'OrderDate' : pd.Series(['5/15/2020', '5/16/2020', '5/16/2020','5/15/2020', \
'5/16/2020', '5/17/2020','5/16/2020','5/16/2020','5/17/2020']),
'Units' : pd.Series([4, 3, 2, 1, 3, 2, 4, 2, 3])}
df = pd.DataFrame.from_dict(d)
#This dataframe only has pencils
df_pencils = df[df.item == 'Pencil']
df_pencils.groupby('OrderDate').agg('Units').sum().plot(kind='bar', x='OrderDate', y='Units')
df.plot(kind='bar', x='OrderDate', y='Units')
The groupby is used for grouping all rows with the same date, and, for each group, add up the Units sold.
In fact, when you do this:
df_pencils.groupby('OrderDate').agg('Units').sum()
this is the output:
OrderDate
5/15/2020 4
5/16/2020 5
Name: Units, dtype: int64
If you want a one liner, it's:
df[df.item == 'Pencil'].groupby('OrderDate').agg('Units').sum().plot(kind='bar', x='OrderDate', y='Units')