Can't properly replace blank values using pandas - python

I'm a python beginner, so I'm practicing some data analysis using pandas in a dataframe with a list of restaurants with a Michelin star (restaurants_df).
When I show, for example, the first 5 rows I notice that in the "price" column (object type) of row 4 I have a blank value:
In [ ]: restaurants_df.head()
Out[ ]:
name year latitude longitude city region zipCode cuisine price
0 Kilian Stuba 2019 47.348580 10.17114 Kleinwalsertal Austria 87568 Creative $
1 Pfefferschiff 2019 47.837870 13.07917 Hallwang Austria 5300 Classic cuisine $
2 Esszimmer 2019 47.806850 13.03409 Salzburg Austria 5020 Creative $
3 Carpe Diem 2019 47.800010 13.04006 Salzburg Austria 5020 Market cuisine $
4 Edvard 2019 48.216503 16.36852 Wien Austria 1010 Modern cuisine
Then I check how many NaN values are in each column. In the case of the price column there are 151 values:
In [ ]: restaurants_df.isnull().sum()
Out[ ]: name 0
year 0
latitude 0
longitude 0
city 2
region 0
zipCode 149
cuisine 0
price 151
dtype: int64
After, I replace those values with the string "No Price", and confirm that all values have been replaced.
In [ ]: restaurants_df["price"].fillna("No Price", inplace = True)
restaurants_df.isnull().sum()
Out[ ]: name 0
year 0
latitude 0
longitude 0
city 0
region 0
zipCode 0
cuisine 0
price 0
dtype: int64
However, when I show the first 5 rows, the problem persists.
In [ ]: restaurants_df.head()
Out[ ]:
name year latitude longitude city region zipCode cuisine price
0 Kilian Stuba 2019 47.348580 10.17114 Kleinwalsertal Austria 87568 Creative $
1 Pfefferschiff 2019 47.837870 13.07917 Hallwang Austria 5300 Classic cuisine $
2 Esszimmer 2019 47.806850 13.03409 Salzburg Austria 5020 Creative $
3 Carpe Diem 2019 47.800010 13.04006 Salzburg Austria 5020 Market cuisine $
4 Edvard 2019 48.216503 16.36852 Wien Austria 1010 Modern cuisine
Any idea why this is happening and how I can solve it? Thanks in advance!

Viewing the dataset over at kaggle shows that the first four restaurants are 5 '$' while the fifth is 4 '$'. Thus, I'm guessing that jupyter notebook is just not displaying all the '$' visually, however the data internally is correct.
To double check if I'm correct try running
df.price
and see what you get. I think this might have something to do with jupyter's HTML handler when it tries to display four dollar signs. You can look at this issue that is similar to yours
If you're bothered by this, simplay replace the '$' symbols with a number using something like
df.replace({'price': {'$': 1, '$$': 2, '$$$': 3, '$$$$': 4, '$$$$$': 5}})

What I understand is that you are dealing with both blank values and null values. These are handled differently. Check out this question to understand how to handle them.

I don't think pandas will recognize areas with '' as null. for instance:
df2 = pd.DataFrame(np.array([[1, 2, ''], [4, 5, 6], [7, 8, 9]]),
columns=['a', 'b', 'c'])
then:
df2.isnull()
a b c
0 False False False
1 False False False
2 False False False
see here, and try:
pandas.options.mode.use_inf_as_na = True
EDIT:
you could also try replaying with:
df2.replace({'': 'No Price'}, inplace=True)
EDIT2: I believe #AKareem has the solution, but to expand you can use this to escape the latex
restaurants_df.replace({'price': {
'$': '\$',
'$$': '\$$',
'$$$': '\$$$',
'$$$$': '\$$$$',
'$$$$$': '\$$$$$'}}
, inplace=True)

Related

Using groupby() on an exploded pandas DataFrame returns a data frame where indeces are repeated but they have different attributes

I am working with a dataset found in kaggle (https://www.kaggle.com/datasets/shivamb/netflix-shows) which has data regarding different productions on Netflix. I am looking to answer the following question: How many productions are produced from each country.
Because there are productions which are essentially co-productions, meaning that under the column named 'country' there might be occasions where we have comma-separated strings in the form of : 'Country1','Country2',... etc, I have split the question into 2 parts:
The first part, is to filter the dataframe and keep only the rows which are single country productions and so we can find out how many productions are produced from each country provided that these are not co-productions. I encountered no problem during the first phase.
The second part is to take into account the co-productions. So, every time a country is found into a co-production the total number of it's productions is +=1.
The method I chose to solve the second part, is to convert into lists the comma-separated string values of countries and then to explode the 'country' column, then groupby('country') and get the .sum() of the attributes.
The problem that I have encountered is, that when I groupby('country'), there are countries which are repeated, for example United States is found with two different 'count' values as seen in the pics below.
(I need to underline that I have added to the original dataframe a column named 'count' which equals to the value 1 for every row and that df_no_nan_country = df[df['country'].notna()])
Why is that happening and what can I do to fix it?
This is my code:
import pandas as pd
df = pd.read_csv('netflix_titles.csv')
df['count'] = 1
df_no_nan_country = df[df['country'].notna()]
df_split_countries = df_no_nan_country.assign(country=df['country'].str.split(',')).explode('country')
df_split_countries.groupby('country').sum()
df_split_countries_max = df_split_countries.groupby('country').sum()[df_split_countries.groupby('country').sum()['count']>100]
df_split_countries_max.head(30)
try cleaning the column country before groupby
df_split_countries['country'] = df_split_countries['country'].str.strip()
df_split_countries['country'] = df_split_countries['country'].map(lambda x: .encode(x, errors="ignore").decode())
I think countries are repeated because they are not exactly the same string values, maybe one contain extra space...
Have you considered using a flag for coproductions and then explode?
import pandas as pd
df = pd.read_csv('netflix_titles.csv')
# drop null countries
df = df[df["country"].notnull()].reset_index(drop=True)
# Flag for coproductions
df["countries_coproduction"] = df['country']\
.str.split(',').apply(len).gt(1)
# Explode
df = df.assign(
country=df['country'].str.split(','))\
.explode('country')
# Clean coutries
# removing lead/tail whitespaces
df["country"] = df["country"].str.lstrip().str.rstrip()
Then you can easily estract the top 10 countries in each case as
grp[grp["countries_coproduction"]].nlargest(10, "n")\
.reset_index(drop=True)
countries_coproduction country n
0 True United States 872
1 True United Kingdom 387
2 True France 269
3 True Canada 264
4 True Germany 159
5 True China 96
6 True Spain 87
7 True Belgium 81
8 True India 74
9 True Australia 73
and
grp[~grp["countries_coproduction"]].nlargest(10, "n")\
.reset_index(drop=True)
countries_coproduction country n
0 False United States 2818
1 False India 972
2 False United Kingdom 419
3 False Japan 245
4 False South Korea 199
5 False Canada 181
6 False Spain 145
7 False France 124
8 False Mexico 110
9 False Egypt 106

Get Max Sum value of a column in Pandas

I have a csv like this:
Country Values Address
USA 1 AnyAddress
USA 2 AnyAddress
Brazil 1 AnyAddress
UK 3 AnyAddress
Australia 0 AnyAddress
Australia 0 AnyAddress
I need to group data by Country and sum Values, then return a string with the country and max value summed, in this case considering USA that is lexicographically greater then UK, the output is like this:
"Country: USA, Value: 3"
When I use groupby in pandas I am not able to get the strings with country name and value, how can I do that?
try:
max_values = df.groupby('Country').sum().reset_index().max().values
your_string = f"Country: {max_values[0]}, Value: {max_values[1]}"
Output:
>>> print(your_string)
Country: USA, Value: 3
You can do:
df.groupby("Country", as_index=False)["Values"].sum()\
.sort_values(["Values", "Country"], ascending=False).iloc[0]
Outputs:
Country USA
Values 3
Name: 3, dtype: object

DataFrame from variable and filtering data

I have a DataFrame and want to extract 3 columns from it, but one of them is an input from the user. I made a list, but need it to be iterable so I can run a For iteration.
So far I made it through by making a dictionary with 2 of the columns making a list of each and zipping them... but I really need the 3 columns...
My code:
Data=pd.read_csv(----------)
selec=input("What month would you want to show?")
NewData=[(Data['Country']),(Data['City']),(Data[selec].astype('int64')]
#here I try to iterate:
iteration=[i for i in NewData if NewData[i]<=25]
print (iteration)
*TypeError:list indices must be int ot slices, not Series*
My CSV is the following:
I want to be able to choose the month with the variable "selec" and filter the results of the month I've chosen... so the output for selec="Feb" would be:
I tried as well with loc/iloc, but not lucky at all (unhashable type:'list').
See the below example for how you can:
select specific columns from a DataFrame by providing a list of columns between the selection brackets (link to tutorial)
select specific rows from a DataFrame by providing a condition between the selection brackets (link to tutorial)
iterate rows of a DataFrame, although I don't suppose you need it - if you'd like to keep working with the DataFrame after filtering it, it's better to use the method mentioned above (you won't have to put the rows back together, and it will likely be more performant because pandas is optimized for bulk operations)
import pandas as pd
# this is just for testing, instead of pd.read_csv(...)
df = pd.DataFrame([
dict(Country="Spain", City="Madrid", Jan="15", Feb="16", Mar="17", Apr="18", May=""),
dict(Country="Spain", City="Galicia", Jan="1", Feb="2", Mar="3", Apr="4", May=""),
dict(Country="France", City="Paris", Jan="0", Feb="2", Mar="3", Apr="4", May=""),
dict(Country="Algeria", City="Argel", Jan="20", Feb="28", Mar="29", Apr="30", May=""),
])
print("---- Original df:")
print(df)
selec = "Feb" # let's pretend this comes from input()
print("\n---- Just the 3 columns:")
df = df[["Country", "City", selec]] # narrow down the df to just the 3 columns
df[selec] = df[selec].astype("int64") # convert the selec column to proper type
print(df)
print("\n---- Filtered dataframe:")
df1 = df[df[selec] <= 25]
print(df1)
print("\n---- Iterated & filtered rows:")
for row in df.itertuples():
# we could also use row[3] instead of getattr(...)
if getattr(row, selec) <= 25:
print(row)
Output:
---- Original df:
Country City Jan Feb Mar Apr May
0 Spain Madrid 15 16 17 18
1 Spain Galicia 1 2 3 4
2 France Paris 0 2 3 4
3 Algeria Argel 20 28 29 30
---- Just the 3 columns:
Country City Feb
0 Spain Madrid 16
1 Spain Galicia 2
2 France Paris 2
3 Algeria Argel 28
---- Filtered dataframe:
Country City Feb
0 Spain Madrid 16
1 Spain Galicia 2
2 France Paris 2
---- Iterated & filtered dataframe:
Pandas(Index=0, Country='Spain', City='Madrid', Feb=16)
Pandas(Index=1, Country='Spain', City='Galicia', Feb=2)
Pandas(Index=2, Country='France', City='Paris', Feb=2)

How to output the top 5 of a specific column along with associated columns using python?

I've tried to use df2.nlargest(5, ['1960'] this gives me:
Country Name Country Code ... 2017 2018
0 IDA & IBRD total IBT ... 6335039629.0000 6412522234.0000
1 Low & middle income LMY ... 6306560891.0000 6383958209.0000
2 Middle income MIC ... 5619111361.0000 5678540888.0000
3 IBRD only IBD ... 4731120193.0000 4772284113.0000
6 Upper middle income UMC ... 2637690770.0000 2655635719.0000
This is somewhat right, but it's outputting all the columns. I just want it to include the column name "Country Name" and "1960" only, but sort by the column "1960."
So the output should look like this...
Country Name 1960
China 5000000000
India 499999999
USA 300000
France 100000
Germany 90000

Problem with New Column in Pandas Dataframe

I have a dataframe and I'm trying to create a new column of values that is one column divided by the other. This should be obvious but I'm only getting 0's and 1's as my output.
I also tried converting the output to float in case the output was somehow being rounded off but that didn't change anything.
def answer_seven():
df = answer_one()
columns_to_keep = ['Self-citations', 'Citations']
df = df[columns_to_keep]
df['ratio'] = df['Self-citations'] / df['Citations']
return df
answer_seven()
Output:
Self_cite Citations ratio
Country
Aus. 15606 90765 0
Brazil 14396 60702 0
Canada 40930 215003 0
China 411683 597237 1
France 28601 130632 0
Germany 27426 140566 0
India 37209 128763 0
Iran 19125 57470 0
Italy 26661 111850 0
Japan 61554 223024 0
S Korea 22595 114675 0
Russian 12422 34266 0
Spain 23964 123336 0
Britain 37874 206091 0
America 265436 792274 0
Does anyone know why I'm only getting 1's and 0's when I want float values? I tried the solutions given in the link suggested and none of them worked. I've tried to convert the values to floats using a few different methods including .astype('float'), float(df['A']) and df['ratio'] = df['Self-citations'] * 1.0 / df['Citations']. But none have worked so far.
Without having the exact dataframe it is difficult to say. But it is most likely a casting problem.
Lets build a MCVE:
import io
import pandas as pd
s = io.StringIO("""Country;Self_cite;Citations
Aus.;15606;90765
Brazil;14396;60702
Canada;40930;215003
China;411683;597237
France;28601;130632
Germany;27426;140566
India;37209;128763
Iran;19125;57470
Italy;26661;111850
Japan;61554;223024
S. Korea;22595;114675
Russian;12422;34266
Spain;23964;123336
Britain;37874;206091
America;265436;792274""")
df = pd.read_csv(s, sep=';', header=0).set_index('Country')
Then we can perform the desired operation as you suggested:
df['ratio'] = df['Self_cite']/df['Citations']
Checking dtypes:
df.dtypes
Self_cite int64
Citations int64
ratio float64
dtype: object
The result is:
Self_cite Citations ratio
Country
Aus. 15606 90765 0.171939
Brazil 14396 60702 0.237159
Canada 40930 215003 0.190369
China 411683 597237 0.689313
France 28601 130632 0.218943
Germany 27426 140566 0.195111
India 37209 128763 0.288973
Iran 19125 57470 0.332782
Italy 26661 111850 0.238364
Japan 61554 223024 0.275997
S. Korea 22595 114675 0.197035
Russian 12422 34266 0.362517
Spain 23964 123336 0.194299
Britain 37874 206091 0.183773
America 265436 792274 0.335031
Graphically:
df['ratio'].plot(kind='bar')
If you want to enforce type, you can cast dataframe using astype method:
df.astype(float)

Categories

Resources