Python dataframe loc - python

Why is that df.loc[df[‘xx’] ==‘yy’,’oo’].values turns out to be a blank list/array (array([],dtype=float64))instead of returning a 0 in a dataframe? when I call out df, that specific spot shows 0, which is changed by myself from a nan.
I want the loc spot appears to be 0 which it’s supposed to

The df.loc[df[‘xx’] == ‘yy’, ’oo’].values result, contains filtered data that match by this condition, if you want count of matched rows you must use count method, same as below:
df.loc[df[‘xx’] == ‘yy’, ’oo’].count()

Related

group by a dataframe by the max column value

I have this dataframe and I need to leave only the lines with the max value of the 'revisão' column referring to each value of the 'mesano' column
groupede=dfgc.groupby(['mesano','description','paymentCategories.description','paymentCategories.type']) result=groupede['revisao','paymentCategories.interval.totalPrice'].agg('max','sum')
and i try too
grouped=dfgc.groupby(['mesano','description','paymentCategories.description','paymentCategories.type','paymentCategories.interval.totalPrice'], as_index=False)['revisao'].max()
but this code is wrong
You can sort the dataframe by highest value in revisao and then drop all duplicate rows and only keep the first entry, essentially filtering by max value:
df.sort_values(by=['revisão'], ascending=False).drop_duplicates(keep='first')
We can choose only the rows for which the value of a particular column is equal to the maximum value for that column. This can be done by using Boolean index filtering, where a 1 in the boolean index specifies keeping this row, and 0 means dropping it. For your particular use case, you can use
df_max_revisão = df[df['revisão'] == df['revisão'].max()]
where df['revisão'] == df['revisão'].max() generates a boolean index, and df[boolean_index] gives you the rows with 1 in the boolean index.
If you want only the values in the 'mesano' column, you can filter the dataset and choose those by using
df_mesano = df['mesano']

How do I search a pandas dataframe to get the row with a cell matching a specified value?

I have a dataframe that might look like this:
print(df_selection_names)
name
0 fatty red meat, like prime rib
0 grilled
I have another dataframe, df_everything, with columns called name, suggestion and a lot of other columns. I want to find all the rows in df_everything with a name value matching the name values from df_selection_names so that I can print the values for each name and suggestion pair, e.g., "suggestion1 is suggested for name1", "suggestion2 is suggested for name2", etc.
I've tried several ways to get cell values from a dataframe and searching for values within a row including
# number of items in df_selection_names = df_selection_names.shape[0]
# so, in other words, we are looping through all the items the user selected
for i in range(df_selection_names.shape[0]):
# get the cell value using at() function
# in 'name' column and i-1 row
sel = df_selection_names.at[i, 'name']
# this line finds the row 'sel' in df_everything
row = df_everything[df_everything['name'] == sel]
but everything I tried gives me ValueErrors. This post leads me to think I may be
way off, but I'm feeling pretty confused about everything at this point!
https://pandas.pydata.org/docs/reference/api/pandas.Series.isin.html?highlight=isin#pandas.Series.isin
df_everything[df_everything['name'].isin(df_selection_names["name"])]

Get the index of a value passed to map() in pandas

I have a DataFrame that's read in from a csv. The data has various problems. The one i'm concerned about for this post is that some data is not in the column it should be. For example, '900' is in the zipcode column, or 'RQ' is in the langauge column when it should be in the nationality column. In some cases, these "misinsertions" are just anomalies and can be converted to NaN. In other cases they indicate that the values have shifted one column to the right or the left such that the whole row has missinserted data. I want to remove these shifted lines from the DataFrame and try to fix them separately. My proposed solution has been to keep track of the number of bad values in each row as I am cleaning each column. Here is an example with the zipcode column:
def is_zipcode(value: str, regx):
if regx.match(value):
return value
else:
return nan
regx = re.compile("^[0-9]{5}(?:-[0-9]{4})?$")
df['ZIPCODE'] = df['ZIPCODE'].map(lambda x: is_zipcode(x, regx), na_action='ignore')
I'm doing something like this on every column in the dataframe depending on the data in that column, e.g. for the 'Nationality' column i'll look up the values in a json file of nationality codes.
What I haven't been able to achieve is to keep count of the bad values in a row. I tried something like this:
def is_zipcode(value: str, regx):
if regx.match(value):
return 0
else:
return 1
regx = re.compile("^[0-9]{5}(?:-[0-9]{4})?$")
df['badValues'] = df['ZIPCODE'].map(lambda x: is_zipcode(x, regx), na_action='ignore')
df['badValues'] = df['badValues'] + df['Nationalities'].map(is_nationality, na_action='ignore) # where is_nationality() similarly returns 1 if it is a bad value
And this can work to keep track of the bad values. What I'd like to do is somehow combine the process of cleaning the data and getting the bad values. I'd love to do something like this:
def is_zipcode(value: str, regx):
if regx.match(value):
# add 1 to the value of df['badValues'] at the corresponding index
return value
else:
return nan
The problem is that I don't think it's possible to access the index of the value being passed to the map function. I looked at these two questions (one, two) but I didn't see a solution to my issue.
I guess this would do what you want ...
is_zipcode_mask = df['ZIPCODE'].str.match(regex_for_zipcode)
print(len(df[is_zipcode_mask]))

Can someone help me understand what .index is doing in this code?

I have the following code:
print(df.drop(df[df['Quantity'] == 0].index).rename(columns={'Weight': 'Weight (oz.)'}))
I understand what query is trying to do, but I'm lost at why you need to add the " .index " portion?
What is .index doing in this particular code?
For context here is what the dataframe looks like:
I looked at the python documentation for dataframe index:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.index.html
but unfortunately it was too vague for me to make sense of it.
The DataFrame.index is the index of each record in your dataframe. It is unique to each row even if two rows have the same data in each column. DataFrame.drop takes the index : single label or list-like and drops those rows that match the index.
So from the code above,
df[df['Quantity'] == 0] gets the rows that has Quantity == 0,
df[df['Quantity'] == 0].index gets the indexes of all rows that has the predicate,
df.drop(df[df['Quantity'] == 0].index) this drops all the indices that returned True for that predicate.
Hope this helps!
I checked df.drop()'s documentation. It says that it drops by index. This code first finds the items that has the quantity 0, but because drop() works with indexes , it sends the items back to the dataframe and receive their indexes. That's index.
https://pandas.pydata.org/pandas-docs/stable//reference/api/pandas.DataFrame.drop.html

Panda groupby shifting and count at same time

Basically I am trying the take the previous row for the combination of ['dealer','State','city']. If I have multiple values in this combination I will get the Shifted value of this combination.
df['ShiftBY_D_S_C']= df.groupby(['dealer','State','city'])['dealer'].shift(1)
I am taking this ShiftBY_D_S_C column again and trying to take the count for the ['ShiftBY_D_S_C','State','city'] combination.
df['NewColumn'] = (df.groupby(['ShiftBY_D_S_C','State','city'])['ShiftBY_D_S_C'].transform("count"))+1
Below table shows what I am trying to do and it works well also. But when all the rows in ShiftBY_D_S_C column is nulls, this not working, as it have all null values. Any suggestions?
I am trying to see the NewColumn values like below when all the values in ShiftBY_D_S_C are NaN.
You could simply handle the special case that you describe with an if/else case:
if df['ShiftBY_D_S_C'].isna().all():
df['NewColumn'] = 1
else:
df['NewColumn'] = df.groupby(...)

Categories

Resources