import pandas as pd
businesses = pd.read_json(businesses_filepath, lines=True, encoding='utf_8')
restaurantes = businesses['Restaurants' in businesses['categories']]
I would like to remove the lines that do not have Restaurants in the categories column, and this column has lists, however gave the error 'KeyError: False' and I would like to understand why and how to solve.
The expression 'Restaurants' in businesses['categories'] returns the boolean value False. This is passed to the brackets indexing operator for the DataFrame businesses which does not contain a column called False and thus raises a KeyError.
What you are looking to do is something called boolean indexing which works like this.
businesses[businesses['categories'] == 'Restaurants']
If you find that your data contains spelling variations or alternative restaurant related terms, the following may be of benefit. Essentially you put your restaurant related terms in restuarant_lst. The lambda function returns true if any of the items in restaurant_lst are contained within each row of the business series. The .loc indexer filters out rows which return false for the lambda function.
restaurant_lst = ['Restaurant','restaurantes','diner','bistro']
restaurant = businesses.loc[businesses.apply(lambda x: any(restaurant_str in x for restaurant_str in restaurant_lst))]
The reason for this is that the Series class implements a custom in operator that doesn't return an iterable like the == does, here's a workaround
businesses[['Restaurants' in c for c in list(businesses['categories'])]]
hopefully this helps someone where you're looking for a substring in the column and not a full match.
I think what you meant was :
businesses = businesses.loc[businesses['categories'] == 'Restaurants']
that will only keep rows with the category restaurants
None of the answers here actually worked for me,
businesses[businesses['categories'] == 'Restaurants']
obviously won't work since the value in 'categories' is not a string, it's a list, meaning the comparison will always fail.
What does, however, work, is converting the column into tuples instead of strings:
businesses['categories'] = businesses['categories'].apply(tuple)
That allows you to use the standard .loc thing:
businesses.loc[businesses['categories'] == ('Restaurants',)]
Related
How do I replace the cell values in a column if they contain a number in general or contain a specific thing like a comma, replace the whole cell value with something else.
Say for example a column that has a comma meaning it has more than one thing I want it to be replaced by text like "ENM".
For a column that has a cell with a number value, I want to replace it by 'UNM'
As you have not provided examples of what your expected and current output look like, I'm making some assumptions below. What it seems like you're trying to do is iterate through every value in a column and if the value meets certain conditions, change it to something else.
Just a general pointer. Iterating through dataframes requires some important considerations for larger sizes. Read through this answer for more insight.
Start by defining a function you want to use to check the value:
def has_comma(value):
if ',' in value:
return True
return False
Then use the pandas.DataFrame.replace method to make the change.
for i in df['column_name']:
if has_comma(i):
df['column_name'] = df['column_name'].replace([i], 'ENM')
else:
df['column_name'] = df['column_name'].replace([i], 'UNM')
Say you have a column, i.e. pandas Series called col
The following code can be used to map values with comma to "ENM" as per your example
col.mask(col.str.contains(','), "ENM")
You can overwrite your original column with this result if that's what you want to do. This approach will be much faster than looping through each element.
For mapping floats to "UNM" as per your example the following would work
col.mask(col.apply(isinstance, args=(float,)), "UNM")
Hopefully you get the idea.
See https://pandas.pydata.org/docs/reference/api/pandas.Series.mask.html for more info on masking
in a given dataframe in pandas, is there a way to see all the Booleans present in filt in the code below:
filt = dataframe['tag1'] =='ABC'
filt
TLDR
It's possible. I think you should use indexing, it's extensively described here. To be more specific you can use boolean indexing.
Code should look like this
filt = df[df.loc[:,"tag1"] == 'ABC]
Now what actually happens here
df.loc[:,"tag1"] returns all rows : character, but limits columns to just "tag1". Next df.loc[:,"tag1"] == 'ABC comperes returned rows with value "ABC", as the result grid of True/False will be created. True row was equal to "ABC" etc. Now the grand final. Whenever you pass grid of logical values to an dataframe they are treated as indicators whether or not to include the result. So let's say value at [0,0] in passed grid is True, therefore it will be included in the result.
I understand it's hard to wrap one's head around this concept but once you get it it's super useful. The best is to just play around with this iloc[] and loc[] functions.
New Python user here, so please pardon my ignorance if my approach seems completely off.
I am having troubles filtering rows of a column based off of their Character/Number format.
Here's an example of the DataFrame and Series
df = {'a':[1,2,4,5,6], 'b':[7, 8, 9,10 ], 'target':[ 'ABC1234','ABC123', '123ABC', '7KZA23']
The column I am looking to filter is the "target" column based on their character/number combos and I am essentially trying to make a dict like below
{'ABC1234': counts_of_format
'ABC123': counts_of_format
'123ABC': counts_of_format
'any_other_format': counts_of_format}
Here's my progress so far:
col = df['target'].astype('string')
abc1234_pat = '^[A-Z]{3}[0-9]{4]'
matches = re.findall(abc1234_pat, col)
I keep getting this error:
TypeError: expected string or bytes-like object
I've double checked the dtype and it comes back as string. I've researched the TypeError and the only solutions I can find it converting it to a string.
Any insight or suggestion on what I might be doing wrong, or if this is simply the wrong approach to this problem, will be greatly appreciated!
Thanks in advance!
I am trying to create a dict that returns how many times the different character/number combos occur. For example, how many time does 3 characters followed by 4 numbers occur and so on.
(Your problem would have been earlier and easier understood had you stated this in the question post itself rather than in a comment.)
By characters, you mean letters; by numbers, you mean digits.
abc1234_pat = '^[A-Z]{3}[0-9]{4]'
Since you want to count occurrences of all character/number combos, this approach of using one concrete pattern would not lead very far. I suggest to transform the targets to a canonical form which serves as the key of your desired dict, e. g. substitute every letter with C and every digit with N (using your terms).
Of the many ways to tackle this, one is using str.translate together with a class which does the said transformation.
class classify():
def __getitem__(self, key):
return ord('C' if chr(key).isalpha() else 'N' if chr(key).isdigit() else None)
occ = df.target.str.translate(classify()).value_counts()#.todict()
Note that this will purposely raise an exception if target contains non-alphanumeric characters.
You can convert the resulting Series to a dict with .to_dict() if you like.
I want to loop over all the rows in a df, checking that two conditions hold and, if they do, replace the value in a column with something else. I've attempted to do this two different ways:
if (sales.iloc[idx]['shelf'] in ("DRY BLENDS","LIQUID BLENDS")) & np.isnan(sales.iloc[idx]['traceable_blend']):
sales.iloc[idx]['traceable_blend'] = False
and:
if (sales.iloc[idx]['shelf'] in ("DRY BLENDS","LIQUID BLENDS")) & (sales.iloc[idx]['traceable_blend'] == np.NaN):
sales.iloc[idx]['traceable_blend'] = False
By including print statements we've verified that the if statement is actually functional, but no assignment ever takes place. Once we've run the loop, there are True and NaN values in the 'traceable_blend' column, but never False. Somehow the assignment is failing.
It looks like this might've worked:
if (sales.iloc[idx]['shelf'] in ("DRY BLENDS","LIQUID BLENDS")) & np.isnan(sales.iloc[idx]['traceable_blend']):
sales.at[idx, 'traceable_blend'] = False
But I would still like to understand what's happening.
This, sales.iloc[idx]['traceable_blend']=False, is index chaining, and will almost never work. In fact, you don't need to loop:
sales['traceable_blend'] = sales['traceable_blend'].fillna(sales['shelf'].isin(['DRY BLENDS', 'LIQUID BLENDS']))
Pandas offers two functions for checking for missing data (NaN or null): isnull() and notnull() - They return a boolean value. I suggest to try these instead of isnan()
You can also determine if any value is missing in your series by chaining .values.any()
I'm working on the pandas tutorial at https://github.com/brandon-rhodes/pycon-pandas-tutorial/blob/master/Exercises-3.ipynb. It has exercises on the cast dataframe, a sample of which is
There are two commands which are almost similar, except for one small difference, and one outputs a Series while the other outputs a dataframe. I don't understand why.
The first code is:
c1 = cast[cast.title == 'The Pink Panther']
c2 = c1.groupby('year')['n'].max()
type(c2)
and it makes c2 a Series. However, if I simply add another square brackets around 'n' as in the following code, I get a dataframe.
c1 = cast[cast.title == 'The Pink Panther']
c2 = c1.groupby('year')[['n']].max()
type(c2)
Can someone help me explain this? Thanks!
If you pass a list of columns, you get a DataFrame. It doesn't matter how many elements the list has. It would be confusing if it returned a Series just in the case of a one-item list, because sometimes your list might be programmatically generated. For instance, suppose you had:
columns_to_use = [column for blah in blahblah]
x = c1.groupby('year')[columns_to_use]
With the current behavior, you know that x will always be a DataFrame, because columns_to_use is a list. If this were not the case, you might get errors later because you wouldn't know ahead of time whether x would be a Series or DataFrame, so you wouldn't know, e.g., what methods you could call on it in later code.
Basically, If you pass __getitem__ on a DataFrame a Series, np.ndarray, Index, or list, then you will get back an array (DataFrame).
Otherwise __getitem__ will attempt to retrieve a a column (Series). This case includes stringtypes, numbers, a custom class, etc.
DataFrameGroupBy behaves similarly to DataFrame in that if a you pass any of the former listed objects(plus tuples apparently), you will get a two-dimensonal object back(DataFrame), otherwise it will attempt to retrieve a one-dimensional object(Series)
In your first code block you are passing a string:
>>> type(c1['year'])
pandas.core.frame.Series
In the second code block you pass a list containing a string to __getitem__
>>> type(c1[['year']])
pandas.core.frame.DataFrame
[] has multiple meanings in this case.
Passing a list of one element is generally not very useful however except for nicely printing the column name at the top (but the Series still retains the name of the column in the name attribute). The primary intent of passing a list to __getitem__ is to key on multiple columns.
To see how brackets [] work on a class, check its __getitem__ method.
From pandas.series.core.frame.DataFrame:
if isinstance(key, (Series, np.ndarray, Index, list)):
# either boolean or fancy integer index
return self._getitem_array(key)
elif isinstance(key, DataFrame):
return self._getitem_frame(key)
elif is_mi_columns:
return self._getitem_multilevel(key)
else:
return self._getitem_column(key)