how to fix the issue of CategoricalIndex column in pandas? - python

I am working with chicago crime data and want to aggregated count of top 5 crimes for each region/community area. However, my code works but I got unwanted index and CategoricalIndex type column in dataframe columns which stop me to access particular columns for further data manipulation.
what I did:
crimes_2012 = pd.read_csv('Chicago_Crimes_2012_to_2017.csv', sep=',', error_bad_lines=False)
df=crimes_2012[['Primary Type', 'Location Description', 'Community Area']]
crime_catg = df.groupby(['Community Name', 'Primary Type'])['Primary Type'].count().unstack()
crime_catg = crime_catg[['THEFT','BATTERY', 'CRIMINAL DAMAGE', 'NARCOTICS', 'ASSAULT']]
crime_catg = crime_catg.dropna()
here is my current output that needs to be improved:
here is my attempt:
when I tried below code, I still didn't get new index and index name displayed strange in output dataframe. why? how to fix this? any idea? Thanks
even when I tried to reindex dataframe it didn't get new index after all.
crime_catg.reindex(inplace=True, drop=True)
any idea to fix this issue? any thought?

There are a couple of ways to handle this.
1) Keep the CategoricalIndex type and the use .add_categories method to update valid categories eg to fix your .reindex problem:
crime_catg.columns = crime_catg.columns.add_categories(['Community Name'])
2) Cast as pandas.Index:
crime_catg.columns = pd.Index(list(crime_catg.columns))

Related

Extracting top-N occurrences in a grouped dataframe using pandas

I've been trying to find out the top-3 highest frequency restaurant names under each type of restaurant
The columns are:
rest_type - Column for the type of restaurant
name - Column for the name of the restaurant
url - Column used for counting occurrences
This was the code that ended up working for me after some searching:
df_1=df.groupby(['rest_type','name']).agg('count')
datas=df_1.groupby(['rest_type'], as_index=False).apply(lambda x : x.sort_values(by="url",ascending=False).head(3))
['url'].reset_index().rename(columns={'url':'count'})
The final output was as follows:
I had a few questions pertaining to the above code:
How are we able to groupby using rest_type again for datas variable after grouping it earlier. Should it not give the missing column error? The second groupby operation is a bit confusing to me.
What does the first formulated column level_0 signify? I tried the code with as_index=True and it created an index and column pertaining to rest_type so I couldn't reset the index. Output below:
Thank you
You can use groupby a second time as it is present in the index which is recognized by groupby.
level_0 comes from the reset_index command because you index is unnamed.
That said, and provided I understand your dataset, I feel that you could achieve your goal more easily:
import random
df = pd.DataFrame({'rest_type': random.choices('ABCDEF', k=20),
'name': random.choices('abcdef', k=20),
'url': range(20), # looks like this is a unique identifier
})
def tops(s, n=3):
return s.value_counts().sort_values(ascending=False).head(n)
df.groupby('rest_type')['name'].apply(tops, n=3)
edit: here is an alternative to format the result as a dataframe with informative column names
(df.groupby('rest_type')
.apply(lambda x: x['name'].value_counts().nlargest(3))
.reset_index().rename(columns={'name': 'counts', 'level_1': 'name'})
)
I have a similar case where the above query looks working partially. In my case the cooccurrence value is coming as 1 always.
Here in my input data frame.
And my query is below
top_five_family_cooccurence_df = (common_top25_cooccurance1_df.groupby('family') .apply(lambda x: x['related_family'].value_counts().nlargest(5)) .reset_index().rename(columns={'related_family': 'cooccurence', 'level_1': 'related_family'}) )
I am getting result as
Where as The cooccurrence is always giving me 1.

Cannot load data from spreadsheet properly

I have a spreadsheet looking like this:
I'm trying to read it into dataframe:
def loading_nasdaq_info_from_spreadsheet():
excel_file = 'nasdaq.xlsx'
nasdaq_info_dataframe = pandas.read_excel(excel_file, index_col=0)
# data cleaning
nasdaq_info_dataframe.dropna()
return nasdaq_info_dataframe
if __name__ == '__main__':
df = loading_nasdaq_info_from_spreadsheet()
print(df.loc['symbol'])
I constantly get
"raise KeyError(key) from err KeyError: 'Symbol'"
It doesn't matter which key I wanna print or use. It is always the same error. What's even worse, even I manually (in excel) set everything to text, when I'm trying to
nasdaq_info_dataframe.applymap(lambda text: text.strip())
I get
'float' doesn't have strip()
I fight with this for a few hours now, so please help me.
EDIT:
Printing
print(df.loc)
gives
<pandas.core.indexing._LocIndexer object at 0x1160e8778>
Printing
print(df.columns)
gives
Index(['Name', 'Sector', 'Industry'], dtype='object')
Furthermore, if I remove multiindex by removing "index_col=0", I still have the same keyerror when I'm printing df.loc['Symbol']
Printing df.head() gives
The problem is in df.loc['symbol'].
use df.loc[:, 'Symbol'] or df['Symbol'] instead.
if Symbol is the df's index, then apply df = df.reset_index() first.
You can get more detail in pandas official guide Indexing and selecting data.

Sorting by columns after grouping generating error

Could anyone please tell me why sorting is generating an error here? I suspect it is related to indexing but reset_index didnt solve the issue
df['s'] = df.groupby(['ID','Date'],as_index=False)['Text_Data']\
.transform(lambda x : ' '.join(x))\
.sort_values(['ID','Date']) .
KeyError: ('ID', 'Date')
What I was trying to do is to sort the dataframe regardless grouping. In R you would do ungroup() first not sure anything simliar is necessary in Pyhton? Thanks
df.groupby(['ID','Date'],as_index=False)['Text_Data'].transform(lambda x : ' '.join(x))
This above code will give you a Pandas Series which consists of only one column Text_Data. But when you apply sort_values(['ID','Date']), this generates an error because there are no ID and Date Columns present here.
You can separately sort your dataframe and transformed your column into Series. Then, delete that column from dataframe and append the transformed column to it like this,
df = df.sort_values(['ID','Date'])
df['s'] = df.groupby(['ID','Date'],as_index=False)['Text_Data'].transform(lambda x : ' '.join(x))
del df['Text_Data']
df['Text_Data] = df['s'].values

Pandas - Function to remove na

trying to do a quick function but struggling since new to Pandas/Python. I'm trying to remove nas from two of my columns, but I keep getting this error, my code is the following:
def remove_na():
df.dropna(subset=['Column 1', 'Column 2'])
df.reset_index(drop=True)
df = remove_rows()
df.head(3)
AttributeError: 'NoneType' object has no attribute 'dropna'
I want to use this function on different tables, hence why I thought it would make sense to create a method. However, I just don't understand why it's not working for this method when compared to others it seems fine. Thank you.
I believe you can specify if you want to remove NA from columns or rows by the paremeter axis where 0 is index and 1 is columns. This would remove all NAs from all columns
df.dropna(axis =1, inplace=True )
I think you can use apply with dropna:
df = df.apply(lambda x: pd.Series(x.dropna().values))
print (df)
OR you can also try this
df=df.dropna(axis=0, how='any')
You're getting an error cos the dropna function here yields a dataframe as its output.
You can either save it to a dataframe:
df = df.dropna(subset=['Column 1', 'Column 2'])
or call the argument 'inplace=True' :
df.dropna(subset=['Column 1', 'Column 2'], inplace=True)
In order to remove all the missing values from the data set at once using pandas you can use the following:(Remember You have to specify the index in the arguments so that you can efficiently remove the missing values)
# making new data frame with dropped NA values
new_data = data.dropna(axis = 0, how ='any')

Unable to drop column, object has no attribute error

I have a csv file with column titles: name, mfr, type, calories, protein, fat, sodium, fiber, carbo, sugars, vitamins, rating. When I try to drop the sodium column, I don't understand why I'm getting a NoneType' object has no attribute 'drop' error
I've tried
df.drop(['sodium'],axis=1)
df = df.drop(['sodium'],axis=1)
df = df.drop (['sodium'], 1, inplace=True)
Here's your problem:
df = df.drop (['sodium'], 1, inplace=True)
This returns None (documentation) due to the inplace flag, and so you no longer have a reference to your dataframe. df is now None and None has no drop attribute.
My expectation is that you have done this (or something like it, perhaps dropping another column?) at some prior point in your code.
There is a similar question, you should have a look at,
Delete column from pandas DataFrame using del df.column_name
According to the answer,
`df = df.drop (['sodium'], 1, inplace=True)`
should rather be
df.drop (['sodium'], 1, inplace=True)
Although the first code,
df = df.drop(['sodium'],axis=1)
should work fine, if there is an error, try
print(df.columns)
to make sure that the columns are actually read from the csv file
use pd.read_csv(r'File_Path_with_name') and this will be sorted out as there is some issue with reading csv file.

Categories

Resources