Pandas group events by year

Pandas group events by year - python

I am very new to pandas but making progress...
I have the following dataframe:
I want to do a count on the number of events that have happened by Month/Year which I believe would produce something like the below
I have tried the following based on the article located here
group = df.groupby(['MonthYear', 'EventID']).count()
frequency = group['EventID'].groupby(level=0, group_keys=False)
print(frequency)
I then get an error (using VS Code) that states:
unable to open 'hashtable_class_helper.pxi'
I have had this before and it is usually when I have used the wrong case for my column names but I have verified they are correct.
Where am I going wrong?

you can use:
frequency= df.groupby('MonthYear')['EventID'].value_counts()
See documentation for more details

You could try aggregation on top of groupBy df.groupby('MonthYear').agg({'EventID':'count'})

Related

unable to sort excel values using pandas [duplicate]

New to Pandas, so maybe I'm missing a big idea?
I have a Pandas DataFrame of register transactions with shape like (500,4):
Time datetime64[ns]
Net Total float64
Tax float64
Total Due float64
I'm working through my code in a Python3 Jupyter notebook. I can't get past sorting any column. Working through the different code examples for sort, I'm not seeing the output reorder when I inspect the df. So, I've reduced the problem to trying to order just one column:
df.sort_values(by='Time')
# OR
df.sort_values(['Total Due'])
# OR
df.sort_values(['Time'], ascending=True)
No matter which column title, or which boolean argument I use, the displayed results never change order.
Thinking it could be a Jupyter thing, I've previewed the results using print(df), df.head(), and HTML(df.to_html()) (the last example is for Jupyter notebooks). I've also rerun the whole notebook from import CSV to this code. And, I'm also new to Python3 (from 2.7), so I get stuck with that sometimes, but I don't see how that's relevant in this case.
Another post has a similar problem, Python pandas dataframe sort_values does not work. In that instance, the ordering was on a column type string. But as you can see all of the columns here are unambiguously sortable.
Why does my Pandas DataFrame not display new order using sort_values?

df.sort_values(['Total Due']) returns a sorted DF, but it doesn't update DF in place.
So do it explicitly:
df = df.sort_values(['Total Due'])
or
df.sort_values(['Total Due'], inplace=True)

My problem, fyi, was that I wasn't returning the resulting dataframe, so PyCharm wasn't bothering to update said dataframe. Naming the dataframe after the return keyword fixed the issue.
Edit:
I had return at the end of my method instead of
return df,
which the debugger must of noticed, because df wasn't being updated in spite of my explicit, in-place sort.

how to access based row based on condition with grouped dataframe

I am new to Python and I want to access some rows for an already grouped dataframe (used groupby).
However, I am unable to select the row I want and would like your help.
The code I used for groupby shown below:
language_conversion = house_ads.groupby(['date_served','language_preferred']).agg({'user_id':'nunique',
'converted':'sum'})
language_conversion
Result shows:
For example, I want to access the number of Spanish-speaking users who received house ads using:
language_conversion[('user_id','Spanish')]
gives me KeyError('user_id','Spanish')
This is the same when I try to create a new column, which gives me the same error.
Thanks for your help

Use this,
language_conversion.loc[(slice(None), 'Arabic'), 'user_id']
You can see the indices(in this case tuples of length 2) using language_conversion.index

you should use this
language_conversion.loc[(slice(None),'Spanish'), 'user_id']
slice(None) here includes all rows in date index.
if you have one particular date in mind just replace slice(None) with that specific date.
the error you are getting is because u accessed columns before indexes which is not correct way of doing it follow the link to learn more indexing

How to filter df using str.extract

I am relatively new to coding and I tried to challenge myself with a personal project which is proving a bit more difficult than I have anticipated :-)
I using a sale report that has both product info and product range information stored in the same column. I am trying to utilise the formatting of the column for filtering.
I am have filter my date using :
range_df = df[df['Stroke_Range_Det'].str.contains('(^\s{12})+([a-zA-Z]+)')]
which returned Warning Message "This pattern has match groups. To actually get the groups, use str.extract. return func(self, *args, **kwargs)". Instead of ignoring the warning message, I have tried to use the str.extract but can't get the desired outcome.
I have tried the below code which didn't work.
range_df = df[df['Stroke_Range_Det'].str.extractall(pat = '(^\s{12})+([a-zA-Z]+)')]
Any suggestions?

extractall() already returns a DataFrame.
Use:
range_df = df['Stroke_Range_Det'].str.extractall(pat = '(^\s{12})+([a-zA-Z]+)')

How to get Full rows for groupby function for the following problem

Current Output
I am executing the following code to groupby similar items, take SUM and then average it with similar (groupby) items. The instruction works fine but i need my output with full rows, my current output provides only one item of ORIGIN_AIRPORT against all DESTINATION_AIRPORT. It should be ANC for SEA as well.
I hope anyone got my point.
df=df.groupby(['ORIGIN_AIRPORT', 'DESTINATION_AIRPORT']).agg({'SUM': np.average})

IIUC try:
df=df.groupby(['ORIGIN_AIRPORT', 'DESTINATION_AIRPORT']).agg({'SUM': np.average}).reset_index()
add reset_index() at the end of the query.

how to find difference in dates in pandas dataframe in Azure ML

Is Azure uses some other Syntax for finding difference in dates and time.
or
Any package is missing in Azure.
how to find difference in dates in pandas data-frame in Azure ML.
I have 2 columns in a dataframe and I have to find the difference of two and have to kept in third column ,the problem is this, all this runs well in python IDE but not in Microsoft Azure.
My date format : 2015-09-25T01:45:34.372Z
I have to to find df['days'] = df['a'] - df['b']
I have tried almost all the syntax available on stackoverflow.
Please help
mylist = ['app_lastCommunicatedAt', 'app_installAt', 'installationId']
'def finding_dates(df, mylist):
for i in mylist:
if i == 'installationId':
continue
df[i] = [pd.to_datetime(e) for e in df[i]]
df['days'] = abs((df[mylist[1]] - df[mylist[0]]).dt.days)
return df'
when I am calling this function it is giving error and not accepting lines below continue.
I had also tried many other things like converting dates to string, etc

Per my experience, it seems that the issue was caused by your code without the dataframe_service which indicates that the function operations on a data frame, please see https://github.com/Azure/Azure-MachineLearning-ClientLibrary-Python#dataframe_service. If not being familiar with the decorator #, please see https://www.python.org/dev/peps/pep-0318/ to know it.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas group events by year - python

you can use: frequency= df.groupby('MonthYear')['EventID'].value_counts() See documentation for more details

You could try aggregation on top of groupBy df.groupby('MonthYear').agg({'EventID':'count'})

Related

unable to sort excel values using pandas [duplicate]

how to access based row based on condition with grouped dataframe

How to filter df using str.extract

How to get Full rows for groupby function for the following problem

how to find difference in dates in pandas dataframe in Azure ML

Categories

Resources