I am very new to pandas but making progress...
I have the following dataframe:
I want to do a count on the number of events that have happened by Month/Year which I believe would produce something like the below
I have tried the following based on the article located here
group = df.groupby(['MonthYear', 'EventID']).count()
frequency = group['EventID'].groupby(level=0, group_keys=False)
print(frequency)
I then get an error (using VS Code) that states:
unable to open 'hashtable_class_helper.pxi'
I have had this before and it is usually when I have used the wrong case for my column names but I have verified they are correct.
Where am I going wrong?
you can use:
frequency= df.groupby('MonthYear')['EventID'].value_counts()
See documentation for more details
You could try aggregation on top of groupBy df.groupby('MonthYear').agg({'EventID':'count'})
Related
New to Pandas, so maybe I'm missing a big idea?
I have a Pandas DataFrame of register transactions with shape like (500,4):
Time datetime64[ns]
Net Total float64
Tax float64
Total Due float64
I'm working through my code in a Python3 Jupyter notebook. I can't get past sorting any column. Working through the different code examples for sort, I'm not seeing the output reorder when I inspect the df. So, I've reduced the problem to trying to order just one column:
df.sort_values(by='Time')
# OR
df.sort_values(['Total Due'])
# OR
df.sort_values(['Time'], ascending=True)
No matter which column title, or which boolean argument I use, the displayed results never change order.
Thinking it could be a Jupyter thing, I've previewed the results using print(df), df.head(), and HTML(df.to_html()) (the last example is for Jupyter notebooks). I've also rerun the whole notebook from import CSV to this code. And, I'm also new to Python3 (from 2.7), so I get stuck with that sometimes, but I don't see how that's relevant in this case.
Another post has a similar problem, Python pandas dataframe sort_values does not work. In that instance, the ordering was on a column type string. But as you can see all of the columns here are unambiguously sortable.
Why does my Pandas DataFrame not display new order using sort_values?
df.sort_values(['Total Due']) returns a sorted DF, but it doesn't update DF in place.
So do it explicitly:
df = df.sort_values(['Total Due'])
or
df.sort_values(['Total Due'], inplace=True)
My problem, fyi, was that I wasn't returning the resulting dataframe, so PyCharm wasn't bothering to update said dataframe. Naming the dataframe after the return keyword fixed the issue.
Edit:
I had return at the end of my method instead of
return df,
which the debugger must of noticed, because df wasn't being updated in spite of my explicit, in-place sort.
I am new to Python and I want to access some rows for an already grouped dataframe (used groupby).
However, I am unable to select the row I want and would like your help.
The code I used for groupby shown below:
language_conversion = house_ads.groupby(['date_served','language_preferred']).agg({'user_id':'nunique',
'converted':'sum'})
language_conversion
Result shows:
For example, I want to access the number of Spanish-speaking users who received house ads using:
language_conversion[('user_id','Spanish')]
gives me KeyError('user_id','Spanish')
This is the same when I try to create a new column, which gives me the same error.
Thanks for your help
Use this,
language_conversion.loc[(slice(None), 'Arabic'), 'user_id']
You can see the indices(in this case tuples of length 2) using language_conversion.index
you should use this
language_conversion.loc[(slice(None),'Spanish'), 'user_id']
slice(None) here includes all rows in date index.
if you have one particular date in mind just replace slice(None) with that specific date.
the error you are getting is because u accessed columns before indexes which is not correct way of doing it follow the link to learn more indexing
I am relatively new to coding and I tried to challenge myself with a personal project which is proving a bit more difficult than I have anticipated :-)
I using a sale report that has both product info and product range information stored in the same column. I am trying to utilise the formatting of the column for filtering.
I am have filter my date using :
range_df = df[df['Stroke_Range_Det'].str.contains('(^\s{12})+([a-zA-Z]+)')]
which returned Warning Message "This pattern has match groups. To actually get the groups, use str.extract. return func(self, *args, **kwargs)". Instead of ignoring the warning message, I have tried to use the str.extract but can't get the desired outcome.
I have tried the below code which didn't work.
range_df = df[df['Stroke_Range_Det'].str.extractall(pat = '(^\s{12})+([a-zA-Z]+)')]
Any suggestions?
extractall() already returns a DataFrame.
Use:
range_df = df['Stroke_Range_Det'].str.extractall(pat = '(^\s{12})+([a-zA-Z]+)')
Current Output
I am executing the following code to groupby similar items, take SUM and then average it with similar (groupby) items. The instruction works fine but i need my output with full rows, my current output provides only one item of ORIGIN_AIRPORT against all DESTINATION_AIRPORT. It should be ANC for SEA as well.
I hope anyone got my point.
df=df.groupby(['ORIGIN_AIRPORT', 'DESTINATION_AIRPORT']).agg({'SUM': np.average})
IIUC try:
df=df.groupby(['ORIGIN_AIRPORT', 'DESTINATION_AIRPORT']).agg({'SUM': np.average}).reset_index()
add reset_index() at the end of the query.
Is Azure uses some other Syntax for finding difference in dates and time.
or
Any package is missing in Azure.
how to find difference in dates in pandas data-frame in Azure ML.
I have 2 columns in a dataframe and I have to find the difference of two and have to kept in third column ,the problem is this, all this runs well in python IDE but not in Microsoft Azure.
My date format : 2015-09-25T01:45:34.372Z
I have to to find df['days'] = df['a'] - df['b']
I have tried almost all the syntax available on stackoverflow.
Please help
mylist = ['app_lastCommunicatedAt', 'app_installAt', 'installationId']
'def finding_dates(df, mylist):
for i in mylist:
if i == 'installationId':
continue
df[i] = [pd.to_datetime(e) for e in df[i]]
df['days'] = abs((df[mylist[1]] - df[mylist[0]]).dt.days)
return df'
when I am calling this function it is giving error and not accepting lines below continue.
I had also tried many other things like converting dates to string, etc
Per my experience, it seems that the issue was caused by your code without the dataframe_service which indicates that the function operations on a data frame, please see https://github.com/Azure/Azure-MachineLearning-ClientLibrary-Python#dataframe_service. If not being familiar with the decorator #, please see https://www.python.org/dev/peps/pep-0318/ to know it.