Group by of dataframe with average of a column

Group by of dataframe with average of a column - python

I am really new to python..just a week ago started learning it. I have a query and hope you guys can help me to solve it. Thanks in advance..!!
I have data in below format.
Date Product Price Discount
1/1/2020 A 17,490 30
1/1/2020 B 34,990 21
1/1/2020 C 20,734 11
1/2/2020 A 16,884 26
1/2/2020 B 26,990 40
1/2/2020 C 17,936 10
1/3/2020 A 16,670 36
1/3/2020 B 12,990 13
1/3/2020 C 30,990 43
I want to take the average of discount column for each date and just have 2 columns.. It aint working out.. :(
Date AVG_Discount
1/1/2020 x %
1/2/2020 y %
1/3/2020 z %
What I have tried doing is below.. As I said, I am novice in Python so approach might be incorrect.. Need guidance guys.. TIA
mean_col=df.groupby(df['time'])['discount'].mean()
df=df.set_index(['time'])
df['mean_col']=mean_col
df=df.reset_index()

df.groupby(df['time'])['discount'].mean() Is already returning series with time as index.
All you need to do is just use reset_index function on this.
grouped_df = df.groupby(df['time'])['discount'].mean().reset_index()
As Quang Hoang Suggested in comments. You can also pass as_index=False to groupby.

Apparently, you have read your DataFrame from a text file,
e.g. CSV, but with separator other than a comma.
Run df.info() and I assume that you got result something like below:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9 entries, 0 to 8
Data columns (total 4 columns):
Date 9 non-null object
Product 9 non-null object
Price 9 non-null object
Discount 9 non-null int64
dtypes: int64(1), object(3)
Note that Date, Product and Price columns are of object type
(actually, a string). This remark is especially importoant in case of
Price column, because to compte mean you should have source column
as a number (not a string).
So first you should convert Date and Price columns to proper types
(datetime and float). To do it run:
df.Date = pd.to_datetime(df.Date)
df.Price = df.Price.str.replace(',', '.').astype(float)
Run df.info() again and now the result should be:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9 entries, 0 to 8
Data columns (total 4 columns):
Date 9 non-null datetime64[ns]
Product 9 non-null object
Price 9 non-null float64
Discount 9 non-null int64
dtypes: datetime64[ns](1), float64(1), int64(1), object(1)
And now you can compute the mean discount, running:
df.groupby('Date').Discount.mean()
For your data I got:
Date
2020-01-01 20.666667
2020-01-02 25.333333
2020-01-03 30.666667
Name: Discount, dtype: float64
Note that your code sample contains the following errors:
Argument of groupby is the column name (or a list of column names), so:
df between parentheses is not needed,
instead of time you should write Date (you have no time column).
Your Discount column is written starting with capital D.

Related

A merge in pandas is returning only NaN values

I'm trying to merge two dataframes: 'new_df' and 'df3'.
new_df contains years and months, and df3 contains years, months and other columns.
I've cast most of the columns as object, and tried to merge them both.
The merge 'works' as doesn't return an error, but my final datafram is all empty, only the year and month columns are correct.
new_df
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119 entries, 0 to 118
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date_test 119 non-null datetime64[ns]
1 year 119 non-null object
2 month 119 non-null object
dtypes: datetime64[ns](1), object(2)
df3
<class 'pandas.core.frame.DataFrame'>
Int64Index: 191 entries, 53 to 1297
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 case_number 191 non-null object
1 date 191 non-null object
2 year 191 non-null object
3 country 191 non-null object
4 area 191 non-null object
5 location 191 non-null object
6 activity 191 non-null object
7 fatal_y_n 182 non-null object
8 time 172 non-null object
9 species 103 non-null object
10 month 190 non-null object
dtypes: object(11)
I've tried this line of code:
df_joined = pd.merge(left=new_df, right=df3, how='left', on=['year','month'])
I was expecting a table with only filled fields in all columns, instead i got the table:

Your issue is with the data types for month and year in both columns - they're of type object which gets a bit weird during the join.
Here's a great answer that goes into depth about converting types to numbers, but here's what the code might look like before joining:
# convert column "year" and "month" of new_df
new_df["year"] = pd.to_numeric(new_df["year"])
new_df["month"] = pd.to_numeric(new_df["month"])
And make sure you do the same with df3 as well.
You may also have a data integrity problem as well - not sure what you're doing before you get those data frames, but if it's casting as an 'Object', you may have had a mix of ints/strings or other data types that got merged together. Here's a good article that goes over Panda Data Types. Specifically, and Object data type can be a mix of strings or other data, so the join might get weird.
Hope that helps!

python pandas | replacing the date and time string with only time

price
quantity
high time
10.4
3
2021-11-08 14:26:00-05:00
dataframe = ddg
the datatype for hightime is datetime64[ns, America/New_York]
i want the high time to be only 14:26:00 (getting rid of 2021-11-08 and -05:00) but i got an error when using the code below
ddg['high_time'] = ddg['high_time'].dt.strftime('%H:%M')

I think because it's not the right column name:
# Your code
>>> ddg['high_time'].dt.strftime('%H:%M')
...
KeyError: 'high_time'
# With right column name
>>> ddg['high time'].dt.strftime('%H:%M')
0 14:26
Name: high time, dtype: object
# My dataframe:
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 price 1 non-null float64
1 quantity 1 non-null int64
2 high time 1 non-null datetime64[ns, America/New_York]
dtypes: datetime64[ns, America/New_York](1), float64(1), int64(1)
memory usage: 152.0 bytes

grouping DateTime by week of the day

I suppose, it should be easy question for experienced guys. I want to group records by week' day and to have number of records at particular week-day.
Here is my DataFrame rent_week.info():
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1689 entries, 3 to 1832
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 1689 non-null int64
1 createdAt 1689 non-null datetime64[ns]
2 updatedAt 1689 non-null datetime64[ns]
3 endAt 1689 non-null datetime64[ns]
4 timeoutAt 1689 non-null datetime64[ns]
5 powerBankId 1689 non-null int64
6 station_id 1689 non-null int64
7 endPlaceId 1686 non-null float64
8 endStatus 1689 non-null object
9 userId 1689 non-null int64
10 station_name 1689 non-null object
dtypes: datetime64[ns](4), float64(1), int64(4), object(2)
memory usage: 158.3+ KB
Data in 'createdAt' columns looks like "2020-07-19T18:00:27.190010000"
I am trying to add new column:
rent_week['a_day'] = rent_week['createdAt'].strftime('%A')
and receive error back: AttributeError: 'Series' object has no attribute 'strftime'.
Meanwhile, if I write:
a_day = datetime.today()
print(a_day.strftime('%A'))
it shows expected result. In my understanding, a_day and rent_week['a_day'] have similar type datetime.
Even request through:
rent_week['a_day'] = pd.to_datetime(rent_week['createdAt']).strftime('%A')
shows me the same error: no strftime attribute.
I even didn't start grouping my data. What I am expecting in result is a DataFrame with information like:
a_day number_of_records
Monday 101
Tuesday 55
...

Try a_day.dt.strftime('%A') - note the additional .dt on your DataFrame column/Series object.
Background: the "similar" type assumption you make is almost correct. However, as a column could be of many types (numeric, string, datetime, geographic, ...), the methods of the underlying values are typically stored in a namespace to not clutter the already broad API (method count) of the Series type itself. That's why string functions are available only through .str, and datetime functions only available through .dt.

You can make a lambda function for conversion and apply that function to the column of "createdAt" Columns. After this step you can groupby based on your requirement. You can take help from this code:
rent_week['a_day'] = rent_week['createdAt'].apply(lambda x: x.strftime('%A'))

Thank you Quamar and Ojdo for your contribution. I found the problem: it is in index
<ipython-input-41-a42a82727cdd>:8: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
rent_week['a_day'] = rent_week['createdAt'].dt.strftime('%A')
as soon as I reset index
rent_week.reset_index()
both variants are working as expected!

Sorting a Pandas dataframe

I have the following dataframe:
Join_Count 1
LSOA11CD
E01006512 15
E01006513 35
E01006514 11
E01006515 11
E01006518 11
...
But when I try to sort it:
BusStopList.sort("LSOA11CD",ascending=1)
I get the following:
Key Error: 'LSOA11CD'
How do I go about sorting this by either the LSOA column or the column full of numbers which doesn't have a heading?
The following is the information produced by Python about this dataframe:
<class 'pandas.core.frame.DataFrame'>
Index: 286 entries, E01006512 to E01033768
Data columns (total 1 columns):
1 286 non-null int64
dtypes: int64(1)
memory usage: 4.5+ KB

'LSOA11CD' is the name of the index, 1 is the name of the column. So you must use sort index (rather than sort_values):
BusStopList.sort_index(level="LSOA11CD", ascending=True)

readin data as float per converter

I have a csv-file called 'filename' and want to read in these data as 64float, except the column 'hour'. I managed it with the pd.read_csv - function and an converter.
df = pd.read_csv("../data/filename.csv",
delimiter = ';',
date_parser = ['hour'],
skiprows = 1,
converters={'column1': lambda x: float(x.replace ('.','').replace(',','.'))})
Now, I have two points:
FIRST:
The delimiter works with ; ,but if I take a look in Notepad to my data, there are ',', not ';'. But if I take ',' I get: 'pandas.parser.CParserError: Error tokenizing data. C error: Expected 7 fields in line 13, saw 9'
SECOND:
If I want to use the converter for all columns, how can I get this?! What`s the right term?
I try to use dtype = float in the readin-function, but I get 'AttributeError: 'NoneType' object has no attribute 'dtype'' Whats happend? Thats the reasion why I want to managed it with the converter.
Data:
,hour,PV,Wind onshore,Wind offshore,PV.1,Wind onshore.1,Wind
offshore.1,PV.2,Wind onshore.2,Wind offshore.2
0,1,0.0,"12,985.0","9,614.0",0.0,"32,825.5","9,495.7",0.0,"13,110.3","10,855.5"
1,2,0.0,"12,908.9","9,290.8",0.0,"36,052.3","9,589.1",0.0,"13,670.2","10,828.6"
2,3,0.0,"12,740.9","8,886.9",0.0,"38,540.9","10,087.3",0.0,"14,610.8","10,828.6"
3,4,0.0,"12,485.3","8,644.5",0.0,"40,734.0","10,087.3",0.0,"15,638.3","10,343.7"
4,5,0.0,"11,188.5","8,079.0",0.0,"42,688.0","10,087.3",0.0,"16,809.4","10,343.7"
5,6,0.0,"11,219.0","7,594.2",0.0,"43,333.5","10,025.0",0.0,"18,266.9","10,343.7"

This should work:
In [40]:
# text data
temp=''',hour,PV,Wind onshore,Wind offshore,PV.1,Wind onshore.1,Wind offshore.1,PV.2,Wind onshore.2,Wind offshore.2
0,1,0.0,"12,985.0","9,614.0",0.0,"32,825.5","9,495.7",0.0,"13,110.3","10,855.5"
1,2,0.0,"12,908.9","9,290.8",0.0,"36,052.3","9,589.1",0.0,"13,670.2","10,828.6"
2,3,0.0,"12,740.9","8,886.9",0.0,"38,540.9","10,087.3",0.0,"14,610.8","10,828.6"
3,4,0.0,"12,485.3","8,644.5",0.0,"40,734.0","10,087.3",0.0,"15,638.3","10,343.7"
4,5,0.0,"11,188.5","8,079.0",0.0,"42,688.0","10,087.3",0.0,"16,809.4","10,343.7"
5,6,0.0,"11,219.0","7,594.2",0.0,"43,333.5","10,025.0",0.0,"18,266.9","10,343.7"'''
# so read the csv, pass params quotechar and the thousands character
df = pd.read_csv(io.StringIO(temp), quotechar='"', thousands=',')
df
Out[40]:
Unnamed: 0 hour PV Wind onshore Wind offshore PV.1 Wind onshore.1 \
0 0 1 0 12985.0 9614.0 0 32825.5
1 1 2 0 12908.9 9290.8 0 36052.3
2 2 3 0 12740.9 8886.9 0 38540.9
3 3 4 0 12485.3 8644.5 0 40734.0
4 4 5 0 11188.5 8079.0 0 42688.0
5 5 6 0 11219.0 7594.2 0 43333.5
Wind offshore.1 PV.2 Wind onshore.2 Wind offshore.2
0 9495.7 0 13110.3 10855.5
1 9589.1 0 13670.2 10828.6
2 10087.3 0 14610.8 10828.6
3 10087.3 0 15638.3 10343.7
4 10087.3 0 16809.4 10343.7
5 10025.0 0 18266.9 10343.7
In [41]:
# check the dtypes
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 6 entries, 0 to 5
Data columns (total 11 columns):
Unnamed: 0 6 non-null int64
hour 6 non-null int64
PV 6 non-null float64
Wind onshore 6 non-null float64
Wind offshore 6 non-null float64
PV.1 6 non-null float64
Wind onshore.1 6 non-null float64
Wind offshore.1 6 non-null float64
PV.2 6 non-null float64
Wind onshore.2 6 non-null float64
Wind offshore.2 6 non-null float64
dtypes: float64(9), int64(2)
memory usage: 576.0 bytes
So basically you need to pass the quotechar='"' and thousands=',' params to read_csv to achieve what you want, see the docs: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html#pandas.read_csv
EDIT
If you want to convert after importing (which is a waste when you can do it upfront) then you can do this for each column of interest:
In [43]:
# replace the comma separator
df['Wind onshore'] = df['Wind onshore'].str.replace(',','')
# convert the type
df['Wind onshore'] = df['Wind onshore'].astype(np.float64)
df['Wind onshore'].dtype
Out[43]:
dtype('float64')
It would be faster to replace the comma separator on all the columns of interest first and just call convert_objects like so: df.convert_objects(convert_numeric=True)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Group by of dataframe with average of a column - python

Related

A merge in pandas is returning only NaN values

python pandas | replacing the date and time string with only time

grouping DateTime by week of the day

Sorting a Pandas dataframe

readin data as float per converter

Categories

Resources