Data Overview
Hello everyone
I need to get the two platforms with the most visits per day for one year in total. So:
Group the data by day
Extract the two platforms with most visits for each day
I tried this code:
df.groupby(pd.Grouper(key="Datum", freq="1D")).nlargest(2, 'Visits')
and got that error:
AttributeError: Cannot access callable attribute 'nlargest' of 'DataFrameGroupBy' objects, try using the 'apply' method
Thanks a lot for your help! :)
Why not just use apply, as the error message states:
import pandas as pd
# dataframe example
d = {'Platform': ['location', 'office', 'station'], 'Date': ['01.08.2019', '01.08.2019', '01.08.2019'], 'Visits': [4372, 48176, 2012]}
df = pd.DataFrame(data=d)
df.groupby(pd.Grouper(key="Date")).apply(lambda grp: grp.nlargest(2, 'Visits'))
Related
I am running
import pandas as pd
df= pd.read_csv("RELIANCE.csv",parse_dates=['Date'], index_col=['Date'])
df.head(2)
It gives output below
Open High Low Close Adj Close Volume
Date
2019-08-19 1281.050049 1296.800049 1280.000000 1292.599976 1287.764648 7459859.0
2019-08-20 1289.800049 1292.599976 1272.599976 1275.949951 1271.176880 6843460.0
but type(df.Date[0]) throws AttributeError: 'DataFrame' object has no attribute 'Date' and df['2019-08-19'] throws KeyError: '2019-08-19'
Can anybody tell me How to resolve this error?
I think you can use .loc
df.loc['2019-08-19']
AttributeError is probably because index name is not stored as a attribute for any data frame, so you can't address it directly. Instead, you can do something like type(df.index[0]) or df.index.dtype to index type.
I am trying to convert a new column in a dataframe through a function based on the values in the date column, but get an error indicating "Timestamp object has no attribute dt." However, if I run this outside of a function, the dt attribute works fine.
Any guidance would be appreciated.
This code runs with no issues:
sample = {'Date': ['2015-07-02 11:47:00', '2015-08-02 11:30:00']}
dftest = pd.DataFrame.from_dict(sample)
dftest['Date'] = pd.to_datetime(dftest['Date'])
display(dftest.info())
dftest['year'] = dftest['Date'].dt.year
dftest['month'] = dftest['Date'].dt.month
This code gives me the error message:
sample = {'Date': ['2015-07-02 11:47:00', '2015-08-02 11:30:00']}
dftest = pd.DataFrame.from_dict(sample)
dftest['Date'] = pd.to_datetime(dftest['Date'])
def CALLYMD(dftest):
if dftest['Date'].dt.month>9:
return str(dftest['Date'].dt.year) + '1231'
elif dftest['Date'].dt.month>6:
return str(dftest['Date'].dt.year) + '0930'
elif dftest['Date'].dt.month>3:
return str(dftest['Date'].dt.year) + '0630'
else:
return str(dftest['Date'].dt.year) + '0331'
dftest['CALLYMD'] = dftest.apply(CALLYMD, axis=1)
Lastly, I'm open to any suggestions on how to make this code better as I'm still learning.
I'm guessing you should remove .dt in the second case. When you do apply it's applying to each element, .dt is needed when it's a group of data, if it's only one element you don't need .dt otherwise it will raise
{AttributeError: 'Timestamp' object has no attribute 'dt'}
reference: https://stackoverflow.com/a/48967889/13720936
After looking at the timestamp documentation, I found removing the .dt and just doing .year and .month works. However, I'm still confused as to why it works in the first code but does not work in the second code.
here is how to create a yearmonth bucket using the year and month
for key, item in df.iterrows():
year=pd.to_datetime(item['Date']).year
month=str(pd.to_datetime(item['Date']).month)
df.loc[key,'YearMonth']="{:.0f}{}".format(year,month.zfill(2))
I would like to apply a function to each row of a dask dataframe.
Executing the operation with ddf.compute() gives me an error:
AttributeError: 'Series' object has no attribute 'encode'
This is my code:
def polar(data):
data=scale(sid.polarity_scores(data.tweet)['compound'])
return data
t_data['sentiment'] = t_data.map_partitions(polar, meta=('sentiment', int))
And using t_data.head() also result in same error.
I have found out the answer. You have to apply for partition.
t_data['sentiment']=t_data.map_partitions(lambda df : df.apply(polar,axis=1))
You can use the following:
t_data.apply(polar, axis=1)
I want to convert all the items in the 'Time' column of my pandas dataframe from UTC to Eastern time. However, following the answer in this stackoverflow post, some of the keywords are not known in pandas 0.20.3. Overall, how should I do this task?
tweets_df = pd.read_csv('valid_tweets.csv')
tweets_df['Time'] = tweets_df.to_datetime(tweets_df['Time'])
tweets_df.set_index('Time', drop=False, inplace=True)
error is:
tweets_df['Time'] = tweets_df.to_datetime(tweets_df['Time'])
File "/scratch/sjn/anaconda/lib/python3.6/site-packages/pandas/core/generic.py", line 3081, in __getattr__
return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'to_datetime'
items from the Time column look like this:
2016-10-20 03:43:11+00:00
Update:
using
tweets_df['Time'] = pd.to_datetime(tweets_df['Time'])
tweets_df.set_index('Time', drop=False, inplace=True)
tweets_df.index = tweets_df.index.tz_localize('UTC').tz_convert('US/Eastern')
did no time conversion. Any idea what could be fixed?
Update 2:
So the following code, does not do in-place conversion meaning when I print the row['Time'] using iterrows() it shows the original values. Do you know how to do the in-place conversion?
tweets_df['Time'] = pd.to_datetime(tweets_df['Time'])
for index, row in tweets_df.iterrows():
row['Time'].tz_localize('UTC').tz_convert('US/Eastern')
for index, row in tweets_df.iterrows():
print(row['Time'])
to_datetime is a function defined in pandas not a method on a DataFrame. Try:
tweets_df['Time'] = pd.to_datetime(tweets_df['Time'])
I want to select only those rows that have a timestamp that belongs to last 36 hours. My PySpark DataFrame df has a column unix_timestamp that is a timestamp in seconds.
This is my current code, but it fails with the error AttributeError: 'DataFrame' object has no attribute 'timestamp'. I tried to change it to unix_timestamp, but it fails all the time.
import datetime
hours_36 = (datetime.datetime.now() - datetime.timedelta(hours = 36)).strftime("%Y-%m-%d %H:%M:%S")
df = df.withColumn("unix_timestamp", df.unix_timestamp.cast("timestamp")).filter(df.timestamp > hours_36)
The time stamp column doesn't exist yet when you try to refer to it; You can either use pyspark.sql.functions.col to refer to it in a dynamic way without specifying which data frame object the column belongs to as:
import pyspark.sql.functions as F
df = df.withColumn("unix_timestamp", df.unix_timestamp.cast("timestamp")).filter(F.col("unix_timestamp") > hours_36)
Or without creating the intermediate column:
df.filter(df.unix_timestamp.cast("timestamp") > hours_36)
The API Doc tells me that you can also use a String notation for filtering:
https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.filter
import pyspark.sql.functions as F
df = df.withColumn("unix_timestamp", df.unix_timestamp.cast("timestamp"))
.filter("unix_timestamp > %s" % hours_36)
Maybe its not so effienc though