How can calculate & plot data received from survey using python - python

What I need to do is basically calculate the responses received over a period of time.
I.E
07/07/2019 | 6
08/07/2019 | 7
And plot the above to a graph.
But the current data is in the below format:
07/07/2019 17:33:07
07/07/2019 12:00:03
08/07/2019 21:10:05
08/07/2019 20:06:09
So far,
import pandas as pd
df = pd.read_csv('survey_results_public.csv')
df.head()
df['Timestamp'].value_counts().plot(kind="bar")
plt.show()
But the above doesn't look good.

You are counting all values in the timestamp column so you will have 1 response per timestamp.
You should parse the timestamp column, check the unique dates and then count the number of timestamps that belong to each date.
Only then should you plot the data.
So do something like this:
import pandas as pd
import datetime
def parse_timestamps(timestamp):
datetime.datetime.strptime(timestamp, '%d/%m/%Y %H:%M:%S')
df = pandas.read_csv('survey_results_public.csv')
df["Date"]=df["Timestamp"].map(lambda t: parse_timestamps(t).date())
df["Date"].value_counts().plot(kind="bar")

Related

Bar graph drawing using month from date in pandas

Need to draw a bar chart using below data set. X axis needs to be Territory and Y axis needs to be average production in each territory and hue needs to contain the month from the date column.
Not exactly sure what you are asking. When you say average production, do you want to calculate average production from a Territory, or just display the value that is in the production column? If you clarify I can update my answer. In my example I just display the data from the production column. First export your spreadsheet to csv. Then you can do the following:
import calendar
import datetime
import pandas as pd
import plotly.express as ex
df = pd.read_csv("data.csv")
def get_month_names(dataframe: pd.DataFrame):
# Get all the dates
dates = dataframe["Date"].to_list()
# Convert date-string to datetime object
# I assume month/day/year, if it is day/month/year, swap %m and %d
date_objs = [datetime.datetime.strptime(date, "%m/%d/%Y %H:%M:%S") for date in dates]
# Get all the months
months = [date.month for date in date_objs]
# Get the names of the months
month_names = [calendar.month_name[month] for month in months]
return month_names
fig = ex.bar(x=df["Territory"],
y=df["Production"],
color=get_month_names(df))
fig.show()
this produces:

Iterating through a range of dates in Python with missing dates

Here I got a pandas data frame with daily return of stocks and columns are date and return rate.
But if I only want to keep the last day of each week, and the data has some missing days, what can I do?
import pandas as pd
df = pd.read_csv('Daily_return.csv')
df.Date = pd.to_datetime(db.Date)
count = 300
for last_day in ('2017-01-01' + 7n for n in range(count)):
Actually my brain stop working at this point with my limited imagination......Maybe one of the biggest point is "+7n" kind of stuff is meaningless with some missing dates.
I'll create a sample dataset with 40 dates and 40 sample returns, then sample 90 percent of that randomly to simulate the missing dates.
The key here is that you need to convert your date column into datetime if it isn't already, and make sure your df is sorted by the date.
Then you can groupby year/week and take the last value. If you run this repeatedly you'll see that the selected dates can change if the value dropped was the last day of the week.
Based on that
import pandas as pd
import numpy as np
df = pd.DataFrame()
df['date'] = pd.date_range(start='04-18-2022',periods=40, freq='D')
df['return'] = np.random.uniform(size=40)
# Keep 90 percent of the records so we can see what happens when some days are missing
df = df.sample(frac=.9)
# In case your dates are actually strings
df['date'] = pd.to_datetime(df['date'])
# Make sure they are sorted from oldest to newest
df = df.sort_values(by='date')
df = df.groupby([df['date'].dt.isocalendar().year,
df['date'].dt.isocalendar().week], as_index=False).last()
print(df)
Output
date return
0 2022-04-24 0.299958
1 2022-05-01 0.248471
2 2022-05-08 0.506919
3 2022-05-15 0.541929
4 2022-05-22 0.588768
5 2022-05-27 0.504419

Python: How to filter a DataFrame of dates in Pandas by a particular date within a window of some days?

I have a DataFrame of dates and would like to filter for a particular date +- some days.
import pandas as pd
import numpy as np
import datetime
dates = pd.date_range(start="08/01/2009",end="08/01/2012",freq="D")
df = pd.DataFrame(np.random.rand(len(dates), 1)*1500, index=dates, columns=['Power'])
If I select lets say date 2009-08-03 and a window of 5 days, output would be similar to:
>>>
Power
2010-07-29 713.108020
2010-07-30 1055.109543
2010-07-31 951.159099
2010-08-01 1350.638983
2010-08-02 453.166697
2010-08-03 1066.859386
2010-08-04 1381.900717
2010-08-05 107.489179
2010-08-06 1195.945723
2010-08-07 1209.762910
2010-08-08 349.554492
N.B.: The original problem I am trying to accomplish is under Python: Filter DataFrame in Pandas by hour, day and month grouped by year
The function I created to accomplish this is filterDaysWindow and can be used as follows:
import pandas as pd
import numpy as np
import datetime
dates = pd.date_range(start="08/01/2009",end="08/01/2012",freq="D")
df = pd.DataFrame(np.random.rand(len(dates), 1)*1500, index=dates, columns=['Power'])
def filterDaysWindow(df, date, daysWindow):
"""
Filter a Dataframe by a date within a window of days
#type df: DataFrame
#param df: DataFrame of dates
#type date: datetime.date
#param date: date to focus on
#type daysWindow: int
#param daysWindow: Number of days to perform the days window selection
#rtype: DataFrame
#return: Returns a DataFrame with dates within date+-daysWindow
"""
dateStart = date - datetime.timedelta(days=daysWindow)
dateEnd = date + datetime.timedelta(days=daysWindow)
return df [dateStart:dateEnd]
df_filtered = filterDaysWindow(df, datetime.date(2010,8,3), 5)
print df_filtered

How can I sort a DataFrame by date ddMMMyyyy?

I have a sample dataframe as follows.
How can I sort it by the index by year instead of by month?
test=DataFrame([1,2,3],index=['28FEB1993','28FEB1994','30MAR1993'],columns=['value'])
I would like to have the following dataFrame as the result
test=DataFrame([1,2,3],index=['28FEB1993','30MAR1993','28FEB1994'],columns=['value'])
I think I stuck at how to parse ddMMMyyyy data format to a datetime object.
Thanks aton!
You can use strptime:
from datetime import datetime
test.index = np.array([datetime.strptime(s, "%d%b%Y") for s in test.index.values])
test.sort_index()
# value
# 1993-02-28 1
# 1993-03-30 3
# 1994-02-28 2
Or as suggested by #chrisb:
test.index = pd.to_datetime(test.index, format="%d%b%Y")

Date difference in hours (Excel data import)?

I need to calculate hour difference between two dates (format: year-month-dayTHH:MM:SS I could also potentially transform data format to (format: year-month-day HH:MM:SS) from huge excel file. What is the most efficient way to do it in Python? I have tried to use Datatime/Time object (TypeError: expected string or buffer), Timestamp (ValueError) and DataFrame (does not give hour result).
Excel File:
Order_Date Received_Customer Column3
2000-10-06T13:00:58 2000-11-06T13:00:58 1
2000-10-21T15:40:15 2000-12-27T10:09:29 2
2000-10-23T10:09:29 2000-10-26T10:09:29 3
..... ....
Datatime/Time object code (TypeError: expected string or buffer):
import pandas as pd
import time as t
data=pd.read_excel('/path/file.xlsx')
s1 = (data,['Order_Date'])
s2 = (data,['Received_Customer'])
s1Time = t.strptime(s1, "%Y:%m:%d:%H:%M:%S")
s2Time = t.strptime(s2, "%Y:%m:%d:%H:%M:%S")
deltaInHours = (t.mktime(s2Time) - t.mktime(s1Time))
print deltaInHours, "hours"
Timestamp (ValueError) code:
import pandas as pd
import datetime as dt
data=pd.read_excel('/path/file.xlsx')
df = pd.DataFrame(data,columns=['Order_Date','Received_Customer'])
df.to = [pd.Timestamp('Order_Date')]
df.fr = [pd.Timestamp('Received_Customer')]
(df.fr-df.to).astype('timedelta64[h]')
DataFrame (does not return the desired result)
import pandas as pd
data=pd.read_excel('/path/file.xlsx')
df = pd.DataFrame(data,columns=['Order_Date','Received_Customer'])
df['Order_Date'] = pd.to_datetime(df['Order_Date'])
df['Received_Customer'] = pd.to_datetime(df['Received_Customer'])
answer = df.dropna()['Order_Date'] - df.dropna()['Received_Customer']
answer.astype('timedelta64[h]')
print(answer)
Output:
0 24 days 16:38:07
1 0 days 00:00:00
2 20 days 12:39:52
dtype: timedelta64[ns]
Should be something like this:
0 592 hour
1 0 hour
2 492 hour
Is there another way to convert timedelta64[ns] into hours than answer.astype('timedelta64[h]')?
For each of your solutions you mixed up datatypes and methods. Whereas I do not find the time to explicitly explain your mistakes, yet i want to help you by providing a (probably non optimal) solution.
I built the solution out of your previous tries and I combined it with knowledge from other questions such as:
Convert a timedelta to days, hours and minutes
Get total number of hours from a Pandas Timedelta?
Note that i used Python 3. I hope that my solution guides your way. My solution is this one:
import pandas as pd
from datetime import datetime
import numpy as np
d = pd.read_excel('C:\\Users\\nrieble\\Desktop\\check.xlsx',header=0)
start = [pd.to_datetime(e) for e in data['Order_Date'] if len(str(e))>4]
end = [pd.to_datetime(e) for e in data['Received_Customer'] if len(str(e))>4]
delta = np.asarray(s2Time)-np.asarray(s1Time)
deltainhours = [e/np.timedelta64(1, 'h') for e in delta]
print (deltainhours, "hours")

Categories

Resources