I have a csv-file (called "cameradata") with columns MeetingRoomID, and Time (There are more columns, but they should not be needed).
I would like to get the number of occurrences a certain MeetingRoomID ("14094020", from the column "MeetingRoomID") is used during one day. The csv-file luckily only consist of timestamps from one day in the "Time" column. One problem is that the timestamps are in the datetime-format %H:%M:%S and I want to categorize the occurrences by the hour it occured (between 07:00-18:00).
The goal is to have the occurences linked to the hours of the timestamps - in order for me to plot a barplot with (x = "timestamps (hourly)" and y = "a dataframe/series that maps the certain MeetingRoomID with the hour it was used".
How can I get a function for my y-axis that understands that the value_count for ID 14094020 and the timestamps are connected?
So far I've come up with something like this:
y = cameradata.set_index('Time').resample('H')
cameradata['MeetingRoomID'].value_counts()[14094020]
My code seems to work if I divide it, but I do not know how to connect it in a syntax-friendly way.
Clarification:
The code: cameradata['MeetingRoomID'].value_counts().idxmax() revealed the ID with the most occurrences, so I think I'm onto something there.
Grateful for your help!
This is how the print of the Dataframe looks like, 'Tid' is time and 'MätplatsID' is what I called MeetingRoomID.
For some reason the "Time" column has added a made-up year and month next to it when I converted it to datetime. I converted in to datetime by: kameradata['Tid'] = pd.to_datetime(kameradata['Tid'], format=('%H:%M:%S'))
This is an example of how the output look like in the end
I am analyzing some shared data sheets,I set some filters so that this helps to find rows that do not follow the criteria:
filter3 = df[(df['Currency'].isnull())]
filter1= df[(df["Date"] > '2021-06-16' ) & (df['Subtype'].isnull())]
However I have tried to put a filter so when running the script I can find rows that does not follow this date format: %d/%m/%Y
How can I implement this filter? At the end what I would like to do is inform to the person adding rows to that shared report that he/she typed the incorrect format.
Thank you!
Here is an example of how to use the errors parameter of pd.to_datetime. If any of the date values don't adhere to the format it will return null. In this case we use .loc to select the null (invalid) dates.
import pandas as pd
df = pd.DataFrame({'dates':['2021-06-16','11/08/20']})
df.loc[pd.to_datetime(df['dates'], errors='coerce',format='%Y-%m-%d').isnull()]
Output
dates
1 11/08/20
I have a Dask data frame that has two columns, a date and a value.
I store it like so:
ddf.to_parquet('/some/folder', engine='pyarrow', overwrite=True)
I'm expecting Dask to store the date column as date in Parquet, but when I query it with Apache Drill I get 16 digit numbers (I would say timestamps) instead of dates. For example I get:
1546300800000000 instead of 2019-01-01
1548979200000000 instead of 2019-02-01
Is there a way to tell Dask to store columns as dates? How can I run a select with Apache Drill and get the dates? I tried using SELECT CAST in Drill but it doesn't convert the numbers to dates.
Not sure if is relevant for you, but it seems that you are interested only in the date value (ignoring hours, minutes, etc.). If so, you can explicitly convert timestamp information into date string using .dt.date.
import pandas as pd
import dask.dataframe as dd
sample_dates = [
'2019-01-01 00:01:00',
'2019-01-02 05:04:02',
'2019-01-02 15:04:02'
]
df = pd.DataFrame(zip(sample_dates, range(len(sample_dates))), columns=['datestring', 'value'])
ddf = dd.from_pandas(df, npartitions=2)
# convert to timestamp and calculate as unix time (relative to 1970)
ddf['unix_timestamp_seconds'] = (ddf['datestring'].astype('M8[s]') - pd.to_datetime('1970-01-01')).dt.total_seconds()
# convert to timestamp format and extract dates
ddf['datestring'] = ddf['datestring'].astype('M8[s]').dt.date
ddf.to_parquet('test.parquet', engine='pyarrow', write_index=False, coerce_timestamps='ms')
For time conversion, you can use .astype or dd.to_datetime, see answers to this question. There is also a very similar question and answer, which suggests that ensuring that the timestamp is downcasted to ms resolves the issue.
So playing around with the values you provided it's possible to see that the core problem is a mismatch in the scaling of the variable:
# both yield: Timestamp('2019-01-01 00:00:00')
pd.to_datetime(1546300800000000*1000, unit='ns')
pd.to_datetime(1546300800000000/1000000, unit='s')
If memory serves, Drill uses an old non-standard of INT96 time stamps, which was never supported by parquet. A parquet timestamp is essentially a UNIX timestamp, as an int64, and with various precision. Drill must have a function to correctly convert this its internal format.
I am no expert on Drill, but it seems you need to first divide your integer by the appropriate power of 10, (see this answer). This syntac is probably wrong, but might give you the idea:
SELECT TO_TIMESTAMP((mycol as FLOAT) / 1000) FROM ...;
Here's a link to the Drill docs about the TO_TIMESTAMP() function. (https://drill.apache.org/docs/data-type-conversion/#to_timestamp) I think #mdurant is correct in his approach.
I would try either:
SELECT TO_TIMESTAMP(<date_col>) FROM ...
or
SELECT TO_TIMSTAMP((<date_col> / 1000)) FROM ...
I have a pandas dataframe with a column of orderdates formatted like this: 2019-12-26.
However when I take the max of this date it will give 2019-12-12. While it is actually 2019-12-26. It makes sense because my dateformat is Dutch and the max() function uses the 'American' (correct me if I'm wrong) format.
This meas that my calculations aren't correct.
How I can change the way the function calculate? Or if thats not possible, change the format of my date column so the calculations are correct?
[In] df['orderdate'] = df['orderdate'].astype('datetime64[ns]')
print(df["orderdate"].max())
[Out] 2019-12-12 00:00:00
Thank you!
I'm reading a sql query and using it as dataframe columns.
query = "SELECT count(*) as numRecords, YEARWEEK(date) as weekNum FROM events GROUP BY YEARWEEK(date)"
df = pd.read_sql(query, connection)
date = df['weekNum']
records = df['numRecords']
The date column, which are int64 values, look like this:
...
201850
201851
201852
201901
201902
...
How can I convert the dataframe to a real date value (instead of int64), so when I plot this, the axis do not break because of the year change?
I'm using matplotlib.
All you need to do is use:
pd.to_datetime(date,format='%Y&%W')
Edited:
It gave an error that Day should be mentioned to convert it into datetime. So to tackle that we attach a '-1' to the end (which means Monday... you can add any specific value from 0 to 6 where each represents a day).
Then grab the 'day of the week' using an additional %w in the format and it will work:
pd.to_datetime(date.apply(lambda x: str(x)+'-0'), format="%Y%W-%w")
Remember that to perform any of the above operations, the value in date dataframe or series should be a string object. If not, you can easily convert them using d.astype(str) and then perform all these operations.