I have a dask dataframe, dfs, with a date column, IR_START_DATE. I'd like to create a new dayofweek column using said date column.
I can achieve this using the following code:
ddf.to_datetime(dfs['IR_START_DATE']).dt.dayofweek.compute()
However, I'm having trouble storing this to it's own column.
E.g., I've tried:
Assigning as column:
dfs['yeah'] = ddf.to_datetime(dfs['IR_START_DATE']).dt.dayofweek.compute()
Using map_partition():
def compute_dow(df):
date_time = ddf.to_datetime(df['IR_START_DATE']).dt
dow = date_time.dayofweek
return dow
dow = dfs.map_partitions(compute_dow)
Using map():
dfs['IR_START_DATE'].map(lambda x: ddf.to_datetime(x['IR_START_DATE']).dt.dayofweek, meta = ('time', 'datetime64[ns]')).compute()
Obviously I'm missing some fundamental piece of dask knowledge here, please point me in the right direction!
Your first two methods were very close!
This should work:
dfs['yeah'] = ddf.to_datetime(dfs['IR_START_DATE']).dt.dayofweek
Note the lack of compute() - you do not want to make a pandas dataframe, you want the column to refer back to the original data in the normal dask lazy way.
For map_partitions, you could have done
def compute_dow(df):
date_time = ddf.to_datetime(df['IR_START_DATE']).dt
df['dow'] = date_time.dayofweek
return df
Note that we are passing in a dataframe and getting back a dataframe. Also, it would help when calling map_partitions to provide the meta= argument, to reduce the inference dask needs to make (read the method docs).
Related
We have some data in a Delta source which has nested structures. For this example we are focusing on a particular field from the Delta named status which has a number of sub-fields: commissionDate, decommissionDate, isDeactivated, isPreview, terminationDate.
In our transformation we currently read the Delta file in using PySpark, convert the DF to pandas using df.toPandas() and operate on this pandas DF using the pandas API. Once we have this pandas DF we would like to access its fields without using row iteration.
The data in Pandas looks like the following when queried using inventory_df["status"][0] (i.e. inventory_df["status"] is a list):
Row(commissionDate='2011-07-24T00:00:00+00:00', decommissionDate='2013-07-15T00:00:00+00:00', isDeactivated=True, isPreview=False, terminationDate=None)
We have found success using row iteration like:
unit_df["Active"] = [
not row["isDeactivated"] for row in inventory_df["status"]
]
but we have to use a row iteration each time we want to access data from the inventory_df. This is more verbose and is less efficient.
We would love to be able to do something like:
unit_df["Active"] = [
not inventory_df["status.isDeactivated"]
]
which is similar to the Spark destructuring approach, and allows accessing all of the rows at once but there doesn't seem to be equivalent pandas logic.
The data within PySpark has a format like status: struct<commissionDate:string,decommissionDate:string,isDeactivated:boolean,isPreview:boolean,terminationDate:string> and we can use the format mentioned above, selecting a subcolumn like df.select("status.isDeactivated").
How can this approach be done using pandas?
This may get you to where you think you are:
unit_df["Active"] = unit_df["Active"].apply(lambda x: pd.DataFrame(x.asDict()))
From here I would do:
unit_df = pd.concat([pd.concat(unif_df["Active"], ignore_index=True), unit_df], axis=1)
Which would get you a singular pd.DataFrame, now with columns for commissiondate, decomissiondate, etc.
Update:
I was able to perform the conversion. The next step is to put it back to the ddf.
What I did, following the book suggestion are:
the dates were parsed and stored as a separate variable.
dropped the original date column using
ddf2=ddf.drop('date',axis=1)
appended the new parsed date using assign
ddf3=ddf2.assign(date=parsed_date)
the new date was added as a new column, last column.
Question 1: is there a more efficient way to insert the parsed_date back to the ddf?
Question 2: What if I have three columns of string dates (date, startdate, enddate), I am not able to find if loop will work so that I did not have to recode each string dates. (or I could be wrong in the approach I am thinking)
Question 3 for the date in 11OCT2020:13:03:12.452 format, is this the right parsing: "%d%b%Y:%H:%M:%S" ? I feel I am missing something for the seconds because the seconds above is a decimal number/float.
Older:
I have the following column in a dask dataframe:
ddf = dd.DataFrame({'date': ['15JAN1955', '25DEC1990', '06MAY1962', '20SEPT1975']})
when it was initially uploaded as a dask dataframe, it was projected as an object/string. While looking for guidance in the Data Science with Python and Dask book, it suggested that at the initial upload to upload it as np.str datatype. However, I could not understand how to convert the column into a date datatype. I tried processing it using dd.to_datetime, the confirmation returned dtype: datetime64[ns] but when I ran the ddf.dtypes, the frame still returned an object datatype.
I would like to change the object dtype to date to filter/run a condition later on
dask.dataframe supports pandas API for handling datetimes, so this should work:
import dask.dataframe as dd
import pandas as pd
df = pd.DataFrame({"date": ["15JAN1955", "25DEC1990", "06MAY1962", "20SEPT1975"]})
print(pd.to_datetime(df["date"]))
# 0 1955-01-15
# 1 1990-12-25
# 2 1962-05-06
# 3 1975-09-20
# Name: date, dtype: datetime64[ns]
ddf = dd.from_pandas(df, npartitions=2)
ddf["date"] = dd.to_datetime(ddf["date"])
print(ddf.compute())
# date
# 0 1955-01-15
# 1 1990-12-25
# 2 1962-05-06
# 3 1975-09-20
Usually when I am having a hard time computing or parsing, I use the apply lamba call. Although some says it is not a better way but it works. Give it a try
sorry that might be very simple question but I am new to python/json and everything. I am trying to filter my twitter json data set based on user_location/country_code/gb. but I have no idea how to do this. I have tried several ways but still no chance. I have attached my data set and some codes I have used here. I would appreciate any help.
here is what I did to get the best result however I do not know how to tell it to go for whole data set and print out the result of tweet_id:
import json
import pandas as pd
df = pd.read_json('example.json', lines=True)
if df['user_location'][4]['country_code'] == 'th':
print (df.tweet_id[4])
else:
print('false')
this code show me the tweet_id : 1223489829817577472
however, I couldn't extend it to the whole data set.
I have tried theis code as well, still no chance:
dataset = df[df['user_location'].isin([ "gb" ])].copy()
print (dataset)
that is what my data set looks like:
I would break the user_location column into multiple columns using the following
df = pd.concat([df, df.pop('user_location').apply(pd.Series)], axis=1)
Running this should give you a column each for the keys contained within the user_location json. Then it should be easy to print out tweet_ids based on country_code using:
df[df['country_code']=='th']['tweet_id']
An explanation of what is actually happening here:
df.pop('user_location') removes the 'user_location' column from df and returns it at the same time
With the returned column, we use the .apply method to apply a function to the column
pd.Series converts the JSON data/dictionary into a DataFrame
pd.concat concatenates the original df (now without the 'user_location' column) with the new columns created from the 'user_location' data
Actually, I am facing the problem to add the data in the subcolumn in the specific format. I have created the "Polypoints" as the main column and I want
df["Polypoints"] = [{"__type":"Polygon","coordinates":Row_list}]
where Row_list is the column of dataframe which contains the data in the below format
df["Row_list"] = [[x1,y1],[x2,y2],[x3,y3]
[x1,y1],[x2,y2],[x3,y3]
[x1,y1],[x2,y2],[x3,y3]]
I want to convert the dataframe into json in the format
"Polypoints" :{"__type":"Polygon" ,"coordinates":Row_list}
There are various ways to do that.
One can create a function create_polygon that takes as input the dataframe (df), and the column name (columname). That would look like the following
def create_polygon(df, columnname):
return {"__type":"Polygon", "coordinates":df[columnname]}
Considering that the column name will be Row_list, the following will already be enough
def create_polygon(df):
return {"__type":"Polygon", "coordinates":df['Row_list']}
Then with pandas.DataFrame.apply one can apply it to the column Polypoints as follows
df['Polypoints'] = df.apply(create_polygon, axis=1)
As Itamar Mushkin mentions, one can also do it with a Lambda function as follows
df['Polypoints'] = df.apply(lambda row: {"__type":"Polygon", "coordinates":row['Row_list']} ,axis=1)
I am trying to change the format of the date in a pandas dataframe.
If I check the date in the beginning, I have:
df['Date'][0]
Out[158]: '01/02/2008'
Then, I use:
df['Date'] = pd.to_datetime(df['Date']).dt.date
To change the format to
df['Date'][0]
Out[157]: datetime.date(2008, 1, 2)
However, this takes a veeeery long time, since my dataframe has millions of rows.
All I want to do is change the date format from MM-DD-YYYY to YYYY-MM-DD.
How can I do it in a faster way?
You should first collapse by Date using the groupby method to reduce the dimensionality of the problem.
Then you parse the dates into the new format and merge the results back into the original DataFrame.
This requires some time because of the merging, but it takes advantage from the fact that many dates are repeated a large number of times. You want to convert each date only once!
You can use the following code:
date_parser = lambda x: pd.datetime.strptime(str(x), '%m/%d/%Y')
df['date_index'] = df['Date']
dates = df.groupby(['date_index']).first()['Date'].apply(date_parser)
df = df.set_index([ 'date_index' ])
df['New Date'] = dates
df = df.reset_index()
df.head()
In my case, the execution time for a DataFrame with 3 million lines reduced from 30 seconds to about 1.5 seconds.
I'm not sure if this will help with the performance issue, as I haven't tested with a dataset of your size, but at least in theory, this should help. Pandas has a built in parameter you can use to specify that it should load a column as a date or datetime field. See the parse_dates parameter in the pandas documentation.
Simply pass in a list of columns that you want to be parsed as a date and pandas will convert the columns for you when creating the DataFrame. Then, you won't have to worry about looping back through the dataframe and attempting the conversion after.
import pandas as pd
df = pd.read_csv('test.csv', parse_dates=[0,2])
The above example would try to parse the 1st and 3rd (zero-based) columns as dates.
The type of each resulting column value will be a pandas timestamp and you can then use pandas to print this out however you'd like when working with the dataframe.
Following a lead at #pygo's comment, I found that my mistake was to try to read the data as
df['Date'] = pd.to_datetime(df['Date']).dt.date
This would be, as this answer explains:
This is because pandas falls back to dateutil.parser.parse for parsing the strings when it has a non-default format or when no format string is supplied (this is much more flexible, but also slower).
As you have shown above, you can improve the performance by supplying a format string to to_datetime. Or another option is to use infer_datetime_format=True
When using any of the date parsers from the answers above, we go into the for loop. Also, when specifying the format we want (instead of the format we have) in the pd.to_datetime, we also go into the for loop.
Hence, instead of doing
df['Date'] = pd.to_datetime(df['Date'],format='%Y-%m-%d')
or
df['Date'] = pd.to_datetime(df['Date']).dt.date
we should do
df['Date'] = pd.to_datetime(df['Date'],format='%m/%d/%Y').dt.date
By supplying the current format of the data, it is read really fast into datetime format. Then, using .dt.date, it is fast to change it to the new format without the parser.
Thank you to everyone who helped!