Joining two dataframe of one column generated with spark - python

I'm working with pyspark and pandas in Databricks. I'm generating the two following dataframe:
start_date = datetime.today() - timedelta(days=60)
end_date = datetime.today()
date_list = pd.date_range(start=start_date,end=end_date).strftime('%Y-%m-%d').tolist()
date_df = spark.createDataFrame(date_list, 'string').toDF("date")
and
random_list = np.random.normal(loc=50, scale=10, size=61)
random_list = [round(i) for i in random_list]
integer_df = spark.createDataFrame(random_list, 'integer').toDF("value")
so I have two dataframes of one column each ("date" and "value") of the same length and I'd like to "merge" them into one dataframe.
I've tried this:
integer_df=pd.concat(date_df)
which is returning the following error first argument must be an iterable of pandas-on-Spark objects, you passed an object of type "DataFrame"
and this
test_df = pd.concat([integer_df, date_df], axis=1, join='inner')
which is returning the following error cannot concatenate object of type 'list; only ps.Series and ps.DataFrame are valid
Mostly I'd like to understand these errors.

From what i could see you are not transitioning the objects correctly, for example you are trying to concatenate a sparkdf object to a pandasdf object.
first argument must be an iterable of pandas-on-Spark objects, you passed an object of type "DataFrame"
This guy was caused because because, you passed the wrong type object. To concatenate. You should try using pandas on spark object or just pandas objects, if you are going to use pandas.
So to fix your first error, i would just follow the convention. Work with the objects of the given library.
Something like this (or maybe just use pd.Series() or pd.DataFrame)
date_df = spark.createDataFrame(date_list, 'string').toPandas()
# toDF("date") is redundant, either use createDataFrame or toDf not both
integer_df = spark.createDataFrame(random_list, 'integer').toPandas()
After that try utilizing pd.concat([]), with the give results.
Your second error, was caused because pandas has a given condition to only accept type Series object (something similar to your list), since you are passing a pyspark df well i guess pandas gets confused and read it as a list.
So to fix it again utilize the correct object of the library, or transform it to numpy if you want something more efficient
Hope this helps.

Related

Creating an empty dataframe in pandas with column of type datetime64[ns, Europe/Paris]

I want to create an empty dataframe in pandas with a single column 'time'. I also want it to be of type datetime64[ns, 'Europe/Paris'], ie. to be able to store timezone-aware timestamps.
I actually need to return an empty dataframe under certain conditions, but I still want to be able to perform some basic operations that require the type to be defined (for instance, merging it with other similra dataframes / performing group by using the column, and so on...).
For now, the simple pd.DataFrame(columns=['time']) creates a column of type object.
I tried to use pd.DataFrame({'time': pd.Series(dtype=np.datetime64)}), but I get ValueError: The 'datetime64' dtype has no unit. Please pass in 'datetime64[ns]' instead. (which I cannot pass by the way). Plus, it would not provide me the appropriate timezone.
Any idea how to do that ?
You can try to create empty DataFrame with one column called time with default object type and then parse the column with the correct datetime and timezone. See my code below and hope this answer your question:
import pandas as pd
from datetime import datetime
df = pd.DataFrame({
"time": []
})
df["time"] = pd.to_datetime(df["time"]).dt.tz_localize('UTC').dt.tz_convert('Europe/Paris')

DataFrame with DateTimeIndex from Django model

My goal is to create a pandas dataframe with a datetimeindex from a django model. I am using the django-pandas package for this purpose, specifically, the 'to_timeseries()' method.
First, I used the .values() method on my qs. This still returns a qs, but it contains dictionaries. I then used to_timeseries() to create my dataframe. Everything here worked as expected: the pivot, the values, etc. But my index is just a list of strings. I don't know why.
I have been able to find a great many manipulations in the pandas documentation, including how to turn a column or series into datetime objects. However, my index is not a Series, it is an Index, and none of these methods work. How do I make this happen? Thank you.
df = mclv.to_timeseries(index='day',
pivot_columns='medicine',
values='takentoday',
storage='wide')
df = df['day'].astype(Timestamp)
raise TypeError(f"dtype '{dtype}' not understood")
TypeError: dtype '<class 'pandas._libs.tslibs.timestamps.Timestamp'>' not understood
AttributeError: 'DataFrame' object has no attribute 'DateTimeIndex'
df = pd.DatetimeIndex(df, inplace=True)
TypeError: __new__() got an unexpected keyword argument 'inplace'
TypeError: Cannot cast DatetimeIndex to dtype
etc...
Correction & update: django-pandas did work as its authors expected. The problem was my misunderstanding of what it was doing, and how.

Pandas merge not working due to a wrong type

I'm trying to merge two dataframes using
grouped_data = pd.merge(grouped_data, df['Pattern'].str[7:11]
,how='left',left_on='Calc_DRILLING_Holes',
right_on='Calc_DRILLING_Holes')
But I get an error saying can not merge DataFrame with instance of type <class 'pandas.core.series.Series'>
What could be the issue here. The original dataframe that I'm trying to merge to was created from a much larger dataset with the following code:
import pandas as pd
raw_data = pd.read_csv(r"C:\Users\cherp2\Desktop\test.csv")
data_drill = raw_data.query('Activity =="DRILL"')
grouped_data = data_drill.groupby([data_drill[
'PeriodStartDate'].str[:10], 'Blast'])[
'Calc_DRILLING_Holes'].sum().reset_index(
).sort_values('PeriodStartDate')
What do I need to change here to make it a regular normal dataframe?
If I try to convert either of them to a dataframe using .to_frame() I get an error saying that 'DataFrame' object has no attribute 'to_frame'
I'm so confused at to what kind of data type it is.
Both objects in a call to pd.merge need to be DataFrame objects. Is grouped_data a Series? If so, try promoting it to a DataFrame by passing pd.DataFrame(grouped_data) instead of just grouped_data.

Pandas datatype change within a function

General background
I've written a function which incorporates a MySQL query, with some munging on the returned data (pulled into a pandas df).
enginedb =create_engine("mysql+mysqlconnector://user:pswd#10.0.10.26:3306/db",
encoding='latin1')
query = ("""Select blah blah""")
df = pd.read_sql(query, enginedb)
This works fine - the query is a significant one with multiple joins etc.. However, it transpired for a certain lot within the db, the datatype was off: for almost all 'normal' lots, the datatypes for the columns were int64, some object, a datetime64[ns]... but for one lot (so far), all but the datetime were returning as object.
Issue
I need to do a stack - one of the columns is a list, and i've got some code to take each item of the list and stack them down row by row:
cols = list(df)
cols = cols[:-1]
df_stack = df.set_index(cols)['data'].apply(pd.Series).stack()
The problem is this doesn't work for this 'odd' lot, with the non-standard datatypes (the reason for the non-std data types is due to an upstream ETL process and i can't affect that).
The exact error is:
'Series' object has no attribute 'stack'
Consequently I had incorporated an if/else statement, checking to see if the dtype of one of the cols is incorrect, and if so, change it:
if df['id'].dtype == 'int64':
df_stack = df.set_index(cols)['data'].apply(pd.Series).stack()
df_stack = df_stack.reset_index()
else:
df_stack = df.apply(pd.to_numeric, errors = 'coerce')
# it can't be more specific than for all the columns, as there are a LOT
But this is having no effect - i've included in the function (containing the query and subsequent munging) a print out statement of dy.dtypes and df_stack.dtypes and the function is having no effect.
Why is this?
EDIT
I've added this picture to show the code (at right) which is attempting to catch the incorrectly-dtyped lot (12384), and the print-outs before and after the pd.to_numeric function (which both show just objects, no numeric cols).
My underlying questions has two parts:
What would cause 'Series' object has no attribute 'stack'? (more fundamentally than wrong datatype - or at least why is datatype an issue?)
Why would a pd.numeric not cause any change here?

rbindlist equivalent R's function in python

First i am creating empty lists based on length of the num_vars and storing the output of each loop in one list.
After that I want to combine the all the outputs and convert that as pandas data frame.
for this we can simply use rbindlist in R, for combine the list objects.
for that i used the following python code.
ests_list=[[] for i in range(num_vars)]
for i in list(range(0,num_vars)):
for j in list(range(1,num_vars+1))
ests_list[i]=pd.merge(df1,
df2,
how='left',
on=eval('combine%s'%j+'_lvl'))
pd.concat(ests_list)
when i tried the above syntax it throws the following error:
TypeError: cannot concatenate object of type "<class 'list'>"; only
pd.Series, pd.DataFrame, and pd.Panel (deprecated) objs are valid
Please anyone can help me to solve this issue.
Thanks in advance.
I found a solution for my problem:
ests_list=[]
for i in list(range(1,num_vars)):
ests_list.append(df1.merge(df2,how='left',on=eval("combine%s"%i+"_lvl")))
pd.concat(ests_list)
I am creating an empty list and and I added each loop output to it.
Then I am combining all the list by using the pd.concat function, so it gives me the output in pandas data frame format.

Categories

Resources