Converting a Pandas Series to Dataframe - python

I tried to do this:
get_sent_score_neut_df =
pandas.Series(get_sent_score_neut).to_frame(name='sentiment-
neutral').reset_index().apply(lambda x: float(x))
And when I want to merge/join it with another DataFrame I created the same way the error I get is:
AttributeError: 'Series' object has no attribute '_join_compat'
Is there a way to fix that?
That´s the line of code I used to merge/join them:
sentMerge = pandas.DataFrame.join(get_sent_score_pos_df, get_sent_score_neut_df)
Besides: i have tried to rename the index with ```.reset_index`(name='xyz')``
(assigning column names to a pandas series) which causes my IDE to responed with "unexpected argument".

Related

Converting series from pandas to pyspark: need to use "groupby" and "size", but pyspark yields error

I am converting some code from Pandas to pyspark. In pandas, lets imagine I have the following mock dataframe, df:
And in pandas, I define a certain variable the following way:
value = df.groupby(["Age", "Siblings"]).size()
And the output is a series as follows:
However, when trying to covert this to pyspark, an error comes up: AttributeError: 'GroupedData' object has no attribute 'size'. Can anyone help me solve this?
The equivalent of size in pyspark is count:
df.groupby(["Age", "Siblings"]).count()
You can also use the agg method, which is more flexible as it allows you to set column alias or add other types of aggregations:
import pyspark.sql.functions as F
df.groupby('Age', 'Siblings').agg(F.count('*').alias('count'))

Pandas merge not working due to a wrong type

I'm trying to merge two dataframes using
grouped_data = pd.merge(grouped_data, df['Pattern'].str[7:11]
,how='left',left_on='Calc_DRILLING_Holes',
right_on='Calc_DRILLING_Holes')
But I get an error saying can not merge DataFrame with instance of type <class 'pandas.core.series.Series'>
What could be the issue here. The original dataframe that I'm trying to merge to was created from a much larger dataset with the following code:
import pandas as pd
raw_data = pd.read_csv(r"C:\Users\cherp2\Desktop\test.csv")
data_drill = raw_data.query('Activity =="DRILL"')
grouped_data = data_drill.groupby([data_drill[
'PeriodStartDate'].str[:10], 'Blast'])[
'Calc_DRILLING_Holes'].sum().reset_index(
).sort_values('PeriodStartDate')
What do I need to change here to make it a regular normal dataframe?
If I try to convert either of them to a dataframe using .to_frame() I get an error saying that 'DataFrame' object has no attribute 'to_frame'
I'm so confused at to what kind of data type it is.
Both objects in a call to pd.merge need to be DataFrame objects. Is grouped_data a Series? If so, try promoting it to a DataFrame by passing pd.DataFrame(grouped_data) instead of just grouped_data.

How to properly use .loc with a multilevel index in a Pandas Dataframe?

I have this Pandas dataframe with a multilevel index.
I want to access a particular row of this dataframe so I tried
df.loc[(0, 0, '2015-07-01'),:]
But it gives a KeyError for '2015-07-01'. For any of the combination of the three levels of index, it throws the same error. What's wrong with my line of code?
I don't think the datatype of the date is an issue here, still, it's object so accessing it as a string should work.
df.index.get_level_values(2)

getting 'DataFrameGroupBy' object is not callable in jupyter

I have this csv file from https://www.data.gov.au/dataset/airport-traffic-data/resource/f0fbdc3d-1a82-4671-956f-7fee3bf9d7f2
I'm trying to aggregate with
airportdata = Airports.groupby(['Year_Ended_December'])('Dom_Pax_in','Dom_Pax_Out')
airportdata.sum()
However, I keep getting 'DataFrameGroupBy' object is not callable
and it wont print the data I want
How to fix the issue?
You need to execute the sum aggregation before extracting the columns:
airportdata_agg = Airports.groupby(['Year_Ended_December']).sum()[['Dom_Pax_in','Dom_Pax_Out']]
Alternatively, if you'd like to ensure you're not aggregating columns you are not going to use:
airportdata_agg = Airports[['Dom_Pax_in','Dom_Pax_Out', 'Year_Ended_December']].groupby(['Year_Ended_December']).sum()

Get result of value_count() to excel from Pandas

I have a data frame "df" with a column called "column1". By running the below code:
df.column1.value_counts()
I get the output which contains values in column1 and its frequency. I want this result in the excel. When I try to this by running the below code:
df.column1.value_counts().to_excel("result.xlsx",index=None)
I get the below error:
AttributeError: 'Series' object has no attribute 'to_excel'
How can I accomplish the above task?
You are using index = None, You need the index, its the name of the values.
pd.DataFrame(df.column1.value_counts()).to_excel("result.xlsx")
If go through the documentation Series had no method to_excelit applies only to Dataframe.
So either you can save it another frame and create an excel as:
a=df.column1.value_counts()
a.to_excel("result.xlsx")
Look at Merlin comment I think it is the best way:
pd.DataFrame(df.column1.value_counts()).to_excel("result.xlsx")

Categories

Resources