pandas dataframe to_csv() with get_handle() error [duplicate]

pandas dataframe to_csv() with get_handle() error [duplicate] - python

I had big table which I sliced to many smaller tables based on their dates:
dfs={}
for fecha in fechas:
dfs[fecha]=df[df['date']==fecha].set_index('Hour')
#now I can acess the tables like this:
dfs['2019-06-23'].head()
I have done some modifictions to the dfs['2019-06-23'] specific table and now I would like to save it on my computer. I have tried to do this in two ways:
#first try:
dfs['2019-06-23'].to_csv('specific/path/file.csv')
#second try:
test=dfs['2019-06-23']
test.to_csv('test.csv')
both of them raised this error:
TypeError: get_handle() got an unexpected keyword argument 'errors'
I don't know why I get this error and haven't find any reason for that. I have saved many files this way but never had that before.
My goal: to be able to save this dataframe after my modification as csv

If you are getting this error, there are two things to check:
Whether the DataFrame is not actually a Series - see (Pandas : to_csv() got an unexpected keyword argument)
Your numpy version. For me, updating to numpy==1.20.1 with pandas==1.2.2 fixed the problem. If you are using Jupyter notebooks, remember to restart the kernel afterwards.

In the end what worked was to use pd.DataFrame and then to export it as following:
to_export=pd.DataFrame(dfs['2019-06-23'])
to_export.to_csv('my_table.csv')
that suprised me because when I checked the type of the table when I got the error it was dataframe . However, this way it works.

Related

Python Pandas: Using corrwith and getting "outputs are collapsed"

I want to find out how a column of data in a matrix correlates with the other columns in the matrix.
The data looks like;
I use the following code;
selected_product = "5002.0"
df_single_product = df_recommender[selected_product]
df_similar_to_selected_product=df_recommender.corrwith(df_single_product)
df_similar_to_selected_product.head()
The output from the head command is not produced. Instead I get a message saying "Outputs are collapsed". Why is this happening? is this a error I can trap or is the code wrong?
Maybe there are too many rows? I am using Visual Studio Code.

OK I found the answer. The df_single_product variable which I thought would be a dataframe is in fact a series. To correct that I entered the following change to the code;
df_single_product = df_recommender[selected_product].to_frame()

Pyspark: How to compare two Timestamps to show most recently updated records in dataframe or table?

I am writing a script for a daily incremental load process using Pyspark and a Hive table which has already been initially loaded with data. Each morning a job will run the script against that table.
I've been try to use PySpark to create a timestamp filter that will compare two timestamps, mod_date_ts and max(mod_date_ts) to show updated records that were added since the last load and save the result to the dataframe or another dataframe.
I've tried the following syntax:
dw_mod_ts = base_df.select('dw_mod_ts')
max_dw_load_ts = base_df.select(max_('dw_mod_ts'))
inc_df = inc_df.filter(dw_mod_ts >= max_dw_load_ts)
But I keep getting type errors and syntax errors stating that a dataframe cannot be compared to streven though I've casted both variables and columns asTimestampType`.
inc_df = inc_df.filter(inc_df("dw_mod_ts").cast(DataTypes.DateType) >= max_('dw_mod_ts').cast(DataTypes.DateType))
Also, I keep getting an error stating the >= operator cannot be used within the current syntax as well.
I don't have much experience working with Pyspark, so any help or suggestions is appreciated guys.

Suppose the comparison is in string form. First build the max_dw_load_ts variable, and then pass its value to filter to get the final result.
max_dw_mod_ts = df.groupBy().agg(F.max('dw_mod_ts')).collect()[0][0]
df = df.filter(f'dw_mod_ts >= "{max_dw_mod_ts}"')
df.show()

Sort dataframe by absolute value without changing value or adding column

I have a dataframe that's the result of importing a csv and then performing a few operations and adding a column that's the difference between two other columns (column 10 - column 9 let's say). I am trying to sort the dataframe by the absolute value of that difference column, without changing its value or adding another column.
I have seen this syntax over and over all over the internet, with indications that it was a success (accepted answers, comments saying "thanks, that worked", etc.). However, I get the error you see below:
df.sort_values(by='Difference', ascending=False, inplace=True, key=abs)
Error:
TypeError: sort_values() got an unexpected keyword argument 'key'
I'm not sure why the syntax that I see working for other people is not working for me. I have a lot more going on with the code and other dataframes, so it's not a pandas import problem I don't think.
I have moved on and just made a new column that is the absolute value of the difference column and sorted by that, and exclude that column from my export to worksheet, but I really would like to know how to get it to work the other way. Any help is appreciated.
I'm using Python 3

df.loc[(df.c - df.b).sort_values(ascending = False).index]
Sorting by difference between "c" and "b" , without creating new column.
I hope this is what you were looking for.

key is optional argument
It accepts series as input , maybe you were working with dataframe.
check this

Unable to implement MICE in Python

I'm trying to use statsmodels package of MICE to impute values for my columns. I'm unable to figure out how exactly to use it. Whatever I run, it throws the error: ValueError: variable to be imputed has no observed values
Code:
df=pd.read_csv('contacts.csv', engine='c',low_memory=False)
from statsmodels.imputation.mice import MICEData as md
md(df)
Why am I doing wrong?

at least one of the columns in the generated data frame (hence csv) is empty.
Check the dataframe, maybe you have to clean it up/normalize.
Also, don't afraid to look into the code base.
What you are looking for is the _split_indices method of MICEData.

How to perform a rolling_median on a large data frame in Pandas without encountering the skiplist_insert failed error?

I have a huge data frame with about 1041507 rows.
I wanted to calculate a rolling_median for certain values, under certain categories in my data frame. For this I used a groupBy follwed by apply:
df['rolling_median']=df['value'].groupby(['Category','Subcategory']).apply(pd.rolling_median,7,min_periods=7)
However, this given me a MemoryError: skiplist_insert failed. I will attach the full Traceback if needed, but I came across another similar post which specifies that this is an issue in pandas, as can be seen here https://github.com/pydata/pandas/issues/11697. For a very large size >~ 35000
After this i tried to do a bit of manipulation to simply get the rolling median by iterating over each group separately
for index,group in df.groupby(['Category','Subcategory']):
print pd.rolling_median(group['value'],7,min_period=7)
Each group is about 20-25 rows long only. Yet this function fails and shows the same MemoryError after a few iterations.
I ran the code several times, and every time it showed the memory error for different items.
I created some dummy values for anyone to test, here:
index=[]
[index.append(x) for y in range(25) for x in np.arange(34000)]
sample=pd.DataFrame(np.arange(34000*25),index=index)
for index,group in sample.groupby(level=0):
try:
pd.rolling_median(group[0],7,7)
except MemoryError:
print a
print pd.rolling_median(group[0],7,7)
If i run the rolling_median again after encountering the memoryError (as you can see in the above code), it runs fine without another exception-
I am not sure how can I calculate my rolling_median if it keeps throwing the memory Error.
Can anyone tell me a better way to calculate the rolling_median, or help me understand the issue here?

The groupby doesn't look right and should change
df['rolling_median']=df['value'].groupby(['Category','Subcategory']).apply(pd.rolling_median,7,min_periods=7)
to
df['rolling_median']=df.groupby(['Category','Subcategory'])['value'].apply(pd.rolling_median,7,min_periods=7)
Otherwise the the groupby won't work as it is a series with column named ["value"] so can't groupby ['Category','Subcategory'] as not present.
Also the groupby is going to be smaller than the length of the dataframe and creating the df['rolling_median'] will cause a length mismatch.
Hope that helps.

The bug has been fixed in Pandas 0.18.0, and now the methods rolling_mean() and rolling_median() have depreciated.
This was the bug: https://github.com/pydata/pandas/issues/11697
Can be viewed here: http://pandas.pydata.org/pandas-docs/stable/computation.html

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

pandas dataframe to_csv() with get_handle() error [duplicate] - python

In the end what worked was to use pd.DataFrame and then to export it as following: to_export=pd.DataFrame(dfs['2019-06-23']) to_export.to_csv('my_table.csv') that suprised me because when I checked the type of the table when I got the error it was dataframe . However, this way it works.

Related

Python Pandas: Using corrwith and getting "outputs are collapsed"

Pyspark: How to compare two Timestamps to show most recently updated records in dataframe or table?

Sort dataframe by absolute value without changing value or adding column

Unable to implement MICE in Python

How to perform a rolling_median on a large data frame in Pandas without encountering the skiplist_insert failed error?

Categories

Resources