Python Pandas: Using corrwith and getting "outputs are collapsed" - python

I want to find out how a column of data in a matrix correlates with the other columns in the matrix.
The data looks like;
I use the following code;
selected_product = "5002.0"
df_single_product = df_recommender[selected_product]
df_similar_to_selected_product=df_recommender.corrwith(df_single_product)
df_similar_to_selected_product.head()
The output from the head command is not produced. Instead I get a message saying "Outputs are collapsed". Why is this happening? is this a error I can trap or is the code wrong?
Maybe there are too many rows? I am using Visual Studio Code.

OK I found the answer. The df_single_product variable which I thought would be a dataframe is in fact a series. To correct that I entered the following change to the code;
df_single_product = df_recommender[selected_product].to_frame()

Related

pandas dataframe to_csv() with get_handle() error [duplicate]

I had big table which I sliced to many smaller tables based on their dates:
dfs={}
for fecha in fechas:
dfs[fecha]=df[df['date']==fecha].set_index('Hour')
#now I can acess the tables like this:
dfs['2019-06-23'].head()
I have done some modifictions to the dfs['2019-06-23'] specific table and now I would like to save it on my computer. I have tried to do this in two ways:
#first try:
dfs['2019-06-23'].to_csv('specific/path/file.csv')
#second try:
test=dfs['2019-06-23']
test.to_csv('test.csv')
both of them raised this error:
TypeError: get_handle() got an unexpected keyword argument 'errors'
I don't know why I get this error and haven't find any reason for that. I have saved many files this way but never had that before.
My goal: to be able to save this dataframe after my modification as csv
If you are getting this error, there are two things to check:
Whether the DataFrame is not actually a Series - see (Pandas : to_csv() got an unexpected keyword argument)
Your numpy version. For me, updating to numpy==1.20.1 with pandas==1.2.2 fixed the problem. If you are using Jupyter notebooks, remember to restart the kernel afterwards.
In the end what worked was to use pd.DataFrame and then to export it as following:
to_export=pd.DataFrame(dfs['2019-06-23'])
to_export.to_csv('my_table.csv')
that suprised me because when I checked the type of the table when I got the error it was dataframe . However, this way it works.

Showing integer columns as categorical and throwing error in sweetviz compare

If I analyze these two datasets individually, I don't get any error and the I also get the viz of all the integer columns.
But when I try to compare these dataframe, I get the below error.
Cannot convert series 'Web Visit' in COMPARED from its TYPE_CATEGORICAL
to the desired type TYPE_BOOL.
I also tried the FeatureConfig to skip it, but no avail.
pid_compare = sweetviz.compare([pdf,"234_7551009"],[pdf_2,"215_220941058"])
Maintainer of the lib here; this question was asked in the git also, but it will be useful to detail the answer here.
After looking at your data provided in the link above, it looks like the first dataframe (pdf) only contains 0 & 1, so it is classified as boolean so it cannot be compared against the second one which is categorical (that one has 0,1,2,3 as you probably know!).
The system will be able to handle it if you use FeatureConfig to force the first dataframe to be considered CATEGORICAL.
I just tried the following and it seems to work, let me know if it helps!
feature_config = sweetviz.FeatureConfig(force_cat = ["Web Desktop Interaction"])
report = sweetviz.compare(pdf, pdf_2, None, feature_config)

Issue with python pandas crosstab

I'm trying to run a crosstab on my dataframe called "d_recent" using the following line of code:
pd.crosstab(d_recent['BinnedAge'],' d_recent['APBI']')
The output I am getting is this:
|Age Bin|Brachytherapy|EBRT|IORT|
|-------|-------------|----|----|
|51-60|1|1|0|
|71-80|86|62|11|
|61-70|2578|723|276|
|41-50|9386|2049|1188|
|81-90|13860|3257|2449|
|31-40|7725|2078|1628|
|21-30|1958|615|425|
This is wrong. What it should look like is:
|Age Bin|Brachytherapy|EBRT|IORT|
|-------|-------------|----|----|
|21-30|1|1|0|
|31-40|86|62|11|
|41-50|2578|723|276|
|51-60|9386|2049|1188|
|61-70|13860|3257|2449|
|71-80|7725|2078|1628|
|81-90|1958|615|425|
Any idea what is going on here and how I can fix it? I can tell that the order of the rows in the first table is related to the order the specific bins are encountered in my dataframe. I can get the correct output if I sort by age prior to running the crosstab, but this isn't a preferable solution because I need to do this with multiple variables. Thanks!

How to perform a rolling_median on a large data frame in Pandas without encountering the skiplist_insert failed error?

I have a huge data frame with about 1041507 rows.
I wanted to calculate a rolling_median for certain values, under certain categories in my data frame. For this I used a groupBy follwed by apply:
df['rolling_median']=df['value'].groupby(['Category','Subcategory']).apply(pd.rolling_median,7,min_periods=7)
However, this given me a MemoryError: skiplist_insert failed. I will attach the full Traceback if needed, but I came across another similar post which specifies that this is an issue in pandas, as can be seen here https://github.com/pydata/pandas/issues/11697. For a very large size >~ 35000
After this i tried to do a bit of manipulation to simply get the rolling median by iterating over each group separately
for index,group in df.groupby(['Category','Subcategory']):
print pd.rolling_median(group['value'],7,min_period=7)
Each group is about 20-25 rows long only. Yet this function fails and shows the same MemoryError after a few iterations.
I ran the code several times, and every time it showed the memory error for different items.
I created some dummy values for anyone to test, here:
index=[]
[index.append(x) for y in range(25) for x in np.arange(34000)]
sample=pd.DataFrame(np.arange(34000*25),index=index)
for index,group in sample.groupby(level=0):
try:
pd.rolling_median(group[0],7,7)
except MemoryError:
print a
print pd.rolling_median(group[0],7,7)
If i run the rolling_median again after encountering the memoryError (as you can see in the above code), it runs fine without another exception-
I am not sure how can I calculate my rolling_median if it keeps throwing the memory Error.
Can anyone tell me a better way to calculate the rolling_median, or help me understand the issue here?
The groupby doesn't look right and should change
df['rolling_median']=df['value'].groupby(['Category','Subcategory']).apply(pd.rolling_median,7,min_periods=7)
to
df['rolling_median']=df.groupby(['Category','Subcategory'])['value'].apply(pd.rolling_median,7,min_periods=7)
Otherwise the the groupby won't work as it is a series with column named ["value"] so can't groupby ['Category','Subcategory'] as not present.
Also the groupby is going to be smaller than the length of the dataframe and creating the df['rolling_median'] will cause a length mismatch.
Hope that helps.
The bug has been fixed in Pandas 0.18.0, and now the methods rolling_mean() and rolling_median() have depreciated.
This was the bug: https://github.com/pydata/pandas/issues/11697
Can be viewed here: http://pandas.pydata.org/pandas-docs/stable/computation.html

ValueError: Cannot shift with no freq

I have tried to write the following code but I get the error message: "ValueError: Cannot shift with no freq."
I have no idea of how to fix it? I tried to google on the error message but couldn't find any case similar to mine.
df is a python pandas dataframe for which I want to create new columns showing the daily change. The code is shown below. How can I fix the code to avoid the value error?
for column_names in df:
df[column_names+'%-daily'] =df[column_names].pct_change(freq=1).fillna(0)
The problem was that I had date as index. Since only weekdays were shown delta became incorrect. When I changed to period.
for column_names in list(df.columns.values):
df[column_names+'%-daily'] =df[column_names].pct_change(periods=1).fillna(0)

Categories

Resources