error concatenating multiindex pandas dataframes (categorical) - python

L is a list of dataframes with a multiindex on the rows.
pd.concat(L,axis=1)
I get the following error (from the Categorical constructor in categorical.py):
TypeError: 'values' is not ordered, please explicitly specify the categories order by passing in a categories argument.
It clearly has something to do with the values in my dataframe, as I can get it work if I restrict the data in some way.
E.g. all of these work
a=pd.concat(L[0:6],axis=1)
b=pd.concat(L[6:11],axis=1)
c=pd.concat(L[3:9],axis=1)
but
d=pd.concat(L[0:11],axis=1)
fails.
pd.concat([x.iloc[0:1000,:] for x in L[0:11]],axis=1)
also works. I've gone through the edge cases at which it breaks, and for the life of me, I don't see anything that could be offensive in those rows. Does anyone have some ideas on what I should be looking for?

I just had this issue too when I did a df.groupby(...).apply(...) with a custom apply function. The error seemed to appear when the results were to get merged back together after the groupby-apply (so I must have returned something in my custom apply function that it didn't like).
After inspecting the extensive stacktrace provided by pytest I found a mysterious third value appeared in my index values:
values = Index([(2018-09-01 00:00:00, 'SE0011527613'),
(2018-09-25 00:00:00, 'SE0011527613'),
1535760000000000000], dtype='object')
I have absolutely no idea how it appeared there, but I managed to work around it somehow by avoiding multi-indexed stuff in that particular part of the code (extensive use of reset_index and set_index).
Not sure if this will be of help to anyone, but there you have it. If someone could attempt a minimal reproducible example that would be helpful (I didn't manage to).

I came across the same error:
TypeError: 'values' is not ordered, please explicitly specify the categories order by passing in a categories argument.
However, there is not much material around it. Have a look what the error log states a bit further above. I have:
TypeError: unorderable types: range() < range()
During handling of the above exception, another exception occurred:
The clue was 'range() < range()' because I had a previous problem here with Pandas interpreting '(1,2)' or '(30,31)' not as string but as 'range(1,3)' or 'range(30,32)' respectively. Very annoying as the dtypes is still object.
I had to change the column content to lists and/or drop the 'range(x,y)' column.
Hope this helps or anybody else who comes across this problem. Cheers!

Related

Populate float_answer for Tapas Weak supervision for aggregation (WTQ). TypeError

I am trying to fine-tune Tapas following the instructions here: https://huggingface.co/transformers/v4.3.0/model_doc/tapas.html#usage-fine-tuning , Weak supervision for aggregation (WTQ) using the https://www.microsoft.com/en-us/download/details.aspx?id=54253 , which follow the required format of dataset in the SQA format, tsv files with most of the named columns. But, there is no float_answer column. And as mentioned,
float_answer: the float answer to the question, if there is one (np.nan if there isn’t). Only required in case of weak supervision for aggregation (such as WTQ and WikiSQL)
Since I am using WTQ, I need the float_answer column. I tried populating float_answer based on answer_text as suggested here, using https://github.com/google-research/tapas/blob/master/tapas/utils/interaction_utils_parser.py 's parse_question(table, question, mode) function. However, I am getting errors.
I copied everything from here and put these args:
.
But, I get this error: TypeError: Parameter to CopyFrom() must be instance of same class: expected language.tapas.Question got str.
1) Can you, please help understand what args should I Use or how else can I populate float_answer?
I am using table_csv and the question, answer to which is in the table given:
2) Also we have tried to simply add float_answer column and make all the values np.nan. Crashed, too.
Is there tutorial for WTQ fine-tuning? Thanx!

Showing integer columns as categorical and throwing error in sweetviz compare

If I analyze these two datasets individually, I don't get any error and the I also get the viz of all the integer columns.
But when I try to compare these dataframe, I get the below error.
Cannot convert series 'Web Visit' in COMPARED from its TYPE_CATEGORICAL
to the desired type TYPE_BOOL.
I also tried the FeatureConfig to skip it, but no avail.
pid_compare = sweetviz.compare([pdf,"234_7551009"],[pdf_2,"215_220941058"])
Maintainer of the lib here; this question was asked in the git also, but it will be useful to detail the answer here.
After looking at your data provided in the link above, it looks like the first dataframe (pdf) only contains 0 & 1, so it is classified as boolean so it cannot be compared against the second one which is categorical (that one has 0,1,2,3 as you probably know!).
The system will be able to handle it if you use FeatureConfig to force the first dataframe to be considered CATEGORICAL.
I just tried the following and it seems to work, let me know if it helps!
feature_config = sweetviz.FeatureConfig(force_cat = ["Web Desktop Interaction"])
report = sweetviz.compare(pdf, pdf_2, None, feature_config)

Sort dataframe by absolute value without changing value or adding column

I have a dataframe that's the result of importing a csv and then performing a few operations and adding a column that's the difference between two other columns (column 10 - column 9 let's say). I am trying to sort the dataframe by the absolute value of that difference column, without changing its value or adding another column.
I have seen this syntax over and over all over the internet, with indications that it was a success (accepted answers, comments saying "thanks, that worked", etc.). However, I get the error you see below:
df.sort_values(by='Difference', ascending=False, inplace=True, key=abs)
Error:
TypeError: sort_values() got an unexpected keyword argument 'key'
I'm not sure why the syntax that I see working for other people is not working for me. I have a lot more going on with the code and other dataframes, so it's not a pandas import problem I don't think.
I have moved on and just made a new column that is the absolute value of the difference column and sorted by that, and exclude that column from my export to worksheet, but I really would like to know how to get it to work the other way. Any help is appreciated.
I'm using Python 3
df.loc[(df.c - df.b).sort_values(ascending = False).index]
Sorting by difference between "c" and "b" , without creating new column.
I hope this is what you were looking for.
key is optional argument
It accepts series as input , maybe you were working with dataframe.
check this

Why does pandas need to reshape my boolean index, and how can I fix it to avoid the warning?

Background
I've got two DataFrames of timestamped-ids (the index is the id). I want to get all of the ids where the timestamps differ by, say, 5 minutes.
Code
time_delta = abs(df2.time - df1.time).dt.total_seconds()
ids_out_of_range = df1[time_delta > 300].index
This gives me the ids I want, so it is working code.
Problem
Like many, I face this warning:
file.py:33: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
ids_out_of_range = df1[time_delta > 300].index
Most explanations center on the "length" of the index differing from the "length" of the dataframe. But:
+(Pdb) time_delta.shape
(176,)
+(Pdb) df1.shape
(176, 1)
+(Pdb) sorted(time_delta.index.values.tolist()) == sorted(df1.index.values.tolist())
True
The shapes are the same, except that one is a Series and the other is a DataFrame. The indices (appear) to be the same; perhaps the ordering is the issue? They did not compare equal without sorted.
(I've tried wrapping time_delta in a DataFrame, to no avail.)
Long-term, I would like this warning to go away (and not with 2>/dev/null, thank you). It's visual clutter in the output of my script, and, well, it is a warning—so theoretically I should pay attention to it.
Question
What am I doing "wrong" that I get this warning, since the sizes seem to be right?
How do I fix (1) so I can avoid this warning?
The warning is saying that your time_delta index is different from df index.
But when I tried to reproduce the warning, it didn't show up. I'm using pandas 0.25.1. So if you are using a different version, there might be a warning.
Please refer to this page for suppressing warnings
The following fixed my issue:
df1.sort_index(inplace=True)
df2.sort_index(inplace=True)
time_delta.sort_index(inplace=True)
This allowed the indices to align perfectly, so they must have not been in the same order with respect to each other.

Filtering a dataset on values not in another dataset

I am looking to filter a dataset based off of whether a certain ID does not appear in a different dataframe.
While I'm not super attached to the way in which I've decided to do this if there's a better way that I'm not familiar with, I want to apply a Boolean function to my dataset, put the results in a new column, and then filter the entire dataset off of that True/False result.
My main dataframe is df, and my other dataframe with the ID's in it is called ID:
def groups():
if df['owner_id'] not in ID['owner_id']:
return True
return False
This ends up being accepted (no syntax problems), so I then go to apply it to my dataframe, which fails:
df['ID Groups?'] = df.apply (lambda row: groups() ,axis=1)
Result:
TypeError: ("'Series' objects are mutable, thus they cannot be hashed", 'occurred at index 0')
It seems that somewhere my data that I'm trying to use (the ID's are both letters and numbers, so strings) is incorrectly formatted.
I have two questions:
Is my proposed method the best way of going about this?
How can I fix the error that I'm seeing?
My apologies if it's something super obvious, I have very limited exposure to Python and coding as a whole, but I wasn't able to find anywhere where this type of question had already been addressed.
Expression to keep only these rows in df that match owner_id in ID:
df = df[df['owner_id'].isin(ID['owner_id'])]
Lambda expression is going to be way slower that this.
isin is the Pandas way. not in is the Python collections way.
The reason you are getting this error is df['owner_id'] not in ID['owner_id'] hashes left hand side to figure out if it is present in the right hand side. df['owner_id'] is of type Series and is not hashable, as reported. Luckily, it is not needed.

Categories

Resources