I have recently started learning pandas and I was trying to analyze the Stack Overflow developer survey. I am trying to learn the groupby function:
country_grp=df.groupby(['Country'])
country_grp.get_group('China')
ed=country_grp['EdLevel'].value_counts()
salary=country_grp['ConvertedComp'].value_counts()
response=country_grp['Country'].value_counts()
combine=pd.concat([ed,response,salary],axis='columns',sort=False)
combine
After this line it's giving me this warning:
RuntimeWarning: The values in the array are unorderable. Pass
`sort=False` to suppress this warning. uniq_tuples =
lib.fast_unique_multiple([self._values, other._values], sort=sort)
It gives me the data frame but all the rows for ['country'] columns are NaN. Can someone please guide me how can I solve this?
I am not sure if the problem is the same as mine since there is no underlying data.
I got a similar error, which was due to the data type of the indices index for each dataframe in the concat function being different, though it looks the same. After making all index data types the same, the warning and NaN disappear.
Related
This is a very strange problem, I tried a lot of things but I can't find a way to solve it.
I have a DataFrame with data collected from API : no problem with that, then I'm using a library which is pandas-ta (https://github.com/twopirllc/pandas-ta), so this add new columns to the DataFrame.
Of course, sometimes there is NaN values in the new columns added (there is a lot of reasons but the main one is that some indicators are length-based).
Basic problem, so basic solution, just need to type df.fillna(0, inplace=True) and it works !
But when when I check the df.values (or the conversion to_numpy()) there is still nan values.
Properties of the problem :
_NaN not found with np.where() in the array both with np.nan & pandas-ta.npNaN
_df.isna().any().any() returns False
_NaN are float values, not string
_array has a dtype equal to object
_I tried various methods to replace the NaNs, not only fillna, but with the fact that they are not recognized it does not work at all
_I also thought it was because of large numbers, but using to_numpy(dtype='float64') gives the same problem
So these values are here only when converted to numpy array and not recognized.
These values are also here when I use PCA to my dataset, where I get a message error because of the NaNs.
Thanks a lot for your time, sorry for the mistakes I'm not a native speaker.
Have a good day y'all.
Edit :
There is a screen of the operations I'm doing and the result printed, you can see one NaN value.
If I analyze these two datasets individually, I don't get any error and the I also get the viz of all the integer columns.
But when I try to compare these dataframe, I get the below error.
Cannot convert series 'Web Visit' in COMPARED from its TYPE_CATEGORICAL
to the desired type TYPE_BOOL.
I also tried the FeatureConfig to skip it, but no avail.
pid_compare = sweetviz.compare([pdf,"234_7551009"],[pdf_2,"215_220941058"])
Maintainer of the lib here; this question was asked in the git also, but it will be useful to detail the answer here.
After looking at your data provided in the link above, it looks like the first dataframe (pdf) only contains 0 & 1, so it is classified as boolean so it cannot be compared against the second one which is categorical (that one has 0,1,2,3 as you probably know!).
The system will be able to handle it if you use FeatureConfig to force the first dataframe to be considered CATEGORICAL.
I just tried the following and it seems to work, let me know if it helps!
feature_config = sweetviz.FeatureConfig(force_cat = ["Web Desktop Interaction"])
report = sweetviz.compare(pdf, pdf_2, None, feature_config)
I´m new to programming. I´m trying to use scipy minimize, had several issues and gotten through most of them.
Right now this is the code, but I'm not understanding why I´m getting this error.
par_opt = so.minimize(fun=fun_obj, x0=par_ini, method='Nelder-Mead', args=[series_pt_cal, dt, series_caudal_cal])
Not enough info is given by the OP, but basically somewhere in the code it's specified to operate by data frame column (axis=1) on an object that is a Pandas Series. If the code typically works but occasional gives errors, check for degenerative cases where a data frame may have only 1 row. Pandas has a nasty habit of guessing what you want -- it may decide to reduce a 1-row data frame to a Series (e.g., the apply() function; you can disable that by using reduce=False in there).
Add a line of code to check the object is isinstance(df, pd.DataFrame) or else convert the offending pandas Series to a data frame, something like s.to_frame().T for the problems I had to deal with.
Use pd.DataFrame(df) before your so.minimize function.
Pandas wants to run on DataFrame for that function.
L is a list of dataframes with a multiindex on the rows.
pd.concat(L,axis=1)
I get the following error (from the Categorical constructor in categorical.py):
TypeError: 'values' is not ordered, please explicitly specify the categories order by passing in a categories argument.
It clearly has something to do with the values in my dataframe, as I can get it work if I restrict the data in some way.
E.g. all of these work
a=pd.concat(L[0:6],axis=1)
b=pd.concat(L[6:11],axis=1)
c=pd.concat(L[3:9],axis=1)
but
d=pd.concat(L[0:11],axis=1)
fails.
pd.concat([x.iloc[0:1000,:] for x in L[0:11]],axis=1)
also works. I've gone through the edge cases at which it breaks, and for the life of me, I don't see anything that could be offensive in those rows. Does anyone have some ideas on what I should be looking for?
I just had this issue too when I did a df.groupby(...).apply(...) with a custom apply function. The error seemed to appear when the results were to get merged back together after the groupby-apply (so I must have returned something in my custom apply function that it didn't like).
After inspecting the extensive stacktrace provided by pytest I found a mysterious third value appeared in my index values:
values = Index([(2018-09-01 00:00:00, 'SE0011527613'),
(2018-09-25 00:00:00, 'SE0011527613'),
1535760000000000000], dtype='object')
I have absolutely no idea how it appeared there, but I managed to work around it somehow by avoiding multi-indexed stuff in that particular part of the code (extensive use of reset_index and set_index).
Not sure if this will be of help to anyone, but there you have it. If someone could attempt a minimal reproducible example that would be helpful (I didn't manage to).
I came across the same error:
TypeError: 'values' is not ordered, please explicitly specify the categories order by passing in a categories argument.
However, there is not much material around it. Have a look what the error log states a bit further above. I have:
TypeError: unorderable types: range() < range()
During handling of the above exception, another exception occurred:
The clue was 'range() < range()' because I had a previous problem here with Pandas interpreting '(1,2)' or '(30,31)' not as string but as 'range(1,3)' or 'range(30,32)' respectively. Very annoying as the dtypes is still object.
I had to change the column content to lists and/or drop the 'range(x,y)' column.
Hope this helps or anybody else who comes across this problem. Cheers!
I have a huge data frame with about 1041507 rows.
I wanted to calculate a rolling_median for certain values, under certain categories in my data frame. For this I used a groupBy follwed by apply:
df['rolling_median']=df['value'].groupby(['Category','Subcategory']).apply(pd.rolling_median,7,min_periods=7)
However, this given me a MemoryError: skiplist_insert failed. I will attach the full Traceback if needed, but I came across another similar post which specifies that this is an issue in pandas, as can be seen here https://github.com/pydata/pandas/issues/11697. For a very large size >~ 35000
After this i tried to do a bit of manipulation to simply get the rolling median by iterating over each group separately
for index,group in df.groupby(['Category','Subcategory']):
print pd.rolling_median(group['value'],7,min_period=7)
Each group is about 20-25 rows long only. Yet this function fails and shows the same MemoryError after a few iterations.
I ran the code several times, and every time it showed the memory error for different items.
I created some dummy values for anyone to test, here:
index=[]
[index.append(x) for y in range(25) for x in np.arange(34000)]
sample=pd.DataFrame(np.arange(34000*25),index=index)
for index,group in sample.groupby(level=0):
try:
pd.rolling_median(group[0],7,7)
except MemoryError:
print a
print pd.rolling_median(group[0],7,7)
If i run the rolling_median again after encountering the memoryError (as you can see in the above code), it runs fine without another exception-
I am not sure how can I calculate my rolling_median if it keeps throwing the memory Error.
Can anyone tell me a better way to calculate the rolling_median, or help me understand the issue here?
The groupby doesn't look right and should change
df['rolling_median']=df['value'].groupby(['Category','Subcategory']).apply(pd.rolling_median,7,min_periods=7)
to
df['rolling_median']=df.groupby(['Category','Subcategory'])['value'].apply(pd.rolling_median,7,min_periods=7)
Otherwise the the groupby won't work as it is a series with column named ["value"] so can't groupby ['Category','Subcategory'] as not present.
Also the groupby is going to be smaller than the length of the dataframe and creating the df['rolling_median'] will cause a length mismatch.
Hope that helps.
The bug has been fixed in Pandas 0.18.0, and now the methods rolling_mean() and rolling_median() have depreciated.
This was the bug: https://github.com/pydata/pandas/issues/11697
Can be viewed here: http://pandas.pydata.org/pandas-docs/stable/computation.html