I am trying to create a dummy variable that takes on the value of "1" if it has the largest democracy value in a given year using pandas. I have tried numerous iterations of code, but none of them accomplish what I am trying to do. I will provide some code and discuss the errors that I am dealing with. Also, for further transparency, I am trying to replicate an R document using tidyverse in Python. Here is what my R code looks like (which generates the variable just fine):
merged_data <- merged_data %>% group_by(year) %>% mutate(biggest_democ = if_else(v2x_regime == max(v2x_regime), 1, 0)) %>% ungroup()
As stated before, this works just fine in R, but I cannot replicate this in Python. Here are some of the lines of code that I am running into issues with:
merged_data = merged_data.assign(biggest_democ = np.where(merged_data['v2x_regime'].max(), 1, 0).groupby('year'))
This just comes up with the error:
"AttributeError: 'numpy.ndarray' object has no attribute 'groupby'"
I have tried other iterations as well but they result in the same error.
I would appreciate any and all help!
Here's one approach using groupby, transform, and a custom lambda function. Not sure if the example data I made matches your situation
import pandas as pd
merged_data = pd.DataFrame({
'country':['A','B','C','A','B','C'],
'v2x_regime':[10,20,30,70,40,50],
'year':[2010,2010,2010,2020,2020,2020],
})
merged_data['biggest_democ'] = (
merged_data
.groupby('year')['v2x_regime']
.transform(
lambda v: v.eq(v.max())
)
.astype(int)
)
merged_data
Output
Related
I have been trying to find a way to split a text data(separator is space) in a single column into multiple columns. i can do it by Pandas using the following code, but i would like to do the same with Vaex.
i was looking at the Vaex API document, but i can't see rsplit equivalent method to do so.
https://vaex.readthedocs.io/en/latest/api.html
df_data = df_data.iloc[:,0].apply(lambda x: pd.Series(x.rsplit(" ")))
I have also referred this page who was asking similar question and tried to run the same code. but in my environment i get this Error evaluating: ValueError('No memory tracker found with name default').
vaex extract one column of str.split()
df = pd.DataFrame({'ticker' : ['spx 5/25/2001 p500', 'spx 5/25/2001 p600', 'spx 5/25/2001 p700']})
df_vaex = vaex.from_pandas(df)
df_vaex["ticker"].str.split(" ").apply(lambda x: x[-1])
Are you using the latest version of vaex?
I just tried out your code example and it works fine..
After i restarted my laptop, following split method worked and i got a result as I expected.
df_data = df_data[:,0].str.split(' ').apply(lambda x: x[3])
In my data frame I have this data
df_first_year = df['FIRST_YEAR']
df_last_year = df['LAST_YEAR']
df_span = df['span']
I want to use span column as bin in histogram. So, when I run this part of code (below). It shows error (ValueError: bins must increase monotonically, when an array)
plt.hist(df_first_year, bins=df_span, edgecolor='black')
plt.legend()
Thats why I tried to sort the dataframe by span column. Like this
df = df.sort_values(by=["span"], inplace=True)
After running this part of code. When I want to see my dataframes data, it
shows None. I think that means there is no data
Is there any another option or what I have done wrong in my simple code !!!!!
This is the problem.
df = df.sort_values(by=["span"], inplace=True)
Reason: Inplace means you're setting reflect changes to dataframe as true, which does not return any values.
If you're using inplace argument for the sort_values function, use it as either
df.sort_values(by=["span"], inplace=True)
OR
df = df.sort_values(by=["span"], inplace=False)
I want to use Pandas + Uncertainties. I am getting a strange error, below a MWE:
from uncertainties import ufloat
import pandas
number_with_uncertainty = ufloat(2,1)
df = pandas.DataFrame({'a': [number_with_uncertainty]}) # This line works fine.
df.loc[0,'b'] = ufloat(3,1) # This line fails.
I have noticed that if I try to add the ufloats "on the fly", as I usually do with a float or some other stuff, it fails. If I first create a Series then it works:
from uncertainties import ufloat
import pandas
number_with_uncertainty = ufloat(2,1)
df = pandas.DataFrame({'a': [number_with_uncertainty]}) # This line works fine.
df['b'] = pandas.Series([ufloat(3,1)]) # Now it works.
print(df)
This makes it more cumbersome when calculating values on the fly within a loop as I have to create a temporary Series and after the loop add it as a column into my data frame.
Is this a problem of Pandas, a problem of Uncertainties, or am I doing something that is not supposed to be done?
The problem arises because when pandas tries to create a new column it checks the dtype of the new value so that it knows what dtype to assign to that column. For some reason, the dtype check on the ufloat value fails. I believe this is a bug that will have to be fixed in uncertainties.
A workaround in the interim is to manually create the new column with dtype set to object, for example in your case above:
from uncertainties import ufloat
import pandas
import numpy
number_with_uncertainty = ufloat(2,1)
df = pandas.DataFrame({'a': [number_with_uncertainty]}) # This line works fine.
# create a new column with the correct dtype
df.loc[:, 'b'] = numpy.zeros(len(df), dtype=object)
df.loc[0,'b'] = ufloat(3,1) # This line now works.
first at all I want to specify that my question is very similar to others questions done before, but I tried their answers and nothing worked for me.
I'm trying to aggregate some info using more than one variable to group. I can use pivot table or groupby, both are fine for this, but I get the same error all the time.
My code is:
import numpy as np
vars_agrup = ['num_vars', 'list_vars', 'Modelo']
metricas = ["MAE", "MAE_perc", "MSE", "R2"]
g = pd.pivot_table(df, index=vars_agrup, aggfunc=np.sum, values=metricas)
or
df.groupby(vars_agrup, as_index=False).agg(Count=('MAE','sum'))
Also, I tried to use () instead of [] to avoid make it a list, but then the program search a column called "'num_vars', 'list_vars', 'Modelo'" which doesn't exist. I tried ([]) and [()]. Index instead of columns. It's always the same: for one variable to group it's fine, for multiples I get the error: TypeError: unhashable type: 'list'
For sure, all these variables are columns in df.
Edit: My df looks like this:
I'm trying to decode html characters within a pandas dataframe.
I don't know why but my apply function won't work.
# requirements
import html
import pandas as pd
# This code works fine.
df = df.apply(lambda x: x + "TESTSTRING")
print(df) # "TESTSTRING" is appended to all values.
# This code also works fine. html.unescape() is working well.
fn = lambda x: html.unescape(x)
str = "Someting wrong with <b>E&S</b>"
print(fn(str)) # returns "Something wrong with <b>E&S</b>"
# However, the code below doesn't work. The "&" within the values dont' get decoded.
df2 = df.apply(fn)
print(df2) # The html characters aren't decoded!
It's really frustrating that the apply function and html.unescape() is working well separately, but I don't know why they don't work when they are together.
I've also tried axis=1
I'd really appreciate your help. Thanks in advance.
The problem is that html.unexcape() seems unvectorized, i.e. it accepts only one single string.
In case Your df is not really large, using applymap should still be sufficiently fast:
df2 = df.applymap(lambda x: html.unescape(x))
print(df2)