pandas df.apply() not working with html.unescape() - python

I'm trying to decode html characters within a pandas dataframe.
I don't know why but my apply function won't work.
# requirements
import html
import pandas as pd
# This code works fine.
df = df.apply(lambda x: x + "TESTSTRING")
print(df) # "TESTSTRING" is appended to all values.
# This code also works fine. html.unescape() is working well.
fn = lambda x: html.unescape(x)
str = "Someting wrong with <b>E&S</b>"
print(fn(str)) # returns "Something wrong with <b>E&S</b>"
# However, the code below doesn't work. The "&" within the values dont' get decoded.
df2 = df.apply(fn)
print(df2) # The html characters aren't decoded!
It's really frustrating that the apply function and html.unescape() is working well separately, but I don't know why they don't work when they are together.
I've also tried axis=1
I'd really appreciate your help. Thanks in advance.

The problem is that html.unexcape() seems unvectorized, i.e. it accepts only one single string.
In case Your df is not really large, using applymap should still be sufficiently fast:
df2 = df.applymap(lambda x: html.unescape(x))
print(df2)

Related

Creating a Dummy Variable Using Groupby and Max Functions With Pandas

I am trying to create a dummy variable that takes on the value of "1" if it has the largest democracy value in a given year using pandas. I have tried numerous iterations of code, but none of them accomplish what I am trying to do. I will provide some code and discuss the errors that I am dealing with. Also, for further transparency, I am trying to replicate an R document using tidyverse in Python. Here is what my R code looks like (which generates the variable just fine):
merged_data <- merged_data %>% group_by(year) %>% mutate(biggest_democ = if_else(v2x_regime == max(v2x_regime), 1, 0)) %>% ungroup()
As stated before, this works just fine in R, but I cannot replicate this in Python. Here are some of the lines of code that I am running into issues with:
merged_data = merged_data.assign(biggest_democ = np.where(merged_data['v2x_regime'].max(), 1, 0).groupby('year'))
This just comes up with the error:
"AttributeError: 'numpy.ndarray' object has no attribute 'groupby'"
I have tried other iterations as well but they result in the same error.
I would appreciate any and all help!
Here's one approach using groupby, transform, and a custom lambda function. Not sure if the example data I made matches your situation
import pandas as pd
merged_data = pd.DataFrame({
'country':['A','B','C','A','B','C'],
'v2x_regime':[10,20,30,70,40,50],
'year':[2010,2010,2010,2020,2020,2020],
})
merged_data['biggest_democ'] = (
merged_data
.groupby('year')['v2x_regime']
.transform(
lambda v: v.eq(v.max())
)
.astype(int)
)
merged_data
Output

Using np.where inside of df.eval string in python

long time listener, first time caller..
I need to be able to call functions (custom made and otherwise, like numpy.where) as a string in a pandas dataframe eval statement. see example:
import pandas as pd
import numpy as np
data = [[0,10,10],[1,20,20],[0,30,30]]
df = pd.DataFrame(data,columns=['X','A','B'])
df['C'] = np.where(df.X==0,df.A+df.B,0) #This works
df['C'] = df.eval('np.where(X==0,A+B,0)') #But this is how i need to implement
df['C'] = df.eval('#np.where(X==0,A+B,0)') #pinata swings starting here
Please stackoverflow! help!
Update:
I found a way to solve by changing the string and the way I evaluated to this:
df['C'] = eval("np.where(df['X']==0,df['A']+df['B'],0)")

How can I split out this list containing a dictionary into separate columns?

The top table is what I have and the bottom is what I want. I'm doing this in a Pandas dataframe. Any help would be appreciated.
Thanks!
It would have been nice if you provided a code snippet for this since we are unable to easily test your case.
The following lines should do the job:
df['label'] = df['sentiment'].apply(lambda x: x[0]['label'])
df['score'] = df['sentiment'].apply(lambda x: x[0]['score'])

Pandas df.apply does not modify DataFrame

I am just starting pandas so please forgive if this is something stupid.
I am trying to apply a function to a column but its not working and i don't see any errors also.
capitalizer = lambda x: x.upper()
for df in pd.read_csv(downloaded_file, chunksize=2, compression='gzip', low_memory=False):
df['level1'].apply(capitalizer)
print df
exit(1)
This print shows the level1 column values same as the original csv not doing upper. Am i missing something here ?
Thanks
apply is not an inplace function - it does not modify values in the original object, so you need to assign it back:
df['level1'] = df['level1'].apply(capitalizer)
Alternatively, you can use str.upper, it should be much faster.
df['level1'] = df['level1'].str.upper()
df['level1'] = map(lambda x: x.upper(), df['level1'])
you can use above code to make your column uppercase

How to use xlwings or pandas to get all the non-null cell?

Recently I need to write a python script to find out how many times the specific string occurs in the excel sheet.
I noted that we can use *xlwings.Range('A1').table.formula* to achieve this task only if the cells are continuous. If the cells are not continuous how can I accomplish that?
It's a little hacky, but why not.
By the way, I'm assuming you are using python 3.x.
First well create a new boolean dataframe that matches the value you are looking for.
import pandas as pd
import numpy as np
df = pd.read_excel('path_to_your_excel..')
b = df.applymap(lambda x: x == 'value_you_want_to_find' if isinstance(x, str) else False)
and then simply sum all occurences.
print(np.count_nonzero(b.values))
As clarified in the comments, if you already have a dataframe, you can simply use count (Note: there must be a better way of doing it):
df = pd.DataFrame({'col_a': ['a'], 'col_b': ['ab'], 'col_c': ['c']})
string_to_search = '^a$' # should actually be a regex, in this example searching for 'a'
print(sum(df[col].str.count(string_to_search).sum() for col in df.columns))
>> 1

Categories

Resources