pandas cleaning 1+1 values in a column - python

I have a column that has the following data
column
------
1+1
2+3
4+5
How do I get pandas to sum these values so that the out put is 2,5,9 instead of the above?
Many thanks

You column obviously contains strings, so, you must somehow evaluate them. Use pd.eval function. Eg
frame['column'].apply(pd.eval)
If interested in performance, probably use an alternative method, like ast.literal_eval. Thanks to user #Serge Ballesta for mentioning

Related

.strip() with in-place solution not working

I'm trying to find a solution for stripping blank spaces from some strings in my DataFrame. I found this solution, where someone said this:
I agree with the other answers that there's no inplace parameter for
the strip function, as seen in the
documentation
for str.strip.
To add to that: I've found the str functions for pandas Series
usually used when selecting specific rows. Like
df[df['Name'].str.contains('69'). I'd say this is a possible reason
that it doesn't have an inplace parameter -- it's not meant to be
completely "stand-alone" like rename or drop.
Also to add! I think a more pythonic solution is to use negative
indices instead:
data['Name'] = data['Name'].str.strip().str[-5:]
This way, we don't have to assume that there are 18 characters, and/or
we'll consistently get "last 5 characters" instead!
So, I have a list of DataFrames called 'dataframes'. On the first dataframe (which is dataframes[0]), I have a column named 'CNJ' with string values, some of them with a blank space in the end. For example:
Input:
dataframes[0]['cnj'][9]
Output:
'0100758-73.2019.5.01.0064 '
So, following the comment above, I did this:
Input:
dataframes[0]['cnj'] = dataframes[0]['cnj'].strip()
Then I get the following error:
AttributeError: 'Series' object has no attribute 'strip'
Since the solution given on the other topic worked, what am I doing wrong to get this error? It seemed to me it shouldn't work because its a Series, but it should get the same result as the one mentioned above (data['Name'] = data['Name'].str.strip().str[-5:]), right?
Use
dataframes[0]['cnj']=dataframes[0]['cnj'].str.strip()
or better yet, store the dataframe in a variable first:
df0=dataframes[0]
df0['cnj']=df0['cnj'].str.strip()
The code in the solution you posted uses .str. :
data['Name'] = data['Name'].str.strip().str[-5:]
The Pandas Series object has no string or date manipulation methods methods. These are exposed through the Series.str and Series.dt accessor objects.
The result of Series.str.strip() is a new series. That's why .str[-5:] is needed to retrieve the last 5 characters. That results is a new series again. That expression is equivalent to :
temp_series=data['Name'].str.strip()
data['Name'] = temp_series.str[-5:]
You could just apply a transformation function on the column values like this.
data["Name"] = data["Name"].apply(lambda x: str(x).strip()[-5:])
What you need is a string without the right spaces is a series or a dataframe right, at least that's my understanding looking at your query, use str.rstrip() which will work both on series and dataframe objects.
Note: strip() usually is only for string datatypes, so the error you are getting is appropriate.
Refer to link , and try implementing str.rstrip() provided by pandas.
For str.strip() you can refer to this link, it works for me.
In your case, assuming the dataframe column to be s, you can use the below code:
df[s].str.strip()

Pandas Groupby Count Partial Strings

I am wanting to try to get a count of how many rows within a column contain a partial string based on an imported dataframe. In the sample data below, I want to groupby Trans_type and then get a count of how many rows contain a value.
So I would expect to see:
First, is this possible generically without passing a link to get each types expected brand? If not, how could I pass say Car a list of .str.contains['Audi','BMW'].
Thanks for any help!
Try this one:
df.groupby(df["Trans_type"], df["Brand"].str.extract("([a-zA-Z])+", expand=False)).count()

What is the the best way to modify (e.g., perform math functions) a column in a Dask DataFrame?

I'm a veteran of Pandas DataFrame objects, but I'm struggling to find a clean, convenient method for altering the values in a Dask DataFrame column. For a specific example, I'm trying to multiply positive values in a numpy.float column by -1, thereby making them negative. Here is my current method (I'm trying to change the last column in the DataFrame):
cols = df.columns
df[[cols[-1]]] = df[[cols[-1]]]*-1
This seems to work only if the column has a string header, otherwise it adds another column using the index number as a string-type column name for a new column. Is there something akin to the Pandas method of, say, df.iloc[-1,:] = df.iloc[-1,:]*-1 that I can use with a Dask dataframe?
Edit: I'm also trying to implement: df = df.applymap(lambda x: x*-1). This, of course, applies the function to the entire dataframe, but is there a way to apply a function over just one column? Thank you.
first question
If something works for string columns and not for numeric-named columns then that is probably a bug. I recommend raising an issue at https://github.com/dask/dask/issues/new
second question
but is there a way to apply a function over just one column?
You can't apply a single Python function over a dask dataframe that is stored in many pieces directly, however methods like .map_partitions or .reduction may help you to achieve the same result with some cleverness.
in the future we recommend asking separate questions separately on stack overflow

Python - Use column name with int and string

I have imported an NBA statistical dataset. But some of my column names have 2 data types, as in "3PP" or "2FG". Therefore, the following code won't work.
for team in nba.3PP
Because when it runs, it gives an "invalid syntax" error. Is there a special way I can use 3PP like .\3PP or something to get it to work? Thanks!
EDIT: Using Pandas dataFrame
You don't say what you've imported into. If Pandas:
for team in nba['3PP']:
...
This uses the item-oriented indexing, rather than attribute-oriented indexing. In Python in general, they are not equivalent, but in Pandas they can often be used interchangeably.
Use the .get method:
nba.get("3PP")
Or:
nba['3PP']
Depending on if the dataset is in Pandas or whatnot.

Using .apply in python to apply a mapper

this should be very simple but I can't figure it out.
I have a 'mapper' DataFrame that looks something like this:
mapper={'old_values':[105,312,269],'new_values':[849,383,628]}
df=pd.DataFrame(mapper)
I then have another dataframe with a column that contains old values. I simply want to convert them all to their new values (e.g. all 105's should become 849's). I think I need to use df.apply but I can't find an example of how to do this.
Thanks in advance.
It's better to choose Series.map method which performs similar to a python dictionary in functionality to aid in the mapping of values from one series to another than go for a slow apply function here.
df['old_values'].map(df.set_index('old_values')['new_values'])
Out[12]:
0 849
1 383
2 628
Name: old_values, dtype: int64
The only modification you need to make here is:
new_df['old_values'].map(old_df.set_index('old_values')['new_values'])
But do note that this introduces NaN for keys not found in the original DF.(Any unseen value encountered by the new DF would be coerced to NaN).
If this is the behavior you'd expect then map is an ideal method.
Although, if your intention is to simply replace the values and leave the missing keys as they were before, you can opt for Series.replace method.
new_df['old_values'].replace(old_df.set_index('old_values')['new_values'])

Categories

Resources