Find length of pd.Series and strip the last two characters Pandas - python

I am aware that I can find the length of a pd.Series by using pd.Series.str.len() but is there a method to strip the last two characters? I know we can use Python to accomplish this but I was curious to see if it could be done in Pandas.
For example:
$1000.0000
1..0009
456.2233
Would end in :
$1000.00
1..00
456.22
Any insight would be greatly appreciated.

Just do:
import pandas as pd
s = pd.Series(['$1000.0000', '1..0009', '456.2233'])
res = s.str[:-2]
print(res)
Output
0 $1000.00
1 1..00
2 456.22
dtype: object
Pandas supports the built-in string methods through the accessor str, from the documentation:
These are accessed via the str attribute and generally have names
matching the equivalent (scalar) built-in string methods

Try with
df_new = df.astype(str).applymap(lambda x : x[:-2])
Or only one column
df_new = df.astype(str).str[:-2]

Related

How to put pandas df column values into extract regular expression

I am wondering how to pass pandas data frame column values into a regular expression. I have tried the below but get "TypeError: 'Series' objects are mutable, thus they cannot be hashed"
Im after the result below. (I could just use a different regex but was wondering how this might be done dynamically)
Thoughts appreciated :)
to_search search_string search_result
ABC-T3-123 ABC ABC-T3
ABC-T2-123 ABC ABC-T3
DEF-T1-123 ABC DEF-T1
import pandas as pd
# create list for data frame
data = [['ABC-T3-123', 'ABC'], ['ABC-T2-123', 'ABC'], ['DEF-T1-123', 'DEF']]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['to_search', 'search_string'])
df['search_results']=df['to_search'].str.extract("(" + df['search_string'] + "-T[0-9])")}```
I know that you want an efficient solution, but typically these pandas functions do not take values such as Serieses. Here is an apply-based solution, which I think, besides simplifying the regular expression, is the only viable solution here:
searched = df.apply(lambda row: re.search("(" + row['search_string'] + "-T[0-9])", row['to_search']).group(1), axis=1)
Output:
>>> searched
0 ABC-T3
1 ABC-T2
2 DEF-T1
dtype: object

How do I convert DataFrame column of type "string" to "float" using .replace?

In my DataFrame, the "Value_String" column consists of strings that are either:
number-like strings starting with dollar sign, and thousands are separated by a comma, [e.g. $1,000]
"None"
Therefore, I tried to create a new column and convert the string to float with the following lambda function:
to_replace = '$,'
df['Value_Float'] = df[df['Value_String'].apply(lambda x: 0 if x == 'None'
else float(x.replace(y, '')) for y in to_replace)]
This actually generates a "TypeError: 'generator' object is not callable".
How can I solve this?
The numpy where method is very helpful for conditionally updating values. In this case where the value is not 'None' we will use the replace function. However since str.replace uses regex by default, we need to change the pattern to a literal dollar sign OR a comma
import pandas as pd
import numpy as np
df = pd.DataFrame({'Value_String':["$1,000","None"]})
df['Value_String'] = np.where(df['Value_String']!='None', df['Value_String'].str.replace('\$|,',''), df['Value_String'])
print(df)
Output
Value_String
0 1000
1 None

Replacing an string in a dataframe python

I have a (7,11000) dataframe. in some of these 7 columns, there are strings.
In Coulmn 2 and row 1000, there is a string 'London'. I want to change it to 'Paris'.
how can I do this? I searched all over the web but I couldnt find a way. I used theses commands but none of them works:
df['column2'].replace('London','Paris')
df['column2'].str.replace('London','Paris')
re.sub('London','Paris',df['column2'])
I usually receive this error:
TypeError: expected string or bytes-like object
If you want to replace a single row (you mention row 1000), you can do it with .loc. If you want to replace all occurrences of 'London', you could do this:
import pandas as pd
df = pd.DataFrame({'country': ['New York', 'London'],})
df.country = df.country.str.replace('London', 'Paris')
Alternatively, you could write your own replacement function, and then use .apply:
def replace_country(string):
if string == 'London':
return 'Paris'
return string
df.country = df.country.apply(replace_country)
The second method is a bit overkill, but is a good example that generalizes better for more complex tasks.
Before replacing check for non characters with re
import re
for r, map in re_map.items():
df['column2'] = [re.sub(r, map, x) for x in df['column2']]
These are all great answers but many are not vectorized, operating on every item in the series once rather than working on the entire series.
A very reliable filter + replace strategy is to create a mask or subset True/False series and then use loc with that series to replace:
mask = df.country == 'London'
df.loc[mask, 'country'] = 'Paris'
# On 10m records:
# this method < 1 second
# #Charles method 1 < 10 seconds
# #Charles method 2 < 3.5 seconds
# #jose method didn't bother because it would be 30 seconds or more

Cleaning data frames with rogue elements using split()

Given the following data in an excel sheet (taken in as a dataframe) :
Name Number Date
AA '9988779911' '01-JAN-18'
'BB' '8779912044' '01-FEB-18'
I have used the following code to clean the dataframe and remove the unnecessary apostrophes;
for name in list(df):
df[name] = df[name].str.split("'").str[1]
And I want the following output :
Name Number Date
AA 9988779911 01-JAN-18
BB 8779912044 01-FEB-18
I am getting the following error :
AttributeError: Can only use .str accessor with string values, which use np.object_ dtype in pandas
Thanks in advance for your help.:):)
try this,
for name in list(df):
df[name] = df[name].str.replace("\'","")
Replace ' with empty character.
simpler approach
df.applymap(lambda x: x.replace("'",""))
Strip function is probably the shortest way here. The other answers are elegant too.
str.strip("'")
Moshevi has said the same in one of the comments.

Pandas ".convert_objects(convert_numeric=True)" deprecated [duplicate]

This question already has answers here:
Convert pandas.Series from dtype object to float, and errors to nans
(3 answers)
Closed 3 years ago.
I have this line in my code which converts my data to numeric...
data["S1Q2I"] = data["S1Q2I"].convert_objects(convert_numeric=True)
The thing is that now the new pandas release (0.17.0) said that this function is deprecated..
This is the error:
FutureWarning: convert_objects is deprecated.
Use the data-type specific converters pd.to_datetime,
pd.to_timedelta and pd.to_numeric.
data["S3BD5Q2A"] = data["S3BD5Q2A"].convert_objects(convert_numeric=True)
So, I went to the new documentation and I couldn't find any examples of how to use the new function to convert my data...
It only says this:
"DataFrame.convert_objects has been deprecated in favor of type-specific functions pd.to_datetime, pd.to_timestamp and pd.to_numeric (new in 0.17.0) (GH11133)."
Any help would be nice!
As explained by #EvanWright in the comments,
data['S1Q2I'] = pd.to_numeric(data['S1Q2I'])
is now the prefered way of converting types. A detailed explanation in of the change can be found in the github PR GH11133.
You can effect a replacement using apply as done here. An example would be:
>>> import pandas as pd
>>> a = pd.DataFrame([{"letter":"a", "number":"1"},{"letter":"b", "number":"2"}])
>>> a.dtypes
letter object
number object
dtype: object
>>> b = a.apply(pd.to_numeric, errors="ignore")
>>> b.dtypes
letter object
number int64
dtype: object
>>>
But it sucks in two ways:
You have to use apply rather than a non-native dataframe method
You have to copy to another dataframe--can't be done in place. So much for use with "big data."
I'm not really loving the direction pandas is going. I haven't used R data.table much, but so far it seems superior.
I think a data table with native, in-place type conversion is pretty basic for a competitive data analysis framework.
It depends on which version of Pandas......
if you have Pandas's version 0.18.0
this type will work ........
df['col name'] = df['col name'].apply(pd.to_numeric, errors='coerce')
another versions ........
df['col name']=df.col name .astype(float)
If you convert all columns to numeric at once, this code may work.
data = data.apply(pd.to_numeric, axis=0)
You can get it to apply correctly to a particular variable name in a dataframe without having to copy into a different dataframe like this:
>>> import pandas as pd
>>> a = pd.DataFrame([{"letter":"a", "number":"1"},{"letter":"b", "number":"2"}])
>>> a.dtypes
letter object
number object
dtype: object
>>> a['number'] = a['number'].apply(pd.to_numeric, errors='coerce')
>>> a.dtypes
letter object
number int64
dtype: object
An example based on the original question above would be something like:
data['S1Q2I'] = data['S1Q2I'].apply(pd.to_numeric, errors='coerce')
This works the same way as your original:
data['S1Q2I'] = data['S1Q2I'].convert_objects(convert_numeric=True)
in my hands, anyway....
This doesn't address the point abalter made about inferring datatypes which is a little above my head I'm afraid!

Categories

Resources