removing character from string value in dataframe column - python

I hope you can help me with this question. I have a column with numeric values as strings. Since they are data from diferent countries, some of them have different formats such as "," and "$". I'm trying to convert the serie to numbers, but i'm having trouble with "," and "$" values.
data={"valores":[1,1,3,"4","5.00","1,000","$5,700"]}
df=pd.DataFrame(data)
df
valores
0 1
1 1
2 3
3 4
4 5.00
5 1,000
6 $5,700
Ive tried the following:
df["valores"].replace(",","")
but it does not change a thing since the "," value is in the string, not the string value itself
pd.to_numeric(df["valores"])
But I receive the "ValueError: Unable to parse string "1,000" at position 5" error.
valores=[i.replace(",","") for i in df["valores"].values]
But I receive the "AttributeError: 'int' object has no attribute 'replace' error.
So, at last, I tried with this:
valores=[i.replace(",","") for i in df["valores"].values if type(i)==str]
valores
['4', '5.00', '1000', '$5700']
But it skipped the first three values since they are not strings..
I think that with a Regex code i would be able to manage it, but I just simply dont understand how to work with it.
I hope you can help me since i've been struggling with this for about 7 hours.

You should first create a string from it, so something like this
valores=[str(i).replace(",","") for i in df["valores"].values]

You can try this:
df['valores'] = df['valores'].replace(to_replace='[\,\$]',value='',regex=True).astype(float)

.replace by default searches for the whole cell values. Since you want to replace a part of the string, you need .str.replace or replace(...,regex=True):
df['valores'] = df["valores"].replace(",","", regex=True)
Or:
df['valore'] = df["valores"].str.replace(",","")

You need to cast the values in the valores column to string using .astype(str), then remove all $ and , using .str.replace('[,$]', '') and then you may convert all data to numeric using pd.to_numeric:
>>> pd.to_numeric(df["valores"].astype(str).str.replace("[,$]",""))
0 1.0
1 1.0
2 3.0
3 4.0
4 5.0
5 1000.0
6 5700.0

Related

Filter on a pandas string column as numeric without creating a new column

This is a quite easy task, however, I am stuck here. I have a dataframe and there is a column with type string, so characters in it:
Category
AB00
CD01
EF02
GH03
RF04
Now I want to treat these values as numeric and filter on and create a subset dataframe. However, I do not want to change the dataframe in any way. I tried:
df_subset=df[df['Category'].str[2:4]<=3]
of course this does not work, as the first part is a string and cannot be evaluated as numeric and compared to 69.
I tried
df_subset=df[int(df['Category'].str[2:4])<=3]
but I am not sure about this, I think it is wrong or not the way it should be done.
Add type conversion to your expression:
df[df['Category'].str[2:].astype(int) <= 3]
Category
0 AB00
1 CD01
2 EF02
3 GH03
As you have leading zeros, you can directly use string comparison:
df_subset = df.loc[df['Category'].str[2:4] <= '03']
Output:
Category
0 AB00
1 CD01
2 EF02
3 GH03

Trying to compare to values in a pandas dataframe for max value

I've got a pandas dataframe, and I'm trying to fill a new column in the dataframe, which takes the maximum value of two values situated in another column of the dataframe, iteratively. I'm trying to build a loop to do this, and save time with computation as I realise I could probably do it with more lines of code.
for x in ((jac_input.index)):
jac_output['Max Load'][x] = jac_input[['load'][x],['load'][x+1]].max()
However, I keep getting this error during the comparison
IndexError: list index out of range
Any ideas as to where I'm going wrong here? Any help would be appreciated!
Many things are wrong with your current code.
When you do ['abc'][x], x can only take the value 0 and this will return 'abc' as you are slicing a list. Not at all what you expect it to do (I imagine, slicing the Series).
For your code to be valid, you should do something like:
jac_input = pd.DataFrame({'load': [1,0,3,2,5,4]})
for x in jac_input.index:
print(jac_input['load'].loc[x:x+1].max())
output:
1
3
3
5
5
4
Also, when assigning, if you use jac_output['Max Load'][x] = ... you will likely encounter a SettingWithCopyWarning. You should rather use loc: jac_outputLoc[x, 'Max Load'] = .
But you do not need all that, use vectorial code instead!
You can perform rolling on the reversed dataframe:
jac_output['Max Load'] = jac_input['load'][::-1].rolling(2, min_periods=1).max()[::-1]
Or using concat:
jac_output['Max Load'] = pd.concat([jac_input['load'], jac_input['load'].shift(-1)], axis=1).max(1)
output (without assignment):
0 1.0
1 3.0
2 3.0
3 5.0
4 5.0
5 4.0
dtype: float64

Python replace NaN values of a field holding numeric values with empty and not single quotes which will be treated later as strings

I am uploading some data frames into snowflake cloud. I had to use the following to transform all field values into string:
data = data.applymap(str)
The only reason I am doing that is that without it, I will get the following error:
TypeError: not all arguments converted during string formatting
The reason for that is there is fields containing numeric values, but not all rows have it, some of them have 'NA' instead. And for data integrity, we cannot replace them with 0s as in our domain, 0 might seems something, and in our work blank is different to the value of 0.
At the beginning, I tried to relace NAa with single quotes '', but then, all fields having numbers were transformed into float. So if a value is 123, it will be 123.0.
How can I replace NA values in a numeric field into completly blank and not '' so the field can still be considered as type INT.
In the image below, I don't want the empty cell to be treated as string, as the other fields will be transformed with the applymap() into floats if they are int:
Detect nan's using np.isnan() and put only non-nan numbers into str().
If you don't want float-typed intgers, just change the mapping fromstr() to str(int()).
Data
Note that column B contains nan which is actually a float number, so its dtype is automatically float.
df = pd.DataFrame({"A": [1 ,2], "B":[3, np.nan]})
print(df)
A B
0 1 3.0
1 2 NaN
Code
import numpy as np
df.applymap(lambda el: "" if np.isnan(el) else str(el))
Out[12]:
A B
0 1 3.0
1 2

How to remove percent before converting it to integer

I have a dataframe of 2 columns.I want to convert COUNT column to int.It keeps me giving value error:Unable to Parse string "0.58%" at position 0
METRIC COUNT
Scans 125487
No Reads 2541
Diverts 54710
No Code% 0.58%
No Read% 1.25%
df['COUNT'] = df['COUNT'].apply(pd.to_numeric)
How can i remove % before conversion
You can use str.strip:
pd.to_numeric(df.col1.str.strip('%'))
0 1
1 2
2 3
Name: col1, dtype: int64
Try this, I'm assuming that the 0.58% is read in as a string, meaning that the replace function will work to replace '%' with nothing, at which point, it can be converted to a number
import pandas as pd
df = pd.DataFrame({'col1':['1','2','3%']})
df.col1.str.replace('%','').astype(float)

Iterating over dataframe and using replace method based on condtions

I am attempting to iterate over a specific column in my dataframe.
The column is:
df['column'] = ['1.4million', '1,235,000','100million',NaN, '14million', '2.5mill']
I am trying to clean this column and eventually get it all to integers to do more work with. I am stuck on the step to clean out "million." I would like to replace the "million" with five zeros when there is a decimal (ie 1.4million becomes 1.400000) and the "million" with six zeros when there is no decimal (ie 100million becomes 100000000).
To simplify, the first step I'm trying is to just focus on filtering out the values with a decimal and replace those with 5 zeros. I have attempted to use np.where for this, however I cannot use the replace method with numpy.
I also attempted to use pd.DataFrame.where, but am getting an error:
for i,row in df.iterrows():
df.at[i,'column'] = pd.DataFrame.where('.' in df.at[i,'column'],df.at[i,'column'].replace('million',''),df.at[i,'column'])
``AttributeError: 'numpy.ndarray' object has no attribute 'replace'
Im sure there is something I'm missing here. (I'm also sure that I'll be told that I don't need to use iterrows here, so I am open to suggestions on that as well).
Given your sample data - it looks like you can strip out commas and then take all digits (and . characters) until the string mill or end of string and split those out, eg:
x = df['column'].str.replace(',', '').str.extract('(.*?)(mill.*)?$')
This'll give you:
0 1
0 1.4 million
1 1235000 NaN
2 100 million
3 NaN NaN
4 14 million
5 2.5 mill
Then take the number part and multiply it by a million where there's something in column 1 else multiple it by 1, eg:
res = pd.to_numeric(x[0]) * np.where(x[1].notna(), 1_000_000, 1)
That'll give you:
0 1400000.0
1 1235000.0
2 100000000.0
3 NaN
4 14000000.0
5 2500000.0
Try this:
df['column'].apply(lambda x : x.replace('million','00000'))
Make sure your dtype is string before applying this
For the given data:
df['column'].apply(lambda x: float(str(x).split('m')[0])*10**6
if 'million' in str(x) or 'mill' in str(x) else x)
If there may be many forms of million in the column, then regex search.

Categories

Resources