Object convert to float with spacing and minus calculation symbol python - python

Time Floating P/L
2 2019.09.30 -16.60
3 2019.10.01 -4.40
4 2019.10.02 -1 162.04
5 2019.10.03 -82.88
I have a database the column Floating P/L is object that I need to convert it to float. I tried google, this default way is used:
df['Floating P/L'] = df['Floating P/L'].replace('[^\d.]', '', regex=True).astype(float)
However, after doing this, the minus symbol is deleted
2 16.60
3 4.40
4 1162.04
5 82.88
I am wondering how can I keep the minus symbol, I assume others may have the same problem, so I post it here

You might use
df['Floating P/L'] = df['Floating P/L'].replace('[^-\d.]', '', regex=True).astype(float)
# ^^^

I would remove all whitespaces (\s) and then convert to float that is:
import pandas as pd
df = pd.DataFrame({'Floating':['-16.60','-4.40','-1 162.04','-82.88']})
df['Floating'] = df['Floating'].replace('\s', '', regex=True).astype(float)
print(df)
print(df['Floating'].dtype)
Output:
Floating
0 -16.60
1 -4.40
2 -1162.04
3 -82.88
float64

Related

Convert the string into a float value

I have copied a table with three columns from a pdf file. I am attaching the screenshot from the PDF here:
The values in the column padj are exponential values, however, when you copy from the pdf to an excel and then open it with pandas, these are strings or object data types. Hence, these values cannot be parsed as floats or numeric values. I need these values as floats, not as strings. Can someone help me with some suggestions?
So far this is what I have tried.
The excel or the csv file is then opened in python using the escape_unicode encoding in order to circumvent the UnicodeDecodeError
## open the file
df = pd.read_csv("S2_GSE184956.csv",header=0,sep=',',encoding='unicode_escape')[["DEGs","LFC","padj"]]
df.head()
DEGs padj LFC
0 JUNB 1.5 ×10-8 -1.273329
1 HOOK2 2.39×10-7 -1.109320
2 EGR1 3.17×10-6 -4.187828
3 DUSP1 3.95×10-6 -3.251030
4 IL6 3.95×10-6 -3.415500
5 ARL4C 5.06×10-6 -2.147519
6 NR4A2 2.94×10-4 -3.001167
7 CCL3L1 4.026×10-4 -5.293694
# Convert the string to float by replacing the x10- with exponential sign
df['padj'] = df['padj'].apply(lambda x: (unidecode(x).replace('x10-','x10-e'))).astype(float)
That threw an error,
ValueError: could not convert string to float: '1.5 x10-e8'
Any suggestions would be appreciated. Thanks
With the dataframe shared in the question on this last edit, the following using pandas.Series.str.replace and pandas.Series.astype will do the work:
df['padj'] = df['padj'].str.replace('×10','e').str.replace(' ', '').astype(float)
The goal is to get the cells to look like the following 1.560000e-08.
Notes:
Depending on the rest of the dataframe, additional adjustments might still be required, such as, removing the spaces ' that might exist in one of the cells. For that one can use pandas.Series.str.replace as follows
df['padj'] = df['padj'].str.replace("'", '')
Considering your sample (column padj), the code below should work:
f_value = eval(str_float.replace('x10', 'e').replace(' ', ''))
Updated based on the data you provided above. The most significant thing being that the x is actually a times symbol:
import pandas as pd
DEGs = ["JUNB", "HOOK2", "EGR1", "DUSP1", "IL6", "ARL4C", "NR4A2", "CCL3L1"]
padj = ["1.5 ×10-8", "2.39×10-7", "3.17×10-6", "3.95×10-6", "3.95×10-6", "5.06×10-6", "2.94×10-4", "4.026×10-4"]
LFC = ["-1.273329", "-1.109320", "-4.187828", "-3.251030", "-3.415500", "-2.147519", "-3.001167", "-5.293694"]
df = pd.DataFrame({'DEGs': DEGs, 'padj': padj, 'LFC': LFC})
# change to python-friendly float format
df['padj'] = df['padj'].str.replace(' ×10-', 'e-', regex=False)
df['padj'] = df['padj'].str.replace('×10-', 'e-', regex=False)
# convert padj from string to float
df['padj'] = df['padj'].astype(float)
will give you this dataframe:
If you want a numerical vectorial solution, you can use:
df['float'] = (df['padj'].str.extract(r'(\d+(?:\.\d+))\s*×10(.?\d+)')
.apply(pd.to_numeric).pipe(lambda d: d[0].mul(10.**d[1]))
)
output:
DEGs padj LFC float
0 JUNB 1.5 ×10-8 -1.273329 1.500000e-08
1 HOOK2 2.39×10-7 -1.109320 2.390000e-07
2 EGR1 3.17×10-6 -4.187828 3.170000e-06
3 DUSP1 3.95×10-6 -3.251030 3.950000e-06
4 IL6 3.95×10-6 -3.415500 3.950000e-06
5 ARL4C 5.06×10-6 -2.147519 5.060000e-06
6 NR4A2 2.94×10-4 -3.001167 2.940000e-04
7 CCL3L1 4.026×10-4 -5.293694 4.026000e-04
Intermediate:
df['padj'].str.extract('(\d+(?:\.\d+))\s*×10(.?\d+)')
0 1
0 1.5 -8
1 2.39 -7
2 3.17 -6
3 3.95 -6
4 3.95 -6
5 5.06 -6
6 2.94 -4
7 4.026 -4

pandas: convert column with multiple datatypes to int, ignore errors

I have a column with data that needs some massaging. the column may contain strings or floats. some strings are in exponential form. Id like to best try to format all data in this column as a whole number where possible, expanding any exponential notation to integer. So here is an example
df = pd.DataFrame({'code': ['1170E1', '1.17E+04', 11700.0, '24477G', '124601', 247602.0]})
df['code'] = df['code'].astype(int, errors = 'ignore')
The above code does not seem to do a thing. i know i can convert the exponential notation and decimals with simply using the int function, and i would think the above astype would do the same, but it does not. for example, the following code work in python:
int(1170E1), int(1.17E+04), int(11700.0)
> (11700, 11700, 11700)
Any help in solving this would be appreciated. What i'm expecting the output to look like is:
0 '11700'
1 '11700'
2 '11700
3 '24477G'
4 '124601'
5 '247602'
You may check with pd.to_numeric
df.code = pd.to_numeric(df.code,errors='coerce').fillna(df.code)
Out[800]:
0 11700.0
1 11700.0
2 11700.0
3 24477G
4 124601.0
5 247602.0
Name: code, dtype: object
Update
df['code'] = df['code'].astype(object)
s = pd.to_numeric(df['code'],errors='coerce')
df.loc[s.notna(),'code'] = s.dropna().astype(int)
df
Out[829]:
code
0 11700
1 11700
2 11700
3 24477G
4 124601
5 247602
BENY's answer should work, although you potentially leave yourself open to catching exceptions and filling that you don't want to. This will also do the integer conversion you are looking for.
def convert(x):
try:
return str(int(float(x)))
except ValueError:
return x
df = pd.DataFrame({'code': ['1170E1', '1.17E+04', 11700.0, '24477G', '124601', 247602.0]})
df['code'] = df['code'].apply(convert)
outputs
0 11700
1 11700
2 11700
3 24477G
4 124601
5 247602
where each element is a string.
I will be the first to say, I'm not proud of that triple cast.

Unable to convert comma separated integers and non-integer values to float in a series column in Python

Loading in the data
in: import pandas as pd
in: df = pd.read_csv('name', sep = ';', encoding='unicode_escape')
in : df.dtypes
out: amount object
I have an object column with amounts like 150,01 and 43,69. Thee are about 5,000 rows.
df['amount']
0 31
1 150,01
2 50
3 54,4
4 32,79
...
4950 25,5
4951 39,5
4952 75,56
4953 5,9
4954 43,69
Name: amount, Length: 4955, dtype: object
Naturally, I tried to convert the series into the locale format, which suppose to turn it into a float format. I came back with the following error:
In: import locale
setlocale(LC_NUMERIC, 'en_US.UTF-8')
Out: 'en_US.UTF-8'
In: df['amount'].apply(locale.atof)
Out: ValueError: could not convert string to float: ' - '
Now that I'm aware that there are non-numeric values in the list, I tried to use isnumeric methods to turn the non-numeric values to become NaN.
Unfortunately, due to the comma separated structure, all the values would turn into -1.
0 -1
1 -1
2 -1
3 -1
4 -1
..
4950 -1
4951 -1
4952 -1
4953 -1
4954 -1
Name: amount, Length: 4955, dtype: int64
How do I turn the "," values to "." by first removing the "-" values? I tried .drop() or .truncate it does not help. If I replace the str",", " ", it would also cause trouble since there is a non-integer value.
Please help!
Documentation that I came across
-https://stackoverflow.com/questions/21771133/finding-non-numeric-rows-in-dataframe-in-pandas
-https://stackoverflow.com/questions/56315468/replace-comma-and-dot-in-pandas
p.s. This is my first post, please be kind
Sounds like you have a European-style CSV similar to the following. Provide actual sample data as many comments asked for if your format is different:
data.csv
thing;amount
thing1;31
thing2;150,01
thing3;50
thing4;54,4
thing5;1.500,22
To read it, specify the column, decimal and thousands separator as needed:
import pandas as pd
df = pd.read_csv('data.csv',sep=';',decimal=',',thousands='.')
print(df)
Output:
thing amount
0 thing1 31.00
1 thing2 150.01
2 thing3 50.00
3 thing4 54.40
4 thing5 1500.22
Posting as an answer since it contains multi-line code, despite not truly answering your question (yet):
Try using chardet. pip install chardet to get the package, then in your import block, add import chardet.
When importing the file, do something like:
with open("C:/path/to/file.csv", 'r') as f:
data = f.read()
result = chardet.detect(data.encode())
charencode = result['encoding']
# now re-set the handler to the beginning and re-read the file:
f.seek(0, 0)
data = pd.read_csv(f, delimiter=';', encoding=charencode)
Alternatively, for reasons I cannot fathom, passing engine='python' as a parameter works often. You'd just do
data = pd.read_csv('C:/path/to/file.csv', engine='python')
#Mark Tolonen has a more elegant approach to standardizing the actual data, but my (hacky) way of doing it was to just write a function:
def stripThousands(self, df_column):
df_column.replace(',', '', regex=True, inplace=True)
df_column = df_column.apply(pd.to_numeric, errors='coerce')
return df_column
If you don't care about the entries that are just hyphens, you could use a function like
def screw_hyphens(self, column):
column.replace(['-'], np.nan, inplace=True)
or if np.nan values will be a problem, you can just replace it with column.replace('-', '', inplace=True)
**EDIT: there was a typo in the block outlining the usage of chardet. it should be correct now (previously the end of the last line was encoding=charenc)

Trying to format my values in pandas Dataframe which r showing like 1,008.59-, 1008.59 (one space is there aftr +ve value) to -1008.59 and 1008.59

Amtindoccurre
1008.59 (right space is there after this value)
1008.59-
Need to format those values in proper format like 1008.59(without any right side space)and -1008.59 in Pandas Data frame.
Can any one tell the way how to do.
Thanks
DS
You can use .str.replace() with regex to parse the float numbers with trailing sign and space and put the sign and space at the front. Then use .str.strip() to remove any space remaining:
data = {'val': ['1008.59 ', '1008.59-', '57,039.54 ', '4,232.49 ', '4,191.59-', '1,257,039.54 ', '2,257,039.54-']}
df = pd.DataFrame(data)
df['val'] = df['val'].str.replace(r'(\d+(?:,\d+)*(?:\.\d+)?)(\s|-)', r'\2\1', regex=True).str.strip()
Result:
print(df)
val
0 1008.59
1 -1008.59
2 57,039.54
3 4,232.49
4 -4,191.59
5 1,257,039.54
6 -2,257,039.54
If you want to remove the thousand separators , also, you can use:
df['val'].str.replace(r'(\d+(?:,\d+)*(?:\.\d+)?)(\s|-)', r'\2\1', regex=True).str.strip().str.replace(',', '', regex=True)
Result:
print(df)
val
0 1008.59
1 -1008.59
2 57039.54
3 4232.49
4 -4191.59
5 1257039.54
6 -2257039.54
If you want to further convert the numbers from string to numeric values (float type), you can use:
df['val'] = df['val'].str.replace(r'(\d+(?:,\d+)*(?:\.\d+)?)(\s|-)', r'\2\1', regex=True).str.strip().str.replace(',', '', regex=True).astype(float)

Output of column in Pandas dataframe from float to currency (negative values)

I have the following data frame (consisting of both negative and positive numbers):
df.head()
Out[39]:
Prices
0 -445.0
1 -2058.0
2 -954.0
3 -520.0
4 -730.0
I am trying to change the 'Prices' column to display as currency when I export it to an Excel spreadsheet. The following command I use works well:
df['Prices'] = df['Prices'].map("${:,.0f}".format)
df.head()
Out[42]:
Prices
0 $-445
1 $-2,058
2 $-954
3 $-520
4 $-730
Now my question here is what would I do if I wanted the output to have the negative signs BEFORE the dollar sign. In the output above, the dollar signs are before the negative signs. I am looking for something like this:
-$445
-$2,058
-$954
-$520
-$730
Please note there are also positive numbers as well.
You can use np.where and test whether the values are negative and if so prepend a negative sign in front of the dollar and cast the series to a string using astype:
In [153]:
df['Prices'] = np.where( df['Prices'] < 0, '-$' + df['Prices'].astype(str).str[1:], '$' + df['Prices'].astype(str))
df['Prices']
Out[153]:
0 -$445.0
1 -$2058.0
2 -$954.0
3 -$520.0
4 -$730.0
Name: Prices, dtype: object
You can use the locale module and the _override_localeconv dict. It's not well documented, but it's a trick I found in another answer that has helped me before.
import pandas as pd
import locale
locale.setlocale( locale.LC_ALL, 'English_United States.1252')
# Made an assumption with that locale. Adjust as appropriate.
locale._override_localeconv = {'n_sign_posn':1}
# Load dataframe into df
df['Prices'] = df['Prices'].map(locale.currency)
This creates a dataframe that looks like this:
Prices
0 -$445.00
1 -$2058.00
2 -$954.00
3 -$520.00
4 -$730.00

Categories

Resources