How could I convert this column to numeric? - python

I have a problem with to convert this column to numeric. I tried this
pricesquare['PRICE_SQUARE'] =
pd.to_numeric(pricesquare['PRICE_SQUARE'])
ValueError: Unable to parse string "13 312 " at position 0
df["PRICE_SQUARE"] = df["PRICE_SQUARE"].astype(str).astype(int)
ValueError: invalid literal for int() with base 10: '13\xa0312.
I don't know what to do.

You can replace the \xa0 unicode character with an empty space before converting to int.
import pandas as pd
data = ["13\xa0312", "14\xa01234"]
pd.Series(data).str.replace("\xa0", "").astype(int)
0 13312
1 141234
dtype: int64
You can also use unicodedata.normalize to normalize the unicode character to a space, then replace the space with empty space, and finally convert to int.
import unicodedata
import pandas as pd
data = ["13\xa0312", "14\xa01234"]
pd.Series(data).apply(lambda s: unicodedata.normalize("NFKC", s)).str.replace(" ", "").astype(int)
0 13312
1 141234
dtype: int64

Use apply https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html?highlight=apply#pandas.DataFrame.apply and in the function remove the space before you convert to int.

Related

How to check the pattern of a column in a dataframe

I have a dataframe which has some id's. I want to check the pattern of those column values.
Here is how the column looks like-
id: {ASDH12HK,GHST67KH,AGSH90IL,THKI86LK}
I want to to write a code that can distinguish characters and numerics in the pattern above and display an output like 'SSSS99SS' as the pattern of the column above where 'S' represents a character and '9' represents a numeric.This dataset is a large dataset so I can't predefine the position the characters and numeric will be in.I want the code to calculate the position of the characters and numerics. I am new to python so any leads will be helpful!
You can try something like:
my_string = "ASDH12HK"
def decode_pattern(my_string):
my_string = ''.join(str(9) if s.isdigit() else s for s in my_string)
my_string = ''.join('S' if s.isalpha() else s for s in my_string)
return my_string
decode_pattern(my_string)
Output:
'SSSS99SS'
You can apply this to the column in your dataframe as well as below:
import pandas as pd
df = pd.DataFrame(['ASDH12HK','GHST67KH','AGSH90IL','THKI86LK', 'SOMEPATTERN123'], columns=['id'])
df['pattern'] = df['id'].map(decode_pattern)
df
Output:
id pattern
0 ASDH12HK SSSS99SS
1 GHST67KH SSSS99SS
2 AGSH90IL SSSS99SS
3 THKI86LK SSSS99SS
4 SOMEPATTERN123 SSSSSSSSSSS999
You can use regular experssion:
st = "SSSS99SSSS"
a = re.match("[A-Za-z]{4}[0-9]{2}[A-Za-z]{4}", st)
It will return a match if the string starting with 4 Char followed by 2 numeric and again 4 char
So you can use this in your df to filter the df
You can use the function findall() from the re module:
import re
text = "ASDH12HK,GHST67KH,AGSH90IL,THKI86LK"
result = re.findall("[A-Za-z]{4}[0-9]{2}[A-Za-z]{2}", text)
print(result)

ValueError: could not convert string to float:?

I have a dataset with object type, which was imported as a txt file into Jupyter Notebook. But now I am trying to plot some auto-correlation for an individual column and it is not working.
My first attempt was to convert the object columns to float but I get the error message:
could not convert string to float: ?
How do I fix this?
Okay this is my script:
book = pd.read_csv('Book1.csv', parse_dates=True)
t= str(book.Global_active_power)
t
'0 4.216\n1 5.36\n2 5.374\n3 5.388\n4 3.666\n5 3.52\n6 3.702\n7 3.7\n8 3.668\n9 3.662\n10 4.448\n11 5.412\n12 5.224\n13 5.268\n14 4.054\n15 3.384\n16 3.27\n17 3.43\n18 3.266\n19 3.728\n20 5.894\n21 7.706\n22 7.026\n23 5.174\n24 4.474\n25 3.248\n26 3.236\n27 3.228\n28 3.258\n29 3.178\n ... \n1048545 0.324\n1048546 0.324\n1048547 0.324\n1048548 0.322\n1048549 0.322\n1048550 0.322\n1048551 0.324\n1048552 0.324\n1048553 0.326\n1048554 0.326\n1048555 0.324\n1048556 0.324\n1048557 0.322\n1048558 0.322\n1048559 0.324\n1048560 0.322\n1048561 0.322\n1048562 0.324\n1048563 0.388\n1048564 0.424\n1048565 0.42\n1048566 0.418\n1048567 0.418\n1048568 0.42\n1048569 0.422\n1048570 0.426\n1048571 0.424\n1048572 0.422\n1048573 0.422\n1048574 0.422\nName: Global_active_power, Length: 1048575, dtype: object'
I believe the reason is that i have to format my column first for equal number of decimal places and then i can convert to float, but trying to format using this is not working for me
print("{:0<4s}".format(book.Global_active_power))
The column contains a ? entry. Clean this up (along with any other extraneous entries) and you should not see this error.

Pandas string was read in as string (object) but in numeric notation

I read in a csv file.
govselldata = pd.read_csv('govselldata.csv', dtype={'BUS_LOC_ID': str})
#or by this
#govselldata = pd.read_csv('govselldata.csv')
I have values in string format.
govselldata.dtypes
a int64
BUS_LOC_ID object
But they are not like this '255048925478501030', but rather scientific like this 2.55048925478501e+17.
How do i convert it to '255048925478501030'?
Edit: Using float() did not work. This could be due to some white space.
govselldata['BUS_LOC'] = govselldata['BUS_LOC_ID'].map(lambda x: float(x))
ValueError: could not convert string to float:

How to convert numbers represented as characters for short into numeric in Python

I have a column in my data frame which has values like '3.456B' which actually stands for 3.456 Billion (and similar notation for Million). How to convert this string form to correct numeric representation?
This shows the data frame:
import pandas as pd
data_csv = pd.read_csv('https://biz.yahoo.com/p/csv/422conameu.csv')
data_csv
This is a sample value:
data_csv['Market Cap'][0]
type(data_csv['Market Cap'][0])
I tried this:
data_csv.loc[data_csv['Market Cap'].str.contains('B'), 'Market Cap'] = data_csv['Market Cap'].str.replace('B', '').astype(float).fillna(0.0)
data_csv
But unfortunately there are also values with 'M' at the end which denotes Millions. It returns error as follows:
ValueError: invalid literal for float(): 6.46M
How can I replace both B and M with appropriate values in this column? Is there a better way to do it?
I'd use a dictionary to replace the strings then evaluate as float.
mapping = dict(K='E3', M='E6', B='E9')
df['Market Cap'] = pd.to_numeric(df['Market Cap'].replace(mapping, regex=True))
Assuming all entries have a letter at the end, you can do this:
d = {'K': 1000, 'M': 1000000, 'B': 1000000000}
df.loc[:, 'Market Cap'] = pd.to_numeric(df['Market Cap'].str[:-1]) * \
df['Market Cap'].str[-1].replace(d)
This converts everything but the last character into a numeric value, then multiplies it by the number equivalent to the letter in the last character.
First extract units as last character in strings. Then convert values without units to floats and multiply where needed:
df = pd.DataFrame({'Market Cap':['6.46M','2.25B','0.23B']})
units = df['Market Cap'].str[-1]
df['Market Cap'] = df['Market Cap'].str[:-1].astype(float)
df.loc[units=='M','Market Cap'] *= 0.001
# Market Cap
# 0 0.00646
# 1 2.25000
# 2 0.23000
Now everything is in billions.

Can you format pandas integers for display, like `pd.options.display.float_format` for floats?

I've seen this and this on formatting floating-point numbers for display in pandas, but I'm interested in doing the same thing for integers.
Right now, I have:
pd.options.display.float_format = '{:,.2f}'.format
That works on the floats in my data, but will either leave annoying trailing zeroes on integers that are cast to floats, or I'll have plain integers that don't get formatted with commas.
The pandas docs mention a SeriesFormatter class about which I haven't been able to find any information.
Alternatively, if there's a way to write a single string formatter that will format floats as '{:,.2f}' and floats with zero trailing decimal as '{:,d}', that'd work too.
You could monkey-patch pandas.io.formats.format.IntArrayFormatter:
import contextlib
import numpy as np
import pandas as pd
import pandas.io.formats.format as pf
np.random.seed(2015)
#contextlib.contextmanager
def custom_formatting():
orig_float_format = pd.options.display.float_format
orig_int_format = pf.IntArrayFormatter
pd.options.display.float_format = '{:0,.2f}'.format
class IntArrayFormatter(pf.GenericArrayFormatter):
def _format_strings(self):
formatter = self.formatter or '{:,d}'.format
fmt_values = [formatter(x) for x in self.values]
return fmt_values
pf.IntArrayFormatter = IntArrayFormatter
yield
pd.options.display.float_format = orig_float_format
pf.IntArrayFormatter = orig_int_format
df = pd.DataFrame(np.random.randint(10000, size=(5,3)), columns=list('ABC'))
df['D'] = np.random.random(df.shape[0])*10000
with custom_formatting():
print(df)
yields
A B C D
0 2,658 2,828 4,540 8,961.77
1 9,506 2,734 9,805 2,221.86
2 3,765 4,152 4,583 2,011.82
3 5,244 5,395 7,485 8,656.08
4 9,107 6,033 5,998 2,942.53
while outside of the with-statement:
print(df)
yields
A B C D
0 2658 2828 4540 8961.765260
1 9506 2734 9805 2221.864779
2 3765 4152 4583 2011.823701
3 5244 5395 7485 8656.075610
4 9107 6033 5998 2942.530551
Another option for Jupyter notebooks is to use df.style.format('{:,}'), but it only works on a single dataframe as far as I know, so you would have to call this every time:
table.style.format('{:,}')
col1 col2
0s 9,246,452 6,669,310
>0 2,513,002 5,090,144
table
col1 col2
0s 9246452 6669310
>0 2513002 5090144
Styling — pandas 1.1.2 documentation
Starting with Pandas 1.3.0, you can specify df.style.format(thousands=',') to use commas to separate thousands in floats, complex numbers, and integers.
See: https://pandas.pydata.org/docs/reference/api/pandas.io.formats.style.Styler.format.html.
Although it's been a years to give the answer
Like the following example, even though I set the format at the beginning, the format is changed after using add
We can try use asType to convert the format

Categories

Resources