Pandas string was read in as string (object) but in numeric notation - python

I read in a csv file.
govselldata = pd.read_csv('govselldata.csv', dtype={'BUS_LOC_ID': str})
#or by this
#govselldata = pd.read_csv('govselldata.csv')
I have values in string format.
govselldata.dtypes
a int64
BUS_LOC_ID object
But they are not like this '255048925478501030', but rather scientific like this 2.55048925478501e+17.
How do i convert it to '255048925478501030'?
Edit: Using float() did not work. This could be due to some white space.
govselldata['BUS_LOC'] = govselldata['BUS_LOC_ID'].map(lambda x: float(x))
ValueError: could not convert string to float:

Related

How could I convert this column to numeric?

I have a problem with to convert this column to numeric. I tried this
pricesquare['PRICE_SQUARE'] =
pd.to_numeric(pricesquare['PRICE_SQUARE'])
ValueError: Unable to parse string "13 312 " at position 0
df["PRICE_SQUARE"] = df["PRICE_SQUARE"].astype(str).astype(int)
ValueError: invalid literal for int() with base 10: '13\xa0312.
I don't know what to do.
You can replace the \xa0 unicode character with an empty space before converting to int.
import pandas as pd
data = ["13\xa0312", "14\xa01234"]
pd.Series(data).str.replace("\xa0", "").astype(int)
0 13312
1 141234
dtype: int64
You can also use unicodedata.normalize to normalize the unicode character to a space, then replace the space with empty space, and finally convert to int.
import unicodedata
import pandas as pd
data = ["13\xa0312", "14\xa01234"]
pd.Series(data).apply(lambda s: unicodedata.normalize("NFKC", s)).str.replace(" ", "").astype(int)
0 13312
1 141234
dtype: int64
Use apply https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html?highlight=apply#pandas.DataFrame.apply and in the function remove the space before you convert to int.

pandas from mix data type convert exponential or scientific numbers to integers

I have been looking for the solution and tried few suggestions but could not get the answer.
I have a column where in string and long numbers are there in form of exponent, and I need to get the full number out of exponent values for further processing.
pandas exponential scientific numbers to integers.
Click here for Sample Data
IT looks like
import pandas as pd
# tried adding this
pd.options.display.float_format = "{:.0f}".format
df = pd.read_csv('Detail Statement.csv')
# tried converting to int by ignoring other types after commenting display.format
df['Ref Number'] = df['Ref Number'].astype(int, errors='ignore')
# tried map
df['Ref Number'] = df['Ref Number'].map(int)
You can define a custom function and use .apply
def convert(x):
try:
return int(float(x))
except Exception:
return x
df['Ref Number'] = df['Ref Number'].apply(convert)
df['Ref Number'].iloc[0], type(df['Ref Number'].iloc[0])
'HSB345678', str
df['Ref Number'].iloc[-1] , type(df['Ref Number'].iloc[-1])
201498000000, int
If floats are fine then you can use pd.to_numeric with errors parameter set to coerce then use .fillna to fill back the strings that are not convertible.
df['Ref Number'] = pd.to_numeric(df['Ref Number'], errors='coerce').fillna(df['Ref Number'])
df['Ref Number'].dtype
dtype('O')
df['Ref Number'].iloc[0], type(df['Ref Number'].iloc[0])
'HSB345678', str
df['Ref Number'].iloc[-1] , type(df['Ref Number'].iloc[-1])
201498000000.0, float

ValueError: could not convert string to float:?

I have a dataset with object type, which was imported as a txt file into Jupyter Notebook. But now I am trying to plot some auto-correlation for an individual column and it is not working.
My first attempt was to convert the object columns to float but I get the error message:
could not convert string to float: ?
How do I fix this?
Okay this is my script:
book = pd.read_csv('Book1.csv', parse_dates=True)
t= str(book.Global_active_power)
t
'0 4.216\n1 5.36\n2 5.374\n3 5.388\n4 3.666\n5 3.52\n6 3.702\n7 3.7\n8 3.668\n9 3.662\n10 4.448\n11 5.412\n12 5.224\n13 5.268\n14 4.054\n15 3.384\n16 3.27\n17 3.43\n18 3.266\n19 3.728\n20 5.894\n21 7.706\n22 7.026\n23 5.174\n24 4.474\n25 3.248\n26 3.236\n27 3.228\n28 3.258\n29 3.178\n ... \n1048545 0.324\n1048546 0.324\n1048547 0.324\n1048548 0.322\n1048549 0.322\n1048550 0.322\n1048551 0.324\n1048552 0.324\n1048553 0.326\n1048554 0.326\n1048555 0.324\n1048556 0.324\n1048557 0.322\n1048558 0.322\n1048559 0.324\n1048560 0.322\n1048561 0.322\n1048562 0.324\n1048563 0.388\n1048564 0.424\n1048565 0.42\n1048566 0.418\n1048567 0.418\n1048568 0.42\n1048569 0.422\n1048570 0.426\n1048571 0.424\n1048572 0.422\n1048573 0.422\n1048574 0.422\nName: Global_active_power, Length: 1048575, dtype: object'
I believe the reason is that i have to format my column first for equal number of decimal places and then i can convert to float, but trying to format using this is not working for me
print("{:0<4s}".format(book.Global_active_power))
The column contains a ? entry. Clean this up (along with any other extraneous entries) and you should not see this error.

How to convert numbers represented as characters for short into numeric in Python

I have a column in my data frame which has values like '3.456B' which actually stands for 3.456 Billion (and similar notation for Million). How to convert this string form to correct numeric representation?
This shows the data frame:
import pandas as pd
data_csv = pd.read_csv('https://biz.yahoo.com/p/csv/422conameu.csv')
data_csv
This is a sample value:
data_csv['Market Cap'][0]
type(data_csv['Market Cap'][0])
I tried this:
data_csv.loc[data_csv['Market Cap'].str.contains('B'), 'Market Cap'] = data_csv['Market Cap'].str.replace('B', '').astype(float).fillna(0.0)
data_csv
But unfortunately there are also values with 'M' at the end which denotes Millions. It returns error as follows:
ValueError: invalid literal for float(): 6.46M
How can I replace both B and M with appropriate values in this column? Is there a better way to do it?
I'd use a dictionary to replace the strings then evaluate as float.
mapping = dict(K='E3', M='E6', B='E9')
df['Market Cap'] = pd.to_numeric(df['Market Cap'].replace(mapping, regex=True))
Assuming all entries have a letter at the end, you can do this:
d = {'K': 1000, 'M': 1000000, 'B': 1000000000}
df.loc[:, 'Market Cap'] = pd.to_numeric(df['Market Cap'].str[:-1]) * \
df['Market Cap'].str[-1].replace(d)
This converts everything but the last character into a numeric value, then multiplies it by the number equivalent to the letter in the last character.
First extract units as last character in strings. Then convert values without units to floats and multiply where needed:
df = pd.DataFrame({'Market Cap':['6.46M','2.25B','0.23B']})
units = df['Market Cap'].str[-1]
df['Market Cap'] = df['Market Cap'].str[:-1].astype(float)
df.loc[units=='M','Market Cap'] *= 0.001
# Market Cap
# 0 0.00646
# 1 2.25000
# 2 0.23000
Now everything is in billions.

Return unknown string in dataframe (extract unkown string from

I have a large dataset which I have imported using the read_csv as described below which should be float measurement and NaN.
df = pd.read_csv(file_,parse_dates=[['Date','Time']],na_values = ['No Data','Bad Data','','No Sample'],low_memory=False)
When I apply df.dtypes, most of the columns return as object type which indicate that there are other objects in the dataframe that I am not aware of.I am looking for a way of identifying those string and replace then by na values.
First thing that I wanted to do was to convert everything to dtype = np.float but I couldn't. Then, I tried to read in each (columns,index) and return the identified string.
I have tried something very inefficient (I am a beginner) and time consuming, it has worked for other dataframe but here it returns a errors:
TypeError: argument of type 'float' is not iterable
from isstring import *
list_string = []
for i in range(0,len(df)):
for j in range(0,len(df.columns)):
x = test.ix[i,j]
if isstring(x) and '.'not in x:
list_string.append(x)
list_string = pd.DataFrame(list_string, columns=["list_string"])
g = list_string.groupby('list_string').size()
Is there a simple way of detecting unknown string in large dataset. Thanks
You could try:
string_list = []
for col, series in df.items(): # iterating over all columns - perhaps only select `object` types
string_list += [s for s in series.unique() if isinstance(s, str)]

Categories

Resources