Converting strings to ints in a DataFrame

Converting strings to ints in a DataFrame - python

How to covert a DataFrame column containing strings and "-" values to floats.
I have tried with pd.to_numeric and pd.Series.astype(int) but i haven´t success.
What do you recommend??

If I correctly understand, you want pandas to convert the string 7.629.352 to the float value 7629352.0 and the string (21.808.956) to the value -21808956.0.
For the first part, it is directly possible with the thousands parameter, and it is even possible to process - as a NaN:
m = read_csv(..., thousands='.', na_values='-')
The real problem is the parens for negative values.
You could use a python function to convert the values. A possible alternative would be to post process the dataframe column wise:
m = read_csv(..., thousands='.', na_values='-')
for col in m.columns:
if m[col].dtype == np.dtype('object'):
m[col] = m[col].str.replace(r'\.', '').str.replace(r'\((.*)\)', r'-\1').astype('float64')

Related

Convert float dtype dataframe to string dtype dataframe (NaN becomes '0')

Here is my code:
dfFact = pd.read_sql_query("select * from MusicFact;", conn)
dfFact['artist'].fillna(0)
dfFact['artist'].astype('Int64')
dfFact['artist'] = dfFact['artist'].astype(str)
I am trying to convert the following dataframe into a string dataframe
But my desired output needs to be for example in this case '3','3','3','3','2','1','0','0'
I am stuck so I thought I'd come to stack - thank you in advance!

Series.fillna if used without inplace=True returns a new series with missing values filled with the replacement value. Similarly Series.astype returns a copy of casted series. In your case you are not assigning the results of these operations back the the series hence the changes are not propogating in the output.
Simply, use:
dfFact['artist'] = dfFact['artist'].fillna(0).astype(int).astype(str)

Summing an entire column with another column in pandas

I'm trying to sum an entire column by country, but when I use
my_df.groupby('COUNTRY').VALUES.mean()
It throws a
DataError: No numeric types to aggregate
And when I use
my_df.groupby('COUNTRY').VALUES.sum()
It produces really big values that are far from realistic (maybe by adding them as strings together?)
Could it be that it interprets the values in the column as strings, or am I using the function the wrong way?
I'm trying to accomplish exactly what this guy is doing at 1:45 https://www.youtube.com/watch?v=qy0fDqoMJx8
i.e the values column contains integers that I want to sum by each country.

The values in the column was interpreted as strings, they explain how to convert the datatype in this question
Change data type of columns in Pandas

I understand you are trying to achieve a count of countries, but Is not clear if you want the count the countries or it is based on another variable:
Try:
my_df['COUNTRY'].value_counts()
within the same column or if the sum is based on another variable:
my_df[['COUNTRY','other_variable']].groupby(['COUNTRY']).sum()
Your question is not clear, you should show your dataframe

You need to convert your VALUES series to a numeric type before performing any computations. For example, for conversion to integers:
# convert to integers, non-convertible values will become NaN
my_df['VALUES'] = pd.to_numeric(my_df['VALUES'], downcast='integer', errors='coerce')
# perform groupby as normal
grouped = my_df.groupby('COUNTRY')['VALUES'].mean()

Pandas Dataframe column with both Strings and Floats

I have a dataframe where one of the columns holds strings and floats.
The column is named 'Value' has values like "AAA", "Korea, Republic of", "123,456.78" and "5000.00".
The first two values are obviously strings, and the last is obviously a float. The third value should be a float as well, but due to the commas, the next step of my code sees it as a string.
Is there an easy way for me to remove the commas for those values that are really floats but keep them for values that are really strings? So "Korea, Republic of" stays, but "123,456,78" converts to "123456.78".
Thanks.

To begin with, your Pandas column does not contain strings and floats, since columns contain homogeneous types. If one entry is a string, then all of them are. You can verify this by doing something like (assuming the DataFrame is df and the column is c):
>>> df.dtypes
and noticing that the type should be something like Object.
Having said that, you can convert the string column to a different string column, where the strings representing numbers, have the commas removed. This might be useful for further operations, e.g., when you wish to see which entries can be converted to floats. This can be done as follows.
First, write a function like:
import re
def remove_commas_from_numbers(n):
r = re.compile(r'^(\d+(?:,\d+)?.+)*$')
m = r.match(n)
if not m:
return n
return n.replace(',', '')
remove_commas_from_numbers('1,1.')
Then, you can do something like:
>>> df.c = df.c.apply(remove_commas_from_numbers)
Again, it's important to note that df.c's type will be string.

Finding All Values in Pandas DataFrame Not of Certain Type

To avoid the following error, I would like to replace any integer in my DataFrame with Unix Time:
ValueError: mixed datetimes and integers in passed array
In a small subset of the Excel files I'm reading in, I know the integers that appear are 0. However, what if there were multiple distinct integers? Or what if there are multiple dtypes? How can I easily replace any non-datetimes with the epoch represented datetime?
This works for the simple case of replacing 0s:
for col_name in time_columns:
time_col = data[col_name]
if time_col.dtypes is np.dtype('object'):
time_col.replace(to_replace=0, value=epoch, inplace=True)
time_col = pd.DatetimeIndex(time_col).astype(np.int64)/10**6
data[col_name] = time_col
where
epoch = datetime.datetime.utcfromtimestamp(0)

Use Python's isinstance() or issubclass()

String problem / Select all values > 8000 in pandas dataframe

I want to select all values bigger than 8000 within a pandas dataframe.
new_df = df.loc[df['GM'] > 8000]
However, it is not working. I think the problem is, that the value comes from an Excel file and the number is interpreted as string e.g. "1.111,52". Do you know how I can convert such a string to float / int in order to compare it properly?

Taken from the documentation of pd.read_excel:
Thousands separator for parsing string columns to numeric. Note that this parameter is only necessary for columns stored as TEXT in Excel, any numeric columns will automatically be parsed, regardless of display format.
This means that pandas checks the type of the format stored in excel. If this was numeric in Excel, the conversion should go correct. If your column was string, try to use:
df = pd.read_excel('filename.xlsx', thousands='.')
If you have a csv file, you can solve this by specifying thousands + decimal character:
df = pd.read_csv('filename.csv', thousands='.', decimal=',')

You can see value of df.dtypes to see what is the type of each column. Then, if the column type is not as you want to, you can change it by df['GM'].astype(float), and then new_df = df.loc[df['GM'].astype(float) > 8000] should work as you want to.

you can convert entire column data type to numeric
import pandas as pd
df['GM'] = pd.to_numeric(df['GM'])

You can see the data type of your column by using type function. In order to convert it to float use astype function as follows:
df['GM'].astype(float)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Converting strings to ints in a DataFrame - python

How to covert a DataFrame column containing strings and "-" values to floats. I have tried with pd.to_numeric and pd.Series.astype(int) but i haven´t success. What do you recommend??

Related

Convert float dtype dataframe to string dtype dataframe (NaN becomes '0')

Summing an entire column with another column in pandas

Pandas Dataframe column with both Strings and Floats

Finding All Values in Pandas DataFrame Not of Certain Type

String problem / Select all values > 8000 in pandas dataframe

Categories

Resources