Converting a column of strings to numbers in Pandas

Converting a column of strings to numbers in Pandas - python

How do I get the Units column to numeric?
I have a Google spreadsheet that I am reading in the date column gets converted fine.. but I'm not having much luck getting the Unit Sales column to convert to numeric I'm including all the code which uses requests to get the data:
from StringIO import StringIO
import requests
#act = requests.get('https://docs.google.com/spreadsheet/ccc?key=0Ak_wF7ZGeMmHdFZtQjI1a1hhUWR2UExCa2E4MFhiWWc&output=csv&gid=1')
dataact = act.content
actdf = pd.read_csv(StringIO(dataact),index_col=0,parse_dates=['date'])
actdf.rename(columns={'Unit Sales': 'Units'}, inplace=True) #incase the space in the name is messing me up
The different methods I have tried to get Units to get to numeric
actdf=actdf['Units'].convert_objects(convert_numeric=True)
#actdf=actdf['Units'].astype('float32')
Then I want to resample and I'm getting strange string concatenations since the numbers are still string
#actdfq=actdf.resample('Q',sum)
#actdfq.head()
actdf.head()
#actdf
so the df looks like this with just units and the date index
date
2013-09-01 3,533
2013-08-01 4,226
2013-07-01 4,281
Name: Units, Length: 161, dtype: object

You have to specify the thousands separator:
actdf = pd.read_csv(StringIO(dataact), index_col=0, parse_dates=['date'], thousands=',')

This will work
In [13]: s
Out[13]:
0 4,223
1 3,123
dtype: object
In [14]: s.str.replace(',','').convert_objects(convert_numeric=True)
Out[14]:
0 4223
1 3123
dtype: int64

Related

Python script to find the number of date columns in a csv file and update the date format to MM-DD-YYYY

I get a file everyday with around 15 columns. Somedays there are 2 date columns and some days one date column. Also the date format on somedays is YYYY-MM-DD and on some its DD-MM-YYYY. Task is to convert the 2 or 1 date columns to MM-DD-YYYY. Sample data in csv file for few columns :
Execution_date
Extract_date
Requestor_Name
Count
2023-01-15
2023-01-15
John Smith
7
Sometimes we dont get the second column above - extract_date :
Execution_date
Requestor_Name
Count
17-01-2023
Andrew Mill
3
Task is to find all the date columns in the file and change the date format to MM-DD-YYYY.
So the sample output of above 2 files will be :
Execution_date
Extract_date
Requestor_Name
Count
01-15-2023
01-15-2023
John Smith
7
Execution_date
Requestor_Name
Count
01-17-2023
Andrew Mill
3
I am using pandas and can't figure out how to deal with the missing second column on some days and the change of the date value format.
I can hardcode the 2 column names and change the format by :
df['Execution_Date'] = pd.to_datetime(df['Execution_Date'], format='%d-%m-%Y').dt.strftime('%m-%d-%Y')
df['Extract_Date'] = pd.to_datetime(df['Extract_Date'], format='%d-%m-%Y').dt.strftime('%m-%d-%Y')
This will only work when the file has 2 columns and the values are in DD-MM-YYYY format.
Looking for guidance on how to dynamically find the number of date columns and the date value format so that I can use it in my above 2 lines of code. If not then any other solution would also work for me. I can use powershell if it can't be done in python. But I am guessing there will be a lot more avenues in python to do this than we will have in powershell.

The following loads a CSV file into a dataframe, checks each value (that is a str) to see if it matches one of the date formats, and if it does rearranges the date to the format you're looking for. Other values are untouched.
import pandas as pd
import re
df = pd.read_csv("today.csv")
# compiling the patterns ahead of time saves a lot of processing power later
d_m_y = re.compile(r"(\d\d)-(\d\d)-(\d\d\d\d)")
d_m_y_replace = r"\2-\1-\3"
y_m_d = re.compile(r"(\d\d\d\d)-(\d\d)-(\d\d)")
y_m_d_replace = r"\2-\3-\1"
def change_dt(value):
if isinstance(value, str):
if d_m_y.fullmatch(value):
return d_m_y.sub(d_m_y_replace, value)
elif y_m_d.fullmatch(value):
return y_m_d.sub(y_m_d_replace, value)
return value
new_df = df.applymap(change_dt)
However, if there are other columns containing dates that you don't want to change, and you just want to specify the columns to be altered, use this instead of the last line above:
cols = ["Execution_date", "Extract_date"]
for col in cols:
if col in df.columns:
df[col] = df[col].apply(change_dt)
You can convert the columns to datetimes if you wish.

You can use a function to check all column names that contain "date" and use .fillna to try other formats (add all possible formats).
import pandas as pd
def convert_to_datetime(df: pd.DataFrame, column_name: str) -> pd.DataFrame:
for column in df.columns[df.columns.str.contains(column_name, case=False)]:
df[column] = (
pd.to_datetime(df[column], format="%d-%m-%Y", errors="coerce")
.fillna(pd.to_datetime(df[column], format="%Y-%m-%d", errors="coerce"))
).dt.strftime("%m-%d-%Y")
return df
data1 = {'Execution_date': '2023-01-15', 'Extract_date': '2023-01-15', 'Requestor_Name': "John Smith", 'Count': 7}
df1 = pd.DataFrame(data=[data1])
data2 = {'Execution_date': '17-01-2023', 'Requestor_Name': 'Andrew Mill', 'Count': 3}
df2 = pd.DataFrame(data=[data2])
final1 = convert_to_datetime(df=df1, column_name="date")
print(final1)
final2 = convert_to_datetime(df=df2, column_name="date")
print(final2)
Output:
Execution_date Extract_date Requestor_Name Count
0 01-15-2023 01-15-2023 John Smith 7
Execution_date Requestor_Name Count
0 01-17-2023 Andrew Mill 3

Unable to convert comma separated integers and non-integer values to float in a series column in Python

Loading in the data
in: import pandas as pd
in: df = pd.read_csv('name', sep = ';', encoding='unicode_escape')
in : df.dtypes
out: amount object
I have an object column with amounts like 150,01 and 43,69. Thee are about 5,000 rows.
df['amount']
0 31
1 150,01
2 50
3 54,4
4 32,79
...
4950 25,5
4951 39,5
4952 75,56
4953 5,9
4954 43,69
Name: amount, Length: 4955, dtype: object
Naturally, I tried to convert the series into the locale format, which suppose to turn it into a float format. I came back with the following error:
In: import locale
setlocale(LC_NUMERIC, 'en_US.UTF-8')
Out: 'en_US.UTF-8'
In: df['amount'].apply(locale.atof)
Out: ValueError: could not convert string to float: ' - '
Now that I'm aware that there are non-numeric values in the list, I tried to use isnumeric methods to turn the non-numeric values to become NaN.
Unfortunately, due to the comma separated structure, all the values would turn into -1.
0 -1
1 -1
2 -1
3 -1
4 -1
..
4950 -1
4951 -1
4952 -1
4953 -1
4954 -1
Name: amount, Length: 4955, dtype: int64
How do I turn the "," values to "." by first removing the "-" values? I tried .drop() or .truncate it does not help. If I replace the str",", " ", it would also cause trouble since there is a non-integer value.
Please help!
Documentation that I came across
-https://stackoverflow.com/questions/21771133/finding-non-numeric-rows-in-dataframe-in-pandas
-https://stackoverflow.com/questions/56315468/replace-comma-and-dot-in-pandas
p.s. This is my first post, please be kind

Sounds like you have a European-style CSV similar to the following. Provide actual sample data as many comments asked for if your format is different:
data.csv
thing;amount
thing1;31
thing2;150,01
thing3;50
thing4;54,4
thing5;1.500,22
To read it, specify the column, decimal and thousands separator as needed:
import pandas as pd
df = pd.read_csv('data.csv',sep=';',decimal=',',thousands='.')
print(df)
Output:
thing amount
0 thing1 31.00
1 thing2 150.01
2 thing3 50.00
3 thing4 54.40
4 thing5 1500.22

Posting as an answer since it contains multi-line code, despite not truly answering your question (yet):
Try using chardet. pip install chardet to get the package, then in your import block, add import chardet.
When importing the file, do something like:
with open("C:/path/to/file.csv", 'r') as f:
data = f.read()
result = chardet.detect(data.encode())
charencode = result['encoding']
# now re-set the handler to the beginning and re-read the file:
f.seek(0, 0)
data = pd.read_csv(f, delimiter=';', encoding=charencode)
Alternatively, for reasons I cannot fathom, passing engine='python' as a parameter works often. You'd just do
data = pd.read_csv('C:/path/to/file.csv', engine='python')
#Mark Tolonen has a more elegant approach to standardizing the actual data, but my (hacky) way of doing it was to just write a function:
def stripThousands(self, df_column):
df_column.replace(',', '', regex=True, inplace=True)
df_column = df_column.apply(pd.to_numeric, errors='coerce')
return df_column
If you don't care about the entries that are just hyphens, you could use a function like
def screw_hyphens(self, column):
column.replace(['-'], np.nan, inplace=True)
or if np.nan values will be a problem, you can just replace it with column.replace('-', '', inplace=True)
**EDIT: there was a typo in the block outlining the usage of chardet. it should be correct now (previously the end of the last line was encoding=charenc)

Output of column in Pandas dataframe from float to currency (negative values)

I have the following data frame (consisting of both negative and positive numbers):
df.head()
Out[39]:
Prices
0 -445.0
1 -2058.0
2 -954.0
3 -520.0
4 -730.0
I am trying to change the 'Prices' column to display as currency when I export it to an Excel spreadsheet. The following command I use works well:
df['Prices'] = df['Prices'].map("${:,.0f}".format)
df.head()
Out[42]:
Prices
0 $-445
1 $-2,058
2 $-954
3 $-520
4 $-730
Now my question here is what would I do if I wanted the output to have the negative signs BEFORE the dollar sign. In the output above, the dollar signs are before the negative signs. I am looking for something like this:
-$445
-$2,058
-$954
-$520
-$730
Please note there are also positive numbers as well.

You can use np.where and test whether the values are negative and if so prepend a negative sign in front of the dollar and cast the series to a string using astype:
In [153]:
df['Prices'] = np.where( df['Prices'] < 0, '-$' + df['Prices'].astype(str).str[1:], '$' + df['Prices'].astype(str))
df['Prices']
Out[153]:
0 -$445.0
1 -$2058.0
2 -$954.0
3 -$520.0
4 -$730.0
Name: Prices, dtype: object

You can use the locale module and the _override_localeconv dict. It's not well documented, but it's a trick I found in another answer that has helped me before.
import pandas as pd
import locale
locale.setlocale( locale.LC_ALL, 'English_United States.1252')
# Made an assumption with that locale. Adjust as appropriate.
locale._override_localeconv = {'n_sign_posn':1}
# Load dataframe into df
df['Prices'] = df['Prices'].map(locale.currency)
This creates a dataframe that looks like this:
Prices
0 -$445.00
1 -$2058.00
2 -$954.00
3 -$520.00
4 -$730.00

Pandas DataFrame casting to timedelta fails with loc

I've got a little bit of a weird situation, and I don't understand why it works in one situation and not the other.
I'm trying to cast a column on a multiindex from timedelta64[ns] to timedelta64[s], and I also have a multiindex for rows.
If tuple is the column I want (level_0, level_1):
it works with df[tuple] = df[tuple].astype(timedelta64[s])
it doesn't work with df.loc[:, tuple].astype(timedelta64[s])
Here is some sample data (csv):
Level_0,,,Respondent,Respondent,Respondent,OtherCat,OtherCat
Level_1,,,Something,StartDate,EndDate,Yes/No,SomethingElse
Region,Site,RespondentID,,,,,
Region_1,Site_1,3987227376,A,5/25/2015 10:59,5/25/2015 11:22,Yes,
Region_1,Site_1,3980680971,A,5/21/2015 9:40,5/21/2015 9:52,Yes,Yes
Region_1,Site_2,3977723249,A,5/20/2015 8:27,5/20/2015 8:41,Yes,
Region_1,Site_2,3977723089,A,5/20/2015 8:33,5/20/2015 9:09,Yes,No
Load it with:
In [1]: df = pd.read_csv(header=[0,1], index_col=[0,1,2])
df
Out[1]:
I want to create a column "Duration" (and then one called "DurationMinutes" dividing Duration by 60).
I start by casting the dates to datetime:
In [2]:
df.loc[:,('Respondent','StartDate')] = pd.to_datetime(sample.loc[:,('Respondent','StartDate')])
df.loc[:,('Respondent','EndDate')] = pd.to_datetime(df.loc[:,('Respondent','EndDate')])
df.loc[:,('Respondent','Duration')] = df.loc[:,('Respondent','EndDate')] - df.loc[:,('Respondent','StartDate')]
This is where I don't understand anymore what's going on. I want to convert it to timedelta64[s] because I need that.
If I simply display the result of astype('timedelta64[s]'), it works like a charm:
In [3]: df.loc[:,('Respondent','Duration')].astype('timedelta64[s]')
Out[3]:
Region Site RespondentID
Region_1 Site_1 3987227376 1380
3980680971 720
Site_2 3977723249 840
3977723089 2160
Name: (Respondent, Duration), dtype: float64
But if I assign, then show the column, it fails:
In [4]: df.loc[:,('Respondent','Duration')] = df.loc[:,'Respondent','Duration')].astype('timedelta64[s]')
df.loc[:,('Respondent','Duration')]
Out[4]:
Region Site RespondentID
Region_1 Site_1 3987227376 00:00:00.000001
3980680971 00:00:00.000000
Site_2 3977723249 00:00:00.000000
3977723089 00:00:00.000002
Name: (Respondent, Duration), dtype: timedelta64[ns]
Weirdly enough, if I do this: it will work:
In [5]: df[('Respondent','Duration')] = df[('Respondent','Duration')].astype('timedelta64[s]')
df.loc[:,('Respondent','Duration')]
Out[5]:
Region Site RespondentID
Region_1 Site_1 3987227376 1380
3980680971 720
Site_2 3977723249 840
3977723089 2160
Name: (Respondent, Duration), dtype: float64
Another strange thing, if I filter for one site, and drop the Region so that I end up with a single-level index, it works...:
In [6]:
Survey = 'Site_1'
df = df.xs(Survey, level='Site').copy()

# Drop the 'Region' from index
df.index = df.index.droplevel(level='Region')
df.loc[:,('Respondent','StartDate')] = pd.to_datetime(df.loc[:,('Respondent','StartDate')])
df.loc[:,('Respondent','EndDate')] = pd.to_datetime(df.loc[:,('Respondent','EndDate')])
df.loc[:,('Respondent','Duration')] = df.loc[:,('Respondent','EndDate')] - df.loc[:,('Respondent','StartDate')]
# This works fine
df.loc[:,('Respondent','Duration')] = df.loc[:,('Respondent','Duration')].astype('timedelta64[s]')

# Display
df.loc[:,('Respondent','Duration')]
Out[6]:
RespondentID
3987227376 1380
3980680971 720
Name: (Respondent, Duration), dtype: float64
Clearly I'm missing something as to why df.loc[:,tuple] is different than df[tuple].
Can someone shed some light please?
Python 2.7.9, pandas 0.16.2

This was a bug, I just fixed it here, will be in 0.17.0.
The gist is this. When you do something like df.loc[:,column] = value this is treated exactly the same as df[[column]] = value. This means that type coercion is independent of what the column WAS. Contrast this to df.loc[indexer,column], e.g. you are partially setting a column. Here the new value AND the existing dtype of the column matters.
The bug was that when the frame has a multi-index, even though the multi-index was a full index (e.g. it encompassed the full length of values in the frame) it wasn't taking the correct path.
So the bottom line is that these cases should (and will be) the same.

Reformat a column containing dates in Pandas

Python newbie here who's switching from R to Python for statistical modeling and analysis.
I am working with a Pandas data structure and am trying to restructure a column that contains 'date' values. In the data below, you'll notice that some values take the 'Mar-10' format which others take a '12/1/13' format. How can I restructure a column in a Pandas data structure that contains 'dates' (technically not a date structure) so that they are uniform (contain the same structure). I'd prefer that they all follow the 'Mar-10' format. Can anyone help?
In [34]: dat["Date"].unique()
Out[34]:
array(['Jan-10', 'Feb-10', 'Mar-10', 'Apr-10', 'May-10', 'Jun-10',
'Jul-10', 'Aug-10', 'Sep-10', 'Oct-10', 'Nov-10', 'Dec-10',
'Jan-11', 'Feb-11', 'Mar-11', 'Apr-11', 'May-11', 'Jun-11',
'Jul-11', 'Aug-11', 'Sep-11', 'Oct-11', 'Nov-11', 'Dec-11',
'Jan-12', 'Feb-12', 'Mar-12', 'Apr-12', 'May-12', 'Jun-12',
'Jul-12', 'Aug-12', 'Sep-12', 'Oct-12', 'Nov-12', 'Dec-12',
'Jan-13', 'Feb-13', 'Mar-13', 'Apr-13', 'May-13', '6/1/13',
'7/1/13', '8/1/13', '9/1/13', '10/1/13', '11/1/13', '12/1/13',
'1/1/14', '2/1/14', '3/1/14', '4/1/14', '5/1/14', '6/1/14',
'7/1/14', '8/1/14'], dtype=object)
In [35]: isinstance(dat["Date"], basestring) # not a string?
Out[35]: False
In [36]: type(dat["Date"]).__name__
Out[36]: 'Series'

I think your dates are already strings, try:
import numpy as np
import pandas as pd
date = pd.Series(np.array(['Jan-10', 'Feb-10', 'Mar-10', 'Apr-10', 'May-10', 'Jun-10',
'Jul-10', 'Aug-10', 'Sep-10', 'Oct-10', 'Nov-10', 'Dec-10',
'Jan-11', 'Feb-11', 'Mar-11', 'Apr-11', 'May-11', 'Jun-11',
'Jul-11', 'Aug-11', 'Sep-11', 'Oct-11', 'Nov-11', 'Dec-11',
'Jan-12', 'Feb-12', 'Mar-12', 'Apr-12', 'May-12', 'Jun-12',
'Jul-12', 'Aug-12', 'Sep-12', 'Oct-12', 'Nov-12', 'Dec-12',
'Jan-13', 'Feb-13', 'Mar-13', 'Apr-13', 'May-13', '6/1/13',
'7/1/13', '8/1/13', '9/1/13', '10/1/13', '11/1/13', '12/1/13',
'1/1/14', '2/1/14', '3/1/14', '4/1/14', '5/1/14', '6/1/14',
'7/1/14', '8/1/14'], dtype=object))
date.map(type).value_counts()
# date contains 56 strings
# <type 'str'> 56
# dtype: int64
To see the types of each individual element, rather than seeing the type of the column they're contained in.
Your best bet for dealing sensibly with them is to convert them into pandas DateTime objects:
pd.to_datetime(date)
Out[18]:
0 2014-01-10
1 2014-02-10
2 2014-03-10
3 2014-04-10
4 2014-05-10
5 2014-06-10
6 2014-07-10
7 2014-08-10
8 2014-09-10
...
You may have to play around with the formats somewhat, e.g. creating two separate arrays
for each format and then merging them back together:
# Convert the Aug-10 style strings
pd.to_datetime(date, format='%b-%y', coerce=True)
# Convert the 9/1/13 style strings
pd.to_datetime(date, format='%m/%d/%y', coerce=True)
I can never remember these time formatting codes off the top of my head but there's a good rundown of them here.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.