I have a dataframe that has a column called 'CBG' with numbers as a string value.
CBG acs_total_persons acs_total_housing_units
0 010010211001 1925 1013
1 010030114011 2668 1303
2 010070100043 930 532
When I write it to a csv file, the leading 'O' are removed:
combine_acs_merge.to_csv(new_out_csv, sep=',')
>>> CBG: [0: 10010221101, ...]
It's already a string; how can I keep the leading zero from being removed in the .csv file
Lets take an example:
Below is your example DataFrame:
>>> df
col1 num
0 One 011
1 two 0123
2 three 0122
3 four 0333
Considering the num as an int which you can convert to str().
>>> df["num"] = df["num"].astype(str)
>>> df.to_csv("datasheet.csv")
Output:
$ cat datasheet.csv
You will find the leading zeros are intacted..
,col1,num
0,One,011
1,two,0123
2,three,0122
3,four,0333
OR, if you reading the data from csv first then use belwo..
pd.read_csv('test.csv', dtype=str)
However, if your column CBG already str then it should be straight forward..
>>> df = pd.DataFrame({'CBG': ["010010211001", "010030114011", "010070100043"],
... 'acs_total_persons': [1925, 2668, 930],
... 'acs_total_housing_units': [1013, 1303, 532]})
>>>
>>> df
CBG acs_total_housing_units acs_total_persons
0 010010211001 1013 1925
1 010030114011 1303 2668
2 010070100043 532 930
>>> df.to_csv("CBG.csv")
result:
$ cat CBG.csv
,CBG,acs_total_housing_units,acs_total_persons
0,010010211001,1013,1925
1,010030114011,1303,2668
2,010070100043,532,930
Pandas doesn't strip padded zeros. You're liking seeing this when opening in Excel. Open the csv in a text editor like notepad++ and you'll see they're still zero padded.
When reading a CSV file pandas tries to convert values in every column to some data type as it sees fit. If it sees a column which contains only digits it will set the dtype of this column to int64. This converts "010010211001" to 10010211001.
If you don't want any data type conversions to happen specify dtype=str when reading in the CSV file.
As per pandas documentation for read_csv https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html:
dtype : Type name or dict of column -> type, optional
Data type for data or columns. E.g. {‘a’: np.float64, ‘b’: np.int32, ‘c’: ‘Int64’} Use str or object
together with suitable na_values settings to preserve and not interpret dtype. If
converters are specified, they will be applied INSTEAD of dtype conversion.
Related
Loading in the data
in: import pandas as pd
in: df = pd.read_csv('name', sep = ';', encoding='unicode_escape')
in : df.dtypes
out: amount object
I have an object column with amounts like 150,01 and 43,69. Thee are about 5,000 rows.
df['amount']
0 31
1 150,01
2 50
3 54,4
4 32,79
...
4950 25,5
4951 39,5
4952 75,56
4953 5,9
4954 43,69
Name: amount, Length: 4955, dtype: object
Naturally, I tried to convert the series into the locale format, which suppose to turn it into a float format. I came back with the following error:
In: import locale
setlocale(LC_NUMERIC, 'en_US.UTF-8')
Out: 'en_US.UTF-8'
In: df['amount'].apply(locale.atof)
Out: ValueError: could not convert string to float: ' - '
Now that I'm aware that there are non-numeric values in the list, I tried to use isnumeric methods to turn the non-numeric values to become NaN.
Unfortunately, due to the comma separated structure, all the values would turn into -1.
0 -1
1 -1
2 -1
3 -1
4 -1
..
4950 -1
4951 -1
4952 -1
4953 -1
4954 -1
Name: amount, Length: 4955, dtype: int64
How do I turn the "," values to "." by first removing the "-" values? I tried .drop() or .truncate it does not help. If I replace the str",", " ", it would also cause trouble since there is a non-integer value.
Please help!
Documentation that I came across
-https://stackoverflow.com/questions/21771133/finding-non-numeric-rows-in-dataframe-in-pandas
-https://stackoverflow.com/questions/56315468/replace-comma-and-dot-in-pandas
p.s. This is my first post, please be kind
Sounds like you have a European-style CSV similar to the following. Provide actual sample data as many comments asked for if your format is different:
data.csv
thing;amount
thing1;31
thing2;150,01
thing3;50
thing4;54,4
thing5;1.500,22
To read it, specify the column, decimal and thousands separator as needed:
import pandas as pd
df = pd.read_csv('data.csv',sep=';',decimal=',',thousands='.')
print(df)
Output:
thing amount
0 thing1 31.00
1 thing2 150.01
2 thing3 50.00
3 thing4 54.40
4 thing5 1500.22
Posting as an answer since it contains multi-line code, despite not truly answering your question (yet):
Try using chardet. pip install chardet to get the package, then in your import block, add import chardet.
When importing the file, do something like:
with open("C:/path/to/file.csv", 'r') as f:
data = f.read()
result = chardet.detect(data.encode())
charencode = result['encoding']
# now re-set the handler to the beginning and re-read the file:
f.seek(0, 0)
data = pd.read_csv(f, delimiter=';', encoding=charencode)
Alternatively, for reasons I cannot fathom, passing engine='python' as a parameter works often. You'd just do
data = pd.read_csv('C:/path/to/file.csv', engine='python')
#Mark Tolonen has a more elegant approach to standardizing the actual data, but my (hacky) way of doing it was to just write a function:
def stripThousands(self, df_column):
df_column.replace(',', '', regex=True, inplace=True)
df_column = df_column.apply(pd.to_numeric, errors='coerce')
return df_column
If you don't care about the entries that are just hyphens, you could use a function like
def screw_hyphens(self, column):
column.replace(['-'], np.nan, inplace=True)
or if np.nan values will be a problem, you can just replace it with column.replace('-', '', inplace=True)
**EDIT: there was a typo in the block outlining the usage of chardet. it should be correct now (previously the end of the last line was encoding=charenc)
I would like to convert negative value strings and strings with commas to float
df. But I am struggling to do both operations at the same time
customer_id Revenue
332 1,293.00
293 -485
4284 1,373.80
284 -327
Output_df
332 1293.00
293 485
4284 1373.80
284 327
Convert to numeric and then take the absolute value:
df["Revenue"] = pd.to_numeric(df["Revenue"]).abs()
If the above doesn't work, then try:
df["Revenue"] = pd.to_numeric(df["Revenue"].str.strip().str.replace(",", "")).abs()
Here I first make a call to str.strip() to remove any whitespace in your float. Then, I remove commas using str.replace().
Does using .str.replace() help?
df["Revenue"] = pd.to_numeric(df["Revenue"].str.replace(',','').abs()
If you are getting the DataFrame from a csv file, you can use the following at import to address the commas, and then deal with the - later:
df.read_csv ('foo.csv', thousands=',')
df["Revenue"] = pd.to_numeric(df["Revenue"]).abs()
Given the following dataframe:
DF = pd.DataFrame({'COL1': ['A', 'B', 'C', 'D','D','D'],
'mixed': [2016.0, 2017.0, 'sweatervest', 20, 209, 21]})
DF
COL1 mixed
0 A 2016.0
1 B 2017.0
2 C sweatervest
3 D 20
4 D 209
5 D 21
I want to convert 'mixed' to an object such that all numbers are integers as strings and all strings remain, of course, strings.
The desired output is as follows:
COL1 mixed
0 A 2016
1 B 2017
2 C sweatervest
3 D 20
4 D 209
5 D 21
Background info:
Originally, 'mixed' was part of a data frame taken from a CSV that mainly consisted of numbers, with some strings here and there. When I tried converting it to string, some numbers ended up with '.0' at the end.
try:
DF['mixed']=DF.mixed.astype(object)
this results in:
DF['mixed']
0 2016
1 2017
2 sweatervest
3 20
4 209
5 21
Name: mixed, dtype: object
df.mixed = df.mixed.apply(lambda elt: str(int(elt)) if isinstance(elt, float) else str(elt))
This calls the lambda elt: str(int(elt)) if isinstance(elt, float) else str(elt) function over each element of the 'mixed' column.
Note: This assumes that all of your floats are convertible to integers, as you implied in your comments on your question.
This approach builds upon the answer by gbrener. It iterates over a dateframe looking for mixed dtype columns. For each such mixed column, it first replaces all nan values with pd.NA. It then safely converts its values to strings. It isusable in-place as unmix_dtypes(df). It was tested with Pandas 1 under Python 3.8.
Note that this answer uses assignment expressions which work only with Python 3.8 or newer. It can however trivially be modified to not use them.
from typing import Union
import pandas as pd
def _to_str(val: Union[type(pd.NA), float, int, str]) -> Union[type(pd.NA), str]:
"""Return a string representation of the given integer, rounded float, or otherwise a string.
`pd.NA` values are returned as is.
It can be useful to call `df[col].fillna(value=pd.NA, inplace=True)` before calling this function.
"""
if val is pd.NA:
return val
if isinstance(val, float) and (val % 1 == 0.0):
return str(int(val))
if isinstance(val, int):
return str(val)
assert isinstance(val, str)
return val
def unmix_dtypes(df: pd.DataFrame) -> None:
"""Convert mixed dtype columns in the given dataframe to strings.
Ref: https://stackoverflow.com/a/61826020/
"""
for col in df.columns:
if not (orig_dtype := pd.api.types.infer_dtype(df[col])).startswith("mixed"):
continue
df[col].fillna(value=pd.NA, inplace=True)
df[col] = df[col].apply(_to_str)
if (new_dtype := pd.api.types.infer_dtype(df[col])).startswith("mixed"):
raise TypeError(f"Unable to convert {col} to a non-mixed dtype. Its previous dtype was {orig_dtype} and new dtype is {new_dtype}.")
Caution: One of the dangers of not specifying an explicit dtype, however, is that a column such as ["012", "0034", "4"] can be read by pd.read_csv as an integer column, thereby irrecoverably losing the leading zeros. What's worse is that if dataframes are concatenated, such a loss of the leading zeros can happen inconsistently, leading to column values such as ["012", "12", "34", "0034"].
Python newbie here who's switching from R to Python for statistical modeling and analysis.
I am working with a Pandas data structure and am trying to restructure a column that contains 'date' values. In the data below, you'll notice that some values take the 'Mar-10' format which others take a '12/1/13' format. How can I restructure a column in a Pandas data structure that contains 'dates' (technically not a date structure) so that they are uniform (contain the same structure). I'd prefer that they all follow the 'Mar-10' format. Can anyone help?
In [34]: dat["Date"].unique()
Out[34]:
array(['Jan-10', 'Feb-10', 'Mar-10', 'Apr-10', 'May-10', 'Jun-10',
'Jul-10', 'Aug-10', 'Sep-10', 'Oct-10', 'Nov-10', 'Dec-10',
'Jan-11', 'Feb-11', 'Mar-11', 'Apr-11', 'May-11', 'Jun-11',
'Jul-11', 'Aug-11', 'Sep-11', 'Oct-11', 'Nov-11', 'Dec-11',
'Jan-12', 'Feb-12', 'Mar-12', 'Apr-12', 'May-12', 'Jun-12',
'Jul-12', 'Aug-12', 'Sep-12', 'Oct-12', 'Nov-12', 'Dec-12',
'Jan-13', 'Feb-13', 'Mar-13', 'Apr-13', 'May-13', '6/1/13',
'7/1/13', '8/1/13', '9/1/13', '10/1/13', '11/1/13', '12/1/13',
'1/1/14', '2/1/14', '3/1/14', '4/1/14', '5/1/14', '6/1/14',
'7/1/14', '8/1/14'], dtype=object)
In [35]: isinstance(dat["Date"], basestring) # not a string?
Out[35]: False
In [36]: type(dat["Date"]).__name__
Out[36]: 'Series'
I think your dates are already strings, try:
import numpy as np
import pandas as pd
date = pd.Series(np.array(['Jan-10', 'Feb-10', 'Mar-10', 'Apr-10', 'May-10', 'Jun-10',
'Jul-10', 'Aug-10', 'Sep-10', 'Oct-10', 'Nov-10', 'Dec-10',
'Jan-11', 'Feb-11', 'Mar-11', 'Apr-11', 'May-11', 'Jun-11',
'Jul-11', 'Aug-11', 'Sep-11', 'Oct-11', 'Nov-11', 'Dec-11',
'Jan-12', 'Feb-12', 'Mar-12', 'Apr-12', 'May-12', 'Jun-12',
'Jul-12', 'Aug-12', 'Sep-12', 'Oct-12', 'Nov-12', 'Dec-12',
'Jan-13', 'Feb-13', 'Mar-13', 'Apr-13', 'May-13', '6/1/13',
'7/1/13', '8/1/13', '9/1/13', '10/1/13', '11/1/13', '12/1/13',
'1/1/14', '2/1/14', '3/1/14', '4/1/14', '5/1/14', '6/1/14',
'7/1/14', '8/1/14'], dtype=object))
date.map(type).value_counts()
# date contains 56 strings
# <type 'str'> 56
# dtype: int64
To see the types of each individual element, rather than seeing the type of the column they're contained in.
Your best bet for dealing sensibly with them is to convert them into pandas DateTime objects:
pd.to_datetime(date)
Out[18]:
0 2014-01-10
1 2014-02-10
2 2014-03-10
3 2014-04-10
4 2014-05-10
5 2014-06-10
6 2014-07-10
7 2014-08-10
8 2014-09-10
...
You may have to play around with the formats somewhat, e.g. creating two separate arrays
for each format and then merging them back together:
# Convert the Aug-10 style strings
pd.to_datetime(date, format='%b-%y', coerce=True)
# Convert the 9/1/13 style strings
pd.to_datetime(date, format='%m/%d/%y', coerce=True)
I can never remember these time formatting codes off the top of my head but there's a good rundown of them here.
How do I get the Units column to numeric?
I have a Google spreadsheet that I am reading in the date column gets converted fine.. but I'm not having much luck getting the Unit Sales column to convert to numeric I'm including all the code which uses requests to get the data:
from StringIO import StringIO
import requests
#act = requests.get('https://docs.google.com/spreadsheet/ccc?key=0Ak_wF7ZGeMmHdFZtQjI1a1hhUWR2UExCa2E4MFhiWWc&output=csv&gid=1')
dataact = act.content
actdf = pd.read_csv(StringIO(dataact),index_col=0,parse_dates=['date'])
actdf.rename(columns={'Unit Sales': 'Units'}, inplace=True) #incase the space in the name is messing me up
The different methods I have tried to get Units to get to numeric
actdf=actdf['Units'].convert_objects(convert_numeric=True)
#actdf=actdf['Units'].astype('float32')
Then I want to resample and I'm getting strange string concatenations since the numbers are still string
#actdfq=actdf.resample('Q',sum)
#actdfq.head()
actdf.head()
#actdf
so the df looks like this with just units and the date index
date
2013-09-01 3,533
2013-08-01 4,226
2013-07-01 4,281
Name: Units, Length: 161, dtype: object
You have to specify the thousands separator:
actdf = pd.read_csv(StringIO(dataact), index_col=0, parse_dates=['date'], thousands=',')
This will work
In [13]: s
Out[13]:
0 4,223
1 3,123
dtype: object
In [14]: s.str.replace(',','').convert_objects(convert_numeric=True)
Out[14]:
0 4223
1 3123
dtype: int64