I would like to convert negative value strings and strings with commas to float
df. But I am struggling to do both operations at the same time
customer_id Revenue
332 1,293.00
293 -485
4284 1,373.80
284 -327
Output_df
332 1293.00
293 485
4284 1373.80
284 327
Convert to numeric and then take the absolute value:
df["Revenue"] = pd.to_numeric(df["Revenue"]).abs()
If the above doesn't work, then try:
df["Revenue"] = pd.to_numeric(df["Revenue"].str.strip().str.replace(",", "")).abs()
Here I first make a call to str.strip() to remove any whitespace in your float. Then, I remove commas using str.replace().
Does using .str.replace() help?
df["Revenue"] = pd.to_numeric(df["Revenue"].str.replace(',','').abs()
If you are getting the DataFrame from a csv file, you can use the following at import to address the commas, and then deal with the - later:
df.read_csv ('foo.csv', thousands=',')
df["Revenue"] = pd.to_numeric(df["Revenue"]).abs()
Related
Loading in the data
in: import pandas as pd
in: df = pd.read_csv('name', sep = ';', encoding='unicode_escape')
in : df.dtypes
out: amount object
I have an object column with amounts like 150,01 and 43,69. Thee are about 5,000 rows.
df['amount']
0 31
1 150,01
2 50
3 54,4
4 32,79
...
4950 25,5
4951 39,5
4952 75,56
4953 5,9
4954 43,69
Name: amount, Length: 4955, dtype: object
Naturally, I tried to convert the series into the locale format, which suppose to turn it into a float format. I came back with the following error:
In: import locale
setlocale(LC_NUMERIC, 'en_US.UTF-8')
Out: 'en_US.UTF-8'
In: df['amount'].apply(locale.atof)
Out: ValueError: could not convert string to float: ' - '
Now that I'm aware that there are non-numeric values in the list, I tried to use isnumeric methods to turn the non-numeric values to become NaN.
Unfortunately, due to the comma separated structure, all the values would turn into -1.
0 -1
1 -1
2 -1
3 -1
4 -1
..
4950 -1
4951 -1
4952 -1
4953 -1
4954 -1
Name: amount, Length: 4955, dtype: int64
How do I turn the "," values to "." by first removing the "-" values? I tried .drop() or .truncate it does not help. If I replace the str",", " ", it would also cause trouble since there is a non-integer value.
Please help!
Documentation that I came across
-https://stackoverflow.com/questions/21771133/finding-non-numeric-rows-in-dataframe-in-pandas
-https://stackoverflow.com/questions/56315468/replace-comma-and-dot-in-pandas
p.s. This is my first post, please be kind
Sounds like you have a European-style CSV similar to the following. Provide actual sample data as many comments asked for if your format is different:
data.csv
thing;amount
thing1;31
thing2;150,01
thing3;50
thing4;54,4
thing5;1.500,22
To read it, specify the column, decimal and thousands separator as needed:
import pandas as pd
df = pd.read_csv('data.csv',sep=';',decimal=',',thousands='.')
print(df)
Output:
thing amount
0 thing1 31.00
1 thing2 150.01
2 thing3 50.00
3 thing4 54.40
4 thing5 1500.22
Posting as an answer since it contains multi-line code, despite not truly answering your question (yet):
Try using chardet. pip install chardet to get the package, then in your import block, add import chardet.
When importing the file, do something like:
with open("C:/path/to/file.csv", 'r') as f:
data = f.read()
result = chardet.detect(data.encode())
charencode = result['encoding']
# now re-set the handler to the beginning and re-read the file:
f.seek(0, 0)
data = pd.read_csv(f, delimiter=';', encoding=charencode)
Alternatively, for reasons I cannot fathom, passing engine='python' as a parameter works often. You'd just do
data = pd.read_csv('C:/path/to/file.csv', engine='python')
#Mark Tolonen has a more elegant approach to standardizing the actual data, but my (hacky) way of doing it was to just write a function:
def stripThousands(self, df_column):
df_column.replace(',', '', regex=True, inplace=True)
df_column = df_column.apply(pd.to_numeric, errors='coerce')
return df_column
If you don't care about the entries that are just hyphens, you could use a function like
def screw_hyphens(self, column):
column.replace(['-'], np.nan, inplace=True)
or if np.nan values will be a problem, you can just replace it with column.replace('-', '', inplace=True)
**EDIT: there was a typo in the block outlining the usage of chardet. it should be correct now (previously the end of the last line was encoding=charenc)
I am trying to split the following column using Pandas: (df name is count)
Location count
POINT (-118.05425 34.1341) 355
POINT (-118.244512 34.072581) 337
POINT (-118.265586 34.043271) 284
POINT (-118.360102 34.071338) 269
POINT (-118.40816 33.943626) 241
to this desired outcome:
X-Axis Y-Axis count
-118.05425 34.1341 355
-118.244512 34.072581 337
-118.265586 34.043271 284
-118.360102 34.071338 269
-118.40816 33.943626 241
I have tried removing the word 'POINT', and both the brackets. But then I am met with an extra white space at the beginning of the column. I tried using:
count.columns = count.columns.str.lstrip()
But it was not removing the white space.
I was hoping to use this code to split the column:
count = pd.DataFrame(count.Location.str.split(' ',1).tolist(),
columns = ['x-axis','y-axis'])
Since the space between both x and y axis could be used as the separator, but the white space.
You can use .str.extract with regex pattern having capture groups:
df[['x-axis', 'y-axis']] = df.pop('Location').str.extract(r'\((\S+) (\S+)\)')
print(df)
count x-axis y-axis
0 355 -118.05425 34.1341
1 337 -118.244512 34.072581
2 284 -118.265586 34.043271
3 269 -118.360102 34.071338
4 241 -118.40816 33.943626
a quick solution can be:
(df['Location']
.str.split(' ', 1) # like what you did,
.str[-1] # select only lat,lon
.str.strip('(') # remove open curly bracket
.str.strip(')') # remove close curly bracket
.str.split(' ', expand=True)) # expand to two columns
then you may rename column names using .rename or df.columns = colnames
I'm trying to covert "Quantity" column to int.
The quantity column has a string(,) divider or separator for the numerical values
using code
data['Quantity'] = data['Quantity'].astype('int')
data['Quantity'] = data['Quantity'].astype('float')
I am getting this error:
ValueError: could not convert string to float: '16,000'
ValueError: invalid literal for int() with base 10: '16,000'
Data
Date Quantity
2019-06-25 200
2019-03-30 100
2019-11-02 250
2018-10-23 100
2018-07-17 150
2018-05-31 150
2018-07-05 100
2018-10-04 100
2018-02-23 100
2019-09-16 204
2019-09-16 315
2019-11-09 113
2019-08-29 5
2019-08-23 4
2019-06-18 78
2019-12-06 4
2019-12-06 2
2019-10-03 16,000
2019-07-03 8,000
2018-12-12 32
Name: Quantity, dtype: object
It's a pandas dataframe with 124964 rows. I added the head and tail of the data
What can I do to fix this problem?
Solution
# Replace string "," with ""
data["Quantity"] = data["Quantity"].apply(lambda x: str(x.replace(',','')))
data['Quantity'] = data['Quantity'].astype('float')
'16,000' is neither a valid representation of an int or float, and actually the format is ambiguous - depending on locale standard, it could mean either 16.0 (float) or 16000 (int).
You first need to specify how this data should be interpreted, then fix the string so that it's a valid representation of either a float or int, then apply asType() with the correct type.
To make '16,000' a valid float representation, you just have to replace the comma with a dot:
val = '16,000'
val = val.replace(",", ".")
To make it an int (with value 16000) you just remove the comma:
val = '16,000'
val = val.replace(",", "")
I don't use panda so I can't tell how to best do this with a dataframe, but this is surely documented.
As a general rule: when working on data coming from the wild outside world (anything outside your own code), never trust the data, always make sure you validate and sanitize it before use.
number = '16,000'
act_num = ''
for char in number:
try:
character = int(char)
act_num+=(char)
except:
if char == '-' or char == '.':
act_num+= (char)
print(float(act_num))
data.Quantity = data.Quantity.astype(str).astype(int)
I have a dataframe that has a column called 'CBG' with numbers as a string value.
CBG acs_total_persons acs_total_housing_units
0 010010211001 1925 1013
1 010030114011 2668 1303
2 010070100043 930 532
When I write it to a csv file, the leading 'O' are removed:
combine_acs_merge.to_csv(new_out_csv, sep=',')
>>> CBG: [0: 10010221101, ...]
It's already a string; how can I keep the leading zero from being removed in the .csv file
Lets take an example:
Below is your example DataFrame:
>>> df
col1 num
0 One 011
1 two 0123
2 three 0122
3 four 0333
Considering the num as an int which you can convert to str().
>>> df["num"] = df["num"].astype(str)
>>> df.to_csv("datasheet.csv")
Output:
$ cat datasheet.csv
You will find the leading zeros are intacted..
,col1,num
0,One,011
1,two,0123
2,three,0122
3,four,0333
OR, if you reading the data from csv first then use belwo..
pd.read_csv('test.csv', dtype=str)
However, if your column CBG already str then it should be straight forward..
>>> df = pd.DataFrame({'CBG': ["010010211001", "010030114011", "010070100043"],
... 'acs_total_persons': [1925, 2668, 930],
... 'acs_total_housing_units': [1013, 1303, 532]})
>>>
>>> df
CBG acs_total_housing_units acs_total_persons
0 010010211001 1013 1925
1 010030114011 1303 2668
2 010070100043 532 930
>>> df.to_csv("CBG.csv")
result:
$ cat CBG.csv
,CBG,acs_total_housing_units,acs_total_persons
0,010010211001,1013,1925
1,010030114011,1303,2668
2,010070100043,532,930
Pandas doesn't strip padded zeros. You're liking seeing this when opening in Excel. Open the csv in a text editor like notepad++ and you'll see they're still zero padded.
When reading a CSV file pandas tries to convert values in every column to some data type as it sees fit. If it sees a column which contains only digits it will set the dtype of this column to int64. This converts "010010211001" to 10010211001.
If you don't want any data type conversions to happen specify dtype=str when reading in the CSV file.
As per pandas documentation for read_csv https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html:
dtype : Type name or dict of column -> type, optional
Data type for data or columns. E.g. {‘a’: np.float64, ‘b’: np.int32, ‘c’: ‘Int64’} Use str or object
together with suitable na_values settings to preserve and not interpret dtype. If
converters are specified, they will be applied INSTEAD of dtype conversion.
I have a weird interaction that I would need help with. Basically :
1)
I have created a pandas dataframe that containts 1179 rows x 6 columns. One column is street names and the same value will have several duplicates (because each line represents a point, and each point is associated with a street).
2)
I also have a list of all the streets in this panda dataframe.
3)If I run this line, I get an output of all the rows matching that street name:
print(sub_df[sub_df.AQROUTES_3=='AvenueMermoz'])
Result :
FID AQROUTES_3 ... BEARING E_ID
983 983 AvenueMermoz ... 288.058014
984 984 AvenueMermoz ... 288.058014
992 992 AvenueMermoz ... 288.058014
1005 1005 AvenueMermoz ... 288.058014
1038 1038 AvenueMermoz ... 288.058014
1019 1019 AvenueMermoz ... 288.058014
However, if I run this command in a loop with the string of my list as the street name, it returns an empty dataframe :
x=()
for names in pd_streetlist:
print(names)
x=names
print(sub_df[sub_df.AQROUTES_3 =="'"+str(x)+"'"])
x=()
Returns :
RangSaint_Joseph
Empty DataFrame
Columns: [FID, AQROUTES_3, X, Y, BEARING, E_ID]
Index: []
AvenueAugustin
Empty DataFrame
Columns: [FID, AQROUTES_3, X, Y, BEARING, E_ID]
Index: []
and so on...
I can't figure out why. Anybody has an idea?
Thanks
I believe the issue is in this line:
print(sub_df[sub_df.AQROUTES_3 =="'"+str(x)+"'"])
To each names you add unnecessarily quote characters at the beginning and at the end so that each valid name of the street (in your example 'AvenueMermoz' turns into "'AvenueMermoz'" where we had to use double quotes to enclose single-quoted string).
As #busybear has commented - there is no need to cast to str either. So, the corrected line would be:
print(sub_df[sub_df.AQROUTES_3 == x])
So youre adding quotation marks to the filter which you shouldnt. now youre filtering on 'AvenueMermoz' while you just want to filter on AvenueMermoz .
so
print(sub_df[sub_df.AQROUTES_3 =="'"+str(x)+"'"])
should become
print(sub_df[sub_df.AQROUTES_3 ==str(x)])