Why will this string not convert to float? - python

Column 'Amount' is a string. I want to change it to float so that I can input these rows into a later calculation.
In [1] import pandas as pd
data = pd.read_csv('input.csv')
data
Out [1]
ID Amount Cost
0 A 9,596,249.09 1000000
1 B 38,385,668.57 50000
2 C 351,740.00 100
3 D - 23
4 E 178,255.96 999
Note the that 'D' has an Amount of ' - ' rather than zero.
First I clean up the bad data:
In [2]
data['Amount'] = data['Amount'].replace(' - ', 0)
data
Out [2]
ID Amount Cost
0 A 9,596,249.09 1000000
1 B 38,385,668.57 50000
2 C 351,740.00 100
3 D 0 23
4 E 178,255.96 999
Then I try to convert to float using 2 methods. Both unsuccessful:
In [3]
pd.Series(data['Amount']).astype(float)
Out [3]
ValueError: could not convert string to float: '9,596,249.09'
and:
In [4]
pd.to_numeric(data['Amount'])
Out [4]
ValueError: Unable to parse string "9,596,249.09" at position 0
In my desperation I attempt to loop through the rows:
In [5]
def cleandata(x):
return float(x)
data['Amount'] = data['Amount'].apply(cleandata)
Out [5]
ValueError: could not convert string to float: '9,596,249.09'
Appreciate any advice you could give. I have tried for hours. Thank you.

try:
data = pd.read_csv('input.csv', thousands=',', decimal='.')

You should get rid of the commas, that way should fix the problem. Try this:
data['Amount'] = data['Amount'].apply(lambda x: x.replace(",", "")) # take the commas away
data['Amount'] = data.Amount.astype(float)

Creating a list (y) seems to work.
In [1]:
import pandas as pd
data = pd.read_csv('input.csv')
y = list(data["Amount"])
y = [item.replace(" - " , '0') for item in y]
y = [item.replace("," , '') for item in y]
data["Amount"] = y
data["Amount"] = pd.to_numeric(data['Amount'], errors='coerce')
data['Result'] = data["Amount"] - data["Cost"]
data
Out [1]:
ID Amount Cost Result
0 A 9596249.09 1000000 8596249.09
1 B 38385668.57 50000 38335668.57
2 C 351740.00 100 351640.00
3 D 0.00 23 -23.00
4 E 178255.96 999 177256.9
There is certainly a better and more pythonic way to write this^ i'm sure.

Related

how to convert a dataframe containing 1's and 0's and add a new column to the same dataframe that represents the hex value of entire row in python

I have a dataframe of 51 rows and 464 columns , the columns contain 1's and 0's. I want to have a encoded value of the hex as you see in the attached picture.
I was trying to use numpy to make the hex conversion but it would fail
df = pd.DataFrame(np.random.randint(0,2,size=(51, 464)))
#converting into numpy for easier shifting
a = df.values
b = a.dot(2**np.arange(a.size)[::-1])
I want to have every 4 columns grouped to produce the hexadecimal value and then if there are odd columns for ex: 463 instead of 464 then the trailing hexadecimal will be padded with zero or zeroes based on how many ever needed to make the full hexadecimal value
This code only works for 64 bits length and then fails.
I was following this example
binary0|1 to hex string
any suggestions on how to do this?
Doesn't this do what you want?
df.apply(lambda row: hex(int(''.join(map(str, row)), base=2)), axis=1)
Convert to string every number in a row
Join them to create one big number in string
Convert it to integer with base 2 (since a row is in binary format)
Convert it to hex
Edit: To convert every 4 piece with the same manner:
def hexize(row):
hexes = '0x'
row = ''.join(map(str, row))
for i in range(0, len(row), 4):
value = row[i:i+4]
value = value.ljust(4, '0') # right fill with 0
value = hex(int(value, base=2))
hexes += value[2:]
return hexes
df.apply(hexize, axis=1)
hexize('011101100') # returns '0x760'
Given input data:
ECID,T1,T2,T3,T4,T5,T6,T7,T8,T9,T10,T11,T12,T13,T14,T15,T16,T17,T18,T19,T20,T21,T22,T23,T24,T25,T26,T27,T28,T29,T30,T31,T32,T33,T34,T35,T36,T37,T38,T39,T40,T41,T42,T43,T44,T45,T46,T47,T48,T49,T50,T51
ABC123,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
XYZ345,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
DEF789,1,0,1,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
434thECID,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
This adds an "Encoded" column similar to what was asked. The first row example in the original question seems to have the wrong number of Fs:
import pandas as pd
def encode(row):
s = ''.join(str(x) for x in row[1:]) # Create binary string
s += '0' * (4 - len(row[1:]) % 4) # Make length a multiple of 4 by adding zeros
i = int(s,2) # convert to integer base 2
h = hex(i).rstrip('0') # strip trailing zeros
return h if h != '0x' else '0x0' # Handle special case of '0x0' stripping to '0x'
df = pd.read_csv('input.csv')
df['Encoded'] = df.apply(encode,axis=1)
print(df)
Output:
ECID T1 T2 T3 T4 T5 ... T47 T48 T49 T50 T51 Encoded
0 ABC123 1 1 1 1 1 ... 1 1 1 1 1 0xffffffffffffe
1 XYZ345 1 0 0 0 0 ... 0 0 0 0 0 0x8
2 DEF789 1 0 1 0 1 ... 0 0 0 0 0 0xaa
3 434thECID 0 0 0 0 0 ... 0 0 0 0 0 0x0
[4 rows x 53 columns]

Converting a 1D list into a 2D DataFrame

I have scraped a webpage table, and the table items are in a sequential 1D list, with repeated headers. I want to reconstitute the table into a DataFrame.
I have an algorithm to do this, but I'd like to know if there is a more pythonic/efficient way to achieve this? NB. I don't necessarily know how many columns there are in my table. Here's an example:
input = ['A',1,'B',5,'C',9,
'A',2,'B',6,'C',10,
'A',3,'B',7,'C',11,
'A',4,'B',8,'C',12]
output = {}
it = iter(input)
val = next(it)
while val:
if val in output:
output[val].append(next(it))
else:
output[val] = [next(it)]
val = next(it,None)
df = pd.DataFrame(output)
print(df)
with the result:
A B C
0 1 5 9
1 2 6 10
2 3 7 11
3 4 8 12
If your data is always "well behaved", then something like this should suffice:
import pandas as pd
data = ['A',1,'B',5,'C',9,
'A',2,'B',6,'C',10,
'A',3,'B',7,'C',11,
'A',4,'B',8,'C',12]
result = {}
for k,v in zip(data[::2], data[1::2]):
result.setdefault(k, []).append(v)
df = pd.DataFrame(output)
You can also use numpy reshape:
import numpy as np
cols = sorted(set(l[::2]))
df = pd.DataFrame(np.reshape(l, (int(len(l)/len(cols)/2), len(cols)*2)).T[1::2].T, columns=cols)
A B C
0 1 5 9
1 2 6 10
2 3 7 11
3 4 8 12
Explaination:
# get columns
cols = sorted(set(l[::2]))
# reshape list into list of lists
shape = (int(len(l)/len(cols)/2), len(cols)*2)
np.reshape(l, shape)
# get only the values of the data
.T[1::2].T
# this transposes the data and slices every second step

Multiplying values from a string column in Pandas

I have a column with land dimensions in Pandas. It looks like this:
df.LotSizeDimensions.value_counts(dropna=False)
40.00X150.00 2
57.00X130.00 2
27.00X117.00 2
63.00X135.00 2
37.00X108.00 2
65.00X134.00 2
57.00X116.00 2
33x124x67x31x20x118 1
55.00X160.00 1
63.00X126.00 1
36.00X105.50 1
In rows where there is only one X, I would like to create a separate column that would multiply the values. In columns where there is more than one X, I would like to return a zero. This is the code I came up with
def dimensions_split(df: pd.DataFrame):
df.LotSizeDimensions = df.LotSizeDimensions.str.strip()
df.LotSizeDimensions = df.LotSizeDimensions.str.upper()
df.LotSizeDimensions = df.LotSizeDimensions.str.strip('`"M')
if df.LotSizeDimensions.count('X') > 1
return 0
df['LotSize'] = map(int(df.LotSizeDimensions.str.split("X", 1).str[0])*int(df.LotSizeDimensions.str.split("X", 1).str[1]))
This is coming back with the following error:
TypeError: cannot convert the series to <class 'int'>
I would also like to add a line where if there are any non-numeric characters other than X, return a zero.
Idea is first stripping and convert to upper column LotSizeDimensions to Series and then use Series.str.split for DataFrame and then multiple columns if there is only one X else is returned 0:
s = df.LotSizeDimensions.str.strip('`"M ').str.upper()
df1 = s.str.split('X', expand=True).astype(float)
#general data
#df1 = s.str.split('X', expand=True).apply(lambda x: pd.to_numeric(x, errors='coerce'))
df['LotSize'] = np.where(s.str.count('X').eq(1), df1[0] * df1[1], 0)
print (df)
LotSizeDimensions LotSize
0 40.00X150.00 6000.0
1 57.00X130.00 7410.0
2 27.00X117.00 3159.0
3 37.00X108.00 3996.0
4 63.00X135.00 8505.0
5 65.00X134.00 8710.0
6 57.00X116.00 6612.0
7 33x124x67x31x20x118 0.0
8 55.00X160.00 8800.0
9 63.00X126.00 7938.0
10 36.00X105.50 3798.0
I get this using list comprehension:
import pandas as pd
df = pd.DataFrame(['40.00X150.00','57.00X130.00',
'27.00X117.00',
'37.00X108.00',
'63.00X135.00' ,
'65.00X134.00' ,
'57.00X116.00' ,
'33x124x67x31x20x118',
'55.00X160.00',
'63.00X126.00',
'36.00X105.50'])
df[1] = [float(str_data.strip().split("X")[0])*float(str_data.strip().split("X")[1]) if len(str_data.strip().split("X"))==2 else None for str_data in df[0]]

Python: How to change same numbers in a Series/Column to other values?

I am trying to change the values of a very long column (about 1mio entries) in a data frame. I have something like
####ID_Orig
3452
3452
3452
6543
6543
...
I want something like
####ID_new
0
0
0
1
1
...
At the moment I'm doing this:
j=0
for i in range(0,1199531):
if data.ID_orig[i]==data.ID_orig[i+1]:
data.ID_orig[i] = j
else:
data.ID_orig[i] = j
j=j+1
Which takes about ages... Is there a faster way to do this?
I don't know what values ID_orig has and how often a single value comes up.
Use factorize, but if duplicated groups then output values are set to same number.
Another solution with comparing by ne (!=) of shifted values with cumsum is more general - create always new values, also if repeating group values:
df['ID_new1'] = pd.factorize(df['ID_Orig'])[0]
df['ID_new2'] = df['ID_Orig'].ne(df['ID_Orig'].shift()).cumsum() - 1
print (df)
ID_Orig ID_new1 ID_new2
0 3452 0 0
1 3452 0 0
2 3452 0 0
3 6543 1 1
4 6543 1 1
5 100 2 2
6 100 2 2
7 6543 1 3 <-repeating group
8 6543 1 3 <-repeating group
You can do this …
import collections
l1 = [3452, 3452, 3452, 6543, 6543]
c = collections.Counter(l1)
l2 = list(c.items())
l3 = []
for i, t in enumerate(l2):
for x in range(t[1]):
l3.append(i)
for x in l3:
print(x)
This is the output:
0
0
0
1
1
You can use the following. In the following implementation duplicate ids in the original id will get same ids. The implementation is based on dropping duplicates from the column and assigning a different number to each unique id to form the enw ids. These new ids are then merged into the original dataset
import numpy as np
import pandas as pd
from time import time
num_rows = 119953
input_data = np.random.randint(1199531, size=(num_rows,1))
data = pd.DataFrame(input_data)
data.columns = ["ID_orig"]
data2 = pd.DataFrame(input_data)
data2.columns = ["ID_orig"]
t0 = time()
j=0
for i in range(0,num_rows-1):
if data.ID_orig[i]==data.ID_orig[i+1]:
data.ID_orig[i] = j
else:
data.ID_orig[i] = j
j=j+1
t1 = time()
id_new = data2.loc[:,"ID_orig"].drop_duplicates().reset_index().drop("index", axis=1)
id_new.reset_index(inplace=True)
id_new.columns = ["id_new"] + id_new.columns[1:].values.tolist()
data2 = data2.merge(id_new, on="ID_orig")
t2 = time()
print("Previous: ", round(t1-t0, 2), " seconds")
print("Current : ", round(t2-t1, 2), " seconds")
The output of the above program using only 119k rows is
Previous: 12.16 seconds
Current : 0.06 seconds
The runtime difference increases even more as the number of rows are increased.
EDIT
Using the same number of rows:
>>> print("Previous: ", round(t1-t0, 2))
Previous: 11.7
>>> print("Current : ", round(t2-t1, 2))
Current : 0.06
>>> print("jezrael's answer : ", round(t3-t2, 2))
jezrael's answer : 0.02

How to add a new column to a table formed from conditional statements?

I have a very simple query.
I have a csv that looks like this:
ID X Y
1 10 3
2 20 23
3 21 34
And I want to add a new column called Z which is equal to 1 if X is equal to or bigger than Y, or 0 otherwise.
My code so far is:
import pandas as pd
data = pd.read_csv("XYZ.csv")
for x in data["X"]:
if x >= data["Y"]:
Data["Z"] = 1
else:
Data["Z"] = 0
You can do this without using a loop by using ge which means greater than or equal to and cast the boolean array to int using astype:
In [119]:
df['Z'] = (df['X'].ge(df['Y'])).astype(int)
df
Out[119]:
ID X Y Z
0 1 10 3 1
1 2 20 23 0
2 3 21 34 0
Regarding your attempt:
for x in data["X"]:
if x >= data["Y"]:
Data["Z"] = 1
else:
Data["Z"] = 0
it wouldn't work, firstly you're using Data not data, even with that fixed you'd be comparing a scalar against an array so this would raise a warning as it's ambiguous to compare a scalar with an array, thirdly you're assigning the entire column so overwriting the column.
You need to access the index label which your loop didn't you can use iteritems to do this:
In [125]:
for idx, x in df["X"].iteritems():
if x >= df['Y'].loc[idx]:
df.loc[idx, 'Z'] = 1
else:
df.loc[idx, 'Z'] = 0
df
Out[125]:
ID X Y Z
0 1 10 3 1
1 2 20 23 0
2 3 21 34 0
But really this is unnecessary as there is a vectorised method here
Firstly, your code is just fine. You simply capitalized your dataframe name as 'Data' instead of making it 'data'.
However, for efficient code, EdChum has a great answer above. Or another method similar to the for loop in efficiency but easier code to remember:
import numpy as np
data['Z'] = np.where(data.X >= data.Y, 1, 0)

Categories

Resources