I have a dataframe that looks like this:
df_all_data:
everything file_names
0  v_merged.sql
1 CREATE VIEW [dbo].[v_merged] v_merged.sql
2 AS v_merged.sql
3 WITH [stage] AS v_merged.sql
4 ( v_merged.sql
5 SELECT --[row] v_merged.sql
6 [fssa_legacysystemid] v_merged.sql
7 ,[A_ID] v_merged.sql
8 ,[vendorcode] v_merged.sql
9 ,NULL AS [lpinumber] v_merged.sql
I am receiving the following error:
TypeError: ("descriptor 'startswith' requires a 'str' object but received a 'float'", 'occurred at index everything')
I am not sure what I am doing wrong? I thought my everything column is a str or object type?
Edit #1:
This is the code that caused this error:
df_all_data = df_all_data[~df_all_data.applymap(lambda x : str.startswith(x,'--')).any(1)]
Since Pandas has found float values, there's a good chance it's true. It could be that those values are null, i.e. NaN / np.nan. One simple workaround is to convert to str in your lambda function:
df = df[~df.applymap(lambda x: str.startswith(str(x), '--')).any(1)]
A better idea would be to convert to str via pd.DataFrame.astype and use pd.Series.str methods, which mimic exactly Python string methods:
df = df[df.astype(str).str.startswith('--').any(1)]
Related
I have a column with data that needs some massaging. the column may contain strings or floats. some strings are in exponential form. Id like to best try to format all data in this column as a whole number where possible, expanding any exponential notation to integer. So here is an example
df = pd.DataFrame({'code': ['1170E1', '1.17E+04', 11700.0, '24477G', '124601', 247602.0]})
df['code'] = df['code'].astype(int, errors = 'ignore')
The above code does not seem to do a thing. i know i can convert the exponential notation and decimals with simply using the int function, and i would think the above astype would do the same, but it does not. for example, the following code work in python:
int(1170E1), int(1.17E+04), int(11700.0)
> (11700, 11700, 11700)
Any help in solving this would be appreciated. What i'm expecting the output to look like is:
0 '11700'
1 '11700'
2 '11700
3 '24477G'
4 '124601'
5 '247602'
You may check with pd.to_numeric
df.code = pd.to_numeric(df.code,errors='coerce').fillna(df.code)
Out[800]:
0 11700.0
1 11700.0
2 11700.0
3 24477G
4 124601.0
5 247602.0
Name: code, dtype: object
Update
df['code'] = df['code'].astype(object)
s = pd.to_numeric(df['code'],errors='coerce')
df.loc[s.notna(),'code'] = s.dropna().astype(int)
df
Out[829]:
code
0 11700
1 11700
2 11700
3 24477G
4 124601
5 247602
BENY's answer should work, although you potentially leave yourself open to catching exceptions and filling that you don't want to. This will also do the integer conversion you are looking for.
def convert(x):
try:
return str(int(float(x)))
except ValueError:
return x
df = pd.DataFrame({'code': ['1170E1', '1.17E+04', 11700.0, '24477G', '124601', 247602.0]})
df['code'] = df['code'].apply(convert)
outputs
0 11700
1 11700
2 11700
3 24477G
4 124601
5 247602
where each element is a string.
I will be the first to say, I'm not proud of that triple cast.
I have a dataframe like so:
time 0 1 2 3 4 5
0 3.477110 3.475698 3.475874 3.478345 3.476757 3.478169
1 3.422223 3.419752 3.417987 3.421341 3.418693 3.418340
2 3.474110 3.474816 3.477463 3.479757 3.479581 3.476757
3 3.504995 3.507112 3.504995 3.505877 3.507112 3.508171
4 3.426106 3.424870 3.422399 3.421517 3.419046 3.417105
6 3.364336 3.362571 3.360453 3.358335 3.357806 3.356924
7 3.364336 3.362571 3.360453 3.358335 3.357806 3.356924
8 3.364336 3.362571 3.360453 3.358335 3.357806 3.356924
but sktime requires the data to be in a format where each dataframe entry is a seperate time series:
3.477110,3.475698,3.475874,3.478345,3.476757,3.478169
3.422223,3.419752,3.417987,3.421341,3.418693,3.418340
3.474110,3.474816,3.477463,3.479757,3.479581,3.476757
3.504995,3.507112,3.504995,3.505877,3.507112,3.508171
3.426106,3.424870,3.422399,3.421517,3.419046,3.417105
3.364336,3.362571,3.360453,3.358335,3.357806,3.356924
Essentially as I have 6 cols of data, each row should become a seperate series (of length 6) and the final shape should be (9, 1) (for this example) instead of the (9, 6) it is right now
I have tried iterating over the rows, using various transform techniques but to no avail, I am looking for something similar to the .squeeze() method but that works for multiple datapoints, how does one go about it?
I think you want something like this.
result = df.set_index('time').apply(np.array, axis=1)
print(result)
print(type(result))
print(result.shape)
time
0 [3.47711, 3.475698, 3.475874, 3.478345, 3.4767...
1 [3.422223, 3.419752, 3.417987, 3.421341, 3.418...
2 [3.47411, 3.474816, 3.477463, 3.479757, 3.4795...
3 [3.504995, 3.507112, 3.504995, 3.505877, 3.507...
4 [3.426106, 3.42487, 3.422399, 3.421517, 3.4190...
6 [3.364336, 3.362571, 3.360453, 3.358335, 3.357...
7 [3.364336, 3.362571, 3.360453, 3.358335, 3.357...
8 [3.364336, 3.362571, 3.360453, 3.358335, 3.357...
dtype: object
<class 'pandas.core.series.Series'>
(8,)
This is one pd.Series of length 8 (in your example data index 5 is missing;) ) and each value of the Series is a np.array. You can also go with list (in the applystatement) if you want.
Convert all columns to str, because the join method only accepts string.
Then join all columns by a "," delimiter
df.astype(str).agg(','.join,axis=1)
df.astype(str).agg(','.join,axis=1).shape
(9,)
I have a data set represented in a Pandas object, see below:
datetime season holiday workingday weather temp atemp humidity windspeed casual registered count
1/1/2011 0:00 1 0 0 1 9.84 14.395 81 0 3 13 16
1/1/2011 1:00 1 0 0 2 9.02 13.635 80 0 8 32 40
1/1/2011 2:00 1 0 0 3 9.02 13.635 80 0 5 27 32
p_type_1 = pd.read_csv("Bike Share Demand.csv")
p_type_1 = (p_type_1 >>
rename(date = X.datetime))
p_type_1.date.str.split(expand=True,)
p_type_1[['Date','Hour']] = p_type_1.date.str.split(" ",expand=True,)
p_type_1['date'] = pd.to_datetime(p_type_1['date'])
p_hour = p_type_1["Hour"]
p_hour
Now I am trying to take the sum of my column Hour that I created (p_hour)
p_hours = p_type_1["Hour"].sum()
p_hours
and get this error:
TypeError: must be str, not int
so I then tried:
p_hours = p_type_1(str["Hour"].sum())
p_hours
and get this error:
TypeError: 'type' object is not subscriptable
i just want the sum, what gives.
Your dataframe datatypes are problem.
Take a closer look at this question:
Convert DataFrame column type from string to datetime, dd/mm/yyyy format
Sample code that should be solution for your problem, i simplified CSV
'''
CSV
datetime,season
1/1/2011 0:00,1
1/1/2011 1:00,1
1/1/2011 2:00,1
'''
import pandas as pd
p_type_1 = pd.read_csv("Bike Share Demand.csv")
p_type_1['datetime'] = p_type_1['datetime'].astype('datetime64[ns]')
p_type_1['hour'] = [val.hour for i, val in p_type_1['datetime'].iteritems()]
print(p_type_1['hour'].sum())
There's quite a bit going on in here that's not correct. So I'll try to break down the issues and offer alternatives.
Here:
p_hours = p_type_1(str["Hour"].sum())
p_hours
What your issue is, is that you are actually trying to do this:
p_hours = p_type_1([str("Hour")].sum())
p_hours
Instead of doing that, your code technically asks for the property named 'Hour' in the string type. That's not what you are trying to do. This crash is unrelated to your core problem, and is just a syntax error.
What the problem actually is here, is that your dataframe column has mixed string and integer types together in the same column. The sum operation will concatenate string, or sum numeric types. In a mixed type, it will fail out.
In order to verify that this is the issue however, we would need to see your actual dataframe, as I have a feeling the one you gave may not be the correct one.
As a proof of concept, I created the following example:
import pandas as pd
dta = [str(x) for x in range(20)]
dta.append(12)
frame = pd.DataFrame.from_dict({
"data": dta})
print(frame["data"].sum())
>>> TypeError: can only concatenate str (not "int") to str
Note that the newer editions of pandas have more clear error messages.
Say I have a panda dataframe and want to print rows with two particular columns (Score and Score1) that have different values.
I am running on python 3.6
I tried
print(Data[round(Data['Score'],4)!=round(Data['Score1'],4)])
and got this error:
unsupported operand type(s) for *: 'decimal.Decimal' and 'float'
I also tried
from decimal import *
print(Data[Decimal(round(Data['Score'],4))!=round(Data['Score1'],4)])
and get:
conversion from Series to Decimal is not supported
Here is some sample data
Score Score1
0 0.00187718 0.001877000000000
1 0.000184217 0.000184000000000
2 0.000502648 0.000503000000000
3 0.185124 0.185124000000000
4 3.3589e-05 0.000034000000000
5 0.00156229 0.001562000000000
6 6.4937e-05 0.000065000000000
7 4.87503e-05 0.000049000000000
8 0.00215561 0.002156000000000
9 3.22308e-05 0.000044000000000
10 3.70668e-05 0.000037000000000
11 0.000100837 0.000101000000000
12 7.91073e-05 0.000079000000000
13 0.00424232 0.004232000000000
14 6.80564e-06 0.000007000000000
15 0.00928687 0.009287000000000
My solution for now is to output the dataframe to csv and reload the csv into python. It looks good to me. Knowing that its definitely not a smart way, I am going with it given my tight timeline.
Here are some other common approaches for comparing floating values. They are not equivalent to what you implemented but should still be good in many scenarios.
Using native pandas:
selected = data[(data["Score"]-data["Score1"]).abs() >= 1e-4]
print(selected)
Using numpy.isclose:
import numpy as np
selected = data[~np.isclose(data["Score"], data["Score1"], 0, 1e-4)]
print(selected)
It seems that you are comparing two numbers (float point number rounded to a certain decimal place).
I think you can try this (compare a with b)
a = 0.00542
b = 0.00534
decimal_place = 4 #or any place you want
round(a-b, decimal_place) # if this is zero, consider a and b as the same
Since I don't know what kind of data you have, I cannot implement things in pandas for you. This is what I come up with when seeing your question. Hope that it will help you out.
Update after getting the data file:
import pandas as pd
filename = "datafile"
df = pd.read_csv(filename, delim_whitespace = True)
print(df)
print(df.columns)
df["Compare"] = (round(df["Score"] - df["Score1"], 6) == 0)
print(df)
Somehow, my code works smoothly (after copying your data into a file named "datafile"). I plan to run your code and find out why.
Unfortunately, after plugging your code into my code, I still cannot find why it did not work out. It seems just fine to run:
import pandas as pd
filename = "datafile"
df = pd.read_csv(filename, delim_whitespace = True)
print(df)
print(df.columns)
print(round(df['Score'],4)!=round(df['Score1'],4))
#print(df[round(df['Score'],4)!=round(df['Score1'],4)])
#I delete the df[] around the round...4)
Here is the output
0 False
1 False
2 False
3 False
4 False
5 False
6 False
7 False
8 False
9 False
10 False
11 False
12 False
13 False
14 False
15 False
dtype: bool
I guessed you did not read in the file in a correct way. I would suggest printing out the data frame to see why.
Good luck with that!
I use Pandas 'ver 0.12.0' with Python 2.7 and have a dataframe as below:
df = pd.DataFrame({'id' : [123,512,'zhub1', 12354.3, 129, 753, 295, 610],
'colour': ['black', 'white','white','white',
'black', 'black', 'white', 'white'],
'shape': ['round', 'triangular', 'triangular','triangular','square',
'triangular','round','triangular']
}, columns= ['id','colour', 'shape'])
The id Series consists of some integers and strings. Its dtype by default is object. I want to convert all contents of id to strings. I tried astype(str), which produces the output below.
df['id'].astype(str)
0 1
1 5
2 z
3 1
4 1
5 7
6 2
7 6
1) How can I convert all elements of id to String?
2) I will eventually use id for indexing for dataframes. Would having String indices in a dataframe slow things down, compared to having an integer index?
A new answer to reflect the most current practices: as of now (v1.2.4), neither astype('str') nor astype(str) work.
As per the documentation, a Series can be converted to the string datatype in the following ways:
df['id'] = df['id'].astype("string")
df['id'] = pandas.Series(df['id'], dtype="string")
df['id'] = pandas.Series(df['id'], dtype=pandas.StringDtype)
You can convert all elements of id to str using apply
df.id.apply(str)
0 123
1 512
2 zhub1
3 12354.3
4 129
5 753
6 295
7 610
Edit by OP:
I think the issue was related to the Python version (2.7.), this worked:
df['id'].astype(basestring)
0 123
1 512
2 zhub1
3 12354.3
4 129
5 753
6 295
7 610
Name: id, dtype: object
You must assign it, like this:-
df['id']= df['id'].astype(str)
Personally none of the above worked for me.
What did:
new_str = [str(x) for x in old_obj][0]
You can use:
df.loc[:,'id'] = df.loc[:, 'id'].astype(str)
This is why they recommend this solution: Pandas doc
TD;LR
To reflect some of the answers:
df['id'] = df['id'].astype("string")
This will break on the given example because it will try to convert to StringArray which can not handle any number in the 'string'.
df['id']= df['id'].astype(str)
For me this solution throw some warning:
> SettingWithCopyWarning:
> A value is trying to be set on a copy of a
> slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead
There are two possibilities:
Use .astype("str").astype("string"). As seen here
Use .astype(pd.StringDtype()). From the official documentation
For me it worked:
df['id'].convert_dtypes()
see the documentation here:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.convert_dtypes.html
use pandas string methods ie df['id'].str.cat()
If you want to do dynamically
df_obj = df.select_dtypes(include='object')
df[df_obj.columns] = df_obj.astype(str)
Your problem can easily be solved by converting it to the object first. After it is converted to object, just use "astype" to convert it to str.
obj = lambda x:x[1:]
df['id']=df['id'].apply(obj).astype('str')
for me .to_string() worked
df['id']=df['id'].to_string()