How to sum an object column in python - python

I have a data set represented in a Pandas object, see below:
datetime season holiday workingday weather temp atemp humidity windspeed casual registered count
1/1/2011 0:00 1 0 0 1 9.84 14.395 81 0 3 13 16
1/1/2011 1:00 1 0 0 2 9.02 13.635 80 0 8 32 40
1/1/2011 2:00 1 0 0 3 9.02 13.635 80 0 5 27 32
p_type_1 = pd.read_csv("Bike Share Demand.csv")
p_type_1 = (p_type_1 >>
rename(date = X.datetime))
p_type_1.date.str.split(expand=True,)
p_type_1[['Date','Hour']] = p_type_1.date.str.split(" ",expand=True,)
p_type_1['date'] = pd.to_datetime(p_type_1['date'])
p_hour = p_type_1["Hour"]
p_hour
Now I am trying to take the sum of my column Hour that I created (p_hour)
p_hours = p_type_1["Hour"].sum()
p_hours
and get this error:
TypeError: must be str, not int
so I then tried:
p_hours = p_type_1(str["Hour"].sum())
p_hours
and get this error:
TypeError: 'type' object is not subscriptable
i just want the sum, what gives.

Your dataframe datatypes are problem.
Take a closer look at this question:
Convert DataFrame column type from string to datetime, dd/mm/yyyy format
Sample code that should be solution for your problem, i simplified CSV
'''
CSV
datetime,season
1/1/2011 0:00,1
1/1/2011 1:00,1
1/1/2011 2:00,1
'''
import pandas as pd
p_type_1 = pd.read_csv("Bike Share Demand.csv")
p_type_1['datetime'] = p_type_1['datetime'].astype('datetime64[ns]')
p_type_1['hour'] = [val.hour for i, val in p_type_1['datetime'].iteritems()]
print(p_type_1['hour'].sum())

There's quite a bit going on in here that's not correct. So I'll try to break down the issues and offer alternatives.
Here:
p_hours = p_type_1(str["Hour"].sum())
p_hours
What your issue is, is that you are actually trying to do this:
p_hours = p_type_1([str("Hour")].sum())
p_hours
Instead of doing that, your code technically asks for the property named 'Hour' in the string type. That's not what you are trying to do. This crash is unrelated to your core problem, and is just a syntax error.
What the problem actually is here, is that your dataframe column has mixed string and integer types together in the same column. The sum operation will concatenate string, or sum numeric types. In a mixed type, it will fail out.
In order to verify that this is the issue however, we would need to see your actual dataframe, as I have a feeling the one you gave may not be the correct one.
As a proof of concept, I created the following example:
import pandas as pd
dta = [str(x) for x in range(20)]
dta.append(12)
frame = pd.DataFrame.from_dict({
"data": dta})
print(frame["data"].sum())
>>> TypeError: can only concatenate str (not "int") to str
Note that the newer editions of pandas have more clear error messages.

Related

pandas: convert column with multiple datatypes to int, ignore errors

I have a column with data that needs some massaging. the column may contain strings or floats. some strings are in exponential form. Id like to best try to format all data in this column as a whole number where possible, expanding any exponential notation to integer. So here is an example
df = pd.DataFrame({'code': ['1170E1', '1.17E+04', 11700.0, '24477G', '124601', 247602.0]})
df['code'] = df['code'].astype(int, errors = 'ignore')
The above code does not seem to do a thing. i know i can convert the exponential notation and decimals with simply using the int function, and i would think the above astype would do the same, but it does not. for example, the following code work in python:
int(1170E1), int(1.17E+04), int(11700.0)
> (11700, 11700, 11700)
Any help in solving this would be appreciated. What i'm expecting the output to look like is:
0 '11700'
1 '11700'
2 '11700
3 '24477G'
4 '124601'
5 '247602'
You may check with pd.to_numeric
df.code = pd.to_numeric(df.code,errors='coerce').fillna(df.code)
Out[800]:
0 11700.0
1 11700.0
2 11700.0
3 24477G
4 124601.0
5 247602.0
Name: code, dtype: object
Update
df['code'] = df['code'].astype(object)
s = pd.to_numeric(df['code'],errors='coerce')
df.loc[s.notna(),'code'] = s.dropna().astype(int)
df
Out[829]:
code
0 11700
1 11700
2 11700
3 24477G
4 124601
5 247602
BENY's answer should work, although you potentially leave yourself open to catching exceptions and filling that you don't want to. This will also do the integer conversion you are looking for.
def convert(x):
try:
return str(int(float(x)))
except ValueError:
return x
df = pd.DataFrame({'code': ['1170E1', '1.17E+04', 11700.0, '24477G', '124601', 247602.0]})
df['code'] = df['code'].apply(convert)
outputs
0 11700
1 11700
2 11700
3 24477G
4 124601
5 247602
where each element is a string.
I will be the first to say, I'm not proud of that triple cast.

Unable to convert comma separated integers and non-integer values to float in a series column in Python

Loading in the data
in: import pandas as pd
in: df = pd.read_csv('name', sep = ';', encoding='unicode_escape')
in : df.dtypes
out: amount object
I have an object column with amounts like 150,01 and 43,69. Thee are about 5,000 rows.
df['amount']
0 31
1 150,01
2 50
3 54,4
4 32,79
...
4950 25,5
4951 39,5
4952 75,56
4953 5,9
4954 43,69
Name: amount, Length: 4955, dtype: object
Naturally, I tried to convert the series into the locale format, which suppose to turn it into a float format. I came back with the following error:
In: import locale
setlocale(LC_NUMERIC, 'en_US.UTF-8')
Out: 'en_US.UTF-8'
In: df['amount'].apply(locale.atof)
Out: ValueError: could not convert string to float: ' - '
Now that I'm aware that there are non-numeric values in the list, I tried to use isnumeric methods to turn the non-numeric values to become NaN.
Unfortunately, due to the comma separated structure, all the values would turn into -1.
0 -1
1 -1
2 -1
3 -1
4 -1
..
4950 -1
4951 -1
4952 -1
4953 -1
4954 -1
Name: amount, Length: 4955, dtype: int64
How do I turn the "," values to "." by first removing the "-" values? I tried .drop() or .truncate it does not help. If I replace the str",", " ", it would also cause trouble since there is a non-integer value.
Please help!
Documentation that I came across
-https://stackoverflow.com/questions/21771133/finding-non-numeric-rows-in-dataframe-in-pandas
-https://stackoverflow.com/questions/56315468/replace-comma-and-dot-in-pandas
p.s. This is my first post, please be kind
Sounds like you have a European-style CSV similar to the following. Provide actual sample data as many comments asked for if your format is different:
data.csv
thing;amount
thing1;31
thing2;150,01
thing3;50
thing4;54,4
thing5;1.500,22
To read it, specify the column, decimal and thousands separator as needed:
import pandas as pd
df = pd.read_csv('data.csv',sep=';',decimal=',',thousands='.')
print(df)
Output:
thing amount
0 thing1 31.00
1 thing2 150.01
2 thing3 50.00
3 thing4 54.40
4 thing5 1500.22
Posting as an answer since it contains multi-line code, despite not truly answering your question (yet):
Try using chardet. pip install chardet to get the package, then in your import block, add import chardet.
When importing the file, do something like:
with open("C:/path/to/file.csv", 'r') as f:
data = f.read()
result = chardet.detect(data.encode())
charencode = result['encoding']
# now re-set the handler to the beginning and re-read the file:
f.seek(0, 0)
data = pd.read_csv(f, delimiter=';', encoding=charencode)
Alternatively, for reasons I cannot fathom, passing engine='python' as a parameter works often. You'd just do
data = pd.read_csv('C:/path/to/file.csv', engine='python')
#Mark Tolonen has a more elegant approach to standardizing the actual data, but my (hacky) way of doing it was to just write a function:
def stripThousands(self, df_column):
df_column.replace(',', '', regex=True, inplace=True)
df_column = df_column.apply(pd.to_numeric, errors='coerce')
return df_column
If you don't care about the entries that are just hyphens, you could use a function like
def screw_hyphens(self, column):
column.replace(['-'], np.nan, inplace=True)
or if np.nan values will be a problem, you can just replace it with column.replace('-', '', inplace=True)
**EDIT: there was a typo in the block outlining the usage of chardet. it should be correct now (previously the end of the last line was encoding=charenc)

while filtering the data from dataframe TypeError: must be real number, not str

we have a dataframe as
print(df)
Empld EmpName Date
1234 Ram 2020-01-01 01:01:01
2332 Andy 2010-11-11 01:01:01
2233 Jim 2009-01-11 01:01:01
when i try to filter the data in the dataframe
dfemp = df[df['Empld'] == '1234']
print(dfemp)
Empld EmpName Date
1234 Ram 2020-01-01 01:01:01
my code is like below i am trying to assign only date value to a variable as the dataframe will always have only one record with id '1234'
if dfemp.empty :
EmDt = "2000-01-01"
else :
EmDt = dfemp['Date'].values[0].replace("[","").replace("]","")[:10]
i am getting below error
Error: TypeError: must be real number, not str
Is there any way to overcome this error, i am trying to get final output to a variable
EmDt=2020-01-01(if it has value then "2020-01-01" if not "2000-01-01" static value)
I assume that all columns in df are of string type.
When you create dfemp, it is a DataFrame, and you want to read
Date column from the first row, also as a string.
To do it run:
if dfemp.empty:
EmDt = "2000-01-01"
else:
EmDt = dfemp.iloc[0].Date[:10]
replace is not needed here.
Another detail to check:
print(type(dfemp.iloc[0].Date).__name__)
The result should be "str". If the result is other then there is
something wrong / unexpected with your source data.
According to what I understood from your question, this should work.
dfemp['Date']= pd.to_datetime(dfemp['Date'])
if dfemp.empty :
EmDt = "2000-01-01"
else :
EmDt = dfemp['Date'].dt.date[0]
print(EmDt)

How to convert string into datetime?

I'm quite new to Python and I'm encountering a problem.
I have a dataframe where one of the columns is the departure time of flights. These hours are given in the following format : 1100.0, 525.0, 1640.0, etc.
This is a pandas series which I want to transform into a datetime series such as : S = [11.00, 5.25, 16.40,...]
What I have tried already :
Transforming my objects into string :
S = [str(x) for x in S]
Using datetime.strptime :
S = [datetime.strptime(x,'%H%M.%S') for x in S]
But since they are not all the same format it doesn't work
Using parser from dateutil :
S = [parser.parse(x) for x in S]
I got the error :
'Unknown string format'
Using the panda datetime :
S= pd.to_datetime(S)
Doesn't give me the expected result
Thanks for your answers !
Since it's a columns within a dataframe (A series), keep it that way while transforming should work just fine.
S = [1100.0, 525.0, 1640.0]
se = pd.Series(S) # Your column
# se:
0 1100.0
1 525.0
2 1640.0
dtype: float64
setime = se.astype(int).astype(str).apply(lambda x: x[:-2] + ":" + x[-2:])
This transform the floats to correctly formatted strings:
0 11:00
1 5:25
2 16:40
dtype: object
And then you can simply do:
df["your_new_col"] = pd.to_datetime(setime)
How about this?
(Added an if statement since some entries have 4 digits before decimal and some have 3. Added the use case of 125.0 to account for this)
from datetime import datetime
S = [1100.0, 525.0, 1640.0, 125.0]
for x in S:
if str(x).find(".")==3:
x="0"+str(x)
print(datetime.strftime(datetime.strptime(str(x),"%H%M.%S"),"%H:%M:%S"))
You might give it a go as follows:
# Just initialising a state in line with your requirements
st = ["1100.0", "525.0", "1640.0"]
dfObj = pd.DataFrame(st)
# Casting the string column to float
dfObj_num = dfObj[0].astype(float)
# Getting the hour representation out of the number
df1 = dfObj_num.floordiv(100)
# Getting the minutes
df2 = dfObj_num.mod(100)
# Moving the minutes on the right-hand side of the decimal point
df3 = df2.mul(0.01)
# Combining the two dataframes
df4 = df1.add(df3)
# At this point can cast to other types
Result:
0 11.00
1 5.25
2 16.40
You can run this example to verify the steps for yourself, also you can make it into a function. Make slight variations if needed in order to tweak it according to your precise requirements.
Might be useful to go through this article about Pandas Series.
https://www.geeksforgeeks.org/python-pandas-series/
There must be a better way to do this, but this works for me.
df=pd.DataFrame([1100.0, 525.0, 1640.0], columns=['hour'])
df['hour_dt']=((df['hour']/100).apply(str).str.split('.').str[0]+'.'+
df['hour'].apply((lambda x: '{:.2f}'.format(x/100).split('.')[1])).apply(str))
print(df)
hour hour_dt
0 1100.0 11.00
1 525.0 5.25
2 1640.0 16.40

Pandas Requires a Str object but received a float

I have a dataframe that looks like this:
df_all_data:
everything file_names
0  v_merged.sql
1 CREATE VIEW [dbo].[v_merged] v_merged.sql
2 AS v_merged.sql
3 WITH [stage] AS v_merged.sql
4 ( v_merged.sql
5 SELECT --[row] v_merged.sql
6 [fssa_legacysystemid] v_merged.sql
7 ,[A_ID] v_merged.sql
8 ,[vendorcode] v_merged.sql
9 ,NULL AS [lpinumber] v_merged.sql
I am receiving the following error:
TypeError: ("descriptor 'startswith' requires a 'str' object but received a 'float'", 'occurred at index everything')
I am not sure what I am doing wrong? I thought my everything column is a str or object type?
Edit #1:
This is the code that caused this error:
df_all_data = df_all_data[~df_all_data.applymap(lambda x : str.startswith(x,'--')).any(1)]
Since Pandas has found float values, there's a good chance it's true. It could be that those values are null, i.e. NaN / np.nan. One simple workaround is to convert to str in your lambda function:
df = df[~df.applymap(lambda x: str.startswith(str(x), '--')).any(1)]
A better idea would be to convert to str via pd.DataFrame.astype and use pd.Series.str methods, which mimic exactly Python string methods:
df = df[df.astype(str).str.startswith('--').any(1)]

Categories

Resources