"I have to change the format of a object list ['Dev'] to float64, but python send me this error. 'Dev'its a column of objects from a csv table ('prueba.csv') and I need to translate it for create statistics/metrics, could you help me?"
#start the code
import pandas as pd
import numpy as np
#read the data from .csv file;
df = pd.read_csv('prueba.csv', sep=';', decimal=',',encoding='ISO-8859-1')
#Data description
df.dtypes
Element object
Property object
Nominal float64
Actual object
Tol - float64
Tol + float64
Dev object
Check float64
Out float64
dtype: object
#List for float translate;
df['Dev']
0 0,69
1 0,62
2 0,54
3 0,47
4 0,19
5 -0,26
6 0,11
7 0,1
8 0,2
9 0,29
10 -1,54
11 -2
12 -2,06
13 -2,02
14 -2,08
15 -1,39
16 -1,68
17 -1,91
18 -1,78
19 -1,8
20 -1,21
21 -1,07
22 -0,97
23 -1,47
24 -1,35
25 -0,91
26 -1,17
27 -0,67
28 -1,12
29 -1,13
1962 0
1963 -0,37
1964 0,02
1965 0,32
1966 0,04
1967 0
1968 0,39
1969 0,25
1970 0,38
1971 0,15
1972 0
1973 1,11
1974 -1,13
1975 0,15
1976 0,12
1977 0
1978 -0,47
1979 -0,85
1980 0,08
1981 0,23
1982 0
1983 1,03
1984 -0,76
1985 -0,03
1986 0,02
1987 0
1988 0,36
1989 -1,45
1990 0,12
1991 0,09
Name: Dev, Length: 1992, dtype: object
#Function definition for conversion
def convert_currency(val):
"""
- Remove commas and u'
- Convert to float type
"""
new_val = val.replace("u'","'")
return float(new_val)
df['Dev'].apply(convert_currency)
#Python error;
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-93-9ac4450806f7> in <module>()
----> 1 df['Dev'].apply(convert_currency)
C:\Users\kaosb\miniconda2\lib\site-packages\pandas\core\series.pyc in apply(self, func, convert_dtype, args, **kwds)
3589 else:
3590 values = self.astype(object).values
-> 3591 mapped = lib.map_infer(values, f, convert=convert_dtype)
3592
3593 if len(mapped) and isinstance(mapped[0], Series):
pandas/_libs/lib.pyx in pandas._libs.lib.map_infer()
<ipython-input-89-0c43557e37eb> in convert_currency(val)
5 """
6 new_val = val.replace("u'","'")
----> 7 return float(new_val)
ValueError: invalid literal for float(): 0,69
looks like you are not replacing the commas
def convert_currency(val):
"""
- Remove commas and u'
- Convert to float type
"""
new_val = val.replace("u'","").replace(",",".")
return float(new_val)
would do the trick.
your floats need dot instead of comma like you said.
You have to replace the commas by points before applying float otherwise the float function considers you're giving to arguments
Just change your encoding to latin_1.
I tried locally and it works great, it automatically assigns its dtype as float64.
Related
I'm trying to build a linear regression model with just one predictor variable for a Retail Dataset. The predictor variable I'm trying to use is known as Description.
The Description column contains numerical values of the product category and the data type of the column, initially, was int64.
**InvoiceNo CustomerID StockCode Quantity Description Country Year Month Day dayofWeek HourofDay**
4501 14730.0 3171 2 1324.0 35 2011 3 28 0 12
2220 14442.0 1483 2 368.0 6 2011 1 27 3 9
5736 15145.0 2498 12 2799.0 35 2011 4 26 1 16
3347 12678.0 1809 12 48.0 13 2011 2 28 0 11
14246 14179.0 1510 1 953.0 35 2011 10 18 1 1
When I tried to fit the model with just that, it worked perfectly.
X_train_1 = X_train['Description'].values.reshape(-1, 1)
X_train_1.columns = ['Description']
X_train_1.shape
mod = LinearRegression()
mod.fit(X_train_1, y_train)
However, when I changed the data type of the description variable from int64 to category, it is throwing me this error.
ValueError Traceback (most recent call last)
<ipython-input-48-7243890f1de1> in <module>()
4
5 mod = LinearRegression()
----> 6 mod.fit(X_train_1, y_train)
5 frames
/usr/local/lib/python3.6/dist-packages/pandas/core/algorithms.py in take_nd(arr, indexer, axis, out, fill_value, allow_fill)
1711 arr.ndim, arr.dtype, out.dtype, axis=axis, mask_info=mask_info
1712 )
-> 1713 func(arr, indexer, out, fill_value)
1714
1715 if flip_order:
pandas/_libs/algos_take_helper.pxi in pandas._libs.algos.take_1d_int64_int64()
ValueError: Buffer has wrong number of dimensions (expected 1, got 2)
Any idea why this is happening? Any help would be much appreciated.
I have a dataframe as shown in the picture:
problem dataframe: attdf
I would like to group the data by Source class and Destination class, count the number of rows in each group and sum up Attention values.
While trying to achieve that, I am unable to get past this type error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-100-6f2c8b3de8f2> in <module>()
----> 1 attdf.groupby(['Source Class', 'Destination Class']).count()
8 frames
pandas/_libs/properties.pyx in pandas._libs.properties.CachedProperty.__get__()
/usr/local/lib/python3.6/dist-packages/pandas/core/algorithms.py in _factorize_array(values, na_sentinel, size_hint, na_value)
458 table = hash_klass(size_hint or len(values))
459 uniques, labels = table.factorize(values, na_sentinel=na_sentinel,
--> 460 na_value=na_value)
461
462 labels = ensure_platform_int(labels)
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.factorize()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable._unique()
TypeError: unhashable type: 'numpy.ndarray'
attdf.groupby(['Source Class', 'Destination Class'])
gives me a <pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f1e720f2080> which I'm not sure how to use to get what I want.
Dataframe attdf can be imported from : https://drive.google.com/open?id=1t_h4b8FQd9soVgYeiXQasY-EbnhfOEYi
Please advise.
#Adam.Er8 and #jezarael helped me with their inputs. The unhashable type error in my case was because of the datatypes of the columns in my dataframe.
Original df and df imported from csv
It turned out that the original dataframe had two object columns which i was trying to use up in the groupby. Hence the unhashable type error. But on importing the data into a new dataframe right out of a csv fixed the datatypes. Consequently, no type errors faced anymore.
try using .agg as follows:
import pandas as pd
attdf = pd.read_csv("attdf.csv")
print(attdf.groupby(['Source Class', 'Destination Class']).agg({"Attention": ['sum', 'count']}))
Output:
Attention
sum count
Source Class Destination Class
0 0 282.368908 1419
1 7.251101 32
2 3.361009 23
3 22.482438 161
4 14.020189 88
5 10.138409 75
6 11.377947 80
1 0 6.172269 32
1 181.582437 1035
2 9.440956 62
3 12.007303 67
4 3.025752 20
5 4.491725 28
6 0.279559 2
2 0 3.349921 23
1 8.521828 62
2 391.116034 2072
3 9.937170 53
4 0.412747 2
5 4.441985 30
6 0.220316 2
3 0 33.156251 161
1 11.944373 67
2 9.176584 53
3 722.685180 3168
4 29.776050 137
5 8.827215 54
6 2.434347 16
4 0 17.431855 88
1 4.195519 20
2 0.457089 2
3 20.401789 137
4 378.802604 1746
5 3.616083 19
6 1.095061 6
5 0 13.525333 75
1 4.289306 28
2 6.424412 30
3 10.911705 54
4 3.896328 19
5 250.309764 1132
6 8.643153 46
6 0 15.249959 80
1 0.150240 2
2 0.413639 2
3 3.108417 16
4 0.850280 6
5 8.655959 46
6 151.571505 686
I am working on this problem for my coding class that is outlined in the doc strings. I would appreciate any help on optimizing my code as well as any explanations as to why I am receiving the following error despite resetting the index.
import pandas as pd
def beds_top_ten(df, facility_id):
'''
INPUT: DataFrame, int
OUTPUT: date
Write a pandas query that returns the ten census dates with the highest
number of available beds for the nursing home with the specified facility id
REQUIREMENTS:
Do a filter followed by a sort rather than a sort followed by a merge.
'''
df = pd.read_csv('beds.csv', low_memory= False)
df['Bed Census Date'] = pd.to_datetime(df['Bed Census Date'])
df = df.filter(items =['Facility ID', 'Bed Census Date','Available Residential Beds'])
df = df.sort_values(by =[ 'Facility ID', 'Available Residential Beds'], ascending= False)
df_group_by_ten = df.groupby('Facility ID').head(10).reset_index(drop=True)
dates = df_group_by_ten.loc[df_group_by_ten['Facility ID']==facility_id, 'Bed Census Date']
return dates
this is what the table looks like after the first groupby:
Facility ID Bed Census Date Available Residential Beds
336 19 2011-01-05 29
339 19 2010-12-15 28
330 19 2011-02-23 27
332 19 2011-02-02 27
333 19 2011-01-26 27
334 19 2011-01-19 27
335 19 2011-01-12 27
338 19 2010-12-22 27
341 19 2010-12-01 27
331 19 2011-02-09 26
16 17 2013-04-10 22
87 17 2011-11-09 19
30 17 2013-01-02 17
37 17 2012-11-07 17
47 17 2012-08-29 17
31 17 2012-12-26 16
56 17 2012-06-20 16
10 17 2013-05-22 15
27 17 2013-01-23 15
61 17 2012-05-16 15
And when I run from my command_line:
In [15]: beds_top_ten('beds.csv',17)
Out[15]:
16 2013-04-10
87 2011-11-09
30 2013-01-02
37 2012-11-07
47 2012-08-29
31 2012-12-26
56 2012-06-20
10 2013-05-22
27 2013-01-23
61 2012-05-16
Name: Bed Census Date, dtype: datetime64[ns]
Yet when I run the same code on the online environment, I get the following error:
/usr/local/lib/python2.7/unittest/suite.py:108: DtypeWarning: Columns (10,45) have mixed types. Specify dtype option on import or set low_memory=False.
test(result)
E
======================================================================
ERROR: test_fourth_pandas (test_methods.Test)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/usr/src/app/test_methods.py", line 25, in test_fourth_pandas
all_equal = np.all(result == answer)
File "/usr/local/lib/python2.7/site-packages/pandas/core/ops.py", line 812, in wrapper
raise ValueError(msg)
ValueError: Can only compare identically-labeled Series objects
----------------------------------------------------------------------
Ran 1 test in 19.743s
FAILED (errors=1)
There's nothing wrong with pd.to_datetime. It's possible you have erroneous dates. Try specifying a format, and errors='coerce so invalid formats are converted to NaT.
df['Bed Census Date'] = pd.to_datetime(df['Bed Census Date'].str.strip(),
format='%Y-%m-%d', errors='coerce')
Now, expanding on my comment, filter, sort, and get the first 10 items using head:
x = df[df['Facility ID'] == facility_id]\
.sort_values('Available Residential Beds', ascending=False).head(10)
return x['Bed Census Date']
Removing the date formatting line resolved the above error.
df = pd.read_csv('beds.csv', low_memory= False)
#df['Bed Census Date'] = pd.to_datetime(df['Bed Census Date'])
df = df.filter(items=['Facility ID', 'Bed Census Date','Available Residential Beds'])
x = df[df['Facility ID'] == facility_id].sort_values('Available Residential Beds', ascending=False).head(10)
return x['Bed Census Date']
total_val_count = dataset[attr].value_counts()
for i in range(len(total_val_count.index)):
print total_val_count[i]
I have written this piece of code which counts occurrences of all distinct values of an attribute in a dataframe. The problem I am facing is that I am unable to access the first value by using index 0. I get a KeyError: 0 error in the first loop run itself.
The total_val_count contains proper values as shown below:
34 2887
4 2708
13 2523
35 2507
33 2407
3 2404
36 2382
26 2378
16 2282
22 2187
21 2141
12 2104
25 2073
5 2052
15 2044
17 2040
14 2027
28 1984
27 1980
23 1979
24 1960
30 1953
29 1936
31 1884
18 1877
7 1858
37 1767
20 1762
11 1740
8 1722
6 1693
32 1692
10 1662
9 1576
19 1308
2 1266
1 175
38 63
dtype: int64
total_val_count is a Series. The index of the Series are values in dataset[attr],
and the values in the Series are the number of times the associated value in dataset[attr] appears.
When you index a Series with total_val_count[i], Pandas looks for i in the index and returns the assocated value. In other words, total_val_count[i] is indexing by index value, not by ordinal.
Think of a Series as a mapping from the index to the values. When using plain indexing, e.g. total_val_count[i], it behaves more like a dict than a list.
You are getting a KeyError because 0 is not a value in the index.
To index by ordinal, use total_val_count.iloc[i].
Having said that, using for i in range(len(total_val_count.index)) -- or, what amounts to the same thing, for i in range(len(total_val_count)) -- is not recommended. Instead of
for i in range(len(total_val_count)):
print(total_val_count.iloc[i])
you could use
for value in total_val_count.values:
print(value)
This is more readable, and allows you to access the desired value as a variable, value, instead of the more cumbersome total_val_count.iloc[i].
Here is an example which shows how to iterate over the values, the keys, both the keys and values:
import pandas as pd
s = pd.Series([1, 2, 3, 2, 2])
total_val_count = s.value_counts()
print(total_val_count)
# 2 3
# 3 1
# 1 1
# dtype: int64
for value in total_val_count.values:
print(value)
# 3
# 1
# 1
for key in total_val_count.keys():
print(key)
# 2
# 3
# 1
for key, value in total_val_count.iteritems():
print(key, value)
# (2, 3)
# (3, 1)
# (1, 1)
for i in range(len(total_val_count)):
print(total_val_count.iloc[i])
# 3
# 1
# 1
I have been looking for how to order columns for pandas crosstabs to no avail. I specifically need to order my columns which are formatted dates (mmm yy) based on the values of the dates and not sorted alphabetically on the the 3-letter name of month (mmm).
Here are the details of my code:
python 3.3
pandas 0.12.0
f_dtflt is a pandas dataframe.
f_dtflt.COLLECTION_DATE is dtype datetime64[ns]
My crosstab statement is:
pd.crosstab(f_dtflt.EW_REGIONCOLLSITE, f_dtflt.COLLECTION_DATE.apply(lambda x: x.strftime("%b %y")), margins=True)
The output is:
COLLECTION_DATE Apr 13 Aug 13 Dec 12 Feb 13 Jan 13 Jul 13 Jun 13
EW_REGIONCOLLSITE
EAST 1964 2092 2280 2272 2757 2113 1902
WEST 2579 2011 1003 2351 2216 1506 1823
All 4543 4103 3283 4623 4973 3619 3725
COLLECTION_DATE Mar 13 May 13 Nov 12 Oct 12 Sep 13 All
EW_REGIONCOLLSITE
EAST 1682 1981 2108 825 975 22951
WEST 2770 3014 407 42 888 20610
All 4452 4995 2515 867 1863 43561
I want the columns to be ordered by ascending date...Oct 12, Nov 12, ... Jan 13, ...Sep 13.
I recognize that I could format the dates so that they are yy-mm (e.g. 13-01) but these labels will be used in a report and that is a compromise I hope not to make.
I'm new to python and pandas so please help the newbie by connecting any dots in your responses! Thanks a bunch.
METHOD 1
Edit in response to the first part of #Andy's answer. There is an issue with step 3:
I have tried to implement Andy's suggestion and here is more info on this effort.
1) I ran the following line to see what the dates look like. The following line creates values such as '2012-10' for collection date. ("beautified" by print?)
print(pd.DatetimeIndex(f_dtflt['COLLECTION_DATE']).to_period('M'))
2) When the above statement is entered into the crosstab, it changes the month values to digits such as 513, 514, etc. (actual values in field?)
table1=pd.crosstab(f_dtflt.EW_REGIONCOLLSITE, pd.DatetimeIndex(f_dtflt['COLLECTION_DATE']).to_period('M'), margins=True)
Here is the output:
col_0 513 514 515 516 517 518 519 520 521 522
EW_REGIONCOLLSITE
EAST 825 2108 2280 2757 2272 1682 1964 1981 1902 2113
WEST 42 407 1003 2216 2351 2770 2579 3014 1823 1506
All 867 2515 3283 4973 4623 4452 4543 4995 3725 3619
col_0 523 524 All
EW_REGIONCOLLSITE
EAST 2092 975 22951
WEST 2011 888 20610
All 4103 1863 43561
3) When I run the following code, it throws an error that 'int' object has no attribute 'strftime'
table1.columns = table1.columns.map(lambda x: x.strftime("%b %y"))
I played around with this quite a bit and here are some of my notes:
# This runs and creates an array of strings: '513' etc.
pd.to_datetime(table1.columns.map(str), unit='M')
# The last entry in table1.columns is "All" and needs to be removed. Hence [:-1] slice.
# This also runs but seems to give years in 1630's.
pd.DatetimeIndex(table1.columns[:-1].map(str)).to_datetime('M')
# This does not run because it says object is immutable
table1.columns[:-1]=pd.DatetimeIndex(table1.columns[:-1].map(str)).to_datetime('M')
# This also runs but the output is weird. It seems to give an array of both dates and -1
table1.columns.reindex(pd.DatetimeIndex(table1.columns[:-1].map(str)).to_datetime('M'))
# Does not run: DatetimeIndex() must be called with a collection of some kind, '513' was passed
table1.columns = table1.columns.map(lambda x: pd.DatetimeIndex(str(x)).strftime("%b %y"))
# Does not run: DatetimeIndex object is not callable
table1.rename(columns=pd.DatetimeIndex(table1.columns[:-1].map(str)).to_datetime('M'))
4) This does work for labeling the columns in the crosstab:
table1.columns.name = 'COLLECTION_DATE'
METHOD 2
#Andy gave a second suggestion and I played around with it and couldn't get it to work. A big part of the issue is my lack of familiarity with python, pandas, and numpy. I made notes for myself as I tried to sort it out. Here are my notes:
# Working with a new concept
# This creates row titles of 12 10, 12 11, etc.
table1=pd.crosstab(f_dtflt.EW_REGIONCOLLSITE, f_dtflt.COLLECTION_DATE.apply(lambda x: x.strftime("%y %m")), margins=True)
# This throws an error that yb is not defined
table1.columns.map(lambda yb: "%s %s" % (y, b) for y, b in yb.split())
# Tried to simplify and see what happens. Runs and creates an array of lists such as [['12, '10'], ['12', '11']...]
table1.columns.map(lambda x: x.split())
# Trying a different approach. This creates a numpy array of datetimes.
tempholder=table1.columns[:-1].map(lambda x: datetime.datetime(year=int(x[0:2]), month=int(x[3:]), day=1))
# Noted that f_dtflt['COLLECTION_DATE'] was a dtype of datetime64[ns] but tempholder was dtype object. So had issue.
# Convert to datetime64
# Get error: Out of bounds nanosecond timestamp: 12-10-01 00:00:00
tempholder=pd.to_datetime(tempholder)
# Tempholder is an array of datetimes from the datetime module. I used the pandas date function above.
# Need to change that and use python datetime module function.
# Does not work: 'numpy.ndarray' object has no attribute 'apply'...
# this is a pandas function which does not work on a numpy array.
tempholder.apply(lambda x: x.strftime('%b %y'))
# This works for numpy array but I can't tell what it contains.
# print(tempholder) gives <map object at 0x0000000026C04F28>
# tempholder gives Out[169]: <builtins.map at 0x26c04f28>
tempholder=map(lambda x: x.strftime('%b %y'), tempholder)
I approached this problem from a slightly different angle and created a function that can be used as a general method of ordering columns in a crosstab in pandas. It may also work for a pivot table but I didn't test that nor did I look at the details. I suppose it can also be used to order row labels too but I didn't try for that.
This creates a crosstab with column labels such as "12 10_Oct 12" and 12 11_Nov 12". The label effectively forces the alphabetizing of crosstab to work in my favor. The alphabetizing section of the label is concatenated with "_" and the label that I want to use.
table_1=pd.crosstab(f_dtflt.EW_REGIONCOLLSITE, f_dtflt.COLLECTION_DATE.apply(lambda x: x.strftime("%y %m_%b %y")), margins=True)
Output:
"COLLECTION_DATE 12 10_Oct 12 12 11_Nov 12 12 12_Dec 12 13 01_Jan 13
EW_REGIONCOLLSITE
EAST 825 2108 2280 2757
WEST 42 407 1003 2216
All 867 2515 3283 4973
COLLECTION_DATE 13 02_Feb 13 13 03_Mar 13 13 04_Apr 13 13 05_May 13
EW_REGIONCOLLSITE
EAST 2272 1682 1964 1981
WEST 2351 2770 2579 3014
All 4623 4452 4543 4995
COLLECTION_DATE 13 06_Jun 13 13 07_Jul 13 13 08_Aug 13 13 09_Sep 13
EW_REGIONCOLLSITE
EAST 1902 2113 2092 975
WEST 1823 1506 2011 888
All 3725 3619 4103 1863
COLLECTION_DATE All
EW_REGIONCOLLSITE
EAST 22951
WEST 20610
All 43561 "
The function and calls:
def clean_label(label_list, margins='False'):
''' This function takes the column index list from a crosstab (or pivot table?) in pandas and removes the
part of the label before and including the "_". This allows the user to order the columns manually by creating
an alphabetical index followed by "_" and then the label that they would like to use. For example, a label such as
['a_Positive', 'b_Negative'] will be converted to ['Positive', 'Negative']. Another example would be to order dates
in a table from ['12 10_Oct 12', '12 11_Nov 12'] to ['Oct 12', 'Nov 12']
margins = False if the crosstab was created without margins and therefore does not have an "All" at the end of the list
margins = True if the crosstab was created with margins and therefore has an "All" at the end of the list
'''
corrected_list=list()
# If one creates margins in pivot/crosstab, will get the last column of "All"
# This has to be removed from the following code or it will throw an error.
if margins:
convert_list = label_list[:-1]
else:
convert_list = label_list
for l in convert_list:
x,y=l.split('_')
corrected_list.append(y)
if margins:
corrected_list.append('Total') # Renames "All" to "Total"
return corrected_list
# Change the labels on the crosstab table
table_1.columns=clean_label(table_1.columns, margins=True)
# Change name of columns
table_1.columns.name = 'Month of Collection'
# Change name of rows
table_1.index.name = 'Region'
Output (final table):
"Month of Collection Oct 12 Nov 12 Dec 12 Jan 13 Feb 13 Mar 13 Apr 13
Region
EAST 825 2108 2280 2757 2272 1682 1964
WEST 42 407 1003 2216 2351 2770 2579
All 867 2515 3283 4973 4623 4452 4543
Month of Collection May 13 Jun 13 Jul 13 Aug 13 Sep 13 Total
Region
EAST 1981 1902 2113 2092 975 22951
WEST 3014 1823 1506 2011 888 20610
All 4995 3725 3619 4103 1863 43561 "
If you've done as year-month as a string (and it's in the correct order), you could reverse:
In [1]: df = pd.DataFrame([['a', 'b']], columns=['12 Mar', '12 Jun'])
In [2]: df.columns.map(lambda yb: ' '.join(reversed(yb.split())))
Out[2]: array(['Mar 12', 'Jun 12'], dtype=object)
In [3]: df.columns = df.columns.map(lambda yb: ' '.join(reversed(yb.split())))
I had suggested you could do this with periods:
pd.DatetimeIndex(f_dtflt['COLLECTION_DATE']).to_period('M')
Then after you can clean the column to the format you require:
df.columns = df.columns.map(lambda x: x.strftime("%b %y"))
df.columns.name = 'COLLECTION_DATE'
but this appears to change period index into int (possibly a bug?).