I'm trying to build a linear regression model with just one predictor variable for a Retail Dataset. The predictor variable I'm trying to use is known as Description.
The Description column contains numerical values of the product category and the data type of the column, initially, was int64.
**InvoiceNo CustomerID StockCode Quantity Description Country Year Month Day dayofWeek HourofDay**
4501 14730.0 3171 2 1324.0 35 2011 3 28 0 12
2220 14442.0 1483 2 368.0 6 2011 1 27 3 9
5736 15145.0 2498 12 2799.0 35 2011 4 26 1 16
3347 12678.0 1809 12 48.0 13 2011 2 28 0 11
14246 14179.0 1510 1 953.0 35 2011 10 18 1 1
When I tried to fit the model with just that, it worked perfectly.
X_train_1 = X_train['Description'].values.reshape(-1, 1)
X_train_1.columns = ['Description']
X_train_1.shape
mod = LinearRegression()
mod.fit(X_train_1, y_train)
However, when I changed the data type of the description variable from int64 to category, it is throwing me this error.
ValueError Traceback (most recent call last)
<ipython-input-48-7243890f1de1> in <module>()
4
5 mod = LinearRegression()
----> 6 mod.fit(X_train_1, y_train)
5 frames
/usr/local/lib/python3.6/dist-packages/pandas/core/algorithms.py in take_nd(arr, indexer, axis, out, fill_value, allow_fill)
1711 arr.ndim, arr.dtype, out.dtype, axis=axis, mask_info=mask_info
1712 )
-> 1713 func(arr, indexer, out, fill_value)
1714
1715 if flip_order:
pandas/_libs/algos_take_helper.pxi in pandas._libs.algos.take_1d_int64_int64()
ValueError: Buffer has wrong number of dimensions (expected 1, got 2)
Any idea why this is happening? Any help would be much appreciated.
Related
"I have to change the format of a object list ['Dev'] to float64, but python send me this error. 'Dev'its a column of objects from a csv table ('prueba.csv') and I need to translate it for create statistics/metrics, could you help me?"
#start the code
import pandas as pd
import numpy as np
#read the data from .csv file;
df = pd.read_csv('prueba.csv', sep=';', decimal=',',encoding='ISO-8859-1')
#Data description
df.dtypes
Element object
Property object
Nominal float64
Actual object
Tol - float64
Tol + float64
Dev object
Check float64
Out float64
dtype: object
#List for float translate;
df['Dev']
0 0,69
1 0,62
2 0,54
3 0,47
4 0,19
5 -0,26
6 0,11
7 0,1
8 0,2
9 0,29
10 -1,54
11 -2
12 -2,06
13 -2,02
14 -2,08
15 -1,39
16 -1,68
17 -1,91
18 -1,78
19 -1,8
20 -1,21
21 -1,07
22 -0,97
23 -1,47
24 -1,35
25 -0,91
26 -1,17
27 -0,67
28 -1,12
29 -1,13
1962 0
1963 -0,37
1964 0,02
1965 0,32
1966 0,04
1967 0
1968 0,39
1969 0,25
1970 0,38
1971 0,15
1972 0
1973 1,11
1974 -1,13
1975 0,15
1976 0,12
1977 0
1978 -0,47
1979 -0,85
1980 0,08
1981 0,23
1982 0
1983 1,03
1984 -0,76
1985 -0,03
1986 0,02
1987 0
1988 0,36
1989 -1,45
1990 0,12
1991 0,09
Name: Dev, Length: 1992, dtype: object
#Function definition for conversion
def convert_currency(val):
"""
- Remove commas and u'
- Convert to float type
"""
new_val = val.replace("u'","'")
return float(new_val)
df['Dev'].apply(convert_currency)
#Python error;
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-93-9ac4450806f7> in <module>()
----> 1 df['Dev'].apply(convert_currency)
C:\Users\kaosb\miniconda2\lib\site-packages\pandas\core\series.pyc in apply(self, func, convert_dtype, args, **kwds)
3589 else:
3590 values = self.astype(object).values
-> 3591 mapped = lib.map_infer(values, f, convert=convert_dtype)
3592
3593 if len(mapped) and isinstance(mapped[0], Series):
pandas/_libs/lib.pyx in pandas._libs.lib.map_infer()
<ipython-input-89-0c43557e37eb> in convert_currency(val)
5 """
6 new_val = val.replace("u'","'")
----> 7 return float(new_val)
ValueError: invalid literal for float(): 0,69
looks like you are not replacing the commas
def convert_currency(val):
"""
- Remove commas and u'
- Convert to float type
"""
new_val = val.replace("u'","").replace(",",".")
return float(new_val)
would do the trick.
your floats need dot instead of comma like you said.
You have to replace the commas by points before applying float otherwise the float function considers you're giving to arguments
Just change your encoding to latin_1.
I tried locally and it works great, it automatically assigns its dtype as float64.
I have the dataframe as following:
(cusid means the customer id; product means product id bought by the customer; count means the purchased count of this product.)
cusid product count
1521 30 2
18984 99 1
25094 1 1
2363 36 1
3316 21 1
19249 228 1
13220 78 1
1226 79 4
1117 112 2
I want to calculate the average number of every product that every customer would buy.
Seeming need to get groupby product in cusid, then groupby product in count, then get the mean.
my expect ouput:
product mean(count)
30
99
1
36
Here is my code:
(df.groupby(['product','cusid']).mean().groupby('product')['count'].mean())
got the error:
TypeError Traceback (most recent call last)
<ipython-input-43-0fac990bbd61> in <module>()
----> 1 (df.groupby(['product','cusid']).mean().groupby('product')['count'].mean())
TypeError: groupby() takes at least 3 arguments (2 given
have no idea how to fix it
df.groupby(['cusid', 'product']).mean().reset_index().groupby('product')['count'].mean()
OUTPUT:
product
1 1
21 1
30 2
36 1
78 1
79 4
99 1
112 2
228 1
python version: 3.7.4
pandas version: 0.25.0
I have a dataframe as shown in the picture:
problem dataframe: attdf
I would like to group the data by Source class and Destination class, count the number of rows in each group and sum up Attention values.
While trying to achieve that, I am unable to get past this type error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-100-6f2c8b3de8f2> in <module>()
----> 1 attdf.groupby(['Source Class', 'Destination Class']).count()
8 frames
pandas/_libs/properties.pyx in pandas._libs.properties.CachedProperty.__get__()
/usr/local/lib/python3.6/dist-packages/pandas/core/algorithms.py in _factorize_array(values, na_sentinel, size_hint, na_value)
458 table = hash_klass(size_hint or len(values))
459 uniques, labels = table.factorize(values, na_sentinel=na_sentinel,
--> 460 na_value=na_value)
461
462 labels = ensure_platform_int(labels)
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.factorize()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable._unique()
TypeError: unhashable type: 'numpy.ndarray'
attdf.groupby(['Source Class', 'Destination Class'])
gives me a <pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f1e720f2080> which I'm not sure how to use to get what I want.
Dataframe attdf can be imported from : https://drive.google.com/open?id=1t_h4b8FQd9soVgYeiXQasY-EbnhfOEYi
Please advise.
#Adam.Er8 and #jezarael helped me with their inputs. The unhashable type error in my case was because of the datatypes of the columns in my dataframe.
Original df and df imported from csv
It turned out that the original dataframe had two object columns which i was trying to use up in the groupby. Hence the unhashable type error. But on importing the data into a new dataframe right out of a csv fixed the datatypes. Consequently, no type errors faced anymore.
try using .agg as follows:
import pandas as pd
attdf = pd.read_csv("attdf.csv")
print(attdf.groupby(['Source Class', 'Destination Class']).agg({"Attention": ['sum', 'count']}))
Output:
Attention
sum count
Source Class Destination Class
0 0 282.368908 1419
1 7.251101 32
2 3.361009 23
3 22.482438 161
4 14.020189 88
5 10.138409 75
6 11.377947 80
1 0 6.172269 32
1 181.582437 1035
2 9.440956 62
3 12.007303 67
4 3.025752 20
5 4.491725 28
6 0.279559 2
2 0 3.349921 23
1 8.521828 62
2 391.116034 2072
3 9.937170 53
4 0.412747 2
5 4.441985 30
6 0.220316 2
3 0 33.156251 161
1 11.944373 67
2 9.176584 53
3 722.685180 3168
4 29.776050 137
5 8.827215 54
6 2.434347 16
4 0 17.431855 88
1 4.195519 20
2 0.457089 2
3 20.401789 137
4 378.802604 1746
5 3.616083 19
6 1.095061 6
5 0 13.525333 75
1 4.289306 28
2 6.424412 30
3 10.911705 54
4 3.896328 19
5 250.309764 1132
6 8.643153 46
6 0 15.249959 80
1 0.150240 2
2 0.413639 2
3 3.108417 16
4 0.850280 6
5 8.655959 46
6 151.571505 686
I am trying to access the labels (i.e. positional indicator) after binning my data by decile:
q = pd.qcut(df["revenue"], 10)
q.head():
7 (317.942, 500.424]
81 (317.942, 500.424]
83 (150.65, 317.942]
84 [0.19, 150.65]
85 (317.942, 500.424]
Name: revenue, dtype: category
Categories (10, object): [[0.19, 150.65] < (150.65, 317.942] < (317.942, 500.424] < (500.424, 734.916] ... (1268.306, 1648.35]
< (1648.35, 1968.758] < (1968.758, 2527.675] < (2527.675, 18690.2]]
In [233]:
This post link shows that you can do the following to access the labels:
>>> q.labels
But when I do that I get:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-246-e806c96b1ab2> in <module>()
----> 1 q.labels
C:\Users\blah\Anaconda3\lib\site-packages\pandas\core\generic.py in __getattr__(self, name)
2666 if (name in self._internal_names_set or name in self._metadata or
2667 name in self._accessors):
-> 2668 return object.__getattribute__(self, name)
2669 else:
2670 if name in self._info_axis:
AttributeError: 'Series' object has no attribute 'labels'
In any case, what I want to do is use the labels to filter my data - likely by adding a new column in df which represents the positional label of the result of the decile (or quantile).
I personally like using the labels parameter in pd.qcut to specify clean looking and consistent labels.
np.random.seed([3,1415])
df = pd.DataFrame(dict(revenue=np.random.randint(1000000, 99999999, 100)))
df['decile'] = pd.qcut(df.revenue, 10, labels=range(10))
print(df.head())
As #jeremycg pointed out, you access category information via the cat accessor attribute
df.decile.cat.categories
Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype='int64')
You can quickly describe each bin
df.groupby('decile').describe().unstack()
You can filter
df.query('decile >= 8')
revenue decile
4 98274570 9
6 99418302 9
19 89598752 8
20 88877661 8
22 90789485 9
29 83126518 8
31 90700517 9
33 96816407 9
40 89937348 8
54 83041116 8
65 83399066 8
66 97055576 9
79 87700403 8
81 88592657 8
82 91963755 9
83 82443566 8
84 84880509 8
88 98603752 9
95 92548497 9
98 98963891 9
you can z-score within deciles
df = df.join(df.groupby('decile').revenue.agg(dict(Mean='mean', Std='std')), on='decile')
df['revenue_zscore_by_decile'] = df.revenue.sub(df.Mean).div(df.Std)
df.head()
revenue decile Std Mean revenue_zscore_by_decile
0 32951600 2 2.503325e+06 29649669 1.319018
1 70565451 6 9.639336e+05 71677761 -1.153928
2 6602402 0 5.395453e+06 11166286 -0.845876
3 82040251 7 2.976992e+06 78299299 1.256621
4 98274570 9 3.578865e+06 95513475 0.771500
total_val_count = dataset[attr].value_counts()
for i in range(len(total_val_count.index)):
print total_val_count[i]
I have written this piece of code which counts occurrences of all distinct values of an attribute in a dataframe. The problem I am facing is that I am unable to access the first value by using index 0. I get a KeyError: 0 error in the first loop run itself.
The total_val_count contains proper values as shown below:
34 2887
4 2708
13 2523
35 2507
33 2407
3 2404
36 2382
26 2378
16 2282
22 2187
21 2141
12 2104
25 2073
5 2052
15 2044
17 2040
14 2027
28 1984
27 1980
23 1979
24 1960
30 1953
29 1936
31 1884
18 1877
7 1858
37 1767
20 1762
11 1740
8 1722
6 1693
32 1692
10 1662
9 1576
19 1308
2 1266
1 175
38 63
dtype: int64
total_val_count is a Series. The index of the Series are values in dataset[attr],
and the values in the Series are the number of times the associated value in dataset[attr] appears.
When you index a Series with total_val_count[i], Pandas looks for i in the index and returns the assocated value. In other words, total_val_count[i] is indexing by index value, not by ordinal.
Think of a Series as a mapping from the index to the values. When using plain indexing, e.g. total_val_count[i], it behaves more like a dict than a list.
You are getting a KeyError because 0 is not a value in the index.
To index by ordinal, use total_val_count.iloc[i].
Having said that, using for i in range(len(total_val_count.index)) -- or, what amounts to the same thing, for i in range(len(total_val_count)) -- is not recommended. Instead of
for i in range(len(total_val_count)):
print(total_val_count.iloc[i])
you could use
for value in total_val_count.values:
print(value)
This is more readable, and allows you to access the desired value as a variable, value, instead of the more cumbersome total_val_count.iloc[i].
Here is an example which shows how to iterate over the values, the keys, both the keys and values:
import pandas as pd
s = pd.Series([1, 2, 3, 2, 2])
total_val_count = s.value_counts()
print(total_val_count)
# 2 3
# 3 1
# 1 1
# dtype: int64
for value in total_val_count.values:
print(value)
# 3
# 1
# 1
for key in total_val_count.keys():
print(key)
# 2
# 3
# 1
for key, value in total_val_count.iteritems():
print(key, value)
# (2, 3)
# (3, 1)
# (1, 1)
for i in range(len(total_val_count)):
print(total_val_count.iloc[i])
# 3
# 1
# 1