SFrame Kmeans - Covert to Int, Float, Dict - python

I'm preparting data to run KMEAMS from Graphlab, and am running into the following error:
tmp = data.select_columns(['a.item_id'])
tmp['sku'] = tmp['a.item_id'].apply(lambda x: x.split(','))
tmp = tmp.unpack('sku')
kmeans_model = gl.kmeans.create(tmp, num_clusters=K)
Feature 'sku.0' excluded because of its type. Kmeans features must be int, float, dict, or array.array type.
Feature 'sku.1' excluded because of its type. Kmeans features must be int, float, dict, or array.array type.
Here are the current datatypes of each column:
a.item_id str
sku.0 str
sku.1 str
If I can get the datatype from str to int I think it should work. However, using SFrames is a more tricky than standard python libraries. Any help getting there is appreciated.

The kmeans model does allow features in dictionary form, just not in list form. This is slightly different from what you've got now, because the dictionary loses the order of your SKUs, but in terms of model quality I suspect it actually makes more sense. They key function is count_words, in the text analytics toolkit.
https://dato.com/products/create/docs/generated/graphlab.text_analytics.count_words.html
import graphlab as gl
sf = gl.SFrame({'item_id': ['abc,xyz,cat', 'rst', 'abc,dog']})
sf['sku_count'] = gl.text_analytics.count_words(sf['item_id'], delimiters=[','])
model = gl.kmeans.create(sf, num_clusters=2, features=['sku_count'])
print model.cluster_id
+--------+------------+----------------+
| row_id | cluster_id | distance |
+--------+------------+----------------+
| 0 | 1 | 0.866025388241 |
| 1 | 0 | 0.0 |
| 2 | 1 | 0.866025388241 |
+--------+------------+----------------+
[3 rows x 3 columns]

Related

Get just one value of a nested list inside a dictionary to create a Dataframe Update #1

I am using an API that returns a dictionary with a nested list inside, lets name it coins_best The result looks like this:
{'bitcoin': [[1603782192402, 13089.646908288987],
[1603865643028, 13712.070136258053]],
'ethereum': [[1603782053064, 393.6741989091851],
[1603865024078, 404.86117057956386]]}
The first value on the list is a timestamp, while the second is a price in dollars. I want to create a DataFrame with the prices and having the timestamps as index. I tried with this code to do it in just one step:
d = pd.DataFrame()
for id, obj in coins_best.items():
for i in range(0,len(obj)):
temp = pd.DataFrame({
obj[i][1]
}
)
d = pd.concat([d, temp])
d
This attempt gave me a DataFrame with just one column and not the two required, because using the columns argument threw errors (TypeError: Index(...) must be called with a collection of some kind, 'bitcoin' was passed) when I tried with id
Then I tried with comprehensions to preprocess the dictionary and their lists:
for k in coins_best.keys():
inner_lists = (coins_best[k] for inner_dict in coins_best.values())
items = (item[1] for ls in inner_lists for item in ls)
I could not obtain the both elements in the dictionary, just the last.
I know is possible to try:
df = pd.DataFrame(coins_best, columns=coins_best.keys())
Which gives me:
bitcoin ethereum
0 [1603782192402, 13089.646908288987] [1603782053064, 393.6741989091851]
1 [1603785693143, 13146.275972229188] [1603785731599, 394.6174435303511]
And then try to remove the first element in every list of every row, but was even harder to me. The required answer is:
bitcoin ethereum
1603782192402 13089.646908288987 393.6741989091851
1603785693143 13146.275972229188 394.6174435303511
Do you know how to process the dictionary before creating the DataFrame in order the get this result?
Is my first question, I tried to be as clear as possible. Thank you very much.
Update #1
The answer by Sander van den Oord also solved the problem of timestamps and is useful for its purpose. However, the sample code while correct (as it used the info provided) was limited to these two keys. This is the final code that solved the problem for every key in the dictionary.
for k in coins_best:
df_coins1 = pd.DataFrame(data=coins_best[k], columns=['timestamp', k])
df_coins1['timestamp'] = pd.to_datetime(df_coins1['timestamp'], unit='ms')
df_coins = pd.concat([df_coins1, df_coins], sort=False)
df_coins_resampled = df_coins.set_index('timestamp').resample('d').mean()
Thank you very much for your answers.
I think you shouldn't ignore the fact that values of coins are taken at different times. You could do something like this:
import pandas as pd
import hvplot.pandas
coins_best = {
'bitcoin': [[1603782192402, 13089.646908288987],
[1603865643028, 13712.070136258053]],
'ethereum': [[1603782053064, 393.6741989091851],
[1603865024078, 404.86117057956386]],
}
df_bitcoin = pd.DataFrame(data=coins_best['bitcoin'], columns=['timestamp', 'bitcoin'])
df_bitcoin['timestamp'] = pd.to_datetime(df_bitcoin['timestamp'], unit='ms')
df_ethereum = pd.DataFrame(data=coins_best['ethereum'], columns=['timestamp', 'ethereum'])
df_ethereum['timestamp'] = pd.to_datetime(df_ethereum['timestamp'], unit='ms')
df_coins = pd.concat([df_ethereum, df_bitcoin], sort=False)
Your df_coins will now look like this:
+----+----------------------------+------------+-----------+
| | timestamp | ethereum | bitcoin |
|----+----------------------------+------------+-----------|
| 0 | 2020-10-27 07:00:53.064000 | 393.674 | nan |
| 1 | 2020-10-28 06:03:44.078000 | 404.861 | nan |
| 0 | 2020-10-27 07:03:12.402000 | nan | 13089.6 |
| 1 | 2020-10-28 06:14:03.028000 | nan | 13712.1 |
+----+----------------------------+------------+-----------+
Now if you want values to be on the same line, you could use resampling, here I do it per day: all values of the same day for a coin type are averaged:
df_coins_resampled = df_coins.set_index('timestamp').resample('d').mean()
df_coins_resampled will look like this:
+---------------------+------------+-----------+
| timestamp | ethereum | bitcoin |
|---------------------+------------+-----------|
| 2020-10-27 00:00:00 | 393.674 | 13089.6 |
| 2020-10-28 00:00:00 | 404.861 | 13712.1 |
+---------------------+------------+-----------+
I like to use hvplot to get an interactive plot of the result:
df_coins_resampled.hvplot.scatter(
x='timestamp',
y=['bitcoin', 'ethereum'],
s=20, padding=0.1
)
Resulting plot:
There are different timestamps, so the correct output looks differently, than what you presented, but other than that, its a oneliner (where d is your input dictionary):
pd.concat([pd.DataFrame(val, columns=['timestamp', key]).set_index('timestamp') for key, val in d.items()], axis=1)

Print Data Frame columns and modified types

I have large tables both in length and width for data science. Frequently I want to move a data frame to a database SQL table. This means I have to write out each column type in my create statement. As columns grow > 200, it can be cumbersome to do this.
This seems like a good opportunity for a function. I think it would first check the column types in a data frame and return the appropriate Postgres column types, for me to copy and paste.
R -> PG12 Translation https://www.postgresql.org/docs/12/datatype-numeric.html
| R class | PG12 datatype | Note |
|--------------|---------------------------|----------------------------------------------------------------------------------------------------------------------|
| factor, char | text | text can handle varying length strings |
| integer | smallint | if abs(x) <= 32767 then smallint |
| integer | integer | if abs(x) <= 2147483647 then integer |
| integer | bigint | if abs(x) <= 9223372036854775807 then bigint |
| numeric | smallint, integer, bigint | if there is nothing in the decimal places `4.0` coerce it to integer to save space |
| numeric | numeric(precision,scale) | precision = nchar(unlist(strsplit(x = as.character(10.045), split = ".", fixed = T))[2]); scale = max(nchar(10.045)) |
| Date | date | |
For example this table: head(Orange)
would give these results:
That I can copy and paste into a create statement:
Are there any solutions for this that people have found or could someone help with the logic for this function? Thanks in advance for any help!
No one has answered yet, so here's a functional (not optimal) solution I worked up.
x <- data.frame(smallint = 0:10,
smallint2 = 0.0:10.0,
integer = 1e6:1.00001e6,
bigint1 = 1e11:1.0000000001e11,
bigint2 = 1e10:1.000000001e10,
bigint3 = 1e9:1.00000001e9,
bigint4 = 1e8:1.0000001e8,
text1 = LETTERS[1:11],
text2 = factor(LETTERS[1:11]),
date = as.Date("1985-01-21"),
timestamp = Sys.time()
, stringsAsFactors = F)
foo <- function(df){
col.info <- sapply(X = df, FUN = function(z){RPostgreSQL::dbDataType(dbObj = con, obj = z)})
print(data.frame(col = row.names(data.frame(col.info)), col.type = paste0(col.info,",")), row.names = F)
}
foo(df = x)
foo(df = Orange)

How to access to elements of confusion matrix in H2O for python?

I made a grid search that contains 36 models.
For each model the confusion matrix is available with :
grid_search.get_grid(sort_by='a_metrics', decreasing=True)[index].confusion_matrix(valid=valid_set)
My problematic is I only want to access some parts of this confusion matrix in order to make my own ranking, which is not natively available with h2o.
Let's say we have the confusion_matrix of the first model of the grid_search below:
+---+-------+--------+--------+--------+------------------+
| | 0 | 1 | Error | Rate | |
+---+-------+--------+--------+--------+------------------+
| 0 | 0 | 766.0 | 2718.0 | 0.7801 | (2718.0/3484.0) |
| 1 | 1 | 351.0 | 6412.0 | 0.0519 | (351.0/6763.0) |
| 2 | Total | 1117.0 | 9130.0 | 0.2995 | (3069.0/10247.0) |
+---+-------+--------+--------+--------+------------------+
Actually, the only things that really interest me is the precision of the class 0 as 766/1117 = 0,685765443. While h2o consider precision metrics for all the classes and it is done to the detriment of what I am looking for.
I tried to convert it in dataframe with:
model = grid_search.get_grid(sort_by='a_metrics', decreasing=True)[0]
model.confusion_matrix(valid=valid_set).as_data_frame()
Even if some topics on internet suggest it works, actually it does not (or doesn't anymore):
AttributeError: 'ConfusionMatrix' object has no attribute 'as_data_frame'
I search a way to return a list of attributes of the confusion_matrix without success.
According to H2O documentation there is no as_dataframe method: http://docs.h2o.ai/h2o/latest-stable/h2o-py/docs/_modules/h2o/model/confusion_matrix.html
I assume the easiest way is to call to_list().
The .table attribute gives the object with as_data_frame method.
model.confusion_matrix(valid=valid_set).table.as_data_frame()
If you need to access the table header, you can do
model.confusion_matrix(valid=valid_set).table._table_header
Hint: You can use dir() to check the valid attributes of a python object.

Apply method in Pandas can not handle a function

I am new to pandas. The following is a sub_set of a dataframe named news:
Id is the id of news and the text column includes the news:
Id text
1 the news is really bad.
2 I do not have any courses.
3 Asthma is very prevalent.
4 depression causes disability.
I am going to calculate sentiment for each news in the "text" column.
I need to create a column to include the result of sentiment analysis.
This is my code:
from textblob import TextBlob
review = TextBlob(news.loc[0,'text'])
print (review.sentiment.polarity)
This code works for just one of the news in the text column.
I also wrote this function:
def detect_sentiment(text):
blob = TextBlob(text)
return blob.sentiment.polarity
news['sentiment'] = news.text.apply(detect_sentiment)
But it has the following error:
The `text` argument passed to `__init__(text)` must be a string, not <class 'float'>
Any solution?
I cannot reproduce your bug: your exact code is working perfectly fine to me using pandas==0.24.2 and Python 3.4.3:
import pandas as pd
from textblob import TextBlob
news = pd.DataFrame(["the news is really bad.",
"I do not have any courses.",
"Asthma is very prevalent.",
"depression causes disability."], columns=["text"])
def detect_sentiment(text):
blob = TextBlob(text)
return blob.sentiment.polarity
news['sentiment'] = news.text.apply(detect_sentiment)
display(news)
Result:
+----+-------------------------------+-------------+
| | text | sentiment |
|----+-------------------------------+-------------|
| 0 | the news is really bad. | -0.7 |
| 1 | I do not have any courses. | 0 |
| 2 | Asthma is very prevalent. | 0.2 |
| 3 | depression causes disability. | 0 |
+----+-------------------------------+-------------+

Pairing two Pandas data frames with an ID value

I am trying to put together a useable set of data about glaciers. Our original data comes from an ArcGIS dataset, and latitude/longitude values were stored in a separate file, now detached from the CSV with all of our data. I am attempting to merge the latitude/longitude files with our data set. Heres a preview of what the files look like.
This is my main dataset file, glims (columns dropped for clarity)
| ANLYS_ID | GLAC_ID | AREA |
|----------|----------------|-------|
| 101215 | G286929E46788S | 2.401 |
| 101146 | G286929E46788S | 1.318 |
| 101162 | G286929E46788S | 0.061 |
This is the latitude-longitude file, coordinates
| lat | long | glacier_id |
|-------|---------|----------------|
| 1.187 | -70.166 | G001187E70166S |
| 2.050 | -70.629 | G002050E70629S |
| 3.299 | -54.407 | G002939E70509S |
The problem is, the coordinates data frame has one row for each glacier id with latitude longitude, whereas my glims data frame has multiple rows for each glacier id with varying data for each entry.
I need every single entry in my main data file to have a latitude-longitude value added to it, based on the matching glacier_id between the two data frames.
Heres what I've tried so far.
glims = pd.read_csv('glims_clean.csv')
coordinates = pd.read_csv('LatLong_GLIMS.csv')
df['que'] = np.where((coordinates['glacier_id'] ==
glims['GLAC_ID']))
error returns: 'int' object is not subscriptable
and:
glims.merge(coordinates, how='right', on=('glacier_id', 'GLAC_ID'))
error returns: int' object has no attribute 'merge'
I have no idea how to tackle this big of a merge. I am also afraid of making mistakes because it is nearly impossible to catch them, since the data carries no other identifying factors.
Any guidance would be awesome, thank you.
This should work
glims = glims.merge(coordinates, how='left', left_on='GLAC_ID', right_on='glacier_id')
This a classic merging problem. One way to solve is using straight loc and index-matching
glims = glims.set_index('GLAC_ID')
glims.loc[:, 'lat'] = coord.set_index('glacier_id').lat
glims.loc[:, 'long'] = coord.set_index('glacier_id').long
glims = glims.reset_index()
You can also use pd.merge
pd.merge(glims,
coord.rename(columns={'glacier_id': 'GLAC_ID'}),
on='GLAC_ID')

Categories

Resources