How to multiply two data frame column by column? - python

I have to similarity between columns within the one data frame. The expected result is the same as the correlation matrix output, but the calculation function is different(I wrote my self calc function). So the calc function should get calc_func(column1, column2). The idea is to get similarities between columns. row size is not important. as an Output I expect (937,937) matrix.
Sample data
0011 0012 0013 0014 0015 0019 0111 0112 0121 0122 0123 0125 0129 0161 0168 0172 0174 0175 0176 0221 0222 0223 0224 0230 0241 0242 0243 0249 0251 0252 0341 0342 0344 0345 0351 0352 0353 0361 0362 0363 0371 0372 0411 0412 0421 0422 0423 0430 0441 0449 0452 0453 0459 0461 0471 0472 0481 0482 0483 0484 0485 0541 0542 0544 0545 0546 0547 0548 0561 0564 0566 0567 0571 0572 0573 0574 0575 0576 0577 0579 0581 0583 \
Reporter ISO
AFG 0.149474 0.699753 0.000000e+00 0.000000 0.000000 6.084805 0.000000e+00 0.013655 0.123035 0.000000e+00 0.000000 0.011263 0.000000 0.000000e+00 0.000000 0.000000e+00 0.000000 0.000000 0.000000 0.000000 0.000000 0.040835 0.009775 0.000000 0.000000 0.000000 1.902343e-04 0.003110 0.000000 0.000000 9.900480e-04 0.000382 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.002613 0.002373 0.000184 0.000000 0.000000e+00 5.570409e-04 0.000000 0.001367 0.023009 1.074305 0.000000 4.309246e-04 2.267049 0.135528 6.845710 0.000172 4.785010e-02 5.620574e-04 0.015391 0.000000 0.000000 0.008071 0.000000 0.602458 56.772035 4.902713e+01 11.542497 0.175537 0.000000 8.314311 6.899700e-01 0.009341 0.000000e+00 0.118446 0.465433 0.634222 0.008141 4.406345e+01 1.806608e+02 1457.266474 37.572639 16.111153 0.278868 5.828552e-01
AGO 0.000233 0.000000 2.169950e-05 0.000436 0.000021 0.206904 1.850937e-05 0.001081 0.000054 4.163925e-04 0.000437 0.000348 0.000059 1.287730e-04 0.000289 9.425705e-04 0.000140 0.002698 0.000444 0.000116 0.002252 0.000614 0.000295 0.000481 0.000003 0.000008 0.000000e+00 0.000142 0.000742 0.002136 3.700936e-01 1.594887 0.024370 0.000002 0.039695 0.146148 0.000020 0.267286 0.866269 0.036852 0.000384 0.046401 4.496454e-06 0.000000e+00 0.000834 0.016216 0.001110 0.000000 0.000065 7.354149e-04 0.000061 0.004332 0.000039 0.239055 2.501597e-01 3.830196e-04 0.000546 0.008450 0.015806 0.001086 0.002724 0.009187 0.005919 3.321169e-04 0.002146 0.000693 0.001050 0.006553 2.621612e-03 0.017289 8.732829e-05 0.000309 0.000343 0.000303 0.053893 5.683467e-04 6.637084e-05 0.000005 0.000036 0.001527 0.000022 1.886017e-07
ALB 0.004472 0.093826 0.000000e+00 0.000000 0.000000 4.959089 5.096002e-02 0.000000 0.000000 2.111634e-03 0.000487 0.003162 17.984117 1.681137e-02 0.000287 3.287117e-02 0.001309 0.001702 0.000093 0.005981 0.000027 0.004139 0.007258 0.000442 0.000000 0.122049 0.000000e+00 0.028040 1.376963 0.109314 2.201071e+00 0.646953 0.427123 0.055488 37.156633 24.666195 0.000416 0.452249 0.423161 1.855032 9.630443 16.673592 2.321445e-03 5.822343e-03 0.264946 0.000000 0.001616 0.000000 0.067036 1.468721e-03 0.000867 0.000000 0.000000 0.051276 3.280251e-02 1.379145e-02 0.026767 0.000000 0.315969 0.634852 0.004309 0.343613 0.302088 2.262782e+01 4.408535 0.013666 1.185906 1.818876 1.082149e+00 0.031302 5.695562e-03 3.008238 1.286605 0.064267 0.004062 1.028946e+00 5.242426e-02 2.020501 0.595951 1.282575 0.059749 8.487325e-01

Using exactly the example in the docs for df.corr():
def histogram_intersection(a, b):
v = np.minimum(a, b).sum().round(decimals=1)
return v
df = pd.DataFrame([(.2, .3), (.0, .6), (.6, .0), (.2, .1)],
columns=['dogs', 'cats'])
df.corr(method=histogram_intersection)
# output:
dogs cats
dogs 1.0 0.3
cats 0.3 1.0
So just pass your function in as the method parameter:
df.corr(method=calc_func)

Related

Eliminate unigrams in char-level TF-IDF

I would like to extract unique combinations of letters within words using the scikit-learn TF-IDF vectorizer for an NLP problem. However, I'm not interested in individual letters, but letter combinations, so that, e.g. "the" should produce "th" and "he" but not "t", "h" or "e". My understanding is I should be able to use ngram_range. However, using ngram_range=(2,3) is still returning unigrams.
Example:
from sklearn.feature_extraction.text import TfidfVectorizer
examples = ['The cat on the mat',
'Fast and bulbous']
tfidf = TfidfVectorizer(max_features=None,
analyzer='char_wb',
ngram_range=(2, 3))
data=tfidf.fit_transform(examples)
print(pd.DataFrame(data=data.todense(),
index=examples,
columns = tfidf.get_feature_names_out()))
gives me the 2- and 3-gram results as expected but also unigrams (i.e. I don't want "a", "b", etc.):
a an b bu c \
The cat on the mat 0.000000 0.000000 0.000000 0.000000 0.139994
Fast and bulbous 0.181053 0.181053 0.181053 0.181053 0.000000
ca f fa m ma ... \
The cat on the mat 0.139994 0.000000 0.000000 0.139994 0.139994 ...
Fast and bulbous 0.000000 0.181053 0.181053 0.000000 0.000000 ...
s st st t th \
The cat on the mat 0.000000 0.000000 0.000000 0.199213 0.279987
Fast and bulbous 0.181053 0.181053 0.181053 0.128821 0.000000
the ul ulb us us
The cat on the mat 0.279987 0.000000 0.000000 0.000000 0.000000
Fast and bulbous 0.000000 0.181053 0.181053 0.181053 0.181053
[2 rows x 53 columns]
I would've expected this output with ngram_range=(1,3) but not ngram_range=(2,3).
Edit:
I just noticed that "a" is extracted from "Fast and bulbous", presumably as it occurs as " a", i.e. with a space before the "a", but not in "The cat on the mat" as the "a" in "cat" is surrounded by "c" and "t". Likewise, "u" is not extracted as there is no space surrounding it in either text.
It seems like TfidfVectorizer is extracting bigrams including spaces. Is there a way to turn this off? (I though using analyzer='char_wb' searched within words rather than across words).
I constructed a callable to pass to analyzer. It is stolen from based on the function from the source code of TfidfVectorizer used when analyzer is set to 'char_wb':
def char_wb_ngrams(text_document, ngram_range):
"""Callable for TfidfVectorizer analyzer, based on _char_wb_ngrams from TfidfVectorizer source code at
https://github.com/scikit-learn/scikit-learn/blob/f3f51f9b6/sklearn/feature_extraction/text.py"""
ngrams = []
min_n, max_n = ngram_range
for w in text_document.lower().split():
# This line in _char_wb_ngrams pads words with spaces and needs to be removed:
#w = " " + w + " "
w_len = len(w)
for n in range(min_n, max_n + 1):
offset = 0
ngrams.append(w[offset : offset + n])
while offset + n < w_len:
offset += 1
ngrams.append(w[offset : offset + n])
if offset == 0: # count a short word (w_len < n) only once
break
return ngrams
This works on the example data from above:
from functools import partial
tfidf_no_space = TfidfVectorizer(max_features=None,
analyzer=partial(char_wb_ngrams, ngram_range=(2,3)),
ngram_range=(2, 3))
data=tfidf_no_space.fit_transform(examples)
print(pd.DataFrame(data=data.todense(),
index=examples,
columns = tfidf_no_space.get_feature_names_out()))
which yields
an and as ast at \
The cat on the mat 0.000000 0.000000 0.000000 0.000000 0.436436
Fast and bulbous 0.229416 0.229416 0.229416 0.229416 0.000000
bo bou bu bul ca ... \
The cat on the mat 0.000000 0.000000 0.000000 0.000000 0.218218 ...
Fast and bulbous 0.229416 0.229416 0.229416 0.229416 0.000000 ...
nd on ou ous st \
The cat on the mat 0.000000 0.218218 0.000000 0.000000 0.000000
Fast and bulbous 0.229416 0.000000 0.229416 0.229416 0.229416
th the ul ulb us
The cat on the mat 0.436436 0.436436 0.000000 0.000000 0.000000
Fast and bulbous 0.000000 0.000000 0.229416 0.229416 0.229416
[2 rows x 28 columns]
I'm not sure this would work with punctuation, though. It would be good to have a version that strips punctuation and also doesn't require the call to partial (which fixes ngram_range in the function).

How to use Prophet's make_future_dataframe with multiple regressors?

make_future_dataframe seems to only produce a dataframe with date (ds) values, which in turn results in ValueError: Regressor 'var' missing from dataframe when attempting to generate forecasts when using the code below.
m = Prophet()
m.add_country_holidays(country_name='US')
m.add_regressor('var')
m.fit(df)
forecasts = m.predict(m.make_future_dataframe(periods=7))
Looking through the python docs, there doesn't seem to be any mention of how to combat this issue using Prophet. Is my only option to write additional code to lag all regressors by the period for which I want to generate forecasts (ex. take var at t-7 to produce a 7 day daily forecast)?
The issue here is that the future = m.make_future_dataframe method creates a dataset future where the only column is the ds date column. In order to predict using a model with regressors you also need columns for each regressor in the future dataset.
Using my original training data which I called regression_data, I solved this by predicting the values for the regressor variables and then filling those into a future_w_regressors dataset which was a merge of future and regression_data.
Assume you have a trained model model ready.
# List of regressors
regressors = ['Total Minutes','Sent Emails','Banner Active']
# My data is weekly so I project out 1 year (52 weeks), this is what I want to forecast
future = model.make_future_dataframe(52, freq='W')
at this point if you run model.predict(future) you will get the error you've been getting. What we need to do is incorporate the regressors. . I merge regression_data with future so that the observations from the past are filled. As you can see, the observations looking forward are empty (towards the end of the table)
# regression_data is the dataframe I used to train the model (include all covariates)
# merge the data you used to train the model
future_w_regressors = regression_data[regressors+['ds']].merge(future, how='outer', on='ds')
future_w_regressors
Total Minutes Sent Emails Banner Active ds
0 7.129552 9.241493e-03 0.0 2018-01-07
1 7.157242 8.629305e-14 0.0 2018-01-14
2 7.155367 8.629305e-14 0.0 2018-01-21
3 7.164352 8.629305e-14 0.0 2018-01-28
4 7.165526 8.629305e-14 0.0 2018-02-04
... ... ... ... ...
283 NaN NaN NaN 2023-06-11
284 NaN NaN NaN 2023-06-18
285 NaN NaN NaN 2023-06-25
286 NaN NaN NaN 2023-07-02
287 NaN NaN NaN 2023-07-09
Solution 1: Predict Regressors
For the next step I create a dataset with only the empty regressor values in it, loop through each regressor, train a naive prophet model on each, predict their values for the future date, fill those values into the empty regressors dataset and place those values into the future_w_regressors dataset.
# Get the segment for which we have no regressor values
empty_future = future_w_regressors[future_w_regressors[regressors[0]].isnull()]
only_future = empty_future[['ds']]
# Create a dictionary to hold the different independent variable forecasts
for regressor in regressors:
# Prep a new training dataset
train = regression_data[['ds',regressor]]
train.columns = ['ds','y'] # rename the variables so they can be submitted to the prophet model
# Train a model for this regressor
rmodel = Prophet()
rmodel.weekly_seasonality = False # this is specific to my case
rmodel.fit(train)
regressor_predictions = rmodel.predict(only_future)
# Replace the empty values in the empty dataset with the predicted values from the regressor model
empty_future[regressor] = regressor_predictions['yhat'].values
# Fill in the values for all regressors in the future_w_regressors dataset
future_w_regressors.loc[future_w_regressors[regressors[0]].isnull(), regressors] = empty_future[regressors].values
Now the future_w_regressors table no longer has missing values
future_w_regressors
Total Minutes Sent Emails Banner Active ds
0 7.129552 9.241493e-03 0.000000 2018-01-07
1 7.157242 8.629305e-14 0.000000 2018-01-14
2 7.155367 8.629305e-14 0.000000 2018-01-21
3 7.164352 8.629305e-14 0.000000 2018-01-28
4 7.165526 8.629305e-14 0.000000 2018-02-04
... ... ... ... ...
283 7.161023 -1.114906e-02 0.548577 2023-06-11
284 7.156832 -1.138025e-02 0.404318 2023-06-18
285 7.150829 -5.642398e-03 0.465311 2023-06-25
286 7.146200 -2.989316e-04 0.699624 2023-07-02
287 7.145258 1.568782e-03 0.962070 2023-07-09
And I can run the predict command to get my forecasts which now extend into 2023 (original data ended in 2022):
model.predict(future_w_regressors)
ds trend yhat_lower yhat_upper trend_lower trend_upper Banner Active Banner Active_lower Banner Active_upper Sent Emails Sent Emails_lower Sent Emails_upper Total Minutes Total Minutes_lower Total Minutes_upper additive_terms additive_terms_lower additive_terms_upper extra_regressors_additive extra_regressors_additive_lower extra_regressors_additive_upper yearly yearly_lower yearly_upper multiplicative_terms multiplicative_terms_lower multiplicative_terms_upper yhat
0 2018-01-07 2.118724 2.159304 2.373065 2.118724 2.118724 0.000000 0.000000 0.000000 3.681765e-04 3.681765e-04 3.681765e-04 0.076736 0.076736 0.076736 0.152302 0.152302 0.152302 0.077104 0.077104 0.077104 0.075198 0.075198 0.075198 0.0 0.0 0.0 2.271026
1 2018-01-14 2.119545 2.109899 2.327498 2.119545 2.119545 0.000000 0.000000 0.000000 3.437872e-15 3.437872e-15 3.437872e-15 0.077034 0.077034 0.077034 0.098945 0.098945 0.098945 0.077034 0.077034 0.077034 0.021911 0.021911 0.021911 0.0 0.0 0.0 2.218490
2 2018-01-21 2.120366 2.074524 2.293829 2.120366 2.120366 0.000000 0.000000 0.000000 3.437872e-15 3.437872e-15 3.437872e-15 0.077014 0.077014 0.077014 0.064139 0.064139 0.064139 0.077014 0.077014 0.077014 -0.012874 -0.012874 -0.012874 0.0 0.0 0.0 2.184506
3 2018-01-28 2.121187 2.069461 2.279815 2.121187 2.121187 0.000000 0.000000 0.000000 3.437872e-15 3.437872e-15 3.437872e-15 0.077110 0.077110 0.077110 0.050180 0.050180 0.050180 0.077110 0.077110 0.077110 -0.026931 -0.026931 -0.026931 0.0 0.0 0.0 2.171367
4 2018-02-04 2.122009 2.063122 2.271638 2.122009 2.122009 0.000000 0.000000 0.000000 3.437872e-15 3.437872e-15 3.437872e-15 0.077123 0.077123 0.077123 0.046624 0.046624 0.046624 0.077123 0.077123 0.077123 -0.030498 -0.030498 -0.030498 0.0 0.0 0.0 2.168633
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
283 2023-06-11 2.062645 2.022276 2.238241 2.045284 2.078576 0.025237 0.025237 0.025237 -4.441732e-04 -4.441732e-04 -4.441732e-04 0.077074 0.077074 0.077074 0.070976 0.070976 0.070976 0.101867 0.101867 0.101867 -0.030891 -0.030891 -0.030891 0.0 0.0 0.0 2.133621
284 2023-06-18 2.061211 1.975744 2.199376 2.043279 2.077973 0.018600 0.018600 0.018600 -4.533835e-04 -4.533835e-04 -4.533835e-04 0.077029 0.077029 0.077029 0.025293 0.025293 0.025293 0.095176 0.095176 0.095176 -0.069883 -0.069883 -0.069883 0.0 0.0 0.0 2.086504
285 2023-06-25 2.059778 1.951075 2.162531 2.041192 2.077091 0.021406 0.021406 0.021406 -2.247903e-04 -2.247903e-04 -2.247903e-04 0.076965 0.076965 0.076965 0.002630 0.002630 0.002630 0.098146 0.098146 0.098146 -0.095516 -0.095516 -0.095516 0.0 0.0 0.0 2.062408
286 2023-07-02 2.058344 1.953027 2.177666 2.039228 2.076373 0.032185 0.032185 0.032185 -1.190929e-05 -1.190929e-05 -1.190929e-05 0.076915 0.076915 0.076915 0.006746 0.006746 0.006746 0.109088 0.109088 0.109088 -0.102342 -0.102342 -0.102342 0.0 0.0 0.0 2.065090
287 2023-07-09 2.056911 1.987989 2.206830 2.037272 2.075110 0.044259 0.044259 0.044259 6.249949e-05 6.249949e-05 6.249949e-05 0.076905 0.076905 0.076905 0.039813 0.039813 0.039813 0.121226 0.121226 0.121226 -0.081414 -0.081414 -0.081414 0.0 0.0 0.0 2.096724
288 rows × 28 columns
Note that I trained the model for each regressor naively. However, you could optimize prediction for those independent variables if you wanted to.
Solution 2: Use last year's regressor values
Alternatively, you could just say that you don't want to compound the uncertainty of regressor forecasts on your main forecast and just want an idea of how forecasts might change for different values of the regressor. In that case you might just copy the regressor values from the last year into the missing future_w_regressors dataset. This has the added benefit of easily simulating drops or increases relative to current regressor levels:
from datetime import timedelta
last_date = regression_data.iloc[-1]['ds']
one_year_ago = last_date - timedelta(days=365) # works with data at any scale
last_year_of_regressors = regression_data.loc[regression_data['ds']>one_year_ago, regressors]
# If you want to simulate a 10% drop in levels compared to this year
last_year_of_regressors = last_year_of_regressors * 0.9
future_w_regressors.loc[future_w_regressors[regressors[0]].isnull(), regressors] = last_year_of_regressors.iloc[:len(future_w_regressors[future_w_regressors[regressors[0]].isnull()])].values

Efficient way to create column referencing its own previous value

I am trying to generate a few columns in a dataframe with datetime index based on a rule which references their own previous values. I have tried a for loop on the length of df as per below but looking for a cleaner solution if possible?
Because what I want to do in the end is get the stats of generated columns (C,D,E in below example) over a large number of A,B....
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(30, 2), columns=list('AB'))
reset_level = 0.5
df['diff'] = df['A'].diff()
df['C'], df['D'], df['E'] = [0.0, 0.0, 0.0]
for i in range(1,len(df)):
if abs(df.iloc[i-1]['C'] + df.iloc[i]['diff']) > (reset_level):
df.iat[i,3] = 0.000
df.iat[i,4] = (df.iloc[i-1]['C'] + df.iloc[i]['diff'])
else:
df.iat[i,3] = (df.iloc[i-1]['C'] + df.iloc[i]['diff'])
df.iat[i,4] = 0.000
df.iat[i,5] = 0.5 * df.iloc[i]['D'] * df.iloc[i]['D']
Edit : Adding expected output below
A B diff C D E
0 -0.352725 1.429037 NaN 0.000000 0.000000 0.000000
1 -1.024418 -0.644302 -0.671693 0.000000 -0.671693 0.225585
2 0.401065 0.419555 1.425483 0.000000 1.425483 1.016001
3 -1.302484 0.724320 -1.703549 0.000000 -1.703549 1.451039
4 0.427035 0.835221 1.729518 0.000000 1.729518 1.495617
5 0.158694 -0.416741 -0.268340 -0.268340 0.000000 0.000000
6 0.921985 -0.490635 0.763291 0.494951 0.000000 0.000000
7 -0.835297 -1.036580 -1.757282 0.000000 -1.262331 0.796740
8 0.752060 -0.279206 1.587356 0.000000 1.587356 1.259850
9 1.795306 -1.554886 1.043246 0.000000 1.043246 0.544181
10 -0.405100 -0.361454 -2.200406 0.000000 -2.200406 2.420893
11 -0.253629 -0.627245 0.151471 0.151471 0.000000 0.000000
12 -0.820573 -0.212886 -0.566944 -0.415473 0.000000 0.000000
13 0.473439 2.532487 1.294012 0.000000 0.878539 0.385916
14 -1.395435 1.016338 -1.868875 0.000000 -1.868875 1.746346
15 -0.244269 -0.337820 1.151166 0.000000 1.151166 0.662592
16 -2.084977 -1.262249 -1.840708 0.000000 -1.840708 1.694103
17 0.666323 -1.696245 2.751300 0.000000 2.751300 3.784825
18 0.235207 -0.513903 -0.431115 -0.431115 0.000000 0.000000
19 1.386456 -0.149153 1.151249 0.000000 0.720134 0.259296
20 0.093456 -0.298154 -1.293000 0.000000 -1.293000 0.835925
21 0.690499 -1.687416 0.597043 0.000000 0.597043 0.178230
22 1.287530 -1.390260 0.597031 0.000000 0.597031 0.178223
23 1.828138 -0.288829 0.540608 0.000000 0.540608 0.146128
24 0.209666 -0.903385 -1.618472 0.000000 -1.618472 1.309727
25 -1.010678 0.615569 -1.220344 0.000000 -1.220344 0.744619
26 -1.799800 1.536332 -0.789122 0.000000 -0.789122 0.311357
27 0.611096 -1.033066 2.410896 0.000000 2.410896 2.906209
28 -0.532675 -0.091541 -1.143770 0.000000 -1.143770 0.654105
29 2.468137 -1.046117 3.000811 0.000000 3.000811 4.502435
I converted your for loop using a numpy array to hold the conditions and then np.where to replace the values according to your condition:
Define condition array
condition = np.abs(df.C.shift() + df["diff"]) > reset_level
Replace the values according to condition
df.iloc[:, 3] = np.where(condition, np.zeros((df.shape[0])), (df['C'].shift() + df['diff']))
df.iloc[:, 4] = np.where(~condition, np.zeros((df.shape[0])), (df['C'].shift() + df['diff']))
df.iloc[:, 5] = 0.5 * df['D'] * df['D']
Output:
A B diff C D E
0 -0.432513 -0.259526 NaN NaN 0.000000 0.000000
1 -1.120872 -1.572850 -0.688360 0.000000 NaN NaN
2 -0.917555 -2.251316 0.203317 0.203317 0.000000 0.000000
3 -1.869781 -1.284524 -0.952225 0.000000 -0.748908 0.280432
4 -2.041950 -0.091837 -0.172169 -0.172169 0.000000 0.000000
5 -0.142499 0.207746 1.899451 0.000000 1.727282 1.491751
6 1.432833 0.085211 1.575332 0.000000 1.575332 1.240835
7 -2.500191 -0.009907 -3.933025 0.000000 -3.933025 7.734341
8 0.154460 -1.859954 2.654651 0.000000 2.654651 3.523587
9 -0.565057 -0.516736 -0.719517 0.000000 -0.719517 0.258853
10 0.329845 0.127978 0.894902 0.000000 0.894902 0.400425
11 -0.920558 1.254617 -1.250402 0.000000 -1.250402 0.781753
12 -1.396913 0.262378 -0.476355 -0.476355 0.000000 0.000000
13 0.117336 -0.439932 1.514249 0.000000 1.037894 0.538612
14 -0.227066 2.565831 -0.344402 -0.344402 0.000000 0.000000
15 0.077750 0.195277 0.304816 0.304816 0.000000 0.000000
16 1.470611 -0.357213 1.392861 0.000000 1.697677 1.441053
17 -0.553844 0.339270 -2.024455 0.000000 -2.024455 2.049209
18 -0.259603 0.212839 0.294242 0.294242 0.000000 0.000000
19 0.605961 0.279599 0.865564 0.000000 1.159805 0.672574
20 -0.326706 -0.774350 -0.932667 0.000000 -0.932667 0.434934
21 -0.927601 -2.360751 -0.600895 0.000000 -0.600895 0.180537
22 -0.372085 0.986228 0.555516 0.000000 0.555516 0.154299
23 -0.687731 -2.966817 -0.315647 -0.315647 0.000000 0.000000
24 -0.041028 -0.328898 0.646703 0.000000 0.331057 0.054799
25 0.099489 0.275983 0.140517 0.140517 0.000000 0.000000
26 0.468274 -0.287097 0.368785 0.368785 0.000000 0.000000
27 0.497417 -0.588481 0.029143 0.029143 0.000000 0.000000
28 0.603178 2.243163 0.105761 0.105761 0.000000 0.000000
29 -0.643283 -1.051491 -1.246461 0.000000 -1.140700 0.650598
Is this what you were looking for, you didn't provide expected output.
Documentation:
np.where
Try this one (but don't iterate over all rows - it will do the whole column at once for you):
df["C_prev"] = df["C"].shift(1)
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(30, 2), columns=list('AB'))
reset_level = 0.5
df['diff'] = df['A'].diff()
df['C'], df['D'], df['E'] = [0.0, 0.0, 0.0]
Then apply a function to each row:
def f(row):
if abs(df.loc[row.name - 1, 'C'] + row['diff']) > reset_level:
C = 0.0
D = df.loc[row.name - 1, 'C'] + row['diff']
else:
C = df.loc[row.name - 1, 'C'] + row['diff']
D = 0.0
E = 0.5 * row['D'] * row['D']
return(pd.Series([C, D, E]))
df.loc[1:, ['C', 'D', 'E']] = df[1:].apply(f, axis=1)

ValueError: ndarray is not contiguous

when I build a matrix using the last row of my dataframe:
x = w.iloc[-1, :]
a = np.mat(x).T
it goes:
ValueError: ndarray is not contiguous
`print the x shows(I have 61 columns in my dataframe):
print(x)
cdl2crows 0.000000
cdl3blackcrows 0.000000
cdl3inside 0.000000
cdl3linestrike 0.000000
cdl3outside 0.191465
cdl3starsinsouth 0.000000
cdl3whitesoldiers_x 0.000000
cdl3whitesoldiers_y 0.000000
cdladvanceblock 0.000000
cdlhighwave 0.233690
cdlhikkake 0.218209
cdlhikkakemod 0.000000
...
cdlidentical3crows 0.000000
cdlinneck 0.000000
cdlinvertedhammer 0.351235
cdlkicking 0.000000
cdlkickingbylength 0.000000
cdlladderbottom 0.002259
cdllongleggeddoji 0.629053
cdllongline 0.588480
cdlmarubozu 0.065362
cdlmatchinglow 0.032838
cdlmathold 0.000000
cdlmorningdojistar 0.000000
cdlmorningstar 0.327749
cdlonneck 0.000000
cdlpiercing 0.251690
cdlrickshawman 0.471466
cdlrisefall3methods 0.000000
Name: 2010-01-04, Length: 61, dtype: float64
how to solve it? so many thanks
np.mat expects array form of input.
refer to the doc
doc
So your code should be
x = w.iloc[-1, :].values
a = np.mat(x).T
.values will give numpy array format of dataframe values, so np.mat will work.
Use np.array instead of np.mat:
a = np.array(x).T

TfIDf Vectorizer weights

Hi I have a lemmatized text in the format as shown by lemma. I want to get TfIdf score for each word this is the function that I wrote:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
lemma=["'Ah", 'yes', u'say', 'softly', 'Harry',
'Potter', 'Our', 'new', 'celebrity', 'You',
'learn', 'subtle', 'science', 'exact', 'art',
'potion-making', u'begin', 'He', u'speak', 'barely',
'whisper', 'caught', 'every', 'word', 'like',
'Professor', 'McGonagall', 'Snape', 'gift',
u'keep', 'class', 'silent', 'without', 'effort',
'As', 'little', 'foolish', 'wand-waving', 'many',
'hardly', 'believe', 'magic', 'I', 'dont', 'expect', 'really',
'understand', 'beauty']
def Tfidf_Vectorize(lemmas_name):
vect = TfidfVectorizer(stop_words='english',ngram_range=(1,2))
vect_transform = vect.fit_transform(lemmas_name)
# First approach of creating a dataframe of weight & feature names
vect_score = np.asarray(vect_transform.mean(axis=0)).ravel().tolist()
vect_array = pd.DataFrame({'term': vect.get_feature_names(), 'weight': vect_score})
vect_array.sort_values(by='weight',ascending=False,inplace=True)
# Second approach of getting the feature names
vect_fn = np.array(vect.get_feature_names())
sorted_tfidf_index = vect_transform.max(0).toarray()[0].argsort()
print('Largest Tfidf:\n{}\n'.format(vect_fn[sorted_tfidf_index[:-11:-1]]))
return vect_array
tf_dataframe=Tfidf_Vectorize(lemma)
print(tf_dataframe.iloc[:5,:])
The output I am getting by:
print('Largest Tfidf:\n{}\n'.format(vect_fn[sorted_tfidf_index[:-11:-1]]))
is
Largest Tfidf:
[u'yes' u'fools' u'fury' u'gale' u'ghosts' u'gift' u'glory' u'glow' u'good'
u'granger']
The result of tf_dataframe
term weight
261 snape 0.027875
238 say 0.022648
211 potter 0.013937
181 mind 0.010453
123 harry 0.010453
60 dark 0.006969
75 dumbledore 0.006969
311 voice 0.005226
125 head 0.005226
231 ron 0.005226
Shouldn't both approaches lead to the same result of top features? I just want to calculate the tfidf scores and get the top 5 features/weight. What am i doing wrong?
I am not sure what I am looking at here but I have the feeling that you're using TfidfVectorizer incorrectly. However, please correct me in case I got the wrong idea of what you're trying.
So.. what you need is a list of documents which you feed to fit_transform(). From that you can construct a matrix where, for example, each column represents a document and each row a word. One cell in that matrix is the tf-idf score of the word i in document j.
Here's an example:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
documents = [
"This is a document.",
"This is another document with slightly more text.",
"Whereas this is yet another document with even more text than the other ones.",
"This document is awesome and also rather long.",
"The car he drove was red."
]
document_names = ['Doc {:d}'.format(i) for i in range(len(documents))]
def get_tfidf(docs, ngram_range=(1,1), index=None):
vect = TfidfVectorizer(stop_words='english', ngram_range=ngram_range)
tfidf = vect.fit_transform(documents).todense()
return pd.DataFrame(tfidf, columns=vect.get_feature_names(), index=index).T
print(get_tfidf(documents, ngram_range=(1,2), index=document_names))
Which will give you:
Doc 0 Doc 1 Doc 2 Doc 3 Doc 4
awesome 0.0 0.000000 0.000000 0.481270 0.000000
awesome long 0.0 0.000000 0.000000 0.481270 0.000000
car 0.0 0.000000 0.000000 0.000000 0.447214
car drove 0.0 0.000000 0.000000 0.000000 0.447214
document 1.0 0.282814 0.282814 0.271139 0.000000
document awesome 0.0 0.000000 0.000000 0.481270 0.000000
document slightly 0.0 0.501992 0.000000 0.000000 0.000000
document text 0.0 0.000000 0.501992 0.000000 0.000000
drove 0.0 0.000000 0.000000 0.000000 0.447214
drove red 0.0 0.000000 0.000000 0.000000 0.447214
long 0.0 0.000000 0.000000 0.481270 0.000000
ones 0.0 0.000000 0.501992 0.000000 0.000000
red 0.0 0.000000 0.000000 0.000000 0.447214
slightly 0.0 0.501992 0.000000 0.000000 0.000000
slightly text 0.0 0.501992 0.000000 0.000000 0.000000
text 0.0 0.405004 0.405004 0.000000 0.000000
text ones 0.0 0.000000 0.501992 0.000000 0.000000
The two methods you show to get to words and their respective scores calculate the mean over all documents and fetch the max score of each word respectively.
So let's do this and compare the two methods:
df = get_tfidf(documents, ngram_range=(1,2), index=index)
print(pd.DataFrame([df.mean(1), df.max(1)], index=['score_mean', 'score_max']).T)
We can see that the scores are of course different.
score_mean score_max
awesome 0.096254 0.481270
awesome long 0.096254 0.481270
car 0.089443 0.447214
car drove 0.089443 0.447214
document 0.367353 1.000000
document awesome 0.096254 0.481270
document slightly 0.100398 0.501992
document text 0.100398 0.501992
drove 0.089443 0.447214
drove red 0.089443 0.447214
long 0.096254 0.481270
ones 0.100398 0.501992
red 0.089443 0.447214
slightly 0.100398 0.501992
slightly text 0.100398 0.501992
text 0.162002 0.405004
text ones 0.100398 0.501992
Note:
You can convince yourself that this does the same as calling min/max on the TfidfVectorizer:
vect = TfidfVectorizer(stop_words='english', ngram_range=(1,2))
tfidf = vect.fit_transform(documents)
print(tfidf.max(0))
print(tfidf.mean(0))

Categories

Resources