Append/Merge Dataframe with LDA output

Append/Merge Dataframe with LDA output - python

I'm working on an LDA model using Gensim and spacy.
Generically:
ldamodel = Lda(doc_term_matrix, num_topics=4, random_state = 100, update_every=3, chunksize = 50, id2word = dictionary, passes=100, alpha='auto')
ldamodel.print_topics(num_topics=4, num_words=6)
I'm at the point where I have some output and I'd like to append my original Dataframe (from which the text came from) with the topics and a percent contribution for each document.
The original df looks like this
id group text
234 1 here is some text
837 7 here is some text
494 2 here is some text
223 1 here is some text
I do some standard preprocessing including lemmatization, removing stop words, etc. and then compute percent contributions for each document.
my output looks like this
Document_No Dominant_Topic ... Keywords Text
0 0 1.0 ... RT, new, work, amp, year, today, people, look,... 0
1 1 0.0 ... like, time, good, know, day, find, research, a... 1
2 2 1.0 ... RT, new, work, amp, year, today, people, look,... 2
3 3 3.0 ... study, t, change, use, want, Trump, love, stud... 3
4 4 3.0 ... study, t, change, use, want, Trump, love, stud... 4
I thought I could just concat the 2 dfs back together like so:
results = pd.concat([df, results])
but when I do that the indices don't match and I'm left with a sort of Frankenstein df that looks like this
id group text Document_No Dominant_Topic ...
NaN NaN NaN 0 1.0 ...
NaN NaN NaN 1 0.0 ...
494 2 here is some text NaN NaN ...
223 1 here is some text NaN NaN ...
Happy to post fuller code if that would be helpful, but I'm hoping someone just knows a better way to do this from same point as I might print topics.

Related

How to trim down a Pandas data frame rows?

I'm trying so hard to shorten this awful lot of rows from an XML sitemap but I can't find a solution to trim it down.
import advertools as adv
import pandas as pd
site = "https://www.halfords.com/sitemap_index.xml"
sitemap = adv.sitemap_to_df(site)
sitemap = sitemap.dropna(subset=["loc"]).reset_index(drop=True)
# Some sitemaps keeps urls with "/" on the end, some is with no "/"
# If there is "/" on the end, we take the second last column as slugs
# Else, the last column is the slug column
slugs = sitemap['loc'].dropna()[~sitemap['loc'].dropna().str.endswith('/')].str.split('/').str[-2].str.replace('-', ' ')
slugs2 = sitemap['loc'].dropna()[~sitemap['loc'].dropna().str.endswith('/')].str.split('/').str[-1].str.replace('-', ' ')
# Merge two series
slugs = list(slugs) + list(slugs2)
# adv.word_frequency automatically removes the stop words
word_counts_onegram = adv.word_frequency(slugs)
word_counts_twogram = adv.word_frequency(slugs, phrase_len=2)
competitor = pd.concat([word_counts_onegram, word_counts_twogram])\
.rename({'abs_freq':'Count','word':'Ngram'}, axis=1)\
.sort_values('Count', ascending=False)
competitor.to_csv('competitor.csv',index=False)
competitor
competitor.shape
(67758, 2)
(67758, 2)
I've been raveling around several blogs included resources on Stack Overflow but nothing seemed to work.
This is definitely something going on with my zero expertise in coding I suppose

Two things:
You can use adv.url_to_df to split URLs and get the slugs (there should be a column called last_dir:
urldf = adv.url_to_df(sitemap['loc'].dropna())
urldf
url
scheme
netloc
path
query
fragment
dir_1
dir_2
dir_3
dir_4
dir_5
dir_6
dir_7
dir_8
dir_9
last_dir
0
https://www.halfords.com/cycling/cycling-technology/helmet-cameras/removu-k1-4k-camera-and-stabiliser-694977.html
https
www.halfords.com
/cycling/cycling-technology/helmet-cameras/removu-k1-4k-camera-and-stabiliser-694977.html
nan
nan
cycling
cycling-technology
helmet-cameras
removu-k1-4k-camera-and-stabiliser-694977.html
nan
nan
nan
nan
nan
removu-k1-4k-camera-and-stabiliser-694977.html
1
https://www.halfords.com/technology/bluetooth-car-kits/jabra-drive-bluetooth-speakerphone---white-695094.html
https
www.halfords.com
/technology/bluetooth-car-kits/jabra-drive-bluetooth-speakerphone---white-695094.html
nan
nan
technology
bluetooth-car-kits
jabra-drive-bluetooth-speakerphone---white-695094.html
nan
nan
nan
nan
nan
nan
jabra-drive-bluetooth-speakerphone---white-695094.html
2
https://www.halfords.com/tools/power-tools-and-accessories/power-tools/stanley-fatmax-v20-18v-combi-drill-kit-695102.html
https
www.halfords.com
/tools/power-tools-and-accessories/power-tools/stanley-fatmax-v20-18v-combi-drill-kit-695102.html
nan
nan
tools
power-tools-and-accessories
power-tools
stanley-fatmax-v20-18v-combi-drill-kit-695102.html
nan
nan
nan
nan
nan
stanley-fatmax-v20-18v-combi-drill-kit-695102.html
3
https://www.halfords.com/technology/dash-cams/mio-mivue-c450-695262.html
https
www.halfords.com
/technology/dash-cams/mio-mivue-c450-695262.html
nan
nan
technology
dash-cams
mio-mivue-c450-695262.html
nan
nan
nan
nan
nan
nan
mio-mivue-c450-695262.html
4
https://www.halfords.com/technology/dash-cams/mio-mivue-818-695270.html
https
www.halfords.com
/technology/dash-cams/mio-mivue-818-695270.html
nan
nan
technology
dash-cams
mio-mivue-818-695270.html
nan
nan
nan
nan
nan
nan
mio-mivue-818-695270.html
There are options that pandas provides, which you can change. For example:
pd.options.display.max_rows
60
# change it to display more/fewer rows:
pd.options.display.max_rows = 100
As you did, you can easily create onegrams and bigrams, combine them, and display them:
text_list = urldf['last_dir'].str.replace('-', ' ').dropna()
one_grams = adv.word_frequency(text_list, phrase_len=1)
bigrams = adv.word_frequency(text_list, phrase_len=2)
print(pd.concat([one_grams, bigrams])
.sort_values('abs_freq', ascending=False)
.head(15) # <-- change this to 100 for example
.reset_index(drop=True))
word
abs_freq
0
halfords
2985
1
car
1430
2
bike
922
3
kit
829
4
black
777
5
laser
686
6
set
614
7
wheel
540
8
pack
524
9
mats
511
10
car mats
478
11
thule
453
12
paint
419
13
4
413
14
spray
382
Hope that helps?

Add column to dataframe based on intervals from another dataframe

I have a gpx file that I am manipulating. I would like to add a column to it that describes the terrain based on another dataframe that lists the terrain by distance. Here are the dataframes:
GPS_df
lat lon alt time dist total_dist
0 44.565335 -123.312517 85.314 2020-09-07 14:00:01 0.000000 0.000000
1 44.565336 -123.312528 85.311 2020-09-07 14:00:02 0.000547 0.000547
2 44.565335 -123.312551 85.302 2020-09-07 14:00:03 0.001137 0.001685
3 44.565332 -123.312591 85.287 2020-09-07 14:00:04 0.001985 0.003670
4 44.565331 -123.312637 85.270 2020-09-07 14:00:05 0.002272 0.005942
... ... ... ... ... ... ...
12481 44.565576 -123.316116 85.517 2020-09-07 17:28:14 0.002318 26.091324
12482 44.565559 -123.316072 85.587 2020-09-07 17:28:15 0.002469 26.093793
12483 44.565554 -123.316003 85.637 2020-09-07 17:28:16 0.003423 26.097217
12484 44.565535 -123.315966 85.697 2020-09-07 17:28:17 0.002249 26.099465
12485 44.565521 -123.315929 85.700 2020-09-07 17:28:18 0.002066 26.101532
terrain_df:
dist terrain
0 0.0 Start
1 3.0 Road
2 5.0 Gravel
3 8.0 Trail-hard
4 12.0 Gravel
5 16.0 Trail-med
6 18.0 Road
7 22.0 Gravel
8 23.0 Trail-easy
9 26.2 Road
I have come up with the following code, that works, but I would like to make it more efficient by eliminating the looping:
GPS_df['terrain']=""
i=0
for j in range(0,len(GPS_df)):
if GPS_df.total_dist[j]<= terrain_df.dist[i]:
GPS_df.terrain[j]=terrain_df.terrain[i]
else:
i=i+1
GPS_df.terrain[j]=terrain_df.terrain[i]
I have tried a half a dozen different ways, but none seem to work correctly. I am sure there is an easy way to do it, but I just don't have the skills and experience to figure it out so far, so I am looking for some help. I tried using cut and add the labels, but cut requires unique labels. I could use cut and then replace the generated intervals with labels in another way, but that doesn't seems like the best approach either. I also tried this approach that I found from another question, but it filled the column with the first label only (I also am having trouble understanding how it works, so it makes it tough to troubleshoot).
bins = terrain_df['dist']
names = terrain_df['terrain']
d = dict(enumerate(names, 1))
GPS_df['terrain2'] = np.vectorize(d.get)(np.digitize(GPS_df['dist'], bins))
Appreciate any guidance that you can give me.

I believe pandas.merge_asof should do the trick. Try:
result = pd.merge_asof(left=GPS_df, right=terrain_df, left_on='total_dist', right_on='dist', direction='backward')

How can I build a faster decaying average? comparing a data frame's rows date field to other rows dates

I am clumsy but adequate with python. I have referenced stack often, but this is my first question. I have built a decaying average function to act on a pandas data frame with about 10000 rows, but it takes 40 minutes to run. I would appreciate any thoughts on how to speed it up. Here is a sample of actual data, simplified a bit.
sub = pd.DataFrame({
'user_id':[101,101,101,101,101,102,101],
'class_section':['Modern Biology - B','Spanish Novice 1 - D', 'Modern Biology - B','Spanish Novice 1 - D','Spanish Novice 1 - D','Modern Biology - B','Spanish Novice 1 - D'],
'sub_skill':['A','A','B','B','B','B','B'],
'rating' :[2.0,3.0,3.0,2.0,3.0,2.0,2.0],
'date' :['2019-10-16','2019-09-04','2019-09-04', '2019-09-04','2019-09-13','2019-10-16','2019-09-05']})
For this data frame:
sub
Out[716]:
user_id class_section sub_skill rating date
0 101 Modern Biology - B A 2.0 2019-10-16
1 101 Spanish Novice 1 - D A 3.0 2019-09-04
2 101 Modern Biology - B B 3.0 2019-09-04
3 101 Spanish Novice 1 - D B 2.0 2019-09-04
4 101 Spanish Novice 1 - D B 3.0 2019-09-13
5 102 Modern Biology - B B 2.0 2019-10-16
6 101 Spanish Novice 1 - D B 2.0 2019-09-05
A decaying average weights the most recent event that meets conditions at full weight and weights each previous event with a multiplier less than one. In this case, the multiplier is 0.667. previously weighted events are weighted again.
So the decaying average for user 101's rating in Spanish sub_skill B is:
(2.0*0.667^2 + 2.0*0.667^1 + 3.0*0.667^0)/((0.667^2 + 0.667^1 + 0.667^0) = 2.4735
Here is what I tried, after reading a helpful post on weighted averages
sub['date'] = pd.to_datetime(sub.date_due)
def func(date, user_id, class_section, sub_skill):
return sub.apply(lambda row: row['date'] > date
and row['user_id']==user_id
and row['class_section']== class_section
and row['sub_skill']==sub_skill,axis=1).sum()
# for some reason this next line of code took about 40 minutes to run on 9000 rows:
sub['decay_count']=sub.apply(lambda row: func(row['date'],row['user_id'], row['class_section'], row['sub_skill']), axis=1)
# calculate decay factor:
sub['decay_weight']=sub.apply(lambda row: 0.667**row['decay_count'], axis=1)
# calcuate decay average contributors (still needs to be summed):
g = sub.groupby(['user_id','class_section','sub_skill'])
sub['decay_avg'] = sub.decay_weight / g.decay_weight.transform("sum") * sub.rating
# new dataframe with indicator/course summaries as decaying average (note the sum):
indicator_summary = g.decay_avg.sum().to_frame(name = 'DAvg').reset_index()
I frequently work in pandas and I am used to iterating through large datasets. I would have expected this to take rows-squared time, but it is taking much longer. A more elegant solution or some advice to speed it up would be really appreciated!
Some background on this project: I am trying to automate the conversion from proficiency-based grading into a classic course grade for my school. I have the process of data extraction from our Learning Management System into a spreadsheet that does the decaying average and then posts the information to teachers, but I would like to automate the whole process and extract myself from it. The LMS is slow to implement a proficiency-based system and is reluctant to provide a conversion - for good reason. However, we have to communicate both student proficiencies and our conversion to a traditional grade to parents and colleges since that is a language they speak.

Why not use groupby? The idea here is that you rank the dates within the group in descending order and subtract 1 (because rank starts with 1). That seems to mirror your logic in func above, without having to try to call apply with a nested apply.
sub['decay_count'] = sub.groupby(['user_id', 'class_section', 'sub_skill'])['date'].rank(method='first', ascending=False) - 1
sub['decay_weight'] = sub['decay_count'].apply(lambda x: 0.667 ** x)
Output:
sub.sort_values(['user_id', 'class_section', 'sub_skill', 'decay_count'])
user_id class_section sub_skill rating date decay_count decay_weight
0 101 Modern Biology - B A 2.0 2019-10-16 0.0 1.000000
2 101 Modern Biology - B B 3.0 2019-09-04 0.0 1.000000
1 101 Spanish Novice 1 - D A 3.0 2019-09-04 0.0 1.000000
3 101 Spanish Novice 1 - D B 2.0 2019-09-04 0.0 1.000000
6 101 Spanish Novice 1 - D B 2.0 2019-09-05 1.0 0.667000
4 101 Spanish Novice 1 - D B 3.0 2019-09-13 2.0 0.444889
5 102 Modern Biology - B B 2.0 2019-10-16 0.0 1.000000

One-way Anova loop through pandas dataframe - results in a single table

I have a pandas dataframe containing 16 columns, of which 14 represent variables where i perform a looped Anova test using statsmodels. My dataframe looks something like this (simplified):
ID Cycle_duration Average_support_phase Average_swing_phase Label
1 23.1 34.3 47.2 1
2 27.3 38.4 49.5 1
3 25.8 31.1 45.7 1
4 24.5 35.6 41.9 1
...
So far this is what i'm doing:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
df = pd.read_csv('features_total.csv')
for variable in df.columns:
model = ols('{} ~ Label'.format(variable), data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)
print(anova_table)
Which yields:
sum_sq df F PR(>F)
Label 0.124927 2.0 2.561424 0.084312
Residual 1.731424 71.0 NaN NaN
sum_sq df F PR(>F)
Label 62.626057 2.0 4.969491 0.009552
Residual 447.374788 71.0 NaN NaN
sum_sq df F PR(>F)
Label 62.626057 2.0 4.969491 0.009552
Residual 447.374788 71.0 NaN NaN
I'm getting an individual table print for each variable where the Anova is performed. Basically what i want is to print one single table with the summarized results, or something like this:
sum_sq df F PR(>F)
Cycle_duration 0.1249270 2.0 2.561424 0.084312
Residual 1.7314240 71.0 NaN NaN
Average_support_phase 62.626057 2.0 4.969491 0.009552
Residual 447.374788 71.0 NaN NaN
Average_swing_phase 62.626057 2.0 4.969491 0.009552
Residual 447.374788 71.0 NaN NaN
I can already see a problem because this method always outputs the 'Label' nomenclature before the actual values, and not the variable name in question (like i've shown above, i would like to have the variable name above each 'residual'). Is this even possible with the statsmodels approach?
I'm fairly new to python and excuse me if this has nothing to do with statsmodels - in that case, please do elucidate me on what i should be trying.

You can collect the tables and concatenate them at the end of your loop. This method will create a hierarchical index, but I think that makes it a bit more clear. Something like this:
keys = []
tables = []
for variable in df.columns:
model = ols('{} ~ Label'.format(variable), data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)
keys.append(variable)
tables.append(anova_table)
df_anova = pd.concat(tables, keys=keys, axis=0)
Somewhat related, I would also suggest correcting for multiple comparisons. This is more a statistical suggestion than a coding suggestion, but considering you are performing numerous statistical tests, it would make sense to account for the probability that one of the test would result in a false positive.

Have a dataframe restart it's count for a particular column

The Overview:
In our project, we are working with a CSV file that contains some data. We will call it smal.csv It is a bit of a chunky file that will be later used for some other algorithms. (Here is the gist in case the link to smal.csv is too badly formatted for your browser.)
The file will be loaded like this
filename = "smal.csv"
keyname = "someKeyname"
self.data[keyname] = spectral_data(pd.read_csv(filename, header=[0, 1], verbose=True))
The spectral class looks like this. As you can see, we do not actually keep the dataframe as is.
class spectral_data(object):
def __init__(self, df):
try:
uppercols = df.columns.levels[0]
lowercols = list(df.columns.levels[1].values)
except:
df.columns = pd.MultiIndex.from_tuples(list(df.columns))
uppercols = df.columns.levels[0]
lowercols = list(df.columns.levels[1].values)
for i, val in enumerate(lowercols):
try:
lowercols[i] = float(val)
except:
lowercols[i] = val
levels = [uppercols, lowercols]
df.columns.set_levels(levels, inplace=True)
self.df = df
After we've loaded it we'd like to concatenate it with another set of data, also loaded like smal.csv was.
Our concatenation is done like this.
new_df = pd.concat([self.data[dataSet1].df, self.data[dataSet2].df], ignore_index=True)
However, the ignore_index=True does not work, because the actual row that we are concatenating is not the index. However, we cannot simply remove the column, it is necessary for other parts of our program.
The Objective:
I'm trying to concatenate a couple of data frames together, however, what I thought was the index is not actually the index for the data frame. Thus the command
pd.concat([df1.df, df2.df], ignore_index=True)
will not work. I thought maybe using iloc to change each individual cell would work but I feel like this is not the most intuitive way to approach this.
How can I get a data frame that looks like this
[396 rows x 6207 columns]
Unnamed: 0_level_0 meta ... wvl
Unnamed: 0_level_1 Sample ... 932.695 932.89
0 1 NaN ... -12.33 9.67
1 2 NaN ... 11.94 3.94
2 3 NaN ... -2.67 28.33
3 4 NaN ... 53.22 -13.78
4 1 NaN ... 43.28
5 2 NaN ... 41.33 47.33
6 3 NaN ... -21.94 12.06
7 4 NaN ... -30.94 -1.94
8 5 NaN ... -24.78 40.22
Turn into this.
[396 rows x 6207 columns]
Unnamed: 0_level_0 meta ... wvl
Unnamed: 0_level_1 Sample ... 932.695 932.89
0 1 NaN ... -12.33 9.67
1 2 NaN ... 11.94 3.94
2 3 NaN ... -2.67 28.33
3 4 NaN ... 53.22 -13.78
4 5 NaN ... 43.28
5 6 NaN ... 41.33 47.33
6 7 NaN ... -21.94 12.06
7 8 NaN ... -30.94 -1.94
8 9 NaN ... -24.78 40.22

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Append/Merge Dataframe with LDA output - python

Related

How to trim down a Pandas data frame rows?

Add column to dataframe based on intervals from another dataframe

How can I build a faster decaying average? comparing a data frame's rows date field to other rows dates

One-way Anova loop through pandas dataframe - results in a single table

Have a dataframe restart it's count for a particular column

Categories

Resources