I have a panda dataframe that contains a list of articles; the outlet, publish date, link etc. One of the columns in this dataframe is a list of keywords. For example, in the keyword column each cell contains a list like [drop, right, states, laws].
My ultimate goal is to count the number of occurrences of each unique word on each day. The challenge that I'm having is breaking the keywords out of their lists and then matching them to the date on which they occurred. ...assuming this is even the most logical first step.
At present I have a solution in the code below, however I'm new to python and in thinking through these things I still think in an Excel mindset. The code below works but it's very slow. Is there a fast way to do this?
# Create a list of the keywords for articles in the last 30 days to determine their quantity
keyword_list = stories_full_recent_df['Keywords'].tolist()
keyword_list = [item for sublist in keyword_list for item in sublist]
# Create a blank dataframe and new iterator to write the keyword appearances to
wordtrends_df = pd.DataFrame(columns=['Captured_Date', 'Brand' , 'Coverage' ,'Keyword'])
r = 0
print("Creating table on keywords: {:,}".format(len(keyword_list)))
print(time.strftime("%H:%M:%S"))
# Write the keywords out into their own rows with the dates and origins in which they occur
while r <= len(keyword_list):
for i in stories_full_recent_df.index:
words = stories_full_recent_df.loc[i]['Keywords']
for word in words:
wordtrends_df.loc[r] = [stories_full_recent_df.loc[i]['Captured_Date'], stories_full_recent_df.loc[i]['Brand'],
stories_full_recent_df.loc[i]['Coverage'], word]
r += 1
print(time.strftime("%H:%M:%S"))
print("Keyword compilation complete.")
Once I have each word on it's own row I'm simply using .groupby() to figure out the number of occurences each day.
# Group and count the keywords and days to find the day with the least of each word
test_min = wordtrends_df.groupby(('Keyword', 'Captured_Date'), as_index=False).count().sort_values(by=['Keyword','Brand'], ascending=True)
keyword_min = test_min.groupby(['Keyword'], as_index=False).first()
At present there around about 100,000 words in this list and it takes me an hour to run through that list. I'd love thoughts on a faster way to do it.
I think you can get the expected result by doing this:
wordtrends_df = pd.melt(pd.concat((stories_full_recent_df[['Brand', 'Captured_Date', 'Coverage']],
stories_full_recent_df.Keywords.apply(pd.Series)),axis=1),
id_vars=['Brand','Captured_Date','Coverage'],value_name='Keyword')\
.drop(['variable'],axis=1).dropna(subset=['Keyword'])
An explanation with a small example below.
Consider an example dataframe:
df = pd.DataFrame({'Brand': ['X', 'Y'],
'Captured_Date': ['2017-04-01', '2017-04-02'],
'Coverage': [10, 20],
'Keywords': [['a', 'b', 'c'], ['c', 'd']]})
# Brand Captured_Date Coverage Keywords
# 0 X 2017-04-01 10 [a, b, c]
# 1 Y 2017-04-02 20 [c, d]
First thing you can do is expand the keywords column so that each keyword occupies its own column:
a = df.Keywords.apply(pd.Series)
# 0 1 2
# 0 a b c
# 1 c d NaN
Concatenate this with the original df without Keywords column:
b = pd.concat((df[['Captured_Date','Brand','Coverage']],a),axis=1)
# Captured_Date Brand Coverage 0 1 2
# 0 2017-04-01 X 10 a b c
# 1 2017-04-02 Y 20 c d NaN
Melt this last result to create a row per keyword:
c = pd.melt(b,id_vars=['Captured_Date','Brand','Coverage'],value_name='Keyword')
# Captured_Date Brand Coverage variable Keyword
# 0 2017-04-01 X 10 0 a
# 1 2017-04-02 Y 20 0 c
# 2 2017-04-01 X 10 1 b
# 3 2017-04-02 Y 20 1 d
# 4 2017-04-01 X 10 2 c
# 5 2017-04-02 Y 20 2 NaN
Finally, drop the useless variable column and drop rows where Keyword is missing:
d = c.drop(['variable'],axis=1).dropna(subset=['Keyword'])
# Captured_Date Brand Coverage Keyword
# 0 2017-04-01 X 10 a
# 1 2017-04-02 Y 20 c
# 2 2017-04-01 X 10 b
# 3 2017-04-02 Y 20 d
# 4 2017-04-01 X 10 c
Now you're ready to count by keywords and dates.
Related
Problem
I want to pick a subset of fixed size from a list of items such that the count of the most frequent occurrence of the labels of the selected items is minimized. In English, I have a DataFrame consisting of a list of 10000 items, generated as follows.
import random
import pandas as pd
def RandLet():
alphabet = "ABCDEFG"
return alphabet[random.randint(0, len(alphabet) - 1)]
items = pd.DataFrame([{"ID": i, "Label1": RandLet(), "Label2": RandLet(), "Label3": RandLet()} for i in range(0, 10000)])
items.head(3)
Each item has 3 labels. The labels are letters within ABCDEFG, and the order of the labels doesn't matter. An item may be tagged multiple times with the same label.
[Example of the first 3 rows]
ID Label1 Label2 Label3
0 0 G B D
1 1 C B C
2 2 C A B
From this list, I want to pick 1000 items in a way that minimizes the number of occurrences of the most frequently appearing label within those items.
For example, if my DataFrame only consisted of the above 3 items, and I only wanted to pick 2 items, and I picked items with ID #1 and #2, the label 'C' appears 3 times, 'B' appears 2 times, 'A' appears 1 time, and all other labels appear 0 times - The maximum of these is 3. However, I could have done better by picking items #0 and #2, in which label 'B' appears the most frequently, coming in as a count of 2. Since 2 is less than 3, picking items #0 and #2 is better than picking items #1 and #2.
In the case where there are multiple ways to pick 1000 items such that the count of the maximum label occurrence is minimized, returning any of those selections is fine.
What I've got
To me, this feels similar a knapsack problem in len("ABCDEFG") = 7 dimensions. I want to put 1000 items in the knapsack, and each item's size in the relevant dimension is the sum of the occurrences of the label for that particular item. To that extent, I've built this function to convert my list of items into a list of sizes for the knapsack.
def ReshapeItems(items):
alphabet = "ABCDEFG"
item_rebuilder = []
for i, row in items.iterrows():
letter_counter = {}
for letter in alphabet:
letter_count = sum(row[[c for c in items.columns if "Label" in c]].apply(lambda x: 1 if x == letter else 0))
letter_counter[letter] = letter_count
letter_counter["ID"] = row["ID"]
item_rebuilder.append(letter_counter)
items2 = pd.DataFrame(item_rebuilder)
return items2
items2 = ReshapeItems(items)
items2.head(3)
[Example of the first 3 rows of items2]
A B C D E F G ID
0 0 1 0 1 0 0 1 0
1 0 1 2 0 0 0 0 1
2 1 1 1 0 0 0 0 2
Unfortunately, at that point, I am completely stuck. I think that the point of knapsack problems is to maximize some sort of value, while keeping the sum of the selected items sizes under some limit - However, here my problem is the opposite, I want to minimize the sum of the selected size such that my value is at least some amount.
What I'm looking for
Although a function that takes in items or items2 and returns a subset of these items that meets my specifications would be ideal, I'd be happy to accept any sufficiently detailed answer that points me in the right direction.
Using a different approach, here is my take on your interesting question.
def get_best_subset(
df: pd.DataFrame, n_rows: int, key_cols: list[str], iterations: int = 50_000
) -> tuple[int, pd.DataFrame]:
"""Subset df in such a way that the frequency
of most frequent values in key columns is minimum.
Args:
df: input dataframe.
n_rows: number of rows in subset.
key_cols: columns to consider.
iterations: max number of tries. Defaults to 50_000.
Returns:
Minimum frequency, subset of n rows of input dataframe.
"""
lowest_frequency: int = df.shape[0] * df.shape[1]
best_df = pd.DataFrame([])
# Iterate through possible subsets
for _ in range(iterations):
sample_df = df.sample(n=n_rows)
# Count values in each column, concat and sum counts, get max count
frequency = (
pd.concat([sample_df[col].value_counts() for col in key_cols])
.pipe(lambda df_: df_.groupby(df_.index).sum())
.max()
)
if frequency < lowest_frequency:
lowest_frequency = frequency
best_df = sample_df
return lowest_frequency, best_df.sort_values(by=["ID"]).reset_index(drop=True)
And so, with the toy dataframe constructor you provided:
lowest_frequency, best_df = get_best_subset(
items, 1_000, ["Label1", "Label2", "Label3"]
)
print(lowest_frequency)
# 431
print(best_df)
# Output
ID Label1 Label2 Label3
0 39 F D A
1 46 B G E
2 52 D D B
3 56 D F B
4 72 C D E
.. ... ... ... ...
995 9958 A F E
996 9961 E E E
997 9966 G E C
998 9970 B C B
999 9979 A C G
[1000 rows x 4 columns]
I need a way to extract all words that start with 'A' followed by a 6-digit numeric string right after (i.e. A112233, A000023).
Each cell contains sentences and there could potentially be a user error where they forget to put a space, so if you could account for that as well it would be greatly appreciated.
I've done research into using Python regex and Pandas, but I just don't know enough yet and am kind of on a time crunch.
Suppose your df's content construct from the following code:
import pandas as pd
df1=pd.DataFrame(
{
"columnA":["A194533","A4A556633 system01A484666","A4A556633","a987654A948323a882332A484666","A238B004867","pageA000023lol","a089923","something lol a484876A48466 emoji","A906633 A556633a556633"]
}
)
print(df1)
Output:
columnA
0 A194533
1 A4A556633 system01A484666
2 A4A556633
3 a987654A948323a882332A484666
4 A238B004867
5 pageA000023lol
6 a089923
7 something lol a484876A48466 emoji
8 A906633 A556633a556633
Now let's fetch the target corresponding to the regex patern:
result = df1['columnA'].str.extractall(r'([A]\d{6})')
Output:
0
match
0 0 A194533
1 0 A556633
1 A484666
2 0 A556633
3 0 A948323
1 A484666
5 0 A000023
8 0 A906633
1 A556633
And count them:
result.value_counts()
Output:
A556633 3
A484666 2
A000023 1
A194533 1
A906633 1
A948323 1
dtype: int64
Send the unique index into a list:
unique_list = [i[0] for i in result.value_counts().index.tolist()]
Output:
['A556633', 'A484666', 'A000023', 'A194533', 'A906633', 'A948323']
Value counts into a list:
unique_count_list = result.value_counts().values.tolist()
Output:
[3, 2, 1, 1, 1, 1]
I Have the following table:
and for each cell, I'd like to obtain the n° of values different from 0.
As an example for the first 2 rows:
denovoLocus10 9 C 0 1 0
denovoLocus12 7 G 3 3 4
After creating a simple test data frame, as the data itself is in a screenshot rather than something copyable:
df = pd.DataFrame({'A': ['0/0/0/0', '0/245/42/0']})
Just extract all integers as strings using a regular expression, replace all strings '0' with np.nan. Then count, within each original-index-level group (note that count excludes NaN automatically):
>>> df['A_count'] = df['A'].str.extractall(r'(\d+)').replace('0', np.nan) \
... .groupby(level=0).count()
>>> df
A A_count
0 0/0/0/0 0
1 0/245/42/0 2
If you want it to do it with multiple columns, filter your columns and loop over them with a for loop. (This also could be done with an apply over those columns.) Eg:
for c in df.filter(regex=r'RA\d{2}_R1_2'):
df[c + '_count'] = df[c].str.extractall(r'(\d+)').replace('0', np.nan) \
.groupby(level=0).count()
Here is how I would do it in R.
#load package
library(tidyverse)
#here is the data you gave us
test_data <- tibble(Tag = paste0("denovoLocus", c(10, 12, 14, 16, 17)),
Locus = c(9,7,37,5,4),
ref = c("C", "G", "C", "T", "C"),
RA02_R1_2 = c("0/0/0/0", "22/0/262/1", "0/0/0/0", "0/0/0/0", "0/7/0/0"),
RA03_R1_2 = c("0/223/0/0", "22/0/989/15", "0/5/0/0", "0/0/0/0", "0/42/0/0"),
RA06_R1_2 = c("0/0/0/0", "25/3/791/3", "0/4/0/0", "0/0/0/8", "0/31/0/3"))
#split and count the elements that do not equal zero and them collapse them
test_data%>%
mutate(across(RA02_R1_2:RA06_R1_2, ~map_dbl(., ~str_split(.x, pattern = "/") %>%
map_dbl(., ~sum(.x != "0") )))) %>%
unite(col = "final", everything(), sep = " ")
#> # A tibble: 5 x 1
#> final
#> <chr>
#> 1 denovoLocus10 9 C 0 1 0
#> 2 denovoLocus12 7 G 3 3 4
#> 3 denovoLocus14 37 C 0 1 1
#> 4 denovoLocus16 5 T 0 0 1
#> 5 denovoLocus17 4 C 1 1 2
First using across I summarize the columns with a bunch of "/". I first split the elements by "/" using str_split, then I count the elements that do not equal zero (sum(.x != "0")). It is a little complicated because splitting produces a list, so you need to map over the list to pull the values out. Lastly, we use unite to collapse all the columns into the string format that you wanted.
I have a dataframe with 5 columns, each column containing lists of variable lengths. This is what a row in my dataframe looks like:
A B
1 [aircrafts, they, agreement, airplane] [are, built, made, built]
Now I would like to 'unpack' or 'unstack' these lists so that each cell only contains one value (one word). In the unpacking process, the words in cells from one column should be combined pairwise with the corresponding value in the next column. The result would then be:
A B
1 aircrafts are
2 they built
3 agreement made
4 airplane built
For reference, my full dataframe looks as follows:
obj rel1 \
0 [Boeing] [sells]
1 [aircrafts, they, agreement, airplane] [are, built, made, built]
2 [exception, these] [are, are]
3 [sales, contract] [regulated, consist]
4 [contract] [stipulates]
5 [acquisition] [has]
6 [contract] [managed]
7 [employee] [act]
8 [salesperson, Boeing] [change, ensures]
9 [airlines, airlines] [related, have]
10 [Boeing] [keep]
dep1 rel2 \
0 [aircrafts] [to]
1 [] [on, with]
2 [] [of, of, for]
3 [] [by, of, with, of, of]
4 [elements] [across, as]
5 [details] [of, as, of, for]
6 [] [by]
7 [] [as, for]
8 [] [Given, of, over, for]
9 [company] [to, for]
10 [track, aircrafts] [of, between, to, of]
dep2
0 [companies]
1 [demand, customer]
2 [airplanes, scope, case]
3 [means, contracts, companies, acquisitions, ai...
4 [acquisitions, conditions]
5 [airplane, model, airplane, options]
6 [salesperson]
7 [salesperson, contracts]
8 [term, contracts, time, client]
9 [other, example]
10 [relationships, companies, airlines, buyer]
How can I perform the 'unpacking' and rearranging operations in python? It would be great if these operations could be performed on the dataframe itself. If this proves to be difficult or impossible, is there a way I could rearrange the data in the lists before combining them into a dataframe?
Thank you very much for any help or advice.
I believe you want to be able to read the rows as sentences
import pandas as pd
import numpy as np
df = pd.DataFrame(dict(
obj=[['Boeing'], ['aircrafts', 'they', 'agreement', 'airplane'], ['exception', 'these']],
rel1=[['sells'], ['are', 'built', 'made', 'built'],['are', 'are']],
dep1=[['aircrafts'], [], []],
rel2=[['to'], ['on', 'with'], ['of', 'of', 'for']]
))
# Output dataframe
out = pd.DataFrame()
# Keep track of which set of rows we have already appended to the dataframe
row_counter = 0
# Loop through each row in the input dataframe
for row, value in df.iterrows():
# Get the max len of each list in this row
rows_needed = value.map(lambda x: len(x)).max()
for i in range(rows_needed):
# Set a name for this row (numeric)
new_row = pd.Series(name=row_counter+i)
# Loop through each header and create a single row per item in the list
for header in value.index:
# Find if there will be a value here or null
this_value = np.nan
if i < len(df.loc[row, header]):
this_value = df.loc[row, header][i]
# Add this single result to a series for this row
new_row = new_row.set_value(header, this_value)
# Add this row series to the full dataframe
out = out.append(new_row)
row_counter += rows_needed
out
Out[1]:
dep1 obj rel1 rel2
0 aircrafts Boeing sells to
1 NaN aircrafts are on
2 NaN they built with
3 NaN agreement made NaN
4 NaN airplane built NaN
5 NaN exception are of
6 NaN these are of
7 NaN NaN NaN for
order = ['obj', 'rel1', 'dep1', 'rel2']
out = out[order]
out
Out[2]:
obj rel1 dep1 rel2
0 Boeing sells aircrafts to
1 aircrafts are NaN on
2 they built NaN with
3 agreement made NaN NaN
4 airplane built NaN NaN
5 exception are NaN of
6 these are NaN of
7 NaN NaN NaN for
I have a dataframe where the row indices and column headings should determine the content of each cell. I'm working with a much larger version of the following df:
df = pd.DataFrame(index = ['afghijklde', 'afghijklmde', 'ade', 'afghilmde', 'amde'],
columns = ['ae', 'azde', 'afgle', 'arlde', 'afghijklbcmde'])
Specifically, I want to apply the custom function edit_distance() or equivalent (see here for function code) which calculates a difference score between two strings. The two inputs are the row and column names. The following works but is extremely slow:
for seq in df.index:
for seq2 in df.columns:
df.loc[seq, seq2] = edit_distance(seq, seq2)
This produces the result I want:
ae azde afgle arlde afghijklbcmde
afghijklde 8 7 5 6 3
afghijklmde 9 8 6 7 2
ade 1 1 3 2 10
afghilmde 7 6 4 5 4
amde 2 1 3 2 9
What is a better way to do this, perhaps using applymap() ?. Everything I've tried with applymap() or apply or df.iterrows() has returned errors of the kind AttributeError: "'float' object has no attribute 'index'" . Thanks.
Turns out there's an even better way to do this. onepan's dictionary comprehension answer above is good but returns the df index and columns in random order. Using a nested .apply() accomplishes the same thing at about the same speed and doesn't change the row/column order. The key is to not get hung up on naming the df's rows and columns first and filling in the values second. Instead, do it the other way around, initially treating the future index and columns as standalone pandas Series.
series_rows = pd.Series(['afghijklde', 'afghijklmde', 'ade', 'afghilmde', 'amde'])
series_cols = pd.Series(['ae', 'azde', 'afgle', 'arlde', 'afghijklbcmde'])
df = pd.DataFrame(series_rows.apply(lambda x: series_cols.apply(lambda y: edit_distance(x, y))))
df.index = series_rows
df.columns = series_cols
you could use comprehensions, which speeds it up ~4.5x on my pc
first = ['afghijklde', 'afghijklmde', 'ade', 'afghilmde', 'amde']
second = ['ae', 'azde', 'afgle', 'arlde', 'afghijklbcmde']
pd.DataFrame.from_dict({f:{s:edit_distance(f, s) for s in second} for f in first}, orient='index')
# output
# ae azde afgle arlde afghijklbcmde
# ade 1 2 2 2 2
# afghijklde 1 3 4 4 9
# afghijklmde 1 3 4 4 10
# afghilmde 1 3 4 4 8
# amde 1 3 3 3 3
# this matches to edit_distance('ae', 'afghijklde') == 8, e.g.
note I used this code for edit_distance (first response in your link):
def edit_distance(s1, s2):
if len(s1) > len(s2):
s1, s2 = s2, s1
distances = range(len(s1) + 1)
for i2, c2 in enumerate(s2):
distances_ = [i2+1]
for i1, c1 in enumerate(s1):
if c1 == c2:
distances_.append(distances[i1])
else:
distances_.append(1 + min((distances[i1], distances[i1 + 1], distances_[-1])))
distances = distances_
return distances[-1]