I have two dataframes, one that contains a large amount of textual data scraped from PDF documents, and another that contains categories and subcategories.
For each subcategory, I need to calculate the percentage of documents that contains at least one mention of the subcategory (e.g. for the subcategory "apple", calculate the percentage of documents that contains "apple"). I'm able to correctly calculate the subcategory percentage. However, when I attempt to populate the dataframe with the value, an incorrect value is displayed.
For each category, I need to calculate the percentage of documents that contains at least one mention of each subcategory (e.g. for the category "fruit", calculate the percentage of documents that contains "apple" or "banana"). The calculation of this value is harder, as it's not a subtotal. I'm trying to calcuate this value through a combination of GROUPBY and APPLY, but I've gotten stuck.
The document dataframe looks like this:
The categories dataframe looks like this:
This is what I'm aiming for:
This is what I have so far:
import pandas as pd
documents = {'Text': ['apple apple', 'banana apple', 'carrot carrot carrot', 'spinach','hammer']}
doc_df = pd.DataFrame(data=documents)
print(doc_df,'\n')
categories = {'Category': ['fruit', 'fruit', 'vegetable', 'vegetable'],
'Subcategory': ['apple', 'banana', 'carrot', 'spinach']}
cat_df = pd.DataFrame(data=categories)
print(cat_df,'\n')
total_docs = doc_df.shape[0]
cat_df['Subcat_Percentage'] = 0
cat_df['Cat_Percentage'] = 0
cat_df = cat_df[['Category', 'Cat_Percentage', 'Subcategory', 'Subcat_Percentage']]
for idx, subcategory in enumerate(cat_df['Subcategory']):
total_docs_with_subcat = doc_df[doc_df['Text'].str.contains(subcategory)].shape[0]
subcat_percentage = total_docs_with_subcat / total_docs #calculation is correct
cat_df.at[idx, 'Subcat_Percentage'] = subcat_percentage #wrong value is output
cat_percentage = cat_df.groupby('Category').apply(lambda x: (doc_df[doc_df['Text'].str.contains(subcategory)].shape[0]) #this doesn't work
cat_df.at[idx, 'Cat_Percentage'] = cat_percentage
print('\n', cat_df,'\n')
It Could be better optimized, but try this :
agg_category = cat_df.groupby('Category')['Subcategory'].agg('|'.join)
def percentage_cat(category):
return doc_df[doc_df['Text'].str.contains(agg_category[category])].size / doc_df.size
def percentage_subcat(subcategory):
return doc_df[doc_df['Text'].str.contains(subcategory)].size / doc_df.size
cat_df['percentage_category'] = cat_df['Category'].apply(percentage_cat)
cat_df['sub_percentage'] = cat_df['Subcategory'].apply(percentage_subcat)
cat_df
Related
I want to create a new columns conditional on two other columns in python.
Below is the dataframe:
name
address
apple
hello1234
banana
happy111
apple
str3333
pie
diary5144
I want to create a new column "want", conditional on column "name" and "column" address.
The rules are as follows:
(1)If the value in "name" is apple, the the value in "want" should be the first five letters in column "address".
(2)If the value in "name" is banana, the the value in "want" should be the first four letters in column "address".
(3)If the value in "name" is pie, the the value in "want" should be the first three letters in column "address".
The dataframe I want look like this:
name
address
want
apple
hello1234
hello
banana
happy111
happ
apple
str3333
str33
pie
diary5144
dia
How to address such problem? Thanks!
I hope you are well,
import pandas as pd
# Initialize data of lists.
data = {'Name': ['Apple', 'Banana', 'Apple', 'Pie'],
'Address': ['hello1234', 'happy111', 'str3333', 'diary5144']}
# Create DataFrame
df = pd.DataFrame(data)
# Add an empty column
df['Want'] = ''
for i in range(len(df)):
if df['Name'].iloc[i] == "Apple":
df['Want'].iloc[i] = df['Address'].iloc[i][:5]
if df['Name'].iloc[i] == "Banana":
df['Want'].iloc[i] = df['Address'].iloc[i][:4]
if df['Name'].iloc[i] == "Pie":
df['Want'].iloc[i] = df['Address'].iloc[i][:3]
# Print the Dataframe
print(df)
I hope it helps,
Have a lovely day
I think a broader way of doing this is by creating a conditional map dict and applying it with lambda functions on your dataset.
Creating the dataset:
import pandas as pd
data = {
'name': ['apple', 'banana', 'apple', 'pie'],
'address': ['hello1234', 'happy111', 'str3333', 'diary5144']
}
df = pd.DataFrame(data)
Defining the conditional dict:
conditionalMap = {
'apple': lambda s: s[:5],
'banana': lambda s: s[:4],
'pie': lambda s: s[:3]
}
Applying the map:
df.loc[:, 'want'] = df.apply(lambda row: conditionalMap[row['name']](row['address']), axis=1)
With the resulting df:
name
address
want
0
apple
hello1234
hello
1
banana
happy111
happ
2
apple
str3333
str33
3
pie
diary5144
dia
You could do the following:
for string, length in {"apple": 5, "banana": 4, "pie": 3}.items():
mask = df["name"].eq(string)
df.loc[mask, "want"] = df.loc[mask, "address"].str[:length]
Iterate over the 3 conditions: string is the string on which the length requirement depends, and the length requirement is stored in length.
Build a mask via df["name"].eq(string) which selects the rows with value string in column name.
Then set column want at those rows to the adequately clipped column address values.
Result for the sample dataframe:
name address want
0 apple hello1234 hello
1 banana happy111 happ
2 apple str3333 str33
3 pie diary5144 dia
I have run into a problem with my Python code.
I am creating a movie filter after i scraped IMDB for certain movies.
However, the problem is that movies with multiple genres will show up identically in my movie_filter.
So my code is following:
def create_movies_drop_down(self):
movies = []
if "genres" in self.mappingQuery:
mapping = [row for row in dataLakeDB["nordisk-film-movie-mapping"].find(dict(genres = self.mappingQuery["genres"]))]
else:
mapping = [row for row in dataLakeDB["nordisk-film-movie-mapping"].find()]
for row in mapping:
if row["title"] not in movies:
movies.append(dict(movie = row["title"][0].upper()+row["title"][1:],
imdbPageID = row["imdbPageID"]))
return movies
Now i.e because the movie "DRUK" has the genres "Comende and drama" it will show up 2 times with the same title and imdb page ID.
I have tried with multiple arguments, but can't seem to solve the specific reason why this happens.
Can anyone help here?
Edit: The mapping for 1 movie is like this:
[{'_id': '6028139039cba4ae2722f8d9', 'castList': '[Rosa Salazar, Christoph Waltz, Jennifer Connelly, Mahershala Ali, Ed Skrein]', 'clientID': 'FILM', 'dcmCampaignID': [''], 'director': 'Robert Rodriguez', 'dv360InsertionOrderID': ['7675053', '7675055', '7675065', '768
3006', '7863461'], 'genres': ['action', 'adventure', 'sci-fi'], 'imdbPageID': '0437086', 'imdbPageURL': 'https://www.imdb.com/title/tt0437086', 'imdbRating': '7.3', 'marathonCountryID': 'PMDK', 'posterURL': 'https://m.media-amazon.com/images/M/MV5BMTQzYWYwYjctY2JhZS00
NTYzLTllM2UtZWY5ZTk0NmYwYzIyXkEyXkFqcGdeQXVyMzgxODM4NjM#.V1_UX182_CR0,0,182,268_AL.jpg', 'title': 'alita: battle angel\xa0(2019)'}
Since movies is a list of dictionaries (which are unhashable), converting it to a set to get rid of duplicates will not work. Instead you have to iterate and append each movie to the movies list on the condition that it does not already exist there. You have already tried to do this with the if statement inside the for loop. The problem is that your if statement is always True because your are checking just for a Title and not for the whole dictionary object. You can fix it like this:
def create_movies_drop_down(self):
movies = []
if "genres" in self.mappingQuery:
mapping = [row for row in
dataLakeDB["nordisk-film-movie-mapping"].find(dict(genres=self.mappingQuery["genres"]))]
else:
mapping = [row for row in dataLakeDB["nordisk-film-movie-mapping"].find()]
for row in mapping:
movie_dic = dict(movie=row["title"][0].upper() + row["title"][1:],
imdbPageID=row["imdbPageID"])
if movie_dic not in movies:
movies.append(movie_dic)
return movies
I need little help, I know it's very easy I tried but didn't reach the goal.
# Import pandas library
import pandas as pd
data1 = [['India', 350], ['India', 600], ['Bangladesh', 350],['Bangladesh', 600]]
df1 = pd.DataFrame(data1, columns = ['Country', 'Bottle_Weight'])
data2 = [['India', 350], ['India', 600],['India', 200], ['Bangladesh', 350],['Bangladesh', 600]]
df2 = pd.DataFrame(data2, columns = ['Country', 'Bottle_Weight'])
data3 = [['India', 350], ['India', 600], ['Bangladesh', 350],['Bangladesh', 600],['Bangladesh', 200]]
df3 = pd.DataFrame(data3, columns = ['Country', 'Bottle_Weight'])
So basically I want to create a function, which will check the mapping by comparing all other unique countries(Bottle weights) with the first country.
According to the 1st Dataframe, It should return text as - All unique value of 'Bottle Weights' are mapped with all unique countries
According to the 2nd Dataframe, It should return text as - 'Country_name' not mapped 'Column name' 'value'
In this case, 'Bangladesh' not mapped with 'Bottle_Weight' 200
According to the 3rd Dataframe, It should return text as - All unique value of Bottle Weights are mapped with all unique countries (and in a new line) 'Country_name' mapped with new value '200'
It is not a particularly efficient algorithm, but I think this should get you the results you are looking for.
def check_weights(df):
success = True
countries = df['Country'].unique()
first_weights = df.loc[df['Country']==countries[0]]['Bottle_Weight'].unique()
for country in countries[1:]:
weights = df.loc[df['Country']==country]['Bottle_Weight'].unique()
for weight in first_weights:
if not np.any(weights[:] == weight):
success = False
print(f"{country} does not have bottle weight {weight}")
if success:
print("All bottle weights are shared with another country")
In a json file with huge data I got 24 columns with 700k rows, one of columns have a dictionary inside, so i selected that column below:
dataset = pd.read_json('ecommerce-events - Copia.json', lines=True)
dataset.loc[dataset['eventType']=="transaction"]
In transaction column has "price", wanna sum all prices times quantity, how I do this with pandas?
'url': 'da7caa77e2729e12b32a9d7d1a324652ce2264a6',
'referrer': '6e03ee62984224d0c0f08d4b68b819297d7f4d14',
'order': 5545, # unique transaction id
'orderItems': [{ # list of products bought in that transaction
'product': 16493, # product id
'price': 19.9, # product unit price
'quantity': 1.0
print
def summation(x):
value=x["price"] * x["qun"]
return value
df=pd.DataFrame({"Transaction":[[{"price":23,"qun":2}],[{"price":25,"qun":2}],[{"price":24,"qun":2}]]})
df["summation_value"]=df[["Transaction"]].apply(lambda x : summation(x[0][0]), axis=1)
I am trying to understand how to apply function within the 'groupby' or each groups of the groups in a dataframe.
import pandas as pd
import numpy as np
df = pd.DataFrame({'Stock' : ['apple', 'ford', 'google', 'samsung','walmart', 'kroger'],
'Sector' : ['tech', 'auto', 'tech', 'tech','retail', 'retail'],
'Price': np.random.randn(6),
'Signal' : np.random.randn(6)}, columns= ['Stock','Sector','Price','Signal'])
dfg = df.groupby(['Sector'],as_index=False)
type(dfg)
pandas.core.groupby.DataFrameGroupBy
I want to get the sum ( Price * (1/Signal) ) group by 'Sector'.
i.e. The resulting output should look like
Sector | Value
auto | 0.744944
retail |-0.572164053
tech | -1.454632
I can get the results by creating separate data frames, but was looking for a way to
figure out how to operate withing each of the grouped ( sector) frames.
I can find mean or sum of Price
dfg.agg({'Price' : [np.mean, np.sum] }).head(2)
but not get sum ( Price * (1/Signal) ), which is what I need.
Thanks,
You provided random data, so there is no way we can get the exact number that you got. But based on what you just described, I think the following will do:
In [121]:
(df.Price/df.Signal).groupby(df.Sector).sum()
Out[121]:
Sector
auto -1.693373
retail -5.137694
tech -0.984826
dtype: float64