Extract specific words from dataframe

Extract specific words from dataframe - python

I have the following dataframe named marketing where i would like to extract out source= from the values. Is there a way to create a general regex function so that i can apply on other columns as well to extract words after equal sign?
Data
source=book,social_media=facebook,ads=Facebook
source=book,ads=Facebook,customer=2
cost=2, customer=3
Im using python and i have tried the following:
df = pd.DataFrame()
def find_keywords(row_string):
tags = [x for x in row_string if x.startswith('source=')]
return tags
df['Data'] = marketing['Data'].apply(lambda row : find_keywords(row))
May i know whether there is a more efficient way to extract and place into columns:
source social_media ads customer costs
book facebook facebook - -
book - facebook 2 -

You can split the column value of string type into dict then use pd.json_normalize to convert dict to columns.
out = pd.json_normalize(marketing['Data'].apply(lambda x: dict([map(str.strip, i.split('=')) for i in x.split(',')]))).dropna(subset='source')
print(out)
source social_media ads customer cost
0 book facebook Facebook NaN NaN
1 book NaN Facebook 2 NaN

Here's another option:
Sample dataframe marketing is:
marketing = pd.DataFrame(
{"Data": ["source=book,social_media=facebook,ads=Facebook",
"source=book,ads=Facebook,customer=2",
"cost=2, customer=3"]}
)
Data
0 source=book,social_media=facebook,ads=Facebook
1 source=book,ads=Facebook,customer=2
2 cost=2, customer=3
Now this
result = (marketing["Data"].str.split(r"\s*,\s*").explode().str.strip()
.str.split(r"\s*=\s*", expand=True).pivot(columns=0))
does produce
1
0 ads cost customer social_media source
0 Facebook NaN NaN facebook book
1 Facebook NaN 2 NaN book
2 NaN 2 3 NaN NaN
which is almost what you're looking for, except for the extra column level and the column ordering. So the following modification
result = (marketing["Data"].str.split(r"\s*,\s*").explode().str.strip()
.str.split(r"\s*=\s*", expand=True).rename(columns={0: "columns"})
.pivot(columns="columns").droplevel(level=0, axis=1))
result = result[["source", "social_media", "ads", "customer", "cost"]]
should fix that:
columns source social_media ads customer cost
0 book facebook Facebook NaN NaN
1 book NaN Facebook 2 NaN
2 NaN NaN NaN 3 2

Related

Pandas transforming chronological rows to columns

I have a table of work experiences where each row represents a job in chronological order from the first job to the most recent job. For data science purposes I'm trying to create a new table based on this table that displays new job attributes and old job attributes on the same row. For example, the original table would be like:
uniqueID
personID
startdate
enddate
title
functions
1
A1
1/1/21
12/1/21
Analyst
data science
2
A1
1/1/22
12/1/22
Manager
admin
The new table would be something like this:
uniqueID
personID
new_title
new_function
old_title
old_function
1
A1
Analyst
data science
nan
nan
2
A1
Manager
admin
Analyst
data science
I tried to use some groupby variations but haven't been able to get this result.

If I understand correctly, you're looking for a shift:
cols = ['title', 'functions']
df[['old_' + c for c in cols]] = df.groupby('personID')[cols].shift(1)
df = df.drop(['startdate', 'enddate'], axis=1).rename({c: 'new_' + c for c in cols}, axis=1)
Output:
>>> df
uniqueID personID new_title new_functions old_title old_functions
0 1 A1 Analyst data science NaN NaN
1 2 A1 Manager admin Analyst data science

Check if each column values exist in another dataframe column where another column value is the column header

companies.xlsx
company To
1 amazon hi#test.de
2 google bye#test.com
3 amazon hi#tld.com
4 starbucks hi#test.de
5 greyhound bye#tuz.de
emails.xlsx
hi#test.de bye#test.com hi#tld.com ...
1 amazon google microsoft
2 starbucks amazon tesla
3 Grey Hound greyhound
4 ferrari
So i have the 2 excel sheets above and read both of em:
file1 = pd.ExcelFile('data/companies.xlsx')
file2 = pd.ExcelFile('data/emails.xlsx')
df_companies = file1.parse('sheet1')
df_emails = file2.parse('sheet1')
what i'm trying to accomplish is:
check if df_companies['To'] is an existing header in df_emails
if the header exists in df_emails, search the appropriate column of that header for df_companies['company']
if the company is found, add a column to df_companies and fill in '1', if not fill in '0'
e.g.: company amazon has the To email hi#test.de in company.xlsx. in email.xlsx the header hi#test.de exists and also amazon was found in the column - so its a '1'.
Anyone knows how to accomplish this?

Here's one approach. Convert df_emails to a dictionary and map it to df_companies. Then, compare the mapped column with df_companies['company'].
df_companies['check'] = df_companies['To'].map(df_emails.to_dict(orient='list')).fillna('')
df_companies['check'] = df_companies.apply(lambda x: x['company'] in x['check'], axis=1).astype(int)
company To check
1 amazon hi#test.de 1
2 google bye#test.com 1
3 amazon hi#tld.com 0
4 starbucks hi#test.de 1
5 greyhound bye#tuz.de 0

Python get value counts from multiple columns and average from another column

I have a dataframe with the following columns
Movie Rating Genre_0 Genre_1 Genre_2
MovieA 8.9 Action Comedy Family
MovieB 9.1 Horror NaN NaN
MovieC 4.4 Comedy Family Adventure
MovieD 7.7 Action Adventure NaN
MovieE 9.5 Adventure Comedy NaN
MovieF 7.5 Horror NaN NaN
MovieG 8.6 Horror NaN NaN
I'd like get a dataframe which has value counts for each genre and the average rating for each time the genre appears
Genre value_count Average_Rating
Action 2 8.3
Comedy 3 7.6
Horror 3 8.4
Family 2 6.7
Adventure 3 7.2
I have tried the following code and am able to get the value counts. However, am unable to get the average rating of each genre based on the number of times each genre appears. Any form of help is much appreciated, thank you.
#create a list for the genre columns
genre_col = [col for col in df if col.startswith('Genre_')]
#get value counts of genres
genre_counts = df[genre_col].apply(pd.Series.value_counts).sum(1).to_frame(name='Count')
genre_counts.index.name = 'Genre'
genre_counts = genre_counts.reset_index()

You can .melt the dataframe then group then melted frame on genre and aggregate using a dictionary that specifies the columns and their corresponding aggregation functions:
# filter and melt the dataframe
m = df.filter(regex=r'Rating|Genre').melt('Rating', value_name='Genre')
# group and aggregate
dct = {'Value_Count': ('Genre', 'count'), 'Average_Rating': ('Rating', 'mean')}
df_out = m.groupby('Genre', as_index=False).agg(**dct)
>>> df_out
Genre Value_Count Average_Rating
0 Action 2 8.30
1 Adventure 3 7.20
2 Comedy 3 7.60
3 Family 2 6.65
4 Horror 3 8.40

The process of encoding the genre to their value counts is frequency encoding it can be done with this code
df_frequency_map = df.Genre_0.value_counts().to_dict()
df['Genre0_frequency_map'] = df.Genre_0.map(df_frequency_map)
The add the average as a feature in your dataset I think you can just perform the same thing but calculate the average before performing to the to_dict() function.
df_frequency_map = df.df.Genre_0.value_counts().mean().to_dict()
df['Genre0_mean_frequency_map'] = df.Genre_0.map(df_frequency_map)

How to add a word to the end of each string in a specific column (pandas dataframe)

I want to add "NSW" to the end of each town name in a pandas data frame.The dataframe currently looks like this:
0 Parkes NaN
1 Forbes NaN
2 Yanco NaN
3 Orange NaN
4 Narara NaN
5 Wyong NaN
I need every town to also have the word NSW added to it

Try with
df['Name'] = df['Name'] + 'NSW'

parse text into different columns in pandas

I have a dataframe containing the query part of multiple urls.
For eg.
in=2015-09-19&stars_4=yes&min=4&a=3&city=New+York,+NY,+United+States&out=2015-09-20&search=1\n
in=2015-09-14&stars_3=yes&min=4&a=3&city=London,+United+Kingdom&out=2015-09-15&search=1\n
in=2015-09-26&Filter=175&min=5&a=2&city=New+York,+NY,+United+States&out=2015-09-27&search=2\n
My desired dataframe should be:
in Filter stars min a max city country out search
--------------------------------------------------------------------------------
2015-09-19 NAN stars_4 4 3 NAN NY US 2015-09-20 1
2015-09-14 NAN stars_3 4 3 NAN LONDON UK 2015-09-15 1
2015-09-26 175 NAN 5 2 NAN NY US 2015-09-27 2
Is there any easy way out for this using regex?
Any help will be much appreciated! Thanks in advance!

A quick-and-dirty fix would be to just use list comprehensions:
json_data = [{c[0]:c[1] for c in [b.split('=') for b in line.split('&')]} \
for line in open('data_file.txt')]
df = pd.DataFrame.from_records(json_data)
This won't solve your location classification issues, but will get you a better dataframe from which to work.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extract specific words from dataframe - python

Related

Pandas transforming chronological rows to columns

Check if each column values exist in another dataframe column where another column value is the column header

Python get value counts from multiple columns and average from another column

How to add a word to the end of each string in a specific column (pandas dataframe)

parse text into different columns in pandas

Categories

Resources