retrieve multiple substrings from a string in pandas dataframe

retrieve multiple substrings from a string in pandas dataframe - python

I have a dataframe contains strings of email addresses with the format:
d = {'Country':'A', 'Email':'123#abc.com,456#def.com,789#ghi.com'}
df = pd.DataFrame(data=d)
and I want the username of emails only. So the new dataframe should look like this:
d = {'Country':'A', 'Email':'123,456,789'}
df1 = pd.DataFrame(data=d)
The best way I could think of is to split the original string by comma, delete the domain part of emails and join the list back again. Are there better ways to this problem?

If you want a string as output, you can remove the part starting on #. Use str.replace with the #[^,]+ regex:
df['Email'] = df['Email'].str.replace(r'#[^,]+', '', regex=True)
Output:
Country Email
0 A 123,456,789
For a list you could use str.findall:
df['Email'] = df['Email'].str.findall(r'[^,]+(?=#)')
Output:
Country Email
0 A [123, 456, 789]

This is a regex question, not really a Pandas question but here's a solution that'll return a list (which you can join together as a string)
import re
df['Email'].apply(lambda s: re.findall('\w+(?=#)', s))
Output:
0 [123, 456, 789]
Name: Email, dtype: object

Try this
import pandas as pd
d = {'Country':['A', 'B'], 'Email':['123#abc.com,456#def.com,789#ghi.com', '134#abc.com,436#def.com,229#ghi.com']}
df = pd.DataFrame(d)
df['Email'] = df['Email'].str.findall('#\w+.com').apply(', '.join).str.replace('#','')
df
Output
Country Email
A abc.com, def.com, ghi.com
B abc.com, def.com, ghi.com

here is one way to do it
# iterate through the dictionary values are replace anything after # and , or '
d={k: re.sub(r"#[^,\']+",'' , v) for k, v in d.items()}
d
{'Country': 'A', 'Email': '123,456, 789'}

Related

Dataframe or CSV to JSON object array

This is probably an easy one for the python pros. So, please forgive my naivety.
Here is my data:
0 xyz#tim.com 13239169023 jane bo
1 tim#tim.com 13239169023 lane ko
2 jim#jim.com 13239169023 john do
Here is what I get as output:
[{"email":"xyz#tim.com","phone_number":13239169023,"first_name":"jane","last_name":"bo"},{"email":"tim#tim.com","phone_number":13239169023,"first_name":"lane","last_name":"ko"},{"email":"jim#jim.com","phone_number":13239169023,"first_name":"john","last_name":"do"}]
My Code:
df = pd.read_csv('profiles.csv')
print(df)
data = df.to_json(orient="records")
print(data)
Output I want:
{"profiles":[{"email":"xyz#tim.com","phone_number":13239169023,"first_name":"jane","last_name":"bo"},{"email":"tim#tim.com","phone_number":13239169023,"first_name":"lane","last_name":"ko"},{"email":"jim#jim.com","phone_number":13239169023,"first_name":"john","last_name":"do"}]}
Adding below does NOT work.
output = {"profiles": data}
It adds single quotes on the data and profiles in NOT in double quotes (basically NOT a valid JSON), Like so:
{'profiles': '[{"email":"xyz#tim.com","phone_number":13239169023,"first_name":"jane","last_name":"bo"},{"email":"tim#tim.com","phone_number":13239169023,"first_name":"lane","last_name":"ko"},{"email":"jim#jim.com","phone_number":13239169023,"first_name":"john","last_name":"do"}]'}

You can use df.to_dict to output to a dictionary instead of a json-formatted string:
import pandas as pd
df = pd.read_csv('data.csv')
data = df.to_dict(orient='records')
output = {'profiles': data}
print(output)
Returns:
{'profiles': [{'0': 1, 'xyz#tim.com': 'tim#tim.com', '13239169023': 13239169023, 'jane': 'lane', 'bo': 'ko'}, {'0': 2, 'xyz#tim.com': 'jim#jim.com', '13239169023': 13239169023, 'jane': 'john', 'bo': 'do'}]}

I think I found a solution.
Changes:
data = df.to_dict(orient="records")
output = {}
output["profiles"] = data

Create a nested dict containing list from a file

For example, for the txt file of
Math, Calculus, 5
Math, Vector, 3
Language, English, 4
Language, Spanish, 4
into the dictionary of:
data={'Math':{'name':[Calculus, Vector], 'score':[5,3]}, 'Language':{'name':[English, Spanish], 'score':[4,4]}}
I am having trouble with appending value to create list inside the smaller dict. I'm very new to this and I would not understand importing command. Thank you so much for all your help!

For each line, find the 3 values, then add them to a dict structure
from pathlib import Path
result = {}
for row in Path("test.txt").read_text().splitlines():
subject_type, subject, score = row.split(", ")
if subject_type not in result:
result[subject_type] = {'name': [], 'score': []}
result[subject_type]['name'].append(subject)
result[subject_type]['score'].append(int(score))
You can simplify it with the use of a defaultdict that creates the mapping if the key isn't already present
result = defaultdict(lambda: {'name': [], 'score': []}) # from collections import defaultdict
for row in Path("test.txt").read_text().splitlines():
subject_type, subject, score = row.split(", ")
result[subject_type]['name'].append(subject)
result[subject_type]['score'].append(int(score))
With pandas.DataFrame you can directly the formatted data and output the format you want
import pandas as pd
df = pd.read_csv("test.txt", sep=", ", engine="python", names=['key', 'name', 'score'])
df = df.groupby('key').agg(list)
result = df.to_dict(orient='index')

From your data:
data={'Math':{'name':['Calculus', 'Vector'], 'score':[5,3]},
'Language':{'name':['English', 'Spanish'], 'score':[4,4]}}
If you want to append to the list inside your dictionary, you can do:
data['Math']['name'].append('Algebra')
data['Math']['score'].append(4)
If you want to add a new dictionary, you can do:
data['Science'] = {'name':['Chemisty', 'Biology'], 'score':[2,3]}
I am not sure if that is what you wanted but I hope it helps!

Convert pandas dataframe columns into nested python dictionary

I want to create python dictionary with pandas data frame column 2(source) and column 3(description) and group by column 1(title)
Also, I want to get values of only provided titles
titles = ['test1','test2']
title source description
1 Test1 ABC description1
2 Test2 ABC description2
3 Test2 DEF description3
4 Test3 XYZ description4
output = {'Test1':{'ABC':'description1'},'Test2':{'ABC':'description2':'DEF':'description3'}

Use boolean indexing with Series.isin for filter first, then is used GroupBy.apply with lambda function for Series of dicts and last Series.to_dict:
titles = ['Test1','Test2']
d = (df[df['title'].isin(titles)]
.groupby('title')[['source','description']]
.apply(lambda x: dict(x.to_numpy()))
.to_dict())
print (d)
{'Test1': {'ABC': 'description1'}, 'Test2': {'ABC': 'description2', 'DEF': 'description3'}}

You can group by the dataframe w.r.t. title and then use python zip function to create inner dictionary with source and description. Please find below code for the same.
final_dict=dict()
all_groups = df.groupby('title')
for title in titles:
title_group = all_groups.get_group(title)
source_desc=dict(zip(title_group.source, title_group.description))
final_dict[title_group] = source_desc
print(final_dict)

try this,
result = {}
filter_ = ['Test1','Test2']
for x in df[df['title'].isin(filter_)].to_dict(orient='records'):
result.setdefault(x['title'], {}).update({x['source']: x['description']})
{'Test1': {'ABC': 'description1'}, 'Test2': {'ABC': 'description2', 'DEF': 'description3'}}

How to transform a string column to a list of string list column in pandas

I have a pandas dataframe like this:
df = pd.DataFrame ({'names': ['John;Joe;Tom', 'Justin', 'Ryan;John']})
names
0 John;Joe;Tom
1 Justin
2 Ryan;John
I want to transform the column to a string list column like below:
0 ['John', 'Joe', 'Tom']
1 ['Justin']
2 ['Ryan', 'John']
I did the following:
df.names.apply(lambda x: x.split(';'))
what I got is:
0 [John, Joe, Tom]
1 [Justin]
2 [Ryan, John]
I lost all the quotes. Does anyone know how to fix that? Thanks a lot.

You never lost the quotes.
It is just because pandas do not show the quotes for two or more rows.
Check the following example.
df = pd.DataFrame ({'names': ['John;Joe;Tom', 'Justin', 'Ryan;John']})
df.names = df.names.apply(lambda x: x.split(';'))
df.names.iloc[0]
The output is ['John', 'Joe', 'Tom'] as you expected.

As Gilseung mentioned, the output is same as your output. But if you really insist adding quotes as an extra character to you output try this:
def add(x):
temp_list = x.split(';')
temp_list = [f"\'{x}\'" for x in temp_list] #adds extra character
return temp_list
df = df.names.apply(add)
which gives you this output:
0 ['John', 'Joe', 'Tom']
1 ['Justin']
2 ['Ryan', 'John']
Name: names, dtype: object

How do I save result of multiple “for” loops into a dataframe?

How can I add outputs of different for loops into one dataframe. For example I have scraped data from website and have list of Names,Email and phone number using loops. I want to add all outputs into a table in single dataframe.
I am able to do it for One single loop but not for multiple loops.
Please look at the code and output in attached images.
By removing Zip from for loop its giving error. "Too many values to unpack"
Loop
phone = soup.find_all(class_ = "directory_item_phone directory_item_info_item")
for phn in phone:
print(phn.text.strip())
##Output - List of Numbers
Code for df
df = list()
for name,mail,phn in zip(faculty_name,email,phone):
df.append(name.text.strip())
df.append(mail.text.strip())
df.append(phn.text.strip())
df = pd.DataFrame(df)
df
For loops
Code and Output for df

An efficient way to create a pandas.DataFrame is to first create a dict and then convert it into a DataFrame.
In your case you probably could do :
import pandas as pd
D = {'name': [], 'mail': [], 'phone': []}
for name, mail, phn in zip(faculty_name, email, phone):
D['name'].append(name.text.strip())
D['mail'].append(mail.text.strip())
D['phone'].append(phn.text.strip())
df = pd.DataFrame(D)
Another way with a lambda function :
import pandas as pd
text_strip = lambda s : s.text.strip()
D = {
'name': list(map(text_strip, faculty_name)),
'mail': list(map(text_strip, email)),
'phone': list(map(text_strip, phone))
}
df = pd.DataFrame(D)
If lists don't all have the same length you may try this (but I am not sure that is very efficient) :
import pandas as pd
columns_names = ['name', 'mail', 'phone']
all_lists = [faculty_name, email, phone]
max_lenght = max(map(len, all_lists))
D = {c_name: [None]*max_lenght for c_name in columns_names}
for c_name, l in zip(columns_names , all_lists):
for ind, element in enumerate(l):
D[c_name][ind] = element
df = pd.DataFrame(D)

Try this,
data = {'name':[name.text.strip() for name in faculty_name],
'mail':[mail.text.strip() for mail in email],
'phn':[phn.text.strip() for phn in phone],}
df = pd.DataFrame.from_dict(data)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

retrieve multiple substrings from a string in pandas dataframe - python

This is a regex question, not really a Pandas question but here's a solution that'll return a list (which you can join together as a string) import re df['Email'].apply(lambda s: re.findall('\w+(?=#)', s)) Output: 0 [123, 456, 789] Name: Email, dtype: object

here is one way to do it # iterate through the dictionary values are replace anything after # and , or ' d={k: re.sub(r"#[^,\']+",'' , v) for k, v in d.items()} d {'Country': 'A', 'Email': '123,456, 789'}

Related

Dataframe or CSV to JSON object array

Create a nested dict containing list from a file

Convert pandas dataframe columns into nested python dictionary

How to transform a string column to a list of string list column in pandas

How do I save result of multiple “for” loops into a dataframe?

Categories

Resources