I have a DataFrame and I'd like to make only specific parts of strings to be made uppercase with an underscore afterwords.
| TYPE | NAME |
|-----------------------------|
| Contract Employee | John |
| Full Time Employee | Carol |
| Temporary Employee | Kyle |
I'd like the words "Contract" and "Temporary" made into uppercase like this with an underscore after and before the word:
| TYPE | NAME |
|-------------------------------|
| _CONTRACT_ Employee | John |
| Full Time Employee | Carol |
| _TEMPORARY_ Employee | Kyle |
I tried using str.upper() but that made the entire cell uppercase and I'm looking only for those certain words.
EDIT: I should mention sometimes the words are not capitalized if that matters. Often it will display as temporary employee instead of Temporary Employee.
Here is one option using re.sub:
def type_to_upper(match):
return match.group(1).upper()
text = "Contract Employee"
output = re.sub(r'\b(Contract|Temporary)\b', type_to_upper, text)
EDIT:
This is the same approach applied within pandas, also addressing the latest edit regarding uncertain upper or lower case words to be replaced:
test dataframe:
TYPE NAME
0 Contract Employee John
1 Full Time Employee Carol
2 Temporary Employee Kyle
3 contract employee John
4 Full Time employee Carol
5 temporary employee Kyle
solution:
def type_to_upper(match):
return '_{}_'.format(match.group(1).upper())
df.TYPE = df.TYPE.str.replace(r'\b([Cc]ontract|[Tt]emporary)\b', type_to_upper)
result:
df
TYPE NAME
0 _CONTRACT_ Employee John
1 Full Time Employee Carol
2 _TEMPORARY_ Employee Kyle
3 _CONTRACT_ employee John
4 Full Time employee Carol
5 _TEMPORARY_ employee Kyle
Note that this is only for addressing exactly these two cases which are defined in the OPs request. For complete case insensitivity it's even simpler:
df.TYPE = df.TYPE.str.replace(r'\b(contract|temporary)\b', type_to_upper, case=False)
Something that modifies the data-frame (without regex or anything):
l=['Contract','Temporary']
df['TYPE']=df['TYPE'].apply(lambda x: ' '.join(['_'+i.upper()+'_' if i in l else i for i in x.split()]))
join and split, being in a apply.
And then now:
print(df)
Is:
TYPE NAME
0 _CONTRACT_ Employee John
1 Full Time Employee Carol
2 _TEMPORARY_ Employee Kyle
This is simples and easy way by using replace with dictionary format.
Please refer pandas Doc for Series.replace
df["TYPE"] = df["TYPE"].replace({'Contract': '_CONTRACT_', 'Temporary': '_Temporary_'}, regex=True)
Just reproduced:
>>> df
TYPE Name
0 Contract Employee John
1 Full Time Employee Carol
2 Temporary Employee Kyle
>>> df["TYPE"] = df["TYPE"].replace({'Contract': '_CONTRACT_', 'Temporary': '_TEMPORARY_'}, regex=True)
>>> df
TYPE Name
0 _CONTRACT_ Employee John
1 Full Time Employee Carol
2 _TEMPORARY_ Employee Kyle
U9 beat me to it, using lambda and split() on the input:
def match_and_upper(match):
matches = ["Contract", "Temporary"]
if match in matches:
return match.upper()
return match
input = "Contract Employee"
output = " ".join(map(lambda x: match_and_upper(x), input.split()))
# Result: CONTRACT Employee #
Answering part of my own question here. Using #Tim Biegeleisen's regex he provided, I did a string replace on the column.
df["TYPE"] = df["TYPE"].str.replace(r'\b(Contract)\b', '_CONTRACT_')
Related
I have a dataframe with a single column which contains the names of people as shown below.
name
--------------
john doe
john david doe
doe henry john
john henry
I want to count the number of time each two words appear together in a name regardless of the order. In this example, the words john and doe appear in the same same three times names john doe, john henry doe and doe john.
Expected output
name1 | name2 | count
----------------------
david | doe | 1
doe | henry | 1
doe | john | 3
henry | john | 2
Notice that name1 is the word that comes first in alphabetical order. Currently I have a brute force solution.
Create a list of all unique words in the dataframe
For each unique word W in this list, filter the records in the original data frame which contain this W
From the filtered records, count frequency of other words. This gives the number of time W appears with various other words
Question: This works fine for small number of records but is not efficient if we have a large number of records as it runs in quadratic complexity. How can it generate the output in a faster way? Is there any function or package that can give these counts?
Note: I tried using n-gram extraction from NLP packages but this over estimates the counts because it internally appends all the names to form a long string due to which the last word on a name and the first word of the next name shows up as a a sequence of words in the appended string which adds up to the count.
Here's a solution which is still quadratic but over smaller n and also hides most of it in compiled code (which should hopefully execute faster):
from itertools import combinations
combs = df['name'].apply(lambda x:list(combinations(sorted(x.split()),2)))
counts = Counter(combs.explode())
res = pd.Series(counts).rename_axis(['name1', 'name2']).reset_index(name='count')
Output for your sample data:
name1 name2 count
0 doe john 3
1 david doe 1
2 david john 1
3 doe henry 1
4 henry john 2
I suggest the below:
import itertools
# 1) Create combinations of 2 from all the names using itertools().
l = [a for b in df.name.tolist() for a in b]
c = {c for c in itertools.combinations(l, 2) if c[0] != c[1]}
df_counts = pd.DataFrame(c, columns=["name1", "name2"])
# 2) Create a new column iterating through rows to check of each name is contained in each list of words & sum the boolean outputs.
df_counts["counts"] = df_counts.apply(lambda row: sum([row["name1"] in l and row["name2"] in l for l in df.name.to_list()]), axis=1)
I hope this helps.
This problem is an instance of 'Association Rule Mining' with just 2 items per transaction. It has simple algorithms like 'Aperiori' as well as efficient ones like 'FP-Growth' and you can find a lot of resources to learn that.
I have a csv file as show below, the names column has names separated with commas, I want to spilt them on comma and append them to new columns and create the same csv, similar to the text to columns in excel, the problem is some rows have random number of names.
| Address | Name |
| 1st st | John, Smith |
|2nd st. | Andrew, Jane, Aaron|
my pandas code look something like this
df1 = pd.read_csv('sample.csv')
df1['Name'] = df1['Name'].str.split(',', expand=True)
df1.to_csv('results.csv',index=None)
offcourse this doesn't work because columns must be same length as key. The expected output is
| Address | Name | | |
| 1st st | John |Smith| |
|2nd st. | Andrew| Jane| Aaron|
count the max number of commas, then accordingly assign to new columns.
max_commas = df['name'].str.split(',').transform(len).max()
df[[f'name_{x}' for x in range(max_commas)]] = df['name'].str.split(',', expand=True)
input df:
col name
0 1st st john, smith
1 2nd st andrew, jane, aron
2 3rd st harry, philip, anna, james
output:
col name name_0 name_1 name_2 name_3
0 1st st john, smith john smith None None
1 2nd st andrew, jane, aron andrew jane aron None
2 3rd st harry, philip, anna, james harry philip anna james
You can do
out = df.join(df['Name'].str.split(', ',expand=True).add_prefix('name_'))
I want to insert several different values in just one cell
E.g.
Friends' names
ID | Grade | Names
----+--------------+----------------------------
1 | elementary | Kai, Matthew, Grace
2 | guidance | Eli, Zoey, David, Nora, William
3 | High school | Emma, James, Levi, Sophia
Or as a list or dictionary:
ID | Grade | Names
----+--------------+------------------------------
1 | elementary | [Kai, Matthew, Grace]
2 | guidance | [Eli, Zoey, David, Nora, William]
3 | High school | [Emma, James, Levi, Sophia]
or
ID | Grade | Names
----+--------------+---------------------------------------------
1 | elementary | { a:Kai, b:Matthew, c:Grace}
2 | guidance | { a:Eli, b:Zoey, c:David, d:Nora, e:William}
3 | High school | { a:Emma, b:James, c:Levi, d:Sophia}
Is there a way?
Yes there is a way, but that doesn't mean you should do it this way.
You could for example save your values as a json string and save them inside the column. If you later want to add a value you can simply parse the json, add the value and put it back into the database. (Might also work with a BLOB, but I'm not sure)
However, I would not recommend saving a list inside of a column, as SQL is not meant to be used like that.
What I would recommend is that you have a table and for every grade with its own primary key. Like this:
ID
Grade
1
Elementary
2
Guidance
3
High school
And then another table containing all the names, having its own primary key and the gradeId as its secondary key. E.g:
ID
GradeID
Name
1
1
Kai
2
1
Matthew
3
1
Grace
4
2
Eli
5
2
Zoey
6
2
David
7
2
Nora
8
2
William
9
3
Emma
10
3
James
11
3
Levia
12
3
Sophia
If you want to know more about this, you should read about Normalization in SQL.
I have two df's, one for user names and another for real names. I'd like to know how I can check if I have a real name in my first df using the data of the other, and then replace it.
For example:
import pandas as pd
df1 = pd.DataFrame({'userName':['peterKing', 'john', 'joe545', 'mary']})
df2 = pd.DataFrame({'realName':['alice','peter', 'john', 'francis', 'joe', 'carol']})
df1
userName
0 peterKing
1 john
2 joe545
3 mary
df2
realName
0 alice
1 peter
2 john
3 francis
4 joe
5 carol
My code should replace 'peterKing' and 'joe545' since these names appear in my df2. I tried using pd.contains, but I can only verify if a name appears or not.
The output should be like this:
userName
0 peter
1 john
2 joe
3 mary
Can someone help me with that? Thanks in advance!
You can use loc[row, colum], here you can see the documentation about loc method. And Series.str.contain method to select the usernames you need to replace with the real names. In my opinion, this solution is clear in terms of readability.
for real_name in df2['realName'].to_list():
df1.loc[ df1['userName'].str.contains(real_name), 'userName' ] = real_name
Output:
userName
0 peter
1 john
2 joe
3 mary
Say I have a pandas DataFrame like so:
df = pd.DataFrame({'Name': ['John Doe', 'Jane Smith', 'John Doe', 'Jane Smith','Jack Dawson','John Doe']})
df:
Name
0 John Doe
1 Jane Smith
2 John Doe
3 Jane Smith
4 Jack Dawson
5 John Doe
And I want to add a column with uuids that are the same if the name is the same. For example, the DataFrame above should become:
df:
Name UUID
0 John Doe 6d07cb5f-7faa-4893-9bad-d85d3c192f52
1 Jane Smith a709bd1a-5f98-4d29-81a8-09de6e675b56
2 John Doe 6d07cb5f-7faa-4893-9bad-d85d3c192f52
3 Jane Smith a709bd1a-5f98-4d29-81a8-09de6e675b56
4 Jack Dawson 6a495c95-dd68-4a7c-8109-43c2e32d5d42
5 John Doe 6d07cb5f-7faa-4893-9bad-d85d3c192f52
The uuid's should be generated from the uuid.uuid4() function.
My current idea is to use a groupby("Name").cumcount() to identify which rows have the same name and which are different. Then I'd create a dictionary with a key of the cumcount and a value of the uuid and use that to add the uuids to the DF.
While that would work, I'm wondering if there's a more efficient way to do this?
Grouping the data frame and applying uuid.uuid4 will be more efficient than looping through the groups. Since you want to keep the original shape of your data frame you should use pandas function transform.
Using your sample data frame, we'll add a column in order to have a series to apply transform to. Since uuid.uuid4 doesn't take any argument it really doesn't matter what the column is.
df = pd.DataFrame({'Name': ['John Doe', 'Jane Smith', 'John Doe', 'Jane Smith','Jack Dawson','John Doe']})
df.loc[:, "UUID"] = 1
Now to use transform:
import uuid
df.loc[:, "UUID"] = df.groupby("Name").UUID.transform(lambda g: uuid.uuid4())
+----+--------------+--------------------------------------+
| | Name | UUID |
+----+--------------+--------------------------------------+
| 0 | John Doe | c032c629-b565-4903-be5c-81bf05804717 |
| 1 | Jane Smith | a5434e69-bd1c-4d29-8b14-3743c06e1941 |
| 2 | John Doe | c032c629-b565-4903-be5c-81bf05804717 |
| 3 | Jane Smith | a5434e69-bd1c-4d29-8b14-3743c06e1941 |
| 4 | Jack Dawson | 6b843d0f-ba3a-4880-8a84-d98c4af09cc3 |
| 5 | John Doe | c032c629-b565-4903-be5c-81bf05804717 |
+----+--------------+--------------------------------------+
uuid.uuid4 will be called as many times as there are distinct groups
How about this
names = df['Name'].unique()
for name in names:
df.loc[df['Name'] == name, 'UUID'] = uuid.uuid4()
could shorten it to
for name in df['Name'].unique():
df.loc[df['Name'] == name, 'UUID'] = uuid.uuid4()