Values in my DataFrame look like this:
id val
big_val_167 80
renv_100 100
color_100 200
color_60/write_10 200
I want to remove everything in values of id column after _numeric. So desired result must look like:
id val
big_val 80
renv 100
color 200
color 200
How to do that? I know that str.replace() can be used, but I don't understand how to write regular expression part in it.
You can use regex(re.search) to find the first occurence of _ + digit and then you can solve the problem.
Code:
import re
import pandas as pd
def fix_id(id):
# Find the first occurence of: _ + digits in the id:
digit_search = re.search(r"_\d", id)
return id[:digit_search.start()]
# Your df
df = pd.DataFrame({"id": ["big_val_167", "renv_100", "color_100", "color_60/write_10"],
"val": [80, 100, 200, 200]})
df["id"] = df["id"].apply(fix_id)
print(df)
Output:
id val
0 big_val 80
1 renv 100
2 color 200
3 color 200
Related
I have a dataframe as shown below.It has 3 columns with names "TTN_163_2.5_-40 ","TTN_163_2.7_-40" and " TTN_163_3.6_-40".
I need to select all rows whose column name contains '2.5','3.6','2.7'.
I have some column names which contains 1.6,1.62 and 1.656.I need to select
these separately.when I am writing df_psrr_funct_1V6.filter(regex='1\.6|^xvalues$') I am geting all rows corresponds to 1.6 ,1.65 and 1.62 .I don't want this .May I know how to select uniquely.
I used this method (df_psrr_funct = df_psrr_funct.filter(regex='2.5'))but it is not capturing 1st column(xvalues)
Sample dataframe
xvalues TTN_163_2.5_-40 TTN_163_2.7_-40 TTN_163_3.6_-40
23.0279 -58.7591 -58.5892 -60.0966
30.5284 -58.6903 -57.3153 -59.9111
Please the image my dataframe
May I know how to do this
Expand regex with | for or, ^ is for start string, $ is for end string for extract column name xvalues and avoid extract colums names with substrings like xvalues 1 or aaa xvalues:
df_psrr_funct = df_psrr_funct.filter(regex='2\.5|^xvalues$')
print (df_psrr_funct)
xvalues TTN_163_2.5_-40
0 23.0279 -58.7591
1 30.5284 -58.6903
EDIT: If need values between _ use:
print (df_psrr_funct)
xvalues TTN_163_1.6_-40 TTN_163_1.62_-40 TTN_163_1.656_-40
0 23.0279 -58.7591 -58.5892 -60.0966
1 30.5284 -58.6903 -57.3153 -59.9111
df_psrr_funct = df_psrr_funct.filter(regex='_1\.6_|^xvalues$')
print (df_psrr_funct)
xvalues TTN_163_1.6_-40
0 23.0279 -58.7591
1 30.5284 -58.6903
Another approach:
df_psrr_funct.filter(regex = '^\D+$|2.5')
xvalues TTN_163_2.5_-40
0 23.0279 -58.7591
1 30.5284 -58.6903
using regex for this doesnt make any sense... just do
columns_with_2point5 = [c for c in df.columns if "2.5" in c]
only_cool_cols = df[['xvalues'] + columns_with_2point5]
dont overcomplicate it ...
if you dont need the first column you can just use filter with like instead of using one of the regex solutions (see first comment from #BeRT2me)
I'm really amateur-level with both python and pandas, but I'm trying to solve an issue for work that's stumping me.
I have two dataframes, let's call them dfA and dfB:
dfA:
project_id Category Initiative
10
20
30
40
dfB:
project_id field_id value
10 100 lorem
10 200 lorem1
10 300 lorem2
20 200 ipsum
20 300 ipsum1
20 500 ipsum2
Let's say I know "Category" from dfA correlates to field_id "100" from dfB, and "Initiative" correlates to field_id "200".
I need to look through dfB and for a given project_id/field_id combination, take the corresponding value in the "value" column and place it in the correct cell in dfA.
The result would look like this:
dfA:
project_id Category Initiative
10 lorem lorem1
20 ipsum
30
40
Bonus difficulty: not every project in dfA exists in dfB, and not every field_id is used in every project_id.
I hope I've explained this well enough; I feel like there must be a relatively simple way to handle this that I'm missing.
You could do something like this although it's not very elegant, there must be a better way. I had to use try/except because of the cases where the project Id is not available in the dfB. I put NaN values for the missing ones but you can easily put empty strings.
def get_value(row):
try:
res = dfB[(dfB['field_id'] == 100) & (dfB['project_id'] == row['project_id'])]['value'].iloc[0]
except:
res = np.nan
row['Categorie'] = res
try:
res = dfB[(dfB['field_id'] == 200) & (dfB['project_id'] == row['project_id'])]['value'].iloc[0]
except:
res = np.nan
row['Initiative'] = res
return row
dfA = dfA.apply(get_value, axis=1)
EDIT: as mentioned in comment, this is not very flexible as some values are hardcoded but you can easily change that with something like the below. This way, if the field_id change or you need to add/remove a column, just update the dictionary.
columns_field = {"Category": 100, "Initiative": 200}
def get_value(row):
for key, value in columns_fields.items():
try:
res = dfB[(dfB['field_id'] == value) & (dfB['project_id'] == row['project_id'])]['value'].iloc[0]
except:
res = np.nan
row[key] = res
return row
dfA = dfA.apply(get_value, axis=1)
Let us say I have the following simple data frame. But in reality, I have hundreds thousands of rows like this.
df
ID Sales
倀굖곾ꆹ譋῾理 100
倀굖곾ꆹ 50
倀굖곾ꆹ譋῾理 70
곾ꆹ텊躥㫆 60
My idea is that I want to replace the Chinese digit with randomly generated 8 digits something looks like below.
ID Sales
13434535 100
67894335 50
13434535 70
10986467 60
The digits are randomly generated but they should keep uniqueness as well. For example, row 0 and 2 are same and when it replaced by a random unique ID, it should be the same as well.
Can anyone help on this in Python pandas? Any solution that is already done before is also welcome.
The primary method here will be to use Series.map() on the 'ID's to assign the new values.
Used for substituting each value in a Series with another value, that may be derived from a function, a dict or a Series.
which is exactly what you're looking for.
Here are some options for generating the new IDs:
1. Randomly generated 8-digit integers, as asked
You can first create a map of randomly generated 8-digit integers with each of the unique ID's in the dataframe. Then use Series.map() on the 'ID's to assign the new values back. I've included a while loop to ensure that the generated ID's are unique.
import random
original_ids = df['ID'].unique()
while True:
new_ids = {id_: random.randint(10_000_000, 99_999_999) for id_ in original_ids}
if len(set(new_ids.values())) == len(original_ids):
# all the generated id's were unique
break
# otherwise this will repeat until they are
df['ID'] = df['ID'].map(new_ids)
Output:
ID Sales
0 91154173 100
1 27127403 50
2 91154173 70
3 55892778 60
Edit & Warning: The original ids are Chinese characters and they are already length 8. There's definitely more than 10 Chinese characters so with the wrong combination of original IDs, it could become impossible to make unique-enough 8-digit IDs for the new set. Unless you are memory bound, I'd recommend using 16-24 digits. Or even better...
2. Use UUIDs. [IDEAL]
You can still use the "integer" version of the ID instead of hex. This has the added benefit of not needing to check for uniqueness:
import uuid
original_ids = df['ID'].unique()
new_ids = {cid: uuid.uuid4().int for cid in original_ids}
df['ID'] = df['ID'].map(new_ids)
(If you are okay with hex id's, change uuid.uuid4().int above to uuid.uuid4().hex.)
Output:
ID Sales
0 10302456644733067873760508402841674050 100
1 99013251285361656191123600060539725783 50
2 10302456644733067873760508402841674050 70
3 112767087159616563475161054356643068804 60
2.B. Smaller numbers from UUIDs
If the ID generated above is too long, you could truncate it, with some minor risk. Here, I'm only using the first 16 hex characters and converting those to an int. You may put that in the uniqueness loop check as done for option 1, above.
import uuid
original_ids = df['ID'].unique()
DIGITS = 16 # number of hex digits of the UUID to use
new_ids = {cid: int(uuid.uuid4().hex[:DIGITS], base=16) for cid in original_ids}
df['ID'] = df['ID'].map(new_ids)
Output:
ID Sales
0 14173925717660158959 100
1 10599965012234224109 50
2 14173925717660158959 70
3 13414338319624454663 60
3. Creating a mapping based on the actual value:
This group of options has these advantages:
not needing a uniqueness check since it's deterministically based on the original ID and
So original IDs which were the same will generate the same new ID
doesn't need a map created in advance
3.A. CRC32
(Higher probability of finding a collision with different IDs, compared to option 2.B. above.)
import zlib
df['ID'] = df['ID'].map(lambda cid: zlib.crc32(bytes(cid, 'utf-8')))
Output:
ID Sales
0 2083453980 100
1 1445801542 50
2 2083453980 70
3 708870156 60
3.B. Python's built-in hash() of the orignal ID [My preferred approach in this scenario]
Can be done in one line, no imports needed
Reasonably secure to not generate collisions for IDs which are different
df['ID'] = df['ID'].map(hash)
Output:
ID Sales
0 4663892623205934004 100
1 1324266143210735079 50
2 4663892623205934004 70
3 6251873913398988390 60
3.C. MD5Sum, or anything from hashlib
Since the IDs are expected to be small (8 chars), even with MD5, the probability of a collision is very low.
import hashlib
DIGITS = 16 # number of hex digits of the hash to use
df['ID'] = df['ID'].str.encode('utf-8').map(lambda x: int(hashlib.md5(x).hexdigest()[:DIGITS], base=16))
Output:
ID Sales
0 17469287633857111608 100
1 4297816388092454656 50
2 17469287633857111608 70
3 11434864915351595420 60
Not very expert in Pandas, that's why implementing solution for you with Numpy + Pandas. As solution uses fast Numpy it means it will be much faster than pure Python solution especially if you have thousands of rows.
Try it online!
import pandas as pd, numpy as np
df = pd.DataFrame([
['倀굖곾ꆹ譋῾理', 100],
['倀굖곾ꆹ', 50],
['倀굖곾ꆹ譋῾理', 70],
['곾ꆹ텊躥㫆', 60],
], columns = ['ID', 'Sales'])
u, iv = np.unique(df.ID.values, return_inverse = True)
while True:
ids = np.random.randint(10 ** 7, 10 ** 8, u.size)
if np.all(np.unique(ids, return_counts = True)[1] <= 1):
break
df.ID = ids[iv]
print(df)
Output:
ID Sales
0 31043191 100
1 36168634 50
2 31043191 70
3 17162753 60
Given a dataframe df, create a list of the ids:
id_list = list(df.ID)
Then import the random package
from random import randint
from collections import deque
def idSetToNumber(id_list):
id_set = deque(set(id_list))
checked_numbers = []
while len(id_set)>0:
#get the id
id = randint(10000000,99999999)
#check if the id has been used
if id not in checked_numbers:
checked_numbers.append(id)
id_set.popleft()
return checked_numbers
This gives a list of unique 8-digit number for each of your keys.
Then create a dictionary
checked_numbers = idSetToNumber(id_list)
name2id = {}
for i in range(len(checked_numbers)):
name2id[id_list[i]]=checked_numbers[i]
Last step, replace all the pandas ID fields with the ones in the dictionary.
for i in range(df.shape[0]):
df.ID[i] = str(name2id[df.ID[i]])
I would:
identify the unique ID values
build (from np.random) an array of unique values of same size
build a tranformation dataframe with that array
use merge to replace the original ID values
Possible code:
trans = df[['ID']].drop_duplicates() # unique ID values
n = len(trans)
# np.random.seed(0) # uncomment for reproducible pseudo random sequences
while True:
# build a greater array to have a higher chance to get enough unique values
arr = np.unique(np.random.randint(10000000, 100000000, n + n // 2))
if len(arr) >= n:
arr = arr[:n] # ok keep only the required number
break
trans['new'] = arr # ok we have our transformation table
df['ID'] = df.merge(trans, how='left', on='ID')['new'] # done...
With your sample data (and with np.random.seed(0)), it gives:
ID Sales
0 12215104 100
1 48712131 50
2 12215104 70
3 70969723 60
Per #Arty's comment, np.unique will return a ascending sequence. If you do not want that, shuffle it before using it for the transformation table:
...
np.random.shuffle(arr)
trans['new'] = arr
...
I have 2 different Dataframes for which I am trying to match strings columns (names)
Below are just some sample of DF's
df1 (127000,3)
Code Name PostalCode
150 Maarc 47111
250 Kirc 41111
170 Moic 42111
140 Nirc 44111
550 Lacter 47111
df2 (38000,3)
Code NAME POSTAL_CODE
150 Marc 47111
250 Kikc 41111
170 Mosc 49111
140 NiKc 44111
550 Lacter 47111
The aim is to create another DF3 as shown below
Code NAME Best Match Score
150 Marc Maarc 0.9
250 Karc Kirc 0.9
The following code gives the expected output
import difflib
from functools import partial
f = partial(difflib.get_close_matches, possibilities= df1['Name'].tolist(), n=1)
matches = df2['NAME'].map(f).str[0].fillna('')
scores = [difflib.SequenceMatcher(None, x, y).ratio()
for x, y in zip(matches, df2['NAME'])]
df3 = df2.assign(best=matches, score=scores)
df3.sort_values(by='score')
The problem
To match those strings for only 2 rows it takes around 30sec. This task has to be done for 1K rows which will take hours!
The Question
How can I speed up the code?
I was thinking about something like fetchall?
EDIT
Even the fuzzywuzzy libraries has been tried, which takes longer than difflib with the following code:
from fuzzywuzzy import fuzz
def get_fuzz(df, w):
s = df['Name'].apply(lambda y: fuzz.token_set_ratio(y, w))
idx = s.idxmax()
return {'Name': df['Name'].iloc[idx], 'CODE': df['Code'].iloc[idx], 'Value': s.max()}
df2['NAME'].apply(lambda x: get_fuzz(df1, x))
df2 = df2.assign(search= df2['NAME'].apply(lambda x: get_fuzz(df1, x)))
So I was able to speed up the matching step by using the postal code column as discriminant. I was able to goes from 1h40 to 7mn of computation.
Below are just some sample of DF's
df1 (127000,3)
Code Name PostalCode
150 Maarc 47111
250 Kirc 41111
170 Moic 42111
140 Nirc 44111
550 Lacter 47111
df2 (38000,3)
Code NAME POSTAL_CODE
150 Marc 47111
250 Kikc 41111
170 Mosc 49111
140 NiKc 44111
550 Lacter 47111
And below is the code that matches the Name column and retrieve the name with the best score
%%time
import difflib
from functools import partial
def difflib_match (df1, df2, set_nan = True):
# Fill NaN
df2['best']= np.nan
df2['score']= np.nan
# Apply function to retrieve unique first letter of Name's column
first= df2['POSTAL_CODE'].unique()
# Loop over each first letter to apply the matching by starting with the same Postal code for both DF
for m, letter in enumerate(first):
# IF Divid by 100, print Unique values processed
if m%100 == 0:
print(m, 'of', len(first))
df1_first= df1[df1['PostalCode'] == letter]
df2_first= df2[df2['POSTAL_CODE'] == letter]
# Function to match using the Name column from the Web
f = partial(difflib.get_close_matches, possibilities= df1_first['Name'].tolist(), n=1)
# Define which columns to compare while mapping with first letter
matches = df2_first['NAME'].map(f).str[0].fillna('')
# Retrieve the best score for each match
scores = [difflib.SequenceMatcher(None, x, y).ratio()
for x, y in zip(matches, df2_first['NAME'])]
# Assign the result to the DF
for i, name in enumerate(df2_first['NAME']):
df2['best'].where(df2['NAME'] != name, matches.iloc[i], inplace = True)
df2['score'].where(df2['NAME'] != name, scores[i], inplace = True)
return df2
# Apply Function
df_diff= difflib_match(df1, df2)
# Display DF
print('Shape: ', df_diff.shape)
df_diff.head()
The fastest way I can think of matching string is using Regex.
It's a search language design to find matches in a string.
You can see a example here:
import re
txt = "The rain in Spain"
x = re.search("^The.*Spain$", txt)
//Outputs: x == true
*Taken from: https://www.w3schools.com/python/python_regex.asp
Since I don't understand anything Dataframe, I don't know how to implement Regex in your code, but I hope that Regex function might help you.
I have a data frame with a number column, such as:
CompteNum
100
200
300
400
500
and a file with the mapping of all these numbers to other numbers, that I import to python and convert into a dictionary:
{100: 1; 200:2; 300:3; 400:4; 500:5}
And I am creating a second column in the data frame that combine both numbers in the format df number + dict number: From 100 to 1001 and so on...
## dictionary
accounts = pd.read_excel("mapping-accounts.xlsx")
accounts = accounts[['G/L Account #','FrMap']]
accounts = accounts.set_index('G/L Account #').to_dict()['FrMap']
## data frame --> CompteNum is the Number Column
df['CompteNum'] = df['CompteNum'].map(accounts1).astype(str) + df['CompteNum'].astype(str)
The problem is that my output then is 100.01.0 instead of 1001 and that creates additional manual work in the output excel file. I have tried:
df['CompteNum'] = df['CompteNum'].str.replace('.0', '')
but it doesn't deletes ALL the zero's, and I would want the additional ones deleted. Any suggestions?
There is problem missing values for non matched values after map, possible solution is:
print (df)
CompteNum
0 100
1 200
2 300
3 400
4 500
5 40
accounts1 = {100: 1, 200:2, 300:3, 400:4, 500:5}
s = df['CompteNum'].astype(str)
s1 = df['CompteNum'].map(accounts1).dropna().astype(int).astype(str)
df['CompteNum'] = (s + s1).fillna(s)
print (df)
CompteNum
0 1001
1 2002
2 3003
3 4004
4 5005
5 40
Your solution should be changed for replace by regex - $ for end of string with escape ., because special regex character (regex any char):
df['CompteNum'] = df['CompteNum'].str.replace('\.0$', '')