I have a pandas dataframe and in one column I have a string where words are separated by '_', I would like to extract the last element of this string (which is a number) and make a new column with this.
I tried the following
df = pd.DataFrame({'strings':['some_string_25','a_different_one_13','and_a_last_one_40']})
df.assign(number = lambda x: x.strings.str.split('_')[0])
but it gives me this in my last column
number
some
string
25
but I would like to get this
number
25
13
40
How can I do this?
Use Series.str.split for split and select last value of list by indexing or use Series.str.extract by last integer of strings - (\d+) is for match int and $ for end of string:
df['last'] = df['strings'].str.split('_').str[-1]
df['last1'] = df['strings'].str.extract('(\d+)$')
print (df)
strings last last1
0 some_string_25 25 25
1 a_different_one_13 13 13
2 and_a_last_one_40 40 40
Difference is possible see in changed data:
df = pd.DataFrame({'strings':['some_string_25','a_different_one_13','and_a_last_one_40',
'aaaa', 'sss58']})
df['last'] = df['strings'].str.split('_').str[-1]
df['last1'] = df['strings'].str.extract('(\d+)$')
print (df)
strings last last1
0 some_string_25 25 25
1 a_different_one_13 13 13
2 and_a_last_one_40 40 40
3 aaaa aaaa NaN
4 sss58 sss58 58
Can do:
df['number']=df['strings'].apply(lambda row: row.split('_')[-1])
or:
df['number']=[row[-1] for row in df['strings'].str.split('_')]
Please try this
df = pd.DataFrame({'strings':['some_string_25','a_different_one_13','and_a_last_one_40']})
df['number'] = df.strings.apply(lambda x: x.split('_')[-1])
df
Related
I have a large pandas dataset with a messy string column which contains for example:
72.1
61
25.73.20
33.12
I'd like to fill the gaps in order to match a pattern like XX.XX.XX (X are only numbers):
72.10.00
61.00.00
25.73.20
33.12.00
thank you!
How about defining base_string = '00.00.00' then fill other string in each row with base_string:
base_str = '00.00.00'
df = pd.DataFrame({'ms_str':['72.1','61','25.73.20','33.12']})
print(df)
df['ms_str'] = df['ms_str'].apply(lambda x: x+base_str[len(x):])
print(df)
Output:
ms_str
0 72.1
1 61
2 25.73.20
3 33.12
ms_str
0 72.10.00
1 61.00.00
2 25.73.20
3 33.12.00
Here is a vectorized solution, that works for this particular pattern. First fill with zeros on the right, then replace every third character by a dot:
df['col'].str.ljust(8, fillchar='0').str.replace(r'(..).', r'\1.', regex=True)
Output:
0 72.10.00
1 61.00.00
2 25.73.20
3 33.12.00
Name: col, dtype: object
I have a column that looks like this:
Age
[0-10)
[10-20)
[20-30)
[30-40)
[40-50)
[50-60)
[60-70)
[70-80)
and want to remove the "[","-" and ")". Instead of showing the range such as 0-10, I would like to show the middle value instead for every row in the column
Yet another solution:
The dataframe:
df = pd.DataFrame({'Age':['[0-10)','[10-20)','[20-30)','[30-40)','[40-50)','[50-60)','[60-70)','[70-80)']})
df
Age
0 [0-10)
1 [10-20)
2 [20-30)
3 [30-40)
4 [40-50)
5 [50-60)
6 [60-70)
7 [70-80)
The code:
df['Age'] = df.Age.str.extract('(\d+)-(\d+)').astype('int').mean(axis=1).astype('int')
The result:
df
Age
0 5
1 15
2 25
3 35
4 45
5 55
6 65
7 75
If you want to explode a row into multiple rows where each row carries a value from the range, you can do this:
data = '''[0-10)
[10-20)
[20-30)
[30-40)
[40-50)
[50-60)
[60-70)
[70-80)'''
df = pd.DataFrame({'Age': data.splitlines()})
df['Age'] = df['Age'].str.extract(r'\[(\d+)-(\d+)\)').astype(int).apply(lambda r: list(range(r[0], r[1])), axis=1)
df.explode('Age')
Note that I assume your Age column is string typed, so I used extract to get the boundaries of the ranges, and convert them to a real list of integers. Finally explode your dataframe for the modified Age column will get you a new row for each integer in the list. Values in other columns will be copied accordingly.
I tried this:
import pandas as pd
import re
data = {
'age_range': [
'[0-10)',
'[10-20)',
'[20-30)',
'[30-40)',
'[40-50)',
'[50-60)',
'[60-70)',
'[70-80)',
]
}
df = pd.DataFrame(data=data)
def get_middle_age(age_range):
pattern = r'(\d+)'
ages = re.findall(pattern, age_range)
return int((int(ages[0])+int(ages[1]))/2)
df['age'] = df.apply(lambda row: get_middle_age(row['age_range']), axis=1)
I have a dataframe with 3 columns Replaced_ID, New_ID and Installation Date of New_ID.
Each New_ID replaces the Replaced_ID.
Replaced_ID New_ID Installation Date (of New_ID)
3 5 16/02/2018
5 7 17/05/2019
7 9 21/06/2019
9 11 23/08/2020
25 39 16/02/2017
39 41 16/08/2018
My goal is to get a dataframe which includes the first and last record of the sequence. I care only for the first Replaced_ID value and the last New_ID value.
i.e from above dataframe I want this
Replaced_ID New_ID Installation Date (of New_ID)
3 11 23/08/2020
25 41 16/08/2018
Sorting by date and perform shift is not the solution here as far as I can imagine.
Also, I tried to join the columns New_ID with Replaced_ID but this is not the case because it returns only the previous sequence.
I need to find a way to get the sequence [3,5,7,9,11] & [25,41] combining the Replaced_ID & New_ID columns for all rows.
I care mostly about getting the first Replaced_ID value and the last New_ID value and not the Installation Date because I can perform join in the end.
Any ideas here? Thanks.
First, let's create the DataFrame:
import pandas as pd
import numpy as np
from io import StringIO
data = """Replaced_ID,New_ID,Installation Date (of New_ID)
3,5,16/02/2018
5,7,17/05/2019
7,9,21/06/2019
9,11,23/08/2020
25,39,16/02/2017
39,41,16/08/2018
11,14,23/09/2020
41,42,23/10/2020
"""
### note that I've added two rows to check whether it works with non-consecutive rows
### defining some short hands
r = "Replaced_ID"
n = "New_ID"
i = "Installation Date (of New_ID)"
df = pd.read_csv(StringIO(data),header=0,parse_dates=True,sep=",")
df[i] = pd.to_datetime(df[i], )
And now for my actual solution:
a = df[[r,n]].values.flatten()
### returns a flat list of r and n values which clearly show duplicate entries, i.e.:
# [ 3 5 5 7 7 9 9 11 25 39 39 41 11 14 41 42]
### now only get values that occur once,
# and reshape them nicely, such that the first column gives the lowest (replaced) id,
# and the second column gives the highest (new) id, i.e.:
# [[ 3 14]
# [25 42]]
u, c = np.unique( a, return_counts=True)
res = u[c == 1].reshape(2,-1)
### now filter the dataframe where "New_ID" is equal to the second column of res, i.e. [14,42]:
# and replace the entries in "r" with the "lowest possible values" of r
dfn = df[ df[n].isin(res[:,1].tolist()) ]
# print(dfn)
dfn.loc[:][r] = res[:,0]
print(dfn)
Which yields:
Replaced_ID New_ID Installation Date (of New_ID)
6 3 14 2020-09-23
7 25 42 2020-10-23
Assuming dates are sorted , You can create a helper series and then groupby and aggregate:
df['Installation Date (of New_ID)']=pd.to_datetime(df['Installation Date (of New_ID)'])
s = df['Replaced_ID'].ne(df['New_ID'].shift()).cumsum()
out = df.groupby(s).agg(
{"Replaced_ID":"first","New_ID":"last","Installation Date (of New_ID)":"last"}
)
print(out)
Replaced_ID New_ID Installation Date (of New_ID)
1 3 11 2020-08-23
2 25 41 2018-08-16
The helper series s helps in differentiating the groups by comparing the Replaced_ID with the next value of New_ID and when they do not match , it returns True. Then with the help of series.cumsum we return a sum across the series to create seperate groups:
print(s)
0 1
1 1
2 1
3 1
4 2
5 2
Wonder if you have two Columns (A = 'Name', B = 'Name_Age'), is there a quick way to remove 'Name' from 'Name_Age' so that you can quickly get 'Age', like a reversed concatenation??
I've thought about 'string split', but in some cases (when there's no string split factor) I really need a method to remove strings of one column from strings of another.
#example data below:
import pandas as pd
data = {'Name':['Mark','Matt','Michael'], 'Name_Age':['Mark 14','Matt 29','Michael 18']}
df = pd.DataFrame(data)
You can try using pandas apply function, which lets you define your own function to be passed to every row of the dataframe:
def age_from_name_age(name, name_age):
return name_age.replace(name, '').strip()
df['Age'] = df.apply(lambda x: age_from_name_age(x['Name'], x['Name_Age']),
axis='columns')
age_from_name_age takes two strings (a name and a name_age), and returns just the age. Then, in the apply statement, I define an anonymous lambda function that just takes in a row and passes the correct fields to age_from_name_age.
Using string slicing:
df['Age'] = df.apply(lambda row: row['Name_Age'][len(row['Name']):], axis=1).astype(int)
You can use str.split() to separate the values from the column names with space separator and then rename the column's with new names.
1) Using str.split()
>>> df['Name_Age'].str.split(" ", expand=True).rename(columns={0:'Name', 1:'Age'})
Name Age
0 Mark 14
1 Matt 29
2 Michael 18
OR
>>> df = df['Name_Age'].str.split(" ", expand=True).rename(columns={0:'Name', 1:'Age'})
>>> df
Name Age
0 Mark 14
1 Matt 29
2 Michael 18
OR, by Converting the splitted list into new dataframe:
>>> pd.DataFrame(df.Name_Age.str.split().tolist(), columns="Name Age".split())
Name Age
0 Mark 14
1 Matt 29
2 Michael 18
2) Another option using str.partition
>>> df['Name_Age'].str.partition(" ", True).rename(columns={0:'Name', 2:'Age'}).drop(1, axis=1)
Name Age
0 Mark 14
1 Matt 29
2 Michael 18
3) another using df.assign with lambda
Use split() with default separator as follows and assigning the values back with new column Age.
>>> df.assign(Age = df.Name_Age.apply(lambda x: x.split()[1]))
Name Name_Age Age
0 Mark Mark 14 14
1 Matt Matt 29 29
2 Michael Michael 18 18
OR
>>> df.Name_Age.apply(lambda x: pd.Series(str(x).split())).rename({0:"Name",1:"Age"}, axis=1)
Name Age
0 Mark 14
1 Matt 29
2 Michael 18
I have 2 dataframe and I need to get new column to first dataframe, using values from second
FIrse df is
ID,"url","used_at","active_seconds"
8075643aab791cec7dc9d18926958b67,"sberbank.ru/ru/person/promo/10mnl?utm_source=Vesti.ru&utm_medium=html&utm_campaign=10_million_users_SBOL_dec2015&utm_term=every14_syncbanners",2016-01-01 00:03:16,183
a04a8041ffa6fe1b85471ca5af1ee575,"online.rsb.ru/hb/faces/system/login/rslogin.jsp?credit=false",2016-01-01 00:04:36,42
a04a8041ffa6fe1b85471ca5af1ee575,"online.rsb.ru/hb/faces/system/login/sms/sms.jsp?smsAuth=true",2016-01-01 00:05:18,22
a04a8041ffa6fe1b85471ca5af1ee575,"online.rsb.ru/hb/faces/rs/RSIndex.jspx",2016-01-01 00:05:40,14
a04a8041ffa6fe1b85471ca5af1ee575,"online.rsb.ru/hb/faces/rs/payments/PaymentReq.jspx",2016-01-01 00:05:54,22
ba880911a6d54f6ea6d3145081a0e0dd,"homecredit.ru/help/quest/feedback.php",2016-01-01 00:06:12,2
Second df looks like
URL Code
citibank\.ru\/russia\/info\/rus\/contacts_form\.htm 15
citibank\.ru\/russia\/info\/rus\/contacts\.htm 15
gazprombank\.ru\/contacts\/ 15
gazprombank\.ru\/feedback\/ 15
gazprombank\.ru\/additional_office\/ 15
homecredit\.ru\/help\/quest\/feedback\.php 15
homecredit\.ru\/offices\/* 15
If I don't have a regex, I use
df1['code'] = df1.url.map(df2.set_index('URL')['Code'])
But I can't do this, because df2.URL is regex.
But
df1['code'] = df1['url'].replace(df2['URL'], df2['Code'], regex=True)
doesn't work.
As per my comment, the pandas.Series.replace() method doesn't allow Series objects as the to_replace and value arguments. Passing lists instead works:
df1['code'] = df1.url.replace(df2.URL.values, df2.Code.values, regex=True)
print df1[['url', 'code']]
produces the following output:
url \
0 sberbank.ru/ru/person/promo/10mnl?utm_source=V...
1 online.rsb.ru/hb/faces/system/login/rslogin.js...
2 online.rsb.ru/hb/faces/system/login/sms/sms.js...
3 online.rsb.ru/hb/faces/rs/RSIndex.jspx
4 online.rsb.ru/hb/faces/rs/payments/PaymentReq....
5 homecredit.ru/help/quest/feedback.php
code
0 sberbank.ru/ru/person/promo/10mnl?utm_source=V...
1 online.rsb.ru/hb/faces/system/login/rslogin.js...
2 online.rsb.ru/hb/faces/system/login/sms/sms.js...
3 online.rsb.ru/hb/faces/rs/RSIndex.jspx
4 online.rsb.ru/hb/faces/rs/payments/PaymentReq....
5 15
In answer to your additional comments, you can't get df2.Code in df1.code in rows where df1.url doesn't match any of the regex strings, but you can provide a value (e.g. None) for these cases to be put in the column instead. This is, for example, done by adding the following line:
df1['code'] = df1.apply(lambda x: None if x.code == x.url else x.code, axis=1)
where print df1[['url', 'code']] returns the following:
url code
0 sberbank.ru/ru/person/promo/10mnl?utm_source=V... NaN
1 online.rsb.ru/hb/faces/system/login/rslogin.js... NaN
2 online.rsb.ru/hb/faces/system/login/sms/sms.js... NaN
3 online.rsb.ru/hb/faces/rs/RSIndex.jspx NaN
4 online.rsb.ru/hb/faces/rs/payments/PaymentReq.... NaN
5 homecredit.ru/help/quest/feedback.php 15.0