Pandas: replace values in columns using regex

Pandas: replace values in columns using regex - python

I have 2 dataframe and I need to get new column to first dataframe, using values from second
FIrse df is
ID,"url","used_at","active_seconds"
8075643aab791cec7dc9d18926958b67,"sberbank.ru/ru/person/promo/10mnl?utm_source=Vesti.ru&utm_medium=html&utm_campaign=10_million_users_SBOL_dec2015&utm_term=every14_syncbanners",2016-01-01 00:03:16,183
a04a8041ffa6fe1b85471ca5af1ee575,"online.rsb.ru/hb/faces/system/login/rslogin.jsp?credit=false",2016-01-01 00:04:36,42
a04a8041ffa6fe1b85471ca5af1ee575,"online.rsb.ru/hb/faces/system/login/sms/sms.jsp?smsAuth=true",2016-01-01 00:05:18,22
a04a8041ffa6fe1b85471ca5af1ee575,"online.rsb.ru/hb/faces/rs/RSIndex.jspx",2016-01-01 00:05:40,14
a04a8041ffa6fe1b85471ca5af1ee575,"online.rsb.ru/hb/faces/rs/payments/PaymentReq.jspx",2016-01-01 00:05:54,22
ba880911a6d54f6ea6d3145081a0e0dd,"homecredit.ru/help/quest/feedback.php",2016-01-01 00:06:12,2
Second df looks like
URL Code
citibank\.ru\/russia\/info\/rus\/contacts_form\.htm 15
citibank\.ru\/russia\/info\/rus\/contacts\.htm 15
gazprombank\.ru\/contacts\/ 15
gazprombank\.ru\/feedback\/ 15
gazprombank\.ru\/additional_office\/ 15
homecredit\.ru\/help\/quest\/feedback\.php 15
homecredit\.ru\/offices\/* 15
If I don't have a regex, I use
df1['code'] = df1.url.map(df2.set_index('URL')['Code'])
But I can't do this, because df2.URL is regex.
But
df1['code'] = df1['url'].replace(df2['URL'], df2['Code'], regex=True)
doesn't work.

As per my comment, the pandas.Series.replace() method doesn't allow Series objects as the to_replace and value arguments. Passing lists instead works:
df1['code'] = df1.url.replace(df2.URL.values, df2.Code.values, regex=True)
print df1[['url', 'code']]
produces the following output:
url \
0 sberbank.ru/ru/person/promo/10mnl?utm_source=V...
1 online.rsb.ru/hb/faces/system/login/rslogin.js...
2 online.rsb.ru/hb/faces/system/login/sms/sms.js...
3 online.rsb.ru/hb/faces/rs/RSIndex.jspx
4 online.rsb.ru/hb/faces/rs/payments/PaymentReq....
5 homecredit.ru/help/quest/feedback.php
code
0 sberbank.ru/ru/person/promo/10mnl?utm_source=V...
1 online.rsb.ru/hb/faces/system/login/rslogin.js...
2 online.rsb.ru/hb/faces/system/login/sms/sms.js...
3 online.rsb.ru/hb/faces/rs/RSIndex.jspx
4 online.rsb.ru/hb/faces/rs/payments/PaymentReq....
5 15
In answer to your additional comments, you can't get df2.Code in df1.code in rows where df1.url doesn't match any of the regex strings, but you can provide a value (e.g. None) for these cases to be put in the column instead. This is, for example, done by adding the following line:
df1['code'] = df1.apply(lambda x: None if x.code == x.url else x.code, axis=1)
where print df1[['url', 'code']] returns the following:
url code
0 sberbank.ru/ru/person/promo/10mnl?utm_source=V... NaN
1 online.rsb.ru/hb/faces/system/login/rslogin.js... NaN
2 online.rsb.ru/hb/faces/system/login/sms/sms.js... NaN
3 online.rsb.ru/hb/faces/rs/RSIndex.jspx NaN
4 online.rsb.ru/hb/faces/rs/payments/PaymentReq.... NaN
5 homecredit.ru/help/quest/feedback.php 15.0

Related

Extract subset of dataframe by value of column - column datatypes are mixed

I have a dataframe like this:
Seq Value
20-35-ABCDE 14268142.986651151
21-33-ABEDFD 4204281194.109206
61-72-ASDASD 172970.7123134008912
61-76-ASLDKAS 869238.232460215262
63-72-ASDASD string1
63-76-OIASD 20823821.49471747433
64-76-ASDAS(s)D string1
65-72-AS*AS 8762472.99003354316
65-76-ASYAD*S* 32512348.3285536161
66-76-A(AD(AD)) 3843230.72933184169
I want to rank the rows based on the Value, highest to lowest, and return the top 50% of the rows (where the row number could change over time).
I wrote this to do the ranking:
df = pd.read_csv(sys.argv[1],sep='\t')
df.columns =['Seq', 'Value']
#cant do because there are some strings
#pd.to_numeric(df['Value'])
df2 = df.sort_values(['Value'], ascending=True).head(10)
print(df2)
The output is like this:
Seq Value
17210 ASK1 0.0
15061 ASD**ASDHA 0.0
41110 ASD(£)DA 1.4355078174305618
50638 EILMH 1000.7985554926368
62019 VSEFMTRLF 10000.89805735126
41473 LEDSAGES 10002.182707004016
41473 LEDSASDES 10000886.012834921
So I guess it sorted them by string instead of floats, but I'm struggling to understand how to sort by float because some of the entries in that column say string1 (and I want all the string1 to go to the end of the list, i.e. I want to sort by all the floats, and then just put all the string1s at the end), and then I want to be able to return the Seq values in the top 50% of the sorted rows.
Can someone help me with this, even just the sorting part?

The problem is that your column is storing the values as strings, so they will sort according to string sorting, not numeric sorting. You can sort numerically using the key of DataFrame.sort_values, which also allows you to preserve the string values in that column.
Another option would be to turn that column into a numeric column before the sort, but then non-numeric values must be replaced with NaN
Sample data
import pandas as pd
df = pd.DataFrame({'Seq': [1,2,3,4,5],
'Value': ['11', '2', '1.411', 'string1', '91']})
# String sorting
df.sort_values('Value')
#Seq Value
#2 3 1.411
#0 1 11
#1 2 2
#4 5 91
#3 4 string1
Code
# Numeric sorting
df.sort_values('Value', key=lambda x: pd.to_numeric(x, errors='coerce'))
Seq Value
2 3 1.411
1 2 2
0 1 11
4 5 91
3 4 string1

Honestly, I think it would make more sense to use one column for the float values and one for the strings. That said, you can convert to numeric only for the sorting using the key parameter of sort_values. The NaN/strings will be pushed to the end.
df.sort_values(by='Value', key=lambda x: pd.to_numeric(x, errors='coerce'))
output:
Seq Value
2 61-72-ASDASD 172970.7123134008912
3 61-76-ASLDKAS 869238.232460215262
9 66-76-A(AD(AD)) 3843230.72933184169
7 65-72-AS*AS 8762472.99003354316
0 20-35-ABCDE 14268142.986651151
5 63-76-OIASD 20823821.49471747433
8 65-76-ASYAD*S* 32512348.3285536161
1 21-33-ABEDFD 4204281194.109206
4 63-72-ASDASD string1
6 64-76-ASDAS(s)D string1
alternative splitting the floats and strings apart:
s = df['Value']
(df.assign(Value=pd.to_numeric(s, errors='coerce'),
Strings=lambda d: s.where(d['Value'].isna())
)
.sort_values(by=['Value', 'Strings'])
)
output:
Seq Value Strings
2 61-72-ASDASD 1.729707e+05 NaN
3 61-76-ASLDKAS 8.692382e+05 NaN
9 66-76-A(AD(AD)) 3.843231e+06 NaN
7 65-72-AS*AS 8.762473e+06 NaN
0 20-35-ABCDE 1.426814e+07 NaN
5 63-76-OIASD 2.082382e+07 NaN
8 65-76-ASYAD*S* 3.251235e+07 NaN
1 21-33-ABEDFD 4.204281e+09 NaN
4 63-72-ASDASD NaN string1
6 64-76-ASDAS(s)D NaN string1

How Can I drop a column if the last row is nan

I have found examples of how to remove a column based on all or a threshold but I have not been able to find a solution to my particular problem which is dropping the column if the last row is nan. The reason for this is im using time series data in which the collection of data doesnt all start at the same time which is fine but if I used one of the previous solutions it would remove 95% of the dataset. I do however not want data whose most recent column is nan as it means its defunct.
A B C
nan t x
1 2 3
x y z
4 nan 6
Returns
A C
nan x
1 3
x z
4 6

You can also do something like this
df.loc[:, ~df.iloc[-1].isna()]
A C
0 NaN x
1 1 3
2 x z
3 4 6

Try with dropna
df = df.dropna(axis=1, subset=[df.index[-1]], how='any')
Out[8]:
A C
0 NaN x
1 1 3
2 x z
3 4 6

You can use .iloc, .loc and .notna() to sort out your problem.
df = pd.DataFrame({"A":[np.nan, 1,"x",4],
"B":["t",2,"y",np.nan],
"C":["x",3,"z",6]})
df = df.loc[:,df.iloc[-1,:].notna()]

You can use a boolean Series to select the column to drop
df.drop(df.loc[:,df.iloc[-1].isna()], axis=1)
Out:
A C
0 NaN x
1 1 3
2 x z
3 4 6

for i in range(temp_df.shape[1]):
if temp_df.iloc[-1,i] == 'nan':
temp_df = temp_df.drop(i,1)
This will work for you.
Basically what I'm doing here is looping over all columns and checking if last entry is 'nan', then dropping that column.
temp_df.shape[1]
this is the numbers of columns.
pandas.df.drop(i,1)
i represents the column index and 1 represents that you want to drop the column.
EDIT:
I read the other answers on this same post and it seems to me that notna would be best (I would use it), but the advantage of this method is that someone can compare anything they wish to.
Another method I found is isnull() which is a function in the pandas library which will work like this:
for i in range(temp_df.shape[1]):
if temp_df.iloc[-1,i].isnull():
temp_df = temp_df.drop(i,1)

How to get some string of dataframe column?

I have dataframe like this.
print(df)
[ ID ... Control
0 PDF-1 ... NaN
1 PDF-3 ... NaN
2 PDF-4 ... NaN
I want to get only number of ID column. So the result will be.
1
3
4
How to get one of the strings of the dataframe column ?

How about just replace a common PDF- prefix?
df['ID'].str.replace('PDF-', '')

Could you please try following.
df['ID'].replace(regex=True,to_replace=r'([^\d])',value=r'')
One could refer documentation for df.replace
Basically using regex to remove everything apart from digits in column named ID where \d denotes digits and when we use [^\d] means apart form digits match everything.

Another possibility using Regex is:
df.ID.str.extract('(\d+)')
This avoids changing the original data just to extract the integers.
So for the following simple example:
import pandas as pd
df = pd.DataFrame({'ID':['PDF-1','PDF-2','PDF-3','PDF-4','PDF-5']})
print(df.ID.str.extract('(\d+)'))
print(df)
we get the following:
0
0 1
1 2
2 3
3 4
4 5
ID
0 PDF-1
1 PDF-2
2 PDF-3
3 PDF-4
4 PDF-5

Find "PDF-" ,and replace it with nothing
df['ID'] = df['ID'].str.replace('PDF-', '')
Then to print how you asked I'd convert the data frame to a string with no index.
print df['cleanID'].to_string(index=False)

How would I pivot this basic table using pandas?

What I want is this:
visit_id atc_1 atc_2 atc_3 atc_4 atc_5 atc_6 atc_7
48944282 A02AG J01CA04 J095AX02 N02BE01 R05X NaN NaN
48944305 A02AG A03AX13 N02BE01 R05X NaN NaN NaN
I don't know how many atc_1...atc_7...?atc_100 columns there will need to be in advance. I just need to gather all associated atc_codes into one row with each visit_id.
This seems like a group_by and then a pivot but I have tried many times and failed. I also tried to self-join a la SQL using pandas' merge() but that doesn't work either.
The end result is that I will paste together atc_1, atc_7, ... atc_100 to form one long atc_code. This composite atc_code will be my "Y" or "labels" column of my dataset that I am trying to predict.
Thank you!

Use cumcount first for count values per groups which create columns by function pivot. Then add missing columns with reindex_axis and change column names by add_prefix. Last reset_index:
g = df.groupby('visit_id').cumcount() + 1
print (g)
0 1
1 2
2 3
3 4
4 5
5 1
6 2
7 3
8 4
dtype: int64
df = pd.pivot(index=df['visit_id'], columns=g, values=df['atc_code'])
.reindex_axis(range(1, 8), 1)
.add_prefix('atc_')
.reset_index()
print (df)
visit_id atc_1 atc_2 atc_3 atc_4 atc_5 atc_6 atc_7
0 48944282 A02AG J01CA04 J095AX02 N02BE01 R05X NaN NaN
1 48944305 A02AG A03AX13 N02BE01 R05X None NaN NaN

NaNs after merging two dataframes

I have two dataframes like the following:
df1
id name
-------------------------
0 43 c
1 23 t
2 38 j
3 9 s
df2
user id
--------------------------------------------------
0 222087 27,26
1 1343649 6,47,17
2 404134 18,12,23,22,27,43,38,20,35,1
3 1110200 9,23,2,20,26,47,37
I want to split all the ids in df2 into multiple rows and join the resultant dataframe to df1 on "id".
I do the following:
b = pd.DataFrame(df2['id'].str.split(',').tolist(), index=df2.user_id).stack()
b = b.reset_index()[[0, 'user_id']] # var1 variable is currently labeled 0
b.columns = ['Item_id', 'user_id']
When I try to merge, I get NaNs in the resultant dataframe.
pd.merge(b, df1, on = "id", how="left")
id user name
-------------------------------------
0 27 222087 NaN
1 26 222087 NaN
2 6 1343649 NaN
3 47 1343649 NaN
4 17 1343649 NaN
So, I tried doing the following:
b['name']=np.nan
for i in range(0, len(df1)):
b['name'][(b['id'] == df1['id'][i])] = df1['name'][i]
It still gives the same result as above. I am confused as to what could cause this because I am sure both of them should work!
Any help would be much appreciated!
I read similar posts on SO but none seemed to have a concrete answer. I am also not sure if this is not at all related to coding or not.
Thanks in advance!

Problem is you need convert column id in df2 to int, because output of string functions is always string, also if works with numeric.
df2.id = df2.id.astype(int)
Another solution is convert df1.id to string:
df1.id = df1.id.astype(str)
And get NaNs because no match - str values doesnt match with int values.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas: replace values in columns using regex - python

Related

Extract subset of dataframe by value of column - column datatypes are mixed

How Can I drop a column if the last row is nan

How to get some string of dataframe column?

How would I pivot this basic table using pandas?

NaNs after merging two dataframes

Categories

Resources