pandas and "re" - search for total and partial strings - python

This an extended question from this topic. I would like to search in strings total and partial strings like the following keywords Series "w":
rigour*
*demeanour*
centre*
*arbour
fulfil
This obviously means that I wanted to search for words like rigour and rigours, endemeanour and demeanours, centre and centres, harbour and arbour, and fulfil. So the keywords list I have is a mix of complete and partial strings to find. I would like to apply the search on this DataFrame "df":
ID;name
01;rigour
02;rigours
03;endemeanour
04;endemeanours
05;centre
06;centres
07;encentre
08;fulfil
09;fulfill
10;harbour
11;arbour
12;harbours
What I tried so far is the following:
r = re.compile(r'.*({}).*'.format('|'.join(w.values)), re.IGNORECASE)
then I've build a mask to filter the DataFrame:
mask = [m.group(1) if m else None for m in map(r.search, df['Tweet'])]
in order to get a new column with the Keyword found:
df['keyword'] = mask
What I'm expecting is the following resulting DataFrame:
ID;name;keyword
01;rigour;rigour
02;rigours;rigour
03;endemeanour;demeanour
04;endemeanours;demeanour
05;centre;centre
06;centres;centre
07;encentre;None
08;fulfil;fulfil
09;fulfill;None
10;harbour;arbour
11;arbour;arbour
12;harbours;None
This works using a w list without *. Now I had several issues in formatting the keyword w List of words with the * conditions, in order to run the re.compile function correctly.
Any help would be really appreciated.

It looks like your input series w needs to be adjusted to be used as regex pattern like this:
rigour.*
.*demeanour.*
centre.*
\\b.*arbour\\b
\\bfulfil\\b
Note that * in regex goes after something it does not work on its own. It means that whatever it follows can be repeated 0 or more times.
Note also that fulfil is a part of fulfill and if you want to have strict match you need to tell regex this. For example by using 'word separator' - \b - it will catch only string as whole.
Here is how your regex might look like to give you results that you need:
s = '({})'.format('|'.join(w.values))
r = re.compile(s, re.IGNORECASE)
r
re.compile(r'(rigour.*|.*demeanour.*|centre*|\b.*arbour\b|\bfulfil\b)', re.IGNORECASE)
And your code to have the replacement could be done with pandas .where method like this:
df['keyword'] = df.name.where(df.name.str.match(r), None)
df
ID name keyword
0 1 rigour rigour
1 2 rigours rigours
2 3 endemeanour endemeanour
3 4 endemeanours endemeanours
4 5 centre centre
5 6 centres centres
6 7 encentre None
7 8 fulfil fulfil
8 9 fulfill None
9 10 harbour harbour
10 11 arbour arbour
11 12 harbours None

Related

Pandas Dataframe replace string based on length

I have a pandas dataframe (df2) with about 160,000 rows. I'm trying to change some of the values in a column (url).
The strings in this column have lengths between 108 and 150 characters. If the string is not 108 characters, I want to replace it with the same string, cutting off the last 10 characters. IF the string is 108 characters. I want to leave it alone. Please note that i'm not trying to make every string 108 characters, I'm just trying to cut off the last 10 characters of any string that isn't 108 characters.
example: len(s) = 114, replace with s[:-10]
I built a function that will do this, but it's insanely slow, probably because it rebuilds the dataframe in each loop.
for i in df2.url:
if len(i) != 108:
new_i = i[:-10]
df2 = df2.replace(i,new_i)
There has to be a faster way to do this, but I haven't been able to figure out how. I would love the expertise of someone more versed in pandas.
Below is an example of 200 rows of the column I'm trying to change:
['https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1301108?gameHash=bde58669fc59c853&tab=overview',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1291187?gameHash=f7fcd2d6ca775fb5&tab=overview',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1291192?gameHash=005335984c8f8a3a&tab=overview',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1301128?gameHash=fcbd2630c0faec49&tab=overview',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1301159?gameHash=9a7726176fdabfde&tab=overview',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1301169?gameHash=5d816e6d30d2b659&tab=overview',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1301183?gameHash=396641afdcdd99d9&tab=overview',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT02/1271494?gameHash=bd51798e1358c47f',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT04/1130153?gameHash=00a7861ac0a23aef',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT02/1271495?gameHash=0d828bbc9aa9996c',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT02/1271497?gameHash=bd4810bb801abf24',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT04/1130166?gameHash=1cff679b64acb047',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT04/1130177?gameHash=1f92cbefd9a965e0',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT02/1271500?gameHash=abbdae6c3e7b4006',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT02/1271505?gameHash=7c970a84e132a578',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT04/1130182?gameHash=ccb50f6e86e4c3df',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT04/1130193?gameHash=0995997660a65721',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1301262?gameHash=c594a9a52f46cc50',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT04/1130196?gameHash=31553f5bb6ba4420',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1301270?gameHash=5b3babb5d392d78d',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT04/1130201?gameHash=3d2aa031c17d90ae',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1301290?gameHash=31ce80069fdbc873',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT04/1130210?gameHash=91c7b22cded939ff',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1301305?gameHash=3f8d664b3b988446',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT04/1130221?gameHash=a8580ee66ffbb525',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1291406?gameHash=5220923eb35c42c6',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1291426?gameHash=83c7c51530ea074e',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1291442?gameHash=28f7b485f710168f',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1291458?gameHash=49cc14d02ccd0674',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1291470?gameHash=f087c853097c2dd9',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT02/1261474?gameHash=e6c01a288de5dc41',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT04/1130229?gameHash=1489421028163983',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT02/1261475?gameHash=c984e795d6406cd5',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT04/1130243?gameHash=5491d110de253089',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT02/1261482?gameHash=f2283324f82caa66&tab=overview',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT04/1130253?gameHash=f8e39ae785d11c0c',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT04/1130264?gameHash=a98718c088ce663c',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT02/1261488?gameHash=6517011920487fbf&tab=overview',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1291651?gameHash=5ec1b3473060dfd2',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1291682?gameHash=a8f2c06d04117279',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1291703?gameHash=cfb2d078f289825c',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1291737?gameHash=cf67a15df43c2bb2',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1291748?gameHash=7a3c085cf703d7bd',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1291789?gameHash=51e5ed28085fd299',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1291812?gameHash=e540d208bbc69bb3',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1291835?gameHash=a75ab48a22470022',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1291845?gameHash=2eab12f8ffd0dfd0',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT04/1130294?gameHash=ecf040ad60fa9726&tab=overview',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT04/1130299?gameHash=499a21480080a722&tab=overview',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT04/1130306?gameHash=d0e60bf49b6bf008&tab=overview',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1292296?gameHash=3db885bd11a047bc',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1292315?gameHash=2ecf71aaea031312',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1292329?gameHash=5ed85b948b32b8e8',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1292341?gameHash=7335d6ca06763dc0&tab=overview',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1292345?gameHash=6f86444cce429244',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1292348?gameHash=c6a4eec48810e8d5&tab=overview',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1292353?gameHash=6db57c090ed235bd&tab=overview',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1292354?gameHash=79845cdf9a6e88db',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT04/1120429?gameHash=436739b9e99a246e&tab=overview',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT04/1120436?gameHash=58bc4281a76534f3&tab=overview',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT04/1120440?gameHash=4b74592ff226c39f&tab=overview',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT04/1120447?gameHash=9358d210749ab778&tab=overview',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1292579?gameHash=14865e88bd1e30a7',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1292607?gameHash=0ae34d7f67620dc4',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1292635?gameHash=f94944bb4f061f0d',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1292648?gameHash=1338dde99c71877f&tab=overview',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1302501?gameHash=f71748ae9cad5866&tab=overview',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1302519?gameHash=672c1377c3d37ed0&tab=overview',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1302531?gameHash=49cf9a8f3942b9c8&tab=overview',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1302595?gameHash=314d39ea940b354f',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1302628?gameHash=0ab39ec364a3ff5b',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1302635?gameHash=5625553825f5994e',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1302651?gameHash=555c7cd73dff952d',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1292960?gameHash=e3ce73c142354517',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1292974?gameHash=ab79b8f6f354bc0b',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1302827?gameHash=6a1a5de57a7ce6b9&tab=overview',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1302855?gameHash=f9144d0822d68632&tab=overview',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1302881?gameHash=369cd071defeadd9&tab=overview',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1302906?gameHash=c65d2e76e9aa721e&tab=overview',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT04/1130488?gameHash=411522a3de69bb79',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT04/1130489?gameHash=51c4c81c13a484c7',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT04/1130496?gameHash=9575986535e4f4c2',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1293312?gameHash=8e2209227e28843b',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT04/1120557?gameHash=f5bec07774ed5a5e',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1293319?gameHash=762cb3a92744846f',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT04/1130515?gameHash=548d7e528ef1f81e',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1293370?gameHash=8a70038d2eba61de',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1293393?gameHash=841d85edbfa78057',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT04/1130518?gameHash=6764d64a5ef8377e&tab=overview',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT04/1120578?gameHash=838a1db0f44411c8&tab=overview',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT04/1120583?gameHash=c542c3368048efd6&tab=overview',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT04/1120585?gameHash=925d9c523a0b0bdb&tab=overview',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1293765?gameHash=53412e36eb2eab86',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1303478?gameHash=8df5ef3d826ad211&tab=overview',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1303509?gameHash=d0849b1ba82d4826&tab=overview',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1293812?gameHash=48d825f1bb110b55',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1303533?gameHash=3a712b015a672d8d&tab=overview',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1293850?gameHash=0a29fdee10ed35d0',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1293885?gameHash=1ffaffd98da7e806',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1303581?gameHash=2bf61273d44c302f&tab=overview',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1293897?gameHash=77ccf507e1eaa05c',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1293899?gameHash=aa93723cded96f3b',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1293901?gameHash=f5fb660360f96ad6',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1293909?gameHash=245dbdf428788434',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1303619?gameHash=2e2f2ff9c6a32595&tab=overview',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1303626?gameHash=3bba86d0f9ff1d11&tab=overview',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1293929?gameHash=f4b6f53e68bbbc86',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1303641?gameHash=25ffa91aeb9ed707',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1293950?gameHash=e2f3a99412844d36',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1303655?gameHash=4ff2ebbe72e635bb',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1293964?gameHash=10bd6ec239231196',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT04/1130548?gameHash=afd267703d3cbbb1&tab=overview',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1303666?gameHash=a30e98d241d22eef',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT04/1130553?gameHash=f4360fb632593491&tab=overview',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT04/1130560?gameHash=e1e5bae936585a24&tab=overview',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT04/1120607?gameHash=f4b702f689f87c90',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT04/1130563?gameHash=43a7c73ecd281a63&tab=overview',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT04/1120622?gameHash=c87f08d06f392f3f',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT04/1120629?gameHash=6b39ee929c2ebc47',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT04/1120638?gameHash=eb17c2013b9ee77d',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT04/1120649?gameHash=aab6f321110ef3ed',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1294174?gameHash=8f5cb3f02bf790d7&tab=overview',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1303861?gameHash=02847551947ca67d&tab=overview',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1294191?gameHash=e574ac58bbe81abb&tab=overview',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1303876?gameHash=e733bc45e47f4856&tab=overview',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1303904?gameHash=d8aac7332b9edfe8',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1294233?gameHash=6762c2c72bc47359&tab=overview',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1303974?gameHash=28f566b2fa35a32a&tab=overview',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1294260?gameHash=246841d34c9660aa&tab=overview',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1303999?gameHash=619d5a2d571a1b01&tab=overview',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1304020?gameHash=99508a2da285eb4c&tab=overview',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1304032?gameHash=30e5b243a407326c&tab=overview',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1304035?gameHash=f8e3702e77f87cc7',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1304054?gameHash=db39c4bcb7c2320e&tab=overview',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1304075?gameHash=a3d0d6acfb8b92f1&tab=overview',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1304079?gameHash=8339c23d8d925f8b',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1304081?gameHash=d3e5c8f0270ce96f&tab=overview',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1304111?gameHash=64ded61e41c18ccd',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1304122?gameHash=bf7e80351592ce98',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1304132?gameHash=ff37582431bd7e7b&tab=overview',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT04/1130611?gameHash=a099e1df984018a1',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1304158?gameHash=62b9c13c8cecf652',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1294417?gameHash=746905a629b8f374',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT04/1130621?gameHash=9d171c9622870a7b',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1304165?gameHash=c34ae80c4ee8c7bd',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1304169?gameHash=ee6bc6a087a6bc36&tab=overview',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1304193?gameHash=fe234e8ca7d2343f',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT04/1130630?gameHash=b1b183ad3374db06',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1304217?gameHash=b29c2b7461c7700f',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1304223?gameHash=7c70a52e69b01c56',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT04/1130643?gameHash=15bb88ac79a622a1&tab=overview',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT04/1120705?gameHash=a6532b3af6accaf2',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1304264?gameHash=f5e69d8e2f6bae5e',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT04/1120711?gameHash=79659ad2d107f0d9',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1304282?gameHash=f9ea42cec97e930f',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT04/1130676?gameHash=f3e34e47140460ff',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1304302?gameHash=617f3af3e7d2ab4d',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT04/1130696?gameHash=532d412c3f38c0c5',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1304324?gameHash=01bf1a7465a412ba',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT04/1130709?gameHash=1491ee8228ad66ec',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1304356?gameHash=fc584c5143087c0b',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1304374?gameHash=175112ba57cbce5e',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1304391?gameHash=72ad86120a14eb54',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1304399?gameHash=2536b98ac19e617d',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1304412?gameHash=a30a480459e9151c',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1294710?gameHash=0d01dd80aa803997',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1294725?gameHash=3ae63821918e2b43',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1294738?gameHash=5fd20a0eec2c86f4',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1304487?gameHash=0d5d1e4e719e8c46&tab=overview',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT04/1130723?gameHash=c5113e7f25839c2e&tab=overview',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1304504?gameHash=fda4895c0bea1e8a&tab=overview',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1294750?gameHash=fa9335e1a61165a5',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT04/1130734?gameHash=88de231d14ea4b07&tab=overview',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1304521?gameHash=b70af7bde6c54520&tab=overview',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1294784?gameHash=7d1bf4754cda9b46',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT04/1130739?gameHash=4ddb470392dd9248&tab=overview',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1304544?gameHash=b14635dac9add7b4&tab=overview',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1294813?gameHash=51f4579db6e7049f',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT04/1130748?gameHash=8f544ac73c53a606&tab=overview',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1304574?gameHash=a5dcde67b90f29e3&tab=overview',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1304609?gameHash=7a5a6778a7074f09',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1304635?gameHash=4cda5972cf8dd6bd',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1304671?gameHash=67e29eccbbc8f667',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1304691?gameHash=1572b2bb76b73da1',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1304700?gameHash=b50bc9265ac35f9f',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1304739?gameHash=b80bb99cbce5bd71',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1304768?gameHash=6ed6e9d7108f27e0',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1304785?gameHash=13054e8f14fcae76',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1304794?gameHash=b4c27881f0c4481c',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1304884?gameHash=29e8f7f002108b46',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1304889?gameHash=7774d22665d9526d',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1304894?gameHash=bac5c580914f7aaa',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1304928?gameHash=b8b029b3d4002fbc',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT04/1130793?gameHash=a3f6e45612b56302',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1295356?gameHash=731afc76037bd245',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1295369?gameHash=e743122ca08b77d8',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1295383?gameHash=072ee1028f03f4c9',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1295402?gameHash=560c984fc1ba1168',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1295440?gameHash=32cdbf5ce1441159&tab=overview',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1295467?gameHash=bcc21c92fa78e889&tab=overview',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1295489?gameHash=95386dea5cf1ab09&tab=overview']
Basic Solution
The below solution makes use of a lambda function defined within a call to pandas.DataFrame.apply().
df['url'] = df['url'].apply(lambda x: x if len(x) == 108 else x[:-10])
Here, each value within df['url'] (x) remains the same if len(x) == 108, otherwise it is updated to be x[:-10].
Handling Exceptions
The below solution is similar to that provided above, however in this case some basic exception handling has been implemented within the url_trim() function called by pandas.DataFrame.apply().
This is more robust than the first solution and will not halt code execution when an exception is thrown within pandas.DataFrame.apply() due to unexpected values within df['url'] rows, in these cases the value is simply left unchanged - for example if numpy.nan is used for null values.
def url_trim(x):
try:
if len(x) != 108:
return x[:-10]
else:
return x
except:
return x
df['url'] = df['url'].apply(lambda x: url_trim(x))
The following code will check the length of the url columns and prune of last 10 characters of the string if the string is below 108.The modified url will be included in modified_url column.
# Get string length
df["string_length"] = df["url"].astype(str).str.len()
# Create a filter based on string length
filter_length = df["string_length"]<108
# Extract string for the filter
df["modified_url"]=df["url"]
df.loc[filter_length,"modified_url"]=df[filter_length]["url"].astype(str).str[:-10]

Ranking how direct spaCy dependencies are on tree

I have a SpaCy dependency tree made by this code:
from spacy import displacy
text = "We could say to them that if in fact that's all there is, then we could, Oh, we can do something."
print(displacy.render(nlp(text), style='dep', jupyter = True, options = {'distance': 120}))
That prints out this:
SpaCy determines that this entire string is connected in a dependency tree. What I am trying to figure out is how to discern how direct or indirect the connection is between a word and the next word. For example, looking at the first 3 words:
'We' is connected to the next word 'could', because it is directly connected to 'say', which is directly connected to 'could'. Therefor, it is 2 connection points away from the next word.
'could' is directly connected to 'say'. There for it is 1 connection point away from the start.
and so on.
Essentially, I want to make a df that would look like this:
word connection_points_to_next_word
We 2
could 1
say 1
...
I'm not sure how to achieve this. As SpaCy makes this graph, I'm sure there is some efficient way to calculate the number of vertices required to connect adjacent nodes, but all of SpaCy's tools I've found, such as:
token.lefts
token.rights
token.subtree
token.children
more here https://spacy.io/api/token
Include connection information, but not how direct this connection is. Any ideas how to get closer to this problem?
Using the networkx library, we can build an undirected graph from the edgelist of token-children relationships. I am using the index of the token in the document as a unique identifier so that repeat words are treated as separate nodes.
import spacy
import networkx as nx
nlp= spacy.load('en_core_web_lg')
text = "We could say to them that if in fact that's all there is, then we could, Oh, we can do something."
doc = nlp(text)
edges = []
for tok in doc:
edges.extend([(tok.i, child.i) for child in tok.children])
The shortest path between neighboring tokens can be calculated as below:
for idx, _ in enumerate(doc):
if idx < len(doc)-1:
print(doc[idx], doc[idx+1], nx.shortest_path_length(graph,source=idx, target=idx+1))
Output:
We could 2
could say 1
say to 1
to them 1
them that 4
that if 3
if in 2
in fact 1
fact that 3
that 's 1
's all 1
all there 2
there is 1
is , 4
, then 2
then we 2
we could 2
could , 2
, Oh 2
Oh , 2
, we 2
we can 2
can do 1
do something 1
something . 3

Search pattern not unique? - Regular expression

I want to write a function to clean the index column of the dataframe.
Delete the whole row that has high-level IDs. For example, delete
East Kootenay (5901) 01010
Tailor the index into 7-digit number for low-level IDs. For example, turn
East Kootenay A (5901017) RDA 02020
into 5901017
If it has two parenthesis keep only the 7-digit number in the second parenthesis. For example,
Sechelt (Part) (5929803) IGD 02020 to 5929803
Capital H (Part 1) (5917054) RDA 01020 to 5917054
Capital H (Part 2) (5917056) RDA 02030 to 5917056
T'Sou-ke 1 (Sooke 1) (5917817) IRI 01010 to 5917817
T'Sou-ke 2 (Sooke 2) (5917818) IRI 00000 to 5917818
An example of code only works for one bracket is
def extract_id(s):
m = re.search('\((.*)\)', s)
if m:
i = int(m.group(0)[1:-1])
return i
if __name__ == '__main__':
# Read data
census_subdivision_profile = pd.read_excel('../data/census_subdivision_profile.xlsx', sheetname='Data',
index_col='Geography', encoding='utf-8').T
print(census_subdivision_profile.head())
print(census_subdivision_profile.shape)
census_subdivision_profile.index = census_subdivision_profile.index.map(extract_id)
print(census_subdivision_profile.index)
To see the full code, see another question I posted earlier
Merge dataframes that have indices that one contains another (but not the same)
I think you intended '\(([^)]*)\)' ... hth
I don't understand the distinction between points 2 and 3. In both cases you're just wanting to extract the 7 digit number in brackets? In that case I'd be more explicit with the regex, like \((\d{7})\)

Return values from a Python Entrez dictionary of dictionaries

I want to scrape the Interactions table from the Entrez Gene page.
The Interactions table is populated from a web server and when I tried to use the XML package in R, I could get the Entrez gene page, but the Interactions table body was empty (it had not been populated by the web server).
Dealing with the web server issue in R may be solvable (and I'd love to see how), but it seemed Biopython was an easier path.
I put together the following, which gives me what I want for an example gene:
# Pull the Entrez gene page for MAP1B using Biopython
from Bio import Entrez
Entrez.email = "jamayfie#vasci.umass.edu"
handle = Entrez.efetch(db="gene", id="4131", retmode="xml")
record = Entrez.read(handle)
handle.close()
PPI_Entrez = []
PPI_Sym = []
# Find the Dictionary that contains the Interaction table
for x in range(1, len(record[0]["Entrezgene_comments"])):
if ('Gene-commentary_heading', 'Interactions') in record[0]["Entrezgene_comments"][x].items():
for y in range(0, len(record[0]["Entrezgene_comments"][x]['Gene-commentary_comment'])):
EntrezID = record[0]["Entrezgene_comments"][x]['Gene-commentary_comment'][y]['Gene-commentary_comment'][1]['Gene-commentary_source'][0]['Other-source_src']['Dbtag']['Dbtag_tag']['Object-id']['Object-id_id']
PPI_Entrez.append(EntrezID)
Sym = record[0]["Entrezgene_comments"][x]['Gene-commentary_comment'][y]['Gene-commentary_comment'][1]['Gene-commentary_source'][0]['Other-source_anchor']
PPI_Sym.append(Sym)
# Return the desired values: I want the Entrez ID and Gene symbol for each interacting protein
PPI_Entrez # Returns the EntrezID
PPI_Sym # Returns the gene symbol
This code works, giving me what I want. But I think its ugly, and am concerned that if the Entrez gene page changes slightly in format it will break the code. In particular, there must be a better way to extract the desired information than specifying the full path, as I do with:
record[0]["Entrezgene_comments"][x]['Gene-commentary_comment'][y]['Gene-commentary_comment'][1]['Gene-commentary_source'][0]['Other-source_anchor']
But I cannot figure out how to search through a dictionary of dictionaries without specifying each level I want to descend. When I try functions like find(), they operate on the next level down, but not all the way to the bottom.
Is there a wildcard symbol, a Python equivalent of "//", or a function I can use to get to ['Object-id_id'] without naming the full path? Other suggestions for cleaner code are also appreciated.
I'm not sure about xpath in Python, but if the code works, then I would not worry removing full paths or if Entrez Gene XML will change. Since you first tried R, you could get the XML using a system call to Entrez Direct below or a package like rentrez.
doc <- xmlParse( system("efetch -db=gene -id=4131 -format xml", intern=TRUE) )
Next, get the nodes corresponding to rows in the table at http://www.ncbi.nlm.nih.gov/gene/4131#interactions
x <- getNodeSet(doc, "//Gene-commentary_heading[.='Interactions']/../Gene-commentary_comment/Gene-commentary" )
length(x)
[1] 64
x[1]
x[50]
Try the easy stuff first
xmlToDataFrame(x[1:4])
Gene-commentary_type Gene-commentary_text Gene-commentary_refs Gene-commentary_source Gene-commentary_comment
1 18 Affinity Capture-MS 24457600 BioGRID110304BioGRID 255BioGRID110304255GeneID8726EEDBioGRID114265
2 18 Reconstituted Complex 20195357 BioGRID110304BioGRID 255BioGRID110304255GeneID2353FOSBioGRID108636
3 18 Reconstituted Complex 20195357 BioGRID110304BioGRID 255BioGRID110304255GeneID1936EEF1DBioGRID108256
4 18 Affinity Capture-MS 2345592220562859 BioGRID110304BioGRID 255BioGRID110304255GeneID6789STK4BioGRID112665
Gene-commentary_create-date Gene-commentary_update-date
1 2014461120 201410513330
2 201312810490 201410513330
3 201312810490 201410513330
4 20137710360 201410513330
Some tags like text, refs, source, and dates should be easy to parse
sapply(x, function(x) paste( xpathSApply(x, ".//PubMedId", xmlValue), collapse=", "))
I'm not sure about the comments or how Products, Interactants and Other Genes listed in the table are stored in the XML, but I get one or three symbols and three ids for each node here.
sapply(x, function(x) paste( xpathSApply(x, ".//Gene-commentary_comment//Other-source_anchor", xmlValue), collapse=" + "))
sapply(x, function(x) paste( xpathSApply(x, ".//Gene-commentary_comment//Object-id_id", xmlValue), collapse=" + "))
Finally, since I think Entrez Gene just copies IntAct and BioGrid, you could try those sites too. Biogrid has a really simple Rest service, but you have to register for a key.
url <- "http://webservice.thebiogrid.org/interactions?geneList=MAP1B&taxId=9606&includeHeader=TRUE&accesskey=[ your ACCESSKEY ]"
biogrid <- read.delim(url)
dim(biogrid)
[1] 58 24
head(biogrid[, c(8:9,12)])
Official.Symbol.Interactor.A Official.Symbol.Interactor.B Experimental.System
1 ANP32A MAP1B Two-hybrid
2 MAP1B ANP32A Two-hybrid
3 RASSF1 MAP1B Affinity Capture-Western
4 RASSF1 MAP1B Two-hybrid
5 ANP32A MAP1B Affinity Capture-Western
6 GAN MAP1B Affinity Capture-Western

Python Basics – slicing a long string and combine the sliced in wanted pieces

Environment: Win 7; Python 2.76
Hello all…I need to pick up some texts from a string, which looks like:
“C-603WallWizard45256CCCylinders:2HorizontalOpposedBore:1-1/4Stroke:1-1/8Length: SingleVerticalBore:1-111Height:6Width:K-720Cooling:AirWeight:6LBS1.5H.P.#54500RPMC-60150ccGasEngineCylinder:4VerticalInlineBore:1Stroke:1Cycle:4Weight:6-1/2LBSLength:10Width: :AirLength16Cooling:AirLength:5Width:4L-233Height:6Weight: 4TheBlackKnightc-609SteamEngineBore:11/16Stroke:11/16Length:3Width:3Height:4TheChallengerC-600Bore:1Stroke:1P-305Weight:18LBSLength:12Width:7Height:8C-606Wall15ccGasEngineJ-142Cylinder:SingleVerticalBore:1Stroke:1-1/8Cooling:1Stroke:1-1/4HP:: /4Stroke:1-7/:6Width:6Height:9Weight:4LBS1.75H.P.#65200RPM”
The wanted are:
I. Combinations of 1 letter + 3 numbers, joint by ‘-’. Such as: C-603, K-720, C-606 etc
II. Combinations of 5 continuous numbers. Such as: 45256, 54500, 60150, 65200 etc
My idea is to:
slice the string into every pieces, like ‘C’, ‘-’, ‘6’, ‘0’, ‘3’, … ‘R’, ‘P’, ‘M’
combine them into 4 digits and 5 digits, like ‘C-60’, ‘-603’, ‘603W’… and ‘C-603W’, ‘-603W’ , ‘603Wa’
pick up the ones fits the criteria I and II
sounds like a way? If yes, what commands I can use in the processes?
Thanks.
Going with regular expressions is one way to do it:
>>> data = '''C-603WallWizard45256CCCylinders:2HorizontalOpposedBore:1-1/4Stroke:1-1/8Length: SingleVerticalBore:1-111Height:6Width:K-720Cooling:AirWeight:6LBS1.5H.P.#54500RPMC-60150ccGasEngineCylinder:4VerticalInlineBore:1Stroke:1Cycle:4Weight:6-1/2LBSLength:10Width: :AirLength16Cooling:AirLength:5Width:4L-233Height:6Weight: 4TheBlackKnightc-609SteamEngineBore:11/16Stroke:11/16Length:3Width:3Height:4TheChallengerC-600Bore:1Stroke:1P-305Weight:18LBSLength:12Width:7Height:8C-606Wall15ccGasEngineJ-142Cylinder:SingleVerticalBore:1Stroke:1-1/8Cooling:1Stroke:1-1/4HP:: /4Stroke:1-7/:6Width:6Height:9Weight:4LBS1.75H.P.#65200RPM'''
>>> one_letter_three_numbers = re.compile(r'.\-\d{3}', re.IGNORECASE)
>>> re.findall(one_letter_three_numbers, data)
['C-603', '1-111', 'K-720', 'C-601', 'L-233', 'c-609', 'C-600', 'P-305', 'C-606', 'J-142']
>>> five_continuous = re.compile(r'\d{5}', re.IGNORECASE)
>>> re.findall(five_continuous, data)
['45256', '54500', '60150', '65200']

Categories

Resources