This question already has answers here:
Python: Get URL path sections
(7 answers)
Closed 5 months ago.
I have a string type URL --https://drive.google.com/file/d/1vBiwyAL3OZ9VVCWCn5t6BagvLQoMjk82/view?usp=sharing
I want only part after "d/" and before "/view" from the above string so how can I use it using regex or any other function example I only want 1vBiwyAL3OZ9VVCWCn5t6BagvLQoMjk82 from the above string using python. now I am using this
a =df['image']
for i in a :
print(type(i))
res=re.findall('/d/(.*)/view', i)
print(res)
but getting blank array res while printing it
I guess /d/(.*)/view is the regex you are looking for. re.findall('/d/(.*)/view', url)[0] works with your example.
import re
s="--https://drive.google.com/file/d/1vBiwyAL3OZ9VVCWCn5t6BagvLQoMjk82/view?usp=sharing"
res=re.findall('/d/(.*)/view', s)
if len(res)==0:
print('bad url')
else:
print(res[0])
ADDITION (example using dataframes)
import pandas as pd
import re
df=pd.DataFrame([["url1", "--https://drive.google.com/file/d/1vBiwyAL3OZ9VVCWCn5t6BagvLQoMjk82/view?usp=sharing"], ["url2", "--https://drive.google.com/file/d/second/one/view?usp=sharing"], ["noUrl", "anotherstring"]], columns=['name', 'image'])
Then
for i in df['image']:
res=re.findall('/d/(.*)/view', i)
print(res)
output
['1vBiwyAL3OZ9VVCWCn5t6BagvLQoMjk82']
['second/one']
[]
as expected. A len(res)=1 array whose res[0] element is the pattern you want if it matches, or an empty array if not.
If you have more than one /d/.../view patter in your url, then you may even have a longer than 1 answer. But empty one is only if you don't have a .../d/.../view... form
You can utilize str.split method.
First split the string using "/d/" and then use second part and split based on "/view". Take the first part of the new string.
html.split("/d/")[1].split("/view")[0]
If your string is always going to look like this, I would break the string apart using the "/" as a delimiter.
s = "https://drive.google.com/file/d/1vBiwyAL3OZ9VVCWCn5t6BagvLQoMjk82/view?usp=sharing"
s = s.split('/')[5]
Related
I have a column containing strings that are comprised of different words but always have a similar structure structure. E.g.:
2cm off ORDER AGAIN (191 1141)
I want to extract the sub-string that starts after the second space and ends at the space before the opening bracket/parenthesis. So in this example I want to extract ORDER AGAIN.
Is this possible?
You could use str.extract here:
df["out"] = df["col"].str.extract(r'^\w+ \w+ (.*?)(?: \(|$)')
Note that this answer is robust even if the string doesn't have a (...) term at the end.
Here is a demo showing that the regex logic is working.
You can try the following:
r"2cm off ORDER AGAIN (191 1141)".split(r"(")[0].split(" ", maxsplit=2)[-1].strip()
#Out[3]: 'ORDER AGAIN'
If the pattern of data is similar to what you have posted then I think the below code snippet should work for you:
import re
data = "2cm off ORDER AGAIN (191 1141)"
extr = re.match(r".*?\s.*?\s(.*)\s\(.*", data)
if extr:
print (extr.group(1))
You can try the following code
s = '2cm off ORDER AGAIN (191 1141)'
second_space = s.find(' ', s.find(' ') + 1)
openparenthesis = s.find('(')
substring = s[second_space : openparenthesis]
print(substring) #ORDER AGAIN
This question already has answers here:
Remove char at specific index - python
(8 answers)
Closed 3 years ago.
I am new to Python and I‘m currently just messing around a little bit... but I am stuck with this one problem. I am trying to remove certain indexes from a string with a for loop. My Idea was something like this:
Text="Gvusaibdsz8audvbsauzdgsavuczisagbcsuzaicbhas"
for i in range(0,7):
Text=Text.replace(Text[i], "")
print(Text)
But it removes only one index and is restoring the already replaced ones, for example:
1.loop: vusaibdsz8audvbsauzdgsavuczisagbcsuzaicbhas
2.loop: Gusaibdsz8audvbsauzdgsavuczisagbcsuzaicbhas
Surely, there is many ways to get the desired result. Below, there is a good one based on your logic (using a for loop). As you are replacing a character by an empty character, It is better to directly remove the desired character.
In the code, I have transformed the text into a list for easy handling. After that, I have removed all characters based on the indexes. Finally, a join operation yields the desired text.
text = "Gvusaibdsz8audvbsauzdgsavuczisagbcsuzaicbhas"
text_list = list(text)
for index in range(0, 7):
text_list.remove(text[index])
text = ''.join(text_list)
print(text) # dsz8audvbsauzdgsavuczisagbcsuzaicbhas
If I understand what you are trying to do, you are trying to omit letters of the string Text that are in indexes 0,1,2,3,4,5 and 6. but your code doesn't do that currently, but indeed it will take the first letter which is G and it will remove it from all the string(there is one occurrence), the next loop Text is equal to u because Text is equal vusaibdsz8audvbsauzdgsavuczisagbcsuzaicbhas after omitting G in the first iteration, as there are five occurrences of u, Text will be equal to vsaibdsz8advbsazdgsavczisagbcszaicbhas and so on..
you can put print(Text) inside the for loop and watch the results:
>>> Text="Gvusaibdsz8audvbsauzdgsavuczisagbcsuzaicbhas"
>>> for i in range(0,7):
... Text=Text.replace(Text[i], "")
... print(Text)
...
vusaibdsz8audvbsauzdgsavuczisagbcsuzaicbhas
vsaibdsz8advbsazdgsavczisagbcszaicbhas
vsibdsz8dvbszdgsvczisgbcszicbhs
vsidsz8dvszdgsvczisgcszichs
vidz8dvzdgvczigczich
viz8vzgvczigczich
viz8vzvcziczich
In Python you can do that without a loop and the best way for that by using slicing as follows:
Text = Text[7:]
This will give you Text equal to dsz8audvbsauzdgsavuczisagbcsuzaicbhas.
If your goal is to reach this through a loop(supposing you are in need of Text in every iteration), you can try this:
>>> Text="Gvusaibdsz8audvbsauzdgsavuczisagbcsuzaicbhas"
>>> for i in range(0,7):
... Text = Text[1:]
... print(Text)
...
vusaibdsz8audvbsauzdgsavuczisagbcsuzaicbhas
usaibdsz8audvbsauzdgsavuczisagbcsuzaicbhas
saibdsz8audvbsauzdgsavuczisagbcsuzaicbhas
aibdsz8audvbsauzdgsavuczisagbcsuzaicbhas
ibdsz8audvbsauzdgsavuczisagbcsuzaicbhas
bdsz8audvbsauzdgsavuczisagbcsuzaicbhas
dsz8audvbsauzdgsavuczisagbcsuzaicbhas
I hope this will help!
Take a look at string slicing:
text = "Gvusaibdsz8audvbsauzdgsavuczisagbcsuzaicbhas"
newtext = text[1:]
print(newtext)
--> "Gvusaibdsz8audvbsauzdgsavuczisagbcsuzaicbhas"
--> "vusaibdsz8audvbsauzdgsavuczisagbcsuzaicbhas"
You can replace
Text=Text.replace(Text[i], "")
with
Text = Text[:i] + Text[i+1:]
which uses slicing to rebuild a string without the specific index.
Side note: variable names should be lower case.
This question already has answers here:
strange behavior of parenthesis in python regex
(3 answers)
Closed 4 years ago.
I am trying to extract an ID from a string with python3. The regex returns more then one item, despite only having one in the text:
text_total = 'Lore Ippsum Ref. 116519LN Perlmutt'
>>> re.findall(r"Ref\.? ?(([A-Z\d\.]+)|([\d.]+))", text_total)
[('116519LN', '116519LN', '')]
I am looking for a single trimed result, if possible without beeing a list anyway.
That's why my original line is:
[x for x in re.findall(r"Ref\.? ?(([A-Z\d\.]+)|([\d.]+))", text_total)][0]
The regex has an OR as I am also trying to match
Lore Ippsum Ref. 1166AB.39AZU2.123 Lore Ippsum
How can I retrieve just one result from the text and match both conditions?
Your groups inside your OR group, so to speak, are "capturing groups". You need to make them non capturing using the ?: syntax inside those groups, and allow the outer group to stay as a capturing group.
import re
text_total = 'Lore Ippsum Ref. 116519LN Perlmutt'
re.findall(r"Ref\.? ?((?:[A-Z\d\.]+)|(?:[\d.]+))", text_total)
#result ['116519LN']
Note that this still gets you multiple matches if there are many. You can use re.search for just first match.
You don't necessarily need an or, you can do Ref\.? ?([a-zA-Z. 0-9]+) (note the space at the end of the regex, it will be used as the ending for the match.
import re
pattern = r"Ref\.? ?([a-zA-Z. 0-9]+) "
text_total = "Lore Ippsum Ref. 116519LN Perlmutt"
results = re.findall(pattern, text_total)
print(results[0])
This question already has answers here:
Reading csv containing a list in Pandas
(4 answers)
Pandas DataFrame stored list as string: How to convert back to list
(9 answers)
Closed 1 year ago.
I have this 'file.csv' file to read with pandas:
Title|Tags
T1|"[Tag1,Tag2]"
T1|"[Tag1,Tag2,Tag3]"
T2|"[Tag3,Tag1]"
using
df = pd.read_csv('file.csv', sep='|')
the output is:
Title Tags
0 T1 [Tag1,Tag2]
1 T1 [Tag1,Tag2,Tag3]
2 T2 [Tag3,Tag1]
I know that the column Tags is a full string, since:
In [64]: df['Tags'][0][0]
Out[64]: '['
I need to read it as a list of strings like ["Tag1","Tag2"]. I tried the solution provided in this question but no luck there, since I have the [ and ] characters that actually mess up the things.
The expecting output should be:
In [64]: df['Tags'][0][0]
Out[64]: 'Tag1'
You can split the string manually:
>>> df['Tags'] = df.Tags.apply(lambda x: x[1:-1].split(','))
>>> df.Tags[0]
['Tag1', 'Tag2']
You could use the inbuilt ast.literal_eval, it works for tuples as well as lists
import ast
import pandas as pd
df = pd.DataFrame({"mytuples": ["(1,2,3)"]})
print(df.iloc[0,0])
# >> '(1,2,3)'
df["mytuples"] = df["mytuples"].apply(ast.literal_eval)
print(df.iloc[0,0])
# >> (1,2,3)
EDIT: eval should be avoided! If the the string being evaluated is os.system(‘rm -rf /’) it will start deleting all the files on your computer (here). For ast.literal_eval the string or node provided may only consist of the following Python literal structures: strings, bytes, numbers, tuples, lists, dicts, sets, booleans, and None (here). Thanks #TrentonMcKinney :)
Or
df.Tags=df.Tags.str[1:-1].str.split(',').tolist()
I think you could use the json module.
import json
import pandas
df = pd.read_csv('file.csv', sep='|')
df['Tags'] = df['Tags'].apply(lambda x: json.loads(x))
So this will load your dataframe as before, then apply a lambda function to each of the items in the Tags column. The lambda function calls json.loads() which converts the string representation of the list to an actual list.
You can convert the string to a list using strip and split.
df_out = df.assign(Tags=df.Tags.str.strip('[]').str.split(','))
df_out.Tags[0][0]
Output:
'Tag1'
Your df['Tags'] appears to be a list of strings. If you print that list you should get ["[tag1,tag2]","[Tag1,Tag2,Tag3]","[Tag3,Tag1]"] this is why when you call the first element of the first element you're actually getting the first single character of the string, rather than what you want.
You either need to parse that string afterward. Performing something like
df['Tags'][0] = df['Tags'][0].split(',')
But as you saw in your cited example this will give you a list that looks like
in: df['Tags'][0][0]
out: '[tag1'`
What you need is a way to parse the string editing out multiple characters. You can use a simple regex expression to do this. Something like:
import re
df['Tags'][0] = re.findall(r"[\w']+", df['Tags'][0])
print(df['Tags'][0][0])
will print:
'tag1'
Using the other answer involving Pandas converters you might write a converter like this:
def clean(seq_string):
return re.findall(r"[\w']+", seq_string)
If you don't know regex, they can be quite powerful, but also unpredictable if you're not sure on the content of your input strings. The expression used here r"[\w']+" will match any common word character alpha-numeric and underscores and treat everything else as a point for re.findall to split the list at.
This question already has answers here:
How to find and replace nth occurrence of word in a sentence using python regular expression?
(9 answers)
Closed 6 years ago.
I'm looking to remove a ',' (comma) from a string, but only the second time the comma occurs as it needs to be in the correct format for reverse geocoding...
As an example I have the following string in python:
43,14,3085
How would I convert it to the following format:
43,143085
I have tried using regex and str.split() but have not achieved result yet..
If you're sure that string only contains two commas and you want to remove the last one you can use rsplit with join:
>>> s = '43,14,3085'
>>> ''.join(s.rsplit(',', 1))
'43,143085'
In above rsplit splits starting from the end number of times given as a second parameter:
>>> parts = s.rsplit(',', 1)
>>> parts
['43,14', '3085']
Then join is used to combine the parts together:
>>> ''.join(parts)
'43,143085'
What about something like:
i = s.find(',')
s[:i] + ',' + s[i+1:].replace(",", "")
This will get rid of all your commas excepts the first one:
string = '43,14,3085'
splited = string.split(',')
string=",".join(splited[0:2])
string+="".join(splited[2:])
print(string)