I have the following string,
s = {$deletedFields:name:[standardizedSkillUrn,standardizedSkill],entityUrn:urn:li:fs_skill:(ACoAAA0C3rkBDZ7qyoWoEmj9CxUv3QW6brC836w,25),name:Political Campaigns,$type:com.linkedin.voyager.identity.profile.Skill},{$deletedFields:[standardizedSkillUrn,standardizedSkill],entityUrn:urn:li:fs_skill:(ACoAAA0C3rkBDZ7qyoWoEmj9CxUv3QW6brC836w,28),name:Politics,$type:com.linkedin.voyager.identity.profile.Skill},name:
{$deletedFields:[standardizedSkillUrn,standardizedSkill],entityUrn:urn:li:fs_skill:(ACoAAA0C3rkBDZ7qyoWoEmj9CxUv3QW6brC836w,27),name:Political Consulting,$type:com.linkedin.voyager.identity.profile.Skill},
{$deletedFields:[standardizedSkillUrn,standardizedSkill],entityUrn:urn:li:fs_skill:(ACoAAA0C3rkBDZ7qyoWoEmj9CxUv3QW6brC836w,26),name:Grassroots Organizing,$type:com.linkedin.voyager.identity.profile.Skill},
{$deletedFields:[],profileId:ACoAAA0C3rkBDZ7qyoWoEmj9CxUv3QW6brC836w,elements:[urn:li:fs_skill:(ACoAAA0C3rkBDZ7qyoWoEmj9CxUv3QW6brC836w,25),urn:li:fs_skill:(ACoAAA0C3rkBDZ7qyoWoEmj9CxUv3QW6brC836w,26),urn:li:fs_skill:(ACoAAA0C3rkBDZ7qyoWoEmj9CxUv3QW6brC836w,27),urn:li:fs_skill:(ACoAAA0C3rkBDZ7qyoWoEmj9CxUv3QW6brC836w,28)],paging:urn:li:fs_profileView:ACoAAA0C3rkBDZ7qyoWoEmj9CxUv3QW6brC836w,skillView,paging,$type:com.linkedin.voyager.identity.profile.SkillView,$id:urn:li:fs_profileView:ACoAAA0C3rkBDZ7qyoWoEmj9CxUv3QW6brC836w,skillView},
{$deletedFields:[]
I want to grab
name:Political Campaigns
name:Politics
name:Political Consulting
name:Grassroots Organizing
name = [Political Campaigns , Politics, Political Consulting, Grassroots Organizing]
The above string is from a file i want to scrap.
Keep in mind that name has many instances in the file,
is there a way to grab fs_skill then some garbage value but then look for name: near it and grab that string ending at.
data = [pair[5:] for pair in s.split(',') if pair[:4] == 'name' and pair[5].isalpha()]
Output:
['Political Campaigns', 'Politics', 'Political Consulting', 'Grassroots Organizing']
can you try above code snippet, hope this helps
Related
Example dataframe:
data = pd.DataFrame({'Name': ['Nick', 'Matthew', 'Paul'],
'Text': ["Lived in Norway, England, Spain and Germany with his car",
"Used his bikes in England. Loved his bike",
"Lived in Alaska"]})
Example list:
example_list = ["England", "Bike"]
What I need
I want to create a new column, called x, where if a term from example_list is found as a string/substring in data.Text (case insensitive), it adds the word it was found from to the new column.
Output
So in row 1, the word England was found and returned, and bike was found and returned, as well as bikes (which bike was a substring of).
Progress so far:
I have managed - with the following code - to return terms that match the terms regardless of case, however it wont find substrings... e.g. if search for "bike", and it finds "bikes", I want it to return "bikes".
pattern = fr'({"|".join(example_list)})'
data['Text'] = data['Text'].str.findall(pattern, flags=re.IGNORECASE).str.join(", ")
I think I might have found a solution for your pattern there:
pattern = fr'({"|".join("[a-zA-Z]*" + ex + "[a-zA-Z]*" for ex in example_list)})'
data['x'] = data['Text'].str.findall(pattern, flags=re.IGNORECASE).str.join(",")
Basically what I do is, I extend the pattern by optionally allowing letters before the (I think you don't explicitly mention this, maybe this has to be omitted) and after the word.
As an output I get the following:
I'm just not so sure, in which format you want this x-column. In your code you join it via commas (which I followed here) but in the picture you only have a list of the values. If you specify this, I could update my solution.
I have a text and I have got a task in python with reading module:
Find the names of people who are referred to as Mr. XXX. Save the result in a dictionary with the name as key and number of times it is used as value. For example:
If Mr. Churchill is in the novel, then include {'Churchill' : 2}
If Mr. Frank Churchill is in the novel, then include {'Frank Churchill' : 4}
The file is .txt and it contains around 10-15 paragraphs.
Do you have ideas about how can it be improved? (It gives me error after some words, I guess error happens due to the reason that one of the Mr. is at the end of the line.)
orig_text= open('emma.txt', encoding = 'UTF-8')
lines= orig_text.readlines()[32:16267]
counts = dict()
for line in lines:
wordsdirty = line.split()
try:
print (wordsdirty[wordsdirty.index('Mr.') + 1])
except ValueError:
continue
Try this:
text = "When did Mr. Churchill told Mr. James Brown about the fish"
m = [x[0] for x in re.findall('(Mr\.( [A-Z][a-z]*)+)', text)]
You get:
['Mr. Churchill', 'Mr. James Brown']
To solve the line issue simply read the entire file:
text = file.read()
Then, to count the occurrences, simply run:
Counter(m)
Finally, if you'd like to drop 'Mr. ' from all your dictionary entries, use x[0][4:] instead of x[0].
This can be easily done using regex and capturing group.
Take a look here for reference, in this scenario you might want to do something like
# retrieve a list of strings that match your regex
matches = re.findall("Mr\. ([a-zA-Z]+)", your_entire_file) # not sure about the regex
# then create a dictionary and count the occurrences of each match
# if you are allowed to use modules, this can be done using Counter
Counter(matches)
To access the entire file like that, you might want to map it to memory, take a look at this question
In the 'details' column, every entry has 'Mobile' and 'Email" text inside them. I want to separate out Mobile Number and Email-ID of corresponding entries in different individual columns using a Python code.
Please help.
Thanks in advance!
You could try something like this -
import pandas as pd
data = pd.read_csv('AIOS_data.csv')
data['Mobile'] = data['Mobile'].str.extract(r'(Mobile[\d|\D]+Email)')
data['Mobile'] = data['Mobile'].str.replace('[Mobile:|Email:]', '').str.strip()
data['Email'] = data['Email'].str.extract(r'(Email:[\d|\D]+)')
data['Email'] = data['Email'].str.replace('Email:','').str.strip()
Use Series.str.extract with regex for filter values between values Mobile and Email, \s* means zero or some spaces and (.*) means extract any value between:
df[['Mobile','Email']] = df['Details'].str.extract('Mobile:\s*(.*)\s+Email:\s*(.*)')
If want also address:
cols = ['Address','Mobile','Email']
df[cols] = df['Details'].str.extract('Address:\s*(.*)\s*Mobile:\s*(.*)\s+Email:\s*(.*)')
Without providing the full code, I guess you have to take three steps:
Read the csv-file into memory. Python has a handy module for that called csv (documentation)
Once you have done this, you can iterate over each row, and search in detail for the mobile number and email address. If detail is always written in the same way, you can just use the str.find() method (documentation) for that.
E.g.
detail = "Address: 108/81-B, METTU STREET, SE...KKAL TAMIL NADU 637409 Mobile: 9789617285 Email: Leens1794#gmail.com"
mobile_start = detail[detail.find("Mobile:")+8:] # => '9789617285 Email: Leens1794#gmail.com'
mobile = mobile_start[:mobile_start.find(' ')] # => '9789617285'
(You do the same for email)
You store the results (mobile and email) in a new column and export it to csv, again using the ``csv'' module.
I want to extract text from the website and the format is like this:
Avalon
Avondale
Bacon Park Area
How do I just select those 'a' tags with href="#N" because there are several more?
I tried creating a list to iterate through but when I try the code, it selects only one element.
loc= ['#N0', '#N1', '#N2', '#N3', '#N4', '#N5'.....'#N100']
for i in loc:
name=soup.find('a', attrs={'href':i})
print(name)
I get
Avalon
not
Avalon
Avondale
<a href="#N4">Bacon Park Area</a
How about just?
Avalon
Avondale
Bacon Park Area
Thanks in advance!
You're iterating over the items, but not putting them anywhere. So when you are done with your loop all that's left in name is the last item.
You can put them in a list like below, and access the .text attribute to get just the name from the tag:
names = []
for i in loc:
names.append(soup.find('a',attrs={'href':i}).text)
Result:
In [15]: names
Out[15]: ['Bacon Park Area', 'Avondale', 'Avalon']
If you want to leave out the first list's creation you can just do:
import re
names = [tag.text for tag in soup.find_all('a',href=re.compile(r'#N\d+'))]
In a regular expression, the \d means digit and the + means one or more instances of.
I'm trying to parse the title tag in an RSS 2.0 feed into three different variables for each entry in that feed. Using ElementTree I've already parsed the RSS so that I can print each title [minus the trailing )] with the code below:
feed = getfeed("http://www.tourfilter.com/dallas/rss/by_concert_date")
for item in feed:
print repr(item.title[0:-1])
I include that because, as you can see, the item.title is a repr() data type, which I don't know much about.
A particular repr(item.title[0:-1]) printed in the interactive window looks like this:
'randy travis (Billy Bobs 3/21'
'Michael Schenker Group (House of Blues Dallas 3/26'
The user selects a band and I hope to, after parsing each item.title into 3 variables (one each for band, venue, and date... or possibly an array or I don't know...) select only those related to the band selected. Then they are sent to Google for geocoding, but that's another story.
I've seen some examples of regex and I'm reading about them, but it seems very complicated. Is it? I thought maybe someone here would have some insight as to exactly how to do this in an intelligent way. Should I use the re module? Does it matter that the output is currently is repr()s? Is there a better way? I was thinking I'd use a loop like (and this is my pseudoPython, just kind of notes I'm writing):
list = bandRaw,venue,date,latLong
for item in feed:
parse item.title for bandRaw, venue, date
if bandRaw == str(band)
send venue name + ", Dallas, TX" to google for geocoding
return lat,long
list = list + return character + bandRaw + "," + venue + "," + date + "," + lat + "," + long
else
In the end, I need to have the chosen entries in a .csv (comma-delimited) file looking like this:
band,venue,date,lat,long
randy travis,Billy Bobs,3/21,1234.5678,1234.5678
Michael Schenker Group,House of Blues Dallas,3/26,4321.8765,4321.8765
I hope this isn't too much to ask. I'll be looking into it on my own, just thought I should post here to make sure it got answered.
So, the question is, how do I best parse each repr(item.title[0:-1]) in the feed into the 3 separate values that I can then concatenate into a .csv file?
Don't let regex scare you off... it's well worth learning.
Given the examples above, you might try putting the trailing parenthesis back in, and then using this pattern:
import re
pat = re.compile('([\w\s]+)\(([\w\s]+)(\d+/\d+)\)')
info = pat.match(s)
print info.groups()
('Michael Schenker Group ', 'House of Blues Dallas ', '3/26')
To get at each group individual, just call them on the info object:
print info.group(1) # or info.groups()[0]
print '"%s","%s","%s"' % (info.group(1), info.group(2), info.group(3))
"Michael Schenker Group","House of Blues Dallas","3/26"
The hard thing about regex in this case is making sure you know all the known possible characters in the title. If there are non-alpha chars in the 'Michael Schenker Group' part, you'll have to adjust the regex for that part to allow them.
The pattern above breaks down as follows, which is parsed left to right:
([\w\s]+) : Match any word or space characters (the plus symbol indicates that there should be one or more such characters). The parentheses mean that the match will be captured as a group. This is the "Michael Schenker Group " part. If there can be numbers and dashes here, you'll want to modify the pieces between the square brackets, which are the possible characters for the set.
\( : A literal parenthesis. The backslash escapes the parenthesis, since otherwise it counts as a regex command. This is the "(" part of the string.
([\w\s]+) : Same as the one above, but this time matches the "House of Blues Dallas " part. In parentheses so they will be captured as the second group.
(\d+/\d+) : Matches the digits 3 and 26 with a slash in the middle. In parentheses so they will be captured as the third group.
\) : Closing parenthesis for the above.
The python intro to regex is quite good, and you might want to spend an evening going over it http://docs.python.org/library/re.html#module-re. Also, check Dive Into Python, which has a friendly introduction: http://diveintopython3.ep.io/regular-expressions.html.
EDIT: See zacherates below, who has some nice edits. Two heads are better than one!
Regular expressions are a great solution to this problem:
>>> import re
>>> s = 'Michael Schenker Group (House of Blues Dallas 3/26'
>>> re.match(r'(.*) \((.*) (\d+/\d+)', s).groups()
('Michael Schenker Group', 'House of Blues Dallas', '3/26')
As a side note, you might want to look at the Universal Feed Parser for handling the RSS parsing as feeds have a bad habit of being malformed.
Edit
In regards to your comment... The strings occasionally being wrapped in "s rather than 's has to do with the fact that you're using repr. The repr of a string is usually delimited with 's, unless that string contains one or more 's, where instead it uses "s so that the 's don't have to be escaped:
>>> "Hello there"
'Hello there'
>>> "it's not its"
"it's not its"
Notice the different quote styles.
Regarding the repr(item.title[0:-1]) part, not sure where you got that from but I'm pretty sure you can simply use item.title. All you're doing is removing the last char from the string and then calling repr() on it, which does nothing.
Your code should look something like this:
import geocoders # from GeoPy
us = geocoders.GeocoderDotUS()
import feedparser # from www.feedparser.org
feedurl = "http://www.tourfilter.com/dallas/rss/by_concert_date"
feed = feedparser.parse(feedurl)
lines = []
for entry in feed.entries:
m = re.search(r'(.*) \((.*) (\d+/\d+)\)', entry.title)
if m:
bandRaw, venue, date = m.groups()
if band == bandRaw:
place, (lat, lng) = us.geocode(venue + ", Dallas, TX")
lines.append(",".join([band, venue, date, lat, lng]))
result = "\n".join(lines)
EDIT: replaced list with lines as the var name. list is a builtin and should not be used as a variable name. Sorry.