I want to get names from a website in a list.
soup = bs4.BeautifulSoup(page.text, 'html.parser')
tbl = soup.find('ul', class_='static-top-names part1')
for link in tbl:
names = link.get_text()
print(names)
So i'm trying to get some names from a website and when i applied above code, i get names as a . When i try to iterate over it i get below output.
John
Mark
Steve and so on.
I want to get rid of the number in the text data and also just want to have the names in a list format.
All i want is to get these pure names and hopefully put them in a list form. Any help?
If the format is always #. name, then you can do the following:
name.split('. ', 1)[1]
Use regular expression for consistency.
import re
s = '1.TEST'
print(re.sub('\d+.','',s))
will give you TEST only. This will eliminate any size of number following with a dot. Basically, replace any number followed by a dot with emptiness.
Iterate over your original list and do the above at the same time using list comprehension
new_list = [re.sub('\d+.','',s) for s in original_list]
This should give you the new list as per your requirement.
You can simply split with '.' dot character or even a space if there is a space before name.
So name.split('' )[-1] name.split('.')[-1] would give just the name. Then you can append those names into a list.
Something like this.
names = [link.get_text().split(' ')[-1] for link in tbl]
This will you the list of just names, i used [-1] as the list index after since your text contains only two items after splitting with space. So if there are more items please use appropriate index.
Related
I have a dataset. In the column 'Tags' I want to extract from each row all the content that has the word player. I could repeat or be alone in the same cell. Something like this:
'view_snapshot_hi:hab,like_hi:hab,view_snapshot_foinbra,completed_profile,view_page_investors_landing,view_foinbra_inv_step1,view_foinbra_inv_step2,view_foinbra_inv_step3,view_snapshot_acium,player,view_acium_inv_step1,view_acium_inv_step2,view_acium_inv_step3,player_acium-ronda-2_r1,view_foinbra_rinv_step1,view_page_makers_landing'
expected output:
'player,player_acium-ronda-2_r1'
And I need both.
df["Tags"] = df["Tags"].str.ectract(r'*player'*,?\s*')
I tried this but it´s not working.
You need to use Series.str.extract keeping in mind that the pattern should contain a capturing group embracing the part you need to extract.
The pattern you need is player[^,]*:
df["Tags"] = df["Tags"].str.extract(r'(player[^,]*)', expand=False)
The expand=False returns a Series/Index rather than a dataframe.
Note that Series.str.extract finds and fetches the first match only. To get all matches use either of the two solutions below with Series.str.findall:
df["Tags"] = df["Tags"].str.findall(r'player[^,]*', expand=False)
df["Tags"] = df["Tags"].str.findall(r'player[^,]*', expand=False).str.join(", ")
This simple list also gives what you want
words_with_players = [item for item in your_str.split(',') if 'player' in item]
players = ','.join(words_with_players)
I have elements that I've scraped off of a website and when I print them using the following code, they show up neatly as spaced out elements.
print("\n" + time_element)
prints like this
F
4pm-5:50pm
but when I pass time_element into a dataframe as a column and convert it to a string, the output looks like this
# b' \n F\n \n 4pm-5:50pm\n
I am having trouble understanding why it appears so and how to get rid of this "\n" character. I tried using regex to match the "F" and the "4pm-5:50pm" and I thought this way I could separate out the data I need. But using various methods including
# Define the list and the regex pattern to match
time = df['Time']
pattern = '[A-Z]+'
# Filter out all elements that match the pattern
filtered = [x for x in time if re.match(pattern, x)]
print(filtered)
I get back an empty list.
From my research, I understand the "\n" represents a new line and that there might be invisible characters. However, I'm not understanding more about how they behave so I can get rid of them/around them to extract the data that I need.
When I pass the data to csv format, it prints like this all in one cell
F
4pm-5:50pm
but I still end up in the similar place when it comes to separating out the data that I need.
you can use the function strip() when you extract data from the website to avoid "\n"
I have the following two different cases of list of strings:
my_list1=['_','net_my_name','_64', '_66']
my_list2=['net_another_file']
I would like to extract
net_my_name as my name in case I have type of lists such as my_list1;
net_another_file as another file in case I have type of lists such as my_list2.
To do so, I was thinking of:
in case I find a situation like that one described by my_list1, then remove elements that are numerical, then split on _ to take the last two items (i.e. my name);
in case I find a situation like that one described by my_list2, then split on _ to take the last two items (i.e. another file).
If I removed numerical values, where they occur, I would have my_name as last word, i.e. my name as last two words.
Expected output:
my name
another file
Can you please tell me how to 'translate' in code the steps above? Thank you
Consider this code:
import re
string = "net_another_file777"
string = re.sub("[0-9]", "", string) # "net_another_file"
L = string.split('_')[-2:] # ['another', 'file']
Now you have just to go through the list and aply this to every element in the list.
Hope this helps you.
I am trying to split a string I extract on the first occurrence of a comma. I have tried using the split, but something is wrong, as it doesn't split.
for i in range(len(items)):
alldata = items[i].getText().encode('utf-8').split(',', 1)
csvfile.writerow(alldata)
The variable items contains the data I extract from an URL. The output in the CSV file is put in one column. I want it to be on two columns. An example of the data (alldata) I get in the CSV file, looks like this:
['\n\n\n1958\n\n\nGeorge Lees\n']
Using this data as an example, I need the year 1958 to be on one column, and the name George Lees to be on another column instead of the new lines.
EDIT
Forgot to mention what I meant with the commas. The reason why I mentioned the commas is that I also tried splitting on whitespaces. When I did that I got the data:
['1958', 'George', 'Lees']
So what I tried to achieve was to split the data on the first comma occurrence. That's why I did split(',', 1) forgetting that I also need to split on whitespaces. So my problem is that I don't know how I split on both the first commas occurence, so that the year is on oe column and the whole name is on another column. I got
['\n\n\n1958\n\n\nGeorge Lees\n']
When I tried to split with split(',', 1)
You can use strip to remove all spaces in the start & end and then use split by "\n" to get the required output. I have also used the filter method to remove any empty string or values.
Ex:
A = ['\n\n\n1958\n\n\nGeorge Lees\n']
print filter(None, A[0].strip().split("\n"))
Output:
['1958', 'George Lees']
I want to remove specific value from a Unicode list i.e field
u'abv,( field),apn,army,elev,fema'
But when i try something like result.remove ('(field)') it stops working and gives an error ?
Convert it into list and use remove
s = u'abv,( field),apn,army,elev,fema'
res = s.split(",")
res.remove("army") # lets assume we need to remove army
['abv', '( field)', 'apn', 'elev', 'fema']
You can make your output list back to string as well, if you wish
output = ",".join(res)
'abv,( field),apn,elev,fema'