I have a list with address information
The placement of words in the list can be random.
address = [' South region', ' district KTS', ' 4', ' app. 106', ' ent. 1', ' st. 15']
I want to extract each item of a list in a new string.
r = re.compile(".region")
region = list(filter(r.match, address))
It works, but there are more than 1 pattern "region". For example, there can be "South reg." or "South r-n".
How can I combine a multiple patterns?
And digit 4 in list means building number. There can be onle didts, or smth like 4k1.
How can I extract building number?
Hopefully I understood the requirement correctly.
For extracting the region, I chose to get it by the first word, but if you can be sure of the regions which are accepted, it would be better to construct the regex based on the valid values, not first word.
Also, for the building extraction, I am not sure of which are the characters you want to keep, versus the ones which you may want to remove. In this case I chose to keep only alphanumeric, meaning that everything else would be stripped.
CODE
import re
list1 = [' South region', ' district KTS', ' -4k-1.', ' app. 106', ' ent. 1', ' st. 15']
def GetFirstWord(list2,column):
return re.search(r'\w+', list2[column].strip()).group()
def KeepAlpha(list2,column):
return re.sub(r'[^A-Za-z0-9 ]+', '', list2[column].strip())
print(GetFirstWord(list1,0))
print(KeepAlpha(list1,2))
OUTPUT
South
4k1
I have a dataset that has a "tags" column in which each row is a list of tags. For example, the first entry looks something like this
df['tags'][0]
result = "[' Leisure Trip ', ' Couple ', ' Duplex Double Room ', ' Stayed 6 nights ']"
I have been able to remove the trailing whitespace from all elements and only the leading whitespace from the first element (so I get something like the below).
['Leisure trip', ' Couple', ' Duplex Double Room', ' Stayed 6 nights']
Does anyone know how to remove the leading whitespace from all but the first element is these lists? They are not of uniform length or anything. Below is the code I have used to get the final result above:
clean_tags_list = []
for item in reviews['Tags']:
string = item.replace("[", "")
string2 = string.replace("'", "")
string3 = string2.replace("]", "")
string4 = string3.replace(",", "")
string5 = string4.strip()
string6 = string5.lstrip()
#clean_tags_list.append(string4.split(" "))
clean_tags_list.append(string6.split(" "))
clean_tags_list[0]
['Leisure trip', ' Couple', ' Duplex Double Room', ' Stayed 6 nights']
IIUC you want to apply strip for the first element and right strip for the other ones. Then, first convert your 'string list' to an actual list with ast.literal_eval and apply strip and rstrip:
from ast import literal_eval
df.tags.agg(literal_eval).apply(lambda x: [item.strip() if x.index(item) == 0 else item.rstrip() for item in x])
If I understand correctly, you can use the code below :
import pandas as pd
df = pd.DataFrame({'tags': [[' Leisure Trip ', ' Couple ', ' Duplex Double Room ', ' Stayed 6 nights ']]})
df['tags'] = df['tags'].apply(lambda x: [x[0].strip()] + [e.rstrip() for e in x[1:]])
>>> print(df)
I was also able to figure it out with the below code. (I know that this isn't very efficient but it worked).
will_clean_tag_list = []
for row in clean_tags_list:
for col in range(len(row)):
row[col] = row[col].strip()
will_clean_tag_list.append(row)
Thank you all for the insight! This has been my first post and I really appreciate the help.
['ISBN: 9789353765170', 'Pages: 64', 'Size: 294 x 219', 'Language: English', 'Book Binding: Paperback', 'Weight: 350 gm.']
I splitted with l.split(":"), to get a nested list but that isn't helping
[['ISBN', ' 9789353765170'], ['Pages', ' 64'], ['Size', ' 294 x 219'], ['Language', ' English'], ['Book Binding', ' Paperback'], ['Weight', ' 350 gm.']]
I have a list that looks like the above, I want to append these attribute values into separate lists of their own so that I can create a dataframe using them. The problem is that some lists also have the "format" attribute as well in the list.
How do I specifically choose these elements to be added into their own lists and for the non existent attributes to be added "Not available" in python.
IIUC use:
L = ['ISBN: 9789353765170', 'Pages: 64', 'Size: 294 x 219',
'Language: English', 'Book Binding: Paperback', 'Weight: 350 gm.']
df = pd.DataFrame((x.split(': ') for x in L), columns=['a','b'])
print (df)
a b
0 ISBN 9789353765170
1 Pages 64
2 Size 294 x 219
3 Language English
4 Book Binding Paperback
5 Weight 350 gm.
I'm just starting to get to grips with Scrapy. So far, I've figured out how to extract the relevant sections of a web page and to crawl through web pages.
However, I'm still unsure as to how one can format the results in a meaningful tabular format.
When the scraped data is an table format, it's straightforward enough. However, sometimes the data isn't. e.g. this link
I can access the names using
response.xpath('//div[#align="center"]//h3').extract()
Then I can access the details using
response.xpath('//div[#align="center"]//p').extract()
Now, I need to format the data like this, so I can save it to a CSV file.
Name: J Speirs Farms Ltd
Herd Prefix: Pepperstock
Membership No. 7580
Dept. Herd Mark: UK244821
Membership Type: Youth
Year Joined: 2006
Address: Pepsal End Farm, Pepperstock, Luton, Beds
Postcode: LU1 4LH
Region: East Midlands
Telephone: 01582450962
Email:
Website:
Ideally, I'd like to define the structure of the data, then use populate according to the scraped data. Because in some cases, certain fields are not available, e.g. Email: and Website:
I don't need the answer, but would appreciate if someone can point me in the right direction.
All of the data seem to be separated by newlines, so simply use str.splitlines():
> names = response.xpath('//div[#align="center"]//a[#name]')
> details = names[0].xpath('following-sibling::p[1]/text()').extract_first().splitlines()
['J Speirs Farms Ltd ', 'Herd Prefix: Pepperstock ', 'Membership No. 7580 ', 'Dept. Herd Mark: UK244821 ', 'Membership Type: Youth ', 'Year Joined: 2006 ', 'Address: Pepsal End Farm ', ' Pepperstock ', ' Luton ', ' Beds ', 'Postcode: LU1 4LH ', 'Region: East Midlands ', 'Telephone: 01582450962 ']
> name = names[0].xpath('#name').extract_first()
'J+Speirs+Farms+Ltd+++'
Now you just need to figure out how to parse those bits into clean format:
Some names are split in multiple lines but you can identify and fix the list by checking whether members contain : or ., if not they belong to preceding member that does:
clean_details = [f'Name: {details[0]}']
# first item is name, skip
for d in details[1:]:
if ':' in d or 'No.' in d:
clean_details.append(d)
else:
clean_details[-1] += d
Finally parse the cleaned up details list we have:
item = {}
for detail in clean_details:
values = detail.split(':')
if len(values) < 2: # e.g. Membership No.
values = detail.split('No.')
if len(values) == 2: # e.g. telephone: 1337
label, text = values
item[label] = text.strip()
>>> pprint(item)
{'Address': 'Pepsal End Farm Pepperstock Luton Beds',
'Dept. Herd Mark': 'UK244821',
'Herd Prefix': 'Pepperstock',
'Membership ': '7580',
'Membership Type': 'Youth',
'Name': 'J Speirs Farms Ltd',
'Postcode': 'LU1 4LH',
'Region': 'East Midlands',
'Telephone': '01582450962',
'Year Joined': '2006'}
You can define a class for the items you want to save and import the class to your spider. Then you can directly save the items.
I have a list within a list, and I am trying to iterate through one list, and then in the inner list I want to search for a value, and if this value is present, place that list in a variable.
Here's what I have, which doesn't seem to be doing the job:
for z, g in range(len(tablerows), len(andrewlist)):
tablerowslist = tablerows[z]
if "Andrew Alexander" in tablerowslist:
andrewlist[g] = tablerowslist
Any ideas?
This is the list structure:
[['Kyle Bazzy', 'FUP dropbox message', '8/18/2011', 'Swing Trade Stocks</a>', ' ', 'Affiliate blog'], ['Kyle Bazzy', 'FUP dropbox message', '8/18/2011', 'Swing Trade Software</a>', ' ', 'FUP from dropbox message. Affiliate blog'], ['Kyle Bazzy', 'FUP dropbox message', '8/18/2011', 'Start Day Trading (Blog)</a>', ' ', 'FUP from dropbox message'], ['Kyle Bazzy', 'Call, be VERY NICE', '8/18/2011', ' ', 'r24867</a>', 'We have been very nice to him, but he wants to cancel, we need to keep being nice and seeing what is wrong now.'], ['Jason Raznick', 'Reach out', '8/18/2011', 'Lexis Nexis</a>', ' ', '-'], ['Andrew Alexander', 'Check on account in one week', '8/18/2011', ' ', 'r46876</a>', '-'], ['Andrew Alexander', 'Cancel him from 5 dollar feed', '8/18/2011', ' ', 'r37693</a>', '-'], ['Aaron Wise', 'FUP with contract', '8/18/2011', 'YouTradeFX</a>', ' ', "Zisa is on vacation...FUP next week and then try again if she's still gone."], ['Aaron Wise', 'Email--JASON', '8/18/2011', 'Lexis Nexis</a>', ' ', 'email by today'], ['Sarah Knapp', '3rd FUP', '8/18/2011', 'Steven L. Pomeranz</a>', ' ', '-'], ['Sarah Knapp', 'Are we really interested in partnering?', '8/18/2011', 'Reverse Spins</a>', ' ', "V. political, doesn't seem like high quality content. Do we really want a partnership?"], ['Sarah Knapp', '2nd follow up', '8/18/2011', 'Business World</a>', ' ', '-'], ['Sarah Knapp', 'Determine whether we are actually interested in partnership', '8/18/2011', 'Fayrouz In Dallas</a>', ' ', "Hasn't updated since September 2010."], ['Sarah Knapp', 'See email exchange w/Autumn; what should happen', '8/18/2011', 'Graham and Doddsville</a>', ' ', "Wasn't sure if we could partner bc of regulations, but could do something meant simply to increase traffic both ways."], ['Sarah Knapp', '3rd follow up', '8/18/2011', 'Fund Action</a>', ' ', '-']]
For any value that has a particular value in it, say, Andrew Alexander, I want to make a separate list of these.
For example:
[['Andrew Alexander', 'Check on account in one week', '8/18/2011', ' ', 'r46876</a>', '-'], ['Andrew Alexander', 'Cancel him from 5 dollar feed', '8/18/2011', ' ', 'r37693</a>', '-']]
Assuming you have a list whose elements are lists, this is what I'd do:
andrewlist = [row for row in tablerows if "Andrew Alexander" in row]
>>> #I have a list within a list,
>>> lol = [[1, 2, 42, 3], [4, 5, 6], [7, 42, 8]]
>>> found = []
>>> #iterate through one list,
>>> for i in lol:
... #in the inner list I want to search for a value
... if 42 in i:
... #if this value is present, place that list in a variable
... found.append(i)
...
>>> found
[[1, 2, 42, 3], [7, 42, 8]]
for z, g in range(len(tablerows), len(andrewlist)):
This means "make a list of the numbers which are between the length of tablerows and the length of andrewlist, and then look at each of those numbers in turn, and treat those numbers as a list of two values, and assign the two values to z and g each time through the loop".
A number cannot be treated as a list of two values, so this fails.
You need to be much, much clearer about what you are doing. Show an example of the contents of tablerows before the loop, and the contents of andrewlist before the loop, and what it should look like afterwards. Your description is muddled: I can only guess that when you say "and then I want to iterate through one list" you mean one of the lists in your list-of-lists; but I can't tell whether you want one specific one, or each one in turn. And then when you next say "and then in the inner list I want to...", I have no idea what you're referring to.