Scrap data from multiple categories

Scrap data from multiple categories - python

I am scraping a product at softsurroundings.com
This is the product link: https://www.softsurroundings.com/p/estelle-dress/
This product has 3 types of size categories["Misses","Petites","Women"]. Each size category has furthur sizes. i.e.
for "Misses" we has ["XS","S","M","L","XL"]
for "Petites" we has ["PXS","PS","PM","PL","PXL"]
for "Women" we has ["1X","2X","3X"]
I am confused on the css to get the sizes of all three categories.
I get the size of only Misses category because when website loads only Misses category shows
The current code I have is
raw_skus = []
for sku_sel in response.css('.dtlFormBulk.flexItem .size[class="box size"]'):
sku = {
'sku_id': sku_sel.css('.size ::attr(id)').get(),
'size': sku_sel.css('.size ::text').get()
}
raw_skus.append(sku)
return raw_skus
the above code returns me
[
{'sku_id': 'size_501', 'size': 'XS'},
{'sku_id': 'size_601', 'size': 'S'},
{'sku_id': 'size_701', 'size': 'M'},
{'sku_id': 'size_801', 'size': 'L'},
{'sku_id': 'size_901', 'size': 'XL'}
]
I am only getting sizes from Misses category I need sizes from other two categories appended in the list too.
Please help.

Related

How to check a substring in a string up until a certain occurrence of a character in python?

I receive orders that contain the name of customer, price of order and the quantity of the order.
The format of the order looks like this: {'first order': ['Alex', '100#2']}
(100 refers to a price and 2 refers to a quantity).
So I have different orders: {'first order': ['Alex', '99#2'], 'second order': ['Ann', '101#2'], 'third order': ['Nick', '110#3']}
We need to compare the prices and see which is the highest price and which is the lowest.
I was thinking of doing this by cutting the substring into two parts, the first part before the '#' symbol and the second part after the '#' symbol, then extract the numbers from the first part and compare them with others.
What's the most efficient way you can tell to solve this issue?
Thank you

I'd suggest to transform the dictionary to a list of dictionaries and convert the string to two floats. For example:
orders = {
"first order": ["Alex", "99#2"],
"second order": ["Ann", "101#2"],
"third order": ["Nick", "110#3"],
}
orders = [
{
"order": k,
"name": name,
"price": float(pq.split("#")[0]),
"quantity": float(pq.split("#")[1]),
}
for k, (name, pq) in orders.items()
]
Then if you want to find highest and lowest price you can use min/max function easily:
highest = max(orders, key=lambda k: k["price"])
lowest = min(orders, key=lambda k: k["price"])
print(highest)
print(lowest)
Prints:
{'order': 'third order', 'name': 'Nick', 'price': 110.0, 'quantity': 3.0}
{'order': 'first order', 'name': 'Alex', 'price': 99.0, 'quantity': 2.0}

Returning an empty dictionary for the datasource of a datatable in plotly/dash

My callback function reads a value selected by the user (site name) and then queries data for that site and returns 3 figures and one dictionary (df.to_dict('records') to supply the data for a datatable.
If the user selects a site for which there is no data, I return {}. That seems to break it. If I select a site, the data table fills in properly, switch to another site, same thing. Once I select a site with no data, the data table will no longer update, no matter which site I select.
Some relevant code:
The output is defined as:
Output('emission_table','data'),
The return from the callback is:
return time_series_figure,emissions_df.to_dict('records'),site_map,hotspot_figure
html.Div(style={'float':'left','padding':'5px','width':'49%'}, children = [
dash_table.DataTable(id='emission_table', data=[],columns=[
# {'id': "site", 'name': "Site"},
{'id': "dateformatted", 'name': "date"},
{'id': "device", 'name': "device"},
{'id': "emission", 'name': "Emission"},
{'id': "methane", 'name': "CH4"},
{'id': "wdir", 'name': "WDIR"},
{'id': "wspd", 'name': "WSPD"},
{'id': "wd_std", 'name': "WVAR"}],
# {'id': "url", 'name':'(Link for Google Maps)','presentation':'markdown'}],
fixed_rows={'headers': True},
row_selectable='multi',
style_table={'height': '500px', 'overflowY': 'auto'},
style_cell={'textAlign': 'left'})
]),
Any ideas what is happening? Is there a better way for the callback to return an empty data source for the datatable?
Thanks!

You haven't shared enough of your code (your callback specifically) to see what is happening exactly, however:
If the user selects a site for which there is no data, I return {}
is at least one reason why it doesn't work. The data property of a Dash Datatable needs to be a list and not a dictionary. You can however put dictionaries inside the list. Each dictionary inside the list corresponds to a row in the Data Table.
So to re-iterate and more directly answer your question:
Is there a better way for the callback to return an empty data source for the datatable?
Yes you can return a list with any number of dictionaries inside.

xlsxwriter chart create dynamic rows

I'm trying to create charts with xlsxwriter python module.
It works fine, but I would like to not have to hard code the row amount
This example will chart 30 rows.
chart.add_series({
'name': 'SNR of old AP',
'values': '=Depart!$D$2:$D$30',
'marker': {'type': 'circle'},
'data_labels': {'value': True,'num_format':'#,##0'},
})
For values': I would like the row count to be dynamic. How do I do this?
Thanks.

It works fine, but I would like to not have to hard code the row amount
XlsxWriter supports a list syntax in add_series() for this exact case. So your example could be written as:
chart.add_series({
'name': 'SNR of old AP',
'values': ['Depart', 1, 3, 29, 3],
'marker': {'type': 'circle'},
'data_labels': {'value': True, 'num_format':'#,##0'},
})
And then you can set any of the first_row, first_col, last_row, last_col parameters programmatically.
See the docs for add_series().

How to convert a list of dictionaries into a list of each dictionary in the list's values?

I am trying to produce a list of categories that I can pass to my html template to render a nav bar of all the categories. In the products collection in my mongo data base, every product has a category field. Using the code below I generate a pymongo cursor of all the categories.
categories = Database.DATABASE[ProductConstants.COLLECTION].find({}, {'category': True, '_id': False})
print(categories)
<pymongo.cursor.Cursor at 0x1049cc668>
Putting categories in a list gives me
categories = list(categories)
print(categories)
[{'category': 'Phones'},
{'category': 'Phones'},
{'category': 'Phones'},
{'category': 'Phones'},
{'category': 'Phones'},
{'category': 'Soaps'}]
This seems to be a step in the right direction. I would like the end output for categories to simply be:
print(categories)
['Phones', 'Soaps'].
I have tried doing this:
categories = [category.values() for category in categories]
print(categories)
[dict_values(['Phones']),
dict_values(['Phones']),
dict_values(['Phones']),
dict_values(['Phones']),
dict_values(['Soaps'])]
If I could get rid of the dict_values I could potentially flatten this list using sum(categories, []) and then put that into a set() such that I don't have any duplicates. I believe this would give me the desired result but am not sure how to go about it. Perhaps I am going down the wrong route and there is a better way to go about all of this? Advice would be appreciated.

Try this
categories = Database.DATABASE[ProductConstants.COLLECTION].find({}, {'category': True, '_id': False}).distinct('category')
print categories
This category list will contains the distinct number of category values.

It sounds like you need a set of categories:
categories = [{'category': 'Phones'},
{'category': 'Phones'},
{'category': 'Phones'},
{'category': 'Phones'},
{'category': 'Phones'},
{'category': 'Soaps'}]
# use a set to eliminate the duplicate categories
c = set(d['category'] for d in categories)
print(list(c))
Output:
['Soaps', 'Phones']
Update:
c = {d['category'] for d in categories} # equivalent using a set comprehension

Using Python and xlsxwriter to concatenate cell references with cell count

Hi I have a working script to generate a line chart using Xlsxwriter, however I am looking for a way to concatenate an earlier hit count with the cell range for my generated chart as the script is used to iterate over several similar files in the directory so the overall 'hit count' varies for each file.
The script first looks through a text file for a string and collects some stats using line spitting drops the collected figures into Excel and and generates a hit count each time the particular string is found (total)
Then charts are generated using thee collected stats..
Here's my chart generating section...
chart1 = workbook.add_chart({'type': 'line'})
chart1.add_series({
'name': 'My Chart',
'categories': '=Sheet1!$A$2:$A$2200',
'values': '=Sheet1!$B$2:$B$2200',
'line': {'color': 'purple'},
})
I am hoping to generate the chart by referencing the 'total' count in the row count. So I am looking for something along the lines of
'categories': '=Sheet1!$A$2:$A$'+total,
'values': '=Sheet1!$B$2:$B$'+total,
I hope this makes sense? Basically I am looking to have a varying cell row range dependent on the count of hits, is this possible? Or alternatively is there a 'last row' reference in xlsxwriter for this type of circumstance?
Thanks,
MikG

The chart.add_series() method also accepts a list of values so you can do something like this:
chart1.add_series({
'name': 'My Chart',
'categories': ['Sheet1', 1, 0, total -1, 0],
'values': ['Sheet1', 1, 1, total -1, 1],
'line': {'color': 'purple'},
})

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scrap data from multiple categories - python

Related

How to check a substring in a string up until a certain occurrence of a character in python?

Returning an empty dictionary for the datasource of a datatable in plotly/dash

xlsxwriter chart create dynamic rows

How to convert a list of dictionaries into a list of each dictionary in the list's values?

Using Python and xlsxwriter to concatenate cell references with cell count

Categories

Resources