Python Pandas Web Scraping

Python Pandas Web Scraping - python

I'm trying to turn a list of tables on this page into a Pandas DataFrame:
https://intermediaries.hsbc.co.uk/products/product-finder/
I want to select the customer type box only and select one of the elements (from first to last) and then click find product to display the table for each one before concatenating all the DataFrames into 1 DataFrame.
So far I have managed to select the first element and print the table but I can't seem to turn it into a pandas DataFrame as I get a value error: Must pass 2-d input. shape=(1, 38, 12)
This is my code:
def product_type_button(self):
select = Select(self.driver.find_element_by_id('Availability'))
try:
select.select_by_visible_text('First time buyer')
except NoSuchElementException:
print('The item does not exist')
time.sleep(5)
self.driver.find_element_by_xpath('//button[#type="button" and (contains(text(),"Find product"))]').click()
time.sleep(5)
def create_dataframe(self):
data1 = pd.read_html(self.driver.page_source)
print(data1)
data2 = pd.DataFrame(data1)
time.sleep(5)
data2.to_csv('Data1.csv')
I would like to find a way to print the table for each element, maybe selecting by index instead? and then concatenating into one DataFrame. Any help would be appreciated.

All data for the table is located inside javascript file. You can use re/json to parse it and then construct the dataframe:
import re
import json
import requests
import pandas as pd
js_src = "https://intermediaries.hsbc.co.uk/component---src-pages-products-product-finder-js-9c7004fb8446c3fe0a07.js"
data = requests.get(js_src).text
data = re.search(r"JSON\.parse\('(.*)'\)", data).group(1)
data = json.loads(data)
df = pd.DataFrame(data)
print(df.head().to_markdown(index=False))
df.to_csv("data.csv", index=False)
Prints:
Changed
NewProductCode
PreviousProductCode
ProductType
Deal Period (Fixed until)
ProductDescription1
ProductTerm
Availability
Repayment Basis
Min LTV %
MaxLTV
Minimum Loan ?
Fees Payable Per
Rate
Reversionary Rate %
APR
BookingFee
Cashback
CashbackValue
ERC - Payable
Unlimited lump sum payments Premitted (without fees)
Unlimited overpayment permitted (without fees)
Overpayments
ERC
Portable
Completionfee
Free Legals for Remortgage
FreeStandardValuation
Loading
Continued
4071976
Fixed
31.03.25
Fee Saver*
2 Year Fixed
FTB
CR and IO
0%
60%
10,000
5,000,000
5.99%
5.04%
5.4
0
No
0
Yes
No
No
Yes
Please refer to HSBC policy guide
Yes
17
Yes
Yes
nan
Continued
4071977
Fixed
31.03.25
Fee Saver*
2 Year Fixed
FTB
CR and IO
0%
70%
10,000
2,000,000
6.04%
5.04%
5.4
0
No
0
Yes
No
No
Yes
Please refer to HSBC policy guide
Yes
17
Yes
Yes
nan
Continued
4071978
Fixed
31.03.25
Fee Saver*
2 Year Fixed
FTB
CR and IO
0%
75%
10,000
2,000,000
6.04%
5.04%
5.4
0
No
0
Yes
No
No
Yes
Please refer to HSBC policy guide
Yes
17
Yes
Yes
nan
Continued
4071979
Fixed
31.03.25
Fee Saver*
2 Year Fixed
FTB
CR
0%
80%
10,000
1,000,000
6.14%
5.04%
5.4
0
No
0
Yes
No
No
Yes
Please refer to HSBC policy guide
Yes
17
Yes
Yes
nan
Continued
4071980
Fixed
31.03.25
Fee Saver*
2 Year Fixed
FTB
CR
0%
85%
10,000
750,000
6.19%
5.04%
5.4
0
No
0
Yes
No
No
Yes
Please refer to HSBC policy guide
Yes
17
Yes
Yes
nan
and saves data.csv (screenshot from LibreOffice):

A minimal change to your code: pd.read_html returns a list of dataframes for all tables found on the webpage.
Since there is only one table on your page, data1 is a list of 1 dataframe. This is where the error Must pass 2-d input. shape=(1, 38, 12) comes from – data1 contains 1 dataframe of shape (38, 12).
You probably want to do just:
data2 = data1[0]
data2.to_csv(...)
(Also, no need to sleep after reading from the webpage)

Related

Mapping data frame descriptions based on values of multiple columns

I need to generate a mapping dataframe with each unique code and a description I want prioritised, but need to do it based off a set of prioritisation options. So for example the starting dataframe might look like this:
Filename TB Period Company Code Desc. Amount
0 3 - Foxtrot... Prior TB FOXTROT FOXTROT__1000 98 100
1 3 - Foxtrot... Prior TB FOXTROT FOXTROT__1000 7 200
2 3 - Foxtrot... Opening TB FOXTROT FOXTROT__1000 ZX -100
3 3 - Foxtrot... Closing TB FOXTROT FOXTROT__1000 29 -200
4 3 - Foxtrot... Prior TB FOXTROT FOXTROT__1001 BA 100
5 3 - Foxtrot... Opening TB FOXTROT FOXTROT__1001 9 200
6 3 - Foxtrot... Closing TB FOXTROT FOXTROT__1001 ARC -100
7 3 - Foxtrot... Closing TB FOXTROT FOXTROT__1001 86 -200
The options I have for prioritisation of descriptions are:
Firstly to search for viable options in each Period, so for example Closing first, then if not found Opening, then if not found Prior.
If multiple descriptions are in the prioritised period, prioritise either longest or first instance.
So for example, if I wanted prioritisation of Closing, then Opening, then Prior, with longest string, I should get a mapping dataframe that looks like this:
Code New Desc.
FOXTROT__1000 29
FOXTROT__1001 ARC
Just for context, I have a fairly simple way to do all this in tkinter, but its dependent on generating a GUI of inconsistent codes and comboboxes of their descriptions, which is then used to generate a mapping dataframe.
The issue is that for large volumes (>1000 up to 30,000 inconsistent codes), it becomes impractical to generate a GUI, so for large volumes I need this as a way to auto-generate the mapping dataframe directly from the initial data whilst circumventing tkinter entirely.

import numpy as np
import pandas as df
#Create a new column which shows the hierarchy given the value of Period
df['NewFilterColumn'] = np.where( df['Period'] == 'Closing', 1,
np.where(df['Period'] == 'Opening', 2,
np.where(df['Period'] == 'Prior', 3, None
)
)
)
df = df.sort_values(by = ['NewFilterColumn', 'Code','New Desc.'], ascending = True, axis = 0)

By default, how can I view all the rows in a Series and/or DataFrame?

By default, whenever I view a Series or DataFrame, it only gives me the first five rows and the last five rows as a preview. How do I view all the rows? Is there a method for that?
For example,
df[df["First Name"].duplicated()]
First Name Gender Start Date Last Login Time Salary Bonus % Senior Management Team
327 Aaron Male 1994-01-29 2020-04-22 18:48:00 58755 5.097 True Marketing
440 Aaron Male 1990-07-22 2020-04-22 14:53:00 52119 11.343 True Client Services
937 Aaron NaN 1986-01-22 2020-04-22 19:39:00 63126 18.424 False Client Services
141 Adam Male 1990-12-24 2020-04-22 20:57:00 110194 14.727 True Product
302 Adam Male 2007-07-05 2020-04-22 11:59:00 71276 5.027 True Human Resources
... ... ... ... ... ... ... ... ...
902 NaN Male 2001-05-23 2020-04-22 19:52:00 103877 6.322 True Distribution
925 NaN Female 2000-08-23 2020-04-22 16:19:00 95866 19.388 True Sales
946 NaN Female 1985-09-15 2020-04-22 01:50:00 133472 16.941 True Distribution
947 NaN Male 2012-07-30 2020-04-22 15:07:00 107351 5.329 True Marketing
951 NaN Female 2010-09-14 2020-04-22 05:19:00 143638 9.662 True NaN

You can change the viewing options for Jupyter like so:
pd.set_option('display.max_rows', df.shape[0])

An alternative to pd.set_option(). Create a custom loop. Loop through the dataframe in sets of 60 or whatever your max rows is for printing. This approach does not exclude column headers for each iteration of printing 60 rows, but it was a fun "alternative" to code and turns out to seemingly be a viable solution for printing large numbers of rows > 100,000 or so. I created a random dataframe of floats that is 100,000 rows long and took < 1 sec to run.
import numpy as np
import pandas as pd
import math
nrows=100000
df=pd.DataFrame(np.random.rand(nrows,4), columns=list('ABCD'))
i=0
for x in range(0,int(math.ceil(nrows/60))):
print(df.iloc[i:i+60, :].tail(60))
i+=60
The benefit of my approach depends on how many rows you want to show. I just tried the max number of rows with the pd.set_options method on 100,000 rows and when just calling df (instead of print(df)) my page became unresponsive. That is because, it creates such a long page (there is no scrollbar), but when you print you get a scrollbar, so it's way less intensive and better practice IMO for printing a large number of rows.
Okay, so calling df why wouldn't I just change to the max limit with pd.set_option('display.max_rows', None) and do print(df). Wouldn't that work?
Well that worked for 10,000 rows, but I received this error when doing 100,000 rows.
IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.
Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)
Perhaps, you want to adjust the NotebookApp.iopub_data_rate_limit, but then it gets more technical and you might have to go the command line and mess with config settings
IOPub data rate exceeded in Jupyter notebook (when viewing image)
My solution allows you to print all rows without messing with pd.options or having to manually edit these limits in configuration files. Of course, again this really depends on how many rows you want to print in your terminals.

This is explained in the following link.
https://thispointer.com/python-pandas-how-to-display-full-dataframe-i-e-print-all-rows-columns-without-truncation/
An excerpt from the link provides these 4 options.
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', -1)

Pandas.read_csv issue

I trying to read the message from database, but under the class label can't really read same as CSV dataset.
messages = pandas.read_csv('bitcoin_reddit.csv', delimiter='\t',
names=["title","class"])
print (messages)
Under the class label the pandas only can read as NaN
The version of my CSV file
title,url,timestamp,class
"It's official! 1 Bitcoin = $10,000 USD",https://v.redd.it/e7io27rdgt001,29/11/2017 17:25,0
The last 3 months in 47 seconds.,https://v.redd.it/typ8fdslz3e01,4/2/2018 18:42,0
It's over 9000!!!,https://i.imgur.com/jyoZGyW.gifv,26/11/2017 20:55,1
Everyone who's trading BTC right now,http://cdn.mutually.com/wp-content/uploads/2017/06/08-19.jpg,7/1/2018 12:38,1
I hope James is doing well,https://i.redd.it/h4ngqma643101.jpg,1/12/2017 1:50,1
Weeeeeeee!,https://i.redd.it/iwl7vz69cea01.gif,17/1/2018 1:13,0
Bitcoin.. The King,https://i.redd.it/4tl0oustqed01.jpg,1/2/2018 5:46,1
Nothing can increase by that much and still be a good investment.,https://i.imgur.com/oWePY7q.jpg,14/12/2017 0:02,1
"This is why I want bitcoin to hit $10,000",https://i.redd.it/fhzsxgcv9nyz.jpg,18/11/2017 18:25,1
Bitcoin Doesn't Give a Fuck.,https://v.redd.it/ty2y74gawug01,18/2/2018 15:19,-1
Working Hard or Hardly Working?,https://i.redd.it/c2o6204tvc301.jpg,12/12/2017 12:49,1

The separator in your csv file is a comma, not a tab. And since , is the default, there is no need to define it.
However, names= defines custom names for the columns. Your header already provides these names, so passing the column names you are interested in to usecols is all you need then:
>>> pd.read_csv(file, usecols=['title', 'class'])
title class
0 It's official! 1 Bitcoin = $10,000 USD 0
1 The last 3 months in 47 seconds. 0
2 It's over 9000!!! 1
3 Everyone who's trading BTC right now 1
4 I hope James is doing well 1
5 Weeeeeeee! 0

How to find last 24 hours data from pandas data frame

I have a data in which we have two columns, one is description and another is publishedAt. I applied sort function on publishedAt column and get the output of descending order of date. Here is the sample of my data frame:
description publishedAt
13 Bitcoin price has failed to secure momentum in... 2018-05-06T15:22:22Z
16 Brian Kelly, a long-time contributor to CNBC’s... 2018-05-05T15:56:48Z
2 The bitcoin price is less than $100 away from ... 2018-05-05T13:14:45Z
12 Mati Greenspan, a senior analyst at eToro and ... 2018-05-04T16:05:37Z
52 A Singaporean startup developing ‘smart bankno... 2018-05-04T14:02:30Z
75 Cryptocurrencies are set to make a comeback on... 2018-05-03T08:10:19Z
76 The bitcoin price is hovering near its best le... 2018-04-30T16:26:57Z
74 In today’s climate of ICOs with 100 billion to... 2018-04-30T12:03:31Z
27 Investment guru Warren Buffet remains unsold o... 2018-04-29T17:22:19Z
22 The bitcoin price has increased by around $400... 2018-04-28T12:28:35Z
68 Bitcoin futures volume reached an all-time hig... 2018-04-27T16:32:44Z
14 Biotech-company-turned-cryptocurrency-investme... 2018-04-27T14:25:15Z
67 The bitcoin price has rebounded to $9,200 afte... 2018-04-27T06:24:42Z
Now i want to description of last 3 hours, 6 hours, 12 hours and 24 hours.
How can i find it?
Thanks

As a simple solution within Pandas, you can use the DataFrame.last(offset) function. Be sure to set the PublishedAt column as the dataframe DateTimeIndex. A similar function to get rows a the start of a dataframe is the DataFrame.first(offset) function.
Here is an example using the provided offsets:
df.last('24h')
df.last('12h')
df.last('6h')
df.last('3h')

Assuming that the dataframe is called df
import datetime as dt
df[df['publishedAt']>=(dt.datetime.now()-dt.timedelta(hours=3))]['description'] #hours = 6,12, 24
if you need the intervals exclusive, thus the description withing the last 6 hours but not the ones within 3 hours, you'll need to use array-like logical operators from numpy like numpy.logicaland(arr1, arr2) in the first breaket.

Compare multiple cells in openpyxl

I need to make a comparison of multiple cells in openpyxl but I have not been successful. To be more precise, I have an .xlsx file that I import into my python script, which contains 4 columns, and around 70,000 rows. The rows that have the same first 3 columns, must be joined and add the digit that appears in the fourth column.
For example
Row 1 .. Type of material: A | Location: NY | Month of sale: January | Cost: 100
..
Row 239 Type of material: A | Location: NY | Month of sale: January | Cost: 150
..
Row 1020 Type of material: A | Location: NY | Month of sale: January | Cost: 80
..
etc
Assuming that only such matches existed, a new data table must be generated (for example in a data sheet) where only one row appears in this way:
Type of material: A | Location: NY | Month of sale: January | Cost: 330 (sum of costs)
And so on, with all the data in .xlsx file to get a new consolidated table.
I hope to have been clear with the explanation, but if it was not, I can be even more precise if necessary.
As I mentioned at the beginning, I have not been successful so far, so I will appreciate any help!
Thank you very much

instead of reading it via openpyxl, I would use pandas
import pandas as pd
raw_data = pd.read_excel(filename, header=0)
summary = raw_data.groupby(['Type of material', 'Location', 'Month of sale'])['Cost'].sum()
If this raises some KeyErrors you'll need to fix the labels

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Pandas Web Scraping - python

Related

Mapping data frame descriptions based on values of multiple columns

By default, how can I view all the rows in a Series and/or DataFrame?

Pandas.read_csv issue

How to find last 24 hours data from pandas data frame

Compare multiple cells in openpyxl

Categories

Resources