Load webscraping results into pandas dataframe and export to excel - python

I'm trying to write a short python snippet of code that loops through different webpages structured in the same way (i.e. same number of columns / rows) and loads all the information into a pandas dataframe and finally exports this one into excel.
I managed to write all the code that gathers what should be the column headers (in the dt HTML tag) and the rows (in the dd HTML tag), but having issues into placing all this info into a pandas dataframe.
for row in rows:
QA_link = row.find('td', class_='views-field views-field-nothing-1').find('a', href=True)['href']
req_QA = requests.get(QA_link)
soup_QA = BeautifulSoup(req_QA.text, 'html.parser')
QA_table = soup_QA.find('dl', class_='dl-horizontal SingleRulebookRecord')
if boolInitialiseTable:
QA_hdr = [str.replace(link.string, ':', '') for link in QA_table.findAll('dt')]
QA_details = [str(link.string) for link in QA_table.findAll('dd')]
df = pd.DataFrame()
df = pd.concat([df, pd.DataFrame(QA_details).transpose()], ignore_index=True, axis=0)
boolInitialiseTable = False
df.columns = QA_hdr
else:
QA_details = [str(link.string) for link in QA_table.findAll('dd')]
df = pd.concat([df, pd.DataFrame(QA_details).transpose()])
Where rows contains all the different web pages that needs to be accessed to gather the info i need to put in the pandas dataframe.
So from the HTML table like content of:
<dl class="dl-horizontal SingleRulebookRecord">
<dt>Question ID:</dt>
<dd>2020_5469 </dd>
<dt>Topic:</dt>
<dd>Weather</dd>
<dt>Date</dt>
<dd>06/06/2020</dd>
</dl>
I would like to get a pandas dataframe with:
Question ID
Topic
Date
2020_5469
Weather
06/06/2020
Finally df.to_excel('results.xlsx') should do the job of exporting everything into Excel.
I feel that all this transpose in the code is not the correct way of doing it, in addition to that the type of the fields of the table is object and not string as i would expect - but maybe this is not a problem

I would do it like this:
import requests
from bs4 import BeautifulSoup
import pandas as pd
links = ['https://www.eba.europa.eu/single-rule-book-qa/qna/view/publicId/2020_5469',
'https://www.eba.europa.eu/single-rule-book-qa/qna/view/publicId/2020_5128']
dfs = []
for QA_link in links:
req_QA = requests.get(QA_link)
soup_QA = BeautifulSoup(req_QA.text, 'html.parser')
QA_hdr = [link.get_text() for link in soup_QA.findAll('dt')]
QA_details = [[link.get_text() for link in soup_QA.findAll('dd')]]
dfs.append(pd.DataFrame(QA_details, columns=QA_hdr))
df_all = pd.concat(dfs, axis=0).reset_index(drop=True)
# check for NaN values (columns not shared between urls)
print(df_all[df_all.columns[df_all.isna().any()]].T)
0 1
Name of institution / submitter: BearingPoint Switzerland AG NaN
Country of incorporation / residence: Switzerland NaN
Answer prepared by: Answer prepared by the EBA. NaN
Subparagraph: NaN (f)
df_all.iloc[:,:5].info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Question ID: 2 non-null object
1 Legal Act: 2 non-null object
2 Topic: 2 non-null object
3 Article: 2 non-null object
4 Paragraph: 2 non-null object
dtypes: object(5)
memory usage: 208.0+ bytes
Notice that QA_details is a nested list. E.g. each nested list would fill a new row; it's just that you only have one. E.g. here's how it works if you have two nested lists:
lst = [[1,2],[3,4]]
df = pd.DataFrame(lst, columns=['A','B'])
print(df)
A B
0 1 2
1 3 4
As for the reason why the Dtype is given as object, see e.g. this SO post. But all your cells will in fact contain strings, which we can easily check. E.g.:
cols = df_all.columns[df_all.notna().all()]
print(all([isinstance(i, str) for i in df_all.loc[0, cols]]))
# True
Finally, yes df.to_excel('results.xlsx') will work to export the df to Excel. Perhaps add df.to_excel('results.xlsx', index=False) to avoid exporting the index.

Related

Comparing 2 revisions of excel files in python pandas

I am very new to pandas. It might be a silly question to some of you.
I am looking to compare 2 excel files and output the changes or the new entries
old.csv
Product Price Description
1 1.25 Product 1
2 2.25 Product 2
3 3.25 Product 3
new.csv
Product Price Description
1 1.25 Product 1 # Product 2 not in list
3 3.50 Product 3 # Price update
4 4.25 Product 4 # New entry
TRIED
import pandas as pd
import numpy as np
import requests
url = '<SomeUrl>/<PriceList>.xls'
resp = requests.get(url)
df = pd.DataFrame(pd.read_excel(resp.content))
df.to_csv('new.csv')
old = pd.read_csv('old.csv')
new = pd.read_csv('new.csv')
changes = new.loc[new['Price'] != old['Price']]
changes_csv = changes[['Product', 'Price', 'Description']]
print(changes_csv)
EXPECTING
3 3.50 Product 3 # Price update
4 4.25 Product 4 # New entry
I get the correct results if the length matches exactly. Otherwise I get
pandas valueerror can only compare identically labeled objects
BONUS
It wound be awesome if I can produce output for discontinued products
You could create a master index of all products, create 2 old/new dataframes using all the master index, then use df.compare() to compare the two databases:
Import pandas as pd
df1 = pd.DataFrame([[1,1.25,'Product 1'],[2,2.25,'Product 2'],[3,3.25,'Product 3']], columns=['Product','Price','Description'])
df2 = pd.DataFrame([[1,1.25,'Product 1'],[3,3.5,'Product 2'],[4,4.25,'Product 3 Change']], columns=['Product','Price','Description'])
df1product = df1[['Product']]
df2product = df2[['Product']]
dfproducts = df1product.merge(df2product, on='Product', how='outer')
df1 = dfproducts.merge(df1, how='left', on='Product')
df1.set_index(df1['Product'], inplace=True)
df2 = dfproducts.merge(df2, how='left', on='Product')
df2.set_index(df2['Product'], inplace=True)
dfcompare = df1.compare(df2, align_axis=0)
I have solved the problem, even though #WCeconomics kindly took the time to type the code out, it did not help me to get the output I wanted. Likely due to me being a noob with pandas.
This is how I solved it, so as it is useful to the community.
import pandas as pd
import openpyxl # to write excel files
from openpyxl.utils.dataframe import dataframe_to_rows
old = pd.read_excel('old.xls')
new = pd.read_excel('new.xls')
# data for these is in the same format as in question, with 'Product Description' instead of 'Description'
merged = old.merge(new, on='Product', how='outer', suffixes=('_old', '_new'))
df = merged[['Product', 'Product Description_old', 'Price_old', 'Price_new']]
changes = df.loc[(df['Price_new'] > df['Price_old'])].dropna(how='any', axis=0)
wb = openpyxl.Workbook()
ws = wb.active
for r in dataframe_to_rows(changes, index=False, header=True):
ws.append(r)
wb.save('avp_changes.xls')

Extract values within the quotes signs into two separate columns with python

How can i extract the values within the quotes signs into two separate columns with python. The dataframe is given below:
df = pd.DataFrame(["'FRH02';'29290'", "'FRH01';'29300'", "'FRT02';'29310'", "'FRH03';'29340'",
"'FRH05';'29350'", "'FRG02';'29360'"], columns = ['postcode'])
df
postcode
0 'FRH02';'29290'
1 'FRH01';'29300'
2 'FRT02';'29310'
3 'FRH03';'29340'
4 'FRH05';'29350'
5 'FRG02';'29360'
i would like to get an output like the one below:
postcode1 postcode2
FRH02 29290
FRH01 29300
FRT02 29310
FRH03 29340
FRH05 29350
FRG02 29360
i have tried several str.extract codes but havent been able to figure this out. Thanks in advance.
Finishing Quang Hoang's solution that he left in the comments:
import pandas as pd
df = pd.DataFrame(["'FRH02';'29290'",
"'FRH01';'29300'",
"'FRT02';'29310'",
"'FRH03';'29340'",
"'FRH05';'29350'",
"'FRG02';'29360'"],
columns = ['postcode'])
# Remove the quotes and split the strings, which results in a Series made up of 2-element lists
postcodes = df['postcode'].str.replace("'", "").str.split(';')
# Unpack the transposed postcodes into 2 new columns
df['postcode1'], df['postcode2'] = zip(*postcodes)
# Delete the original column
del df['postcode']
print(df)
Output:
postcode1 postcode2
0 FRH02 29290
1 FRH01 29300
2 FRT02 29310
3 FRH03 29340
4 FRH05 29350
5 FRG02 29360
You can use Series.str.split:
p1 = []
p2 = []
for row in df['postcode'].str.split(';'):
p1.append(row[0])
p2.append(row[1])
df2 = pd.DataFrame()
df2["postcode1"] = p1
df2["postcode2"] = p2

How to filter dataframe only by month and year?

I want to select many cells which are filtered only by month and year. For example there are 01.01.2017, 15.01.2017, 03.02.2017 and 15.02.2017 cells. I want to group these cells just looking at the month and year information. If they are in january, They should be grouped together.
Output Expectation:
01.01.2017 ---- 1
15.01.2017 ---- 1
03.02.2017 ---- 2
15.02.2017 ---- 2
Edit: I have 2 datasets in different excels as you can see below.
first data
second data
What I m trying to do is that I want to get 'Su Seviye' data for every 'DH_ID' seperately from first data. And then I want to paste these data into 'Kuyu Yüksekliği' column in the second data. But the problems are that every 'DH_ID' is in different sheets and although there are only month and year data in first database, second database has day information additionally. How can I produce this kind of codes?
import pandas as pd
df = pd.read_excel('...Gözlem kuyu su seviyeleri- 2017.xlsx', sheet_name= 'GÖZLEM KUYULARI1', header=None)
df2 = pd.read_excel('...YERALTI SUYU GÖZLEM KUYULARI ANALİZ SONUÇLAR3.xlsx', sheet_name= 'HJ-8')
HJ8 = df.iloc[:, [0,5,7,9,11,13,15,17,19,21,23,25,27,29]]
##writer = pd.ExcelWriter('yıllarsuseviyeler.xlsx')
##HJ8.to_excel(writer)
##writer.save()
rb = pd.read_excel('...yıllarsuseviyeler.xlsx')
rb.loc[0,7]='01.2022'
rb.loc[0,9]='02.2022'
rb.loc[0,11]='03.2022'
rb.loc[0,13]='04.2022'
rb.loc[0,15]='05.2021'
rb.loc[0,17]='06.2022'
rb.loc[0,19]='07.2022'
rb.loc[0,21]='08.2022'
rb.loc[0,23]='09.2022'
rb.loc[0,25]='10.2022'
rb.loc[0,27]='11.2022'
rb.loc[0,29]='12.2022'
You can see what I have done above.
First, you can convert date column to Datetime object, then get the year and month part with to_period, at last get the group number with ngroup().
df['group'] = df.groupby(pd.to_datetime(df['date'], format='%d.%m.%Y').dt.to_period('M')).ngroup() + 1
date group
0 01.01.2017 1
1 15.01.2017 1
2 03.02.2017 2
3 15.02.2017 2

Apply row logic on date while extracting only multiple columns of a dataframe

I am extracting a data frame in pandas and want to only extract rows where the date is after a variable.
I can do this in multiple steps but would like to know if it is possible to apply all logic in one call for best practice.
Here is my code
import pandas as pd
self.min_date = "2020-05-01"
#Extract DF from URL
self.df = pd.read_html("https://webgate.ec.europa.eu/rasff-window/portal/index.cfm?event=notificationsList")[0]
#Here is where the error lies, I want to extract the columns ["Subject","Reference","Date of case"] but where the date is after min_date.
self.df = self.df.loc[["Date of case" < self.min_date], ["Subject","Reference","Date of case"]]
return(self.df)
I keep getting the error: "IndexError: Boolean index has wrong length: 1 instead of 100"
I cannot find the solution online because every answer is too specific to the scenario of the person that asked the question.
e.g. this solution only works for if you are calling one column: How to select rows from a DataFrame based on column values?
I appreciate any help.
Replace this:
["Date of case" < self.min_date]
with this:
self.df["Date of case"] < self.min_date
That is:
self.df = self.df.loc[self.df["Date of case"] < self.min_date,
["Subject","Reference","Date of case"]]
You have a slight syntax issue.
Keep in mind that it's best practice to convert string dates into pandas datetime objects using pd.to_datetime.
min_date = pd.to_datetime("2020-05-01")
#Extract DF from URL
df = pd.read_html("https://webgate.ec.europa.eu/rasff-window/portal/index.cfm?event=notificationsList")[0]
#Here is where the error lies, I want to extract the columns ["Subject","Reference","Date of case"] but where the date is after min_date.
df['Date of case'] = pd.to_datetime(df['Date of case'])
df = df.loc[df["Date of case"] > min_date, ["Subject","Reference","Date of case"]]
Output:
Subject Reference Date of case
0 Salmonella enterica ser. Enteritidis (presence... 2020.2145 2020-05-22
1 migration of primary aromatic amines (0.4737 m... 2020.2131 2020-05-22
2 celery undeclared on green juice drink from Ge... 2020.2118 2020-05-22
3 aflatoxins (B1 = 29.4 µg/kg - ppb) in shelled ... 2020.2146 2020-05-22
4 too high content of E 200 - sorbic acid (1772 ... 2020.2125 2020-05-22

Converting a column of strings to numbers in Pandas

How do I get the Units column to numeric?
I have a Google spreadsheet that I am reading in the date column gets converted fine.. but I'm not having much luck getting the Unit Sales column to convert to numeric I'm including all the code which uses requests to get the data:
from StringIO import StringIO
import requests
#act = requests.get('https://docs.google.com/spreadsheet/ccc?key=0Ak_wF7ZGeMmHdFZtQjI1a1hhUWR2UExCa2E4MFhiWWc&output=csv&gid=1')
dataact = act.content
actdf = pd.read_csv(StringIO(dataact),index_col=0,parse_dates=['date'])
actdf.rename(columns={'Unit Sales': 'Units'}, inplace=True) #incase the space in the name is messing me up
The different methods I have tried to get Units to get to numeric
actdf=actdf['Units'].convert_objects(convert_numeric=True)
#actdf=actdf['Units'].astype('float32')
Then I want to resample and I'm getting strange string concatenations since the numbers are still string
#actdfq=actdf.resample('Q',sum)
#actdfq.head()
actdf.head()
#actdf
so the df looks like this with just units and the date index
date
2013-09-01 3,533
2013-08-01 4,226
2013-07-01 4,281
Name: Units, Length: 161, dtype: object
You have to specify the thousands separator:
actdf = pd.read_csv(StringIO(dataact), index_col=0, parse_dates=['date'], thousands=',')
This will work
In [13]: s
Out[13]:
0 4,223
1 3,123
dtype: object
In [14]: s.str.replace(',','').convert_objects(convert_numeric=True)
Out[14]:
0 4223
1 3123
dtype: int64

Categories

Resources