Scrapy Html Table to DataFrame With Duplicated Headers

Scrapy Html Table to DataFrame With Duplicated Headers - python

I've hit a brick wall with my scrapy scrape of an html table. Basically, I have a piece of code that works by first assigning column names as objects, using them as keys and then appending them with corresponding xpath entries to a separate object. These are then put into a pandas dataframe, ultimately converted into a csv for final use.
import scrapy
from scrapy.selector import Selector
import re
import pandas as pd
class PostSpider(scrapy.Spider):
name = "standard_squads"
start_urls = [
"https://fbref.com/en/comps/11/stats/Serie-A-Stats",
]
def parse(self, response):
column_index = 1
columns = {}
for column_node in response.xpath('//*[#id="stats_standard_squads"]/thead/tr[2]/th'):
column_name = column_node.xpath("./text()").extract_first()
print("column name is: " + column_name)
columns[column_name] = column_index
column_index += 1
matches = []
for row in response.xpath('//*[#id="stats_standard_squads"]/tbody/tr'):
match = {}
for column_name in columns.keys():
if column_name=='Squad':
match[column_name]=row.xpath('th/a/text()').extract_first()
else:
match[column_name] = row.xpath(
"./td[{index}]//text()".format(index=columns[column_name]-1)
).extract_first()
matches.append(match)
print(matches)
df = pd.DataFrame(matches,columns=columns.keys())
yield df.to_csv("test_squads.csv",sep=",", index=False)
However, I just realised that the column header names in the xpath response (//*[#id="stats_standard_squads"]/thead/tr[2]/th) actually contain duplicates (for example on the page xG appears twice in the table, as does xA). Because of this when I loop through columns.keys() it tosses the duplicates away and so I only end up with 20 columns in the final csv, instead of 25.
I'm not sure what to do now- I've tried adding the column names to a list, adding them as dataframe headers and then appending to a new row each time but it seems to be a lot of boilerplate. I was hoping that there might be a simpler solution to this automated scrape which allows for duplicate names in a pandas dataframe column?

What about reading a list of columns into an array and adding suffixes:
def parse(self, response):
columns = []
for column_node in response.xpath('//*[#id="stats_standard_squads"]/thead/tr[2]/th'):
column_name = column_node.xpath("./text()").extract_first()
columns.append(column_name)
matches = []
for row in response.xpath('//*[#id="stats_standard_squads"]/tbody/tr'):
match = {}
suffixes = {}
for column_index, column_name in enumerate(columns):
# Get correct Index for the currect column
if column_name not in suffixes:
suffixes[column_name] = 1
df_name = column_name # no suffix for the first catch
else:
suffixes[column_name] += 1
df_name = f'{column_name}_{suffixes[column_name]}'
if column_name=='Squad':
match[df_name]=row.xpath('th/a/text()').extract_first()
else:
match[df_name] = row.xpath(
"./td[{index}]//text()".format(index=column_index)
).extract_first()
matches.append(match)
print(matches)
df = pd.DataFrame(matches,columns=columns.keys())
yield df.to_csv("test_squads.csv",sep=",", index=False)

Related

Python-docx delete table code not working as expected

I am trying to do the following tasks:
Open a DOCX file using python-docx library
Count the # of tables in the DOCX: table_count = len(document.tables)
The function read_docx_table() extracts the table, creates a dataframe.
My objective here is as following:
Extract ALL tables from the DOCX
Find the table that is empty
Delete the empty table
Save the DOCX
My code is as follows:
import pandas as pd
from docx import Document
import numpy as np
document = Document('tmp.docx')
table_count = len(document.tables)
table_num= table_count
print(f"Number of tables in the Document is: {table_count}")
nheader=1
i=0
def read_docx_table(document, table_num=1, nheader=1):
table = document.tables[table_num-1]
data = [[cell.text for cell in row.cells] for row in table.rows]
df = pd.DataFrame(data)
if nheader ==1:
df = df.rename(columns=df.iloc[0]).drop(df.index[0]).reset_index(drop=True)
elif nheader == 2:
outside_col, inside_col = df.iloc[0], df.iloc[1]
h_index = pd.MultiIndex.from_tuples(list(zip(outside_col, inside_col)))
df = pd.DataFrame(data, columns=h_index).drop(df.index[[0,1]]).reset_index(drop=True)
elif nheader > 2:
print("More than two headers. Not Supported in Current version.")
df = pd.DataFrame()
return df
def Delete_table(table):
print(f" We are deleting table now. Table index is {table}")
print(f"Type of index before casting is {type(table)}")
index = int(table)
print(f"Type of index is {type(index)}")
try:
print("Deleting started...")
document.tables[index]._element.getparent().remove(document.tables[index]._element)
except Exception as e:
print(e)
while (i < table_count):
print(f"Dataframe table number is {i} ")
df = read_docx_table(document,i,nheader)
df = df.replace('', np.nan)
print(df)
if (df.dropna().empty):
print(f'Empty DataFrame. Table Index = {i}')
print('Deleting Empty table...')
#Delete_table(i)
try:
document.tables[i]._element.getparent().remove(document.tables[i]._element)
print("Table deleted...")
except Exception as e:
print(e)
else:
print("DF is not empty...")
print(df.size)
i+=1
document.save('OUT.docx')
My INPUT docx has 3 tables:
But, my CODE gives me the following Output:
It is keeping the empty table and deleting the non-empty table.
Is there something I am missing? I am doubting my logic to check the Table is empty using if (df.dropna().empty):

The df.dropna().empty logic drops any tables that have no non-header rows lacking a blank cell. Is that the intent? If so, then it seems okay to me.
Two points:
The docx library does not necessarily return the tables in the order they exist in the document.
When you delete a table with your code, you immediately skip the next one to be returned (which might not be the next one in the document due to the above) because you increment your counter after deleting. That results in not processing all the tables.
As I worked through the logic, I did some rearrangements to understand it. I think you might also have been getting some exceptions being emitted on indexing into the tables after deleting? I included my edits below - hopefully they help.
import pandas as pd
from docx import Document
import numpy as np
def read_docx_table(document, table_num, header_rows):
table = document.tables[table_num]
data = [[cell.text for cell in row.cells] for row in table.rows]
df = pd.DataFrame(data)
if header_rows == 1:
df = df.rename(columns=df.iloc[0]).drop(df.index[0]).reset_index(drop=True)
elif header_rows == 2:
outside_col, inside_col = df.iloc[0], df.iloc[1]
h_index = pd.MultiIndex.from_tuples(list(zip(outside_col, inside_col)))
df = pd.DataFrame(data, columns=h_index).drop(df.index[[0,1]]).reset_index(drop=True)
else: # header_rows not 1 or 2
print("More than two headers. Not Supported in Current version.")
df = pd.DataFrame()
return df
def table_is_empty(document, table_num, header_rows):
df = read_docx_table(document, table_num, header_rows)
df = df.replace('', np.nan)
return df.dropna().empty
def delete_table(document, table_num):
document.tables[table_num]._element.getparent().remove(document.tables[table_num]._element)
document = Document('tmp.docx')
header_rows = 1
table_count = len(document.tables)
print(f"Number of tables in the input document is: {table_count}")
table_num = 0
while table_num < table_count:
if table_is_empty(document, table_num, header_rows):
delete_table(document, table_num)
table_count = len(document.tables)
else:
table_num += 1
print(f"Number of tables in the output document is: {table_count}")
document.save('OUT.docx')

Apply string function to data frame

The task is to wrap URLs in excel file with html tag.
For this, I have a fucntion and the following code that works for one column named ANSWER:
import pandas as pd
import numpy as np
import string
import re
def hyperlinksWrapper(myString):
#finding all substrings that look like a URL
URLs = re.findall("(?P<url>https?://[^','')'' ''<'';'\s\n]+)", myString)
#print(URLs)
#replacing each URL by a link wrapped into <a> html-tags
for link in URLs:
wrappedLink = '' + link + ''
myString = myString.replace(link, wrappedLink)
return(myString)
#Opening the original XLS file
filename = "Excel.xlsx"
df = pd.read_excel(filename)
#Filling all the empty cells in the ANSWER cell with the value "n/a"
df.ANSWER.replace(np.NaN, "n/a", inplace=True)
#Going through the ANSWER column and applying hyperlinksWrapper to each cell
for i in range(len(df.ANSWER)):
df.ANSWER[i] = hyperlinksWrapper(df.ANSWER[i])
#Export to CSV
df.to_excel('Excel_refined.xlsx')
The question is, how do I look not in one column, but in all the columns (each cell) in the dataframe without specifying the exact column names?

Perhaps you're looking for something like this:
import pandas as pd
import numpy as np
import string
import re
def hyperlinksWrapper(myString):
#finding all substrings that look like a URL
URLs = re.findall("(?P<url>https?://[^','')'' ''<'';'\s\n]+)", myString)
#print(URLs)
#replacing each URL by a link wrapped into <a> html-tags
for link in URLs:
wrappedLink = '' + link + ''
myString = myString.replace(link, wrappedLink)
return(myString)
# dummy dataframe
df = pd.DataFrame(
{'answer_col1': ['https://example.com', 'https://example.org', np.nan],
'answer_col2': ['https://example.net', 'Hello', 'World']}
)
# as suggested in the comments (replaces all NaNs in df)
df.fillna("n/a", inplace=True)
# option 1
# loops over every column of df
for col in df.columns:
# applies hyperlinksWrapper to every row in col
df[col] = df[col].apply(hyperlinksWrapper)
# [UPDATED] option 2
# applies hyperlinksWrapper to every element of df
df = df.applymap(hyperlinksWrapper)
df.head()

XLRD cannot read multiindex column name

I have a problem with multiindex column name. I'm using XLRD to convert excel data to json using json.dumps but instead it gives me only one row of column name only. I have read about multilevel json but i have no idea how to do it using XLRD.
Here is my sample of table column name
Sample of code:
for i in path:
with xlrd.open_workbook(i) as wb:
print([i])
kwd = 'sage'
print(wb.sheet_names())
for j in range(wb.nsheets):
worksheet = wb.sheet_by_index(j)
data = []
n = 0
nn = 0
keyword = 'sage'
keyword2 = 'adm'
try:
skip = skip_row(worksheet, n, keyword)
keys = [v.value for v in worksheet.row(skip)]
except:
try:
skip = skip_row2(worksheet, nn, keyword2)
keys = [v.value for v in worksheet.row(skip)]
except:
continue
print(keys)
for row_number in range(check_skip(skip), worksheet.nrows):
if row_number == 0:
continue
row_data = {}
for col_number, cell in enumerate(worksheet.row(row_number)):
row_data[keys[col_number]] = cell.value
data.append(row_data)
print(json.dumps({'Data': data}))
ouh by the way, each worksheet have different number to skip before column name so that's why my code got function of skip row. After I skip the row and found the exact location of my column name. Then i start to read the values. But it yah there is where the problem raise from my view because i got two rows of column name. And still confuse how to do multi level json with XLRD or at least join the column name with XLRD (which i guess can't).
Desired outcome multilevel json:
{ "Data":[{ "ID" : "997", "Tax" : [{"Date" : "9/7/2019", "Total" : 2300, "Grand Total" : 340000"}], "Tax ID" : "ST-000", .... }]}
pss:// I've tried to use pandas but it gives me a lot of trouble since i work with big data.

You can use multi indexing in panda, first you need to get header row index for each sheet.
header_indexes = get_header_indexes(excel_filepath, sheet_index) #returns list of header indexes
You need to write get_header_indexes function which scans sheet and return header indexes.
you can use panda to get JSON from dataframe.
import pandas as pd
df = pd.read_excel(excel_filepath, header=header_indexes, sheet_name=sheet_index)
data = df.to_dict(orient="records")
for multiple headers data containts list of dict and each dict has tuple as key, you can reformat it to final JSON as per your requirement.
Note: Use chunksize for reading large files.

How can I use the .findall() function for a excel file iterating through all rows of a column?

I have a big excel sheet with information about different companies altogether in a single cell for each company and my goal is to separate this into different columns following patterns to scrape the info from the first column. The original data looks like this:
My goal is to achieve a dataframe like this:
I have created the following code to use the patterns Mr., Affiliation:, E-mail:, and Mobile because they are repeated in every single row the same way. However, I don't know how to use the findall() function to scrape all the info I want from each row of the desired column.
import openpyxl
import re
import sys
import pandas as pd
reload(sys)
sys.setdefaultencoding('utf8')
wb = openpyxl.load_workbook('/Users/ap/info1.xlsx')
ws = wb.get_sheet_by_name('Companies')
w={'Name': [],'Affiliation': [], 'Email':[]}
for row in ws.iter_rows('C{}:C{}'.format(ws.min_row,ws.max_row)):
for cells in row:
a=re.findall(r'Mr.(.*?)Affiliation:',aa, re.DOTALL)
a1="".join(a).replace('\n',' ')
b=re.findall(r'Affiliation:(.*?)E-mail',aa,re.DOTALL)
b1="".join(b).replace('\n',' ')
c=re.findall(r'E-mail(.*?)Mobile',aa,re.DOTALL)
c1="".join(c).replace('\n',' ')
w['Name'].append(q1)
w['Affiliation'].append(r1)
w['Email'].append(s1)
print cell.value
df=pd.DataFrame(data=w)
df.to_excel(r'/Users/ap/info2.xlsx')

I would go with this, which just replaces the 'E-mail:...' with a delimiter and then splits and assigns to the right column
df['Name'] = np.nan
df['Affiliation'] = np.nan
df['Email'] = np.nan
df['Mobile'] = np.nan
for i in range(0, len(df)):
full_value = df['Companies'].loc[i]
full_value = full_value.replace('Affiliation:', ';').replace('E-mail:', ';').replace('Mobile:', ';')
full_value = full_value.split(';')
df['Name'].loc[i] = full_value[0]
df['Affiliation'].loc[i] = full_value[1]
df['Email'].loc[i] = full_value[2]
df['Mobile'].loc[i] = full_value[3]
del df['Companies']
print(df)

Calculate and add column in python

I think I have a fairly simple question for a python expert. With a lot of struggling I put together underneath code. I am opening an excel file, transforming it to a list of lists and adding a column to this list of lists. Now I want to rename and recalculate the rows of this added column. How do I script that I always take the last column of a list of lists, even though the number of columns could differ.
import xlrd
file_location = "path"
workbook = xlrd.open_workbook(file_location)
sheet = workbook.sheet_by_index(0)
data = [[sheet.cell_value(r, c) for c in range(sheet.ncols)] for r in range(sheet.nrows)]
data = [x + [0] for x in data]

If you have a function called calculate_value that takes a row and returns the value for that row, you could do it like this:
def calculate_value(row):
# calculate it...
return value
def add_calculated_column(rows, func):
result_rows = []
for row in rows:
# create a new row to avoid changing the old data
new_row = row + [func(row)]
result_rows.append(new_row)
return result_rows
data_with_column = add_calculated_column(data, calculate_value)

I found a more easy and more flexible way of adjusting the values in the last column.
counter = 0
for list in data:
counter = counter + 1
if counter == 1:
value = 'Vrspng'
else:
value = counter
list.append(value)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scrapy Html Table to DataFrame With Duplicated Headers - python

Related

Python-docx delete table code not working as expected

Apply string function to data frame

XLRD cannot read multiindex column name

How can I use the .findall() function for a excel file iterating through all rows of a column?

Calculate and add column in python

Categories

Resources