Tabula only read/convert pages 2- 16 - python

I'm trying to convert an online PDF into pandas so I can work with and manipulate it. I know that Tabula automatically converts into a pandas dataframe but the issue that I am having is that the first page has an extra column in the middle where it contains a column name and then every row below that converts 'NaN'. the other pages don't have that problem.
My thought is that I fix up the first page and then merge it to the rest of them, after adjusting the column headers and such, but the problem I am running into is that I am unable to tell tabula to look at pages 2 - 99 and ignore the page 1.
Any help would be most appreciated:
> import tabula
import pandas as pd
url = 'https://www.fhfa.gov/DataTools/Downloads/Documents/HPI/HPI_AT_Tables.pdf'
df = tabula.read_pdf(url, pages = 1)[0]
dfs = df
dfs_cols = ['Metropolitan Statistical Area', 'N/A', 'National Ranking*', '1-Year', 'Quarter', '5-Year']
del dfs['N/A']
url = 'https://www.fhfa.gov/DataTools/Downloads/Documents/HPI/HPI_AT_Tables.pdf'
df = tabula.read_pdf(url, pages = '2-99' )[0]
dfa = df
dfa
When I perform this second command it will return an error and it won't let me past.

Related

Scraping a table Including hyperlinks and upload it to google sheet using python

I'm trying to scrape a table and keep any hyperlinks included then upload the result to a google sheet
I have finished the first part using this code:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import re
# Define URL
url = 'https://www.marefa.org/%D9%82%D8%A7%D8%A6%D9%85%D8%A9_%D8%A3%D9%81%D8%B6%D9%84_%D8%A7%D9%84%D9%83%D8%AA%D8%A8_%D8%A7%D9%84%D8%B9%D8%B1%D8%A8%D9%8A%D8%A9'# Ask hosting server to fetch url
requests.get(url)
response=requests.get(url)
print(response.status_code)
#If the output is <Response [200]> so that means the server allows us to collect
soup = BeautifulSoup(response.content, 'lxml')
indiatable=soup.find('table',{'class':"wikitable"})
df=pd.read_html(str(indiatable))
# convert list to dataframe
df=pd.DataFrame(df[0])
df.head()
#df.to_excel("hi2.xlsx", index=False)
Can anyone give me some hint on how to finish the rest of the task?
I'm not sure how to write hyperlinks in a way that Sheets will recognize them as hyperlinks, but the following will create extra columns to hold the links:
indiatable_headers = [col for cols in [
[chead, chead+' [link]'] for chead in [
c.get_text(strip=True) for c in
indiatable.select_one('tr:has(th)').select('th')
]] for col in cols]
# indiatable_headers = ['الترتيب', 'الترتيب [link]', 'الاسم', 'الاسم [link]', 'المؤلف', 'المؤلف [link]', 'البلد', 'البلد [link]', 'الرواية', 'الرواية [link]']
indiatable_data = [[col for cols in [[
c.get_text(strip=True) if c.text.strip() else (
c.img.get('alt', '') if c.find('img') else ''
), 'https://www.marefa.org'+c.a.get('href') if c.find('a') else None
] for c in r.select('td')
] for col in cols] for r in indiatable.select('tr') if not r.find('th')]
# no text --> looks for image with alt attribute
indiadf = pandas.DataFrame(
indiatable_data, columns=indiatable_headers
).dropna(axis='columns', how='all').set_index(indiatable_headers[0])
# drops empty columns [ie, '[link]' cols with no links], and set first column as index
After that, you can create a spreadsheet with indiadf.to_excel (DON'T use index=False this time unless you want to lose the الترتيب column, since that has now been set as the index), and then manually upload it...
If you want to automate the google sheets part as well, you can look into the Sheets API and maybe modules like pydrive or pygsheets.

How to preserve complicated excel header formats when manipulating data using Pandas Python?

I am parsing a large excel data file to another one, however the headers are very abnormal. I tried to use "read_excel skiprows" and that did not work. I also tried to include the header in
df = pd.read_excel(user_input, header= [1:3], sheet_name = 'PN Projection'), but then I get this error "ValueError: cannot join with no overlapping index names." To get around this I tried to name the columns by location and that did not work either.
When I run the code as shows below everything works fine, but past cell "U" I get the header titles to be "unnamed1, 2, ..." I understand this is because pandas is considering the first row to be the header(which are empty), but how do I fix this? Is there a way to preserve the headers without manually typing in the format for each cell? Any and all help is appreciated, thank you!
small section of the excel file header
the code I am trying to run
#!/usr/bin/env python
import sys
import os
import pandas as pd
#load source excel file
user_input = input("Enter the path of your source excel file (omit 'C:'): ")
#reads the source excel file
df = pd.read_excel(user_input, sheet_name = 'PN Projection')
#Filtering dataframe
#Filters out rows with 'EOL' in column 'item status' and 'xcvr' in 'description'
df = df[~(df['Item Status'] == 'EOL')]
df = df[~(df['Description'].str.contains("XCVR", na=False))]
#Filters in rows with "XC" or "spartan" in 'description' column
df = df[(df['Description'].str.contains("XC", na=False) | df['Description'].str.contains("Spartan", na=False))]
print(df)
#Saving to a new spreadsheet called Filtered Data
df.to_excel('filtered_data.xlsx', sheet_name='filtered_data')
If you do not need the top 2 rows, then:
df = pd.read_excel(user_input, sheet_name = 'PN Projection',error_bad_lines=False, skiprows=range(0,2)
This has worked for me when handling several strangely formatted files. Let me know if this isn't what your looking for, or if their are additional issues.

Extract multiple page web table into Excel

I have a table that spans across many pages. I'm able to pull the info from a designated page and pull it into a CSV table. My goal now is to have this iterate through all the pages and add it to the bottom of the previous page's info. Here is the code so far that works on a single page:
import requests
import pandas as pd
url = 'https://www.mineralanswers.com/oklahoma/producers?page=1'
html = requests.get(url).content
df_list = pd.read_html(html)
df = df_list[-1]
print(df)
df.to_csv('my data.csv')
The page URL is setup in the "...producers?page=1, ...producers?page=2 ...producers?page=3" format so I feel like it's likely possible using a loop, I just am having trouble amending the data instead of overwriting it.
Here is corrected example code to fetch 3 pages and append them to one DataFrame. You may run this code here online.
import requests
import pandas as pd
df = pd.DataFrame()
for page in range(1, 4):
url = 'https://www.mineralanswers.com/oklahoma/producers?page=' + str(page)
html = requests.get(url).content
df_list = pd.read_html(html)
df = df.append(df_list[-1], ignore_index = True)
df.to_csv('my data.csv')

Extracting multiple tables into a .csv

I have a csv file, which I am using to search uniprot.org for multiple variants of a protein, an example of this is the following website:
https://www.uniprot.org/uniprot/?query=KU168294+env&sort=score
import pandas as pd
import requests
from bs4 import BeautifulSoup
import csv
df = pd.read_csv('Env_seq_list.csv')
second_column_df = df['Accession']
for row in second_column_df:
theurl = 'https://www.uniprot.org/uniprot/?query=' + row + '+env&sort=score'
page = requests.get(theurl).content
df_list = pd.read_html(page)
df = df_list[-1]
print(df.loc[df['Gene names'] == 'env'])
If I perform the print function, it works fine and I get back a list of the tables that I'm after. I'm stuck at this point because if I instead use the pandas df.to_csv function I cannot seem to get it to work alongside the df.loc function. Additionally, simply using the df.to_csv function only writes the last search result to the .csv, which I'm pretty sure is due to that function being within the for loop, however I am unsure as to how to fix this. Any help would be greatly appreciated :-)
I would suggest that you take the df you find each time through the loop, and append it to a 'final' df. Then outside the loop, you can run to_csv on that 'final' df. Code below:
final_df = pd.DataFrame()
for row in second_column_df:
theurl = 'https://www.uniprot.org/uniprot/?query=' + row + '+env&sort=score'
page = requests.get(theurl).content
df_list = pd.read_html(page)
df = df_list[-1]
#print(df.loc[df['Gene names'] == 'env'])
final_df = pd.concat([final_df, df.loc[df['Gene names'] == 'env']], axis=0)
final_df.to_csv('/path/to/save/csv')

Read specific columns with pandas or other python module

I have a csv file from this webpage.
I want to read some of the columns in the downloaded file (the csv version can be downloaded in the upper right corner).
Let's say I want 2 columns:
59 which in the header is star_name
60 which in the header is ra.
However, for some reason the authors of the webpage sometimes decide to move the columns around.
In the end I want something like this, keeping in mind that values can be missing.
data = #read data in a clever way
names = data['star_name']
ras = data['ra']
This will prevent my program to malfunction when the columns are changed again in the future, if they keep the name correct.
Until now I have tried various ways using the csv module and resently the pandas module. Both without any luck.
EDIT (added two lines + the header of my datafile. Sorry, but it's extremely long.)
# name, mass, mass_error_min, mass_error_max, radius, radius_error_min, radius_error_max, orbital_period, orbital_period_err_min, orbital_period_err_max, semi_major_axis, semi_major_axis_error_min, semi_major_axis_error_max, eccentricity, eccentricity_error_min, eccentricity_error_max, angular_distance, inclination, inclination_error_min, inclination_error_max, tzero_tr, tzero_tr_error_min, tzero_tr_error_max, tzero_tr_sec, tzero_tr_sec_error_min, tzero_tr_sec_error_max, lambda_angle, lambda_angle_error_min, lambda_angle_error_max, impact_parameter, impact_parameter_error_min, impact_parameter_error_max, tzero_vr, tzero_vr_error_min, tzero_vr_error_max, K, K_error_min, K_error_max, temp_calculated, temp_measured, hot_point_lon, albedo, albedo_error_min, albedo_error_max, log_g, publication_status, discovered, updated, omega, omega_error_min, omega_error_max, tperi, tperi_error_min, tperi_error_max, detection_type, mass_detection_type, radius_detection_type, alternate_names, molecules, star_name, ra, dec, mag_v, mag_i, mag_j, mag_h, mag_k, star_distance, star_metallicity, star_mass, star_radius, star_sp_type, star_age, star_teff, star_detected_disc, star_magnetic_field
11 Com b,19.4,1.5,1.5,,,,326.03,0.32,0.32,1.29,0.05,0.05,0.231,0.005,0.005,0.011664,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1,2008,2011-12-23,94.8,1.5,1.5,2452899.6,1.6,1.6,Radial Velocity,,,,,11 Com,185.1791667,17.7927778,4.74,,,,,110.6,-0.35,2.7,19.0,G8 III,,4742.0,,
11 UMi b,10.5,2.47,2.47,,,,516.22,3.25,3.25,1.54,0.07,0.07,0.08,0.03,0.03,0.012887,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1,2009,2009-08-13,117.63,21.06,21.06,2452861.05,2.06,2.06,Radial Velocity,,,,,11 UMi,229.275,71.8238889,5.02,,,,,119.5,0.04,1.8,24.08,K4III,1.56,4340.0,,
An easy way to do this is using the pandas library like this.
import pandas as pd
fields = ['star_name', 'ra']
df = pd.read_csv('data.csv', skipinitialspace=True, usecols=fields)
# See the keys
print df.keys()
# See content in 'star_name'
print df.star_name
The problem here was the skipinitialspace which remove the spaces in the header. So ' star_name' becomes 'star_name'
According to the latest pandas documentation you can read a csv file selecting only the columns which you want to read.
import pandas as pd
df = pd.read_csv('some_data.csv', usecols = ['col1','col2'], low_memory = True)
Here we use usecols which reads only selected columns in a dataframe.
We are using low_memory so that we Internally process the file in chunks.
Above answers are for python2. So for python 3 users I am giving this answer. You can use the code below:
import pandas as pd
fields = ['star_name', 'ra']
df = pd.read_csv('data.csv', skipinitialspace=True, usecols=fields)
# See the keys
print(df.keys())
# See content in 'star_name'
print(df.star_name)
Got a solution to above problem in a different way where in although i would read entire csv file, but would tweek the display part to show only the content which is desired.
import pandas as pd
df = pd.read_csv('data.csv', skipinitialspace=True)
print df[['star_name', 'ra']]
This one could help in some of the scenario's in learning basics and filtering data on the basis of columns in dataframe.
I think you need to try this method.
import pandas as pd
data_df = pd.read_csv('data.csv')
print(data_df['star_name'])
print(data_df['ra'])

Categories

Resources