Extracting multiple tables into a .csv

Extracting multiple tables into a .csv - python

I have a csv file, which I am using to search uniprot.org for multiple variants of a protein, an example of this is the following website:
https://www.uniprot.org/uniprot/?query=KU168294+env&sort=score
import pandas as pd
import requests
from bs4 import BeautifulSoup
import csv
df = pd.read_csv('Env_seq_list.csv')
second_column_df = df['Accession']
for row in second_column_df:
theurl = 'https://www.uniprot.org/uniprot/?query=' + row + '+env&sort=score'
page = requests.get(theurl).content
df_list = pd.read_html(page)
df = df_list[-1]
print(df.loc[df['Gene names'] == 'env'])
If I perform the print function, it works fine and I get back a list of the tables that I'm after. I'm stuck at this point because if I instead use the pandas df.to_csv function I cannot seem to get it to work alongside the df.loc function. Additionally, simply using the df.to_csv function only writes the last search result to the .csv, which I'm pretty sure is due to that function being within the for loop, however I am unsure as to how to fix this. Any help would be greatly appreciated :-)

I would suggest that you take the df you find each time through the loop, and append it to a 'final' df. Then outside the loop, you can run to_csv on that 'final' df. Code below:
final_df = pd.DataFrame()
for row in second_column_df:
theurl = 'https://www.uniprot.org/uniprot/?query=' + row + '+env&sort=score'
page = requests.get(theurl).content
df_list = pd.read_html(page)
df = df_list[-1]
#print(df.loc[df['Gene names'] == 'env'])
final_df = pd.concat([final_df, df.loc[df['Gene names'] == 'env']], axis=0)
final_df.to_csv('/path/to/save/csv')

Related

pandas dataframe not creating new column

I have data like this. What I am trying to do is to create a rule, based on domain names for my project. I want to create a new column named new_url based on domains. If it contains .cdn. it will take the string before .cdn. , otherwise it will call url parser library and parse url in another way. The problem is that in the csv file I created (cleanurl.csv) , there is no new_url column created. When I print parsed urls in code, I can see them. If and else condition are working. Could you help me please ?
import pandas as pd
import url_parser
from url_parser import parse_url,get_url,get_base_url
import numpy as np
df = pd.read_csv("C:\\Users\\myuser\\Desktop\\raw_data.csv", sep=';')
i=-1
for x in df['domain']:
i=i+1
print("*",x,"*")
if '.cdn.' in x:
parsed_url=x.split('.cdn')[0]
print(parsed_url)
df.iloc[i]['new_url']=parsed_url
else:
parsed_url=get_url(x).domain +'.' + get_url(x).top_domain
print(parsed_url)
df.iloc[i]['new_url']=parsed_url
df.to_csv("C:\\Users\\myuser\\Desktop\\cleanurl.csv", sep=';')

Use .loc[row, 'column'] to create new column
for idx, x in df['domain'].items():
if '.cdn.' in x:
df.loc[idx, 'new_url'] = parsed_url
else:
df.loc[idx, 'new_url'] = parsed_url

Tabula only read/convert pages 2- 16

I'm trying to convert an online PDF into pandas so I can work with and manipulate it. I know that Tabula automatically converts into a pandas dataframe but the issue that I am having is that the first page has an extra column in the middle where it contains a column name and then every row below that converts 'NaN'. the other pages don't have that problem.
My thought is that I fix up the first page and then merge it to the rest of them, after adjusting the column headers and such, but the problem I am running into is that I am unable to tell tabula to look at pages 2 - 99 and ignore the page 1.
Any help would be most appreciated:
> import tabula
import pandas as pd
url = 'https://www.fhfa.gov/DataTools/Downloads/Documents/HPI/HPI_AT_Tables.pdf'
df = tabula.read_pdf(url, pages = 1)[0]
dfs = df
dfs_cols = ['Metropolitan Statistical Area', 'N/A', 'National Ranking*', '1-Year', 'Quarter', '5-Year']
del dfs['N/A']
url = 'https://www.fhfa.gov/DataTools/Downloads/Documents/HPI/HPI_AT_Tables.pdf'
df = tabula.read_pdf(url, pages = '2-99' )[0]
dfa = df
dfa
When I perform this second command it will return an error and it won't let me past.

How can I use the .findall() function for a excel file iterating through all rows of a column?

I have a big excel sheet with information about different companies altogether in a single cell for each company and my goal is to separate this into different columns following patterns to scrape the info from the first column. The original data looks like this:
My goal is to achieve a dataframe like this:
I have created the following code to use the patterns Mr., Affiliation:, E-mail:, and Mobile because they are repeated in every single row the same way. However, I don't know how to use the findall() function to scrape all the info I want from each row of the desired column.
import openpyxl
import re
import sys
import pandas as pd
reload(sys)
sys.setdefaultencoding('utf8')
wb = openpyxl.load_workbook('/Users/ap/info1.xlsx')
ws = wb.get_sheet_by_name('Companies')
w={'Name': [],'Affiliation': [], 'Email':[]}
for row in ws.iter_rows('C{}:C{}'.format(ws.min_row,ws.max_row)):
for cells in row:
a=re.findall(r'Mr.(.*?)Affiliation:',aa, re.DOTALL)
a1="".join(a).replace('\n',' ')
b=re.findall(r'Affiliation:(.*?)E-mail',aa,re.DOTALL)
b1="".join(b).replace('\n',' ')
c=re.findall(r'E-mail(.*?)Mobile',aa,re.DOTALL)
c1="".join(c).replace('\n',' ')
w['Name'].append(q1)
w['Affiliation'].append(r1)
w['Email'].append(s1)
print cell.value
df=pd.DataFrame(data=w)
df.to_excel(r'/Users/ap/info2.xlsx')

I would go with this, which just replaces the 'E-mail:...' with a delimiter and then splits and assigns to the right column
df['Name'] = np.nan
df['Affiliation'] = np.nan
df['Email'] = np.nan
df['Mobile'] = np.nan
for i in range(0, len(df)):
full_value = df['Companies'].loc[i]
full_value = full_value.replace('Affiliation:', ';').replace('E-mail:', ';').replace('Mobile:', ';')
full_value = full_value.split(';')
df['Name'].loc[i] = full_value[0]
df['Affiliation'].loc[i] = full_value[1]
df['Email'].loc[i] = full_value[2]
df['Mobile'].loc[i] = full_value[3]
del df['Companies']
print(df)

Appending multiple pandas dataframe using for loop but returns an empty dataframe

I am to download a number of .csv files which I convert to pandas dataframe and append to each other.
The csv can be accessed via url which is created each day and using datetime it can be easily generated and put in a list.
I am able to open these individually in the list.
When I try to open a number of these and append them together I get an empty dataframe. The code looks like this so.
#Imports
import datetime
import pandas as pd
#Testing can open .csv file
data = pd.read_csv('https://promo.betfair.com/betfairsp/prices/dwbfpricesukwin01022018.csv')
data.iloc[:5]
#Taking heading to use to create new dataframe
data_headings = list(data.columns.values)
#Setting up string for url
path_start = 'https://promo.betfair.com/betfairsp/prices/dwbfpricesukwin'
file = ".csv"
#Getting dates which are used in url
start = datetime.datetime.strptime("01-02-2018", "%d-%m-%Y")
end = datetime.datetime.strptime("04-02-2018", "%d-%m-%Y")
date_generated = [start + datetime.timedelta(days=x) for x in range(0, (end-start).days)]
#Creating new dataframe which is appended to
for heading in data_headings:
data = {heading: []}
df = pd.DataFrame(data, columns=data_headings)
#Creating list of url
date_list = []
for date in date_generated:
date_string = date.strftime("%d%m%Y")
x = path_start + date_string + file
date_list.append(x)
#Opening and appending csv files from list which contains url
for full_path in date_list:
data_link = pd.read_csv(full_path)
df.append(data_link)
print(df)
I have checked that they are not just empty csv but they are not. Any help would be appreciated.
Cheers,
Sandy

You are never storing the appended dataframe. The line:
df.append(data_link)
Should be
df = df.append(data_link)
However, this may be the wrong approach. You really want to use the array of URLs and concatenate them. Check out this similar question and see if it can improve your code!

I really can't understand what you wanted to do here:
#Creating new dataframe which is appended to
for heading in data_headings:
data = {heading: []}
df = pd.DataFrame(data, columns=data_headings)
By the way, try this:
for full_path in date_list:
data_link = pd.read_csv(full_path)
df.append(data_link.copy())

appending multiple pandas DataFrames read in from files

Hello I am trying to read in multiple files, create a dataframe of the specific key information i need and then append each dataframe for each file to a main dataframe called topics. I have tried the following code.
import pandas as pd
import numpy as np
from lxml import etree
import os
topics = pd.DataFrame()
for filename in os.listdir('./topics'):
if not filename.startswith('.'):
#print(filename)
tree = etree.parse('./topics/'+filename)
root = tree.getroot()
childA = []
elementT = []
ElementA = []
for child in root:
elementT.append(str(child.tag))
ElementA.append(str(child.attrib))
childA.append(str(child.attrib))
for element in child:
elementT.append(str(element.tag))
#childA.append(child.attrib)
ElementA.append(str(element.attrib))
childA.append(str(child.attrib))
for sub in element:
#print('***', child.attrib , ':' , element.tag, ':' , element.attrib, '***')
#childA.append(child.attrib)
elementT.append(str(sub.tag))
ElementA.append(str(sub.attrib))
childA.append(str(child.attrib))
df = pd.DataFrame()
df['c'] = np.array (childA)
df['t'] = np.array(ElementA)
df['a'] = np.array(elementT)
file = df['t'].str.extract(r'([A-Z][A-Z].*[words.xml])#')
start = df['t'].str.extract(r'words([0-9]+)')
stop = df['t'].str.extract(r'.*words([0-9]+)')
tags = df['a'].str.extract(r'.*([topic]|[pointer]|[child])')
rootTopic = df['c'].str.extract(r'rdhillon.(\d+)')
df['f'] = file
df['start'] = start
df['stop'] = stop
df['tags'] = tags
# c= topic
# r = pointerr
# d= child
df['topicID'] = rootTopic
df = df.iloc[:,3:]
topics.append(df)
However when i call topics i get the following output
topics
Out[19]:_
Can someone please let me know where i am going wrong, also any suggestions on improving my messy code would be appreciated

Unlike lists, when you append to a DataFrame you return a new object. So topics.append(df) returns an object that you are never storing anywhere and topics remains the empty DataFrame you declare on the 6th line. You can fix this by
topics = topics.append(df)
However, appending to a DataFrame within a loop is a very costly exercise. Instead you should append each DataFrame to a list within the loop and call pd.concat() on the list of DataFrames after the loop.
import pandas as pd
topics_list = []
for filename in os.listdir('./topics'):
# All of your code
topics_list.append(df) # Lists are modified with append
# After the loop one call to concat
topics = pd.concat(topics_list)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extracting multiple tables into a .csv - python

Related

pandas dataframe not creating new column

Tabula only read/convert pages 2- 16

How can I use the .findall() function for a excel file iterating through all rows of a column?

Appending multiple pandas dataframe using for loop but returns an empty dataframe

appending multiple pandas DataFrames read in from files

Categories

Resources