I have a series of HTML files stored in a local folder ("destination folder"). These HTML files all contain a number of tables. What I'm looking to do is to locate the tables I'm interested in thanks to keywords, grab these tables in their entirety, paste them to a text file and save this file to the same local folder ("destination folder").
This is what I have for now:
from bs4 import BeautifulSoup
filename = open('filename.txt', 'r')
soup = BeautifulSoup(filename,"lxml")
data = []
for keyword in keywords.split(','):
u=1
txtfile = destinationFolder + ticker +'_'+ companyname[:10]+ '_'+item[1]+'_'+item[3]+'_'+keyword+u+'.txt'
mots = soup.find_all(string=re.compile(keyword))
for mot in mots:
for row in mot.find("table").find_all("tr"):
data = cell.get_text(strip=True) for cell in row.find_all("td")
data = data.get_string()
with open(txtfile,'wb') as t:
t.write(data)
txtfile.close()
u=u+1
except:
pass
filename.close()
Not sure what's happening in the background but I don't get my txt file in the end like I'm supposed to. The process doesn't fail. It runs its course till the end but the txt file is nowhere to be found in my local folder when it's done. I'm sure I'm looking in the correct folder. The same path is used elsewhere in my code and works fine.
Related
I have an Excel file with a column filled with +4000 URLs each one in a different cell. I need to use Python to open it with Chrome and scraping the website some of the data from a website.
past them in excel.
And then do the same step for the next URL. Could you please help me with that?
export the excel file to csv file read data from it as
def data_collector(url):
# do your code here and return data that you want to write in place of url
return url
with open("myfile.csv") as fobj:
content = fobj.read()
#below line will return you urls in form of list
urls = content.replace(",", " ").strip()
for url in urls:
data_to_be_write = data_collector(url)
# added extra quotes to prevent csv from breaking it is prescribed
# to use csv module to write in csv file but for ease of understanding
# i did it like this, Hoping You will correct it by yourself
content = "\"" + {content.replace(url, data_to_be_write) + "\""
with open("new_file.csv", "wt") as fnew:
fnew.write(content)
after running this code you will get new_file.csv opening it with Excel you will get your desired data in place of url
if you want your url with data just append it like with data in string seprated by colon.
I am trying to develop a Python Script for my Data Engineering Project and I want to loop over 47 URLS stored in a dataframe, which downloads a CSV File and stores in my local machine. Below is the example of top 5 URLS:
test_url = "https://data.cdc.gov/api/views/pj7m-y5uh/rows.csv?accessType=DOWNLOAD"
req = requests.get(test_url)
url_content = req.content
csv_file = open('cdc6.csv', 'wb')
csv_file.write(url_content)
csv_file.close()
I have this for a single file, but instead of the opening a CSV File and writing the Data in it, I want to directly download all the files and save it in local machine.
You want to iterate and then download the file to a folder. Iteration is easy by using the .items() method in pandas dataframes and passing it into a loop. See the documentation here.
Then, you want to download each item. Urllib has a .urlretrieve(url, filename) function for downloading a hosted file to a local file, which is elaborated on in the urllib documentation here.
Your code may look like:
for index, url in url_df.items():
urllib.urlretrieve(url, "cdcData" + index + ".csv")
or if you want to preserve the original names:
for index, url in url_df.items():
name = url.split("/")[-1]
urllib.urlretrieve(url, name)
I am re-framing an existing question for simplicity. I have the following code to download Excel files from a company Share Point site.
import requests
import pandas as pd
def download_file(url):
filename = url.split('/')[-1]
r = requests.get(url)
with open(filename, 'wb') as output_file:
output_file.write(r.content)
df = pd.read_excel(r'O:\Procurement Planning\QA\VSAF_test_macro.xlsm')
df['Name'] = 'share_point_file_path_documentName' #i'm appending the sp file path to the document name
file = df['Name'] #I only need the file path column, I don't need the rest of the dataframe
# for loop for download
for url in file:
download_file(url)
The downloads happen and I don't get any errors in Python, however when I try to open them I get an error from Excel saying Excel cannot open the file because the file format or extension is not valid. If I print the link in Jupyter Notebooks it does open correctly, the issue appears to be with the download.
Check r.status_code. This must be 200 or you have the wrong url or no permission.
Open the downloaded file in a text editor. It might be a HTML file (Office Online)
If the URL contains a web=1 query parameter, remove it or replace it by web=0.
I have a folder with lots of .txt files. I want to merge all .txt file in a single .csv file line by line/row by row.
I have tried the following python codes, they work fine but I have to change .txt file name to add the content into .csv row.
import re
import csv
from bs4 import BeautifulSoup
raw_html = open('/home/erdal/Dropbox/Marburg/LA/LT_CORPUS/fsdl.txt')
cleantext = BeautifulSoup(raw_html, "lxml").text
#print(cleantext)
print (re.sub('\s+',' ', cleantext))
#appending to csv as row
row = [re.sub('\s+',' ', cleantext)]
with open('LT_Corpus.csv', 'a') as csvFile:
writer = csv.writer(csvFile)
writer.writerow(row)
csvFile.close()
I expect to see better and faster solutions for automatizing the process without changing file names. Any recommendation is welcome.
Accessing a list of filenames
The following should get you closer to what you want.
import os will give you access to the os.listdir() function that lists all the files in a directory. You may need to provide the path to your data folder, if the data files are not in the same folder as your script.
This should look something like:
os.listdir('/home/erdal/Dropbox/Marburg/LA/LT_CORPUS/')
Using all the filenames in that directory, you can then open each one individually, by parsing through them with a for loop.
import re
import csv
from bs4 import BeautifulSoup
import os
filenames = os.listdir('/home/erdal/Dropbox/Marburg/LA/LT_CORPUS/')
for file in filenames:
raw_html = open('/home/erdal/Dropbox/Marburg/LA/LT_CORPUS/' + file)
cleantext = BeautifulSoup(raw_html, "lxml").text
output = re.sub('\s+',' ', cleantext) # saved the result using a variable
print(output) # the variable can be reused
row = [output] # as needed, in different contexts
with open('LT_Corpus.csv', 'a') as csvFile:
writer = csv.writer(csvFile)
writer.writerow(row)
Several other nuances: I removed the csvfile.close() function call at the end. When using with context managers, the context manager automatically closes the file for you when you leave the scope of the context manager code block (i.e. that indented section below the with statement). Having said this, there might be merit to simply opening the csv file, leaving it open, and then opening the txt files one by one and writing their content to the open csv and waiting to close the csv til the very end.
Hi I am getting all folders like this
entries=dbx.files_list_folder('').entries
print (entries[1].name)
print (entries[2].name)
And unable to locate subfiles in these folders. As I searched on internet but till now no working function I found.
After listing entries using files_list_folder (and files_list_folder_continue), you can check the type, and then download them if desired using files_download, like this:
entries = dbx.files_list_folder('').entries
for entry in entries:
if isinstance(entry, dropbox.files.FileMetadata): # this entry is a file
md, res = dbx.files_download(entry.path_lower)
print(md) # this is the metadata for the downloaded file
print(len(res.content)) # `res.content` contains the file data
Note that this code sample doesn't properly paginate using files_list_folder_continue nor does it contain any error handling.
There is two possible way to do that:
Either you can write the content to the file or you can create a link (either redirected to the browser or just get a downloadable link )
First way:
metadata, response = dbx.files_download(file_path+filename)
with open(metadata.name, "wb") as f:
f.write(response.content)
Second way:
link = dbx.sharing_create_shared_link(file_path+filename)
print(link.url)
if you want link to be downloadable then replace 0 with 1:
path = link.url.replace("0", "1")