This link contains CSV files for daily reports of COVID-19.
https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_daily_reports
What is the best solution to get all the csv files in a dataframe?
I tried the code bellow from other questions but it doesnt work.
from pathlib import Path
import pandas as pd
files = Path('https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_daily_reports')
csv_only = files.rglob('*.csv')
combo = [pd.read_csv(f)
.assign(f.stem)
.fillna(0)
for f in csv_only]
one_df = pd.concat(combo,ignore_index=True)
one_df = one_df.drop_duplicates('date')
print(one_df)
How could i fit requests to read all the files?
You can simply use requests module to get the names of all the .csv present, which would eliminate the need to run glob:
import requests
url = "https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_daily_reports"
csv_only = [i.split("=")[1][1:-1] for i in requests.get(url).text.split(" ") if '.csv' in i and 'title' in i]
pathlib only works with filesystems so this won't do. csv_only will be an empty generator since there is no such location on your disk. You need to fetch the data from github with actual http requests. I did something for some personal stuff some time ago, you can have a look and modify it accordingly(uses the github API so you'll need to get one).
Related
I am trying to read a url csv file from bitbucket and I want to read it into a df using python. Also for the work I am doing I can not read it locally , it has to be from bitbucket all the time.
Any ideas on how to do this? Thank you!
Here is my example:
url = 'https://bitbucket.EXAMPLE.com/EXAMPLE/EXAMPLE/EXAMPLE/EXAMPLE/raw/wpcProjects.csv?at=refs%2Fheads%2Fmaster'
colnames=['project_id','project_name','gourmet_url']
df7 = pd.read_csv(url, names =colnames)
However, the output is not correct, its not the df being outputted its some bad data.
You have multiple options, but your question is actually 2 separate questions.
How to get a file (.csv in this case) from a remote location.
How to load a csv into a "df" which is a pandas data frame.
For #2, you simply import pandas, and use the df = pandas.read_csv() function call. See the documentation! If the CSV file was in the current directory, you would do pandas.read_csv('myfile.csv')
The CSV is on a server somewhere. In this case, it happens to be on bitbucket's servers accessed from their website. You can fetch it and save it locally, then access it, or you can fetch it to a temporary location, read it into pandas, and discard it. You could even read the data from the file into python as a string. However, having a lot of options doesn't mean they are all useful. I am just listing them for completeness. Looking at the documentation, pandas already has remote fetching built into the read_csv() function. If the passed in path is a valid URL scheme, where, in pandas,
"Valid URL schemes include http, ftp, s3, gs, and file".
If you want to locally save it, you can use pandas to do so once again, using the .write() method of a data frame.
FOR BITBUCKET SPECIFICALLY:
You need to make sure to link to the 'raw' file on bitbucket. Get the link to the raw file, and pass that in. The link used to view the file on your web browser is not the direct link to the raw file by default, it's a webpage that offers a view into that file. Get the raw file link, then pass that into pandas.
Code example:
Assume we want (a random csv file I found on bitbucket):
https://bitbucket.org/pedrorijo91/nodejstutorial/src/db4c991864e65c4d72e98a1dc94e33606e3adde9/node_modules/levelmeup/data/horse_js.csv?at=master
What you need is a link to the raw file! clicking on ... and pressing 'open raw' we get:
https://bitbucket.org/pedrorijo91/nodejstutorial/raw/db4c991864e65c4d72e98a1dc94e33606e3adde9/node_modules/levelmeup/data/horse_js.csv
Let's look at this in detail, the link is the same up to the project name:
https://bitbucket.org/pedrorijo91/nodejstutorial/
afterwards, the raw file is under raw/
then it's the same pointer (random but same letters and numbers)
db4c991864e65c4d72e98a1dc94e33606e3adde9/
Finally, it's the same directory structure:
node_modules/levelmeup/data/horse_js.csv
The first link ends with a ?at=master which is parsed by the web server and originates from src/ at the web server. The second link, the actual link to the raw file, starts from raw/ and ends with .csv
import pandas as pd
RAW_Bitbucket_URL = 'https://bitbucket.org/pedrorijo91/nodejstutorial/raw/db4c991864e65c4d72e98a1dc94e33606e3adde9/node_modules/levelmeup/data/horse_js.csv'
df = pd.read_csv(RAW_Bitbucket_URL)
The above code is successful for me.
You may need to download the entire file so you can try to make the request with requests and then read it as a file in pandas.read_csv().
>>> import pandas as pd
>>> import requests
>>> url = 'https://bitbucket.org/pedrorijo91/nodejstutorial/raw/db4c991864e65c4d72e98a1dc94e33606e3adde9/node_modules/levelmeup/data/horse_js.csv'
>>> r = requests.get(url, allow_redirects=True)
>>> open('file.csv', 'wb').write(r.content)
>>> pd.read_csv('file.csv', encoding='utf-8-sig').head()
ID Tweet Date Via
0 374667940827635712 So, yes, a 100% JS App is 100% awesome 08:59:32, 9-3, 2013 web
1 374656867466637312 "vituperating priests" who rail against JavaSc... 08:15:32, 9-3, 2013 web
2 374654221292806144 Node/Browserify/CJS folks, is there any benefit 08:05:01, 9-3, 2013 Twitter for iPhone
3 374640446955212800 100% JavaScript applications. You may get some 07:10:17, 9-3, 2013 Twitter for iPhone
4 374613490763169792 A node.js app that will order you a sandwich 05:23:10, 9-3, 2013 web
I want to import a public dataset from Kaggle (https://www.kaggle.com/unsdsn/world-happiness?select=2017.csv) into a local jupyter notebook. I don't want to use any credencials in the process.
I saw diverse solutions including: pd.read_html, pd.read_csv, pd.read_table (pd = pandas).
I also found the solutions that imply a login.
The first set of solutions are the ones I am interested in, though I see that they work on other websites because there is a link to the raw data.
I have been clincking everywhere in the kaggle interface but find no direct url to raw data.
Bottom line: Is it possible to use say pd.read_csv to directly get data from the website into your local notebook? If so, how?
You can automate kaggle.cli
follow the instructions to download and save kaggle.json for authentication https://github.com/Kaggle/kaggle-api
import kaggle.cli
import sys
import pandas as pd
from pathlib import Path
from zipfile import ZipFile
# download data set
# https://www.kaggle.com/unsdsn/world-happiness?select=2017.csv
dataset = "unsdsn/world-happiness"
sys.argv = [sys.argv[0]] + f"datasets download {dataset}".split(" ")
kaggle.cli.main()
zfile = ZipFile(f"{dataset.split('/')[1]}.zip")
dfs = {f.filename:pd.read_csv(zfile.open(f)) for f in zfile.infolist() }
dfs["2017.csv"]
i am a beginer at machine learning and exploring with database for my nlp project. here i got the data from http://www.cs.jhu.edu/~mdredze/datasets/sentiment/index2.html. and I am trying to create a pd dataframe where i want to parse the xml data , I also want to add a label(1) to the positive reviews, Can someone please help me with the code, a sample output has been given,
from bs4 import BeautifulSoup
positive_reviews = BeautifulSoup(open('/content/drive/MyDrive/sorted_data_acl/electronics/positive.review', encoding='utf-8').read())
positive_reviews = positive_reviews.findAll('review_text')
positive_reviews[0]
<review_text>
I purchased this unit due to frequent blackouts in my area and 2 power supplies going bad. It will run my cable modem, router, PC, and LCD monitor for 5 minutes. This is more than enough time to save work and shut down. Equally important, I know that my electronics are receiving clean power.
I feel that this investment is minor compared to the loss of valuable data or the failure of equipment due to a power spike or an irregular power supply.
As always, Amazon had it to me in <2 business days
</review_text>
the main issue is the note it is pseudo xml
download tar.gz file and unzip / untar
build dictionary of all files
workaround to deal with pseudo xml - insert document element in string representation of document
then simple case of using list/dict comprehensions to generate pandas constructor format
dfs îs a dictionary of data frames ready to be used
import requests
from pathlib import Path
from tarfile import TarFile
from bs4 import BeautifulSoup
import io
import pandas as pd
# download tar with psuedo XML...
url = "http://www.cs.jhu.edu/%7Emdredze/datasets/sentiment/domain_sentiment_data.tar.gz"
fn = Path.cwd().joinpath(url.split("/")[-1])
if not fn.exists():
r = requests.get(url, stream=True)
with open(fn, 'wb') as f:
for chunk in r.raw.stream(1024, decode_content=False):
if chunk:
f.write(chunk)
# untar downloaded file and generate a dictionary of all files
TarFile.open(fn, "r:gz").extractall()
files = {f"{p.parent.name}/{p.name}":p for p in Path.cwd().joinpath("sorted_data_acl").glob("**/*") if p.is_file()}
# convert all files into dataframes in a dict
dfs = {}
for file in files.keys():
with open(files[file]) as f: text = f.read()
# psuedo xml where there is not root element stops it from being well formed
# force it in...
soup = BeautifulSoup(f"<root>{text}</root>", "xml")
# simple case of each review is a row and each child element is a column
dfs[file] = pd.DataFrame([{c.name:c.text.strip("\n") for c in r.children if c.name} for r in soup.find_all("review")])
I recently started diving into algo trading and building a bot for crypto trading.
For this i created a backtester with pandas to run different strategies with different parameters. The datasets (csv files) I use are rather larger (around 40mb each).
These are processed, but as soon as i want to save the processed data to a csv, nothing happens. No output whatsoever, not even an error message. I tried to use the full path, I tried to save it just with the filename, I even tried to save it as a .txt file. Nothing seems to work. I also tried the solutions I was able to find on stackoverflow.
I am using Anaconda3 in case that could be the source of my problem.
Here you can find the part of my code ,which tries to save the dataframe to a file.
results_df = pd.DataFrame(results)
results_df.columns = ['strategy', 'number_of_trades', "capital"]
print(results_df)
for i in range(2, len(results_df)):
if results_df.capital.iloc[i] < results_df.capital.iloc[0]:
results_df.drop([i],axis="index")
#results to csv
current_dir = os.getcwd()
results_df.to_csv(os.getcwd()+'\\file.csv')
print(results_df)
Thank you for your help!
You can simplifiy your code by a great deal and write it as (should also run faster):
results_df = pd.DataFrame(results)
results_df.columns = ['strategy', 'number_of_trades', "capital"]
print(results_df)
first_row_capital= results_df.capital.iloc[0]
indexer_capital_smaller= results_df.capital < first_row_capital
values_to_delete= indexer_capital_smaller[indexer_capital_smaller].index
results_df.drop(index=values_to_delete, inplace=True)
#results to csv
current_dir = os.getcwd()
results_df.to_csv(os.getcwd()+'\\file.csv')
print(results_df)
I think, the main problem in your code might be, that you write the csv each time you found an entry in the dataframe where capital sattisfies the condition and you write it only if you find such a case.
And if you just do the deletion for the csv output but don't need the dataframe in memory anymore, you can make it even simpler:
results_df = pd.DataFrame(results)
results_df.columns = ['strategy', 'number_of_trades', "capital"]
print(results_df)
first_row_capital= results_df.capital.iloc[0]
indexer_capital_smaller= results_df.capital < first_row_capital
#results to csv
current_dir = os.getcwd()
results_df[indexer_capital_smaller].to_csv(os.getcwd()+'\\file.csv')
print(results_df[indexer_capital_smaller])
This second variant only applies a filter before writing the filtered lines and before printing the content.
I've tried many times to find a way to import this data from this PDF.
(http://www.imea.com.br/upload/publicacoes/arquivos/16032018194928.pdf) It's a report from a agri department in Brazil. I need just the first one.
My mission is to develop a program that gets some specific points of this report and build a paragraph with it.
The thing is that I couldn't find a way to import the table correctly.
I've tried to use tabula-py, but didn't work very well.
Does anyone know how can I import it?
Python 3.6 / Mac hight Sierra
ps: It need to be done just with python, because this code will be upload at Heroku, so I can't install softwares there. (BTW, I think even the tabula-py would not work there as I need to have Java installed... but I will try anyway)
Here what I tried:
import tabula
import requests
url = "http://www.imea.com.br/upload/publicacoes/arquivos/16032018194928.pdf"
response = requests.get(url)
df = tabula.read_pdf(url)
tabula.convert_into("teste.pdf", "output.csv", output_format="csv", area=(67.14, 23.54,284.12, 558.01)) #I tried also without area.
I think tabula expects a file, not a URL. Try this:
#!/usr/bin/env python3
import tabula
import requests
url = "http://www.imea.com.br/upload/publicacoes/arquivos/16032018194928.pdf"
filename = "16032018194928.pdf"
response = requests.get(url)
with open(filename, 'wb') as f:
f.write(response.content)
df = tabula.read_pdf(filename)
print(df)