Import csv from Kaggle url into a pandas DataFrame - python

I want to import a public dataset from Kaggle (https://www.kaggle.com/unsdsn/world-happiness?select=2017.csv) into a local jupyter notebook. I don't want to use any credencials in the process.
I saw diverse solutions including: pd.read_html, pd.read_csv, pd.read_table (pd = pandas).
I also found the solutions that imply a login.
The first set of solutions are the ones I am interested in, though I see that they work on other websites because there is a link to the raw data.
I have been clincking everywhere in the kaggle interface but find no direct url to raw data.
Bottom line: Is it possible to use say pd.read_csv to directly get data from the website into your local notebook? If so, how?

You can automate kaggle.cli
follow the instructions to download and save kaggle.json for authentication https://github.com/Kaggle/kaggle-api
import kaggle.cli
import sys
import pandas as pd
from pathlib import Path
from zipfile import ZipFile
# download data set
# https://www.kaggle.com/unsdsn/world-happiness?select=2017.csv
dataset = "unsdsn/world-happiness"
sys.argv = [sys.argv[0]] + f"datasets download {dataset}".split(" ")
kaggle.cli.main()
zfile = ZipFile(f"{dataset.split('/')[1]}.zip")
dfs = {f.filename:pd.read_csv(zfile.open(f)) for f in zfile.infolist() }
dfs["2017.csv"]

Related

Access file and read/write from shared/network folder using python

i want to read and write data from network folder, so far i have tried
os.open("\u drive path") , open("\u drive path")
but it says accesss or permission denied
but when i use
os.startfile("\u drive path")
I always try r strings when connecting to a network drive (especially if using pandas) try doing this to put the file into a dataframe
import pandas as pd
desired_file = r'\\networkdrive\folder\file.csv'
df = pd.read_csv(desired_file, , encoding='utf-8')
This makes it easier for us to just look at as people with the r string but if you use
print(desired_file)
You can see that python reads it the way that it needs to be formatted for pandas

Extracting tables in 1000's of PDF using tabula area argument

I have around 970 pdf files with same format and i want to extract the table from these pdf's. after doing some research i am able to extract table using tabula-area argument, Unfortuntely the area parameters are not same for each pdf hence i cannot iterate . So, if anyone can help me with automating finding this area arguments for each pdf it would be great help.
as you can see in image i have to use area otherwise the junk in header is also parsed. Here is the script i am able to execute successfully for first pdf, but i need to extract from 970files which is not possible manually. PLS. HELP!!
#author: Jiku-tlenova
"""
import numpy as np
import matplotlib as plt
import pandas as pd
import os
import re
import PyPDF2 as rdpdf
import tabula
path = "/codes/python/"
os.chdir(path)
from convert_pdf_to_txt import convert_pdf_to_txt
os.getcwd()
pa="s/"
os.chdir(path+pa)
files= os.listdir(".")
ar=[187.65,66.35,606.7,723.11]
tablist=[]
for file in files:
i=0
pgnum=2;endval=0
weind=re.findall("\d+", file)
print(file)
reader = rdpdf.PdfFileReader(file)
while endval==0:
table0 =tabula.read_pdf(file, pages = i+2, spreadsheet=True,multiple_tables = False ,lattice=True,area=ar) #pandas_options={'header': 'infer'}
table0=table0.dropna(how="all",axis=1)
#foramtiing headers
head=(table0.iloc[0,:]+table0.iloc[1,:]).T
table0.columns=head
table0=table0.drop([0, 1])
table0=table0.iloc[:-1] #delete last row - not needed
mys=table0[table0.columns[-1]]
val=mys.isnull().all()
if val==True:
endval=1
tablist.append(table0)
i=i+1```
finally able to do it myself....basically took code from R and used wrapper....seems R support community is much active in stack than python one.....thanks

COVID-19 data analysis with Python from Github CSV

This link contains CSV files for daily reports of COVID-19.
https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_daily_reports
What is the best solution to get all the csv files in a dataframe?
I tried the code bellow from other questions but it doesnt work.
from pathlib import Path
import pandas as pd
files = Path('https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_daily_reports')
csv_only = files.rglob('*.csv')
combo = [pd.read_csv(f)
.assign(f.stem)
.fillna(0)
for f in csv_only]
one_df = pd.concat(combo,ignore_index=True)
one_df = one_df.drop_duplicates('date')
print(one_df)
How could i fit requests to read all the files?
You can simply use requests module to get the names of all the .csv present, which would eliminate the need to run glob:
import requests
url = "https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_daily_reports"
csv_only = [i.split("=")[1][1:-1] for i in requests.get(url).text.split(" ") if '.csv' in i and 'title' in i]
pathlib only works with filesystems so this won't do. csv_only will be an empty generator since there is no such location on your disk. You need to fetch the data from github with actual http requests. I did something for some personal stuff some time ago, you can have a look and modify it accordingly(uses the github API so you'll need to get one).

Extracted tables from PDF returned incorrect data - Python

I've tried many times to find a way to import this data from this PDF.
(http://www.imea.com.br/upload/publicacoes/arquivos/16032018194928.pdf) It's a report from a agri department in Brazil. I need just the first one.
My mission is to develop a program that gets some specific points of this report and build a paragraph with it.
The thing is that I couldn't find a way to import the table correctly.
I've tried to use tabula-py, but didn't work very well.
Does anyone know how can I import it?
Python 3.6 / Mac hight Sierra
ps: It need to be done just with python, because this code will be upload at Heroku, so I can't install softwares there. (BTW, I think even the tabula-py would not work there as I need to have Java installed... but I will try anyway)
Here what I tried:
import tabula
import requests
url = "http://www.imea.com.br/upload/publicacoes/arquivos/16032018194928.pdf"
response = requests.get(url)
df = tabula.read_pdf(url)
tabula.convert_into("teste.pdf", "output.csv", output_format="csv", area=(67.14, 23.54,284.12, 558.01)) #I tried also without area.
I think tabula expects a file, not a URL. Try this:
#!/usr/bin/env python3
import tabula
import requests
url = "http://www.imea.com.br/upload/publicacoes/arquivos/16032018194928.pdf"
filename = "16032018194928.pdf"
response = requests.get(url)
with open(filename, 'wb') as f:
f.write(response.content)
df = tabula.read_pdf(filename)
print(df)

Import csv Python with Spyder

I am trying to import a csv file into Python but it doesn't seem to work unless I use the Import Data icon.
I've never used Python before so apologies is I am doing something obviously wrong. I use R and I am trying to replicate the same tasks I do in R in Python.
Here is some sample code:
import pandas as pd
import os as os
Main_Path = "C:/Users/fan0ia/Documents/Python_Files"
Area = "Pricing"
Project = "Elasticity"
Path = os.path.join(R_Files, Business_Area, Project)
os.chdir(Path)
#Read in the data
Seasons = pd.read_csv("seasons.csv")
Dep_Sec_Key = pd.read_csv("DepSecKey.csv")
These files import without any issues but when I execute the following:
UOM = pd.read_csv("FINAL_UOM.csv")
Nothing shows in the variable explorer panel and I get this in the IPython console:
In [3]: UOM = pd.read_csv("FINAL_UOM.csv")
If I use the Import Data icon and use the wizard selecting DataFrame on the preview tab it works fine.
The same file imports into R with the same kind of command so I don't know what I am doing wrong? Is there any way to see what code was generated by the wizard so I can compare it to mine?
Turns out the data had imported, it just wasn't showing in the variable explorer

Categories

Resources