I am trying to import a csv file into Python but it doesn't seem to work unless I use the Import Data icon.
I've never used Python before so apologies is I am doing something obviously wrong. I use R and I am trying to replicate the same tasks I do in R in Python.
Here is some sample code:
import pandas as pd
import os as os
Main_Path = "C:/Users/fan0ia/Documents/Python_Files"
Area = "Pricing"
Project = "Elasticity"
Path = os.path.join(R_Files, Business_Area, Project)
os.chdir(Path)
#Read in the data
Seasons = pd.read_csv("seasons.csv")
Dep_Sec_Key = pd.read_csv("DepSecKey.csv")
These files import without any issues but when I execute the following:
UOM = pd.read_csv("FINAL_UOM.csv")
Nothing shows in the variable explorer panel and I get this in the IPython console:
In [3]: UOM = pd.read_csv("FINAL_UOM.csv")
If I use the Import Data icon and use the wizard selecting DataFrame on the preview tab it works fine.
The same file imports into R with the same kind of command so I don't know what I am doing wrong? Is there any way to see what code was generated by the wizard so I can compare it to mine?
Turns out the data had imported, it just wasn't showing in the variable explorer
Related
I want to import a public dataset from Kaggle (https://www.kaggle.com/unsdsn/world-happiness?select=2017.csv) into a local jupyter notebook. I don't want to use any credencials in the process.
I saw diverse solutions including: pd.read_html, pd.read_csv, pd.read_table (pd = pandas).
I also found the solutions that imply a login.
The first set of solutions are the ones I am interested in, though I see that they work on other websites because there is a link to the raw data.
I have been clincking everywhere in the kaggle interface but find no direct url to raw data.
Bottom line: Is it possible to use say pd.read_csv to directly get data from the website into your local notebook? If so, how?
You can automate kaggle.cli
follow the instructions to download and save kaggle.json for authentication https://github.com/Kaggle/kaggle-api
import kaggle.cli
import sys
import pandas as pd
from pathlib import Path
from zipfile import ZipFile
# download data set
# https://www.kaggle.com/unsdsn/world-happiness?select=2017.csv
dataset = "unsdsn/world-happiness"
sys.argv = [sys.argv[0]] + f"datasets download {dataset}".split(" ")
kaggle.cli.main()
zfile = ZipFile(f"{dataset.split('/')[1]}.zip")
dfs = {f.filename:pd.read_csv(zfile.open(f)) for f in zfile.infolist() }
dfs["2017.csv"]
I am currently looking into working from different PCs on the same ownCloud data, doing file imports such as:
```data = pd.read_csv(r"C:\Users\User\ownCloud\Sample Information\folder1\folder2\data.csv")```
However, only "/Sample Information/folder1/folder2/data.csv" is independent of the PC that I use.
The Jupyter notebook would be somewhere like this: r"C:\Users\User\ownCloud\Simulations\folder1\folder2\data_model.ipynb"
I tried stuff like the following, but it wouldn't work:
```data = pd.read_csv(".../Sample Information/folder1/folder2/data.csv")```
Is there a concise way of doing these imports or do I need to use the os module and some combination of os.path.dirname( *... repeat for amount of subfolder until ownCloud is reached...* os.path.dirname(os.getcwd)) ?
Thanks for your help!
EDIT:
After some pointers I now use either of these solutions with v1 similar to this and v2 similar to this:
import os
ipynb_path = os.getcwd()
n_th_folder = 4
### Version 1
split = ipynb_path.split(os.sep)[:-n_th_folder]
ownCloud_path = (os.sep).join(split)
### Version 2
ownCloud_path = ipynb_path
for _ in range(n_th_folder):
ownCloud_path = os.path.dirname(ownCloud_path)
You could replace username in the path with its value by string interpolation.
import os
data = pd.read_csv(r"C:\Users\{}\ownCloud\Sample Information\folder1\folder2\data.csv".format(os.getlogin())
I have around 970 pdf files with same format and i want to extract the table from these pdf's. after doing some research i am able to extract table using tabula-area argument, Unfortuntely the area parameters are not same for each pdf hence i cannot iterate . So, if anyone can help me with automating finding this area arguments for each pdf it would be great help.
as you can see in image i have to use area otherwise the junk in header is also parsed. Here is the script i am able to execute successfully for first pdf, but i need to extract from 970files which is not possible manually. PLS. HELP!!
#author: Jiku-tlenova
"""
import numpy as np
import matplotlib as plt
import pandas as pd
import os
import re
import PyPDF2 as rdpdf
import tabula
path = "/codes/python/"
os.chdir(path)
from convert_pdf_to_txt import convert_pdf_to_txt
os.getcwd()
pa="s/"
os.chdir(path+pa)
files= os.listdir(".")
ar=[187.65,66.35,606.7,723.11]
tablist=[]
for file in files:
i=0
pgnum=2;endval=0
weind=re.findall("\d+", file)
print(file)
reader = rdpdf.PdfFileReader(file)
while endval==0:
table0 =tabula.read_pdf(file, pages = i+2, spreadsheet=True,multiple_tables = False ,lattice=True,area=ar) #pandas_options={'header': 'infer'}
table0=table0.dropna(how="all",axis=1)
#foramtiing headers
head=(table0.iloc[0,:]+table0.iloc[1,:]).T
table0.columns=head
table0=table0.drop([0, 1])
table0=table0.iloc[:-1] #delete last row - not needed
mys=table0[table0.columns[-1]]
val=mys.isnull().all()
if val==True:
endval=1
tablist.append(table0)
i=i+1```
finally able to do it myself....basically took code from R and used wrapper....seems R support community is much active in stack than python one.....thanks
I am trying to access a dataframe from R global environment and import it into Python in Pycharm IDE. But, I am not able to figure how to do it.
I tried the following:
Since, I don't know how to access the global environment where my target data.frame is stored. I created another R script (myscript.R) where I converted to data.frame into a rds object and called it again.
save(dfcast, file = "forecast.rds")
my_data <- readRDS(file = "forecast.rds")
However, when I try to read the rds in python using the following code in Python:
import os
import pandas as pd
import rpy2.robjects as robjects
from rpy2.robjects import pandas2ri
from rpy2.robjects.packages import SignatureTranslatedAnonymousPackage
cwd = os.getcwd()
pandas2ri.activate()
os.chdir('C:/Users/xx/myscript.R')
readRDS = robjects.r['readRDS']
df = readRDS('forecast.rds')
df = pandas2ri.ri2py(df)
df.head()
I get the following error:
Error in gzfile(file, "rb") : cannot open the connection
In addition: Warning message:
In gzfile(file, "rb") :
cannot open compressed file 'forecast.rds', probable reason 'No such file or directory'
Please show the way to deal with this. I just want to access a data.frame from R in Python.
The data.frame is actually a forecast generated from another R script which takes about 7-8 minutes to run. So, instead of running it again on Python , i want it to process in R and import the forecast dataframe to python for further analysis. Since, I am in the midst of building further analysis module. I don't want the R forecast function to run again and again while I am debugging my analysis module. Hence, I want to directly access it from R.
I have a .csv file on my F: drive on Windows 7 64-bit that I'd like to read into pandas and manipulate.
None of the examples I see read from anything other than a simple file name (e.g. 'foo.csv').
When I try this I get error messages that aren't making the problem clear to me:
import pandas as pd
trainFile = "F:/Projects/Python/coursera/intro-to-data-science/kaggle/data/train.csv"
trainData = pd.read_csv(trainFile)
The error message says:
IOError: Initializing from file failed
I'm missing something simple here. Can anyone see it?
Update:
I did get more information like this:
import csv
if __name__ == '__main__':
trainPath = 'F:/Projects/Python/coursera/intro-to-data-science/kaggle/data/train.csv'
trainData = []
with open(trainPath, 'r') as trainCsv:
trainReader = csv.reader(trainCsv, delimiter=',', quotechar='"')
for row in trainReader:
trainData.append(row)
print trainData
I got a permission error on read. When I checked the properties of the file, I saw that it was read-only. I was able to read 892 lines successfully after unchecking it.
Now pandas is working as well. No need to move the file or amend the path. Thanks for looking.
I cannot promise that this will work, but it's worth a shot:
import pandas as pd
import os
trainFile = "F:/Projects/Python/coursera/intro-to-data-science/kaggle/data/train.csv"
pwd = os.getcwd()
os.chdir(os.path.dirname(trainFile))
trainData = pd.read_csv(os.path.basename(trainFile))
os.chdir(pwd)
A better solution is to use literal strings like r'pathname\filename' rather than 'pathname\filename'. See Lexical Analysis for more details.
I also got the same issue and got that resolved .
Check your path for the file correctly
I initially had the path like
dfTrain = pd.read_csv("D:\\Kaggle\\labeledTrainData.tsv",header=0,delimiter="\t",quoting=3)
This returned an error because the path was wrong .Then I have changed the path as below.This is working fine.
dfTrain = dfTrain = pd.read_csv("D:\\Kaggle\\labeledTrainData.tsv\\labeledTrainData.tsv",header=0,delimiter="\t",quoting=3)
This is because my earlier path was not correct.Hope you get it reolved
This happens to me quite often. Usually I open the csv file in Excel, and save it as an xlsx file, and it works.
so instead of
df = pd.read_csv(r"...\file.csv")
Use:
df = pd.read_excel(r"...\file.xlsx")
If you're sure the path is correct, make sure no other programs have the file open. I got that error once, and closing the Excel file made the error go away.
Try this:
import os
import pandas as pd
trainFile = os.path.join('F:',os.sep,'Projects','Python','coursera','intro-to-data-science','train.csv' )
trainData = pd.read_csv(trainFile)