How to load large xml dataset file in python?

How to load large xml dataset file in python? - python

Hi I am working on a project in data analysis with python where I have an XML file of around 2,8GB which is too large to open . I downloaded EmEditor which helped me open the file . The problem is when i try to load the file in python google colaboratory like this :
import xml.etree.ElementTree as ET
tree = ET.parse('dataset.xml') //dataset.xml is the name of my file
root = tree.getroot()
I get the result that No such file or directory: 'dataset.xml' exists . I have my dataset.xml file on my desktop and it can be opened using the EmEditor which gives me the idea that it can be edited and loaded via the EmEditor but I don't know . I would appreciate your help with helping me load the data in python
google colab.

Google Colab runs remotely on a computer from Google, and can't access files that are on your desktop.
To open the file in Python, you'll first need to transfer the file to your colab instance. There's multiple ways to do this, and you can find them here: https://colab.research.google.com/notebooks/io.ipynb
The easiest is probably this:
from google.colab import files
uploaded = files.upload()
for fn in uploaded.keys():
print('User uploaded file "{name}" with length {length} bytes'.format(
name=fn, length=len(uploaded[fn])))
Although keep in mind that every time you start a new colab session you'll need to reupload the file. This is because Google would like to use to the computer for someone else when you are not using it, and thus wipes all the data on the computer.

Related

Python: how to know if file is finished uploading into hdfs

So I have 2 scripts:
script1 for upload file to hdfs
script2 will access the folder and read the files every n seconds
my upload script is like this
from hdfs import InsecureClient
from requests import Session
from requests.auth import HTTPBasicAuth
session = Session()
session.auth = HTTPBasicAuth('hadoop', 'password')
client_hdfs = InsecureClient('http://hadoop.domain.com:50070', user='hadoop', session=session)
client_hdfs.upload(hdfsPath,filePath,overwrite=True)
when I read https://martin.atlassian.net/wiki/spaces/lestermartin/blog/2019/03/21/1172373509/are+partially-written+hdfs+files+accessible+not+exactly+but+much+more+yes+than+I+previously+thought
or in stackoverflow Accessing a file that is being written.
It seems when I upload using hadoop dfs -put command (or -copyFromLocal or -cp) it will create [filename].COPYING if the file is not finished yet. But in the python script it seems it will create the file with the same name but the size will increasing overtime until it completed (and we can download it before it complete and get corrupted file).
I want to ask if there is a way to upload the file using python so that we knows that the file is finished uploading or not.
Actually I has another work-around to upload them into temporary folder and move them to the correct folder after all is finished (I am still trying to do this), but if there is another idea for this will be appreciated

You can use the same strategy as hDFS
create [filename].COPYING
When data is uploaded rename to [filename]
I feel like you suggested the same thing with a temp file, instead of a name change, but it amounts to the same idea. Just so you know renaming a file is extremely cheap and fast so by all means this is a good strategy.

Python - How to download a file with given name from online repository (given its URL)

I currently have a main python script which is working by analyzing a given csv file present in its own local working folder. With the aim of automatizing the process of analyzing more than one csv file, I'm currently trying to build another script which is performing the following tasks:
Download in local working folder a csv file, identified by its own name among the many in an online repository (a OneDrive folder), for which I have the corresponding URL (for the OneDrive folder, not directly the file).
Run the main script and analyze it.
Remove the analyzed csv file from local folder and repeat the process.
I'm having some issues with the identification and download of the csv files.
I've seen some approaches using 'request' module but they were more related to downloading directly a file corresponding to a given URL, not looking for it and taking it from an online repository. For this reason I'm not even sure about how to start here.
What I'm looking for is something like:
url = 'https://1drv.ms/xxxxxxxxx'
file_name = 'title.csv'
# -> Download(link = url, file = file_name)
Thanks in advance to anyone who'll take some time to read this! :)

python how to import data in folder

hey guys i am new to python and have been trying to use google collaboratory notebook to learn pandas. i have been trying to import data but i was unable to do so, the error being :
`FileNotFoundError: [Errno 2] No such file or directory: './train.csv'`
but i had the csv file in my folder which my notebook is in.
This is the code i used to run. i had no idea why it doesnt work. Thanks for any suggestions.
train = pd.read_csv("./train.csv")
test = pd.read_csv("./test.csv")

Assuming you uploaded your files in Google colab correctly, I suspect that you're not using the exact location of the files (test.csv and
train.csv)
Once you navigate to the location of the files, find the location using
pwd
Once you find the location, you can read the files in pandas
train = pd.read_csv(Location_to_file)
test = pd.read_csv(location_to_file)

Where to upload a csv to then read it when coding on Jupyter

I need to upload a few CSV files somewhere on the internet, to be able to use it in Jupyter later using read_csv.
What would be some easy ways to do this?
The CSV contains a database. I want to upload it somewhere and use it in Jupyter using read_csv so that other people can run the code when I send them my file.

The CSV contains a database.
Since the CSV contains a database, I would not suggest uploading it on Github as mentioned by Steven K in the previous answer. It would be a better option to upload it to either Google Drive or Dropbox as rightly said in the previous answer.
To read the file from Google Drive, you could try the following:
Upload the file on Google Drive and click on "Get Sharable Link" and
ensure that anybody with the link can access it.
Click on copy link and get the file ID associated with the CSV.
Ex: If this is the URL https://drive.google.com/file/d/108ARMaD-pUJRmT9wbXfavr2wM0Op78mX/view?usp=sharing then 108ARMaD-pUJRmT9wbXfavr2wM0Op78mX is the file ID.
Simply use the file ID in the following sample code
import pandas as pd
gdrive_file_id = '108ARMaD-pUJRmT9wbXfavr2wM0Op78mX'
data = pd.read_csv(f'https://docs.google.com/uc?id={gdrive_file_id}&export=download', encoding='ISO-8859-1')
Here you are opening up the CSV to anybody with access to the link. A better and more controlled approach would be to share the access with known people and use a library like PyDrive which is a wrapper around Google API's official Python client.
NOTE: Since your question does not mention the version of Python that you are using, I've assumed Python 3.6+ and used f-strings in line #3 of the code. If you use any version of Python before 3.6, you would have to use format method to substitute the value of the variable in the string

You could use any cloud storage provider like Dropbox or Google Drive. Alternatively, you could use Github.
To do this in your notebook, import pandas and read_csv like you normally would for a local file.
import pandas as pd
url="https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv"
c=pd.read_csv(url)

Downloading public files in Google Drive (Python)

Suppose that someone gives me a link that enables me to download a public file in Google Drive.
I want to write a program that can read the link and then download it as a text file.
For example, https://docs.google.com/document/d/1yJVXtabsP7KrJXSu3XyOh-F2cFoP8Lftr14PtXCLEVU/edit is one of files in my Google Drive.
Everyone can access this file.
But how can I write a Python program that downloads the text file given the above link?
Could someone have some pieces of sample code for me?
It seems that some Google Drive SDK could be useful(?), but is there any way to do it without using SDK?

first you need to write a program that would slice off the link of the file that you have uploaded.
for example in the link that you gave:
https://docs.google.com/document/d/1yJVXtabsP7KrJXSu3XyOh-F2cFoP8Lftr14PtXCLEVU/edit
id is 1yJVXtabsP7KrJXSu3XyOh-F2cFoP8Lftr14PtXCLEVU
save it in some variable , say download_link
now to get the download link:
https://docs.google.com/uc?export=download&id=download_link
this link will download the file

If the above answer doesn't work for you use the following links :
to save as .txt file :
https://docs.google.com/document/d/1yJVXtabsP7KrJXSu3XyOh-F2cFoP8Lftr14PtXCLEVU/export?format=txt
to save as docx file:
https://docs.google.com/document/d/1yJVXtabsP7KrJXSu3XyOh-F2cFoP8Lftr14PtXCLEVU/export?format=docx
generally the trick is to add : export?format=txt instead of edit ! hope it helps.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to load large xml dataset file in python? - python

Related

Python: how to know if file is finished uploading into hdfs

Python - How to download a file with given name from online repository (given its URL)

python how to import data in folder

Where to upload a csv to then read it when coding on Jupyter

Downloading public files in Google Drive (Python)

Categories

Resources