Create a pandas dataframe from a qrc resource file - python

I would like to save a CSV file into a qrc file and than read it putting its contents in a pandas dataframe, but I have some problems.
I created a qrc file called res.qrc:
<!DOCTYPE RCC><RCC version="1.0">
<qresource>
<file>dataset.csv</file>
</qresource>
</RCC>
I compiled it obtaining the res_rc.py file.
To read it I created a python script called resource.py:
import pandas as pd
import res_rc
from PySide.QtCore import *
file = QFile(":/dataset.csv")
df = pd.read_csv(file.fileName())
print(df)
But I obtain the error: IOError: File :/dataset.csv does not exist
All the files (resource.py, res.qrs, res_rc.py, dataset.csv) are in the same folder.
If I do res_rc.qt_resource_data I can see the contents.
How can I create the pandas dataframe?

The qresource is a virtual path that only Qt knows how to obtain it and can change internally without warnings, in these cases what must be done is to read all the data and convert it into a stream with io.BytesIO
import io
import pandas as pd
from PySide import QtCore
import res_rc
file = QtCore.QFile(":/dataset.csv")
if file.open(QtCore.QIODevice.ReadOnly):
f = io.BytesIO(file.readAll().data())
df = pd.read_csv(f)
print(df)

Related

Loading a parquet file from a GitHub repository

I tried to read a parquet (.parq) file I have stored in a GitHub project, using the following script:
import pandas as pd
import numpy as np
import ipywidgets as widgets
import datetime
from ipywidgets import interactive
from IPython.display import display, Javascript
import warnings
warnings.filterwarnings('ignore')
parquet_file = r'https://github.com/smaanan/sev.en_commodities/blob/main/random_deals.parq'
df = pd.read_parquet(parquet_file, engine='auto')
and it gave me this error:
ArrowInvalid: Could not open Parquet input source '': Parquet
magic bytes not found in footer. Either the file is corrupted or this
is not a parquet file.
Does anyone know what this error message means and how I can load the file in my GitHub repository? Thank you in advance.
You should use the URL under the domain raw.githubusercontent.com.
As for your example:
parquet_file = 'https://raw.githubusercontent.com/smaanan/sev.en_commodities/main/random_deals.parq'
df = pd.read_parquet(parquet_file, engine='auto')
You can read parquet files directly from a web URL like this. However, when reading a data file from a git repository you need to make sure it is the raw file url:
url = 'https://github.com/smaanan/sev.en_commodities/blob/main/random_deals.parq?raw=true'

Pandas gives an unordered csv file

What can I do to make this (1 Pic):
look like this one with pandas (2 Pic):
Here's the code I used to make the csv file in the 1 Picture
import pandas as pd
import os
all_months_data = pd.DataFrame()
files = [file for file in os.listdir('Sales_Data/')]
for file in files:
df = pd.read_csv('Sales_Data/' + file)
all_months_data = pd.concat([all_months_data, df])
all_months_data.to_csv('all_data.csv')
I just figured the problem and it was Exel itself that have read my csv file as a text.
I did this and it worked:
Open Excel
Go to 'Data' tab
Select 'From Text/CSV' and select the .CSV file you want to import.
Click 'Import' and you're done!

How to read a specific file from a tar file using Windows?

I have a tar file with several files compressed in it. I need to read one specific file (it is in csv format) using pandas. I tried to use the following code:
import tarfile
tar = tarfile.open('my_files.tar', 'r:gz')
f = tar.extractfile('some_files/need_to_be_read.csv')
import pandas as pd
df = pd.read_csv(f.read())
but it throws up the following error:
OSError: Expected file path name or file-like object, got <class 'bytes'> type
on the last line of the code. How do I go about this to read this file?
When you call pandas.read_csv(), you need to give it a filename or file-like object. tar.extractfile() returns a file-like object. Instead of reading the file into memory, pass the file to Pandas.
So remove the .read() part:
import tarfile
tar = tarfile.open('my_files.tar', 'r:gz')
f = tar.extractfile('some_files/need_to_be_read.csv')
import pandas as pd
df = pd.read_csv(f)

How can I read pickle file containing pandas data frame from qrc resource file with pandas read_pickle?

I have simple UI app created in PyQt5. I would like to have all of my resource files in the qrc resources.
I am using pickle data structure to store previously created DataFrame. In my app I am reading the saved pickle with pandas. When I tried to do it from the qrc_resources (created with pyrcc5) Python module I get an error.
I used same approach as in this answer:
Create a pandas dataframe from a qrc resource file
Resources file:
<!DOCTYPE RCC><RCC version="1.0">
<qresource>
<file alias="AA_data.pkl">resources/AA_data.pkl</file>
</qresource>
</RCC>
Python code:
import bisect, io
import pandas as pd
from PyQt5.QtGui import QImage
from PyQt5.QtCore import QFile, QIODevice
import qrc_resources
file = QFile(':/AA_data.pkl')
if file.open(QIODevice.ReadOnly):
f = io.BytesIO(file.readAll().data())
AA_df = pd.read_pickle(f)
Error:
ValueError: Unrecognized compression type: infer
If I do similar with Excel file it works. But with pickle file format I get an error. I am not very familiar with the data serialization and I am not able to figure it out what am I doing wrong.
You must use None for compression:
import io
import pandas as pd
from PyQt5.QtCore import QFile, QIODevice
import qrc_resources
file = QFile(':/AA_data.pkl')
if file.open(QIODevice.ReadOnly):
f = io.BytesIO(file.readAll().data())
AA_df = pd.read_pickle(f, compression=None)
print(AA_df)

Iterating through excel files and capturing a specific cell value in each file

I have a directory of participation forms (as excel files) from clients, and I want to write a script that will grab all of the relevant cells from the participation form and write them to an excel doc where each client is on its own row. When I try and iterate through the directory using the following code:
import os
import xlrd
import xlwt
from xlrd import open_workbook
from xlwt import easyxf
import pandas as pd
from pandas import np
import csv
for i in os.listdir("filepath"):
book=xlrd.open_workbook("filepath",i)
print book
sheet=book.sheet_by_index(0)
a1=sheet.cell_value(rowx=8, colx=3)
print a1
I get the error: IOError: [Errno 13] Permission denied: 'filepath'
EDIT Here is the Full Traceback after making edits suggested by Steven Rumbalski:
Traceback (most recent call last):
File "C:\Users\Me\Desktop\participation_form.py", line 11, in <module>
book=xlrd.open_workbook(("Y:/Directory1/Directory2/Signup/", i))
File "c:\python27\lib\site-packages\xlrd\__init__.py", line 394, in open_workbook
f = open(filename, "rb")
TypeError: coercing to Unicode: need string or buffer, tuple found
xlrd.open_workbook expects its first argument to be a full path to a file. You are trying to open the folder and not the file. You need to join the filepath and the filename. Do
book = xlrd.open_workbook(os.path.join("filepath", i))
You also my want to guard against trying to open things that are not excel files. You could add this as the first line of your loop:
if not i.endswith((".xls", ".xlsx")): continue
You can simplify all of this with the glob module and the .read_excel() method in pandas (which you already seem to be importing). The following iterates over all the files in some directory that match "*.xlsx", parses them into data frames, and prints out the contents of the appropriate cell.
from glob import glob
for f in glob("/my/path/to/files/*.xlsx"):
print pd.read_excel(f).ix[8,3]

Categories

Resources