So my xlsb file contains real-time data fetching and calculations from third party service like Bloomberg. After calculation is done in excel, how to import the file in Python?
I tried methods online, but none worked and returned NA for cells that required real time calculations.
Try the latest xlsb2xlsx package on PyPI:
pip install xlsb2xlsx
python -m xlsb2xlsx /filepath_with_xlsb_file
Then you can use pandas with something like:
import pandas as pd
df = pd.read_excel('your_filepath.xlsx')
And work with the df object from there.
See https://pypi.org/project/xlsb2xlsx/ for more info.
Related
I have to get input from a text file of a sector and all of it's sales, it has to be stored in a 2d array has to be able to write an average function for data
I tried it in java but I want to know how in python.
I suggest using the Pandas library in Python. It has several handy functions for example creating a DataFrame (a 2D array which allows all sorts of manipulations and calculations).
You can install it using PIP in your CMD with the following command:
python3 -m pip install pandas
Your code should look something like this:
import pandas as pd
# Which pandas function you use depends on the file type you're working with
df = pd.read_excel("sector_data.xlsx")
# If there's a dividends column in your data, the following would calculate the mean.
dividend_mean = df["Dividends"].mean()
I suggest looking at some info about Pandas. It's a strong library.
https://pandas.pydata.org/docs/user_guide/10min.html
I am newbee to JSON and python programming and I am looking for some help in converting below json data from Spotify related artist information to be able to load into excel or csv file.
Expected output columns:
expected output columns
JSON related artist information
you can use pandas in this case. here's an example
import pandas
pandas.read_json("spotify.json").to_excel("spotify.xlsx")
basically you need to install pandas first with this command
pip install pandas
then you can use it right away as i suggested. remember to put up the file in the same directory if you dont want to use path and change your current working directory like this
import os
path = os.path.abspath(os.path.dirname(sys.argv[0]))
os.chdir(path)
Can I think of an ORC file as similar to a CSV file with column headings and row labels containing data? If so, can I somehow read it into a simple pandas dataframe? I am not that familiar with tools like Hadoop or Spark, but is it necessary to understand them just to see the contents of a local ORC file in Python?
The filename is someFile.snappy.orc
I can see online that spark.read.orc('someFile.snappy.orc') works, but even after import pyspark, it is throwing error.
I haven't been able to find any great options, there are a few dead projects trying to wrap the java reader. However, pyarrow does have an ORC reader that won't require you using pyspark. It's a bit limited but it works.
import pandas as pd
import pyarrow.orc as orc
with open(filename) as file:
data = orc.ORCFile(file)
df = data.read().to_pandas()
In case import pyarrow.orc as orc does not work (did not work for me in Windows 10), you can read them to Spark data frame then convert to pandas's data frame
import findspark
from pyspark.sql import SparkSession
findspark.init()
spark = SparkSession.builder.getOrCreate()
df_spark = spark.read.orc('example.orc')
df_pandas = df_spark.toPandas()
Starting from Pandas 1.0.0, there is a built in function for Pandas.
https://pandas.pydata.org/docs/reference/api/pandas.read_orc.html
import pandas as pd
import pyarrow.orc
df = pd.read_orc('/tmp/your_df.orc')
Be sure to read this warning about dependencies. This function might not work on Windows
https://pandas.pydata.org/docs/getting_started/install.html#install-warn-orc
If you want to use
read_orc(), it is highly recommended to install pyarrow using conda
ORC, like AVRO and PARQUET, are format specifically designed for massive storage. You can think about them "like a csv", they are all files containing data, with their particular structure (different than csv, or a json of course!).
Using pyspark should be easy reading an orc file, as soon as your environment grants the Hive support.
Answering your question, I'm not sure that in a local environment without Hive you will be able to read it, I've never done it (you can do a quick test with the following code):
Loads ORC files, returning the result as a DataFrame.
Note: Currently ORC support is only available together with Hive support.
>>> df = spark.read.orc('python/test_support/sql/orc_partitioned')
Hive is a data warehouse system, that allows you to query your data on HDFS (distributed file system) through Map-Reduce like a traditional relational database (creating queries SQL-like, doesn't support 100% all the standard SQL features!).
Edit: Try the following to create a new Spark Session. Not to be rude, but I suggest you to follow one of many PySpark tutorial in order to understand the basics of this "world". Everything will be much clearer.
import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('Test').getOrCreate()
Easiest way is using pyorc:
import pyorc
import pandas as pd
with open(r"my_orc_file.orc", "rb") as orc_file:
reader = pyorc.Reader(orc_file)
orc_data = reader.read()
orc_schema = reader.schema
columns = list(orc_schema.fields)
df = pd.DataFrame(data=orc_data, columns=columns)
I did not want to submit a spark job to read local ORC files or have pandas. This worked for me.
import pyarrow.orc as orc
data_reader = orc.ORCFile("/path/to/orc/part_file.zstd.orc")
data = data_reader.read()
source = data.to_pydict()
I am having a problem reading an excel file from a download link using pandas. The excelString below loads correctly and looks like an excel file, but when trying to convert it to excel using pandas it says the file name is too long. Any assistance would be appreciated. This is a useful generic problem to solve for anyone accessing iShares index membership info.
import urllib
import pandas as pd
f = urllib.request.urlopen('https://www.ishares.com/us/239714/fund-download.dl')
excelString = f.read().decode('utf-8')
pd.ExcelFile(excelString)
The Error returned is OSError: [Errno 36] File name too long
Works fine for me using Python3 and pandas 0.16.2 - do you have the latest version?
In the lab that I work in, we process a lot of data produced by a 96 well plate reader. I'm trying to write a script that will perform a few calculations and output a bar graph using matplotlib.
The problem is that the plate reader outputs data into a .xlsx file. I understand that some modules like pandas have a read_excel function, can you explain how I should go about reading the excel file and putting it into a dataframe?
Thanks
Data sample of a 24 well plate (for simplicity):
0.0868 0.0910 0.0912 0.0929 0.1082 0.1350
0.0466 0.0499 0.0367 0.0445 0.0480 0.0615
0.6998 0.8476 0.9605 0.0429 1.1092 0.0644
0.0970 0.0931 0.1090 0.1002 0.1265 0.1455
I'm not exactly sure what you mean when you say array, but if you mean into a matrix, might you be looking for:
import pandas as pd
df = pd.read_excel([path here])
df.as_matrix()
This returns a numpy.ndarray type.
This task is super easy in Pandas these days.
import pandas as pd
df = pd.read_excel('file_name_here.xlsx', sheet_name='Sheet1')
or
df = pd.read_csv('file_name_here.csv')
This returns a pandas.DataFrame object which is very powerful for performing operations by column, row, over an entire df, or over individual items with iterrows. Not to mention slicing in different ways.
There is awesome xlrd package with quick start example here.
You can just google it to find code snippets. I have never used panda's read_excel function, but xlrd covers all my needs, and can offer even more, I believe.
You could also try it with my wrapper library, which uses xlrd as well:
import pyexcel as pe # pip install pyexcel
import pyexcel.ext.xls # pip install pyexcel-xls
your_matrix = pe.get_array(file_name=path_here) # done