converting a whole table/dataframe in pyarrow a dictionnay_encoded columns - python

am loading a parquet file from apache arrow (pyarrow), and so far, i necessarily needs to transfer to pandas, doing a conversion as categorical, and send it back as arrow table (to save it later as feather file type)
the code looks like it :
df = pq.read_table(inputFile)
# convert to pandas
df2 = df.to_pandas()
# get all cols that needs to be transformed and cast
list_str_obj_cols = df2.columns[df2.dtypes == "object"].tolist()
for str_obj_col in list_str_obj_cols:
df2[str_obj_col] = df2[str_obj_col].astype("category")
print(df2.dtypes)
#get back from pandas to arrow
table = pa.Table.from_pandas(df2)
# write the file in fs
ft.write_feather(table, outputFile, compression='lz4')
is there anyway to make this directly with pyarrow ? would it be faster ?
thanks in advance

In pyarrow "categorical" is referred to as "dictionary encoded". So I think your question is if it is possible to dictionary encode columns from an existing table. You can use the pyarrow.compute.dictionary_encode function to do this. Putting it all together:
import pyarrow as pa
import pyarrow.compute as pc
def dict_encode_all_str_columns(table):
new_arrays = []
for index, field in enumerate(table.schema):
if field.type == pa.string():
new_arr = pc.dictionary_encode(table.column(index))
new_arrays.append(new_arr)
else:
new_arrays.append(table.column(index))
return pa.Table.from_arrays(new_arrays, names=table.column_names)
table = pa.Table.from_pydict({'int': [1, 2, 3, 4], 'str': ['x', 'y', 'x', 'y']})
print(table)
print(dict_encode_all_str_columns(table))

Related

Is there a way to overwrite existing data using pandas to_parquet with partitions?

I'm using pandas to write a parquet file using the to_parquet function with partitions. Example:
df.to_parquet('gs://bucket/path', partition_cols=['key'])
The issue is that every time I run the code. It adds a new parquet file in the partition and when you read data, you get all the data from each time the script was run. Essentially, the data appends each time.
Is there a way to overwrite the data every time you write using pandas?
I have found dask to be helpful reading and writing parquet. It defaults the file name on write (which you can alter) and will replace the parquet file if you use the same name, which I believe is what you are looking for. You can append data to the partition by setting 'append' to True, which is more intuitive to me, or you can set 'overwrite' to True which will remove all files in the partition/folder prior to writing the file. Reading parquet works well as well by including partition columns in the dataframe on read.
https://docs.dask.org/en/stable/generated/dask.dataframe.to_parquet.html
See below some code I used to satisfy myself of the behaviour of dask.dataframe.to_parquet:
import pandas as pd
from dask import dataframe as dd
import numpy as np
dates = pd.date_range("2015-01-01", "2022-06-30")
df_len = len(dates)
df_1 = pd.DataFrame(np.random.randint(0, 1000, size=(df_len, 1)), columns=["value"])
df_2 = pd.DataFrame(np.random.randint(0, 1000, size=(df_len, 1)), columns=["value"])
df_1["date"] = dates
df_1["YEAR"] = df_1["date"].dt.year
df_1["MONTH"] = df_1["date"].dt.month
df_2["date"] = dates
df_2["YEAR"] = df_2["date"].dt.year
df_2["MONTH"] = df_2["date"].dt.month
ddf_1 = dd.from_pandas(df_1, npartitions=1)
ddf_2 = dd.from_pandas(df_2, npartitions=1)
name_function = lambda x: f"monthly_data_{x}.parquet"
ddf_1.to_parquet(
"dask_test_folder",
name_function=name_function,
partition_on=["YEAR", "MONTH"],
write_index=False,
)
print(ddf_1.head())
ddf_first_write = dd.read_parquet("dask_test_folder/YEAR=2015/MONTH=1")
print(ddf_first_write.head())
ddf_2.to_parquet(
"dask_test_folder",
name_function=name_function,
partition_on=["YEAR", "MONTH"],
write_index=False,
)
print(ddf_2.head())
ddf_second_write = dd.read_parquet("dask_test_folder/YEAR=2015/MONTH=1")
print(ddf_second_write.head())
Yeah, there is. You need to read pandas docs and you'll see that to_parquet supports **kwargs and uses engine:pyarrow by default. With that you got to the pyarrow docs. There you'll see there are two methods of doing this. One, by using partition_filename_cb which needs legacy support and will be depricated.
Two, using basename_template which is the new way. This because of performance issues of running a callable/lambda to name each partition. You need to pass a string: "string_{i}". Only works with legacy support off. The saved file will be "string_0","string_1"...
You can't use both at the same time.
def write_data(
df: pd.DataFrame,
path: str,
file_format="csv",
comp_zip="snappy",
index=False,
partition_cols: list[str] = None,
basename_template: str = None,
storage_options: dict = None,
**kwargs,
) -> None:
getattr(pd.DataFrame, f"to_{file_format}")(
df,
f"{path}.{file_format}",
compression=comp_zip,
index=index,
partition_cols=partition_cols,
basename_template=basename_template,
storage_options={"token": creds},
**kwargs,
)
Try this.

Saving results of df.show()

I need to capture the contents of a field from a table in order to append it to a filename. I have sorted the renaming process. Is there anyway I can save the output of the following in order to append it to renamed file? I can't use Scala, it has to be in python
df = sqlContext.sql("select replace(value,'-','') from dbsmets1mig02_technical_build.tbl_Tech_admin_data where type = 'Week_Starting'")
df.show()
Have you tried using indexing?
df = pd.DataFrame({
'Name':['Peter', 'Peter', 'Peter', 'Peter'],
'Planned_Start':['1/1/2019', '1/2/2019', '1/15/2019', '1/2/2019'],
'Duration':[2, 3, 5, 6],
'Hrs':[0.6, 1, 1.2, 0.3]})
df.iloc[3][1]
The syntax is df.iloc[< index of row containing desired enty >][< position of your entry >]
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html
You could convert the df object into a pandas DataFrame/Series object, then can use other Python commands more easily on this;
pandasdf = df.toPandas()
Once you have this as a pandas data frame - say it looks something like this;
import pandas as pd
pandasdf = pd.DataFrame({"col1" : ["20191122"]})
Then you can pull out the string and use f strings to join it into a filename;
date = pandasdf.iloc[0, 0]
filename = f"my_file_{date}.csv"
Then we have the filename object as 'my_file_20191122.csv'

Combing two FITS tables on a single column entry using Python

I read in two .FITS tables and place them into "list_a" and "list_b". List_b is a subset of List_a, but has some additional e.g. "age"', that I'd like to add to my output. This is the current way I'm doing things:
file = open("MyFile.txt","w+")
for ii in range(100000):
name = list_B[np.where((list_A['NAME'][ii] == list_B['NAME']))]['NAME']
thing_from_b = list_B[np.where((list_A['NAME'][ii] == list_B['NAME']))]['AGE']
if (len(name) > 0) :
file.write(" {} {} \n".format(list_A['NAME'][ii], age )
file.close()
but is so slow and clunky, that I'm sure there must be a better, more pythonic method.
Assume "List_a" and "List_b" are both tables, and that you want to get the "ages" from "List_b" for entries where both "List_a" and "List_b", you can use Pandas as in your approach. But Astropy also has a built-in join operation for tables.
So I'm guessing you have something akin to:
>>> from astropy.table import Table
>>> tab_a = Table({'NAME': ['A', 'B', 'C']})
>>> tab_b = Table({'NAME': ['A', 'C', 'D'], 'AGE': [1, 3, 4]})
If you are reading from a FITS file you can use, for example Table.read to read a FITS table into a Table object (among other approaches).
Then you can use join to join the two tables where their name is the same:
>>> from astropy.table import join
>>> tab_c = join(tab_a, tab_b, keys='NAME')
>>> tab_c
<Table length=2>
NAME AGE
str1 int64
---- -----
A 1
C 3
I think that maybe be what you're asking.
You could then write this out to a an ASCII format (similar to in your example) like:
>>> import sys
>>> tab_c.write(sys.stdout, format='ascii.no_header')
A 1
C 3
(Here you could replace sys.stdout with a filename; I was just using it for demonstration purposes). As you can see there are many built-in output formats for Tables, though you can also define your own.
There are lots of goodies like this already in Astropy that should save you in many cases from reinventing the wheel when it comes to table manipulation and file format handling--just peruse the docs to get a better feel :)
Turns out turning the lists into DataFrames and then doing a pandas merge, does work very well::
from pandas import DataFrame
from astropy.table import Table
list_a_table = Table(list_a)
list_a_df = DataFrame(np.array(list_a_table))
list_b_table = Table(list_b)
list_b_df = DataFrame(np.array(list_b_table))
df_merge = pd.merge(list_a_df, list_b_df, on="name")

Dask Parquet loading files with data schema

This is a question related to this post.
I am experimenting with Dask and Parquet files. I loaded the New York parking violations data I downloaded here.
I read the data files, find common columns, apply datatypes, and save all afterwards as a parquet collevtion
from dask import dataframe as dd
from dask.diagnostics import ProgressBar
import numpy as np
base_url = 'origin/nyc-parking-tickets/'
fy14 = dd.read_csv(base_url + '*2014_.csv')
fy15 = dd.read_csv(base_url + '*2015.csv')
fy16 = dd.read_csv(base_url + '*2016.csv')
fy17 = dd.read_csv(base_url + '*2017.csv')
data = [fy14, fy15, fy16, fy17]
col_set = [set(d.columns) for d in data]
common_columns = list(set.intersection(*col_set))
# Set proper column types
dtype_tuples = [(x, np.str) for x in common_columns]
dtypes = dict(dtype_tuples)
floats = ['Feet From Curb', 'Issuer Code', 'Issuer Precinct', 'Law Section', 'Vehicle Year', 'Violation Precinct']
ints32 = ['Street Code1', 'Street Code2', 'Street Code3', 'Summons Number']
ints16 = ['Violation Code']
for item in floats: dtypes[item] = np.float32
for item in ints32: dtypes[item] = np.int32
for item in ints16: dtypes[item] = np.int16
# Read Data
data = dd.read_csv(base_url + '*.csv', dtype=dtypes, usecols=common_columns) # usecols not in Dask documentation, but from pandas
# Write data as parquet
target_url = 'target/nyc-parking-tickets-pq/'
with ProgressBar():
data.to_parquet(target_url)
When I attempt to reload the data
data2 = dd.read_parquet(target_url, engine='pyarrow')
I get a ValueError, namely that some of the partitions have a different file format. Looking at the output, I can see that the 'Violation Legal Code' column is in one partition interpreted as null, presumably because the data is too sparse for sampling.
In the post with the original question two solutions are suggested. The first is about entering dummy values, the other is supplying column types when loading the data. I would like to do the latter and I am stuck.
In the dd.read_csv method I can pass the dtype argument, for which I just enter the dtypes dictionary defined above. The dd.read_parquet does not accept that keyword. In the documentation it seems to suggest that categories is taking over that role , but even when passing categories=dtypes, I still get the same error.
How can I pass type specifications in dask.dataframe.read_parquet?
You can not pass dtypes to read_parquet because Parquet files know their own dtypes (in CSV it is ambiguous). Dask DataFrame expects that all files of a dataset have the same schema, as of 2019-03-26, there is no support for loading data of mixed schemas.
That being said, you could do this yourself using something like Dask Delayed, do whatever manipulations you need to do on a file-by-file basis, and then convert those into a Dask DataFrame with dd.from_delayed. More information about that is here.
https://docs.dask.org/en/latest/delayed.html
https://docs.dask.org/en/latest/delayed-collections.html
It seems the problem was with the parquet engine. When I changed the code to
data.to_parquet(target_url, engine = 'fastparquet')
and
data.from_parquet(target_url, engine = 'fastparquet')
the writing and loading worked fine.

DataFrame constructor not properly called! error

I am new to Python and I am facing problem in creating the Dataframe in the format of key and value i.e.
data = [{'key':'\[GlobalProgramSizeInThousands\]','value':'1000'},]
Here is my code:
columnsss = ['key','value'];
query = "select * from bparst_tags where tag_type = 1 ";
result = database.cursor(db.cursors.DictCursor);
result.execute(query);
result_set = result.fetchall();
data = "[";
for row in result_set:
`row["tag_expression"]`)
data += "{'value': %s , 'key': %s }," % ( `row["tag_expression"]`, `row["tag_name"]` )
data += "]" ;
df = DataFrame(data , columns=columnsss);
But when I pass the data in DataFrame it shows me
pandas.core.common.PandasError: DataFrame constructor not properly called!
while if I print the data and assign the same value to data variable then it works.
You are providing a string representation of a dict to the DataFrame constructor, and not a dict itself. So this is the reason you get that error.
So if you want to use your code, you could do:
df = DataFrame(eval(data))
But better would be to not create the string in the first place, but directly putting it in a dict. Something roughly like:
data = []
for row in result_set:
data.append({'value': row["tag_expression"], 'key': row["tag_name"]})
But probably even this is not needed, as depending on what is exactly in your result_set you could probably:
provide this directly to a DataFrame: DataFrame(result_set)
or use the pandas read_sql_query function to do this for you (see docs on this)
Just ran into the same error, but the above answer could not help me.
My code worked fine on my computer which was like this:
test_dict = {'x': '123', 'y': '456', 'z': '456'}
df=pd.DataFrame(test_dict.items(),columns=['col1','col2'])
However, it did not work on another platform. It gave me the same error as mentioned in the original question. I tried below code by simply adding the list() around the dictionary items, and it worked smoothly after:
df=pd.DataFrame(list(test_dict.items()),columns=['col1','col2'])
Hopefully, this answer can help whoever ran into a similar situation like me.
import json
# Opening JSON file
f = open('data.json')
# returns JSON object as
# a dictionary
data1 = json.load(f)
#converting it into dataframe
df = pd.read_json(data1, orient ='index')

Categories

Resources