Looking for a python datastructure for cleaning/annotating large datasets - python

I'm doing a lot of cleaning, annotating and simple transformations on very large twitter datasets (~50M messages). I'm looking for some kind of datastructure that would contain column info the way pandas does, but works with iterators rather than reading the whole dataset into memory at once. I'm considering writing my own, but I wondered if there was something with similar functionality out there. I know I'm not the only one doing things like this!
Desired functionality:
>>> ds = DataStream.read_sql("SELECT id, message from dataTable WHERE epoch < 129845")
>>> ds.columns
['id', 'message']
>>> ds.iterator.next()
[2385, "Hi it's me, Sally!"]
>>> ds = datastream.read_sql("SELECT id, message from dataTable WHERE epoch < 129845")
>>> ds_tok = get_tokens(ds)
>>> ds_tok.columns
['message_id', 'token', 'n']
>>> ds_tok.iterator.next()
[2385, "Hi", 0]
>>> ds_tok.iterator.next()
[2385, "it's", 1]
>>> ds_tok.iterator.next()
[2385, "me", 2]
>>> ds_tok.to_sql(db_info)
UPDATE: I've settled on a combination of dict iterators and pandas dataframes to satisfy these needs.

As commented there is a chunksize argument for read_sql which means you can work on sql results piecemeal. I would probably use HDF5Store to save the intermediary results... or you could just append it back to another sql table.
dfs = pd.read_sql(..., chunksize=100000)
store = pd.HDF5Store("store.h5")
for df in dfs:
clean_df = ... # whatever munging you have to do
store.append("df", clean_df)
(see hdf5 section of the docs), or
dfs = pd.read_sql(..., chunksize=100000)
for df in dfs:
clean_df = ...
clean_df.to_sql(..., if_exists='append')
see the sql section of the docs.

Related

efficient way of computing a dataframe using concat and split

I am new to python/pandas/numpy and I need to create the following Dataframe:
DF = pd.concat([pd.Series(x[2]).apply(lambda r: pd.Series(re.split('\#|/',r))).assign(id=x[0]) for x in hDF])
where hDF is a dataframe that has been created by:
hDF=pd.DataFrame(h.DF)
and h.DF is a list whose elements looks like this:
['5203906',
['highway=primary',
'maxspeed=30',
'oneway=yes',
'ref=N 22',
'surface=asphalt'],
['3655224911#1.735928/42.543651',
'3655224917#1.735766/42.543561',
'3655224916#1.735694/42.543523',
'3655224915#1.735597/42.543474',
'4817024439#1.735581/42.543469']]
However, in some cases the list is very long (O(10^7)) and also the list in h.DF[*][2] is very long, so I run out of memory.
I can obtain the same result, avoiding the use of the lambda function, like so:
DF = pd.concat([pd.Series(x[2]).str.split('\#|/', expand=True).assign(id=x[0]) for x in hDF])
But I am still running out of memory in the cases where the lists are very long.
Can you think of a possible solution to obtain the same results without starving resources?
I managed to make it work using the following code:
bl = []
for x in h.DF:
data = np.loadtxt(
np.loadtxt(x[2], dtype=str, delimiter="#")[:, 1], dtype=float, delimiter="/"
).tolist()
[i.append(x[0]) for i in data]
bl.append(data)
bbl = list(itertools.chain.from_iterable(bl))
DF = pd.DataFrame(bbl).rename(columns={0: "lon", 1: "lat", 2: "wayid"})
Now it's super fast :)

Is there a way to overwrite existing data using pandas to_parquet with partitions?

I'm using pandas to write a parquet file using the to_parquet function with partitions. Example:
df.to_parquet('gs://bucket/path', partition_cols=['key'])
The issue is that every time I run the code. It adds a new parquet file in the partition and when you read data, you get all the data from each time the script was run. Essentially, the data appends each time.
Is there a way to overwrite the data every time you write using pandas?
I have found dask to be helpful reading and writing parquet. It defaults the file name on write (which you can alter) and will replace the parquet file if you use the same name, which I believe is what you are looking for. You can append data to the partition by setting 'append' to True, which is more intuitive to me, or you can set 'overwrite' to True which will remove all files in the partition/folder prior to writing the file. Reading parquet works well as well by including partition columns in the dataframe on read.
https://docs.dask.org/en/stable/generated/dask.dataframe.to_parquet.html
See below some code I used to satisfy myself of the behaviour of dask.dataframe.to_parquet:
import pandas as pd
from dask import dataframe as dd
import numpy as np
dates = pd.date_range("2015-01-01", "2022-06-30")
df_len = len(dates)
df_1 = pd.DataFrame(np.random.randint(0, 1000, size=(df_len, 1)), columns=["value"])
df_2 = pd.DataFrame(np.random.randint(0, 1000, size=(df_len, 1)), columns=["value"])
df_1["date"] = dates
df_1["YEAR"] = df_1["date"].dt.year
df_1["MONTH"] = df_1["date"].dt.month
df_2["date"] = dates
df_2["YEAR"] = df_2["date"].dt.year
df_2["MONTH"] = df_2["date"].dt.month
ddf_1 = dd.from_pandas(df_1, npartitions=1)
ddf_2 = dd.from_pandas(df_2, npartitions=1)
name_function = lambda x: f"monthly_data_{x}.parquet"
ddf_1.to_parquet(
"dask_test_folder",
name_function=name_function,
partition_on=["YEAR", "MONTH"],
write_index=False,
)
print(ddf_1.head())
ddf_first_write = dd.read_parquet("dask_test_folder/YEAR=2015/MONTH=1")
print(ddf_first_write.head())
ddf_2.to_parquet(
"dask_test_folder",
name_function=name_function,
partition_on=["YEAR", "MONTH"],
write_index=False,
)
print(ddf_2.head())
ddf_second_write = dd.read_parquet("dask_test_folder/YEAR=2015/MONTH=1")
print(ddf_second_write.head())
Yeah, there is. You need to read pandas docs and you'll see that to_parquet supports **kwargs and uses engine:pyarrow by default. With that you got to the pyarrow docs. There you'll see there are two methods of doing this. One, by using partition_filename_cb which needs legacy support and will be depricated.
Two, using basename_template which is the new way. This because of performance issues of running a callable/lambda to name each partition. You need to pass a string: "string_{i}". Only works with legacy support off. The saved file will be "string_0","string_1"...
You can't use both at the same time.
def write_data(
df: pd.DataFrame,
path: str,
file_format="csv",
comp_zip="snappy",
index=False,
partition_cols: list[str] = None,
basename_template: str = None,
storage_options: dict = None,
**kwargs,
) -> None:
getattr(pd.DataFrame, f"to_{file_format}")(
df,
f"{path}.{file_format}",
compression=comp_zip,
index=index,
partition_cols=partition_cols,
basename_template=basename_template,
storage_options={"token": creds},
**kwargs,
)
Try this.

Combing two FITS tables on a single column entry using Python

I read in two .FITS tables and place them into "list_a" and "list_b". List_b is a subset of List_a, but has some additional e.g. "age"', that I'd like to add to my output. This is the current way I'm doing things:
file = open("MyFile.txt","w+")
for ii in range(100000):
name = list_B[np.where((list_A['NAME'][ii] == list_B['NAME']))]['NAME']
thing_from_b = list_B[np.where((list_A['NAME'][ii] == list_B['NAME']))]['AGE']
if (len(name) > 0) :
file.write(" {} {} \n".format(list_A['NAME'][ii], age )
file.close()
but is so slow and clunky, that I'm sure there must be a better, more pythonic method.
Assume "List_a" and "List_b" are both tables, and that you want to get the "ages" from "List_b" for entries where both "List_a" and "List_b", you can use Pandas as in your approach. But Astropy also has a built-in join operation for tables.
So I'm guessing you have something akin to:
>>> from astropy.table import Table
>>> tab_a = Table({'NAME': ['A', 'B', 'C']})
>>> tab_b = Table({'NAME': ['A', 'C', 'D'], 'AGE': [1, 3, 4]})
If you are reading from a FITS file you can use, for example Table.read to read a FITS table into a Table object (among other approaches).
Then you can use join to join the two tables where their name is the same:
>>> from astropy.table import join
>>> tab_c = join(tab_a, tab_b, keys='NAME')
>>> tab_c
<Table length=2>
NAME AGE
str1 int64
---- -----
A 1
C 3
I think that maybe be what you're asking.
You could then write this out to a an ASCII format (similar to in your example) like:
>>> import sys
>>> tab_c.write(sys.stdout, format='ascii.no_header')
A 1
C 3
(Here you could replace sys.stdout with a filename; I was just using it for demonstration purposes). As you can see there are many built-in output formats for Tables, though you can also define your own.
There are lots of goodies like this already in Astropy that should save you in many cases from reinventing the wheel when it comes to table manipulation and file format handling--just peruse the docs to get a better feel :)
Turns out turning the lists into DataFrames and then doing a pandas merge, does work very well::
from pandas import DataFrame
from astropy.table import Table
list_a_table = Table(list_a)
list_a_df = DataFrame(np.array(list_a_table))
list_b_table = Table(list_b)
list_b_df = DataFrame(np.array(list_b_table))
df_merge = pd.merge(list_a_df, list_b_df, on="name")

Dask Parquet loading files with data schema

This is a question related to this post.
I am experimenting with Dask and Parquet files. I loaded the New York parking violations data I downloaded here.
I read the data files, find common columns, apply datatypes, and save all afterwards as a parquet collevtion
from dask import dataframe as dd
from dask.diagnostics import ProgressBar
import numpy as np
base_url = 'origin/nyc-parking-tickets/'
fy14 = dd.read_csv(base_url + '*2014_.csv')
fy15 = dd.read_csv(base_url + '*2015.csv')
fy16 = dd.read_csv(base_url + '*2016.csv')
fy17 = dd.read_csv(base_url + '*2017.csv')
data = [fy14, fy15, fy16, fy17]
col_set = [set(d.columns) for d in data]
common_columns = list(set.intersection(*col_set))
# Set proper column types
dtype_tuples = [(x, np.str) for x in common_columns]
dtypes = dict(dtype_tuples)
floats = ['Feet From Curb', 'Issuer Code', 'Issuer Precinct', 'Law Section', 'Vehicle Year', 'Violation Precinct']
ints32 = ['Street Code1', 'Street Code2', 'Street Code3', 'Summons Number']
ints16 = ['Violation Code']
for item in floats: dtypes[item] = np.float32
for item in ints32: dtypes[item] = np.int32
for item in ints16: dtypes[item] = np.int16
# Read Data
data = dd.read_csv(base_url + '*.csv', dtype=dtypes, usecols=common_columns) # usecols not in Dask documentation, but from pandas
# Write data as parquet
target_url = 'target/nyc-parking-tickets-pq/'
with ProgressBar():
data.to_parquet(target_url)
When I attempt to reload the data
data2 = dd.read_parquet(target_url, engine='pyarrow')
I get a ValueError, namely that some of the partitions have a different file format. Looking at the output, I can see that the 'Violation Legal Code' column is in one partition interpreted as null, presumably because the data is too sparse for sampling.
In the post with the original question two solutions are suggested. The first is about entering dummy values, the other is supplying column types when loading the data. I would like to do the latter and I am stuck.
In the dd.read_csv method I can pass the dtype argument, for which I just enter the dtypes dictionary defined above. The dd.read_parquet does not accept that keyword. In the documentation it seems to suggest that categories is taking over that role , but even when passing categories=dtypes, I still get the same error.
How can I pass type specifications in dask.dataframe.read_parquet?
You can not pass dtypes to read_parquet because Parquet files know their own dtypes (in CSV it is ambiguous). Dask DataFrame expects that all files of a dataset have the same schema, as of 2019-03-26, there is no support for loading data of mixed schemas.
That being said, you could do this yourself using something like Dask Delayed, do whatever manipulations you need to do on a file-by-file basis, and then convert those into a Dask DataFrame with dd.from_delayed. More information about that is here.
https://docs.dask.org/en/latest/delayed.html
https://docs.dask.org/en/latest/delayed-collections.html
It seems the problem was with the parquet engine. When I changed the code to
data.to_parquet(target_url, engine = 'fastparquet')
and
data.from_parquet(target_url, engine = 'fastparquet')
the writing and loading worked fine.

How to use pandas apply instead a list loop?

I have a dataframe like this:
df = pd.DataFrame({'market':['ES','UK','DE'],
'provider':['A','B','C'],
'item':['X','Y','Z']})
Then I have a list with the providers and the following loop:
providers_list = ['A','B','C']
for provider in providers_list:
a = df.loc[df['provider']==provider]
That loop creates a dataframe for each provider, which later on I put into an excel. I would like to use the function apply for speed purposes. I have transformed the code like this:
providers_list = pd.DataFrame({'provider':['A','B','C']})
def report(provider):
a = df.loc[df['provider']==provider]
providers_list.apply(report)
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\ops.py",
line 1190, in wrapper
raise ValueError("Can only compare identically-labeled "
ValueError: ('Can only compare identically-labeled Series objects',
'occurred at index provider')
Thanks
The apply method is generally inefficient. It's nothing more than a glorified loop with some extra functionality. Instead, you can use GroupBy to cycle through each provider:
for prov, df_prov in df.groupby('provider'):
df_prov.to_excel(f'{prov}.xlsx', index=False)
If you only want to include a pre-defined list of providers in your output, you can define a GroupBy object and iterate your list instead:
providers_list = ['A', 'B', 'C']
grouper = df.groupby('provider')
for prov in providers_list:
grouper.get_group(prov).to_excel(f'{prov}.xlsx', index=False)
If you're interested in speed for your process as a whole, I strongly advise you avoid Excel: exporting to csv, csv.gz or pkl will all be much more efficient. For large datasets, it's unlikely filtering your dataframe is your bottleneck when exporting to Excel.
This worked for me with a milllion entries of each provider in under a second:
import pandas as pd
from tqdm import tqdm
tqdm.pandas(desc="Progress:")
df = pd.DataFrame({'market':['ES','UK','DE']*1000000,
'provider':['A','B','C']*1000000,
'item':['X','Y','Z']*1000000})
grouped = df.groupby("provider")
providers_list = ['A','B','C']
for prov in tqdm(providers_list):
frame_name = prov
globals()[frame_name] = pd.DataFrame(grouped.get_group(prov))
print(A)
print(B)
print(C)
100%|██████████| 3/3 [00:00<00:00, 9.59it/s]

Categories

Resources