I would like to save multiple parquet files from a Dask dataframe, one parquet file for all unique values in a specific column. Hence, the number of parquet file should be equal to the number of unique values in that column.
For example, given the following dataframe, I want to save four parquet files, because there a four unique values in column "A".
import pandas as pd
from dask import dataframe as dd
df = pd.DataFrame(
{
"A": [1, 1, 2, 3, 1, 3, 6, 6],
"B": ["A", "L", "C", "D", "A", "B", "A", "B"],
"C": [1, 2, 3, 4, 5, 6, 7, 8],
}
)
ddf = dd.from_pandas(df, npartitions=2)
for i in ddf["A"].unique().compute():
ddf.loc[ddf["A"] == i].to_parquet(f"file_{i}.parquet", schema="infer")
I am not sure if looping over the Dask dataframe is the right approach to scale this up (probably the unique().compute() can be bigger than my memory). Moreover I am unsure if I have to order beforehand.
If you have some suggestions how to properly implement this or things to take into account, I would be happy!
This is not exactly what you are after, but it's possible to use partition_on option of .to_parquet:
ddf.to_parquet("file_parquet", schema="infer", partition_on="A")
Note that this does not guarantee one file per partition as you want, instead there will be subfolders inside file_parquet, containing potentially more than one file.
You can achieve this by setting the index to the column of interest and setting the divisions to follow the unique values in that column.
This should do the trick:
import dask.dataframe as dd
import pandas as pd
import numpy as np
# create dummy dataset with 3 partitions
df = pd.DataFrame(
{"letter": ["a", "b", "c", "a", "a", "d", "d", "b", "c", "b", "a", "b", "c", "e", "e", "e"], "number": np.arange(0,16)}
)
ddf = dd.from_pandas(df, npartitions=3)
# set index to column of interest
ddf = ddf.set_index('letter').persist()
# generate list of divisions (last value needs to be repeated)
index_values = list(df.letter.unique())
divisions = index_values.append(df.letter.unique()[-1])
# repartition
ddf = ddf.repartition(divisions=divisions).persist()
# write out partitions as separate parquet files
for i in range(ddf.npartitions):
ddf.partitions[i].to_parquet(f"file_{i}.parquet", engine='pyarrow')
Note the double occurrence of the value 'e' in the list of divisions. As per the Dask docs: "Divisions includes the minimum value of every partition’s index and the maximum value of the last partition’s index." This means the last value needs to be included twice since it serves as both the start of and the end of the last partition's index.
Related
I got a question about storing data from .dat files in the right row of a dataframe. I go with this minimal example.
I have already a dataframe like this:
data = {'col1': [1, 2, 3, 4],'col2': ["a", "b", "c", "d"]}
df = pd.DataFrame(data, index=['row_exp1','row_exp2','row_exp3','row_exp4'])
Now I want to add a new column called col3 with numpy arrays in each single cell. Thus, I will have 4 numpy arrays, one in every cell.
I get the numpy arrays from a .dat file.
The import part is that I have to found the right row. I have 4 .dat files and every dat file matches to the row name. For example the first .dat file has got the name 230109_exp3_foo.dat. So this dat file matches to the third row of my dataframe.
Then the algorithm has to put the data from the .dat file in the right cell:
col1
col2
col3
row_exp1
1
a
row_exp2
2
b
row_exp3
3
c
[1,2,3,4,5,6]
row_exp4
4
d
The other entries should be NaN and I would fill them with the right numpy array in the next loop.
I think the difficult part is to select the right row and to math this with the file name of the .dat file.
If you're working with time series data, this isn't how you want to structure your dataframe. Read up on "tidy" data. (https://r4ds.had.co.nz/tidy-data.html)
Every column is a variable. Every row is an observation.
So let's assume you're loading your data with a function called load_data that accepts a file name:
def load_data(filename):
# load the data, fill in your own details
pass
Then you would build up your dataframe like this:
meta_data = {
'col1': [1, 2, 3, 4],
'col2': ["a", "b", "c", "d"],
}
list_of_dataframes = []
for n, fname in enumerate(filenames):
this_array = load_data(fname)
list_of_dataframes.append(
pd.DataFrame({
'row_num': list(range(len(this_array))),
'col1': meta_data['col1'][n],
'col2': meta_data['col2'][n],
'values': this_array,
})
)
df = pd.concat(list_of_dataframes, ignore_index=True)
Maybe it helps:
# Do you have the similar pattern in each .dat file name? (I assume that yes)
list_of_files = ['230109_exp3_foo.dat', '230109_exp2_foo.dat', '230109_exp1_foo.dat', '230109_exp4_foo.dat']
# for each index trying to find value after row_ in file list
files_match = df.reset_index()['index'].map(lambda x: [y for y in list_of_files if x.replace('row_', '') in y])
# if I understand correctly, you know how to read .dat file,
# so you can insert your function instead of function_for_reading_dat_file
df['col3'] = files_match.map(lambda x: function_for_reading_dat_file(x[0]) if len(x) != 0 else 'None')
I am dealing with pandas DataFrame (df) within a for loop that may have different columns. For example, for the first loop, df may have columns: "A", "B", "C", "D", and "E". For the second loop, df may have columns: "B", "C", "E", "F", "G", and "H". I want to drop certain columns e.g., "A" , "F", and "G" from df. If I use the line below within the for loop, It will result an error: "['F' 'G'] not found in axis."
df = df.drop(['A', 'F', 'G'], axis=1)
Similarly, for the second loop, It will result an error: "['A'] not found in axis." How to solve this problem?
Try with pass errors = 'ignore'
out = df.drop(["A","F","G"], errors = 'ignore', axis = 1)
Filter the list of columns to only include those that are actually present in the DataFrame, eg:
df = df.drop(df.columns.intersection(['A', 'F', 'G']), axis=1)
df = pd.DataFrame([
["a", 1],
["a", 2],
["b", 5],
["b", 11]
])
df.columns=["c1","c2"]
grouped = df.groupby(["c1"])["c2"].apply(list)
grouped = grouped.reset_index()
grouped["c3"] = "[11,12]" #add list like string manually
#grouped["true_list_c2"] = grouped["c2"].apply(eval)
grouped["true_list_c3"] = grouped["c3"].apply(eval)
print(grouped)
If try to convert manually added column "c3" to true python list, it works.
But if try same for aggregated column "c2", raises error: eval() arg 1 must be a string, bytes or code object
What is reason? what is difference between "c2" and "c3" columns?
The aggregated column "c2" is a series of lists, eval doesn't accept that. If you cast it to str grouped["true_list_c2"] = grouped["c2"].apply(str).apply(eval) (just like "c3") it works just fine.
I have DataFrame like below:
rng = pd.date_range('2020-12-01', periods=5, freq='D')
df = pd.DataFrame({"ID" : ["1", "2", "1", "2", "2"],
"category" : ["A", "B", "A", "C", "B"],
"status" : ["active", "finished", "active", "finished", "other"],
"Date": rng})
And I need to create DataFrame and calculate 2 columns:
New1 = category of the last agreement with "active" status
New2 = category of the last agreement with "finished" status
To be more precision below I give result DataFrame:
Assuming the dataframe is already sorted by date, we want to keep the last row where "status"=="active"and the last row where "status"=="finished". We also want to keep the first and second columns only, and we rename category to "New1" for the active status, and to "New2" for the finished status.
last_active = df[df.status == "active"].iloc[-1, [0, 1]].rename({"category": "New1"})
last_finished = df[df.status == "finished"].iloc[-1, [0, 1]].rename({"category": "New2"})
We got two pandas Series that we want to concatenate side by side, then transpose to have one entry per row :
pd.concat([last_active, last_finished], axis=1, sort=False).T
Perhaps, you also want to call "reset_index() afterwards, to have a fresh new RangeIndex in your resulting DataFrame.
I have a pandas dataframe which has following columns: cust_email, transaction_id, transaction_timestamp
I want to subset the pandas dataframe and include only those email ids which have only one transaction (i.e only one transaction_id, transaction_timestamp for a cust_email)
You can use drop_duplicates and set parameter keep to False. If you want to drop duplicates by a specific column you can use the subset parameter:
df.drop_duplicates(subset="cust_email", keep=False)
For example
import pandas as pd
data = pd.DataFrame()
data["col1"] = ["a", "a", "b", "c", "c", "d", "e"]
data["col2"] = [1,2,3,4,5,6,7]
print(data)
print()
data.drop_duplicates(subset="col1", keep=False)
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html