Subset pandas dataframe - python

I have a pandas dataframe which has following columns: cust_email, transaction_id, transaction_timestamp
I want to subset the pandas dataframe and include only those email ids which have only one transaction (i.e only one transaction_id, transaction_timestamp for a cust_email)

You can use drop_duplicates and set parameter keep to False. If you want to drop duplicates by a specific column you can use the subset parameter:
df.drop_duplicates(subset="cust_email", keep=False)
For example
import pandas as pd
data = pd.DataFrame()
data["col1"] = ["a", "a", "b", "c", "c", "d", "e"]
data["col2"] = [1,2,3,4,5,6,7]
print(data)
print()
data.drop_duplicates(subset="col1", keep=False)
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html

Related

How to use slice to exclude rows and columns from dataframe

I have a DataFrame
import pandas as pd
import numpy as np
index = pd.MultiIndex.from_product([["A", "B"], ["AA", "BB"]])
columns = pd.MultiIndex.from_product([["X", "Y"], ["XX", "YY"]])
df = pd.DataFrame([[1,2,3,4],
[5,6,7,8],
[9,10,11,12],
[13,14,15,16]], index = index, columns = columns)
and slice
toSkip = ((slice(None), slice(None)), (["X"], slice(None)))
I know that I can write df.loc[slice] to get the subset of DataFrame which corresponds to this slice. But how can I do the opposite so get the difference between original df and the one obtained with that slice?
How to invert slicing
To get the idea let's make it more complicated.
import pandas as pd
import numpy as np
index = pd.MultiIndex.from_product([["A", "B", "C"], ["AA", "BB", "CC"]])
columns = pd.MultiIndex.from_product([["X", "Y", "Z"], ["XX", "YY", "ZZ"]])
data = (
np
.arange(len(index) * len(columns))
.reshape(len(index), len(columns))
)
df = pd.DataFrame(data, index, columns)
Let's say I want to process all the data except the inner square (B,Y).
I can get the square by slicing. To get others I'm gonna use a boolean mask:
mask = pd.DataFrame(True, index, columns)
toSkip = ((['B'], slice(None)), (['Y'], slice(None)))
mask.loc[toSkip] = False
Now I can transform others by windowing with mask:
# just for illustration purposes
# let's invert the sign of numbers
df[mask] *= -1
Here's the output:
If slice is a Series with boolean values, then logical negation operator ~ will give the opposite of the condition. So,
df[~slice]
will return rows that doesn't satisfy the condition slice
Not sure if this is you want, you can drop the index and columns of toSkip dataframe
toSkip = ((slice(None), slice(None)), (["X"], slice(None)))
tmp = df.loc[toSkip]
out = df.drop(index=tmp.index, columns=tmp.columns)
print(out)
Empty DataFrame
Columns: [(Y, XX), (Y, YY)]
Index: []

Drop all the columns of pandas DataFrame whose names match with the names given in a list

I am dealing with pandas DataFrame (df) within a for loop that may have different columns. For example, for the first loop, df may have columns: "A", "B", "C", "D", and "E". For the second loop, df may have columns: "B", "C", "E", "F", "G", and "H". I want to drop certain columns e.g., "A" , "F", and "G" from df. If I use the line below within the for loop, It will result an error: "['F' 'G'] not found in axis."
df = df.drop(['A', 'F', 'G'], axis=1)
Similarly, for the second loop, It will result an error: "['A'] not found in axis." How to solve this problem?
Try with pass errors = 'ignore'
out = df.drop(["A","F","G"], errors = 'ignore', axis = 1)
Filter the list of columns to only include those that are actually present in the DataFrame, eg:
df = df.drop(df.columns.intersection(['A', 'F', 'G']), axis=1)

Save multiple parquet files from dask dataframe

I would like to save multiple parquet files from a Dask dataframe, one parquet file for all unique values in a specific column. Hence, the number of parquet file should be equal to the number of unique values in that column.
For example, given the following dataframe, I want to save four parquet files, because there a four unique values in column "A".
import pandas as pd
from dask import dataframe as dd
df = pd.DataFrame(
{
"A": [1, 1, 2, 3, 1, 3, 6, 6],
"B": ["A", "L", "C", "D", "A", "B", "A", "B"],
"C": [1, 2, 3, 4, 5, 6, 7, 8],
}
)
ddf = dd.from_pandas(df, npartitions=2)
for i in ddf["A"].unique().compute():
ddf.loc[ddf["A"] == i].to_parquet(f"file_{i}.parquet", schema="infer")
I am not sure if looping over the Dask dataframe is the right approach to scale this up (probably the unique().compute() can be bigger than my memory). Moreover I am unsure if I have to order beforehand.
If you have some suggestions how to properly implement this or things to take into account, I would be happy!
This is not exactly what you are after, but it's possible to use partition_on option of .to_parquet:
ddf.to_parquet("file_parquet", schema="infer", partition_on="A")
Note that this does not guarantee one file per partition as you want, instead there will be subfolders inside file_parquet, containing potentially more than one file.
You can achieve this by setting the index to the column of interest and setting the divisions to follow the unique values in that column.
This should do the trick:
import dask.dataframe as dd
import pandas as pd
import numpy as np
# create dummy dataset with 3 partitions
df = pd.DataFrame(
{"letter": ["a", "b", "c", "a", "a", "d", "d", "b", "c", "b", "a", "b", "c", "e", "e", "e"], "number": np.arange(0,16)}
)
ddf = dd.from_pandas(df, npartitions=3)
# set index to column of interest
ddf = ddf.set_index('letter').persist()
# generate list of divisions (last value needs to be repeated)
index_values = list(df.letter.unique())
divisions = index_values.append(df.letter.unique()[-1])
# repartition
ddf = ddf.repartition(divisions=divisions).persist()
# write out partitions as separate parquet files
for i in range(ddf.npartitions):
ddf.partitions[i].to_parquet(f"file_{i}.parquet", engine='pyarrow')
Note the double occurrence of the value 'e' in the list of divisions. As per the Dask docs: "Divisions includes the minimum value of every partition’s index and the maximum value of the last partition’s index." This means the last value needs to be included twice since it serves as both the start of and the end of the last partition's index.

Data manipulation in DataFrame in Python Pandas?

I have DataFrame like below:
rng = pd.date_range('2020-12-01', periods=5, freq='D')
df = pd.DataFrame({"ID" : ["1", "2", "1", "2", "2"],
"category" : ["A", "B", "A", "C", "B"],
"status" : ["active", "finished", "active", "finished", "other"],
"Date": rng})
And I need to create DataFrame and calculate 2 columns:
New1 = category of the last agreement with "active" status
New2 = category of the last agreement with "finished" status
To be more precision below I give result DataFrame:
Assuming the dataframe is already sorted by date, we want to keep the last row where "status"=="active"and the last row where "status"=="finished". We also want to keep the first and second columns only, and we rename category to "New1" for the active status, and to "New2" for the finished status.
last_active = df[df.status == "active"].iloc[-1, [0, 1]].rename({"category": "New1"})
last_finished = df[df.status == "finished"].iloc[-1, [0, 1]].rename({"category": "New2"})
We got two pandas Series that we want to concatenate side by side, then transpose to have one entry per row :
pd.concat([last_active, last_finished], axis=1, sort=False).T
Perhaps, you also want to call "reset_index() afterwards, to have a fresh new RangeIndex in your resulting DataFrame.

Adding a new row to a pandas data frame when columns have different data type?

I have a 2-column pandas data frame, initialized with df = pd.DataFrame([], columns = ["A", "B"]). Column A needs to be of type float, and column B is of type datetime.datetime. I need to add my first values to it (i.e. new rows), but I can't seem to figure out how to do it. I can't do new_row = [x, y] then append it since x and y are not of the same type. How should I go about adding these rows? Thank you.
import pandas as pd
from numpy.random import rand
Option 1 - make new row as a DF and append to previous:
df = pd.DataFrame([], columns = ["A", "B"])
T=pd.datetime(2000,1,1)
df2=pd.DataFrame( columns = ["A", "B"],data=[[rand(),T]])
df=df.append(df2)
Or, Option 2 - create empty DF and then index:
df = pd.DataFrame(index=range(5), columns = ["A", "B"])
T=pd.datetime(2000,1,1)
df.iloc[0,:]=[rand(),T]

Categories

Resources