Data manipulation in DataFrame in Python Pandas? - python

I have DataFrame like below:
rng = pd.date_range('2020-12-01', periods=5, freq='D')
df = pd.DataFrame({"ID" : ["1", "2", "1", "2", "2"],
"category" : ["A", "B", "A", "C", "B"],
"status" : ["active", "finished", "active", "finished", "other"],
"Date": rng})
And I need to create DataFrame and calculate 2 columns:
New1 = category of the last agreement with "active" status
New2 = category of the last agreement with "finished" status
To be more precision below I give result DataFrame:

Assuming the dataframe is already sorted by date, we want to keep the last row where "status"=="active"and the last row where "status"=="finished". We also want to keep the first and second columns only, and we rename category to "New1" for the active status, and to "New2" for the finished status.
last_active = df[df.status == "active"].iloc[-1, [0, 1]].rename({"category": "New1"})
last_finished = df[df.status == "finished"].iloc[-1, [0, 1]].rename({"category": "New2"})
We got two pandas Series that we want to concatenate side by side, then transpose to have one entry per row :
pd.concat([last_active, last_finished], axis=1, sort=False).T
Perhaps, you also want to call "reset_index() afterwards, to have a fresh new RangeIndex in your resulting DataFrame.

Related

Merging Lists of Different Format into Pandas DataFrame

I have two different lists and an id:
id = 1
timestamps = [1,2,3,4]
values = ['A','B','C','D']
What I want to do with them is concatenating them into a pandas DataFrame so that:
id
timestamp
value
1
1
A
1
2
B
1
3
C
1
4
D
By iterating over a for loop I will produce a new set of two lists and a new ID with each iteration which should then be concatenated to the existing data frame. The pseudocode would look like this:
# for each sample in group:
# do some calculation to create the two lists
# merge the lists into the data frame, using the ID as index
What I tried to do so far is using concatenate like this:
pd.concat([
existing_dataframe,
pd.DataFrame(
{
"id": id,
"timestamp": timestamps,
"value": values}
)])
But there seems to be a problem that the ID field and the other lists are of different lengths. Thanks for your help!
Use:
pd.DataFrame(
{
"timestamp": timestamps,
"value": values}
).assign(id=id).reindex(columns=["id", "timestamp", "value"])
Or:
df = \
pd.DataFrame(
{
"timestamp": timestamps,
"value": values}
)
df.insert(column='id',value=id, loc=0)

Grouping a Pandas df by two columns and aggregate over lvl 1 in group

I'm new to programming overall, and I'm struggling with some pandas df aggregation.
I'm trying to group a df by two columns "A" and "B" and then the series to display the frequency of B, over all the data, not only the group.
I'm trying the below.
group = df.groupby(['A', 'B']).size() ###this will show only the group frequency of B.
Let's say A is a transaction Id and B is a product. I want to know how many times each product appears when looking over all transactions, but in this structure of grouping, and keeping it into a grouped series not changing back to a df.
Thank you
You can use the pd.pivot_table to do the summary:
# Import packages
import pandas as pd, numpy as np
# Initialize a sample dataframe
df = pd.DataFrame({
"Transacion_ID": [1, 2, 3, 4, 5, 6, 7, 8, 9],
"Product": ["milk", "milk", "milk", "milk", "milk",
"bread", "bread", "bread", "bread"],
"Region": ["Eastern", "Eastern", "Eastern", "Eastern", "Eastern",
"Western", "Western", "Western", "Western"]
})
# Display the dataframe
display(df)
# Use pd.pivot_table fuhction to create the summary
table = pd.pivot_table(
df,
values='Transacion_ID',
index=['Product'],
aggfunc='count')
# Finally show the results
display(table)
You can also simply use the groupby function followed by the agg function as follows:
# Groupby and aggregate
table = df.groupby(['Product']).agg({
'Transacion_ID': 'count'
})
# Finally show the results
display(table)

Rows of one column shifting under other column in Pandas after str.split?

I'm cleaning data (from a csv file) in Pandas and one of the columns pic of it's first 4 rows has hundreds of values (in each row) seperated by comma.
I use str.split(',', expand=True) and able to get the values spread across various columns. However, I'm seeing rows of one column shifting under other column (it can be seen in the pic 2).
Is there any method to get the values under their respective columns?
Note: Each row is associated with unique ID.
I've been stuck on this problem for quite some time and couldn't resolve it. Any help would be highly appreciated!
Edit 1: TL;DR
-- Input -- First 2 rows of a column for an example--
{"crap1": 12, "NAME": "John", "AGE": "30","SEX": "M", "crap2": 34, ....... "ID": 01}
{"crap1": 56, "NAME": "Anna", "AGE": "25","SEX": "F", "crap2": 78, ....... "ID": 02}
-- Desired Output -- Derive 4 columns from 1, based on values in each row
NAME | AGE | SEX | ID
John | 30 | M | 01
Anna | 25 | F | 02
You can try expanding the column with multiple entries into a separate dataframe and then joining them back into the original dataframe.
df2 = df.col1.str.split(',',expand=True)
During this, you can also drop the original column that you wanted to expand and also give the new columns.
df2.columns = ['col2_%d'%idx for idx,__ in enumerate(df2.columns)]
df = df.drop(columns=['col1'])
df = pd.concat([df,df2],axis=1)
Since your example was an image, I couldn't test it out on that specific case. Here's a small working example to illustrate the idea :D
import pandas as pd
def get_example_data():
df = pd.DataFrame(
{
'col1' : ['abc','def','ghi,jkl','abc,def','def'],
'col2' : ['XYZ','XYZ','XYZ','XYZ','XYZ']
}
)
return df
def clean_dataframe(df):
# expand the column into a separate dataframe
df2 = df.col1.str.split(',',expand=True)
print(df2)
# incase you would like to retain original column name : col1 --> col1_0,col1_1
df2.columns = ['col1_%d'%idx for idx,__ in enumerate(df2.columns)]
print(df2)
# drop original column
df = df.drop(columns=['col1'])
# concat expanded column
df = pd.concat([df,df2],axis=1)
print(df)
return df
if __name__=='__main__':
df = get_example_data()
print(df)
df = clean_dataframe(df)

Pythonic way of replace values in one column from a two column table

I have a df with the origin and destination between two points and I want to convert the strings to a numerical index, and I need to have a representation to back convert it for model interpretation.
df1 = pd.DataFrame({"Origin": ["London", "Liverpool", "Paris", "..."], "Destination": ["Liverpool", "Paris", "Liverpool", "..."]})
I separately created a new index on the sorted values.
df2 = pd.DataFrame({"Location": ["Liverpool", "London", "Paris", "..."], "Idx": ["1", "2", "3", "..."]})
What I want to get is this:
df3 = pd.DataFrame({"Origin": ["1", "2", "3", "..."], "Destination": ["1", "3", "1", "..."]})
I am sure there is a simpler way of doing this but the only two methods I can think of are to do a left join onto the Origin column by the Origin to Location and the same for destination then remove extraneous columns, or loop of every item in df1 and df2 and replace matching values. I've done the looped version and it works but it's not very fast, which is to be expected.
I am sure there must be an easier way to replace these values but I am drawing a complete blank.
You can use .map():
mapping = dict(zip(df2.Location, df2.Idx))
df1.Origin = df1.Origin.map(mapping)
df1.Destination = df1.Destination.map(mapping)
print(df1)
Prints:
Origin Destination
0 2 1
1 1 3
2 3 1
3 ... ...
Or "bulk" .replace():
df1 = df1.replace(mapping)
print(df1)

Subset pandas dataframe

I have a pandas dataframe which has following columns: cust_email, transaction_id, transaction_timestamp
I want to subset the pandas dataframe and include only those email ids which have only one transaction (i.e only one transaction_id, transaction_timestamp for a cust_email)
You can use drop_duplicates and set parameter keep to False. If you want to drop duplicates by a specific column you can use the subset parameter:
df.drop_duplicates(subset="cust_email", keep=False)
For example
import pandas as pd
data = pd.DataFrame()
data["col1"] = ["a", "a", "b", "c", "c", "d", "e"]
data["col2"] = [1,2,3,4,5,6,7]
print(data)
print()
data.drop_duplicates(subset="col1", keep=False)
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html

Categories

Resources