Consider the following dataset:
After running the code:
convert_dummy1 = convert_dummy.pivot(index='Product_Code', columns='Month', values='Sales').reset_index()
The data is in the right form, but my index column is named 'Month', and I cannot seem to remove this at all. I have tried codes such as the below, but it does not do anything.
del convert_dummy1.index.name
I can save the dataset to a csv, delete the ID column, and then read the csv - but there must be a more efficient way.
Dataset after reset_index():
convert_dummy1
Month Product_Code 0 1 2 3 4
0 10133.9 0 0 0 0 0
1 10146.9 120 80 60 0 100
convert_dummy1.index = pd.RangeIndex(len(convert_dummy1.index))
del convert_dummy1.columns.name
convert_dummy1
Product_Code 0 1 2 3 4
0 10133.9 0 0 0 0 0
1 10146.9 120 80 60 0 100
Since you pivot with columns="Month", each column in output corresponds to a month. If you decide to reset index after the pivot, you should check column names with convert_dummy1.columns.value which should return in your case :
array(['Product_Code', 1, 2, 3, 4, 5], dtype=object)
while convert_dummy1.columns.names should return:
FrozenList(['Month'])
So to rename Month, use rename_axis function:
convert_dummy1.rename_axis('index',axis=1)
Output:
index Product_Code 1 2 3 4 5
0 10133 NaN NaN NaN NaN 0.0
1 10234 NaN 0.0 NaN NaN NaN
2 10245 0.0 NaN NaN NaN NaN
3 10345 NaN NaN NaN 0.0 NaN
4 10987 NaN NaN 1.0 NaN NaN
If you wish to reproduce it, this is my code:
df1=pd.DataFrame({'Product_Code':[10133,10245,10234,10987,10345], 'Month': [1,2,3,4,5], 'Sales': [0,0,0,1,0]})
df2=df1.pivot_table(index='Product_Code', columns='Month', values='Sales').reset_index().rename_axis('index',axis=1)
Related
I have a DataFrame with (several) grouping variables and (several) value variables. My goal is to set the last n non nan values to nan. So let's take a simple example:
df = pd.DataFrame({'id':[1,1,1,2,2,],
'value':[1,2,np.nan, 9,8]})
df
Out[1]:
id value
0 1 1.0
1 1 2.0
2 1 NaN
3 2 9.0
4 2 8.0
The desired result for n=1 would look like the following:
Out[53]:
id value
0 1 1.0
1 1 NaN
2 1 NaN
3 2 9.0
4 2 NaN
Use with groupby().cumcount():
N=1
groups = df.loc[df['value'].notna()].groupby('id')
enum = groups.cumcount()
sizes = groups['value'].transform('size')
df['value'] = df['value'].where(enum < sizes - N)
Output:
id value
0 1 1.0
1 1 NaN
2 1 NaN
3 2 9.0
4 2 NaN
You can check cumsum after groupby get how many notna value per-row
df['value'].where(df['value'].notna().iloc[::-1].groupby(df['id']).cumsum()>1,inplace=True)
df
Out[86]:
id value
0 1 1.0
1 1 NaN
2 1 NaN
3 2 9.0
4 2 NaN
One option: create a reversed cumcount on the non-NA values:
N = 1
m = (df
.loc[df['value'].notna()]
.groupby('id')
.cumcount(ascending=False)
.lt(N)
)
df.loc[m[m].index, 'value'] = np.nan
Similar approach with boolean masking:
m = df['value'].notna()
df['value'] = df['value'].mask(m[::-1].groupby(df['id']).cumsum().le(N))
output:
id value
0 1 1.0
1 1 NaN
2 1 NaN
3 2 9.0
4 2 NaN
Consider a simple dataframe:
import numpy as np
import pandas as pd
x = pd.DataFrame(np.arange(10).reshape(5,2))
print(x)
0 1
0 0 1
1 2 3
2 4 5
3 6 7
4 8 9
I would like to create a hierarchically indexed dataframe of the form:
0 1
a b a b
0 0 NaN 1 NaN
1 2 NaN 3 NaN
2 4 NaN 5 NaN
3 6 NaN 7 NaN
4 8 NaN 9 NaN
where the 'a' columns correspond to the original dataframe columns and the 'b' columns are blank (or nan).
I can certainly create a hierarchically indexed dataframe with all NaNs and loop over the columns of the original dataframe, writing them into
the new dataframe. Is there something more compact than that?
you can do with MultiIndex.from_product
extra_level = ['a', 'b']
new_cols = pd.MultiIndex.from_product([x.columns, extra_level])
x.columns = new_cols[::len(x.columns)] # select all the first element of extra_level
x = x.reindex(columns=new_cols)
print(x)
0 1
a b a b
0 0 NaN 1 NaN
1 2 NaN 3 NaN
2 4 NaN 5 NaN
3 6 NaN 7 NaN
4 8 NaN 9 NaN
Very much like #Ben.T I am using MultiIndex.from_product:
x.assign(l='a')
.set_index('l', append=True)
.unstack()
.reindex(pd.MultiIndex.from_product([x.columns.tolist(), ['a','b']]), axis=1)
Output:
0 1
a b a b
0 0 NaN 1 NaN
1 2 NaN 3 NaN
2 4 NaN 5 NaN
3 6 NaN 7 NaN
4 8 NaN 9 NaN
i have to data frame
id-input id-output Date Price Type
1 3 20/09/2020 100 ABC
2 1 20/09/2020 200 ABC
2 1 21/09/2020 300 ABC
1 3 21/09/2020 50 AD
1 2 21/09/2020 40 AD
I want to get this Output :
id-inp-ABC id-out-ABC Date-ABC Price-ABC Type-ABC id-inp-AD id-out-AD Date-AD Price-AD Type-AD
1 3 20/09/2020 10 ABC 2 1 20/09/2020 10 AD
1' 3 20/09/2020 90 ABC Nan Nan Nan Nan Nan
2 1 20/09/2020 40 ABC 1 2 21/09/2020 40 AD
2' 1 20/09/2020 160 ABC Nan Nan Nan Nan Nan
2 1 21/09/2020 300 ABC Nan Nan Nan Nan Nan
My idea is to :
-divide the dataframe into two dataframes by type
-iterate through the both dataframes and check if the same id-input == id-output
-check if the price is equal , if not split row and soustract the price.
rename the columns and merge them.
grp = df.groupby('type')
transformed_df_list = []
for idx, frame in grp:
frame.reset_index(drop=True, inplace=True)
transformed_df_list.append(frame.copy())
ABC = pd.DataFrame([transformed_df_list[0])
AD = pd.DataFrame([transformed_df_list[1])
for i , row in ABC.iterrows():
for i, row1 in AD.iterrows():
if row['id-inp'] == row1['id-out']:2
row_df = pd.DataFrame([row1])
row_df= row_df.rename(columns={'id-inp': 'id-inp-AD', 'id-out':'id-out-AD' , 'Date':'Date-AD' ,'price':'price-AD'})
output = pd.merge(ABC.set_index('id-inp' , drop =False) ,row_df.set_index('id-out-AD' , drop =False), how='left' , left_on =['id-inp'] , right_on =['id-inp-AD' ])
but the results is Nan in the id-inp-AD id-out-AD Date-AD Price-AD Type-AD part ,
and row_df contains just the last row :
1 2 21/09/2020 40 A
i want also that the iteration respect the order and each insert in the output dataframe is sorted by date.
The most elegant way to solve your problem is to use pandas.DataFrame.pivot. You end up with multilevel column names instead of a single level. If you need to transfer the DataFrame back to single level column names, check the second answer here.
import pandas as pd
input = [
[1, 3, '20/09/2020', 100, 'ABC'],
[2, 1, '20/09/2020', 200, 'ABC'],
[2, 1, '21/09/2020', 300, 'ABC'],
[1, 3, '21/09/2020', 50, 'AD'],
[1, 2, '21/09/2020', 40, 'AD']
]
df = pd.DataFrame(data=input, columns=["id-input", "id-output", "Date", "Price", "Type"])
df_pivot = df.pivot(columns=["Type"])
print(df_pivot)
Output
id-input id-output Date Price
Type ABC AD ABC AD ABC AD ABC AD
0 1.0 NaN 3.0 NaN 20/09/2020 NaN 100.0 NaN
1 2.0 NaN 1.0 NaN 20/09/2020 NaN 200.0 NaN
2 2.0 NaN 1.0 NaN 21/09/2020 NaN 300.0 NaN
3 NaN 1.0 NaN 3.0 NaN 21/09/2020 NaN 50.0
4 NaN 1.0 NaN 2.0 NaN 21/09/2020 NaN 40.0
I have a Dataframe of the form
date_time uids
2018-10-16 23:00:00 1000,1321,7654,1321
2018-10-16 23:10:00 7654
2018-10-16 23:20:00 NaN
2018-10-16 23:30:00 7654,1000,7654,1321,1000
2018-10-16 23:40:00 691,3974,3974,323
2018-10-16 23:50:00 NaN
2018-10-17 00:00:00 NaN
2018-10-17 00:10:00 NaN
2018-10-17 00:20:00 27,33,3974,3974,7665,27
This is a very big data frame containing the 5 mins time interval and the number of appearances of ids during those time intervals.
I want to iterate over these DataFrame 6 rows at a time (corresponding to 1 hour) and create DataFrame containing the ID and the number of times each id appear during this time.
Expected output is one dataframe per hour information. For example, in the above case dataframe for the hour 23 - 00 will have this form
uid 1 2 3 4 5 6
1000 1 0 0 2 0 0
1321 2 0 0 1 0 0
and so on
How can I do this efficiently?
I don't have an exact solution but you could create a pivot table: ids on the index and datetimes on the columns. Then you just have to select the columns you want.
import pandas as pd
import numpy as np
df = pd.DataFrame(
{
"date_time": [
"2018-10-16 23:00:00",
"2018-10-16 23:10:00",
"2018-10-16 23:20:00",
"2018-10-16 23:30:00",
"2018-10-16 23:40:00",
"2018-10-16 23:50:00",
"2018-10-17 00:00:00",
"2018-10-17 00:10:00",
"2018-10-17 00:20:00",
],
"uids": [
"1000,1321,7654,1321",
"7654",
np.nan,
"7654,1000,7654,1321,1000",
"691,3974,3974,323",
np.nan,
np.nan,
np.nan,
"27,33,3974,3974,7665,27",
],
}
)
df["date_time"] = pd.to_datetime(df["date_time"])
df = (
df.set_index("date_time") #do not use set_index if date_time is current index
.loc[:, "uids"]
.str.extractall(r"(?P<uids>\d+)")
.droplevel(level=1)
) # separate all the ids
df["number"] = df.index.minute.astype(float) / 10 + 1 # get the number 1 to 6 depending on the minutes
df_pivot = df.pivot_table(
values="number",
index="uids",
columns=["date_time"],
) #dataframe with all the uids on the index and all the datetimes in columns.
You can apply this to the whole dataframe or just a subset containing 6 rows. Then you rename your columns.
You can use the function crosstab:
df['uids'] = df['uids'].str.split(',')
df = df.explode('uids')
df['date_time'] = df['date_time'].dt.minute.floordiv(10).add(1)
pd.crosstab(df['uids'], df['date_time'], dropna=False)
Output:
date_time 1 2 3 4 5 6
uids
1000 1 0 0 2 0 0
1321 2 0 0 1 0 0
27 0 0 2 0 0 0
323 0 0 0 0 1 0
33 0 0 1 0 0 0
3974 0 0 2 0 2 0
691 0 0 0 0 1 0
7654 1 1 0 2 0 0
7665 0 0 1 0 0 0
We can achieve this with extracting the minutes from your datetime column. Then using pivot_table to get your wide format:
df['date_time'] = pd.to_datetime(df['date_time'])
df['minute'] = df['date_time'].dt.minute // 10
piv = (df.assign(uids=df['uids'].str.split(','))
.explode('uids')
.pivot_table(index='uids', columns='minute', values='minute', aggfunc='size')
)
minute 0 1 2 3 4
uids
1000 1.0 NaN NaN 2.0 NaN
1321 2.0 NaN NaN 1.0 NaN
27 NaN NaN 2.0 NaN NaN
323 NaN NaN NaN NaN 1.0
33 NaN NaN 1.0 NaN NaN
3974 NaN NaN 2.0 NaN 2.0
691 NaN NaN NaN NaN 1.0
7654 1.0 1.0 NaN 2.0 NaN
7665 NaN NaN 1.0 NaN NaN
I have a problem with mapping values from another dataframe.
These are samples of two dataframes:
df1
product class_1 class_2 class_3
141A 11 13 5
53F4 12 11 18
GS24 14 12 10
df2
id product_type_0 product_type_1 product_type_2 product_type_3 measure_0 measure_1 measure_2 measure_3
1 141A GS24 NaN NaN 1 3 NaN NaN
2 53F4 NaN NaN NaN 1 NaN NaN NaN
3 53F4 141A 141A NaN 2 2 1 NaN
4 141A GS24 NaN NaN 3 2 NaN NaN
What I'm trying to get is next:
I need to add a new columns called "Max_Class_1", "Max_Class_2", "Max_Class_3" and that value would be taken from df1.
For each order number (_1, _2, _3) look at existing columns (for example product_type_1) product_type_1 and take a row from df1 where the product has the same value. Then look at the measure columns (for example measure_1) and if the value is 1 (it's possible max four different values in original data), new column called "Max_Class_1" would have value same as class_1 for that product_type, in this case 11.
I think it's a little bit simpler than I explained it.
Desired output
id product_type_0 product_type_1 product_type_2 product_type_3 measure_0 measure_1 measure_2 measure_3 max_class_0 max_class_1 max_class_2 max_class_3
1 141A GS24 NaN NaN 1 3 NaN NaN 1 10 NaN NaN
2 53F4 NaN NaN NaN 1 NaN NaN NaN 12 NaN NaN NaN
3 53F4 141A 141A NaN 2 2 1 NaN 11 13 11 NaN
4 141A GS24 NaN NaN 3 2 NaN NaN 5 12 NaN NaN
The code I have tried with:
df2['max_class_1'] = None
df2['max_class_2'] = None
df2['max_class_3'] = None
def get_max_class(product_df, measure_df, product_type_column, measure_column, max_class_columns):
for index, row in measure_df.iterrows():
product_df_new = product_df[product_df['product'] == row[product_type_column]]
for ind, r in product_df_new.iterrows():
if row[measure_column] == 1:
row[max_class_columns] = r['class_1']
elif row[measure_column] == 2:
row[max_class_columns] = r['class_2']
elif row[measure_column] == 3:
row[max_class_columns] = r['class_3']
else:
row[tilt_column] = "There is no measure or type"
return measure_df
# And the function call
first_class = get_max_class(product_df=df1, measure_df=df2, product_type_column=product_type_1, measure_column='measure_1', max_class_columns='max_class_1')
second_class = get_max_class(product_df=df1, measure_df=first_class, product_type_column=product_type_2, measure_column='measure_2', max_class_columns='max_class_2')
third_class = get_max_class(product_df=df1, measure_df=second_class, product_type_column=product_type_3, measure_column='measure_3', max_class_columns='max_class_3')
I'm pretty sure there is a simpler solution, but don't know why is not working. I'm getting all None values, nothing changes.
pd.DataFrame.lookup is the standard method for lookups by row and column labels.
Your problem is complicated by the existence of null values. But this can be accommodated by modifying your input mapping dataframe.
Step 1
Rename columns in df1 to integers and add an extra row / column. We will use the added data later to deal with null values.
def rename_cols(x):
return x if not x.startswith('class') else int(x.split('_')[-1])
df1 = df1.rename(columns=rename_cols)
df1 = df1.set_index('product')
df1.loc['X'] = 0
df1[0] = 0
Your mapping dataframe now looks like:
print(df1)
1 2 3 0
product
141A 11 13 5 0
53F4 12 11 18 0
GS24 14 12 10 0
X 0 0 0 0
Step 2
Iterate the number of categories and use pd.DataFrame.lookup. Notice how we fillna with X and 0, exactly what we used for additional mapping data in Step 1.
n = df2.columns.str.startswith('measure').sum()
for i in range(n):
rows = df2['product_type_{}'.format(i)].fillna('X')
cols = df2['measure_{}'.format(i)].fillna(0).astype(int)
df2['max_{}'.format(i)] = df1.lookup(rows, cols)
Result
print(df2)
id product_type_0 product_type_1 product_type_2 product_type_3 measure_0 \
0 1 141A GS24 NaN NaN 1
1 2 53F4 NaN NaN NaN 1
2 3 53F4 141A 141A NaN 2
3 4 141A GS24 NaN NaN 3
measure_1 measure_2 measure_3 max_0 max_1 max_2 max_3
0 3.0 NaN NaN 11 10 0 0
1 NaN NaN NaN 12 0 0 0
2 2.0 1.0 NaN 11 13 11 0
3 2.0 NaN NaN 5 12 0 0
You can convert the 0 to np.nan if required. This will be at the expense of converting your series from int to float, since NaN is considered float.
Of course, if X and 0 are valid values, you can use alternative filler values from the start.