groupby and stack column into a single column - python

I have a dataframe as follows:
df = pd.DataFrame({'Product' : ['A'],
'Size' : ["['XL','L','S','M']"],
'Color' : ["['Blue','Red','Green']"]})
print(df)
Product Size Color
0 A ['XL','L','S','M'] ['Blue','Red','Green']
I need to transform the frame for an ingestion system which only accepts the following format:
target_df = pd.DataFrame({'Description' : ['Product','Color','Color','Color','Size','Size','Size','Size'],
'Agg' : ['A','Blue','Green','Red','XL','L','S','M']})
Description Agg
0 Product A
1 Color Blue
2 Color Green
3 Color Red
4 Size XL
5 Size L
6 Size S
7 Size M
I've attempted all forms of explode, groupby and even itterrows, but I can't get it to line up. I have thousands of Products. with a few groupby and explodes I can stack the column but then I have duplicate Product Names which I need to avoid, the order is important too.

Try:
df['Size']=df['Size'].map(eval)
df['Color']=df['Color'].map(eval)
df=df.stack().explode()
Outputs:
0 Product A
Size XL
Size L
Size S
Size M
Color Blue
Color Red
Color Green
dtype: object

Here's a solution without eval:
(df.T[0].str.strip('[]')
.str.split(',', expand=True)
.stack().str.strip("''")
.reset_index(level=1, drop=True)
.rename_axis(index='Description')
.reset_index(name='Agg')
)
Output:
Description Agg
0 Product A
1 Size XL
2 Size L
3 Size S
4 Size M
5 Color Blue
6 Color Red
7 Color Green

Although both of the answers are already sufficient, thought this was one was nice to work out. Heres a method using explode and melt:
from ast import literal_eval
# needed, because somehow apply(literal_eval) wanst working
for col in df[['Size', 'Color']]:
df[col] = df[col].apply(literal_eval)
dfn = df.explode('Size').reset_index(drop=True)
dfn['Color'] = df['Color'].explode().reset_index(drop=True).reindex(dfn.index)
dfn = dfn.melt(var_name='Description', value_name='Agg').ffill().drop_duplicates().reset_index(drop=True)
Description Agg
0 Product A
1 Size XL
2 Size L
3 Size S
4 Size M
5 Color Blue
6 Color Red
7 Color Green

Related

Plotting multiple groups from a dataframe with datashader as lines

I am trying to make plots with datashader. the data itself is a time series of points in polar coordiantes. i managed to transform them to cartesian coordianted(to have equal spaced pixles) and i can plot them with datashader.
the point where i am stuck is that if i just plot them with line() instead of points() it just connects the whole dataframe as a single line. i would like to plot the data of the dataframe group per group(the groups are the names in list_of_names ) onto the canvas as lines.
data can be found here
i get this kind of image with datashader
This is a zoomed in view of the plot generated with points() instead of line() the goal is to produce the same plot but with connected lines instead of points
import datashader as ds, pandas as pd, colorcet
import numby as np
df = pd.read_csv('file.csv')
print(df)
starlink_name = df.loc[:,'Name']
starlink_alt = df.loc[:,'starlink_alt']
starlink_az = df.loc[:,'starlink_az']
name = starlink_name.values
alt = starlink_alt.values
az = starlink_az.values
print(name)
print(df['Name'].nunique())
df['Date'] = pd.to_datetime(df['Date'])
for name, df_name in df.groupby('Name'):
print(name)
df_grouped = df.groupby('Name')
list_of_names = list(df_grouped.groups)
print(len(list_of_names))
#########################################################################################
#i want this kind of plot with connected lines with datashader
#########################################################################################
fig = plt.figure()
ax = fig.add_axes([0.1,0.1,0.8,0.8], polar=True)
# ax.invert_yaxis()
ax.set_theta_zero_location('N')
ax.set_rlim(90, 60, 1)
# Note: you must set the end of arange to be slightly larger than 90 or it won't include 90
ax.set_yticks(np.arange(0, 91, 15))
ax.set_rlim(bottom=90, top=0)
for name in list_of_names:
df2 = df_grouped.get_group(name)
ax.plot(np.deg2rad(df2['starlink_az']), df2['starlink_alt'], linestyle='solid', marker='.',linewidth=0.5, markersize=0.1)
plt.show()
print(df)
#########################################################################################
#transformation to cartasian coordiantes
#########################################################################################
df['starlink_alt'] = 90 - df['starlink_alt']
df['x'] = df.apply(lambda row: np.deg2rad(row.starlink_alt) * np.cos(np.deg2rad(row.starlink_az)), axis=1)
df['y'] = df.apply(lambda row: -1 * np.deg2rad(row.starlink_alt) * np.sin(np.deg2rad(row.starlink_az)), axis=1)
#########################################################################################
# this is what i want but as lines group per group
#########################################################################################
cvs = ds.Canvas(plot_width=2000, plot_height=2000)
agg = cvs.points(df, 'y', 'x')
img = ds.tf.shade(agg, cmap=colorcet.fire, how='eq_hist')
#########################################################################################
#here i am stuck
#########################################################################################
for name in list_of_names:
df2 = df_grouped.get_group(name)
cvs = ds.Canvas(plot_width=2000, plot_height=2000)
agg = cvs.line(df2, 'y', 'x')
img = ds.tf.shade(agg, cmap=colorcet.fire, how='eq_hist')
#plt.imshow(img)
plt.show()
To do this, you have a couple options. One is inserting NaN rows as a breakpoint into your dataframe when using cvs.line. You need DataShader to "pick up the pen" as it were, by inserting a row of NaNs after each group. It's not the slickest, but that's a current recommended solution.
Really simple, hacky example:
In [17]: df = pd.DataFrame({
...: 'name': list('AABBCCDD'),
...: 'x': np.arange(8),
...: 'y': np.arange(10, 18),
...: })
In [18]: df
Out[18]:
name x y
0 A 0 10
1 A 1 11
2 B 2 12
3 B 3 13
4 C 4 14
5 C 5 15
6 D 6 16
7 D 7 17
This block groups on the 'name' column, then reindexes each group to be one element longer than the original data:
In [20]: res = df.set_index('name').groupby('name').apply(
...: lambda x: x.reset_index(drop=True).reindex(np.arange(len(x) + 1))
...: )
In [21]: res
Out[21]:
x y
name
A 0 0.0 10.0
1 1.0 11.0
2 NaN NaN
B 0 2.0 12.0
1 3.0 13.0
2 NaN NaN
C 0 4.0 14.0
1 5.0 15.0
2 NaN NaN
D 0 6.0 16.0
1 7.0 17.0
2 NaN NaN
You can plug this reindexed dataframe into datashader to have multiple disconnected lines in the result.
This is a still-open issue on the datashader repo, including additional examples and boilerplate code: https://github.com/holoviz/datashader/issues/257
Other options include restructuring your data to accommodate one of cvs.line's other formats. From the Canvas.line docstring:
def line(self, source, x=None, y=None, agg=None, axis=0, geometry=None,
antialias=False):
Parameters
----------
source : pandas.DataFrame, dask.DataFrame, or xarray.DataArray/Dataset
The input datasource.
x, y : str or number or list or tuple or np.ndarray
Specification of the x and y coordinates of each vertex
* str or number: Column labels in source
* list or tuple: List or tuple of column labels in source
* np.ndarray: When axis=1, a literal array of the
coordinates to be used for every row
agg : Reduction, optional
Reduction to compute. Default is ``any()``.
axis : 0 or 1, default 0
Axis in source to draw lines along
* 0: Draw lines using data from the specified columns across
all rows in source
* 1: Draw one line per row in source using data from the
specified columns
There are a number of additional examples in the cvs.line docstring. You can pass arrays as the x, y arguments giving multiple columns to use in forming lines when axis=1, or you can a dataframe with ragged array values.
See this pull request adding the line options (h/t to #James-a-bednar in the comments) for a discussion of their use.

Plot made of array from a pandas dataset in Python

The problem is: I have a SQLAlchemy database called NumFav with arrays of favourite numbers of some people, which uses such a structure:
id name numbers
0 Vladislav [2, 3, 5]
1 Michael [4, 6, 7, 9]
numbers is postgresql.ARRAY(Integer)
I want to make a plot which demonstrates id of people on X and numbers dots on Y in order to show which numbers have been chosen like this:
I extract data using
df = pd.read_sql(Session.query(NumFav).statement, engine)
How can I create a plot with such data?
You can explode the number lists into "long form":
df = df.explode('numbers')
df['color'] = df.id.map({0: 'red', 1: 'blue'})
# id name numbers color
# 0 Vladislav 2 red
# 0 Vladislav 3 red
# 0 Vladislav 5 red
# 1 Michael 4 blue
# 1 Michael 6 blue
# 1 Michael 7 blue
# 1 Michael 9 blue
Then you can directly plot.scatter:
df.plot.scatter(x='name', y='numbers', c='color')
Like this:
import matplotlib.pyplot as plt
for idx, row in df.iterrows():
plt.plot(row['numbers'])
plt.legend(df['name'])
plt.show()

Seaborn: How to temporarily replace none-numeric values in a column with a numeric values for distribution?

This may sound like a strange question, but I was wondering if it's possible to temporarily replace none-numeric values in a column with numeric values, so that we can see the distribution.
Only because, if we use the distplot function, it only works for numerical values only, not none-numeric values.
Therefore, consider the sample data I have (shown below).
ID Colour
---------------
1 Red
---------------
2 Red
---------------
3 Blue
---------------
4 Red
---------------
5 Blue
---------------
Would it be possible to temporarily replace "Red" and "Blue" with numerical values? For example: replacing "Red" with 1 and "Blue" with 0.
Hence, by replacing the none-numeric values (Red and Blue) with numeric values (1 and 0), it would allow me to generate a distribution plot to see the density of "Red" and "Blue" in my dataset.
Therefore, how would I achieve this, so that I can see the distribution and density of Red and Blue colours in my dataset using a distplot.
Thanks.
Consider the sample data:
>>> import pandas as pd
>>> data = ({"Colour": [Red, Red, Blue, Red, Blue]})
>>> df = pd.DataFrame(data)
>>> df
Colour
0 Red
1 Red
2 Blue
3 Red
4 Blue
You can then create a colour_map for the values:
>>> colour_map = {'Red': 1, 'Blue': 0}
Then, by applying .map() against the column:
>>> df['Colour'] = df['Colour'].map(colour_map)
Result:
>>> df
Colour
0 1
1 1
2 0
3 1
4 0

How to plot specific colors for a range of values in python dataframe?

I have a df that looks like below:
S.No Date A
0 12/07/03 76
1 12/07/13 1
2 12/07/23 32
3 12/08/03 12
4 12/08/04 22
5 12/08/05 11
I want to have a plot where the Y axis is A and X axis the Date, and the problem is with the color. I want all the occurences of 76 in red, 32 in blue and all other values of A in green color. Is this possible?
Yes, you can do so:
# define the color according to the values of df['A']
colors = np.select((df['A'].eq(76), df['A'].eq(32)), ('r','b'), 'g')
# pass the color to plt.scatter
plt.scatter(x=df['Date'],y=df['A'], c=colors)
Output:

Adding percentage values onto horizontal bar charts in matplotlib

My pandas dataset, df4, consists of 14 Colour Groups (Green, Blue etc) and 12 Categories (1, 2 etc). I am creating a horizontal bar chart for each category.
print(df4.head())
BASE VOLUME Color Group Type
0 6.0 GREEN 1
1 3.5 GREEN 1
2 2.5 GREEN 2
3 1.5 GREEN 2
4 2.5 BLUE 4
Here is the code below, with how the graph looks for 2 of the categories. On some of the categories, the percentages are all vertically lined up, but on others the percentages are wild.
#groupby / pivot transformation, and reindex
s='1'
dfrr = df4[df4['Type'] == s]
df5 = dfrr.groupby(['Color Group']).sum().sort_values("BASE VOLUME", ascending=False)
data = df5.reset_index().iloc[:,[0,2]]
data.columns = ['Color Group', 'BASE VOLUME']
x= data['BASE VOLUME']
y= data['Color Group']
data2 = data
data2['BASE VOLUME %'] = data2['BASE VOLUME']
data2 = data2.iloc[:,[0,2]]
data2['BASE VOLUME %'] = 100*data2['BASE VOLUME %']/(sum(data2['BASE VOLUME %']))
plt.figure(figsize=(10,6))
clrs = ['red' if (x > 10) else 'gray' for x in data2['BASE VOLUME %']]
ax = sns.barplot(x,y, data=data2, palette=clrs)
ax.set_xlabel('Base Volume',fontsize=15)
ax.set_ylabel('Color Group',fontsize=15)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
for i, v in enumerate(data2['BASE VOLUME %']):
ax.text(v + 0, i + 0.15, str("{0:.1f}%".format(v)), color='black', fontweight='bold', fontsize=14)
For Category 1, the percentages are lined up:
For Category 4, the percentages are not lined up:
The problem may be that there are only 7 colour groups in Category 4, compared to all 14 in Category 1. How can I tweak the code so that whatever I set as 's' (i.e the category), the percentages line up?

Categories

Resources