The problem is: I have a SQLAlchemy database called NumFav with arrays of favourite numbers of some people, which uses such a structure:
id name numbers
0 Vladislav [2, 3, 5]
1 Michael [4, 6, 7, 9]
numbers is postgresql.ARRAY(Integer)
I want to make a plot which demonstrates id of people on X and numbers dots on Y in order to show which numbers have been chosen like this:
I extract data using
df = pd.read_sql(Session.query(NumFav).statement, engine)
How can I create a plot with such data?
You can explode the number lists into "long form":
df = df.explode('numbers')
df['color'] = df.id.map({0: 'red', 1: 'blue'})
# id name numbers color
# 0 Vladislav 2 red
# 0 Vladislav 3 red
# 0 Vladislav 5 red
# 1 Michael 4 blue
# 1 Michael 6 blue
# 1 Michael 7 blue
# 1 Michael 9 blue
Then you can directly plot.scatter:
df.plot.scatter(x='name', y='numbers', c='color')
Like this:
import matplotlib.pyplot as plt
for idx, row in df.iterrows():
plt.plot(row['numbers'])
plt.legend(df['name'])
plt.show()
Related
I have a data frame with two corresponding sets of columns, e.g. like this sample containing people and their rating of three fruits as well as their ability to detect a fruit ('corresponding' means that banana_rati corresponds to banana_reco etc.).
import pandas as pd
df_raw = pd.DataFrame(data=[ ["name1", 10, 10, 9, 10, 10, 10],
["name2", 10, 10, 8, 10, 8, 4],
["name3", 10, 8, 8, 10, 8, 8],
["name4", 5, 10, 10, 5, 10, 8]],
columns=["name", "banana_rati", "mango_rati", "orange_rati",
"banana_reco", "mango_reco", "orange_reco"])
Suppose I now want to find each respondent's favorite fruit, which I define was the highest rated fruit.
I do this via:
cols_find_max = ["banana_rati", "mango_rati", "orange_rati"] # columns to find the maximum in
mxs = df_raw[cols_find_max].eq(df_raw[cols_find_max].max(axis=1), axis=0) # bool indicator if the cell contains the row-wise maximum value across cols_find_max
However, some respondents rated more than one fruit with the highes value:
df_raw['highest_rated_fruits'] = mxs.dot(mxs.columns + ' ').str.rstrip(', ').str.replace("_rati", "").str.split()
df_raw['highest_rated_fruits']
# Out:
# [banana, mango]
# [banana, mango]
# [banana]
# [mango, orange]
I now want to use the maximum of ["banana_reco", "mango_reco", "orange_reco"] for tie breaks. If this also gives no tie break, I want a random selection of fruits from the so-determined favorite ones.
Can someone help me with this?
The expected output is:
df_raw['fav_fruit']
# Out
# mango # <- random selection from banana (rating: 10, recognition: 10) and mango (same values)
# banana # <- highest ratings: banana, mango; highest recognition: banana
# banana # <- highest rating: banana
# mango # <- highest ratings: mango, orange; highest recognition: mango
UPDATED
Here's a way to do what your question asks:
from random import sample
df = pd.DataFrame({
'name':[c[:-len('_rati')] for c in df_raw.columns if c.endswith('_rati')]})
df = df.assign(rand=df.name + '_rand', tupl=df.name + '_tupl')
num = len(df)
df_raw[df.rand] = [sample(range(num), k=num) for _ in range(len(df_raw))]
df_ord = pd.DataFrame(
{f'{fr}_tupl':df_raw.apply(
lambda x: tuple(x[(f'{fr}_{suff}' for suff in ('rati','reco','rand'))]), axis=1)
for fr in df.name})
df_raw['fav_fruit'] = df_ord.apply(lambda x: df.name[list(x==x.max())].squeeze(), axis=1)
df_raw = df_raw.drop(columns=df.rand)
Sample output:
name banana_rati mango_rati orange_rati banana_reco mango_reco orange_reco fav_fruit
0 name1 10 10 9 10 10 10 banana
1 name2 10 10 8 10 8 4 banana
2 name3 10 8 8 10 8 8 banana
3 name4 5 10 10 5 10 8 mango
Explanation:
create one new column per fruit ending in rand to collectively hold a random shuffled sequence of those fruits (0 through number of fruits) for each row
create one new column per fruit ending in tupl containing 3-tuples of rati, reco, rand corresponding to that fruit
because the rand value for each fruit in a given row is distinct, the 3-tuples will break ties, and therefore, for each row we can simply look up the favorite fruit, namely the fruit whose tuple matches the row's max tuple
drop intermediate columns and we're done.
Try:
import numpy as np
mxs.dot(mxs.columns + ' ').str.rstrip(', ').str.replace("_rati", "").str.split().apply(lambda x: x[np.random.randint(len(x))])
This adds .apply(lambda x: x[np.random.randint(len(x))]) to the end of your last statement and randomly selects an element from the list.
Run 1:
0 banana
1 banana
2 banana
3 orange
Run 2:
0 mango
1 banana
2 banana
3 orange
I have a DataFrame I read from a CSV file and I want to store the individual values from the rows in the DataFrame in some variables. I want to use the values from the DataFrame in another step to perform another operation. Note that I do not want the result as series but values such as integers. I am still learning but I could not understand those resources I have consulted. Thank you in advance.
X
Y
Z
1
2
3
3
2
1
4
5
6
I want the values in a variable as x=1,3,4 and so on, as stated above.
There are many ways you can do this but one simple method is to use the index method. Other people may give other methods but let me illustrate the index method here. I will create a dictionary and change it to DataFrame from which rows iteration can be performed.
# Start by importing pandas as pd
import pandas as pd
# Proceed by defining a dictionary that contains a player's stats (just for
ilustration, not real data)
myData = {'Football Club': ['Chelsea', 'Man Utd', 'Inter Milan', 'Everton'],
'Matches Played': [2, 32, 36, 37],
'Goals Scored': [1, 12, 24, 25],
'Assist Given': [0, 0, 11, 6],
'Red card': [0,0,0,0,],
'Yellow Card':[0,4,4,3]}
# Next create a DataFrame from the dictionary from previous step
df = pd.DataFrame(myData, columns = ['Football Club', 'Matches Played', 'Goals
Scored', 'Red card', 'Yellow Card'])
#See what the data look like.
print("This is the created Dataframe from the dictionary:\n", df)
print("\n Now, you can iterate over selected rows or all the rows using
index
attribute as follows:\n")
#Store the values in variables
for indIte in df.index:
clubs=df['Football Club'][indIte]
goals =df['Goals Scored'][indIte]
matches=df['Matches Played'][indIte]
#To see the results that can be used later in the same program
print(clubs, matches, goals)
#You will get the following results:
This is the created Dataframe from the dictionary :
Football Club Matches Played Goals Scored Red card Yellow Card
0 Chelsea 2 1 0 0
1 Man Utd 32 12 0 4
2 Inter Milan 36 24 0 4
3 Everton 37 25 0 3
Now, you can iterate over selected rows or all the rows using index
attribute as follows:
Chelsea 2 1
Man Utd 32 12
Inter Milan 36 24
Everton 37 25
Use:
x, y, z = df.to_dict(orient='list').values()
>>> x
[1, 3, 4]
>>> y
[2, 2, 5]
>>> z
[3, 1, 6]
df.values is a numpy array of a dataframe. So you can manipulate df.values for subsequent processing.
This may sound like a strange question, but I was wondering if it's possible to temporarily replace none-numeric values in a column with numeric values, so that we can see the distribution.
Only because, if we use the distplot function, it only works for numerical values only, not none-numeric values.
Therefore, consider the sample data I have (shown below).
ID Colour
---------------
1 Red
---------------
2 Red
---------------
3 Blue
---------------
4 Red
---------------
5 Blue
---------------
Would it be possible to temporarily replace "Red" and "Blue" with numerical values? For example: replacing "Red" with 1 and "Blue" with 0.
Hence, by replacing the none-numeric values (Red and Blue) with numeric values (1 and 0), it would allow me to generate a distribution plot to see the density of "Red" and "Blue" in my dataset.
Therefore, how would I achieve this, so that I can see the distribution and density of Red and Blue colours in my dataset using a distplot.
Thanks.
Consider the sample data:
>>> import pandas as pd
>>> data = ({"Colour": [Red, Red, Blue, Red, Blue]})
>>> df = pd.DataFrame(data)
>>> df
Colour
0 Red
1 Red
2 Blue
3 Red
4 Blue
You can then create a colour_map for the values:
>>> colour_map = {'Red': 1, 'Blue': 0}
Then, by applying .map() against the column:
>>> df['Colour'] = df['Colour'].map(colour_map)
Result:
>>> df
Colour
0 1
1 1
2 0
3 1
4 0
I have a dataframe as follows:
df = pd.DataFrame({'Product' : ['A'],
'Size' : ["['XL','L','S','M']"],
'Color' : ["['Blue','Red','Green']"]})
print(df)
Product Size Color
0 A ['XL','L','S','M'] ['Blue','Red','Green']
I need to transform the frame for an ingestion system which only accepts the following format:
target_df = pd.DataFrame({'Description' : ['Product','Color','Color','Color','Size','Size','Size','Size'],
'Agg' : ['A','Blue','Green','Red','XL','L','S','M']})
Description Agg
0 Product A
1 Color Blue
2 Color Green
3 Color Red
4 Size XL
5 Size L
6 Size S
7 Size M
I've attempted all forms of explode, groupby and even itterrows, but I can't get it to line up. I have thousands of Products. with a few groupby and explodes I can stack the column but then I have duplicate Product Names which I need to avoid, the order is important too.
Try:
df['Size']=df['Size'].map(eval)
df['Color']=df['Color'].map(eval)
df=df.stack().explode()
Outputs:
0 Product A
Size XL
Size L
Size S
Size M
Color Blue
Color Red
Color Green
dtype: object
Here's a solution without eval:
(df.T[0].str.strip('[]')
.str.split(',', expand=True)
.stack().str.strip("''")
.reset_index(level=1, drop=True)
.rename_axis(index='Description')
.reset_index(name='Agg')
)
Output:
Description Agg
0 Product A
1 Size XL
2 Size L
3 Size S
4 Size M
5 Color Blue
6 Color Red
7 Color Green
Although both of the answers are already sufficient, thought this was one was nice to work out. Heres a method using explode and melt:
from ast import literal_eval
# needed, because somehow apply(literal_eval) wanst working
for col in df[['Size', 'Color']]:
df[col] = df[col].apply(literal_eval)
dfn = df.explode('Size').reset_index(drop=True)
dfn['Color'] = df['Color'].explode().reset_index(drop=True).reindex(dfn.index)
dfn = dfn.melt(var_name='Description', value_name='Agg').ffill().drop_duplicates().reset_index(drop=True)
Description Agg
0 Product A
1 Size XL
2 Size L
3 Size S
4 Size M
5 Color Blue
6 Color Red
7 Color Green
I am working on a pandas data frame and I am trying to subset my data frame such that the cumulative sum of the column is not greater than 18
and then the percentage of yellow color selected should not be less than 65% and then trying to run multiple iterations of the same. However sometimes
loop goes into infinite loop and sometime it does produce the results but we get the same result in every iteration.
Everything after the while loop was taken from the below post
Python random sample selection based on multiple conditions
df=pd.DataFrame({'id':['A','B','C','D','E','G','H','I','J','k','l','m','n','o'],'color':['red','red','orange','red','red','red','red','yellow','yellow','yellow','yellow','yellow','yellow','yellow'], 'qty':[5,2, 3, 4, 7, 6, 8, 1, 5,2, 3, 4, 7, 6]})
df_sample = df
for x in range(2):
sample_s = df.sample(n=df.shape[0])
sample_s= sample_s[(sample_s.qty.cumsum()<= 30)]
sample_size=len(sample_s)
while sum(df['qty']) > 18:
yellow_size = 0.65
df_yellow = df[df['color'] == 'yellow'].sample(int(yellow_size*sample_size))
others_size = 1 - yellow_size
df_others = df[df['color'] != 'yellow'].sample(int(others_size*sample_size))
df = pd.concat([df_yellow, df_others]).sample(frac=1)
print df
This is how I get the result when it works wherein both the results are same.
color id qty
red H 2
yellow n 3
yellow J 5
red G 2
yellow I 1
red D 4
color id qty
red H 2
yellow n 3
yellow J 5
red G 2
yellow I 1
red D 4
I am really hoping if someone could please help to resolve the issue.