plotting a given column name across different data frames in python - python

All, I have multiple dataframes like
df1 = pd.DataFrame(np.array([
['a', 1, 2],
['b', 3, 4],
['c', 5, 6]]),
columns=['name', 'attr1', 'attr2'])
df2 = pd.DataFrame(np.array([
['a', 2, 3],
['b', 4, 5],
['c', 6, 7]]),
columns=['name', 'attr1', 'attr2'])
df3 = pd.DataFrame(np.array([
['a', 3, 4],
['b', 5, 6],
['c', 7, 8]]),
columns=['name', 'attr1', 'attr2'])
each of these dataframes are generated at specific time steps says T=[t1, t2, t3]
I would like to plot, attr1 or attr2 of the diff data frames as function of time T. I would like to do this for 'a', 'b' and 'c' on all the same graph.
Plot Attr1 VS time for 'a', 'b' and 'c'

If I understand correctly, first assign a column T to each of your dataframes, then concatenate the three. Then, you can groupby the name column, iterate through each, and plot T against attr1 or attr2:
dfs = pd.concat([df1.assign(T=1), df2.assign(T=2), df3.assign(T=3)])
for name, data in dfs.groupby('name'):
plt.plot(data['T'], data['attr2'], label=name)
plt.xlabel('Time')
plt.ylabel('attr2')
plt.legend()
plt.show()

Related

how slice by hybrid stile

having a random df
df = pd.DataFrame([[1,2,3,4],[4,5,6,7],[7,8,9,10],[10,11,12,13],[14,15,16,17]], columns=['A', 'B','C','D'])
cols_in = list(df)[0:2]+list(df)[4:]
now:
x = []
for i in range(df.shape[0]):
x.append(df.iloc[i,cols_in])
obviously in the cycle, x return an error due to col_in assignment in iloc.
How could be possible apply a mixed style slicing of df like in append function ?
It seems like you want to exclude one column? There is no column 4, so depending on which columns you are after, something like this might be what you are after:
df = pd.DataFrame([[1,2,3,4],[4,5,6,7],[7,8,9,10],[10,11,12,13],[14,15,16,17]], columns=['A', 'B','C','D'])
If you want to get the column indeces from column names you can do:
cols = ['A', 'B', 'D']
cols_in = np.nonzero(df.columns.isin(cols))[0]
x = []
for i in range(df.shape[0]):
x.append(df.iloc[i, cols_in].to_list())
x
Output:
[[1, 2, 4], [4, 5, 7], [7, 8, 10], [10, 11, 13], [14, 15, 17]]

Convert DataFrame into multi-dimensional array with the column names of DataFrame

Below is the DataFrame I want to action upon:
df = pd.DataFrame({'A': [1,1,1],
'B': [2,2,3],
'C': [4,5,4]})
Each row of df creates a unique key. Objective is to create the following list of multi-dimensional arrays:
parameter = [[['A', 1],['B', 2], ['C', 4]],
[['A', 1],['B', 2], ['C', 5]],
[['A', 1],['B', 3], ['C', 4]]]
Problem is related to this question where I have to iterate over the parameter but instead of manually providing them to my function, I have to put all parameter from df (rows) in a list.
You could use the following list comprehension, which zips the values on each row with the columns of the dataframe:
from itertools import repeat
[list(map(list,zip(cols, i))) for cols, i in zip(df.values.tolist(), repeat(df.columns))]
[[[1, 'A'], [2, 'B'], [4, 'C']],
[[1, 'A'], [2, 'B'], [5, 'C']],
[[1, 'A'], [3, 'B'], [4, 'C']]]

Selecting different rows from different GroupBy groups

As opposed to GroupBy.nth, which selects the same index for each group, I would like to take specific indices from each group. For example, if my GroupBy object consisted of four groups and I would like the 1st, 5th, 10th, and 15th from each respectively, then I would like to be able to pass x = [0, 4, 9, 14] and get those rows.
This is kind of a strange thing to want; is there a reason?
In any case, to do what you want, try this:
df = pd.DataFrame([['a', 1], ['a', 2],
['b', 3], ['b', 4], ['b', 5],
['c', 6], ['c', 7]],
columns=['group', 'value'])
def index_getter(which):
def get(series):
return series.iloc[which[series.name]]
return get
which = {'a': 0, 'b': 2, 'c': 1}
df.groupby('group')['value'].apply(index_getter(which))
Which results in:
group
a 1
b 5
c 7

Can I do a conditional sort on two different columns, but where the order of two columns is reversed based on the secondary condition?

Edit: Since writing this, I remembered a third necessary condition. That is, if the difference between the values at index 1 (time) is greater than or equal to 2, then the rows should be sorted normally by the index 1 (time) column. So because the time value for B is 6 and within a difference of 2 for the T time of 5, B should come after T. However,for T and K, for example, because the 7 value for K is 2 greater than the 5 value for T, T should come first.
Let's say I have this array
input = [['user_id', 'time', 'address'],
['F', 5, 5],
['T', 5, 8],
['B', 6, 6],
['K', 7, 7],
['J', 7, 9],
['M', 9, 10]]
I'd like to sort the rows -- first in ascending order by index 1 (time). However, secondarily, if index 2 (address) for a given user_id such as 'B' is less than index 2 (address) for another user such as 'T', I'd like user_id 'B' to come before user_id 'T'.
So the final output would look like this:
output = [['user_id', 'time', 'address'],
['F', 5, 5],
['B', 6, 6]
['T', 5, 8],
['K', 7, 7],
['J', 7, 9],
['M', 9, 10]]
If possible, I'd like to do this without Pandas.
>>> import functools
>>>
>>> def compare(item1, item2):
... return item1[1]-item2[1] if item1[1]-item2[1] >=2 else item1[2]-item2[2]
...
>>>
>>> output = [input[0]] + sorted(input[1:], key = functools.cmp_to_key(compare))
>>> pprint (output)
[['user_id', 'time', 'address'],
['F', 5, 5],
['B', 6, 6],
['T', 5, 8],
['K', 7, 7],
['J', 7, 9],
['M', 9, 10]]
>>>
For builtin function sorted you can provide custom key method. Here it's enough if the key method returns a tuple of columns 1 and 2, so first the value of column 1 will be considered, and for rows having the same value in that column, will be ordered by column 2.
data = [['user_id', 'time', 'address'],
['F', 5, 5],
['B', 6, 6],
['T', 5, 8],
['K', 7, 7],
['J', 7, 9],
['M', 9, 10]]
data_sorted = [data[0]] + sorted(data[1:], key = lambda row: (row[1], row[2]))

How to plot different parts of same Pandas Series column with different colors, having a customized index?

This is a follow-up for my previous question here.
Let's say I have a Series like this:
testdf = pd.Series([3, 4, 2, 5, 1, 6, 10], index=['a', 'b', 'c', 'd', 'e', 'f', 'g'])
When plotting, this is the result:
testdf.plot()
However, I want to plot, say, the line up to the first 4 values in blue (default) and the rest of the line in red. Trying a solution the way was suggested on the mentioned post above, this is the result I get:
fig, ax = plt.subplots(1, 1)
testdf.plot(ax=ax,color='b')
testdf.iloc[3:].plot(ax=ax,color='r')
I only get the expected result if I don't define my Series with a custom index:
testdf = pd.Series([3, 4, 2, 5, 1, 6, 10])
fig, ax = plt.subplots(1, 1)
testdf.plot(ax=ax,color='b')
testdf.iloc[3:].plot(ax=ax,color='r')
How could I achieve the desired result, then?
I wanted to write a comment but it was too long so I write here.
What you want to achieve works well in case you want to plot bars (which are discrete)
import pandas as pd
import numpy as np
df = pd.Series([3, 4, 2, 5, 1, 6, 10], index=['a', 'b', 'c', 'd', 'e', 'f', 'g'])
df.plot(kind = 'bar',color=np.where(df.index<'e','b','r'))
But not in case of lines (which are continuous) as you already noticed.
In case you don't want to set custom indices you can use:
df = pd.Series([3, 4, 2, 5, 1, 6, 10])
cut = 4
ax = df[:cut].plot(color='b')
df[(cut-1):].plot(ax=ax, color='r')
While using custom indices you should split your series in two parts. One example is doing
df = pd.Series([3, 4, 2, 5, 1, 6, 10], index=['a', 'b', 'c', 'd', 'e', 'f', 'g'])
df1 = pd.Series(np.where(df.index<'e',df.values,np.nan), index=df.index)
df2 = pd.Series(np.where(df.index>='d',df.values,np.nan), index=df.index)
ax = df1.plot(color = 'b')
df2.plot(ax=ax,color='r')

Categories

Resources