I have the following code:
dates = pd.date_range('20130101', periods=6)
df = pd.DataFrame(np.random.randn(6, 6), index=dates, columns=["a","b","c","a_x","b_x","c_x"])
which results in the following:
a b c a_x b_x c_x
2013-01-01 -0.871681 0.938965 -0.804039 0.329384 -1.211573 0.160477
2013-01-02 1.673895 2.017654 2.181771 0.336220 0.389709 0.246264
2013-01-03 -0.670211 -0.561792 -0.747824 -0.837123 0.129040 1.044153
2013-01-04 -0.571023 -0.430249 0.024393 1.017622 1.072909 0.816249
2013-01-05 0.074952 -0.119953 0.245248 2.658196 -1.525059 1.131054
2013-01-06 0.203816 0.379939 -0.162919 -0.674444 -0.650636 0.415143
I want to generate simple line plot charts - a total of three, each plotting the couples:
a and a_x, b and b_x and c and c_x
I know how to generate charts but since the table is big and has the same pattern in the column naming conventions I was thinking if that is possible to be achieved via for loop. For examples the original table would have a column d and column d_x, also column e and e_x etc.
You could use groupby along axis=1, grouped by the first element of splitting columns names:
for _, data in df.groupby(df.columns.str.split('_').str[0], axis=1):
data.plot()
[out]
Related
I have a pandas dataframe which looks like this one:
Name A_x B_x C_x A_y B_y C_y
ab xyz 2 abc123 xyz 2 abc123
cd yza 2 def456 zab 1 NaN
ef zab 3 jkl012 abc 3 jkl012
What I now want to do is to compare columns A_x with A_y, B_x with B_y as well as C_x with C_y. I would like to have a function which returns the Name value if the values in the compared columns do not match and gives back a list with the according names.
E.g.:
When comparing A_x to A_y I would like to get a list like this:
list_A = ['cd','ef']
Comparison of B_x and B_y:
list_B = ['cd']
Comparison of C_x and C_y:
list_C = ['cd']
Note: The dataframe contains some integer but also some string and NaN values.
How could I get such a function?
Solution reshape DataFrame for 2 columns x and y by split columns names by _ with DataFrame.stack, replace missing values and compare in boolean indexing for not equal x with y in Series.ne, last aggregate list and convert to dictionaries for all non matched Names:
df1 = df.set_index('Name')
df1.columns = df1.columns.str.split('_', expand=True)
df1 = df1.stack(0, dropna=False).fillna('missing')
d = df1[ df1['x'].ne(df1['y'])].reset_index().groupby('level_1')['Name'].agg(list).to_dict()
print (d)
{'A': ['cd', 'ef'], 'B': ['cd'], 'C': ['cd']}
Then for select use key of dict:
print (d['A'])
['cd', 'ef']
I've got a dataframe with a multiindex of the form:
(label, date)
where label is a string and date is a DateTimeIndex.
I want to slice my dataframe by date; say for example, I want to get all the rows between 2007 and 2009:
df.loc[:, '2007':'2009']
It seems like the second part (where I've put the date) is actually slicing the column.
How do I slice on date?
You can check partial string indexing:
DatetimeIndex partial string indexing also works on a DataFrame with a MultiIndex:
df = pd.DataFrame(np.random.randn(20, 1),
columns=['A'],
index=pd.MultiIndex.from_product(
[['a', 'b'], pd.date_range('20050101', periods=10, freq='10M'),
]))
idx = pd.IndexSlice
df1 = df.loc[idx[:, '2007':'2009'], :]
print (df1)
A
a 2007-07-31 0.325027
2008-05-31 -1.307117
2009-03-31 -0.556454
b 2007-07-31 1.808920
2008-05-31 1.245404
2009-03-31 -0.425046
Another idea is use loc with axis=0 parameter:
df1 = df.loc(axis=0)[:, '2007':'2009']
print (df1)
A
a 2007-07-31 0.325027
2008-05-31 -1.307117
2009-03-31 -0.556454
b 2007-07-31 1.808920
2008-05-31 1.245404
2009-03-31 -0.425046
I have 2 different data frames that have the same column called date. Now, I want to plot these data frames where the values on X axis be the date column common to both the data frames and Y axis be the value. Also, I want to do this after concatenating both the data frames into a third frame. Currently here is what I did:
df1 = pd.DataFrame({'value': [1,2,3,4,5], 'date': [20,40,60,80,100]})
df2 = pd.DataFrame({'value': [11,21,31,41,51], 'date': [20,40,60,80,100]})
df3 = pd.concat([df1, df2], keys=['df1','df2'], axis=1)
df3.plot()
plt.show()
but the resultant plot is not what I wanted. It generates 4 plots as could be seen from the legend.
How could I just have 2 plots with a common X axis and the difference reflected in the Y axis? Please note that I want to do this after concatenating the data frames df1 and df2 and by calling plot on df3
You could use the "date" column as index before concatenating.
df1 = pd.DataFrame({'value': [1,2,3,4,5], 'date': [20,40,60,80,100]})
df2 = pd.DataFrame({'value': [11,21,31,41,51], 'date': [20,40,60,80,100]})
df3 = pd.concat([df1.set_index("date"), df2.set_index("date")], keys=['df1','df2'], axis=1)
df3.plot()
This creates a dataframe with only the two "value" columns and the date as index.
When plotting the index is used as x values and for each column a line is drawn.
You could also ignore the ignore the column index and later set new column names
df3 = pd.concat([df1.set_index("date"), df2.set_index("date")], axis=1, ignore_index =True)
df3.columns=['df1','df2']
Or you drop the level of the index that is common to both columns after concatenation.
df3 = pd.concat([df1.set_index("date"), df2.set_index("date")], keys=['df1','df2'], axis=1)
df3.columns = df3.columns.droplevel(level=1)
Try :
df3=pd.merge(df1,df2,on='date')
df3.plot.line(x="date")
plt.show()
First since the dates seem to be same, you can merge on the date column
df3=pd.merge(df1,df2,on='date')
value_x date value_y
0 1 20 11
1 2 40 21
2 3 60 31
3 4 80 41
4 5 100 51
Another way to do it using matplotlib :
Plot the date vs value_x and date vs value_y
plt.plot(df3["date"],df3["value_x"],label="df1")
plt.plot(df3["date"],df3["value_y"],label="df2")
plt.legend()
plt.show()
I have a lot of separate dataframes in a list that each have Multiindexed columns and are a timeseries for different time periods and lengths. I would like to do three things:
Bring together all of the separate dataframes
Any dataframes with identical multiindexed columns append and sort
along time axis
Dataframes with different multiindexed columns concatenate along
column axis (axis=1)
I know that by default the `pandas.concat(objs, axis=1) combines the columns and sorts the row index but I also would like dataframes with identical labels and levels to be joined a long the time axis instead of having them completely side by side.
I should also mention that the dataframes with the same labels and levels are over different time periods that connect with one another but do not overlap.
As an example:
first,second,third = rand(5,2),rand(5,2),rand(10,2)
a = pd.DataFrame(first, index=pd.DatetimeIndex(start='1990-01-01', periods=5, freq='d'))
a.columns = pd.MultiIndex.from_tuples([('A','a'),('A','b')])
b = pd.DataFrame(second, index=pd.DatetimeIndex(start='1990-01-06', periods=5, freq='d'))
b.columns = pd.MultiIndex.from_tuples([('A','a'),('A','b')])
c = pd.DataFrame(third, index=pd.DatetimeIndex(start='1990-01-01', periods=10, freq='d'))
c.columns = pd.MultiIndex.from_tuples([('B','a'),('B','b')])
pd.concat([a,b,c], axis=1)
Gives this:
Out[3]:
A B
a b a b a b
1990-01-01 0.351481 0.083324 NaN NaN 0.060026 0.124302
1990-01-02 0.486032 0.742887 NaN NaN 0.570997 0.633906
1990-01-03 0.145066 0.386665 NaN NaN 0.166567 0.147794
1990-01-04 0.257831 0.995324 NaN NaN 0.630652 0.534507
1990-01-05 0.446912 0.374049 NaN NaN 0.311473 0.727622
1990-01-06 NaN NaN 0.920003 0.051772 0.731657 0.393296
1990-01-07 NaN NaN 0.142397 0.837654 0.597090 0.833893
1990-01-08 NaN NaN 0.506141 0.056407 0.832294 0.222501
1990-01-09 NaN NaN 0.655442 0.754245 0.802421 0.743875
1990-01-10 NaN NaN 0.195767 0.880637 0.215509 0.857576
Is there an easy way to get this?
d = a.append(b)
pd.concat([d,c], axis=1)
Out[4]:
A B
a b a b
1990-01-01 0.351481 0.083324 0.060026 0.124302
1990-01-02 0.486032 0.742887 0.570997 0.633906
1990-01-03 0.145066 0.386665 0.166567 0.147794
1990-01-04 0.257831 0.995324 0.630652 0.534507
1990-01-05 0.446912 0.374049 0.311473 0.727622
1990-01-06 0.920003 0.051772 0.731657 0.393296
1990-01-07 0.142397 0.837654 0.597090 0.833893
1990-01-08 0.506141 0.056407 0.832294 0.222501
1990-01-09 0.655442 0.754245 0.802421 0.743875
1990-01-10 0.195767 0.880637 0.215509 0.857576
The key here is that I don't know how the dataframes will be ordered in the list I basically need something that knows when to concat(obj, axis=1) or concat(obj, axis=0) and can do this to combine my list of dataframes. Maybe there is something already in pandas that can do this?
I'm not sure there is a one line way to do this (there may be)...
This is one time I would consider creating an empty frame and then filling it:
In [11]: frames = [a, b, c]
Get the union of their index and columns:
In [12]: index = sum(x.index for x in frames)
cols = sum(x.columns for x in frames)
In [13]: res = pd.DataFrame(index=index, columns=cols)
Fill this in with each frame (by label):
In [14]: for df in [a, b, c]:
res.loc[df.index, df.columns] = df
In [15]: res
Out[15]:
A B
a b a b
1990-01-01 0.8516285 0.4087078 0.577000 0.595293
1990-01-02 0.6544393 0.4377864 0.851378 0.595919
1990-01-03 0.3123428 0.03825423 0.834704 0.989195
1990-01-04 0.2314499 0.4971448 0.343455 0.770400
1990-01-05 0.1982945 0.9031414 0.466225 0.463490
1990-01-06 0.7370323 0.3923151 0.263120 0.892815
1990-01-07 0.09038236 0.8778266 0.643816 0.049769
1990-01-08 0.7199705 0.02114493 0.766267 0.472471
1990-01-09 0.06733081 0.443561 0.984558 0.443647
1990-01-10 0.4695022 0.5648693 0.870240 0.949072
I'd like to use a boolean index to select columns from a pandas dataframe with a datetime index as the column header:
dates = pd.date_range('20130101', periods=6)
df = pd.DataFrame(np.random.randn(4, 6), index=list('ABCD'), columns=dates)
returns:
2013-01-01 2013-01-02 2013-01-03 2013-01-04 2013-01-05 2013-01-06
A 0.173096 0.344348 1.059990 -1.246944 1.624399 -0.276052
B 0.277148 0.965226 -1.301612 -1.264500 -0.124489 1.704485
C -0.375106 0.103812 0.939749 -2.826329 -0.275420 0.664325
D 0.039756 0.631373 0.643565 -1.516543 -0.654626 -1.544038
I'd like to return only the first three columns.
I might do
>>> df.loc[:, df.columns <= datetime(2013, 1, 3)]
2013-01-01 2013-01-02 2013-01-03
A 1.058112 0.883429 -1.939846
B 0.753125 1.664276 -0.619355
C 0.014437 1.125824 -1.421609
D 1.879229 1.594623 -1.499875
You can do vectorized comparisons on the column index directly without using the map/lambda combination.
I had a nice long chat with the duck, and finally realised it was as simple as this:
print df.loc[:, :datetime(2013, 1, 3, 0, 0)]
I love Pandas.
EDIT:
Well, in fact that wasn't exactly what I wanted, because it relies on the 'query' date being present in the column headers. This is actually what I needed:
print df.loc[:, df.columns.map(lambda col: col < datetime(2013, 1, 3, 0, 0))]