How to set xaxis ticks order? - python

df:
USER_ID EventNum
0 1390 17
1 4452 15
2 995 14
3 532 14
4 3281 14
... ... ...
5897 4971 1
5898 2637 1
5899 792 1
5900 5622 1
5901 1 1
[5902 rows x 2 columns]
I want to plot a figure, using USER_ID as X-axis, EventNum as Y-axis.
To avoid cluttering up the axis, I sample USER_ID values at
fixed interval as xticks like this:
[1390, 4899, 4062, 366, 5001, 3383, 5003, 446, 2879, 3220, 4006, 4595, 1713, 2649, 2291, 5647, 2040, 5468, 3719, 4198, 5622]
I do like this, but it seems xticks values are not placed as their sorted order(xticks), instead they are placed by the increasing numeric order(see the figure below).
xticks = [1390, 4899, 4062, 366, 5001, 3383, 5003, 446, 2879, 3220, 4006, 4595, 1713, 2649, 2291, 5647, 2040, 5468, 3719, 4198, 5622]
ax = df.plot(x='USER_ID', y=['EventNum'], use_index=False, rot=270)
ax.set_xticks(xticks)
ax.set_xlabel('User ID')
ax.set_ylabel('Event Number')
How can I fix this?

I figured it out, just replace the line
ax.set_xticks(xticks)
with this
ax.set_xticks(pos, xticks)
where pos is the corresponding position of the xticks value.
In my case, they are respectively like below:
pos: [0, 295, 590, 885, 1180, 1475, 1770, 2065, 2360, 2655, 2950, 3245, 3540, 3835, 4130, 4425, 4720, 5015, 5310, 5605, 5900]
xticks: [1390, 4899, 4062, 366, 5001, 3383, 5003, 446, 2879, 3220, 4006, 4595, 1713, 2649, 2291, 5647, 2040, 5468, 3719, 4198, 5622]

Related

How to build a histogram from a pandas dataframe where each observation is a list?

I have a dataframe as follows. The values are in a cell, a list of elements. I want to visualize distribution of the values from the "Values" column using histogram"S" stacked in rows OR separated by colours (Area_code).
How can I get the values and construct histogram"S" in plotly? Any other idea also welcome. Thank you.
Area_code Values
0 New_York [999, 54, 231, 43, 177, 313, 212, 279, 199, 267]
1 Dallas [915, 183, 2326, 316, 206, 31, 317, 26, 31, 56, 316]
2 XXX [560]
3 YYY [884, 13]
4 ZZZ [203, 1066, 453, 266, 160, 109, 45, 627, 83, 685, 120, 410, 151, 33, 618, 164, 496]
If you reshape your data, this would be a perfect case for px.histogram. And from there you can opt between several outputs like sum, average, count through the histfunc method:
fig = px.histogram(df, x = 'Area_code', y = 'Values', histfunc='sum')
fig.show()
You haven't specified what kind of output you're aiming for, but I'll leave it up to you to change the argument for histfunc and see which option suits your needs best.
I'm often inclined to urge users to rethink their entire data process, but I'm just going to assume that there are good reasons why you're stuck with what seems like a pretty weird setup in your dataframe. The snippet below contains a complete data munginge process to reshape your data from your setup, to a so-called long format:
Area_code Values
0 New_York 999
1 New_York 54
2 New_York 231
3 New_York 43
4 New_York 177
5 New_York 313
6 New_York 212
7 New_York 279
8 New_York 199
9 New_York 267
10 Dallas 915
11 Dallas 183
12 Dallas 2326
13 Dallas 316
14 Dallas 206
15 Dallas 31
16 Dallas 317
17 Dallas 26
18 Dallas 31
19 Dallas 56
20 Dallas 316
21 XXX 560
22 YYY 884
23 YYY 13
24 ZZZ 203
And this is a perfect format for many of the great functionalites of plotly.express.
Complete code:
import plotly.graph_objects as go
import plotly.express as px
import pandas as pd
# data input
df = pd.DataFrame({'Area_code': {0: 'New_York', 1: 'Dallas', 2: 'XXX', 3: 'YYY', 4: 'ZZZ'},
'Values': {0: [999, 54, 231, 43, 177, 313, 212, 279, 199, 267],
1: [915, 183, 2326, 316, 206, 31, 317, 26, 31, 56, 316],
2: [560],
3: [884, 13],
4: [203, 1066, 453, 266, 160, 109, 45, 627, 83, 685, 120, 410, 151, 33, 618, 164, 496]}})
# data munging
areas = []
value = []
for i, row in df.iterrows():
# print(row['Values'])
for j, val in enumerate(row['Values']):
areas.append(row['Area_code'])
value.append(val)
df = pd.DataFrame({'Area_code': areas,
'Values': value})
# plotly
fig = px.histogram(df, x = 'Area_code', y = 'Values', histfunc='sum')
fig.show()

Iterating through Pandas groups to create DataFrames

I have a table containing production data on parts and the variables that were recorded during their production.
FORMAT:
Part | Variable1 | Variable 2 etc
_____________________________
1-----------X---------------X
1-----------X---------------X
2-----------X---------------X
2-----------X---------------X
2-----------X---------------X
2-----------X---------------X
2-----------X---------------X
2-----------X---------------X
2-----------X---------------X
3-----------X---------------X
3-----------X---------------X
3-----------X---------------X
I can group these by part with
dfg = df.groupby("Part") #Where df is my dataframe of productiondata
I also have stored all the Part numbers in the part_num array
part_num = df['Part'].unique()
>out:
array([ 615, 629, 901, 908, 911, 959, 969, 1024, 1025, 1058, 1059,
1092, 1097, 1104, 1105, 1114, 1115, 1117, 1147, 1161, 1171, 1172,
1173, 1174, 1175, 1176, 1177, 1188, 1259, 1307, 1308, 1309, 1310,
1311, 1312, 1313, 1322, 1339, 1340, 1359, 1383, 1384, 1389, 1393,
1394, 1398, 1399, 1402, 1404, 1413, 1414, 1417, 1441, 1449, 1461,
1462, 1463, 1488, 1489, 1490, 1491, 1508, 1509, 1514, 1541, 1542,
1543, 1544, 1545, 1554, 1555, 1559, 1586, 1589, 1601, 1606, 1607,
1618, 1620, 1636, 1659, 1664, 1665, 1667, 1668, 1673, 1674, 1676,
1677, 1679, 1680, 1681, 1687, 1688, 1690, 1704, 1706, 1711, 1714,
1717, 1718, 1723, 1724, 1729, 1731, 1732, 1745, 1747, 1748, 1749,
1753, 1754, 1755, 1756, 1757, 1758, 1759, 1760, 1761, 1762, 1763,
1764, 1765, 1766, 1767, 1768, 1769, 1773, 1774, 1779, 1780, 1783,
1784, 1785, 1787, 1789, 1790, 1791, 1792, 1800, 1845], dtype=int64)
How do I create a dataframe for each part number group?
So you want a dataframe for each unique 'part number group'.
Do a groupby on the index and then store each dataframe in a dict (or list) iteratively.
dummy data
>>> df = pd.DataFrame(np.random.randint(1, 10, (8, 7)),
columns=['a1', 'a2', 'a3', 'b1', 'b2', 'b3', 'b4'], index=[1, 1, 2, 2, 2, 3, 5, 5])
>>> df.head(10)
a1 a2 a3 b1 b2 b3 b4
1 5 3 8 8 1 7 1
1 8 8 8 7 2 5 8
2 4 8 1 9 2 7 5
2 1 8 4 4 1 8 9
2 4 7 4 4 3 9 5
3 3 6 9 3 8 9 2
5 9 5 2 1 7 6 1
5 3 9 1 5 5 8 1
grouped dict of dataframes
>>> grouped_dict = {k: v for k, v in df.groupby(df.index)}
>>> grouped_dict[3] # part number 3 dataframe
a1 a2 a3 b1 b2 b3 b4
3 5 7 5 1 7 8 5

Plotting select rows and columns of DataFrame (python)

I have searched around but have not found an exact way to do this. I have a data frame of several baseball teams and their statistics. Like the following:
RK TEAM GP AB R H 2B 3B HR TB RBI AVG OBP SLG OPS
1 Milwaukee 163 5542 754 1398 252 24 218 2352 711 .252.323 .424 .747
2 Chicago Cubs 163 5624 761 1453 286 34 167 2308 722 .258.333 .410 .744
3 LA Dodgers 163 5572 804 1394 296 33 235 2461 756 .250.333 .442 .774
4 Colorado 163 5541 780 1418 280 42 210 2412 748 .256.322 .435 .757
5 Baltimore 162 5507 622 1317 242 15 188 2153 593 .239.298 .391 .689
I want to be able to plot two teams on the X-axis and then perhaps 3 metrics (ex: R, H, TB) on the Y-axis with the two teams side by side in bar chart format. I haven't been able to figure out how to do this. Any ideas?
Thank you.
My approach was to create a new dataframe which only contains the columns you are interested in plotting:
import pandas as pd
import matplotlib.pyplot as plt
data = [[1, 'Milwaukee', 163, 5542, 754, 1398, 252, 24, 218, 2352, 711, .252, .323, .424, .747],
[2, 'Chicago Cubs', 163, 5624, 761, 1453, 286, 34, 167, 2308, 722, .258, .333, .410, .744],
[3, 'LA Dodgers', 163, 5572, 804, 1394, 296, 33, 235, 2461, 756, .250, .333, .442, .774],
[4, 'Colorado', 163, 5541, 780, 1418, 280, 42, 210, 2412, 748, .256, .322, .435, .757],
[5, 'Baltimore', 162, 5507, 622, 1317, 242, 15, 188, 2153, 593, .239, .298, .391, .689]]
df = pd.DataFrame(data, columns=['RK', 'TEAM', 'GP', 'AB', 'R', 'H', '2B', '3B', 'HR', 'TB', 'RBI', 'AVG', 'OBP', 'SLG', 'OPS'])
dfplot = df[['TEAM', 'R', 'H', 'TB']].copy()
fig = plt.figure()
ax = fig.add_subplot(111)
width = 0.4
dfplot.plot(kind='bar', x='TEAM', ax=ax, width=width, position=1)
plt.show()
This creates the following output:

Query hdf5 datetime column

I have a hdf5 file that contains a table where the column time is in datetime64[ns] format.
I want to get all the rows that are older than thresh. How can I do that? This is what I've tried:
thresh = pd.datetime.strptime('2018-03-08 14:19:41','%Y-%m-%d %H:%M:%S').timestamp()
hdf = pd.read_hdf(STORE, 'gh1', where = 'time>thresh' )
I get the following error:
Traceback (most recent call last):
File "<ipython-input-80-fa444735d0a9>", line 1, in <module>
runfile('/home/joao/github/control_panel/controlpanel/controlpanel/reading_test.py', wdir='/home/joao/github/control_panel/controlpanel/controlpanel')
File "/home/joao/anaconda3/lib/python3.6/site-packages/spyder/utils/site/sitecustomize.py", line 705, in runfile
execfile(filename, namespace)
File "/home/joao/anaconda3/lib/python3.6/site-packages/spyder/utils/site/sitecustomize.py", line 102, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "/home/joao/github/control_panel/controlpanel/controlpanel/reading_test.py", line 15, in <module>
hdf = pd.read_hdf(STORE, 'gh1', where = 'time>thresh' )
File "/home/joao/anaconda3/lib/python3.6/site-packages/pandas/io/pytables.py", line 370, in read_hdf
return store.select(key, auto_close=auto_close, **kwargs)
File "/home/joao/anaconda3/lib/python3.6/site-packages/pandas/io/pytables.py", line 717, in select
return it.get_result()
File "/home/joao/anaconda3/lib/python3.6/site-packages/pandas/io/pytables.py", line 1457, in get_result
results = self.func(self.start, self.stop, where)
File "/home/joao/anaconda3/lib/python3.6/site-packages/pandas/io/pytables.py", line 710, in func
columns=columns, **kwargs)
File "/home/joao/anaconda3/lib/python3.6/site-packages/pandas/io/pytables.py", line 4141, in read
if not self.read_axes(where=where, **kwargs):
File "/home/joao/anaconda3/lib/python3.6/site-packages/pandas/io/pytables.py", line 3340, in read_axes
self.selection = Selection(self, where=where, **kwargs)
File "/home/joao/anaconda3/lib/python3.6/site-packages/pandas/io/pytables.py", line 4706, in __init__
self.condition, self.filter = self.terms.evaluate()
File "/home/joao/anaconda3/lib/python3.6/site-packages/pandas/core/computation/pytables.py", line 556, in evaluate
self.condition = self.terms.prune(ConditionBinOp)
File "/home/joao/anaconda3/lib/python3.6/site-packages/pandas/core/computation/pytables.py", line 118, in prune
res = pr(left.value, right.value)
File "/home/joao/anaconda3/lib/python3.6/site-packages/pandas/core/computation/pytables.py", line 113, in pr
encoding=self.encoding).evaluate()
File "/home/joao/anaconda3/lib/python3.6/site-packages/pandas/core/computation/pytables.py", line 327, in evaluate
values = [self.convert_value(v) for v in rhs]
File "/home/joao/anaconda3/lib/python3.6/site-packages/pandas/core/computation/pytables.py", line 327, in <listcomp>
values = [self.convert_value(v) for v in rhs]
File "/home/joao/anaconda3/lib/python3.6/site-packages/pandas/core/computation/pytables.py", line 185, in convert_value
v = pd.Timestamp(v)
File "pandas/_libs/tslib.pyx", line 390, in pandas._libs.tslib.Timestamp.__new__
File "pandas/_libs/tslib.pyx", line 1549, in pandas._libs.tslib.convert_to_tsobject
File "pandas/_libs/tslib.pyx", line 1735, in pandas._libs.tslib.convert_str_to_tsobject
ValueError: could not convert string to Timestamp
Demo:
creating sample DF (100.000 rows):
In [9]: N = 10**5
In [10]: dates = pd.date_range('1980-01-01', freq='99T', periods=N)
In [11]: df = pd.DataFrame({'date':dates, 'val':np.random.rand(N)})
In [12]: df
Out[12]:
date val
0 1980-01-01 00:00:00 0.985215
1 1980-01-01 01:39:00 0.452295
2 1980-01-01 03:18:00 0.780096
3 1980-01-01 04:57:00 0.004596
4 1980-01-01 06:36:00 0.515051
... ... ...
99995 1998-10-27 15:45:00 0.509954
99996 1998-10-27 17:24:00 0.046636
99997 1998-10-27 19:03:00 0.026678
99998 1998-10-27 20:42:00 0.660652
99999 1998-10-27 22:21:00 0.839426
[100000 rows x 2 columns]
writing it to HDF5 file (index date column):
In [13]: df.to_hdf('d:/temp/test.h5', 'test', format='t', data_columns=['date'])
read HDF5 conditionally by index:
In [14]: x = pd.read_hdf('d:/temp/test.h5', 'test', where="date > '1998-10-27 15:00:00'")
In [15]: x
Out[15]:
date val
99995 1998-10-27 15:45:00 0.509954
99996 1998-10-27 17:24:00 0.046636
99997 1998-10-27 19:03:00 0.026678
99998 1998-10-27 20:42:00 0.660652
99999 1998-10-27 22:21:00 0.839426

get index column error from dataframe from

I simply want to get index column.
import pandas as pd
df1=pd.read_csv(path1, index_col='ID')
df1.head()
VAR1 VAR2 VAR3 OUTCOME
ID
28677 28 1 0.0 0
27170 59 1 0.0 1
39245 65 1 0.0 1
31880 19 1 0.0 0
41441 24 1 0.0 1
I can get many columns like:
df1["VAR1"]
ID
28677 28
27170 59
39245 65
31880 19
41441 24
31070 77
39334 63
....
38348 23
38278 52
28177 58
but, I cannot get index column:
>>> df1["ID"]
Traceback (most recent call last):
File "C:\Anaconda3\lib\site-packages\pandas\indexes\base.py", line 2134, in get_loc
return self._engine.get_loc(key)
File "pandas\index.pyx", line 132, in pandas.index.IndexEngine.get_loc (pandas\index.c:4433)
File "pandas\index.pyx", line 154, in pandas.index.IndexEngine.get_loc (pandas\index.c:4279)
File "pandas\src\hashtable_class_helper.pxi", line 732, in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:13742)
File "pandas\src\hashtable_class_helper.pxi", line 740, in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:13696)
KeyError: 'ID'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Anaconda3\lib\site-packages\pandas\core\frame.py", line 2059, in __getitem__
return self._getitem_column(key)
File "C:\Anaconda3\lib\site-packages\pandas\core\frame.py", line 2066, in _getitem_column
return self._get_item_cache(key)
File "C:\Anaconda3\lib\site-packages\pandas\core\generic.py", line 1386, in _get_item_cache
values = self._data.get(item)
File "C:\Anaconda3\lib\site-packages\pandas\core\internals.py", line 3543, in get
loc = self.items.get_loc(item)
File "C:\Anaconda3\lib\site-packages\pandas\indexes\base.py", line 2136, in get_loc
return self._engine.get_loc(self._maybe_cast_indexer(key))
File "pandas\index.pyx", line 132, in pandas.index.IndexEngine.get_loc (pandas\index.c:4433)
File "pandas\index.pyx", line 154, in pandas.index.IndexEngine.get_loc (pandas\index.c:4279)
File "pandas\src\hashtable_class_helper.pxi", line 732, in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:13742)
File "pandas\src\hashtable_class_helper.pxi", line 740, in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:13696)
KeyError: 'ID'
I want to get the list of index column.
how to do it?
why I get error?
If I want to merge two dataframe use the index column, how to do it?
First column is index so for select use:
print (df1.index)
Int64Index([28677, 27170, 39245, 31880, 41441], dtype='int64', name='ID')
But if possible MultiIndex in index use get_level_values:
print (df1.index.get_level_values('ID'))
Int64Index([28677, 27170, 39245, 31880, 41441], dtype='int64', name='ID')
You can use df.index property:
df.index (or df.index.values for numpy array)
pd.Series(df.index)

Categories

Resources