Iterating through Pandas groups to create DataFrames - python

I have a table containing production data on parts and the variables that were recorded during their production.
FORMAT:
Part | Variable1 | Variable 2 etc
_____________________________
1-----------X---------------X
1-----------X---------------X
2-----------X---------------X
2-----------X---------------X
2-----------X---------------X
2-----------X---------------X
2-----------X---------------X
2-----------X---------------X
2-----------X---------------X
3-----------X---------------X
3-----------X---------------X
3-----------X---------------X
I can group these by part with
dfg = df.groupby("Part") #Where df is my dataframe of productiondata
I also have stored all the Part numbers in the part_num array
part_num = df['Part'].unique()
>out:
array([ 615, 629, 901, 908, 911, 959, 969, 1024, 1025, 1058, 1059,
1092, 1097, 1104, 1105, 1114, 1115, 1117, 1147, 1161, 1171, 1172,
1173, 1174, 1175, 1176, 1177, 1188, 1259, 1307, 1308, 1309, 1310,
1311, 1312, 1313, 1322, 1339, 1340, 1359, 1383, 1384, 1389, 1393,
1394, 1398, 1399, 1402, 1404, 1413, 1414, 1417, 1441, 1449, 1461,
1462, 1463, 1488, 1489, 1490, 1491, 1508, 1509, 1514, 1541, 1542,
1543, 1544, 1545, 1554, 1555, 1559, 1586, 1589, 1601, 1606, 1607,
1618, 1620, 1636, 1659, 1664, 1665, 1667, 1668, 1673, 1674, 1676,
1677, 1679, 1680, 1681, 1687, 1688, 1690, 1704, 1706, 1711, 1714,
1717, 1718, 1723, 1724, 1729, 1731, 1732, 1745, 1747, 1748, 1749,
1753, 1754, 1755, 1756, 1757, 1758, 1759, 1760, 1761, 1762, 1763,
1764, 1765, 1766, 1767, 1768, 1769, 1773, 1774, 1779, 1780, 1783,
1784, 1785, 1787, 1789, 1790, 1791, 1792, 1800, 1845], dtype=int64)
How do I create a dataframe for each part number group?

So you want a dataframe for each unique 'part number group'.
Do a groupby on the index and then store each dataframe in a dict (or list) iteratively.
dummy data
>>> df = pd.DataFrame(np.random.randint(1, 10, (8, 7)),
columns=['a1', 'a2', 'a3', 'b1', 'b2', 'b3', 'b4'], index=[1, 1, 2, 2, 2, 3, 5, 5])
>>> df.head(10)
a1 a2 a3 b1 b2 b3 b4
1 5 3 8 8 1 7 1
1 8 8 8 7 2 5 8
2 4 8 1 9 2 7 5
2 1 8 4 4 1 8 9
2 4 7 4 4 3 9 5
3 3 6 9 3 8 9 2
5 9 5 2 1 7 6 1
5 3 9 1 5 5 8 1
grouped dict of dataframes
>>> grouped_dict = {k: v for k, v in df.groupby(df.index)}
>>> grouped_dict[3] # part number 3 dataframe
a1 a2 a3 b1 b2 b3 b4
3 5 7 5 1 7 8 5

Related

How to set xaxis ticks order?

df:
USER_ID EventNum
0 1390 17
1 4452 15
2 995 14
3 532 14
4 3281 14
... ... ...
5897 4971 1
5898 2637 1
5899 792 1
5900 5622 1
5901 1 1
[5902 rows x 2 columns]
I want to plot a figure, using USER_ID as X-axis, EventNum as Y-axis.
To avoid cluttering up the axis, I sample USER_ID values at
fixed interval as xticks like this:
[1390, 4899, 4062, 366, 5001, 3383, 5003, 446, 2879, 3220, 4006, 4595, 1713, 2649, 2291, 5647, 2040, 5468, 3719, 4198, 5622]
I do like this, but it seems xticks values are not placed as their sorted order(xticks), instead they are placed by the increasing numeric order(see the figure below).
xticks = [1390, 4899, 4062, 366, 5001, 3383, 5003, 446, 2879, 3220, 4006, 4595, 1713, 2649, 2291, 5647, 2040, 5468, 3719, 4198, 5622]
ax = df.plot(x='USER_ID', y=['EventNum'], use_index=False, rot=270)
ax.set_xticks(xticks)
ax.set_xlabel('User ID')
ax.set_ylabel('Event Number')
How can I fix this?
I figured it out, just replace the line
ax.set_xticks(xticks)
with this
ax.set_xticks(pos, xticks)
where pos is the corresponding position of the xticks value.
In my case, they are respectively like below:
pos: [0, 295, 590, 885, 1180, 1475, 1770, 2065, 2360, 2655, 2950, 3245, 3540, 3835, 4130, 4425, 4720, 5015, 5310, 5605, 5900]
xticks: [1390, 4899, 4062, 366, 5001, 3383, 5003, 446, 2879, 3220, 4006, 4595, 1713, 2649, 2291, 5647, 2040, 5468, 3719, 4198, 5622]

Is there a way to reshape a single index pandas DataFrame into a multi index to adapt to time series?

Here's a sample data frame:
import pandas as pd
sample_dframe = pd.DataFrame.from_dict(
{
"id": [123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 456, 456, 456, 456, 456, 456, 456, 456, 456, 456, 456, 456],
"V1": [2552, 813, 496, 401, 4078, 952, 7279, 544, 450,548, 433,4696, 244,9735, 4263,642, 255,2813, 496,401, 4078952, 7279544],
"V2": [3434, 133, 424, 491, 8217, 915, 7179, 5414, 450, 548, 433, 4696, 244, 9735, 4263, 642, 255, 2813, 496, 401, 4952, 4453],
"V3": [382,161, 7237, 7503, 561, 6801, 1072, 9660, 62107, 6233, 5403, 3745, 8613, 6302, 557, 4256, 9874, 3013, 9352, 4522, 3232, 58830],
"V4": [32628, 4471, 4781, 1497, 45104, 8657, 81074, 1091, 370835, 2058, 4447, 7376, 302237, 6833, 48348, 3545, 4263,642, 255,2813, 4088920, 6323521]
}
)
The data frame looks like this:
The above sample shape is (22, 5) and has columns id, V1..V4. I need to convert this into a multi index data frame (as a time series), where for a given id, I need to group 5 values (time steps) from each of V1..V4 for a given id.
i.e., it should give me a frame of shape (2, 4, 5) since there are 2 unique id values.
IIUC, you might just want:
sample_dframe.set_index('id').stack()
NB. the output is a Series, for a DataFrame add .to_frame(name='col_name').
Output:
id
123 V1 2552
V2 3434
V3 382
V4 32628
V1 813
...
456 V4 4088920
V1 7279544
V2 4453
V3 58830
V4 6323521
Length: 88, dtype: int64
Or, maybe:
(sample_dframe
.assign(time=lambda d: d.groupby('id').cumcount())
.set_index(['id', 'time']).stack()
.swaplevel('time', -1)
)
Output:
id time
123 V1 0 2552
V2 0 3434
V3 0 382
V4 0 32628
V1 1 813
...
456 V4 10 4088920
V1 11 7279544
V2 11 4453
V3 11 58830
V4 11 6323521
Length: 88, dtype: int64
import itertools
import timeit
from pandas import DataFrame
import numpy as np
import pandas as pd
from datetime import datetime
from pandas import DataFrame
import functools as ft
df= pd.DataFrame.from_dict(
{
"id": [123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 456, 456, 456, 456, 456, 456, 456, 456, 456, 456, 456, 456],
"V1": [2552, 813, 496, 401, 4078, 952, 7279, 544, 450,548, 433,4696, 244,9735, 4263,642, 255,2813, 496,401, 4078952, 7279544],
"V2": [3434, 133, 424, 491, 8217, 915, 7179, 5414, 450, 548, 433, 4696, 244, 9735, 4263, 642, 255, 2813, 496, 401, 4952, 4453],
"V3": [382,161, 7237, 7503, 561, 6801, 1072, 9660, 62107, 6233, 5403, 3745, 8613, 6302, 557, 4256, 9874, 3013, 9352, 4522, 3232, 58830],
"V4": [32628, 4471, 4781, 1497, 45104, 8657, 81074, 1091, 370835, 2058, 4447, 7376, 302237, 6833, 48348, 3545, 4263,642, 255,2813, 4088920, 6323521]
}
)
print(df)
"""
id V1 V2 V3 V4
0 123 2552 3434 382 32628
1 123 813 133 161 4471
2 123 496 424 7237 4781
3 123 401 491 7503 1497
4 123 4078 8217 561 45104
5 123 952 915 6801 8657
6 123 7279 7179 1072 81074
7 123 544 5414 9660 1091
8 123 450 450 62107 370835
9 123 548 548 6233 2058
10 456 433 433 5403 4447
11 456 4696 4696 3745 7376
12 456 244 244 8613 302237
13 456 9735 9735 6302 6833
14 456 4263 4263 557 48348
15 456 642 642 4256 3545
16 456 255 255 9874 4263
17 456 2813 2813 3013 642
18 456 496 496 9352 255
19 456 401 401 4522 2813
20 456 4078952 4952 3232 4088920
21 456 7279544 4453 58830 6323521
"""
df = df.set_index('id').stack().reset_index().drop(columns = 'level_1').rename(columns = {0:'V1_new'})
print(df)
"""
id V1_new
0 123 2552
1 123 3434
2 123 382
3 123 32628
4 123 813
.. ... ...
83 456 4088920
84 456 7279544
85 456 4453
86 456 58830
87 456 6323521
"""

Plotting select rows and columns of DataFrame (python)

I have searched around but have not found an exact way to do this. I have a data frame of several baseball teams and their statistics. Like the following:
RK TEAM GP AB R H 2B 3B HR TB RBI AVG OBP SLG OPS
1 Milwaukee 163 5542 754 1398 252 24 218 2352 711 .252.323 .424 .747
2 Chicago Cubs 163 5624 761 1453 286 34 167 2308 722 .258.333 .410 .744
3 LA Dodgers 163 5572 804 1394 296 33 235 2461 756 .250.333 .442 .774
4 Colorado 163 5541 780 1418 280 42 210 2412 748 .256.322 .435 .757
5 Baltimore 162 5507 622 1317 242 15 188 2153 593 .239.298 .391 .689
I want to be able to plot two teams on the X-axis and then perhaps 3 metrics (ex: R, H, TB) on the Y-axis with the two teams side by side in bar chart format. I haven't been able to figure out how to do this. Any ideas?
Thank you.
My approach was to create a new dataframe which only contains the columns you are interested in plotting:
import pandas as pd
import matplotlib.pyplot as plt
data = [[1, 'Milwaukee', 163, 5542, 754, 1398, 252, 24, 218, 2352, 711, .252, .323, .424, .747],
[2, 'Chicago Cubs', 163, 5624, 761, 1453, 286, 34, 167, 2308, 722, .258, .333, .410, .744],
[3, 'LA Dodgers', 163, 5572, 804, 1394, 296, 33, 235, 2461, 756, .250, .333, .442, .774],
[4, 'Colorado', 163, 5541, 780, 1418, 280, 42, 210, 2412, 748, .256, .322, .435, .757],
[5, 'Baltimore', 162, 5507, 622, 1317, 242, 15, 188, 2153, 593, .239, .298, .391, .689]]
df = pd.DataFrame(data, columns=['RK', 'TEAM', 'GP', 'AB', 'R', 'H', '2B', '3B', 'HR', 'TB', 'RBI', 'AVG', 'OBP', 'SLG', 'OPS'])
dfplot = df[['TEAM', 'R', 'H', 'TB']].copy()
fig = plt.figure()
ax = fig.add_subplot(111)
width = 0.4
dfplot.plot(kind='bar', x='TEAM', ax=ax, width=width, position=1)
plt.show()
This creates the following output:

Query hdf5 datetime column

I have a hdf5 file that contains a table where the column time is in datetime64[ns] format.
I want to get all the rows that are older than thresh. How can I do that? This is what I've tried:
thresh = pd.datetime.strptime('2018-03-08 14:19:41','%Y-%m-%d %H:%M:%S').timestamp()
hdf = pd.read_hdf(STORE, 'gh1', where = 'time>thresh' )
I get the following error:
Traceback (most recent call last):
File "<ipython-input-80-fa444735d0a9>", line 1, in <module>
runfile('/home/joao/github/control_panel/controlpanel/controlpanel/reading_test.py', wdir='/home/joao/github/control_panel/controlpanel/controlpanel')
File "/home/joao/anaconda3/lib/python3.6/site-packages/spyder/utils/site/sitecustomize.py", line 705, in runfile
execfile(filename, namespace)
File "/home/joao/anaconda3/lib/python3.6/site-packages/spyder/utils/site/sitecustomize.py", line 102, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "/home/joao/github/control_panel/controlpanel/controlpanel/reading_test.py", line 15, in <module>
hdf = pd.read_hdf(STORE, 'gh1', where = 'time>thresh' )
File "/home/joao/anaconda3/lib/python3.6/site-packages/pandas/io/pytables.py", line 370, in read_hdf
return store.select(key, auto_close=auto_close, **kwargs)
File "/home/joao/anaconda3/lib/python3.6/site-packages/pandas/io/pytables.py", line 717, in select
return it.get_result()
File "/home/joao/anaconda3/lib/python3.6/site-packages/pandas/io/pytables.py", line 1457, in get_result
results = self.func(self.start, self.stop, where)
File "/home/joao/anaconda3/lib/python3.6/site-packages/pandas/io/pytables.py", line 710, in func
columns=columns, **kwargs)
File "/home/joao/anaconda3/lib/python3.6/site-packages/pandas/io/pytables.py", line 4141, in read
if not self.read_axes(where=where, **kwargs):
File "/home/joao/anaconda3/lib/python3.6/site-packages/pandas/io/pytables.py", line 3340, in read_axes
self.selection = Selection(self, where=where, **kwargs)
File "/home/joao/anaconda3/lib/python3.6/site-packages/pandas/io/pytables.py", line 4706, in __init__
self.condition, self.filter = self.terms.evaluate()
File "/home/joao/anaconda3/lib/python3.6/site-packages/pandas/core/computation/pytables.py", line 556, in evaluate
self.condition = self.terms.prune(ConditionBinOp)
File "/home/joao/anaconda3/lib/python3.6/site-packages/pandas/core/computation/pytables.py", line 118, in prune
res = pr(left.value, right.value)
File "/home/joao/anaconda3/lib/python3.6/site-packages/pandas/core/computation/pytables.py", line 113, in pr
encoding=self.encoding).evaluate()
File "/home/joao/anaconda3/lib/python3.6/site-packages/pandas/core/computation/pytables.py", line 327, in evaluate
values = [self.convert_value(v) for v in rhs]
File "/home/joao/anaconda3/lib/python3.6/site-packages/pandas/core/computation/pytables.py", line 327, in <listcomp>
values = [self.convert_value(v) for v in rhs]
File "/home/joao/anaconda3/lib/python3.6/site-packages/pandas/core/computation/pytables.py", line 185, in convert_value
v = pd.Timestamp(v)
File "pandas/_libs/tslib.pyx", line 390, in pandas._libs.tslib.Timestamp.__new__
File "pandas/_libs/tslib.pyx", line 1549, in pandas._libs.tslib.convert_to_tsobject
File "pandas/_libs/tslib.pyx", line 1735, in pandas._libs.tslib.convert_str_to_tsobject
ValueError: could not convert string to Timestamp
Demo:
creating sample DF (100.000 rows):
In [9]: N = 10**5
In [10]: dates = pd.date_range('1980-01-01', freq='99T', periods=N)
In [11]: df = pd.DataFrame({'date':dates, 'val':np.random.rand(N)})
In [12]: df
Out[12]:
date val
0 1980-01-01 00:00:00 0.985215
1 1980-01-01 01:39:00 0.452295
2 1980-01-01 03:18:00 0.780096
3 1980-01-01 04:57:00 0.004596
4 1980-01-01 06:36:00 0.515051
... ... ...
99995 1998-10-27 15:45:00 0.509954
99996 1998-10-27 17:24:00 0.046636
99997 1998-10-27 19:03:00 0.026678
99998 1998-10-27 20:42:00 0.660652
99999 1998-10-27 22:21:00 0.839426
[100000 rows x 2 columns]
writing it to HDF5 file (index date column):
In [13]: df.to_hdf('d:/temp/test.h5', 'test', format='t', data_columns=['date'])
read HDF5 conditionally by index:
In [14]: x = pd.read_hdf('d:/temp/test.h5', 'test', where="date > '1998-10-27 15:00:00'")
In [15]: x
Out[15]:
date val
99995 1998-10-27 15:45:00 0.509954
99996 1998-10-27 17:24:00 0.046636
99997 1998-10-27 19:03:00 0.026678
99998 1998-10-27 20:42:00 0.660652
99999 1998-10-27 22:21:00 0.839426

get index column error from dataframe from

I simply want to get index column.
import pandas as pd
df1=pd.read_csv(path1, index_col='ID')
df1.head()
VAR1 VAR2 VAR3 OUTCOME
ID
28677 28 1 0.0 0
27170 59 1 0.0 1
39245 65 1 0.0 1
31880 19 1 0.0 0
41441 24 1 0.0 1
I can get many columns like:
df1["VAR1"]
ID
28677 28
27170 59
39245 65
31880 19
41441 24
31070 77
39334 63
....
38348 23
38278 52
28177 58
but, I cannot get index column:
>>> df1["ID"]
Traceback (most recent call last):
File "C:\Anaconda3\lib\site-packages\pandas\indexes\base.py", line 2134, in get_loc
return self._engine.get_loc(key)
File "pandas\index.pyx", line 132, in pandas.index.IndexEngine.get_loc (pandas\index.c:4433)
File "pandas\index.pyx", line 154, in pandas.index.IndexEngine.get_loc (pandas\index.c:4279)
File "pandas\src\hashtable_class_helper.pxi", line 732, in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:13742)
File "pandas\src\hashtable_class_helper.pxi", line 740, in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:13696)
KeyError: 'ID'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Anaconda3\lib\site-packages\pandas\core\frame.py", line 2059, in __getitem__
return self._getitem_column(key)
File "C:\Anaconda3\lib\site-packages\pandas\core\frame.py", line 2066, in _getitem_column
return self._get_item_cache(key)
File "C:\Anaconda3\lib\site-packages\pandas\core\generic.py", line 1386, in _get_item_cache
values = self._data.get(item)
File "C:\Anaconda3\lib\site-packages\pandas\core\internals.py", line 3543, in get
loc = self.items.get_loc(item)
File "C:\Anaconda3\lib\site-packages\pandas\indexes\base.py", line 2136, in get_loc
return self._engine.get_loc(self._maybe_cast_indexer(key))
File "pandas\index.pyx", line 132, in pandas.index.IndexEngine.get_loc (pandas\index.c:4433)
File "pandas\index.pyx", line 154, in pandas.index.IndexEngine.get_loc (pandas\index.c:4279)
File "pandas\src\hashtable_class_helper.pxi", line 732, in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:13742)
File "pandas\src\hashtable_class_helper.pxi", line 740, in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:13696)
KeyError: 'ID'
I want to get the list of index column.
how to do it?
why I get error?
If I want to merge two dataframe use the index column, how to do it?
First column is index so for select use:
print (df1.index)
Int64Index([28677, 27170, 39245, 31880, 41441], dtype='int64', name='ID')
But if possible MultiIndex in index use get_level_values:
print (df1.index.get_level_values('ID'))
Int64Index([28677, 27170, 39245, 31880, 41441], dtype='int64', name='ID')
You can use df.index property:
df.index (or df.index.values for numpy array)
pd.Series(df.index)

Categories

Resources