I have a dataframe with two columns containing arrays in each cell. Here's some code to create a small example dataframe with the same features as mine.
import pandas as pd
data = {'time': [
np.array(['2017-06-28T22:47:51.213500000', '2017-06-28T22:48:37.570900000', '2017-06-28T22:49:46.736800000']),
np.array(['2017-06-28T22:46:27.321600000', '2017-06-28T22:46:27.321600000', '2017-06-28T22:47:07.220500000', '2017-06-28T22:47:04.293000000']),
np.array(['2017-06-28T23:10:20.125000000', '2017-06-28T23:10:09.885000000', '2017-06-28T23:11:31.902000000'])
],
'depth': [
np.array([215.91168091, 222.89173789, 215.21367521]),
np.array([188.68945869, 208.23361823, 217.30769231, 229.87179487]),
np.array([169.84330484, 189.38746439, 178.91737892])
]
}
df = pd.DataFrame(data)
df
I want to plot the data as three individual shapes, one for each row, where the time values are treated as the x coordinates and the depth values are treated as the y coordinates. To do this, I want to make a list of arrays that looks something like this.
[array([['2017-06-28T22:47:51.213500000', 215.91168091],
['2017-06-28T22:48:37.570900000', 222.89173789],
['2017-06-28T22:49:46.736800000', 215.21367521], dtype=object),
array([['2017-06-28T22:46:27.321600000', 188.68945869],
['2017-06-28T22:46:27.321600000', 208.23361823],
['2017-06-28T22:47:07.220500000', 217.30769231],
['2017-06-28T22:47:04.293000000', 229.87179487], dtype=object),
array([['2017-06-28T23:10:20.125000000', 169.84330484],
['2017-06-28T23:10:09.885000000', 189.38746439],
['2017-06-28T23:11:31.902000000', 178.91737892], dtype=object)]
Try zip with for loop
l = [np.array(list(zip(x,y))) for x, y in zip(df.time,df.depth)]
Out[385]:
[array([['2017-06-28T22:47:51.213500000', '215.91168091'],
['2017-06-28T22:48:37.570900000', '222.89173789'],
['2017-06-28T22:49:46.736800000', '215.21367521']], dtype='<U29'),
array([['2017-06-28T22:46:27.321600000', '188.68945869'],
['2017-06-28T22:46:27.321600000', '208.23361823'],
['2017-06-28T22:47:07.220500000', '217.30769231'],
['2017-06-28T22:47:04.293000000', '229.87179487']], dtype='<U29'),
array([['2017-06-28T23:10:20.125000000', '169.84330484'],
['2017-06-28T23:10:09.885000000', '189.38746439'],
['2017-06-28T23:11:31.902000000', '178.91737892']], dtype='<U29')]
Related
I have a request that gets me some data that looks like this:
[{'__rowType': 'META',
'__type': 'units',
'data': [{'name': 'units.unit', 'type': 'STRING'},
{'name': 'units.classification', 'type': 'STRING'}]},
{'__rowType': 'DATA', '__type': 'units', 'data': ['A', 'Energie']},
{'__rowType': 'DATA', '__type': 'units', 'data': ['bar', ' ']},
{'__rowType': 'DATA', '__type': 'units', 'data': ['CCM', 'Volumen']},
{'__rowType': 'DATA', '__type': 'units', 'data': ['CDM', 'Volumen']}]
and would like to construct a (Pandas) DataFrame that looks like this:
Things like pd.DataFrame(pd.json_normalize(test)['data'] are close but still throw the whole list into the column instead of making separate columns. record_path sounded right but I can't get it to work correctly either.
Any help?
It's difficult to know how the example generalizes, but for this particular case you could use:
pd.DataFrame([d['data'] for d in test
if d.get('__rowType', None)=='DATA' and 'data' in d],
columns=['unit', 'classification']
)
NB. assuming test the input list
output:
unit classification
0 A Energie
1 bar
2 CCM Volumen
3 CDM Volumen
Instead of just giving you the code, first I explain how you can do this by details and then I'll show you the exact steps to follow and the final code. This way you understand everything for any further situation.
When you want to create a pandas dataframe with two columns you can do this by creating a dictionary and passing it to DataFrame class:
my_data = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data=my_data)
This will result in this dataframe:
So if you want to have the dataframe you specified in your question the my_data dictionary should be like this:
my_data = {
'unit': ['A', 'bar', 'CCM', 'CDM'],
'classification': ['Energie', '', 'Volumen', 'Volumen'],
}
df = pd.DataFrame(data=my_data, )
df.index = np.arange(1, len(df)+1)
df
(You can see the df.index=... part. This is because that the index column of the desired dataframe is started at 1 in your question)
So if you want to do so you just have to extract these data from the data you provided and convert them to the exact dictionary mentioned above (my_data dictionary)
To do so you can do this:
# This will get the data values like 'bar', 'CCM' and etc from your initial data
values = [x['data'] for x in d if x['__rowType']=='DATA']
# This gets the columns names from meta data
meta = list(filter(lambda x: x['__rowType']=='META', d))[0]
columns = [x['name'].split('.')[-1] for x in meta['data']]
# This line creates the exact dictionary we need to send to DataFrame class.
my_data = {column:[v[i] for v in values] for i, column in enumerate(columns)}
So the whole code would be this:
d = YOUR_DATA
# This will get the data values like 'bar', 'CCM' and etc
values = [x['data'] for x in d if x['__rowType']=='DATA']
# This gets the columns names from meta data
meta = list(filter(lambda x: x['__rowType']=='META', d))[0]
columns = [x['name'].split('.')[-1] for x in meta['data']]
# This line creates the exact dictionary we need to send to DataFrame class.
my_data = {column:[v[i] for v in values] for i, column in enumerate(columns)}
df = pd.DataFrame(data=my_data, )
df.index = np.arange(1, len(df)+1)
df #or print(df)
Note: Of course you can do all of this in one complex line of code but to avoid confusion I decided to do this in couple of lines of code
I'm collecting values from different arrays and nested dictionary containing list values, like below. The lists contains millions of rows, I tried pandas dataframe concatenation But getting out of memory, so I resorted to a for loop.
array1_str = ['user_1', 'user_2', 'user_3','user_4' , 'user_5']
array2_int = [3,3,1,2,4]
nested_dict_w_list = {'outer_dict' : { 'inner_dict' : [[1.0001],[2.0033],[1.3434],[2.3434], [0.44224]}}
final_out = [array1_str[i], array2_int[i], nested_dict_w_list['outer_dict']['inner_dict'][array2_int[i]]] for i in range(len(array2_int))]
I'm getting the output as
user_1, 3, [2.3434]
user_2, 3, [2.3434]
user_3, 1, [1.0001]
user_4, 2, [1.3434]
user_5, 4, [0.44224]
But I want the output as
user_1, 3, 2.3434
user_2, 3, 2.3434
user_3, 1, 1.0001
user_4, 2, 1.3434
user_5, 4, 0.44224
I need to eventually convert this to parquet file, I'm using spark dataframe to convert this to parquet, but the schema is appearing as array(double)). But I need it as just double. Any input is appreciated.
The below for loop is working, but any other efficient and elegant solution.
final_output = []
for i in range(len(array2_int)-1)):
index = nested_dict_w_list['outer_dict']['inner_dict'][array2_int[i]]
final_output.append(array1_str[i], array2_int[i], index[0])
You can modify your original list comprehension, by indexing to item zero:
final_out = [
(array1_str[i], array2_int[i], nested_dict_w_list['outer_dict']['inner_dict'][array2_int[i]][0])
for i in range(len(array2_int))
]
I have a bunch of tasks to distribute evenly across a date range.
The task lists always contain 5 elements, excluding the final chunk, which will vary between 1 and 5 elements.
The process I've put together outputs the following data structure;
[{'Project': array([['AAC789A'],
['ABL001A'],
['ABL001D'],
['ABL001E'],
['ABL001X']], dtype=object), 'end_date': '2020-10-01'},
{'Project': array([['ACZ885G_MA'],
['ACZ885H'],
['ACZ885H_MA'],
['ACZ885I'],
['ACZ885M']], dtype=object), 'end_date': '2020-10-02'},
{'Project': array([['IGE025C']], dtype=object), 'end_date': '2020-10-03'}]
...but I really need the following format...
Project,end_date
AAC789A,2020-10-01
ABL001A,2020-10-01
ABL001D,2020-10-01
ABL001E,2020-10-01
ABL001X,2020-10-01
ACZ885G_MA,2020-10-02
ACZ885H,2020-10-02
ACZ885H_MA,2020-10-02
ACZ885I,2020-10-02
ACZ885M,2020-10-02
IGE025C,2020-10-03
I've looked at repeating and chaining using itertools, but I don't seem to be getting anywhere with it.
This is my first time working heavily with Python. How would this typically be accomplished in Python?
This is how I'm currently attempting to do this, but I get the error below.
df = pd.concat([pd.Series(row['end_date'], row['Project'].split(','))
for _, row in df.iterrows()]).reset_index()
AttributeError: 'numpy.ndarray' object has no attribute 'split'
here you have a solution using numpy flatten method:
import pandas as pd
import numpy as np
data = [{'Project': np.array([['AAC789A'],
['ABL001A'],
['ABL001D'],
['ABL001E'],
['ABL001X']], dtype=object), 'end_date': '2020-10-01'},
{'Project': np.array([['ACZ885G_MA'],
['ACZ885H'],
['ACZ885H_MA'],
['ACZ885I'],
['ACZ885M']], dtype=object), 'end_date': '2020-10-02'},
{'Project': np.array([['IGE025C']], dtype=object), 'end_date': '2020-10-03'}]
clean = lambda di : { 'Project': di['Project'].flatten(), 'end_date': di['end_date']}
result = pd.concat([pd.DataFrame(clean(d)) for d in data])
result is a dataframe which can be exported to a csv format. It contains the following:
Project,end_date
AAC789A,2020-10-01
ABL001A,2020-10-01
ABL001D,2020-10-01
ABL001E,2020-10-01
ABL001X,2020-10-01
ACZ885G_MA,2020-10-02
ACZ885H,2020-10-02
ACZ885H_MA,2020-10-02
ACZ885I,2020-10-02
ACZ885M,2020-10-02
IGE025C,2020-10-03
I found an answer that met my need. See link below - MaxU's answer served me best.
Using his explode method, I was able to accomplish my goal with one line of code.
df2 = explode(df.assign(var1=df.Project.str.split(',')), 'Project')
Split (explode) pandas dataframe string entry to separate rows
I am trying to get a vector of specific dictionary values which are in a numpy array. Here is what the array looks like:
import numpy as np
edge_array = np.array(
[[1001, 7005, {'lanes': 9, 'length': 0.35, 'type': '99', 'modes': 'cw'}],
[1001, 8259, {'lanes': 10, 'length': 0.46, 'type': '99', 'modes': 'cw'}],
[1001, 14007, {'lanes': 7, 'length': 0.49, 'type': '99', 'modes': 'cw'}]])
I have a vector for the first two values of each row (i.e. 1001 and 7005, but I need another vector for the values associated with 'lanes'.
Here is my code so far:
row_idx = edge_array[:, 0]
col_idx = edge_array[:, 1]
lane_values = edge_array[:, 2['lanes']]
The error I get is as follows:
lane_values = edge_array[:, 2['lanes']]
TypeError: 'int' object has no attribute '__getitem__'
Please let me know if you need any further clarification, thanks!
The subexpression 2['lanes'] does not make sense: you are indexing into the number 2.
Instead, try:
[rec['lanes'] for rec in edge_array[:, 2]]
Or:
import operator
map(operator.itemgetter('lanes'), edge_array[:,2])
The above will give you a regular Python list; if you want a NumPy array you'll have to call np.array() on the list.
But the better solution here is to transform your data into a "structured array" which has named columns and then you can index efficiently by name. If your array has many rows, this will have a big impact on efficiency.
This is not a fully working example. Hard to work with that. The types are unclear. I suspect, that you work with numpy somehow, but well, hard to tell.
In all means, the indexing with 2['something'] is incorrect and the error tells you why. It is tried to index with a key in an integer. Look up how indexing is done in python / numpy.
But this is how you could extract your 'lanes':
map(lambda x: x['lanes'], edge_array[:, 2]))
# OR (if you want a vector/np-array)
vec_of_lanes = np.array(map(lambda x: x['lanes'], edge_array[:, 2])))
More in numpy-style:
vec_of_lanes = np.apply_along_axis(lambda x: x[2]['lanes'], 1, edge_array)
#Zwinck suggested a structured array. Here's one way of doing that
Define a dtype for the dictionary part. It has fields with different dtypes
dt1 = np.dtype([('lanes',int), ('length',float), ('type','S2'),('modes','S2')])
Embed that dtype in a larger one. I used a sub-array format for the first 2 values:
dt = np.dtype([('f0',int,(2,)), ('f1',dt1)])
Now create the array. I edited your expression to fit dt. The mix of tuples and lists is important. I could have transferred the data from your object array instead (todo?)
edge_array1 = np.array(
[([1001, 7005], ( 9, 0.35, '99','cw')),
([1001, 8259], ( 10, 0.46, '99','cw')),
([1001, 14007], (7, 0.49, '99', 'cw'))], dtype=dt)
Now the 2 int values can be accessed by the 'f0' field name:
In [513]: edge_array1['f0']
Out[513]:
array([[ 1001, 7005],
[ 1001, 8259],
[ 1001, 14007]])
while 'lanes' are accessed by a double application of field name indexing (since they are a field within the field):
In [514]: edge_array1['f1']['lanes']
Out[514]: array([ 9, 10, 7])
for instance the following a matrix, e.g.
matrix = [
['month','val1','val2','valn'],
['jan','100','200','300'],
['feb','101',201',302'],
['march','102','202','303'],
['april','103','203','303'],
['march','104','204','304']
]
I'd like to create a new matrix based on a list of columns indexes or names (filter in), so
filter_col_indx = {0,2}
filter_col_name = {'month','val2'}
would produce the same output:
matrix2 = [
['month,'val2'],
['jan','200'],
['feb','201'],
['march','202'],
['april','203'],
['march','204']
]
For large matrices what would be the most efficient way to do this? The list_of_columns can vary.
Thanks
This can be done using operator.itemgetter:
import operator
matrix = [
['month','val1','val2','valn'],
['jan','100','200','300'],
['feb','101','201','302'],
['march','102','202','303'],
['april','103','203','303'],
['march','104','204','304']
]
filter_col_indx = [0,2]
getter = operator.itemgetter(*filter_col_indx)
matrix2 = [list(getter(row)) for row in matrix]
print(matrix2)
yields
[['month', 'val2'],
['jan', '200'],
['feb', '201'],
['march', '202'],
['april', '203'],
['march', '204']]
operator.itemgetter(*filter_col_indx) returns a function which takes a sequence as its argument and returns the 0th and 2th items from the sequence. Thus, you can apply this function to each row to select the desired values from matrix.
If you install pandas, then you could make matrix a DataFrame and select the desired columns like this:
import pandas as pd
matrix = [
['month','val1','val2','valn'],
['jan','100','200','300'],
['feb','101','201','302'],
['march','102','202','303'],
['april','103','203','303'],
['march','104','204','304']
]
df = pd.DataFrame(matrix[1:], columns=matrix[0])
print(df[['month', 'val2']])
yields
month val2
0 jan 200
1 feb 201
2 march 202
3 april 203
4 march 204
You might enjoy using pandas since it make a lot of data-munging operations very easy.
If you're always interested in whole columns, I think it would be appropriate to store the data using a dictionary containing the columns as lists:
data = {'month': ['jan', 'feb', 'march', 'april', 'march'],
'val1': [100, 101, 102, 103, 104],
'val2': [200, 201, 202, 203, 204],
...
}
To retrieve columns (which I have now written horizontally...), you do:
{key: data[key] for key in ['month', 'val2']}
This is a numpy version for this:
import numpy as np
matrix = np.array([
['month','val1','val2','valn'],
['jan','100','200','300'],
['feb','101','201','302'],
['march','102','202','303'],
['april','103','203','303'],
['march','104','204','304']
])
search = ['month', 'val2']
indexes = matrix[0,:].searchsorted(search) #search only the first row
# or indexes = [0, 2]
print matrix[:,indexes]
>>> [['month' 'val2']
['jan' '200']
['feb' '201']
['march' '202']
['april' '203']
['march' '204']]