Write struct columns to parquet with pyarrow - python

I have the following dataframe and schema:
df = pd.DataFrame([[1,2,3], [4,5,6], [7,8,9]], columns=['a', 'b', 'c'])
SCHEMA = pa.schema([("a_and_b", pa.struct([('a', pa.int64()), ('b', pa.int64())])), ('c', pa.int64())])
Then I want to create a pyarrow table from df and save it to parquet with this schema. However, I could not find a way to create a proper type in pandas that would correspond to a struct type in pyarrow. Is there a way to do this?

For pa.struct convertion from pandas you can use a tuples (eg: [(1, 4), (2, 5), (3, 6)]):
df_with_tuples = pd.DataFrame({
"a_and_b": zip(df["a"], df["b"]),
"c": df["c"]
})
pa.Table.from_pandas(df_with_tuples, SCHEMA)
or dict [{'a': 1, 'b': 2}, {'a': 4, 'b': 5}, {'a': 7, 'b': 8}]:
df_with_dict = pd.DataFrame({
"a_and_b": df.apply(lambda x: {"a": x["a"], "b": x["b"] }, axis=1),
"c": df["c"]
})
pa.Table.from_pandas(df_with_dict , SCHEMA)
When converting back from arrow to pandas, struct are represented as dict:
pa.Table.from_pandas(df_with_dict , SCHEMA).to_pandas()['a_and_b']
| a_and_b |
|:-----------------|
| {'a': 1, 'b': 2} |
| {'a': 4, 'b': 5} |
| {'a': 7, 'b': 8} |

Related

Pandas: Pivot multi-index, with one 'shared' column

I have a pandas dataframe that can be represented like:
test_dict = {('a', 1) : {'shared':0,'x':1, 'y':2, 'z':3},
('a', 2) : {'shared':1,'x':2, 'y':4, 'z':6},
('b', 1) : {'shared':0,'x':10, 'y':20, 'z':30},
('b', 2) : {'shared':1,'x':100, 'y':200, 'z':300}}
example = pd.DataFrame.from_dict(test_dict).T
I am trying to figure out a way to turn this into a dataframe that looks like this dictionary representation:
res_dict = {1 : {'shared':0,'a':{'x':1, 'y':2, 'z':3}, 'b':{'x':10, 'y':20, 'z':30}},
2 : {'shared':1,'a':{'x':2, 'y':4, 'z':6},'b':{'x':100, 'y':200, 'z':300}}}
Any suggestions appreciated!
Thanks
A possible solution, which uses only dataframe manipulations and then converts to dictionary:
xyz = ['x', 'y', 'z']
out = (example.assign(xyz=example[xyz].apply(list, axis=1)).reset_index()
.pivot(index='level_0', columns=['level_1', 'shared'], values='xyz')
.applymap(lambda x: dict(zip(xyz, x))))
out.columns = out.columns.rename(None, level=0)
out.index = out.index.rename(None)
(pd.concat([out.droplevel(1, axis=1),
out.columns.to_frame().reset_index(drop=True).iloc[:,1]
.to_frame().T.set_axis(out.columns.get_level_values(0), axis=1)])
.iloc[np.arange(-1, len(out))].to_dict())
Output:
{
1: {
'shared': 0,
'a': {'x': 1, 'y': 2, 'z': 3},
'b': {'x': 10, 'y': 20, 'z': 30}
},
2: {
'shared': 1,
'a': {'x': 2, 'y': 4, 'z': 6},
'b': {'x': 100, 'y': 200, 'z': 300}
}
}

Export data in CSV but with different columns set. Can it be done via a library?

Data input:
[
{'a': 1, 'b': 2, 'c': 3},
{'b': 2, 'd': 4, 'e': 5, 'a': 1},
{'b': 2, 'd': 4, 'a': 1}
]
CVS output (columns order does not matter):
a, b, c, d, e
1, 2, 3
1, 2, , 4, 5
1, 2, , 4
Standard library csv module cannot cover such kind of input.
Is there some package or library for a single-method export?
Or a good solution to deal with column discrepancies?
It can be done fairly easily using the included csv module with a little preliminary processing.
import csv
data = [
{'a': 1, 'b': 2, 'c': 3},
{'b': 2, 'd': 4, 'e': 5, 'a': 1},
{'b': 2, 'd': 4, 'a': 1}
]
fields = sorted(set.union(*(set(tuple(d.keys())) for d in data))) # Determine columns.
with open('output.csv', 'w', newline='') as file:
writer = csv.DictWriter(file, fieldnames=fields)
writer.writeheader()
writer.writerows(data)
print('-fini-')
Contents of file produced:
a,b,c,d,e
1,2,3,,
1,2,,4,5
1,2,,4,
Straightforward with pandas:
import pandas as pd
lst = [
{'a': 1, 'b': 2, 'c': 3},
{'b': 2, 'd': 4, 'e': 5, 'a': 1},
{'b': 2, 'd': 4, 'a': 1}
]
df = pd.DataFrame(lst)
print(df.to_csv(index=None))
Output:
a,b,c,d,e
1,2,3.0,,
1,2,,4.0,5.0
1,2,,4.0,
you have to pass a restval argument to Dictwriter which is the default argument for missing keys in dictionaries
writer = Dictwriter(file, list('abcde'), restval='')

Python deduplicate records - dedupe

I want to use https://github.com/datamade/dedupe to deduplicate some records in python. Looking at their examples
data_d = {}
for row in data:
clean_row = [(k, preProcess(v)) for (k, v) in row.items()]
row_id = int(row['id'])
data_d[row_id] = dict(clean_row)
the dictionary consumes quite a lot of memory compared to e.g. a dictionary created by pandas out of a pd.Datafrmae, or even a normal pd.Dataframe.
If this format is required, how can I convert a pd.Dataframe efficiently to such a dictionary?
edit
Example what pandas generates
{'column1': {0: 1389225600000000000,
1: 1388707200000000000,
2: 1388707200000000000,
3: 1389657600000000000,....
Example what dedupe expects
{'1': {column1: 1389225600000000000, column2: "ddd"},
'2': {column1: 1111, column2: "ddd} ...}
It appears that df.to_dict(orient='index') will produce the representation you are looking for:
import pandas
data = [[1, 2, 3], [4, 5, 6]]
columns = ['a', 'b', 'c']
df = pandas.DataFrame(data, columns=columns)
df.to_dict(orient='index')
results in
{0: {'a': 1, 'b': 2, 'c': 3}, 1: {'a': 4, 'b': 5, 'c': 6}}
You can try something like this:
df = pd.DataFrame({'A': [1,2,3,4,5], 'B': [6,7,8,9,10]})
A B
0 1 6
1 2 7
2 3 8
3 4 9
4 5 10
print(df.T.to_dict())
{0: {'A': 1, 'B': 6}, 1: {'A': 2, 'B': 7}, 2: {'A': 3, 'B': 8}, 3: {'A': 4, 'B': 9}, 4: {'A': 5, 'B': 10}}
This is the same output as in #chthonicdaemon answer so his answer is probably better. I am using pandas.DataFrame.T to transpose index and columns.
A python dictionary is not required, you just need an object that allows indexing by column name. i.e. row['col_name']
So, assuming data is a pandas dataframe should just be able to do something like:
data_d = {}
for row_id, row in data.iterrows():
data_d[row_id] = row
That said, the memory overhead of python dicts is not going to be where you have memory bottlenecks in dedupe.

How to retrieve and store multiple values from a python Data Frame?

I have the following Dataframe that represents a From-To distance matrix between pairs of points. I have predetermined "trips" that visit specific pairs of points that I need to calculate the total distance for.
For example,
Trip 1 = [A:B] + [B:C] + [B:D] = 6 + 5 + 8 = 19
Trip 2 = [A:D] + [B:E] + [C:E] = 6 + 15 + 3 = 24
import pandas
graph = {'A': {'A': 0, 'B': 6, 'C': 10, 'D': 6, 'E': 7},
'B': {'A': 10, 'B': 0, 'C': 5, 'D': 8, 'E': 15},
'C': {'A': 40, 'B': 30, 'C': 0, 'D': 9, 'E': 3}}
df = pd.DataFrame(graph).T
df.to_excel('file.xls')
I have many "trips" that I need to repeat this process for and then need to store the values in a row in a new Dataframe that I can export to excel. I know I can use df.at[A,'B'] to retrieve specific values in the Dataframe but how can retrieve multiple values, sum them, store in new Dataframe, and then repeat for the enxt trip.
Thank you in advance for any help or guidance,
I think if you don't transpose then maybe an unstack will help?
import pandas as pd
graph = {'A': {'A': 0, 'B': 6, 'C': 10, 'D': 6, 'E': 7},
'B': {'A': 10, 'B': 0, 'C': 5, 'D': 8, 'E': 15},
'C': {'A': 40, 'B': 30, 'C': 0, 'D': 9, 'E': 3}}
df = pd.DataFrame(graph)
df = df.unstack()
df.index.names = ['start','finish']
# a list of tuples to represent the trip(s)
trip1 = [('A','B'),('B','C'),('B','D')]
trip2 = [('A','D'),('B','E'),('C','E')]
trips = [trip1,trip2]
my_trips = {}
for trip in trips:
my_trips[str(trip)] = df.loc[trip].sum()
distance_df = pd.DataFrame(my_trips,index=['distance']).T
distance_df
distance
[('A', 'B'), ('B', 'C'), ('B', 'D')] 19
[('A', 'D'), ('B', 'E'), ('C', 'E')] 24

Get the product of lists inside a dict while retaining the same keys

I have the following dict:
my_dict = {'A': [1, 2], 'B': [1, 4]}
And I want to end up with a list of dicts like this:
[
{'A': 1, 'B': 1},
{'A': 1, 'B': 4},
{'A': 2, 'B': 1},
{'A': 2, 'B': 4}
]
So, I'm after the product of dict's lists, expressed as a list of dicts using the same keys as the incoming dict.
The closest I got was:
my_dict = {'A': [1, 2], 'B': [1, 4]}
it = []
for k in my_dict.keys():
current = my_dict.pop(k)
for i in current:
it.append({k2: i2 for k2, i2 in my_dict.iteritems()})
it[-1].update({k: i})
Which, apart from looking hideous, doesn't give me what I want:
[
{'A': 1, 'B': [1, 4]},
{'A': 2, 'B': [1, 4]},
{'B': 1},
{'B': 4}
]
If anyone feels like solving a riddle, I'd love to see how you'd approach it.
You can use itertools.product for this, i.e calculate cartesian product of the value and then simply zip each of the them with the keys from the dictionary. Note that ordering of a dict's keys() and corresponding values() remains same if it is not modified in-between hence ordering won't be an issue here:
>>> from itertools import product
>>> my_dict = {'A': [1, 2], 'B': [1, 4]}
>>> keys = list(my_dict)
>>> [dict(zip(keys, p)) for p in product(*my_dict.values())]
[{'A': 1, 'B': 1}, {'A': 1, 'B': 4}, {'A': 2, 'B': 1}, {'A': 2, 'B': 4}]
you can use itertools.product function within a list comprehension :
>>> from itertools import product
>>> [dict(i) for i in product(*[[(i,k) for k in j] for i,j in my_dict.items()])]
[{'A': 1, 'B': 1}, {'A': 1, 'B': 4}, {'A': 2, 'B': 1}, {'A': 2, 'B': 4}]
You can get the pairs contain your key and values with the following list comprehension :
[(i,k) for k in j] for i,j in my_dict.items()]
[[('A', 1), ('A', 2)], [('B', 1), ('B', 4)]]
Then you can use product to calculate the product of the preceding lists and then convert them to dictionary with dict function.
With itertools:
>>> from itertools import product
>>> my_dict = {'A': [1, 2], 'B': [1, 4]}
>>> keys, items = zip(*my_dict.items())
>>> [dict(zip(keys, x)) for x in product(*items)]
[{'A': 1, 'B': 1}, {'A': 1, 'B': 4}, {'A': 2, 'B': 1}, {'A': 2, 'B': 4}]
Try this:
from itertools import product
def dict_product(values, first, second):
return [
{first: first_value, second: second_value}
for first_value, second_value in product(values[first], values[second])
]
This is the result:
>>> dict_product({'A': [1, 2], 'B': [1, 4]}, 'A', 'B')
[{'A': 1, 'B': 1}, {'A': 1, 'B': 4}, {'A': 2, 'B': 1}, {'A': 2, 'B': 4}]

Categories

Resources