I have the following dataframe and schema:
df = pd.DataFrame([[1,2,3], [4,5,6], [7,8,9]], columns=['a', 'b', 'c'])
SCHEMA = pa.schema([("a_and_b", pa.struct([('a', pa.int64()), ('b', pa.int64())])), ('c', pa.int64())])
Then I want to create a pyarrow table from df and save it to parquet with this schema. However, I could not find a way to create a proper type in pandas that would correspond to a struct type in pyarrow. Is there a way to do this?
For pa.struct convertion from pandas you can use a tuples (eg: [(1, 4), (2, 5), (3, 6)]):
df_with_tuples = pd.DataFrame({
"a_and_b": zip(df["a"], df["b"]),
"c": df["c"]
})
pa.Table.from_pandas(df_with_tuples, SCHEMA)
or dict [{'a': 1, 'b': 2}, {'a': 4, 'b': 5}, {'a': 7, 'b': 8}]:
df_with_dict = pd.DataFrame({
"a_and_b": df.apply(lambda x: {"a": x["a"], "b": x["b"] }, axis=1),
"c": df["c"]
})
pa.Table.from_pandas(df_with_dict , SCHEMA)
When converting back from arrow to pandas, struct are represented as dict:
pa.Table.from_pandas(df_with_dict , SCHEMA).to_pandas()['a_and_b']
| a_and_b |
|:-----------------|
| {'a': 1, 'b': 2} |
| {'a': 4, 'b': 5} |
| {'a': 7, 'b': 8} |
New to pandas and python so thank you in advance.
I have a table
# Create DataFrame
data = [{'analyte': 'sample1'},
{'analyte': 'bacon', 'CAS1': 1},
{'analyte': 'eggs', 'CAS1': 2},
{'analyte': 'money', 'CAS1': 3, 'CAS2': 1, 'Value2': 1.11},
{'analyte': 'shoe', 'CAS1': 4},
{'analyte': 'boy', 'CAS1': 5},
{'analyte': 'girl', 'CAS1': 6},
{'analyte': 'onion', 'CAS1': 7, 'CAS2': 4, 'Value2': 6.53},
{'analyte': 'sample2'},
{'analyte': 'bacon', 'CAS1': 1},
{'analyte': 'eggs', 'CAS1': 2, 'CAS2': 1, 'Value2': 7.88},
{'analyte': 'money', 'CAS1': 3},
{'analyte': 'shoe', 'CAS1': 4, 'CAS2': 3, 'Value2': 15.5},
{'analyte': 'boy', 'CAS1': 5},
{'analyte': 'girl', 'CAS1': 6},
{'analyte': 'onion', 'CAS1': 7}]
df = pd.DataFrame(data)
Before Write Pandas DataFrame into a MySQL Database Table, i need to split df to separate tables, and then write each table to Mysql
How to split df by columns, somethink like, if column name contains string "cas1" then split df
for col in df.columns:
if "cas1" in col:
dfCas1 = df.split
#add uniq index to indetify to which row belongs to
if "cas2" in col:
dfCas2 = df.split
#add uniq index to indetify to which row belongs to
if {"analyte","id" .etc } in col: # main table
dfMain = df.split
dfMain.to_sql("Main", dbConnection, if_exists='fail')
dfCas1.to_sql("cas1", dbConnection, if_exists='fail')
dfCas2.to_sql("cas2", dbConnection, if_exists='fail')
expected
I'm not completely sure what you want to achieve, but I feel like you want to do something like splitting this:
+---------+----+------+--------+------+--------+
| Analyte | id | CAS1 | value1 | Cas2 | Value2 |
+---------+----+------+--------+------+--------+
| | | | | | |
+---------+----+------+--------+------+--------+
to this:
+---------+----+ +------+--------+ +------+--------+
| Analyte | id | | CAS1 | value1 | | Cas2 | Value2 |
+---------+----+ +------+--------+ +------+--------+
| | | | | | | | |
+---------+----+ +------+--------+ +------+--------+
The first one is obtained by calling e.g. df.loc[:, ['Analyte', 'id']]. For the other ones, adjust the column names.
Now for the uniq index that is within your code comments, df.loc[:] keeps the index of the original table. You can use df.reset_index() to reset it to a unique integer index. If you also want to drop empty rows in one of your subtables before parsing, have a look at df.dropna().
I am not 100% sure if this is what you mean, but:
dfCas1 = df[df.col.str.contains('cas1')]
dfCas2 = df[df.col.str.contains('cas2')]
dfMain = df[~((df.col.str.contains('cas2')) & df.col.str.contains('cas1'))]
The ~ sign negates the selection and means all rows where the columns do not contain cas1 and cas2. I hope this makes sense.
Given these data frames:
IncomingCount
-------------------------
Venue|Date | 08 | 10 |
-------------------------
Hotel|20190101| 15 | 03 |
Beach|20190101| 93 | 45 |
OutgoingCount
-------------------------
Venue|Date | 07 | 10 |
-------------------------
Beach|20190101| 30 | 5 |
Hotel|20190103| 05 | 15 |
How can I possibly merge (full join) the two tables resulting in something as following without having to manually loop through each row of both tables?
Dictionary:
[
{"Venue":"Hotel", "Date":"20190101", "08":{ "IncomingCount":15 }, "10":{ "IncomingCount":03 } },
{"Venue":"Beach", "Date":"20190101", "07":{ "OutgoingCount":30 }, "08":{ "IncomingCount":93 }, "10":{ "IncomingCount":45, "OutgoingCount":15 } },
{"Venue":"Hotel", "Date":"20190103", "07":{ "OutgoingCount":05 }, "10":{ "OutgoingCount":15 } }
]
The conditions are:
Venue and Date columns act like join conditions.
The other columns, represented in numbers, are dynamically created.
If dynamically column does not exist, it gets excluded( or included with None as value ).
it's pretty fiddly, but it can be done by making use of the create_map function from spark.
basically divide the columns into four groups: keys (venue, date), common (10), only incoming (08), only outgoing (07).
then create mappers per group (except keys), mapping only what's available per group. apply mapping, drop the old column and rename the mapped column to the old name.
lastly convert all rows to dict (from df's rdd) and collect.
from pyspark.sql import SparkSession
from pyspark.sql.functions import create_map, col, lit
spark = SparkSession.builder.appName('hotels_and_beaches').getOrCreate()
incoming_counts = spark.createDataFrame([('Hotel', 20190101, 15, 3), ('Beach', 20190101, 93, 45)], ['Venue', 'Date', '08', '10']).alias('inc')
outgoing_counts = spark.createDataFrame([('Beach', 20190101, 30, 5), ('Hotel', 20190103, 5, 15)], ['Venue', 'Date', '07', '10']).alias('out')
df = incoming_counts.join(outgoing_counts, on=['Venue', 'Date'], how='full')
outgoing_cols = {c for c in outgoing_counts.columns if c not in {'Venue', 'Date'}}
incoming_cols = {c for c in incoming_counts.columns if c not in {'Venue', 'Date'}}
common_cols = outgoing_cols.intersection(incoming_cols)
outgoing_cols = outgoing_cols.difference(common_cols)
incoming_cols = incoming_cols.difference(common_cols)
for c in common_cols:
df = df.withColumn(
c + '_new', create_map(
lit('IncomingCount'), col('inc.{}'.format(c)),
lit('OutgoingCount'), col('out.{}'.format(c)),
)
).drop(c).withColumnRenamed(c + '_new', c)
for c in incoming_cols:
df = df.withColumn(
c + '_new', create_map(
lit('IncomingCount'), col('inc.{}'.format(c)),
)
).drop(c).withColumnRenamed(c + '_new', c)
for c in outgoing_cols:
df = df.withColumn(
c + '_new', create_map(
lit('OutgoingCount'), col('out.{}'.format(c)),
)
).drop(c).withColumnRenamed(c + '_new', c)
result = df.coalesce(1).rdd.map(lambda r: r.asDict()).collect()
print(result)
result:
[{'Venue': 'Hotel', 'Date': 20190101, '10': {'OutgoingCount': None, 'IncomingCount': 3}, '08': {'IncomingCount': 15}, '07': {'OutgoingCount': None}}, {'Venue': 'Hotel', 'Date': 20190103, '10': {'OutgoingCount': 15, 'IncomingCount': None}, '08': {'IncomingCount': None}, '07': {'OutgoingCount': 5}}, {'Venue': 'Beach', 'Date': 20190101, '10': {'OutgoingCount': 5, 'IncomingCount': 45}, '08': {'IncomingCount': 93}, '07': {'OutgoingCount': 30}}]
I can get this so far:
import pandas as pd
import numpy as np
dd1 = {'venue': ['hotel', 'beach'], 'date':['20190101', '20190101'], '08': [15, 93], '10':[3, 45]}
dd2 = {'venue': ['beach', 'hotel'], 'date':['20190101', '20190103'], '07': [30, 5], '10':[5, 15]}
df1 = pd.DataFrame(data=dd1)
df2 = pd.DataFrame(data=dd2)
df1.columns = [f"IncomingCount:{x}" if x not in ['venue', 'date'] else x for x in df1.columns]
df2.columns = [f"OutgoingCount:{x}" if x not in ['venue', 'date'] else x for x in df2.columns ]
ll_dd = pd.merge(df1, df2, on=['venue', 'date'], how='outer').to_dict('records')
ll_dd = [{k:v for k,v in dd.items() if not pd.isnull(v)} for dd in ll_dd]
OUTPUT:
[{'venue': 'hotel',
'date': '20190101',
'IncomingCount:08': 15.0,
'IncomingCount:10': 3.0},
{'venue': 'beach',
'date': '20190101',
'IncomingCount:08': 93.0,
'IncomingCount:10': 45.0,
'OutgoingCount:07': 30.0,
'OutgoingCount:10': 5.0},
{'venue': 'hotel',
'date': '20190103',
'OutgoingCount:07': 5.0,
'OutgoingCount:10': 15.0}]
The final result as desired by the OP is a list of dictionaries, where all rows from the DataFrame which have same Venue and Date have been clubbed together.
# Creating the DataFrames
df_Incoming = sqlContext.createDataFrame([('Hotel','20190101',15,3),('Beach','20190101',93,45)],('Venue','Date','08','10'))
df_Incoming.show()
+-----+--------+---+---+
|Venue| Date| 08| 10|
+-----+--------+---+---+
|Hotel|20190101| 15| 3|
|Beach|20190101| 93| 45|
+-----+--------+---+---+
df_Outgoing = sqlContext.createDataFrame([('Beach','20190101',30,5),('Hotel','20190103',5,15)],('Venue','Date','07','10'))
df_Outgoing.show()
+-----+--------+---+---+
|Venue| Date| 07| 10|
+-----+--------+---+---+
|Beach|20190101| 30| 5|
|Hotel|20190103| 5| 15|
+-----+--------+---+---+
The idea is to create a dictionary from each row and have the all rows of the DataFrame stored as dictionaries in one big list. And as a final step, we club those dictionaries together which have same Venue and Date.
Since, all rows in the DataFrame are stored as Row() objects, we use collect() function to return all records as list of Row(). Just to illustrate the output -
print(df_Incoming.collect())
[Row(Venue='Hotel', Date='20190101', 08=15, 10=3), Row(Venue='Beach', Date='20190101', 08=93, 10=45)]
But, since we want list of dictionaries, we can use list comprehensions to convert them to a one -
list_Incoming = [row.asDict() for row in df_Incoming.collect()]
print(list_Incoming)
[{'10': 3, 'Date': '20190101', 'Venue': 'Hotel', '08': 15}, {'10': 45, 'Date': '20190101', 'Venue': 'Beach', '08': 93}]
But, since the numeric columns have been in the form like "08":{ "IncomingCount":15 }, instead of "08":15, so we employ dictionary comprehensions to convert them into this form -
list_Incoming = [ {k:v if k in ['Venue','Date'] else {'IncomingCount':v} for k,v in dict_element.items()} for dict_element in list_Incoming]
print(list_Incoming)
[{'10': {'IncomingCount': 3}, 'Date': '20190101', 'Venue': 'Hotel', '08': {'IncomingCount': 15}}, {'10': {'IncomingCount': 45}, 'Date': '20190101', 'Venue': 'Beach', '08': {'IncomingCount': 93}}]
Similarly, we do for OutgoingCount
list_Outgoing = [row.asDict() for row in df_Outgoing.collect()]
list_Outgoing = [ {k:v if k in ['Venue','Date'] else {'OutgoingCount':v} for k,v in dict_element.items()} for dict_element in list_Outgoing]
print(list_Outgoing)
[{'10': {'OutgoingCount': 5}, 'Date': '20190101', 'Venue': 'Beach', '07': {'OutgoingCount': 30}}, {'10': {'OutgoingCount': 15}, 'Date': '20190103', 'Venue': 'Hotel', '07': {'OutgoingCount': 5}}]
Final Step: Now, that we have created the requisite list of dictionaries, we need to club the list together on the basis of Venue and Date.
from copy import deepcopy
def merge_lists(list_Incoming, list_Outgoing):
# create dictionary from list_Incoming:
dict1 = {(record['Venue'], record['Date']): record for record in list_Incoming}
#compare elements in list_Outgoing to those on list_Incoming:
result = {}
for record in list_Outgoing:
ckey = record['Venue'], record['Date']
new_record = deepcopy(record)
if ckey in dict1:
for key, value in dict1[ckey].items():
if key in ('Venue', 'Date'):
# Do not merge these keys
continue
# Dict's "setdefault" finds a key/value, and if it is missing
# creates a new one with the second parameter as value
new_record.setdefault(key, {}).update(value)
result[ckey] = new_record
# Add values from list_Incoming that were not matched in list_Outgoing:
for key, value in dict1.items():
if key not in result:
result[key] = deepcopy(value)
return list(result.values())
res = merge_lists(list_Incoming, list_Outgoing)
print(res)
[{'10': {'OutgoingCount': 5, 'IncomingCount': 45},
'Date': '20190101',
'Venue': 'Beach',
'08': {'IncomingCount': 93},
'07': {'OutgoingCount': 30}
},
{'10': {'OutgoingCount': 15},
'Date': '20190103',
'Venue': 'Hotel',
'07': {'OutgoingCount': 5}
},
{'10': {'IncomingCount': 3},
'Date': '20190101',
'Venue': 'Hotel',
'08': {'IncomingCount': 15}
}]
I have a list of map e.g
[{'a' : 10,'b': 20}, {'a' : 5,'b': 20} , {'b': 20} ,{'a' : 0,'b': 20} }
I want to get the average of values of a and b. So the expected output is
a = (10 + 5 + 0 + 0) /3 = 5 ;
b = 80/4 = 20.
How can i do it efficiently using RDD
The easiest might be map your rdd element to a format like:
init = {'a': {'sum': 0, 'cnt': 0}, 'b': {'sum': 0, 'cnt': 0}}
i.e. record the sum and count for each key, and then reduce it.
Map function:
def map_fun(d, keys=['a', 'b']):
map_d = {}
for k in keys:
if k in d:
temp = {'sum': d[k], 'cnt': 1}
else:
temp = {'sum': 0, 'cnt': 0}
map_d[k] = temp
return map_d
Reduce function:
def reduce_fun(a, b, keys=['a', 'b']):
from collections import defaultdict
reduce_d = defaultdict(dict)
for k in keys:
reduce_d[k]['sum'] = a[k]['sum'] + b[k]['sum']
reduce_d[k]['cnt'] = a[k]['cnt'] + b[k]['cnt']
return reduce_d
rdd.map(map_fun).reduce(reduce_fun)
# defaultdict(<type 'dict'>, {'a': {'sum': 15, 'cnt': 3}, 'b': {'sum': 80, 'cnt': 4}})
Calculate the average:
d = rdd.map(map_fun).reduce(reduce_fun)
{k: v['sum']/v['cnt'] for k, v in d.items()}
{'a': 5, 'b': 20}
Given the structure of your data you should be able to use the dataframe api to achieve this calculation. If you need an rdd it is not to hard to get from the dataframe back to an rdd.
from pyspark.sql import functions as F
df = spark.createDataFrame([{'a' : 10,'b': 20}, {'a' : 5,'b': 20} , {'b': 20} ,{'a' : 0,'b': 20}])
Dataframe looks like this
+----+---+
| a| b|
+----+---+
| 10| 20|
| 5| 20|
|null| 20|
| 0| 20|
+----+---+
Then it follows simply to calculate averages using the pyspark.sql functions
cols = df.columns
df_means = df.agg(*[F.mean(F.col(col)).alias(col+"_mean") for col in cols])
df_means.show()
OUTPUT:
+------+------+
|a_mean|b_mean|
+------+------+
| 5.0| 20.0|
+------+------+
You can use defaultdict to collect similar keys and their values as list.
Then simply aggregate using sum of values divided by number of elements of list for each value.
from collections import defaultdict
x = [{'a' : 10,'b': 20}, {'a' : 5,'b': 20} , {'b': 20} ,{'a' : 0,'b': 20}]
y = defaultdict(lambda: [])
[y[k].append(v) for i in x for k,v in i.items() ]
for k,v in y.items():
print k, "=" ,sum(v)/len(v)
>>> y
defaultdict(<function <lambda> at 0x02A43BB0>, {'a': [10, 5, 0], 'b': [20, 20, 20, 20]})
>>>
>>>
a = 5
b = 20