Given these data frames:
IncomingCount
-------------------------
Venue|Date | 08 | 10 |
-------------------------
Hotel|20190101| 15 | 03 |
Beach|20190101| 93 | 45 |
OutgoingCount
-------------------------
Venue|Date | 07 | 10 |
-------------------------
Beach|20190101| 30 | 5 |
Hotel|20190103| 05 | 15 |
How can I possibly merge (full join) the two tables resulting in something as following without having to manually loop through each row of both tables?
Dictionary:
[
{"Venue":"Hotel", "Date":"20190101", "08":{ "IncomingCount":15 }, "10":{ "IncomingCount":03 } },
{"Venue":"Beach", "Date":"20190101", "07":{ "OutgoingCount":30 }, "08":{ "IncomingCount":93 }, "10":{ "IncomingCount":45, "OutgoingCount":15 } },
{"Venue":"Hotel", "Date":"20190103", "07":{ "OutgoingCount":05 }, "10":{ "OutgoingCount":15 } }
]
The conditions are:
Venue and Date columns act like join conditions.
The other columns, represented in numbers, are dynamically created.
If dynamically column does not exist, it gets excluded( or included with None as value ).
it's pretty fiddly, but it can be done by making use of the create_map function from spark.
basically divide the columns into four groups: keys (venue, date), common (10), only incoming (08), only outgoing (07).
then create mappers per group (except keys), mapping only what's available per group. apply mapping, drop the old column and rename the mapped column to the old name.
lastly convert all rows to dict (from df's rdd) and collect.
from pyspark.sql import SparkSession
from pyspark.sql.functions import create_map, col, lit
spark = SparkSession.builder.appName('hotels_and_beaches').getOrCreate()
incoming_counts = spark.createDataFrame([('Hotel', 20190101, 15, 3), ('Beach', 20190101, 93, 45)], ['Venue', 'Date', '08', '10']).alias('inc')
outgoing_counts = spark.createDataFrame([('Beach', 20190101, 30, 5), ('Hotel', 20190103, 5, 15)], ['Venue', 'Date', '07', '10']).alias('out')
df = incoming_counts.join(outgoing_counts, on=['Venue', 'Date'], how='full')
outgoing_cols = {c for c in outgoing_counts.columns if c not in {'Venue', 'Date'}}
incoming_cols = {c for c in incoming_counts.columns if c not in {'Venue', 'Date'}}
common_cols = outgoing_cols.intersection(incoming_cols)
outgoing_cols = outgoing_cols.difference(common_cols)
incoming_cols = incoming_cols.difference(common_cols)
for c in common_cols:
df = df.withColumn(
c + '_new', create_map(
lit('IncomingCount'), col('inc.{}'.format(c)),
lit('OutgoingCount'), col('out.{}'.format(c)),
)
).drop(c).withColumnRenamed(c + '_new', c)
for c in incoming_cols:
df = df.withColumn(
c + '_new', create_map(
lit('IncomingCount'), col('inc.{}'.format(c)),
)
).drop(c).withColumnRenamed(c + '_new', c)
for c in outgoing_cols:
df = df.withColumn(
c + '_new', create_map(
lit('OutgoingCount'), col('out.{}'.format(c)),
)
).drop(c).withColumnRenamed(c + '_new', c)
result = df.coalesce(1).rdd.map(lambda r: r.asDict()).collect()
print(result)
result:
[{'Venue': 'Hotel', 'Date': 20190101, '10': {'OutgoingCount': None, 'IncomingCount': 3}, '08': {'IncomingCount': 15}, '07': {'OutgoingCount': None}}, {'Venue': 'Hotel', 'Date': 20190103, '10': {'OutgoingCount': 15, 'IncomingCount': None}, '08': {'IncomingCount': None}, '07': {'OutgoingCount': 5}}, {'Venue': 'Beach', 'Date': 20190101, '10': {'OutgoingCount': 5, 'IncomingCount': 45}, '08': {'IncomingCount': 93}, '07': {'OutgoingCount': 30}}]
I can get this so far:
import pandas as pd
import numpy as np
dd1 = {'venue': ['hotel', 'beach'], 'date':['20190101', '20190101'], '08': [15, 93], '10':[3, 45]}
dd2 = {'venue': ['beach', 'hotel'], 'date':['20190101', '20190103'], '07': [30, 5], '10':[5, 15]}
df1 = pd.DataFrame(data=dd1)
df2 = pd.DataFrame(data=dd2)
df1.columns = [f"IncomingCount:{x}" if x not in ['venue', 'date'] else x for x in df1.columns]
df2.columns = [f"OutgoingCount:{x}" if x not in ['venue', 'date'] else x for x in df2.columns ]
ll_dd = pd.merge(df1, df2, on=['venue', 'date'], how='outer').to_dict('records')
ll_dd = [{k:v for k,v in dd.items() if not pd.isnull(v)} for dd in ll_dd]
OUTPUT:
[{'venue': 'hotel',
'date': '20190101',
'IncomingCount:08': 15.0,
'IncomingCount:10': 3.0},
{'venue': 'beach',
'date': '20190101',
'IncomingCount:08': 93.0,
'IncomingCount:10': 45.0,
'OutgoingCount:07': 30.0,
'OutgoingCount:10': 5.0},
{'venue': 'hotel',
'date': '20190103',
'OutgoingCount:07': 5.0,
'OutgoingCount:10': 15.0}]
The final result as desired by the OP is a list of dictionaries, where all rows from the DataFrame which have same Venue and Date have been clubbed together.
# Creating the DataFrames
df_Incoming = sqlContext.createDataFrame([('Hotel','20190101',15,3),('Beach','20190101',93,45)],('Venue','Date','08','10'))
df_Incoming.show()
+-----+--------+---+---+
|Venue| Date| 08| 10|
+-----+--------+---+---+
|Hotel|20190101| 15| 3|
|Beach|20190101| 93| 45|
+-----+--------+---+---+
df_Outgoing = sqlContext.createDataFrame([('Beach','20190101',30,5),('Hotel','20190103',5,15)],('Venue','Date','07','10'))
df_Outgoing.show()
+-----+--------+---+---+
|Venue| Date| 07| 10|
+-----+--------+---+---+
|Beach|20190101| 30| 5|
|Hotel|20190103| 5| 15|
+-----+--------+---+---+
The idea is to create a dictionary from each row and have the all rows of the DataFrame stored as dictionaries in one big list. And as a final step, we club those dictionaries together which have same Venue and Date.
Since, all rows in the DataFrame are stored as Row() objects, we use collect() function to return all records as list of Row(). Just to illustrate the output -
print(df_Incoming.collect())
[Row(Venue='Hotel', Date='20190101', 08=15, 10=3), Row(Venue='Beach', Date='20190101', 08=93, 10=45)]
But, since we want list of dictionaries, we can use list comprehensions to convert them to a one -
list_Incoming = [row.asDict() for row in df_Incoming.collect()]
print(list_Incoming)
[{'10': 3, 'Date': '20190101', 'Venue': 'Hotel', '08': 15}, {'10': 45, 'Date': '20190101', 'Venue': 'Beach', '08': 93}]
But, since the numeric columns have been in the form like "08":{ "IncomingCount":15 }, instead of "08":15, so we employ dictionary comprehensions to convert them into this form -
list_Incoming = [ {k:v if k in ['Venue','Date'] else {'IncomingCount':v} for k,v in dict_element.items()} for dict_element in list_Incoming]
print(list_Incoming)
[{'10': {'IncomingCount': 3}, 'Date': '20190101', 'Venue': 'Hotel', '08': {'IncomingCount': 15}}, {'10': {'IncomingCount': 45}, 'Date': '20190101', 'Venue': 'Beach', '08': {'IncomingCount': 93}}]
Similarly, we do for OutgoingCount
list_Outgoing = [row.asDict() for row in df_Outgoing.collect()]
list_Outgoing = [ {k:v if k in ['Venue','Date'] else {'OutgoingCount':v} for k,v in dict_element.items()} for dict_element in list_Outgoing]
print(list_Outgoing)
[{'10': {'OutgoingCount': 5}, 'Date': '20190101', 'Venue': 'Beach', '07': {'OutgoingCount': 30}}, {'10': {'OutgoingCount': 15}, 'Date': '20190103', 'Venue': 'Hotel', '07': {'OutgoingCount': 5}}]
Final Step: Now, that we have created the requisite list of dictionaries, we need to club the list together on the basis of Venue and Date.
from copy import deepcopy
def merge_lists(list_Incoming, list_Outgoing):
# create dictionary from list_Incoming:
dict1 = {(record['Venue'], record['Date']): record for record in list_Incoming}
#compare elements in list_Outgoing to those on list_Incoming:
result = {}
for record in list_Outgoing:
ckey = record['Venue'], record['Date']
new_record = deepcopy(record)
if ckey in dict1:
for key, value in dict1[ckey].items():
if key in ('Venue', 'Date'):
# Do not merge these keys
continue
# Dict's "setdefault" finds a key/value, and if it is missing
# creates a new one with the second parameter as value
new_record.setdefault(key, {}).update(value)
result[ckey] = new_record
# Add values from list_Incoming that were not matched in list_Outgoing:
for key, value in dict1.items():
if key not in result:
result[key] = deepcopy(value)
return list(result.values())
res = merge_lists(list_Incoming, list_Outgoing)
print(res)
[{'10': {'OutgoingCount': 5, 'IncomingCount': 45},
'Date': '20190101',
'Venue': 'Beach',
'08': {'IncomingCount': 93},
'07': {'OutgoingCount': 30}
},
{'10': {'OutgoingCount': 15},
'Date': '20190103',
'Venue': 'Hotel',
'07': {'OutgoingCount': 5}
},
{'10': {'IncomingCount': 3},
'Date': '20190101',
'Venue': 'Hotel',
'08': {'IncomingCount': 15}
}]
Related
I want to convert dataframe.groupby type to json
Here's my Dataframe
month group source amount_1 amount_2
0 2022-10-21 value_1 source 10 100
1 2022-08-21 value_2 source 20 50
2 2022-08-21 value_3 source 30 50
3 2022-09-21 value_3 source 40 60
currently this code give the result that I want but the key field is fixed.
all_items = []
for key, val in df.groupby(['group', 'source']):
item = {'group': key[0], 'source': key[1], 'price1': {}, 'price2': {}}
for date in all_dates:
rows = val[ val['month'] == date ].reset_index(drop=True)
item['price1'][date] = int(rows['amount_1'].get(0, 0))
item['price2'][date] = int(rows['amount_2'].get(0, 0))
all_items.append(item)
print(json.dumps(all_items, indent=2))
I would like to generate JSON string dynamically based on existing column in Dataframe.
The column list in Dataframe is subjected to change upon each API call.
For example if "group" is False, then it won't appear in JSON string.
column_status = {
"group": True,
"source" : True
"price1": True,
"price2": True
}
End result that I want.
[{'group': 'value_1',
'source': 'source1',
'price1': {'2022-10-21': 10, '2022-09-21': 10, '2022-08-21': 0},
'price2': {'2022-10-21': 100, '2022-09-21': 100, '2022-08-21': 0}},
{'group': 'value_2',
'source': 'source2',
'price1': {'2022-10-21': 0, '2022-09-21': 0, '2022-08-21': 20},
'price2': {'2022-10-21': 0, '2022-09-21': 0, '2022-08-21': 50}},
{'group': 'value_3',
'source': 'source1',
'price1': {'2022-10-21': 0, '2022-09-21': 40, '2022-08-21': 30},
'price2': {'2022-10-21': 0, '2022-09-21': 60, '2022-08-21': 50}}]
I have a dataframe simplified here with 3 columns.
| id | channels | facebookCount |
|:---- |:------:| -----:|
| 0 | {'channel': 'Google', 'count': 0.0} | 3 |
| 1 | {'channel': 'Google', 'count': 4.0} | 0 |
| 2 | {'channel': 'Google', 'count': 3.0} | 6 |
The channels column was a simple count column like facebookCount. However, I transformed into a dictionary using apply and lambda as such:
data_df["channels"] = data_df["googleCount"].apply(
lambda x: {} if x is None else {"channel": "Google", "count": x})
How can I construct the channel column so that it has data for both facebook and google so that I have a list containing 2 dictionaries as seen below:
| id | channels |
|:---- |:------:|
| 0 | [{'channel': 'Google', 'count': 0.0}, {'channel': 'Facebook', 'count': 3.0}] |
| 1 | [{'channel': 'Google', 'count': 4.0}, {'channel': 'Facebook', 'count': 0.0}] |
| 2 | [{'channel': 'Google', 'count': 3.0}, {'channel': 'Facebook', 'count': 6.0}] |
I have tried creating both dictionaries and then setting channel as well as creating one dictionary and then merging the 2 using apply and lambda as well as a helper function as such
dict1 = data_df["30DayGoogleCampaignCount"].apply(
lambda x: {"channel": "Google", "count": x})
data_df["paidMediaChannels"] = data_df["30DayFacebookCampaignCount"].apply(
lambda x: self.Merge(dict1, {"channel": "facebook", "count": x}))
def Merge(self, dict1, dict2):
return(dict2.update(dict1))
Try something like:
import pandas as pd
df = pd.DataFrame({'id': [0, 1, 2],
'channels': [{'channel': 'Google', 'count': 0.0},
{'channel': 'Google', 'count': 4.0},
{'channel': 'Google', 'count': 3.0}],
'facebookCount': [3, 0, 6]})
# Create List
df['channels'] = df.apply(
lambda x: [x['channels'],
{'channel': 'Facebook',
'count': x['facebookCount']}],
axis=1
)
# Drop facebookCount Column
df = df.drop(columns='facebookCount')
print(df.to_string())
df:
id channels
0 0 [{'channel': 'Google', 'count': 0.0}, {'channel': 'Facebook', 'count': 3}]
1 1 [{'channel': 'Google', 'count': 4.0}, {'channel': 'Facebook', 'count': 0}]
2 2 [{'channel': 'Google', 'count': 3.0}, {'channel': 'Facebook', 'count': 6}]
i am trying to convert a data-frame to a dict in the below format:
name age country state pincode
user1 10 in tn 1
user2 11 in tx 2
user3 12 eu gh 3
user4 13 eu io 4
user5 14 us pi 5
user6 15 us ew 6
the output groups users based on countries and had a dictionary of users with the details of users inside a dictionary
{
'in': {
'user1': {'age': 10, 'state': 'tn', 'pincode': 1},
'user2': {'age': 11, 'state': 'tx', 'pincode': 2}
},
'eu': {
'user3': {'age': 12, 'state': 'gh', 'pincode': 3},
'user4': {'age': 13, 'state': 'io', 'pincode': 4},
},
'us': {
'user5': {'age': 14, 'state': 'pi', 'pincode': 5},
'user6': {'age': 15, 'state': 'ew', 'pincode': 6},
}
}
I am currently doing this by below statement(this is not completely correct as i am using a list inside the loop, instead it should have been a dict):
op2 = {}
for i, row in sample2.iterrows():
if row['country'] not in op2:
op2[row['country']] = []
op2[row['country']] = {row['name'] : {'age':row['age'],'state':row['state'],'pincode':row['pincode']}}
I want a the solution to work if there are additional columns added to the df. for example telephone number. Since the statement I have written is static it won't give me the additional rows in my output. Is there a built in method in pandas that does this?
You can combine to_dict with groupby:
{k:v.drop('country',axis=1).to_dict('i')
for k,v in df.set_index('name').groupby('country')}
Output:
{'eu': {'user3': {'age': 12, 'state': 'gh', 'pincode': 3},
'user4': {'age': 13, 'state': 'io', 'pincode': 4}},
'in': {'user1': {'age': 10, 'state': 'tn', 'pincode': 1},
'user2': {'age': 11, 'state': 'tx', 'pincode': 2}},
'us': {'user5': {'age': 14, 'state': 'pi', 'pincode': 5},
'user6': {'age': 15, 'state': 'ew', 'pincode': 6}}}
One of the columns of my pandas dataframe looks like this
>> df
Item
0 [{"id":A,"value":20},{"id":B,"value":30}]
1 [{"id":A,"value":20},{"id":C,"value":50}]
2 [{"id":A,"value":20},{"id":B,"value":30},{"id":C,"value":40}]
I want to expand it as
A B C
0 20 30 NaN
1 20 NaN 50
2 20 30 40
I tried
dfx = pd.DataFrame()
for i in range(df.shape[0]):
df1 = pd.DataFrame(df.item[i]).T
header = df1.iloc[0]
df1 = df1[1:]
df1 = df1.rename(columns = header)
dfx = dfx.append(df1)
But this takes a lot of time as my data is huge. What is the best way to do this?
My original json data looks like this:
{
{
'_id': '5b1284e0b840a768f5545ef6',
'device': '0035sdf121',
'customerId': '38',
'variantId': '31',
'timeStamp': datetime.datetime(2018, 6, 2, 11, 50, 11),
'item': [{'id': A, 'value': 20},
{'id': B, 'value': 30},
{'id': C, 'value': 50}
},
{
'_id': '5b1284e0b840a768f5545ef6',
'device': '0035sdf121',
'customerId': '38',
'variantId': '31',
'timeStamp': datetime.datetime(2018, 6, 2, 11, 50, 11),
'item': [{'id': A, 'value': 20},
{'id': B, 'value': 30},
{'id': C, 'value': 50}
},
.............
}
I agree with #JeffH, you should really look at how you are constructing the DataFrame.
Assuming you are getting this from somewhere out of your control then you can convert to the your desired DataFrame with:
In []:
pd.DataFrame(df['Item'].apply(lambda r: {d['id']: d['value'] for d in r}).values.tolist())
Out[]:
A B C
0 20 30.0 NaN
1 20 NaN 50.0
2 20 30.0 40.0
I want to convert the below pandas data frame
data = pd.DataFrame([[1,2], [5,6]], columns=['10+', '20+'], index=['A', 'B'])
data.index.name = 'City'
data.columns.name= 'Age Group'
print data
Age Group 10+ 20+
City
A 1 2
B 5 6
in to an array of dictionaries, like
[
{'Age Group': '10+', 'City': 'A', 'count': 1},
{'Age Group': '20+', 'City': 'A', 'count': 2},
{'Age Group': '10+', 'City': 'B', 'count': 5},
{'Age Group': '20+', 'City': 'B', 'count': 6}
]
I am able to get the above expected result using the following loops
result = []
cols_name = data.columns.name
index_names = data.index.name
for index in data.index:
for col in data.columns:
result.append({cols_name: col, index_names: index, 'count': data.loc[index, col]})
Is there any better ways of doing this? Since my original data will be having large number of records, using for loops will take more time.
I think you can use stack with reset_index for reshape and last to_dict:
print (data.stack().reset_index(name='count'))
City Age Group count
0 A 10+ 1
1 A 20+ 2
2 B 10+ 5
3 B 20+ 6
print (data.stack().reset_index(name='count').to_dict(orient='records'))
[
{'Age Group': '10+', 'City': 'A', 'count': 1},
{'Age Group': '20+', 'City': 'A', 'count': 2},
{'Age Group': '10+', 'City': 'B', 'count': 5},
{'Age Group': '20+', 'City': 'B', 'count': 6}
]