I try to clean the data with this code
empty = {}
mess = lophoc_clean.query("lop_diemquatrinh.notnull()")[['lop_id', 'lop_diemquatrinh']]
keys = []
values = []
for index, rows in mess.iterrows():
if len(rows['lop_diemquatrinh']) >4:
values.append(rows['lop_diemquatrinh'])
keys.append(rows['lop_id'])
df = pd.DataFrame(dict(zip(keys, values)), index = [0]).transpose()
df.columns = ['data']
The result is a dictionary like this
{'data': {37: '[{"date_update":"31-03-2022","diemquatrinh":"6.0"}]',
38: '[{"date_update":"11-03-2022","diemquatrinh":"6.25"}]',
44: '[{"date_update":"25-12-2021","diemquatrinh":"6.0"},{"date_update":"28-04-2022","diemquatrinh":"6.25"},{"date_update":"28-07-2022","diemquatrinh":"6.5"}]',
1095: '[{"date_update":null,"diemquatrinh":null}]'}}
However, I don't know how to make them into a DataFrame with 3 columns like this. Please help me. Thank you!
id
updated_at
diemquatrinh
38
11-03-2022
6.25
44
25-12-2021
6.0
44
28-04-2022
6.25
44
28-07-2022
6.5
1095
null
null
Here you go.
from json import loads
from pprint import pp
import pandas as pd
def get_example_data():
return [
dict(id=38, updated_at="2022-03-11", diemquatrinh=6.25),
dict(id=44, updated_at="2021-12-25", diemquatrinh=6),
dict(id=44, updated_at="2022-04-28", diemquatrinh=6.25),
dict(id=1095, updated_at=None),
]
df = pd.DataFrame(get_example_data())
df["updated_at"] = pd.to_datetime(df["updated_at"])
print(df.dtypes, "\n")
pp(loads(df.to_json()))
print()
print(df, "\n")
pp(loads(df.to_json(orient="records")))
It produces this output:
id int64
updated_at datetime64[ns]
diemquatrinh float64
dtype: object
{'id': {'0': 38, '1': 44, '2': 44, '3': 1095},
'updated_at': {'0': 1646956800000,
'1': 1640390400000,
'2': 1651104000000,
'3': None},
'diemquatrinh': {'0': 6.25, '1': 6.0, '2': 6.25, '3': None}}
id updated_at diemquatrinh
0 38 2022-03-11 6.25
1 44 2021-12-25 6.00
2 44 2022-04-28 6.25
3 1095 NaT NaN
[{'id': 38, 'updated_at': 1646956800000, 'diemquatrinh': 6.25},
{'id': 44, 'updated_at': 1640390400000, 'diemquatrinh': 6.0},
{'id': 44, 'updated_at': 1651104000000, 'diemquatrinh': 6.25},
{'id': 1095, 'updated_at': None, 'diemquatrinh': None}]
Either of the JSON datastructures
would be acceptable input
for creating a new DataFrame from scratch.
I have a dictionary within a tuple and I want to know how to access it and create a dataframe merging the dictionary value into single row
Example:
({'Id': '4', 'BU': 'usa', 'V_ID': '44', 'INV': 'inv1331', 'DT': '08/1/19', 'AMT': '1500'}, {'Id': '9', 'BU': 'usa', 'V_ID': '44', 'INV': 'inv4321', 'DT': '02/6/19', 'AMT': '1000'})
Expected Result:
Id_1 BU_1 V_ID_1 INV_1 DT_1 AMT_1 Id_2 BU_2 V_ID_2 INV_2 DT_2 AMT_2
---------------------------------------------------------------------------------------------
4 usa 44 inv1331 08/1/19 1500 9 usa 44 inv4321 02/6/19 1000
x = ({'Id': '4', 'BU': 'usa', 'V_ID': '44', 'INV': 'inv1331', 'DT': '08/1/19', 'AMT': '1500'}, {'Id': '9', 'BU': 'usa', 'V_ID': '44', 'INV': 'inv4321', 'DT': '02/6/19', 'AMT': '1000'})
data = {f"{k}_{i+1}": v for i, d in enumerate(x) for k, v in d.items()}
df = pd.DataFrame(data, index = [0])
Output:
>>> df
Id_1 BU_1 V_ID_1 INV_1 DT_1 ... BU_2 V_ID_2 INV_2 DT_2 AMT_2
0 4 usa 44 inv1331 08/1/19 ... usa 44 inv4321 02/6/19 1000
[1 rows x 12 columns]
Given these data frames:
IncomingCount
-------------------------
Venue|Date | 08 | 10 |
-------------------------
Hotel|20190101| 15 | 03 |
Beach|20190101| 93 | 45 |
OutgoingCount
-------------------------
Venue|Date | 07 | 10 |
-------------------------
Beach|20190101| 30 | 5 |
Hotel|20190103| 05 | 15 |
How can I possibly merge (full join) the two tables resulting in something as following without having to manually loop through each row of both tables?
Dictionary:
[
{"Venue":"Hotel", "Date":"20190101", "08":{ "IncomingCount":15 }, "10":{ "IncomingCount":03 } },
{"Venue":"Beach", "Date":"20190101", "07":{ "OutgoingCount":30 }, "08":{ "IncomingCount":93 }, "10":{ "IncomingCount":45, "OutgoingCount":15 } },
{"Venue":"Hotel", "Date":"20190103", "07":{ "OutgoingCount":05 }, "10":{ "OutgoingCount":15 } }
]
The conditions are:
Venue and Date columns act like join conditions.
The other columns, represented in numbers, are dynamically created.
If dynamically column does not exist, it gets excluded( or included with None as value ).
it's pretty fiddly, but it can be done by making use of the create_map function from spark.
basically divide the columns into four groups: keys (venue, date), common (10), only incoming (08), only outgoing (07).
then create mappers per group (except keys), mapping only what's available per group. apply mapping, drop the old column and rename the mapped column to the old name.
lastly convert all rows to dict (from df's rdd) and collect.
from pyspark.sql import SparkSession
from pyspark.sql.functions import create_map, col, lit
spark = SparkSession.builder.appName('hotels_and_beaches').getOrCreate()
incoming_counts = spark.createDataFrame([('Hotel', 20190101, 15, 3), ('Beach', 20190101, 93, 45)], ['Venue', 'Date', '08', '10']).alias('inc')
outgoing_counts = spark.createDataFrame([('Beach', 20190101, 30, 5), ('Hotel', 20190103, 5, 15)], ['Venue', 'Date', '07', '10']).alias('out')
df = incoming_counts.join(outgoing_counts, on=['Venue', 'Date'], how='full')
outgoing_cols = {c for c in outgoing_counts.columns if c not in {'Venue', 'Date'}}
incoming_cols = {c for c in incoming_counts.columns if c not in {'Venue', 'Date'}}
common_cols = outgoing_cols.intersection(incoming_cols)
outgoing_cols = outgoing_cols.difference(common_cols)
incoming_cols = incoming_cols.difference(common_cols)
for c in common_cols:
df = df.withColumn(
c + '_new', create_map(
lit('IncomingCount'), col('inc.{}'.format(c)),
lit('OutgoingCount'), col('out.{}'.format(c)),
)
).drop(c).withColumnRenamed(c + '_new', c)
for c in incoming_cols:
df = df.withColumn(
c + '_new', create_map(
lit('IncomingCount'), col('inc.{}'.format(c)),
)
).drop(c).withColumnRenamed(c + '_new', c)
for c in outgoing_cols:
df = df.withColumn(
c + '_new', create_map(
lit('OutgoingCount'), col('out.{}'.format(c)),
)
).drop(c).withColumnRenamed(c + '_new', c)
result = df.coalesce(1).rdd.map(lambda r: r.asDict()).collect()
print(result)
result:
[{'Venue': 'Hotel', 'Date': 20190101, '10': {'OutgoingCount': None, 'IncomingCount': 3}, '08': {'IncomingCount': 15}, '07': {'OutgoingCount': None}}, {'Venue': 'Hotel', 'Date': 20190103, '10': {'OutgoingCount': 15, 'IncomingCount': None}, '08': {'IncomingCount': None}, '07': {'OutgoingCount': 5}}, {'Venue': 'Beach', 'Date': 20190101, '10': {'OutgoingCount': 5, 'IncomingCount': 45}, '08': {'IncomingCount': 93}, '07': {'OutgoingCount': 30}}]
I can get this so far:
import pandas as pd
import numpy as np
dd1 = {'venue': ['hotel', 'beach'], 'date':['20190101', '20190101'], '08': [15, 93], '10':[3, 45]}
dd2 = {'venue': ['beach', 'hotel'], 'date':['20190101', '20190103'], '07': [30, 5], '10':[5, 15]}
df1 = pd.DataFrame(data=dd1)
df2 = pd.DataFrame(data=dd2)
df1.columns = [f"IncomingCount:{x}" if x not in ['venue', 'date'] else x for x in df1.columns]
df2.columns = [f"OutgoingCount:{x}" if x not in ['venue', 'date'] else x for x in df2.columns ]
ll_dd = pd.merge(df1, df2, on=['venue', 'date'], how='outer').to_dict('records')
ll_dd = [{k:v for k,v in dd.items() if not pd.isnull(v)} for dd in ll_dd]
OUTPUT:
[{'venue': 'hotel',
'date': '20190101',
'IncomingCount:08': 15.0,
'IncomingCount:10': 3.0},
{'venue': 'beach',
'date': '20190101',
'IncomingCount:08': 93.0,
'IncomingCount:10': 45.0,
'OutgoingCount:07': 30.0,
'OutgoingCount:10': 5.0},
{'venue': 'hotel',
'date': '20190103',
'OutgoingCount:07': 5.0,
'OutgoingCount:10': 15.0}]
The final result as desired by the OP is a list of dictionaries, where all rows from the DataFrame which have same Venue and Date have been clubbed together.
# Creating the DataFrames
df_Incoming = sqlContext.createDataFrame([('Hotel','20190101',15,3),('Beach','20190101',93,45)],('Venue','Date','08','10'))
df_Incoming.show()
+-----+--------+---+---+
|Venue| Date| 08| 10|
+-----+--------+---+---+
|Hotel|20190101| 15| 3|
|Beach|20190101| 93| 45|
+-----+--------+---+---+
df_Outgoing = sqlContext.createDataFrame([('Beach','20190101',30,5),('Hotel','20190103',5,15)],('Venue','Date','07','10'))
df_Outgoing.show()
+-----+--------+---+---+
|Venue| Date| 07| 10|
+-----+--------+---+---+
|Beach|20190101| 30| 5|
|Hotel|20190103| 5| 15|
+-----+--------+---+---+
The idea is to create a dictionary from each row and have the all rows of the DataFrame stored as dictionaries in one big list. And as a final step, we club those dictionaries together which have same Venue and Date.
Since, all rows in the DataFrame are stored as Row() objects, we use collect() function to return all records as list of Row(). Just to illustrate the output -
print(df_Incoming.collect())
[Row(Venue='Hotel', Date='20190101', 08=15, 10=3), Row(Venue='Beach', Date='20190101', 08=93, 10=45)]
But, since we want list of dictionaries, we can use list comprehensions to convert them to a one -
list_Incoming = [row.asDict() for row in df_Incoming.collect()]
print(list_Incoming)
[{'10': 3, 'Date': '20190101', 'Venue': 'Hotel', '08': 15}, {'10': 45, 'Date': '20190101', 'Venue': 'Beach', '08': 93}]
But, since the numeric columns have been in the form like "08":{ "IncomingCount":15 }, instead of "08":15, so we employ dictionary comprehensions to convert them into this form -
list_Incoming = [ {k:v if k in ['Venue','Date'] else {'IncomingCount':v} for k,v in dict_element.items()} for dict_element in list_Incoming]
print(list_Incoming)
[{'10': {'IncomingCount': 3}, 'Date': '20190101', 'Venue': 'Hotel', '08': {'IncomingCount': 15}}, {'10': {'IncomingCount': 45}, 'Date': '20190101', 'Venue': 'Beach', '08': {'IncomingCount': 93}}]
Similarly, we do for OutgoingCount
list_Outgoing = [row.asDict() for row in df_Outgoing.collect()]
list_Outgoing = [ {k:v if k in ['Venue','Date'] else {'OutgoingCount':v} for k,v in dict_element.items()} for dict_element in list_Outgoing]
print(list_Outgoing)
[{'10': {'OutgoingCount': 5}, 'Date': '20190101', 'Venue': 'Beach', '07': {'OutgoingCount': 30}}, {'10': {'OutgoingCount': 15}, 'Date': '20190103', 'Venue': 'Hotel', '07': {'OutgoingCount': 5}}]
Final Step: Now, that we have created the requisite list of dictionaries, we need to club the list together on the basis of Venue and Date.
from copy import deepcopy
def merge_lists(list_Incoming, list_Outgoing):
# create dictionary from list_Incoming:
dict1 = {(record['Venue'], record['Date']): record for record in list_Incoming}
#compare elements in list_Outgoing to those on list_Incoming:
result = {}
for record in list_Outgoing:
ckey = record['Venue'], record['Date']
new_record = deepcopy(record)
if ckey in dict1:
for key, value in dict1[ckey].items():
if key in ('Venue', 'Date'):
# Do not merge these keys
continue
# Dict's "setdefault" finds a key/value, and if it is missing
# creates a new one with the second parameter as value
new_record.setdefault(key, {}).update(value)
result[ckey] = new_record
# Add values from list_Incoming that were not matched in list_Outgoing:
for key, value in dict1.items():
if key not in result:
result[key] = deepcopy(value)
return list(result.values())
res = merge_lists(list_Incoming, list_Outgoing)
print(res)
[{'10': {'OutgoingCount': 5, 'IncomingCount': 45},
'Date': '20190101',
'Venue': 'Beach',
'08': {'IncomingCount': 93},
'07': {'OutgoingCount': 30}
},
{'10': {'OutgoingCount': 15},
'Date': '20190103',
'Venue': 'Hotel',
'07': {'OutgoingCount': 5}
},
{'10': {'IncomingCount': 3},
'Date': '20190101',
'Venue': 'Hotel',
'08': {'IncomingCount': 15}
}]
One of the columns of my pandas dataframe looks like this
>> df
Item
0 [{"id":A,"value":20},{"id":B,"value":30}]
1 [{"id":A,"value":20},{"id":C,"value":50}]
2 [{"id":A,"value":20},{"id":B,"value":30},{"id":C,"value":40}]
I want to expand it as
A B C
0 20 30 NaN
1 20 NaN 50
2 20 30 40
I tried
dfx = pd.DataFrame()
for i in range(df.shape[0]):
df1 = pd.DataFrame(df.item[i]).T
header = df1.iloc[0]
df1 = df1[1:]
df1 = df1.rename(columns = header)
dfx = dfx.append(df1)
But this takes a lot of time as my data is huge. What is the best way to do this?
My original json data looks like this:
{
{
'_id': '5b1284e0b840a768f5545ef6',
'device': '0035sdf121',
'customerId': '38',
'variantId': '31',
'timeStamp': datetime.datetime(2018, 6, 2, 11, 50, 11),
'item': [{'id': A, 'value': 20},
{'id': B, 'value': 30},
{'id': C, 'value': 50}
},
{
'_id': '5b1284e0b840a768f5545ef6',
'device': '0035sdf121',
'customerId': '38',
'variantId': '31',
'timeStamp': datetime.datetime(2018, 6, 2, 11, 50, 11),
'item': [{'id': A, 'value': 20},
{'id': B, 'value': 30},
{'id': C, 'value': 50}
},
.............
}
I agree with #JeffH, you should really look at how you are constructing the DataFrame.
Assuming you are getting this from somewhere out of your control then you can convert to the your desired DataFrame with:
In []:
pd.DataFrame(df['Item'].apply(lambda r: {d['id']: d['value'] for d in r}).values.tolist())
Out[]:
A B C
0 20 30.0 NaN
1 20 NaN 50.0
2 20 30.0 40.0
This question is further part of this. So I added it as new question
If my dataframe B would be something like:
ID category words bucket_id
1 audi a4, a6 94
2 bugatti veyron, chiron 86
3 mercedez s-class, e-class 79
4 dslr canon, nikon 69
5 apple iphone,macbook,ipod 51
6 finance sales,loans,sales price 12
7 politics trump, election, votes 77
8 entertainment spiderman,thor, ironmen 88
9 music beiber, rihana,drake 14
........ ..............
......... .........
I want mapped category along with its corresponding column ID as dictionary. Something like:-
{'id': 2, 'term': 'bugatti', 'bucket_id': 86}
{'id': 3, 'term': 'mercedez', 'bucket_id': 79}
{'id': 6, 'term': 'finance', 'bucket_id': 12}
{'id': 7, 'term': 'politics', 'bucket_id': 77}
{'id': 9, 'term': 'music', 'bucket_id': 14}
edit
I just want to map keywords with exact match in between two commas in column words not in between strings or along with any other words.
EDIT:
df = pd.DataFrame({'ID': [1, 2, 3],
'category': ['bugatti', 'entertainment', 'mercedez'],
'words': ['veyron,chiron', 'spiderman,thor,ironmen',
's-class,e-class,s-class'],
'bucket_id': [94, 86, 79]})
print (df)
ID category words bucket_id
0 1 bugatti veyron,chiron 94
1 2 entertainment spiderman,thor,ironmen 86
2 3 mercedez s-class,e-class,s-class 79
A = ['veyron','s-class','derman']
idx = [i for i, x in enumerate(df['words']) for y in x.split(',') if y in A]
print (idx)
[0, 2, 2]
L = (df.loc[idx, ['ID','category','bucket_id']]
.rename(columns={'category':'term'})
.to_dict(orient='r'))
print (L)
[{'ID': 1, 'term': 'bugatti', 'bucket_id': 94},
{'ID': 3, 'term': 'mercedez', 'bucket_id': 79},
{'ID': 3, 'term': 'mercedez', 'bucket_id': 79}]