I have a dataframe simplified here with 3 columns.
| id | channels | facebookCount |
|:---- |:------:| -----:|
| 0 | {'channel': 'Google', 'count': 0.0} | 3 |
| 1 | {'channel': 'Google', 'count': 4.0} | 0 |
| 2 | {'channel': 'Google', 'count': 3.0} | 6 |
The channels column was a simple count column like facebookCount. However, I transformed into a dictionary using apply and lambda as such:
data_df["channels"] = data_df["googleCount"].apply(
lambda x: {} if x is None else {"channel": "Google", "count": x})
How can I construct the channel column so that it has data for both facebook and google so that I have a list containing 2 dictionaries as seen below:
| id | channels |
|:---- |:------:|
| 0 | [{'channel': 'Google', 'count': 0.0}, {'channel': 'Facebook', 'count': 3.0}] |
| 1 | [{'channel': 'Google', 'count': 4.0}, {'channel': 'Facebook', 'count': 0.0}] |
| 2 | [{'channel': 'Google', 'count': 3.0}, {'channel': 'Facebook', 'count': 6.0}] |
I have tried creating both dictionaries and then setting channel as well as creating one dictionary and then merging the 2 using apply and lambda as well as a helper function as such
dict1 = data_df["30DayGoogleCampaignCount"].apply(
lambda x: {"channel": "Google", "count": x})
data_df["paidMediaChannels"] = data_df["30DayFacebookCampaignCount"].apply(
lambda x: self.Merge(dict1, {"channel": "facebook", "count": x}))
def Merge(self, dict1, dict2):
return(dict2.update(dict1))
Try something like:
import pandas as pd
df = pd.DataFrame({'id': [0, 1, 2],
'channels': [{'channel': 'Google', 'count': 0.0},
{'channel': 'Google', 'count': 4.0},
{'channel': 'Google', 'count': 3.0}],
'facebookCount': [3, 0, 6]})
# Create List
df['channels'] = df.apply(
lambda x: [x['channels'],
{'channel': 'Facebook',
'count': x['facebookCount']}],
axis=1
)
# Drop facebookCount Column
df = df.drop(columns='facebookCount')
print(df.to_string())
df:
id channels
0 0 [{'channel': 'Google', 'count': 0.0}, {'channel': 'Facebook', 'count': 3}]
1 1 [{'channel': 'Google', 'count': 4.0}, {'channel': 'Facebook', 'count': 0}]
2 2 [{'channel': 'Google', 'count': 3.0}, {'channel': 'Facebook', 'count': 6}]
Related
I'm making a job to categorize earnings and expenses, from app of movimentations.
To this, i need get the category to tile, with more cases.
For example, in this Scenario:
| title | category | count |
| ----- | -------------- | ----- |
| Pizza | food | 6 |
| Pizza | others_expense | 1 |
| Pizza | refund | 1 |
I want return just the first row, because the title is the same, and category food is used with most frequency.
Code example
I want get the result using just Django ORM, because i have diferent databases and is more fast than iterate over a large list.
Model:
class Movimentation(models.Model):
title = models.CharField(max_length=50)
value = models.FloatField()
category = models.CharField(max_length=50)
Consult:
My actual consult in Django ORM is.
Movimentation.objects \
.values('title', 'category') \
.annotate(count=Count('*'))
Result:
[
{'title': 'Pizza', 'category': 'food', 'count': 6},
{'title': 'Pizza', 'category': 'others_expense', 'count': 1},
{'title': 'Pizza', 'category': 'refund', 'count': 1},
{'title': 'Hamburguer', 'category': 'food', 'count': 1},
{'title': 'Clothing', 'category': 'personal', 'count': 18},
{'title': 'Clothing', 'category': 'home', 'count': 15},
{'title': 'Clothing', 'category': 'others_expense', 'count': 1}
]
Expected result:
In this case, i get just one row by title, with the most used category.
[
{'title': 'Pizza', 'category': 'food', 'count': 6},
{'title': 'Hamburguer', 'category': 'food', 'count': 1},
{'title': 'Clothing', 'category': 'personal', 'count': 18}
]
d = [{'name': 'tv', 'value': 10, 'amount': 35},
{'name': 'tv', 'value': 10, 'amount': 14},
{'name': 'tv', 'value': 15, 'amount': 23},
{'name': 'tv', 'value': 34, 'amount': 56},
{'name': 'radio', 'value': 90, 'amount': 35},
{'name': 'radio', 'value': 90, 'amount': 65},
{'name': 'radio', 'value': 100, 'amount': 50},
{'name': 'dvd', 'value': 0.5, 'amount': 35},
{'name': 'dvd', 'value': 0.2, 'amount': 40},
{'name': 'dvd', 'value': 0.5, 'amount': 15}
]
df = pd.DataFrame(d)
dff = df.groupby(['name', 'value']).agg('sum').reset_index()
dfff = dff.groupby(['name']).apply(lambda x: round((x['amount']/x['amount'].sum())*100))
print(dff)
print(dfff)
name value amount
0 dvd 0.2 40
1 dvd 0.5 50
2 radio 90.0 100
3 radio 100.0 50
4 tv 10.0 49
5 tv 15.0 23
6 tv 34.0 56
name
dvd 0 44.0
1 56.0
radio 2 67.0
3 33.0
tv 4 38.0
5 18.0
6 44.0
I now want to take this dataset and concatenate the rows grouped on the name variable. The amount variable should be expressed as a proportion.
The final dataset should look like below, where the value is the first term and amount expressed as a proportion is the second term.
name concatenated_values
0 dvd 0.2, 44%, 0.5, 56%
1 radio 90, 67%, 100, 33%
.
.
.
Use custom lambda function with flatten nested lists in GroupBy.apply:
dff = df.groupby(['name', 'value']).agg('sum').reset_index()
dff['amount'] = ((dff['amount'] / dff.groupby(['name'])['amount'].transform('sum')*100)
.round().astype(int).astype(str) + '%')
f = lambda x: ', '.join(str(z) for y in x.to_numpy() for z in y)
d = dff.groupby('name')[['value','amount']].apply(f).reset_index(name='concatenated_values')
print(d)
name concatenated_values
0 dvd 0.2, 44%, 0.5, 56%
1 radio 90.0, 67%, 100.0, 33%
2 tv 10.0, 38%, 15.0, 18%, 34.0, 44%
New to pandas and python so thank you in advance.
I have a table
# Create DataFrame
data = [{'analyte': 'sample1'},
{'analyte': 'bacon', 'CAS1': 1},
{'analyte': 'eggs', 'CAS1': 2},
{'analyte': 'money', 'CAS1': 3, 'CAS2': 1, 'Value2': 1.11},
{'analyte': 'shoe', 'CAS1': 4},
{'analyte': 'boy', 'CAS1': 5},
{'analyte': 'girl', 'CAS1': 6},
{'analyte': 'onion', 'CAS1': 7, 'CAS2': 4, 'Value2': 6.53},
{'analyte': 'sample2'},
{'analyte': 'bacon', 'CAS1': 1},
{'analyte': 'eggs', 'CAS1': 2, 'CAS2': 1, 'Value2': 7.88},
{'analyte': 'money', 'CAS1': 3},
{'analyte': 'shoe', 'CAS1': 4, 'CAS2': 3, 'Value2': 15.5},
{'analyte': 'boy', 'CAS1': 5},
{'analyte': 'girl', 'CAS1': 6},
{'analyte': 'onion', 'CAS1': 7}]
df = pd.DataFrame(data)
Before Write Pandas DataFrame into a MySQL Database Table, i need to split df to separate tables, and then write each table to Mysql
How to split df by columns, somethink like, if column name contains string "cas1" then split df
for col in df.columns:
if "cas1" in col:
dfCas1 = df.split
#add uniq index to indetify to which row belongs to
if "cas2" in col:
dfCas2 = df.split
#add uniq index to indetify to which row belongs to
if {"analyte","id" .etc } in col: # main table
dfMain = df.split
dfMain.to_sql("Main", dbConnection, if_exists='fail')
dfCas1.to_sql("cas1", dbConnection, if_exists='fail')
dfCas2.to_sql("cas2", dbConnection, if_exists='fail')
expected
I'm not completely sure what you want to achieve, but I feel like you want to do something like splitting this:
+---------+----+------+--------+------+--------+
| Analyte | id | CAS1 | value1 | Cas2 | Value2 |
+---------+----+------+--------+------+--------+
| | | | | | |
+---------+----+------+--------+------+--------+
to this:
+---------+----+ +------+--------+ +------+--------+
| Analyte | id | | CAS1 | value1 | | Cas2 | Value2 |
+---------+----+ +------+--------+ +------+--------+
| | | | | | | | |
+---------+----+ +------+--------+ +------+--------+
The first one is obtained by calling e.g. df.loc[:, ['Analyte', 'id']]. For the other ones, adjust the column names.
Now for the uniq index that is within your code comments, df.loc[:] keeps the index of the original table. You can use df.reset_index() to reset it to a unique integer index. If you also want to drop empty rows in one of your subtables before parsing, have a look at df.dropna().
I am not 100% sure if this is what you mean, but:
dfCas1 = df[df.col.str.contains('cas1')]
dfCas2 = df[df.col.str.contains('cas2')]
dfMain = df[~((df.col.str.contains('cas2')) & df.col.str.contains('cas1'))]
The ~ sign negates the selection and means all rows where the columns do not contain cas1 and cas2. I hope this makes sense.
Given these data frames:
IncomingCount
-------------------------
Venue|Date | 08 | 10 |
-------------------------
Hotel|20190101| 15 | 03 |
Beach|20190101| 93 | 45 |
OutgoingCount
-------------------------
Venue|Date | 07 | 10 |
-------------------------
Beach|20190101| 30 | 5 |
Hotel|20190103| 05 | 15 |
How can I possibly merge (full join) the two tables resulting in something as following without having to manually loop through each row of both tables?
Dictionary:
[
{"Venue":"Hotel", "Date":"20190101", "08":{ "IncomingCount":15 }, "10":{ "IncomingCount":03 } },
{"Venue":"Beach", "Date":"20190101", "07":{ "OutgoingCount":30 }, "08":{ "IncomingCount":93 }, "10":{ "IncomingCount":45, "OutgoingCount":15 } },
{"Venue":"Hotel", "Date":"20190103", "07":{ "OutgoingCount":05 }, "10":{ "OutgoingCount":15 } }
]
The conditions are:
Venue and Date columns act like join conditions.
The other columns, represented in numbers, are dynamically created.
If dynamically column does not exist, it gets excluded( or included with None as value ).
it's pretty fiddly, but it can be done by making use of the create_map function from spark.
basically divide the columns into four groups: keys (venue, date), common (10), only incoming (08), only outgoing (07).
then create mappers per group (except keys), mapping only what's available per group. apply mapping, drop the old column and rename the mapped column to the old name.
lastly convert all rows to dict (from df's rdd) and collect.
from pyspark.sql import SparkSession
from pyspark.sql.functions import create_map, col, lit
spark = SparkSession.builder.appName('hotels_and_beaches').getOrCreate()
incoming_counts = spark.createDataFrame([('Hotel', 20190101, 15, 3), ('Beach', 20190101, 93, 45)], ['Venue', 'Date', '08', '10']).alias('inc')
outgoing_counts = spark.createDataFrame([('Beach', 20190101, 30, 5), ('Hotel', 20190103, 5, 15)], ['Venue', 'Date', '07', '10']).alias('out')
df = incoming_counts.join(outgoing_counts, on=['Venue', 'Date'], how='full')
outgoing_cols = {c for c in outgoing_counts.columns if c not in {'Venue', 'Date'}}
incoming_cols = {c for c in incoming_counts.columns if c not in {'Venue', 'Date'}}
common_cols = outgoing_cols.intersection(incoming_cols)
outgoing_cols = outgoing_cols.difference(common_cols)
incoming_cols = incoming_cols.difference(common_cols)
for c in common_cols:
df = df.withColumn(
c + '_new', create_map(
lit('IncomingCount'), col('inc.{}'.format(c)),
lit('OutgoingCount'), col('out.{}'.format(c)),
)
).drop(c).withColumnRenamed(c + '_new', c)
for c in incoming_cols:
df = df.withColumn(
c + '_new', create_map(
lit('IncomingCount'), col('inc.{}'.format(c)),
)
).drop(c).withColumnRenamed(c + '_new', c)
for c in outgoing_cols:
df = df.withColumn(
c + '_new', create_map(
lit('OutgoingCount'), col('out.{}'.format(c)),
)
).drop(c).withColumnRenamed(c + '_new', c)
result = df.coalesce(1).rdd.map(lambda r: r.asDict()).collect()
print(result)
result:
[{'Venue': 'Hotel', 'Date': 20190101, '10': {'OutgoingCount': None, 'IncomingCount': 3}, '08': {'IncomingCount': 15}, '07': {'OutgoingCount': None}}, {'Venue': 'Hotel', 'Date': 20190103, '10': {'OutgoingCount': 15, 'IncomingCount': None}, '08': {'IncomingCount': None}, '07': {'OutgoingCount': 5}}, {'Venue': 'Beach', 'Date': 20190101, '10': {'OutgoingCount': 5, 'IncomingCount': 45}, '08': {'IncomingCount': 93}, '07': {'OutgoingCount': 30}}]
I can get this so far:
import pandas as pd
import numpy as np
dd1 = {'venue': ['hotel', 'beach'], 'date':['20190101', '20190101'], '08': [15, 93], '10':[3, 45]}
dd2 = {'venue': ['beach', 'hotel'], 'date':['20190101', '20190103'], '07': [30, 5], '10':[5, 15]}
df1 = pd.DataFrame(data=dd1)
df2 = pd.DataFrame(data=dd2)
df1.columns = [f"IncomingCount:{x}" if x not in ['venue', 'date'] else x for x in df1.columns]
df2.columns = [f"OutgoingCount:{x}" if x not in ['venue', 'date'] else x for x in df2.columns ]
ll_dd = pd.merge(df1, df2, on=['venue', 'date'], how='outer').to_dict('records')
ll_dd = [{k:v for k,v in dd.items() if not pd.isnull(v)} for dd in ll_dd]
OUTPUT:
[{'venue': 'hotel',
'date': '20190101',
'IncomingCount:08': 15.0,
'IncomingCount:10': 3.0},
{'venue': 'beach',
'date': '20190101',
'IncomingCount:08': 93.0,
'IncomingCount:10': 45.0,
'OutgoingCount:07': 30.0,
'OutgoingCount:10': 5.0},
{'venue': 'hotel',
'date': '20190103',
'OutgoingCount:07': 5.0,
'OutgoingCount:10': 15.0}]
The final result as desired by the OP is a list of dictionaries, where all rows from the DataFrame which have same Venue and Date have been clubbed together.
# Creating the DataFrames
df_Incoming = sqlContext.createDataFrame([('Hotel','20190101',15,3),('Beach','20190101',93,45)],('Venue','Date','08','10'))
df_Incoming.show()
+-----+--------+---+---+
|Venue| Date| 08| 10|
+-----+--------+---+---+
|Hotel|20190101| 15| 3|
|Beach|20190101| 93| 45|
+-----+--------+---+---+
df_Outgoing = sqlContext.createDataFrame([('Beach','20190101',30,5),('Hotel','20190103',5,15)],('Venue','Date','07','10'))
df_Outgoing.show()
+-----+--------+---+---+
|Venue| Date| 07| 10|
+-----+--------+---+---+
|Beach|20190101| 30| 5|
|Hotel|20190103| 5| 15|
+-----+--------+---+---+
The idea is to create a dictionary from each row and have the all rows of the DataFrame stored as dictionaries in one big list. And as a final step, we club those dictionaries together which have same Venue and Date.
Since, all rows in the DataFrame are stored as Row() objects, we use collect() function to return all records as list of Row(). Just to illustrate the output -
print(df_Incoming.collect())
[Row(Venue='Hotel', Date='20190101', 08=15, 10=3), Row(Venue='Beach', Date='20190101', 08=93, 10=45)]
But, since we want list of dictionaries, we can use list comprehensions to convert them to a one -
list_Incoming = [row.asDict() for row in df_Incoming.collect()]
print(list_Incoming)
[{'10': 3, 'Date': '20190101', 'Venue': 'Hotel', '08': 15}, {'10': 45, 'Date': '20190101', 'Venue': 'Beach', '08': 93}]
But, since the numeric columns have been in the form like "08":{ "IncomingCount":15 }, instead of "08":15, so we employ dictionary comprehensions to convert them into this form -
list_Incoming = [ {k:v if k in ['Venue','Date'] else {'IncomingCount':v} for k,v in dict_element.items()} for dict_element in list_Incoming]
print(list_Incoming)
[{'10': {'IncomingCount': 3}, 'Date': '20190101', 'Venue': 'Hotel', '08': {'IncomingCount': 15}}, {'10': {'IncomingCount': 45}, 'Date': '20190101', 'Venue': 'Beach', '08': {'IncomingCount': 93}}]
Similarly, we do for OutgoingCount
list_Outgoing = [row.asDict() for row in df_Outgoing.collect()]
list_Outgoing = [ {k:v if k in ['Venue','Date'] else {'OutgoingCount':v} for k,v in dict_element.items()} for dict_element in list_Outgoing]
print(list_Outgoing)
[{'10': {'OutgoingCount': 5}, 'Date': '20190101', 'Venue': 'Beach', '07': {'OutgoingCount': 30}}, {'10': {'OutgoingCount': 15}, 'Date': '20190103', 'Venue': 'Hotel', '07': {'OutgoingCount': 5}}]
Final Step: Now, that we have created the requisite list of dictionaries, we need to club the list together on the basis of Venue and Date.
from copy import deepcopy
def merge_lists(list_Incoming, list_Outgoing):
# create dictionary from list_Incoming:
dict1 = {(record['Venue'], record['Date']): record for record in list_Incoming}
#compare elements in list_Outgoing to those on list_Incoming:
result = {}
for record in list_Outgoing:
ckey = record['Venue'], record['Date']
new_record = deepcopy(record)
if ckey in dict1:
for key, value in dict1[ckey].items():
if key in ('Venue', 'Date'):
# Do not merge these keys
continue
# Dict's "setdefault" finds a key/value, and if it is missing
# creates a new one with the second parameter as value
new_record.setdefault(key, {}).update(value)
result[ckey] = new_record
# Add values from list_Incoming that were not matched in list_Outgoing:
for key, value in dict1.items():
if key not in result:
result[key] = deepcopy(value)
return list(result.values())
res = merge_lists(list_Incoming, list_Outgoing)
print(res)
[{'10': {'OutgoingCount': 5, 'IncomingCount': 45},
'Date': '20190101',
'Venue': 'Beach',
'08': {'IncomingCount': 93},
'07': {'OutgoingCount': 30}
},
{'10': {'OutgoingCount': 15},
'Date': '20190103',
'Venue': 'Hotel',
'07': {'OutgoingCount': 5}
},
{'10': {'IncomingCount': 3},
'Date': '20190101',
'Venue': 'Hotel',
'08': {'IncomingCount': 15}
}]
So i'm trying to get this list of dictionaries
List of dictionaries
[{
'cast_id': 1,
'character': 'W',
'credit_id': '5',
'gender': 2,
'id': 31,
'name': 'To',
'order': 0,
'profile_path': 'pQ'
},
{
'cast_id': 2,
'character': 'Bu',
'credit_id': '52',
'gender': 2,
'id': 12,
'name': 'Ti',
'order': 1,
'profile_path': 'uX'}]
into this dataframe structure:
Pandas DataFrame
|---------|-----------|-----------|--------|----|------|-------|--------------|
| cast_id | character | credit_id | gender | id | name | order | profile_path |
|---------|-----------|-----------|--------|----|------|-------|--------------|
| 1 | W | 5 | 2 | 31 | To | 0 | pQ |
|---------|-----------|-----------|--------|----|------|-------|--------------|
| 2 | Bu | 52 | 2 | 12 | Ti | 1 | uX |
|---------|-----------|-----------|--------|----|------|-------|--------------|
in python and I don't know how do do this.
You can use the ast module to convert the string list to a list object using ast.literal_eval.
Demo:
import pandas as pd
import ast
d = "[{'cast_id': 1, 'character': 'W', 'credit_id': '5', 'gender': 2, 'id': 31, 'name': 'To', 'order': 0, 'profile_path': 'pQ'} , {'cast_id': 2, 'character': 'Bu', 'credit_id': '52', 'gender': 2, 'id': 12, 'name': 'Ti', 'order': 1, 'profile_path': 'uX'}]"
d = ast.literal_eval(d)
df = pd.DataFrame(d) #Convert to a Dataframe.
print(df)
Output:
cast_id character credit_id gender id name order profile_path
0 1 W 5 2 31 To 0 pQ
1 2 Bu 52 2 12 Ti 1 uX
Please find my solution from what I understood.
You can use pandas DataFrame.from_dict(data, orient='columns', dtype=None).
import pandas as pd
def main():
data = [{'cast_id': 1, 'character': 'W', 'credit_id': '5', 'gender': 2, 'id': 31, 'name': 'To', 'order': 0, 'profile_path': 'pQ'} , {'cast_id': 2, 'character': 'Bu', 'credit_id': '52', 'gender': 2, 'id': 12, 'name': 'Ti', 'order': 1, 'profile_path': 'uX'}]
df = pd.DataFrame.from_dict(data)
print(df)
if __name__ == '__main__':
main()
Output
cast_id character credit_id gender id name order profile_path
0 1 W 5 2 31 To 0 pQ
1 2 Bu 52 2 12 Ti 1 uX