Create multiple DataFrames from a single DataFrame based on conditions by columns - python

New to pandas and python so thank you in advance.
I have a table
# Create DataFrame
data = [{'analyte': 'sample1'},
{'analyte': 'bacon', 'CAS1': 1},
{'analyte': 'eggs', 'CAS1': 2},
{'analyte': 'money', 'CAS1': 3, 'CAS2': 1, 'Value2': 1.11},
{'analyte': 'shoe', 'CAS1': 4},
{'analyte': 'boy', 'CAS1': 5},
{'analyte': 'girl', 'CAS1': 6},
{'analyte': 'onion', 'CAS1': 7, 'CAS2': 4, 'Value2': 6.53},
{'analyte': 'sample2'},
{'analyte': 'bacon', 'CAS1': 1},
{'analyte': 'eggs', 'CAS1': 2, 'CAS2': 1, 'Value2': 7.88},
{'analyte': 'money', 'CAS1': 3},
{'analyte': 'shoe', 'CAS1': 4, 'CAS2': 3, 'Value2': 15.5},
{'analyte': 'boy', 'CAS1': 5},
{'analyte': 'girl', 'CAS1': 6},
{'analyte': 'onion', 'CAS1': 7}]
df = pd.DataFrame(data)
Before Write Pandas DataFrame into a MySQL Database Table, i need to split df to separate tables, and then write each table to Mysql
How to split df by columns, somethink like, if column name contains string "cas1" then split df
for col in df.columns:
if "cas1" in col:
dfCas1 = df.split
#add uniq index to indetify to which row belongs to
if "cas2" in col:
dfCas2 = df.split
#add uniq index to indetify to which row belongs to
if {"analyte","id" .etc } in col: # main table
dfMain = df.split
dfMain.to_sql("Main", dbConnection, if_exists='fail')
dfCas1.to_sql("cas1", dbConnection, if_exists='fail')
dfCas2.to_sql("cas2", dbConnection, if_exists='fail')
expected

I'm not completely sure what you want to achieve, but I feel like you want to do something like splitting this:
+---------+----+------+--------+------+--------+
| Analyte | id | CAS1 | value1 | Cas2 | Value2 |
+---------+----+------+--------+------+--------+
| | | | | | |
+---------+----+------+--------+------+--------+
to this:
+---------+----+ +------+--------+ +------+--------+
| Analyte | id | | CAS1 | value1 | | Cas2 | Value2 |
+---------+----+ +------+--------+ +------+--------+
| | | | | | | | |
+---------+----+ +------+--------+ +------+--------+
The first one is obtained by calling e.g. df.loc[:, ['Analyte', 'id']]. For the other ones, adjust the column names.
Now for the uniq index that is within your code comments, df.loc[:] keeps the index of the original table. You can use df.reset_index() to reset it to a unique integer index. If you also want to drop empty rows in one of your subtables before parsing, have a look at df.dropna().

I am not 100% sure if this is what you mean, but:
dfCas1 = df[df.col.str.contains('cas1')]
dfCas2 = df[df.col.str.contains('cas2')]
dfMain = df[~((df.col.str.contains('cas2')) & df.col.str.contains('cas1'))]
The ~ sign negates the selection and means all rows where the columns do not contain cas1 and cas2. I hope this makes sense.

Related

Get top rows by one column Django

I'm making a job to categorize earnings and expenses, from app of movimentations.
To this, i need get the category to tile, with more cases.
For example, in this Scenario:
| title | category | count |
| ----- | -------------- | ----- |
| Pizza | food | 6 |
| Pizza | others_expense | 1 |
| Pizza | refund | 1 |
I want return just the first row, because the title is the same, and category food is used with most frequency.
Code example
I want get the result using just Django ORM, because i have diferent databases and is more fast than iterate over a large list.
Model:
class Movimentation(models.Model):
title = models.CharField(max_length=50)
value = models.FloatField()
category = models.CharField(max_length=50)
Consult:
My actual consult in Django ORM is.
Movimentation.objects \
.values('title', 'category') \
.annotate(count=Count('*'))
Result:
[
{'title': 'Pizza', 'category': 'food', 'count': 6},
{'title': 'Pizza', 'category': 'others_expense', 'count': 1},
{'title': 'Pizza', 'category': 'refund', 'count': 1},
{'title': 'Hamburguer', 'category': 'food', 'count': 1},
{'title': 'Clothing', 'category': 'personal', 'count': 18},
{'title': 'Clothing', 'category': 'home', 'count': 15},
{'title': 'Clothing', 'category': 'others_expense', 'count': 1}
]
Expected result:
In this case, i get just one row by title, with the most used category.
[
{'title': 'Pizza', 'category': 'food', 'count': 6},
{'title': 'Hamburguer', 'category': 'food', 'count': 1},
{'title': 'Clothing', 'category': 'personal', 'count': 18}
]

Merge 2 pandas dictionary columns

I have a dataframe simplified here with 3 columns.
| id | channels | facebookCount |
|:---- |:------:| -----:|
| 0 | {'channel': 'Google', 'count': 0.0} | 3 |
| 1 | {'channel': 'Google', 'count': 4.0} | 0 |
| 2 | {'channel': 'Google', 'count': 3.0} | 6 |
The channels column was a simple count column like facebookCount. However, I transformed into a dictionary using apply and lambda as such:
data_df["channels"] = data_df["googleCount"].apply(
lambda x: {} if x is None else {"channel": "Google", "count": x})
How can I construct the channel column so that it has data for both facebook and google so that I have a list containing 2 dictionaries as seen below:
| id | channels |
|:---- |:------:|
| 0 | [{'channel': 'Google', 'count': 0.0}, {'channel': 'Facebook', 'count': 3.0}] |
| 1 | [{'channel': 'Google', 'count': 4.0}, {'channel': 'Facebook', 'count': 0.0}] |
| 2 | [{'channel': 'Google', 'count': 3.0}, {'channel': 'Facebook', 'count': 6.0}] |
I have tried creating both dictionaries and then setting channel as well as creating one dictionary and then merging the 2 using apply and lambda as well as a helper function as such
dict1 = data_df["30DayGoogleCampaignCount"].apply(
lambda x: {"channel": "Google", "count": x})
data_df["paidMediaChannels"] = data_df["30DayFacebookCampaignCount"].apply(
lambda x: self.Merge(dict1, {"channel": "facebook", "count": x}))
def Merge(self, dict1, dict2):
return(dict2.update(dict1))
Try something like:
import pandas as pd
df = pd.DataFrame({'id': [0, 1, 2],
'channels': [{'channel': 'Google', 'count': 0.0},
{'channel': 'Google', 'count': 4.0},
{'channel': 'Google', 'count': 3.0}],
'facebookCount': [3, 0, 6]})
# Create List
df['channels'] = df.apply(
lambda x: [x['channels'],
{'channel': 'Facebook',
'count': x['facebookCount']}],
axis=1
)
# Drop facebookCount Column
df = df.drop(columns='facebookCount')
print(df.to_string())
df:
id channels
0 0 [{'channel': 'Google', 'count': 0.0}, {'channel': 'Facebook', 'count': 3}]
1 1 [{'channel': 'Google', 'count': 4.0}, {'channel': 'Facebook', 'count': 0}]
2 2 [{'channel': 'Google', 'count': 3.0}, {'channel': 'Facebook', 'count': 6}]

How to use lambda in agg and groupBy when using pyspark?

I am just studying pyspark. I am got confused about the following code:
df.groupBy(['Category','Register']).agg({'NetValue':'sum',
'Units':'mean'}).show(5,truncate=False)
df.groupBy(['Category','Register']).agg({'NetValue':'sum',
'Units': lambda x: pd.Series(x).nunique()}).show(5,truncate=False)
The first line is correct. But the second line is incorrect. The error message is:
AttributeError: 'function' object has no attribute '_get_object_id'
It looks like I did not use lambda function correctly. But this is how I use lambda in a normal python environment, and it is correct.
Could anyone help me here?
If you are okay with the performance of PySpark primitives using pure Python functions, the following code gives the desired result. You can modify the logic in _map to suit your specific need. I made some assumptions about what your data schema might look like.
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, LongType
schema = StructType([
StructField('Category', StringType(), True),
StructField('Register', LongType(), True),
StructField('NetValue', LongType(), True),
StructField('Units', LongType(), True)
])
test_records = [
{'Category': 'foo', 'Register': 1, 'NetValue': 1, 'Units': 1},
{'Category': 'foo', 'Register': 1, 'NetValue': 2, 'Units': 2},
{'Category': 'foo', 'Register': 2, 'NetValue': 3, 'Units': 3},
{'Category': 'foo', 'Register': 2, 'NetValue': 4, 'Units': 4},
{'Category': 'bar', 'Register': 1, 'NetValue': 5, 'Units': 5},
{'Category': 'bar', 'Register': 1, 'NetValue': 6, 'Units': 6},
{'Category': 'bar', 'Register': 2, 'NetValue': 7, 'Units': 7},
{'Category': 'bar', 'Register': 2, 'NetValue': 8, 'Units': 8}
]
spark = SparkSession.builder.getOrCreate()
dataframe = spark.createDataFrame(test_records, schema)
def _map(((category, register), records)):
net_value_sum = 0
uniques = set()
for record in records:
net_value_sum += record['NetValue']
uniques.add(record['Units'])
return category, register, net_value_sum, len(uniques)
new_dataframe = spark.createDataFrame(
dataframe.rdd.groupBy(lambda x: (x['Category'], x['Register'])).map(_map),
schema
)
new_dataframe.show()
Result:
+--------+--------+--------+-----+
|Category|Register|NetValue|Units|
+--------+--------+--------+-----+
| bar| 2| 15| 2|
| foo| 1| 3| 2|
| foo| 2| 7| 2|
| bar| 1| 11| 2|
+--------+--------+--------+-----+
If you need performance or to stick with the pyspark.sql framework, then see this related question and its linked questions:
Custom aggregation on PySpark dataframes

How to join dynamically named columns into dictionary?

Given these data frames:
IncomingCount
-------------------------
Venue|Date | 08 | 10 |
-------------------------
Hotel|20190101| 15 | 03 |
Beach|20190101| 93 | 45 |
OutgoingCount
-------------------------
Venue|Date | 07 | 10 |
-------------------------
Beach|20190101| 30 | 5 |
Hotel|20190103| 05 | 15 |
How can I possibly merge (full join) the two tables resulting in something as following without having to manually loop through each row of both tables?
Dictionary:
[
{"Venue":"Hotel", "Date":"20190101", "08":{ "IncomingCount":15 }, "10":{ "IncomingCount":03 } },
{"Venue":"Beach", "Date":"20190101", "07":{ "OutgoingCount":30 }, "08":{ "IncomingCount":93 }, "10":{ "IncomingCount":45, "OutgoingCount":15 } },
{"Venue":"Hotel", "Date":"20190103", "07":{ "OutgoingCount":05 }, "10":{ "OutgoingCount":15 } }
]
The conditions are:
Venue and Date columns act like join conditions.
The other columns, represented in numbers, are dynamically created.
If dynamically column does not exist, it gets excluded( or included with None as value ).
it's pretty fiddly, but it can be done by making use of the create_map function from spark.
basically divide the columns into four groups: keys (venue, date), common (10), only incoming (08), only outgoing (07).
then create mappers per group (except keys), mapping only what's available per group. apply mapping, drop the old column and rename the mapped column to the old name.
lastly convert all rows to dict (from df's rdd) and collect.
from pyspark.sql import SparkSession
from pyspark.sql.functions import create_map, col, lit
spark = SparkSession.builder.appName('hotels_and_beaches').getOrCreate()
incoming_counts = spark.createDataFrame([('Hotel', 20190101, 15, 3), ('Beach', 20190101, 93, 45)], ['Venue', 'Date', '08', '10']).alias('inc')
outgoing_counts = spark.createDataFrame([('Beach', 20190101, 30, 5), ('Hotel', 20190103, 5, 15)], ['Venue', 'Date', '07', '10']).alias('out')
df = incoming_counts.join(outgoing_counts, on=['Venue', 'Date'], how='full')
outgoing_cols = {c for c in outgoing_counts.columns if c not in {'Venue', 'Date'}}
incoming_cols = {c for c in incoming_counts.columns if c not in {'Venue', 'Date'}}
common_cols = outgoing_cols.intersection(incoming_cols)
outgoing_cols = outgoing_cols.difference(common_cols)
incoming_cols = incoming_cols.difference(common_cols)
for c in common_cols:
df = df.withColumn(
c + '_new', create_map(
lit('IncomingCount'), col('inc.{}'.format(c)),
lit('OutgoingCount'), col('out.{}'.format(c)),
)
).drop(c).withColumnRenamed(c + '_new', c)
for c in incoming_cols:
df = df.withColumn(
c + '_new', create_map(
lit('IncomingCount'), col('inc.{}'.format(c)),
)
).drop(c).withColumnRenamed(c + '_new', c)
for c in outgoing_cols:
df = df.withColumn(
c + '_new', create_map(
lit('OutgoingCount'), col('out.{}'.format(c)),
)
).drop(c).withColumnRenamed(c + '_new', c)
result = df.coalesce(1).rdd.map(lambda r: r.asDict()).collect()
print(result)
result:
[{'Venue': 'Hotel', 'Date': 20190101, '10': {'OutgoingCount': None, 'IncomingCount': 3}, '08': {'IncomingCount': 15}, '07': {'OutgoingCount': None}}, {'Venue': 'Hotel', 'Date': 20190103, '10': {'OutgoingCount': 15, 'IncomingCount': None}, '08': {'IncomingCount': None}, '07': {'OutgoingCount': 5}}, {'Venue': 'Beach', 'Date': 20190101, '10': {'OutgoingCount': 5, 'IncomingCount': 45}, '08': {'IncomingCount': 93}, '07': {'OutgoingCount': 30}}]
I can get this so far:
import pandas as pd
import numpy as np
dd1 = {'venue': ['hotel', 'beach'], 'date':['20190101', '20190101'], '08': [15, 93], '10':[3, 45]}
dd2 = {'venue': ['beach', 'hotel'], 'date':['20190101', '20190103'], '07': [30, 5], '10':[5, 15]}
df1 = pd.DataFrame(data=dd1)
df2 = pd.DataFrame(data=dd2)
df1.columns = [f"IncomingCount:{x}" if x not in ['venue', 'date'] else x for x in df1.columns]
df2.columns = [f"OutgoingCount:{x}" if x not in ['venue', 'date'] else x for x in df2.columns ]
ll_dd = pd.merge(df1, df2, on=['venue', 'date'], how='outer').to_dict('records')
ll_dd = [{k:v for k,v in dd.items() if not pd.isnull(v)} for dd in ll_dd]
OUTPUT:
[{'venue': 'hotel',
'date': '20190101',
'IncomingCount:08': 15.0,
'IncomingCount:10': 3.0},
{'venue': 'beach',
'date': '20190101',
'IncomingCount:08': 93.0,
'IncomingCount:10': 45.0,
'OutgoingCount:07': 30.0,
'OutgoingCount:10': 5.0},
{'venue': 'hotel',
'date': '20190103',
'OutgoingCount:07': 5.0,
'OutgoingCount:10': 15.0}]
The final result as desired by the OP is a list of dictionaries, where all rows from the DataFrame which have same Venue and Date have been clubbed together.
# Creating the DataFrames
df_Incoming = sqlContext.createDataFrame([('Hotel','20190101',15,3),('Beach','20190101',93,45)],('Venue','Date','08','10'))
df_Incoming.show()
+-----+--------+---+---+
|Venue| Date| 08| 10|
+-----+--------+---+---+
|Hotel|20190101| 15| 3|
|Beach|20190101| 93| 45|
+-----+--------+---+---+
df_Outgoing = sqlContext.createDataFrame([('Beach','20190101',30,5),('Hotel','20190103',5,15)],('Venue','Date','07','10'))
df_Outgoing.show()
+-----+--------+---+---+
|Venue| Date| 07| 10|
+-----+--------+---+---+
|Beach|20190101| 30| 5|
|Hotel|20190103| 5| 15|
+-----+--------+---+---+
The idea is to create a dictionary from each row and have the all rows of the DataFrame stored as dictionaries in one big list. And as a final step, we club those dictionaries together which have same Venue and Date.
Since, all rows in the DataFrame are stored as Row() objects, we use collect() function to return all records as list of Row(). Just to illustrate the output -
print(df_Incoming.collect())
[Row(Venue='Hotel', Date='20190101', 08=15, 10=3), Row(Venue='Beach', Date='20190101', 08=93, 10=45)]
But, since we want list of dictionaries, we can use list comprehensions to convert them to a one -
list_Incoming = [row.asDict() for row in df_Incoming.collect()]
print(list_Incoming)
[{'10': 3, 'Date': '20190101', 'Venue': 'Hotel', '08': 15}, {'10': 45, 'Date': '20190101', 'Venue': 'Beach', '08': 93}]
But, since the numeric columns have been in the form like "08":{ "IncomingCount":15 }, instead of "08":15, so we employ dictionary comprehensions to convert them into this form -
list_Incoming = [ {k:v if k in ['Venue','Date'] else {'IncomingCount':v} for k,v in dict_element.items()} for dict_element in list_Incoming]
print(list_Incoming)
[{'10': {'IncomingCount': 3}, 'Date': '20190101', 'Venue': 'Hotel', '08': {'IncomingCount': 15}}, {'10': {'IncomingCount': 45}, 'Date': '20190101', 'Venue': 'Beach', '08': {'IncomingCount': 93}}]
Similarly, we do for OutgoingCount
list_Outgoing = [row.asDict() for row in df_Outgoing.collect()]
list_Outgoing = [ {k:v if k in ['Venue','Date'] else {'OutgoingCount':v} for k,v in dict_element.items()} for dict_element in list_Outgoing]
print(list_Outgoing)
[{'10': {'OutgoingCount': 5}, 'Date': '20190101', 'Venue': 'Beach', '07': {'OutgoingCount': 30}}, {'10': {'OutgoingCount': 15}, 'Date': '20190103', 'Venue': 'Hotel', '07': {'OutgoingCount': 5}}]
Final Step: Now, that we have created the requisite list of dictionaries, we need to club the list together on the basis of Venue and Date.
from copy import deepcopy
def merge_lists(list_Incoming, list_Outgoing):
# create dictionary from list_Incoming:
dict1 = {(record['Venue'], record['Date']): record for record in list_Incoming}
#compare elements in list_Outgoing to those on list_Incoming:
result = {}
for record in list_Outgoing:
ckey = record['Venue'], record['Date']
new_record = deepcopy(record)
if ckey in dict1:
for key, value in dict1[ckey].items():
if key in ('Venue', 'Date'):
# Do not merge these keys
continue
# Dict's "setdefault" finds a key/value, and if it is missing
# creates a new one with the second parameter as value
new_record.setdefault(key, {}).update(value)
result[ckey] = new_record
# Add values from list_Incoming that were not matched in list_Outgoing:
for key, value in dict1.items():
if key not in result:
result[key] = deepcopy(value)
return list(result.values())
res = merge_lists(list_Incoming, list_Outgoing)
print(res)
[{'10': {'OutgoingCount': 5, 'IncomingCount': 45},
'Date': '20190101',
'Venue': 'Beach',
'08': {'IncomingCount': 93},
'07': {'OutgoingCount': 30}
},
{'10': {'OutgoingCount': 15},
'Date': '20190103',
'Venue': 'Hotel',
'07': {'OutgoingCount': 5}
},
{'10': {'IncomingCount': 3},
'Date': '20190101',
'Venue': 'Hotel',
'08': {'IncomingCount': 15}
}]

Aggregate List of Map in PySpark

I have a list of map e.g
[{'a' : 10,'b': 20}, {'a' : 5,'b': 20} , {'b': 20} ,{'a' : 0,'b': 20} }
I want to get the average of values of a and b. So the expected output is
a = (10 + 5 + 0 + 0) /3 = 5 ;
b = 80/4 = 20.
How can i do it efficiently using RDD
The easiest might be map your rdd element to a format like:
init = {'a': {'sum': 0, 'cnt': 0}, 'b': {'sum': 0, 'cnt': 0}}
i.e. record the sum and count for each key, and then reduce it.
Map function:
def map_fun(d, keys=['a', 'b']):
map_d = {}
for k in keys:
if k in d:
temp = {'sum': d[k], 'cnt': 1}
else:
temp = {'sum': 0, 'cnt': 0}
map_d[k] = temp
return map_d
Reduce function:
def reduce_fun(a, b, keys=['a', 'b']):
from collections import defaultdict
reduce_d = defaultdict(dict)
for k in keys:
reduce_d[k]['sum'] = a[k]['sum'] + b[k]['sum']
reduce_d[k]['cnt'] = a[k]['cnt'] + b[k]['cnt']
return reduce_d
rdd.map(map_fun).reduce(reduce_fun)
# defaultdict(<type 'dict'>, {'a': {'sum': 15, 'cnt': 3}, 'b': {'sum': 80, 'cnt': 4}})
Calculate the average:
d = rdd.map(map_fun).reduce(reduce_fun)
{k: v['sum']/v['cnt'] for k, v in d.items()}
{'a': 5, 'b': 20}
Given the structure of your data you should be able to use the dataframe api to achieve this calculation. If you need an rdd it is not to hard to get from the dataframe back to an rdd.
from pyspark.sql import functions as F
df = spark.createDataFrame([{'a' : 10,'b': 20}, {'a' : 5,'b': 20} , {'b': 20} ,{'a' : 0,'b': 20}])
Dataframe looks like this
+----+---+
| a| b|
+----+---+
| 10| 20|
| 5| 20|
|null| 20|
| 0| 20|
+----+---+
Then it follows simply to calculate averages using the pyspark.sql functions
cols = df.columns
df_means = df.agg(*[F.mean(F.col(col)).alias(col+"_mean") for col in cols])
df_means.show()
OUTPUT:
+------+------+
|a_mean|b_mean|
+------+------+
| 5.0| 20.0|
+------+------+
You can use defaultdict to collect similar keys and their values as list.
Then simply aggregate using sum of values divided by number of elements of list for each value.
from collections import defaultdict
x = [{'a' : 10,'b': 20}, {'a' : 5,'b': 20} , {'b': 20} ,{'a' : 0,'b': 20}]
y = defaultdict(lambda: [])
[y[k].append(v) for i in x for k,v in i.items() ]
for k,v in y.items():
print k, "=" ,sum(v)/len(v)
>>> y
defaultdict(<function <lambda> at 0x02A43BB0>, {'a': [10, 5, 0], 'b': [20, 20, 20, 20]})
>>>
>>>
a = 5
b = 20

Categories

Resources