Using Pandas and sqlite3 - python

Try to implement the privote_table of pandas to produce a table for each of party and each state shows how much the party receievd in total contributions from the state.
Is this the right way to do or i has to get into the data base and get fectched out. However the code below gives error.
party_and_state = candidates.merge(contributors, on='id')
party_and_state.pivot_table(df,index=["party","state"],values=["amount"],aggfunc=[np.sum])
The expected result could be something like the table below.
The first coulmn is the state name then the party D underneath the party D is the total votes from each state, the same applies with the party R
+-----------------+---------+--------+
| state | D | R |
+-----------------+---------+--------+
| AK | 500 | 900 |
| IL | 600 | 877 |
| FL | 200 | 400 |
| UT | 300 | 300 |
| CA | 109 | 90 |
| MN | 800 | 888 |

Consider the generalized pandas merge with pd as qualifier instead of a dataframe since the join fields are differently named hence requiring left_on and right_on args. Additionally, do not pass in df if running pivot_table as method of a dataframe since the called df is passed into the function.
Below uses the contributors and contributors_with_candidates text files. Also, per your desired results, you may want to use the values arg of pivot_table:
import numpy as np
import pandas as pd
contributors = pd.read_table('contributors_with_candidate_id.txt', sep="|")
candidates = pd.read_table('candidates.txt', sep="|")
party_and_state = pd.merge(contributors, candidates,
left_on=['candidate_id'], right_on=['id'])
party_and_state.pivot_table(index=["party", "state"],
values=["amount"], aggfunc=np.sum)
# amount
# party state
# D CA 1660.80
# DC 200.09
# FL 4250.00
# IL 200.00
# MA 195.00
# ...
# R AK 1210.00
# AR 14200.00
# AZ 120.00
# CA -6674.53
# CO -5823.00
party_and_state.pivot_table(index=["state"], columns=["party"],
values=["amount"], aggfunc=np.sum)
# amount
# party D R
# state
# AK NaN 1210.00
# AR NaN 14200.00
# AZ NaN 120.00
# CA 1660.80 -6674.53
# CO NaN -5823.00
# CT NaN 2300.00
Do note, you can do the merge as an inner join in SQL with read_sql:
party_and_state = pd.read_sql("SELECT c.*, n.* FROM contributors c " +
"INNER JOIN candidates n ON c.candidate_id = n.id",
con = db)
party_and_state.pivot_table(index=["state"], columns=["party"],
values=["amount"], aggfunc=np.sum)

Related

How to write a Function in python pandas to append the rows in dataframe in a loop?

I am being provided with a data set and i am writing a function.
my objectice is quiet simple. I have a air bnb data base with various columns my onjective is simple. I am using a for loop over neighbourhood group list (that i created) and i am trying to extract (append) the data related to that particular element in a empty dataframe.
Example:
import pandas as pd
import numpy as np
dict1 = {'id' : [2539,2595,3647,3831,12937,18198,258838,258876,267535,385824],'name':['Clean & quiet apt home by the park','Skylit Midtown Castle','THE VILLAGE OF HARLEM....NEW YORK !','Cozy Entire Floor of Brownstone','1 Stop fr. Manhattan! Private Suite,Landmark Block','Little King of Queens','Oceanview,close to Manhattan','Affordable rooms,all transportation','Home Away From Home-Room in Bronx','New York City- Riverdale Modern two bedrooms unit'],'price':[149,225,150,89,130,70,250,50,50,120],'neighbourhood_group':['Brooklyn','Manhattan','Manhattan','Brooklyn','Queens','Queens','Staten Island','Staten Island','Bronx','Bronx']}
df = pd.DataFrame(dict1)
df
I created a function as follows
nbd_grp = ['Bronx','Queens','Staten Islands','Brooklyn','Manhattan']
# Creating a function to find the cheapest place in neighbourhood group
dfdf = pd.DataFrame(columns = ['id','name','price','neighbourhood_group'])
def cheapest_place(neighbourhood_group):
for elem in nbd_grp:
data = df.loc[df['neighbourhood_group']==elem]
cheapest = data.loc[data['price']==min(data['price'])]
dfdf = cheapest.copy()
cheapest_place(nbd_grp)
My Expected Output is :
id
name
Price
neighbourhood group
267535
Home Away From Home-Room in Bronx
50
Bronx
18198
Little King of Queens
70
Queens
258876
Affordable rooms,all transportation
50
Staten Island
3831
Cozy Entire Floor of Brownstone
89
Brooklyn
3647
THE VILLAGE OF HARLEM....NEW YORK !
150
Manhattan
My advice is that anytime you are working in a database or in a dataframe and you think "I need to loop", you should think again.
When in a dataframe you are in a world of set-based logic and there is likely a better set-based way of solving the problem. In your case you can groupby() your neighbourhood_group and get the min() of the price column and then merge or join that result set back to your original dataframe to get your id and name columns.
That would look something like:
df_min_price = df.groupby('neighbourhood_group').price.agg(min).reset_index().merge(df, on=['neighbourhood_group','price'])
+-----+---------------------+-------+--------+-------------------------------------+
| idx | neighbourhood_group | price | id | name |
+-----+---------------------+-------+--------+-------------------------------------+
| 0 | Bronx | 50 | 267535 | Home Away From Home-Room in Bronx |
| 1 | Brooklyn | 89 | 3831 | Cozy Entire Floor of Brownstone |
| 2 | Manhattan | 150 | 3647 | THE VILLAGE OF HARLEM....NEW YORK ! |
| 3 | Queens | 70 | 18198 | Little King of Queens |
| 4 | Staten Island | 50 | 258876 | Affordable rooms,all transportation |
+-----+---------------------+-------+--------+-------------------------------------+

Convert multiple rows into one row with multiple columns in pyspark?

I have something like this (I've simplified the number of columns for brevity, there's about 10 other attributes):
id name foods foods_eaten color continent
1 john apples 2 red Europe
1 john oranges 3 red Europe
2 jack apples 1 blue North America
I want to convert it to:
id name apples oranges color continent
1 john 2 3 red Europe
2 jack 1 0 blue North America
Edit:
(1) I updated the data to show a few more of the columns.
(3) I've done
df_piv = df.groupBy(['id', 'name', 'color', 'continent', ...]).pivot('foods').avg('foods_eaten')
Is there a simpler way to do this sort of thing? As far as I can tell, I'll need to groupby almost every attribute to get my result.
Extending from what you have done so far and leveraging here
>>>from pyspark.sql import functions as F
>>>from pyspark.sql.types import *
>>>from pyspark.sql.functions import collect_list
>>>data=[{'id':1,'name':'john','foods':"apples"},{'id':1,'name':'john','foods':"oranges"},{'id':2,'name':'jack','foods':"banana"}]
>>>dataframe=spark.createDataFrame(data)
>>>dataframe.show()
+-------+---+----+
| foods| id|name|
+-------+---+----+
| apples| 1|john|
|oranges| 1|john|
| banana| 2|jack|
+-------+---+----+
>>>grouping_cols = ["id","name"]
>>>other_cols = [c for c in dataframe.columns if c not in grouping_cols]
>>> df=dataframe.groupBy(grouping_cols).agg(*[collect_list(c).alias(c) for c in other_cols])
>>>df.show()
+---+----+-----------------+
| id|name| foods|
+---+----+-----------------+
| 1|john|[apples, oranges]|
| 2|jack| [banana]|
+---+----+-----------------+
>>>df_sizes = df.select(*[F.size(col).alias(col) for col in other_cols])
>>>df_max = df_sizes.agg(*[F.max(col).alias(col) for col in other_cols])
>>> max_dict = df_max.collect()[0].asDict()
>>>df_result = df.select('id','name', *[df[col][i] for col in other_cols for i in range(max_dict[col])])
>>>df_result.show()
+---+----+--------+--------+
| id|name|foods[0]|foods[1]|
+---+----+--------+--------+
| 1|john| apples| oranges|
| 2|jack| banana| null|
+---+----+--------+--------+

Apply user-defined functions over a python datatable (not pandas dataframe)?

Datatable is popular for R, but it also has a Python version. However, I don't see anything in the docs for applying a user defined function over a datatable.
Here's a toy example (in pandas) where a user function is applied over a dataframe to look for po-box addresses:
df = pd.DataFrame({'customer':[101, 102, 103],
'address':['12 main st', '32 8th st, 7th fl', 'po box 123']})
customer | address
----------------------------
101 | 12 main st
102 | 32 8th st, 7th fl
103 | po box 123
# User-defined function:
def is_pobox(s):
rslt = re.search(r'^p(ost)?\.? *o(ffice)?\.? *box *\d+', s)
if rslt:
return True
else:
return False
# Using .apply() for this example:
df['is_pobox'] = df.apply(lambda x: is_pobox(x['address']), axis = 1)
# Expected Output:
customer | address | rslt
----------------------------|------
101 | 12 main st | False
102 | 32 8th st, 7th fl| False
103 | po box 123 | True
Is there a way to do this .apply operation in datatable? Would be nice, because datatable seems to be quite a bit faster than pandas for most operations.

Add prefix to ffill, identifying values which were carried forward

Is there a wayto add a prefix when filling na's with ffill in pandas? I have a dataframe containing, taxonomic information like so:
| Kingdom | Phylum | Class | Order | Family | Genus |
| Bacteria | Firmicutes | Bacilli | Lactobacillales | Lactobacillaceae | Lactobacillus |
| Bacteria | Bacteroidetes | Bacteroidia | Bacteroidales | | |
| Bacteria | Bacteroidetes | | | | |
Since not all of the taxa in my dataframe can be classified fully, I have some empty cells. Replacing the spaces with NA and using ffill I can fill these with the last valid string in each row but I would like to add a string to these (for example "Unknown_Bacteroidales") so I can identify which ones were carried forward.
So far I tried this taxa_formatted = "unknown_" + taxonomy.fillna(method='ffill', axis=1) but this of course adds the "unknown_" prefix to everything in the dataframe.
You can this using boolean masking with df.isna.
df = df.replace("", np.nan) # if already NaN present skip this step
d = df.ffill()
d[df.isna()]+="(Copy)"
d
Kingdom Phylum Class Order Family Genus
0 Bacteria Firmicutes Bacilli Lactobacillales Lactobacillaceae Lactobacillus
1 Bacteria Bacteroidetes Bacteroidia Bacteroidales Lactobacillaceae(Copy) Lactobacillus(Copy)
2 Bacteria Bacteroidetes Bacteroidia(Copy) Bacteroidales(Copy) Lactobacillaceae(Copy) Lactobacillus(Copy)
You can use df.add here.
d = df.ffill(axis=1)
df.add("unkown_" + d[df.isna()],fill_value='')
Kingdom Phylum Class Order Family Genus
0 Bacteria Firmicutes Bacilli Lactobacillales Lactobacillaceae Lactobacillus
1 Bacteria Bacteroidetes Bacteroidia Bacteroidales unkown_Bacteroidales unkown_Bacteroidales
2 Bacteria Bacteroidetes unkown_Bacteroidetes unkown_Bacteroidetes unkown_Bacteroidetes unkown_Bacteroidetes
You need to use mask and update:
#make true nan's first.
#df = df.replace('',np.nan)
s = df.isnull()
df = df.ffill(axis=1)
df.update('unknown_' + df.mask(~s) )
print(df)
Bacteria Firmicutes Bacilli Lactobacillales \
0 Bacteria Bacteroidetes Bacteroidia Bacteroidales
1 Bacteria Bacteroidetes unknown_Bacteroidetes unknown_Bacteroidetes
Lactobacillaceae Lactobacillus
0 unknown_Bacteroidales unknown_Bacteroidales
1 unknown_Bacteroidetes unknown_Bacteroidetes

How to enrich dataframe by adding columns in specific condition

I have a two different datasets:
users:
+-------+---------+--------+
|user_id| movie_id|timestep|
+-------+---------+--------+
| 100 | 1000 |20200728|
| 101 | 1001 |20200727|
| 101 | 1002 |20200726|
+-------+---------+--------+
movies:
+--------+---------+--------------------------+
|movie_id| title | genre |
+--------+---------+--------------------------+
| 1000 |Toy Story|Adventure|Animation|Chil..|
| 1001 | Jumanji |Adventure|Children|Fantasy|
| 1002 | Iron Man|Action|Adventure|Sci-Fi |
+--------+---------+--------------------------+
How to get dataset in the following format? So I can get user's taste profile, so I can compare different users by their similarity score?
+-------+---------+--------+---------+---------+-----+
|user_id| Action |Adventure|Animation|Children|Drama|
+-------+---------+--------+---------+---------+-----+
| 100 | 0 | 1 | 1 | 1 | 0 |
| 101 | 1 | 1 | 0 | 1 | 0 |
+-------+---------+---------+---------+--------+-----+
Where df is the movies dataframe and dfu is the users dataframe
The 'genre' column needs to be split into a list with pandas.Series.str.split, and then using pandas.DataFrame.explode, transform each element of the list into a row, replicating index values.
pandas.merge the two dataframes on 'movie_id'
Use pandas.DataFrame.groupby on 'user_id' and 'genre' and aggregate by count.
Shape final
.unstack converts the groupby dataframe from long to wide format
.fillna replace NaN with 0
.astype changes the numeric values from float to int
Tested in python 3.10, pandas 1.4.3
import pandas as pd
# data
movies = {'movie_id': [1000, 1001, 1002],
'title': ['Toy Story', 'Jumanji', 'Iron Man'],
'genre': ['Adventure|Animation|Children', 'Adventure|Children|Fantasy', 'Action|Adventure|Sci-Fi']}
users = {'user_id': [100, 101, 101],
'movie_id': [1000, 1001, 1002],
'timestep': [20200728, 20200727, 20200726]}
# set up dataframes
df = pd.DataFrame(movies)
dfu = pd.DataFrame(users)
# split the genre column strings at '|' to make lists
df.genre = df.genre.str.split('|')
# explode the lists in genre
df = df.explode('genre', ignore_index=True)
# merge df with dfu
dfm = pd.merge(dfu, df, on='movie_id')
# groupby, count and unstack
final = dfm.groupby(['user_id', 'genre'])['genre'].count().unstack(level=1).fillna(0).astype(int)
# display(final)
genre Action Adventure Animation Children Fantasy Sci-Fi
user_id
100 0 1 1 1 0 0
101 1 2 0 1 1 1

Categories

Resources