Datatable is popular for R, but it also has a Python version. However, I don't see anything in the docs for applying a user defined function over a datatable.
Here's a toy example (in pandas) where a user function is applied over a dataframe to look for po-box addresses:
df = pd.DataFrame({'customer':[101, 102, 103],
'address':['12 main st', '32 8th st, 7th fl', 'po box 123']})
customer | address
----------------------------
101 | 12 main st
102 | 32 8th st, 7th fl
103 | po box 123
# User-defined function:
def is_pobox(s):
rslt = re.search(r'^p(ost)?\.? *o(ffice)?\.? *box *\d+', s)
if rslt:
return True
else:
return False
# Using .apply() for this example:
df['is_pobox'] = df.apply(lambda x: is_pobox(x['address']), axis = 1)
# Expected Output:
customer | address | rslt
----------------------------|------
101 | 12 main st | False
102 | 32 8th st, 7th fl| False
103 | po box 123 | True
Is there a way to do this .apply operation in datatable? Would be nice, because datatable seems to be quite a bit faster than pandas for most operations.
Related
I am being provided with a data set and i am writing a function.
my objectice is quiet simple. I have a air bnb data base with various columns my onjective is simple. I am using a for loop over neighbourhood group list (that i created) and i am trying to extract (append) the data related to that particular element in a empty dataframe.
Example:
import pandas as pd
import numpy as np
dict1 = {'id' : [2539,2595,3647,3831,12937,18198,258838,258876,267535,385824],'name':['Clean & quiet apt home by the park','Skylit Midtown Castle','THE VILLAGE OF HARLEM....NEW YORK !','Cozy Entire Floor of Brownstone','1 Stop fr. Manhattan! Private Suite,Landmark Block','Little King of Queens','Oceanview,close to Manhattan','Affordable rooms,all transportation','Home Away From Home-Room in Bronx','New York City- Riverdale Modern two bedrooms unit'],'price':[149,225,150,89,130,70,250,50,50,120],'neighbourhood_group':['Brooklyn','Manhattan','Manhattan','Brooklyn','Queens','Queens','Staten Island','Staten Island','Bronx','Bronx']}
df = pd.DataFrame(dict1)
df
I created a function as follows
nbd_grp = ['Bronx','Queens','Staten Islands','Brooklyn','Manhattan']
# Creating a function to find the cheapest place in neighbourhood group
dfdf = pd.DataFrame(columns = ['id','name','price','neighbourhood_group'])
def cheapest_place(neighbourhood_group):
for elem in nbd_grp:
data = df.loc[df['neighbourhood_group']==elem]
cheapest = data.loc[data['price']==min(data['price'])]
dfdf = cheapest.copy()
cheapest_place(nbd_grp)
My Expected Output is :
id
name
Price
neighbourhood group
267535
Home Away From Home-Room in Bronx
50
Bronx
18198
Little King of Queens
70
Queens
258876
Affordable rooms,all transportation
50
Staten Island
3831
Cozy Entire Floor of Brownstone
89
Brooklyn
3647
THE VILLAGE OF HARLEM....NEW YORK !
150
Manhattan
My advice is that anytime you are working in a database or in a dataframe and you think "I need to loop", you should think again.
When in a dataframe you are in a world of set-based logic and there is likely a better set-based way of solving the problem. In your case you can groupby() your neighbourhood_group and get the min() of the price column and then merge or join that result set back to your original dataframe to get your id and name columns.
That would look something like:
df_min_price = df.groupby('neighbourhood_group').price.agg(min).reset_index().merge(df, on=['neighbourhood_group','price'])
+-----+---------------------+-------+--------+-------------------------------------+
| idx | neighbourhood_group | price | id | name |
+-----+---------------------+-------+--------+-------------------------------------+
| 0 | Bronx | 50 | 267535 | Home Away From Home-Room in Bronx |
| 1 | Brooklyn | 89 | 3831 | Cozy Entire Floor of Brownstone |
| 2 | Manhattan | 150 | 3647 | THE VILLAGE OF HARLEM....NEW YORK ! |
| 3 | Queens | 70 | 18198 | Little King of Queens |
| 4 | Staten Island | 50 | 258876 | Affordable rooms,all transportation |
+-----+---------------------+-------+--------+-------------------------------------+
I have a pandas data frame that consists of 4 rows, the English rows contain news titles, some rows contain non-English words like this one
**She’s the Hollywood Power Behind Those ...**
I want to remove all rows like this one, so all rows that contain at least non-English characters in the Pandas data frame.
If using Python >= 3.7:
df[df['col'].map(lambda x: x.isascii())]
where col is your target column.
Data:
df = pd.DataFrame({
'colA': ['**She’s the Hollywood Power Behind Those ...**',
'Hello, world!', 'Cainã', 'another value', 'test123*', 'âbc']
})
print(df.to_markdown())
| | colA |
|---:|:------------------------------------------------------|
| 0 | **She’s the Hollywood Power Behind Those ...** |
| 1 | Hello, world! |
| 2 | Cainã |
| 3 | another value |
| 4 | test123* |
| 5 | âbc |
Identifying and filtering strings with non-English characters (see the ASCII printable characters):
df[df.colA.map(lambda x: x.isascii())]
Output:
colA
1 Hello, world!
3 another value
4 test123*
Original approach was to use a user-defined function like this:
def is_ascii(s):
try:
s.encode(encoding='utf-8').decode('ascii')
except UnicodeDecodeError:
return False
else:
return True
You can use regex to do that.
Installation documentation is here. (just a simple pip install regex)
import re
and use [^a-zA-Z] to filter it.
to break it down:
^: Not
a-z: small letter
A-Z: Capital letters
I am having trouble with a pandas split. So I have a column of data that looks something like this:
Initial Dataframe
index | Address
0 | [123 New York St]
1 | [Amazing Building, 23 New Jersey St, 2F]
2 | [98 New Mexico Ave, 16F]
3 | [White House, 1600 Pennsylvania Ave, PH]
4 | [221 Baker Street]
5 | [Hogwarts]
As you can see, the list contains varying categories and number of elements. Some have building names along with addresses. Some only have addresses with building floors. I want to sort them out by category (building name, address, unit/floor number) but I'm having trouble coming up with a solution to this, as I'm a beginner python & pandas learner.
How do I split the addresses into different categories to get an output that looks like this, assuming the building names ALL start with an alphabet and I can put Null for categories with missing value?
Desired Output:
index | Building Name | Address | Unit Number
0 | Null | 123 New York St | Null
1 | Amazing Building | 23 New Jersery St. | 2F
2 | Null | 98 New Mexico Ave. | 16F
3 | White House | 1600 Pennsylvania Ave | PH
4 | Null | 221B Baker St | Null
5 | Hogwarts | Null | Null
The main thing I need is for all addresses to be in the Address Column. Thanks for any help!
preconditional condition : Building name starts with a character, not a number
If the building name starts with a number, the wrong result can be output.
import pandas as pd
df = pd.DataFrame({'addr' : ['123 New York St',
'Amzing Building, 23 New Jersey St, 2F',
'98 New Mexico Ave, 16F']})
# Check the number of items in the address value
df['addr'] = df['addr'].str.split(',')
df['cnt'] = df['addr'].apply(lambda x: len(x)).values
# function, Building name start letter check
def CheckInt(s):
try:
int(s[0])
return True
except ValueError:
return False
for i, v in df.iterrows():
# One item of address value
if v.cnt == 1:
df.loc[i,'Address'] = v.addr
# Three items of address value
elif v.cnt == 3:
df.loc[i,'Building'] = v.addr[0]
df.loc[i,'Address'] = v.addr[1]
df.loc[i,'Unit'] = v.addr[2]
# Two items of address value
else:
if CheckInt(v.addr[0]):
df.loc[i,'Address'] = v.addr[0]
df.loc[i,'Unit'] = v.addr[1]
else:
df.loc[i,'Building'] = v.addr[0]
df.loc[i,'Address'] = v.addr[1]
We can get the output for your input dataframe as below.
If the data is different, you may have to tinker around.
df['com_Address'] = df[' Address'].apply(lambda x: x.replace('[','').replace(']','')).str.split(',')
st_list= ['St','Ave']
df['St_Address']=df.apply(lambda x: [a if st in a else '' for st in st_list for a in x['com_Address']],axis=1)
df['St_Address']=df['St_Address'].apply(lambda x:[i for i in x if i]).astype(str).apply(lambda x: x.strip("[]'"))
df['Building Name']=df.apply(lambda x: [x['com_Address'][0] if len(x['com_Address'])==3 else 'Null'],axis=1).astype(str).apply(lambda x: x.strip("[]'"))
df['Building Name']=df.apply(lambda x: np.where((len(x['com_Address'])==1) & (x['St_Address']==''),x['com_Address'][0],x['Building Name']),axis=1)
df['Unit Number']=df.apply(lambda x: [x['com_Address'][2] if len(x['com_Address'])==3 else 'Null'],axis=1).astype(str).apply(lambda x: x.strip("[]'"))
df['Unit Number']=df.apply(lambda x: np.where((len(x['com_Address'])==2) & (x['St_Address']!=''),x['com_Address'][-1],x['Unit Number']),axis=1)
df
Column "com_Address" is optional. I had to create it because the 'Address' from your input came to me as a string & not as a list. If you already have it as list, you don't need this & you will have to update "com_Address" with 'Address' in the code.
Output
index Address com_Address Building Name St_Address Unit Number
0 0 [123 New York St] [ 123 New York St] Null 123 New York St Null
1 1 [Amazing Building, 23 New Jersey St, 2F] [ Amazing Building, 23 New Jersey St, 2F] Amazing Building 23 New Jersey St 2F
2 2 [98 New Mexico Ave, 16F] [ 98 New Mexico Ave, 16F] Null 98 New Mexico Ave 16F
3 3 [White House, 1600 Pennsylvania Ave, PH] [ White House, 1600 Pennsylvania Ave, PH] White House 1600 Pennsylvania Ave PH
4 4 [221 Baker Street] [ 221 Baker Street] Null 221 Baker Street Null
5 5 [Hogwarts] [ Hogwarts] Hogwarts Null
I'd like to know if it's possible to display a pandas dataframe in VS Code while debugging (first picture) as it is displayed in PyCharm (second picture) ?
Thanks for any help.
df print in vs code:
df print in pycharm:
As of the January 2021 release of the python extension, you can now view pandas dataframes with the built-in data viewer when debugging native python programs. When the program is halted at a breakpoint, right-click the dataframe variable in the variables list and select "View Value in Data Viewer"
Tabulate is an excellent library to achieve fancy/pretty print of the pandas df:
information - link: [https://pypi.org/project/tabulate/]
Please follow following steps in order to achieve pretty print:
(Note: For easy illustration I will create simple dataframe in python)
1) install tabulate
pip install --upgrade tabulate
This statement will always install latest version of the tabulate library.
2) import statements
import pandas as pd
from tabulate import tabulate
3) create simple temporary dataframe
temp_data = {'Name': ['Sean', 'Ana', 'KK', 'Kelly', 'Amanda'],
'Age': [42, 52, 36, 24, 73],
'Maths_Score': [67, 43, 65, 78, 97],
'English_Score': [78, 98, 45, 67, 64]}
df = pd.DataFrame(temp_data, columns = ['Name', 'Age', 'Maths_Score', 'English_Score'])
4) without tabulate our dataframe print will be:
print(df)
Name Age Maths_Score English_Score
0 Sean 42 67 78
1 Ana 52 43 98
2 KK 36 65 45
3 Kelly 24 78 67
4 Amanda 73 97 64
5) after using tabulate your pretty print will be :
print(tabulate(df, headers='keys', tablefmt='psql'))
+----+--------+-------+---------------+-----------------+
| | Name | Age | Maths_Score | English_Score |
|----+--------+-------+---------------+-----------------|
| 0 | Sean | 42 | 67 | 78 |
| 1 | Ana | 52 | 43 | 98 |
| 2 | KK | 36 | 65 | 45 |
| 3 | Kelly | 24 | 78 | 67 |
| 4 | Amanda | 73 | 97 | 64 |
+----+--------+-------+---------------+-----------------+
nice and crispy print, enjoy!!! Please add comments, if you like my answer!
use vs code jupyter notebooks support
choose between attach to local script or launch mode, up to you.
include a breakpoint() where you want to break if using attach mode.
when debugging use the debug console to:
display(df_consigne_errors)
I have not found a similar feature for VS Code. If you require this feature you might consider using Spyder IDE. Spyder IDE Homepage
In addition to #Shantanu's answer, Panda's to_markdown function, which requires the tabulate library installed in python, provides various plain text formatting for tables which show on VS Code editor, such as:
df = pd.DataFrame(data={"animal_1": ["elk", "pig"], "animal_2": ["dog", "quetzal"]})
print(df.to_markdown())
| | animal_1 | animal_2 |
|---:|:-----------|:-----------|
| 0 | elk | dog |
| 1 | pig | quetzal |
Try to implement the privote_table of pandas to produce a table for each of party and each state shows how much the party receievd in total contributions from the state.
Is this the right way to do or i has to get into the data base and get fectched out. However the code below gives error.
party_and_state = candidates.merge(contributors, on='id')
party_and_state.pivot_table(df,index=["party","state"],values=["amount"],aggfunc=[np.sum])
The expected result could be something like the table below.
The first coulmn is the state name then the party D underneath the party D is the total votes from each state, the same applies with the party R
+-----------------+---------+--------+
| state | D | R |
+-----------------+---------+--------+
| AK | 500 | 900 |
| IL | 600 | 877 |
| FL | 200 | 400 |
| UT | 300 | 300 |
| CA | 109 | 90 |
| MN | 800 | 888 |
Consider the generalized pandas merge with pd as qualifier instead of a dataframe since the join fields are differently named hence requiring left_on and right_on args. Additionally, do not pass in df if running pivot_table as method of a dataframe since the called df is passed into the function.
Below uses the contributors and contributors_with_candidates text files. Also, per your desired results, you may want to use the values arg of pivot_table:
import numpy as np
import pandas as pd
contributors = pd.read_table('contributors_with_candidate_id.txt', sep="|")
candidates = pd.read_table('candidates.txt', sep="|")
party_and_state = pd.merge(contributors, candidates,
left_on=['candidate_id'], right_on=['id'])
party_and_state.pivot_table(index=["party", "state"],
values=["amount"], aggfunc=np.sum)
# amount
# party state
# D CA 1660.80
# DC 200.09
# FL 4250.00
# IL 200.00
# MA 195.00
# ...
# R AK 1210.00
# AR 14200.00
# AZ 120.00
# CA -6674.53
# CO -5823.00
party_and_state.pivot_table(index=["state"], columns=["party"],
values=["amount"], aggfunc=np.sum)
# amount
# party D R
# state
# AK NaN 1210.00
# AR NaN 14200.00
# AZ NaN 120.00
# CA 1660.80 -6674.53
# CO NaN -5823.00
# CT NaN 2300.00
Do note, you can do the merge as an inner join in SQL with read_sql:
party_and_state = pd.read_sql("SELECT c.*, n.* FROM contributors c " +
"INNER JOIN candidates n ON c.candidate_id = n.id",
con = db)
party_and_state.pivot_table(index=["state"], columns=["party"],
values=["amount"], aggfunc=np.sum)