Extract last value of list in each row [duplicate] - python

This question already has answers here:
Extracting an element of a list in a pandas column
(4 answers)
Selecting the last element of a list inside a pandas dataframe
(4 answers)
Closed 5 months ago.
I have a data frame with a column location which contains a lot of text and then an actual location code. I'm trying to extract out the location code, and I figured out if I split the string on spaces then I can just grab the last item in each list. For example, in the df below the column location is the original column and after applying my split I now have location_split.
| location | location_split
0 | Town (123) | ['Town', '(123)']
1 | Town Town (123AB) | ['Town', 'Town', '(123AB)']
2 | Town (40832) (123BC) | ['Town', '(40832)', '(123BC)']
3 | Town (987) | ['Town', '(987)']
But, how do I make it so that I can pull out the last item in the list and have that be the value for location_split? Something like df['location']=df['location_split'][-1] and end up with the location column below. I did attempt regex, but since some rows have multiple parentheses containing numbers it couldn't differentiate, but splitting them and then grabbing the last item on the list seems the most foolproof.
| location
0 | (123)
1 | (123AB)
2 | (123BC)
3 | (987)

You can use .str accessor
df['location'] = df['location_split'].str[-1]
# or
df['location'] = df['location_split'].str.get(-1)

You can use the regex:
df['location'] = df['location'].astype(str).str.extract('(\(.*\))')

Related

Removing rows contains non-english words in Pandas dataframe

I have a pandas data frame that consists of 4 rows, the English rows contain news titles, some rows contain non-English words like this one
**She’s the Hollywood Power Behind Those ...**
I want to remove all rows like this one, so all rows that contain at least non-English characters in the Pandas data frame.
If using Python >= 3.7:
df[df['col'].map(lambda x: x.isascii())]
where col is your target column.
Data:
df = pd.DataFrame({
'colA': ['**She’s the Hollywood Power Behind Those ...**',
'Hello, world!', 'Cainã', 'another value', 'test123*', 'âbc']
})
print(df.to_markdown())
| | colA |
|---:|:------------------------------------------------------|
| 0 | **She’s the Hollywood Power Behind Those ...** |
| 1 | Hello, world! |
| 2 | Cainã |
| 3 | another value |
| 4 | test123* |
| 5 | âbc |
Identifying and filtering strings with non-English characters (see the ASCII printable characters):
df[df.colA.map(lambda x: x.isascii())]
Output:
colA
1 Hello, world!
3 another value
4 test123*
Original approach was to use a user-defined function like this:
def is_ascii(s):
try:
s.encode(encoding='utf-8').decode('ascii')
except UnicodeDecodeError:
return False
else:
return True
You can use regex to do that.
Installation documentation is here. (just a simple pip install regex)
import re
and use [^a-zA-Z] to filter it.
to break it down:
^: Not
a-z: small letter
A-Z: Capital letters

Merging one column in pandas - only keep one row with all values instead of new row for each matc

I want to merge two pandas dataframes:
df 1
City | Attraction | X | Z | Y
Somewhere Rainbows 1 2 3
Somewhere Trees 4 4 4
Somewhere Unicorns
df 2
City | Other Column | Also another column
Somewhere Something Something else
Normally this would be done so:
df2.merge(df1[['City', 'Attraction']], left_on='City', right_on='City. how='left']
City | Other Column | Also another column | Attraction
Somewhere Something Something else Rainbows
Somewhere Something Something else Trees
Somewhere Something Something else Unicorns
However, I would like to group the results of the join into a comma separated list (or whatever):
City | Other Column | Also another column | Attraction
Somewhere Something Something else Rainbows, Trees, Unicorns
groupby() and map:
df2['Attaction'] = df2['City'].map(df1.groupby('City').Attraction.agg(', '.join))

find a string in a column but return the value in a different column under the same row

I have a df like this:
Car | Color | Year | Price
VW | blue | 2014 | 20,000
Audi| red | 2017 | 30,000
I am using IF for this as this table is not exactly my real df, just need an idea.
I need the price of the vehicle if the selection is Audi lets say.
I am looking for s substring in a string but I just need the cost of one specific column in the exact row where the substring was found.
I am using:
for x in cars['Car']:
if "Audi" in x:
#Just need the row in column 'Price'
print(??)
You can use boolean indexing with pandas.
df[df['Car'].str.contains('Audi')]['Price']
To break it down:
df['Car'] # Select Car column
df['Car'].str.contains('Audi') # Check if the value for each row contains Audi
df[df['Car'].str.contains('Audi')] # Select rows where contains Audi is true
df[df['Car'].str.contains('Audi')]['Price'] # Select the price column only
Don't go for string comparision you can directly use pd.groupby() function like this :
import pandas as pd
import numpy as np
listt = [['VW','blue',2014,20000],
['Audi','red',2015,30000],
['BMW','black',2019,90000],
['Audi','white',2011,70000]]
my_df = pd.DataFrame(listt)
my_df.columns=(['Car','Color','Year','Price'])
grouped_data = my_df.groupby('Car')
grouped_data.get_group("Audi")
OutPut :
Car Color Year Price
1 Audi red 2015 30000
3 Audi white 2011 70000
You can use a for loop with iterrows to check if any row's Car contains Audi.
for index, row in cars.iterrows():
if 'Audi' in row.Car:
print(row.Price)

Updating DataFrame Column Conditionally Based on Another DataFrame Column Containing Lists

Say I have a DataFrame similar to this:
|---------------------|----------------------------------|
| Cost | Combo |
|---------------------|----------------------------------|
| 12 | ['apples', 'bananas', 'carrots'] |
|---------------------|----------------------------------|
| 7 | ['apples', 'carrots'] |
|---------------------|----------------------------------|
If the 'Cost' column is a function of the cost of the individual 'Combo' items. The 'Combo' is a list of items. If a 'bananas' changes, then I want to modify the 'Cost' of any Combo with bananas accordingly.
While I can step through each row using iterrows, checking the Combo to see if bananas is in the Combo, I was wondering if there was a faster method I could use to achieve the same effect.
And what if I were to update more than one item in the combo, such as bananas and carrots?

How do I create a one-column SQL table from an input list/array?

I would like to know how to pass a list/set/tuple from python (via psycopg2) to a postgres query as a one-column table. For example, if the list is ['Alice', 'Bob'], I want the table to be:
| Temp |
+-------+
| Alice |
| Bob |
If anybody has alternate suggestions to achieve my result after reading the section below, that would be fine as well.
Background
I have an SQL table which has three columns of interest:
ID | Members | Group
---+---------+----------
1 | Alice | 1
2 | Alice | 1
3 | Bob | 1
4 | Charlie | 1
5 | Alice | 2
6 | Bob | 2
7 | Alice | 3
8 | Bob | 4
9 | Charlie | 3
I want a table of groups with certain combinations of members. Note that a member may have multiple items in a group (e.g. IDs 1 and 2).
For an input of ['Alice'] I would want which groups she is in (present) and which contain only her (unique), as below:
Group | Type
------+--------
1 | present
2 | present
3 | present
For an input of ['Alice', 'Bob']:
Group | Type
------+--------
1 | present
2 | unique
From reading it looks like I am looking for relational division as described here, for which I need to do what the original question asks as the input is taken from a web form processed in python. Again, alternative solutions are also welcome.
You need to make a subquery where you create member counts, then do a simple divisor query with a GROUP BY statement, but against a IN static_set clause instead of against another table. Because this is python, you already know the size of the static set.
I'll assume you already have a database cursor, and the table is called GroupMembers:
MEMBERSHIP_QUERY = '''
SELECT gm.group, mc.memberscount = %(len)s AS type
FROM groupmembers gm
JOIN (SELECT "group", COUNT(DISTINCT members) as memberscount
FROM groupmembers
GROUP BY "group") mc
ON gm.group = mc.group
WHERE gm.members IN %(set)s
GROUP BY gm.group, mc.memberscount
HAVING COUNT(DISTINCT gm.members) = %(len)s;
'''
def membership(members):
# obtain a cursor
for row in cursor.execute(MEMBERSHIP_QUERY, dict(len=len(members), set=members)):
yield dict(group=row[0], type=row[1])
There is thus no need to use a TEMP table to execute this query.
If you do need a TEMP table for other purposes, inserting a set of rows is easiest with .executemany():
members = ['Alice', 'Bob']
cursor.execute('CREATE TEMP TABLE tmp_members (member CHAR(255)) ON COMMIT DROP;')
cursor.executemany('INSERT INTO tmp_members VALUES (%s);',
[(name,) for name in members])
Note that .executemany() expects a sequence of sequences; each entry is a sequence of row data, which in this case only holds one name each. I generate a list of single-item tuples to fill the table.
Alternatively, you can use a sequence of mappings too and use the %(name)s parameter syntax (so the row data sequence becomes [dict(name=name) for name in members]).

Categories

Resources