List search in pandas text column - python

I have a dataset with two columns: date and text. The text column contains unstructured information. I have a list of city names to search for in a text column.
I need to get two sets of data:
list_city = [New York, Los Angeles, Chicago]
When all records from the list with a text message match with the dataframe lines
Sample example:
df_1
data text
06-02-2022 New York, Los Angeles, Chicago, Phoenix
05-02-2022 New York, Houston, Phoenix
04-02-2022 San Antonio, San Diego, Jacksonville
Need result df_1_res:
df_1_res
data text
06-02-2022 New York, Los Angeles, Chicago, Phoenix
I tried this design, it works, but it doesn't look very nice:
df_1_res= df_1.loc[df_1["text"].str.contains(list_city[0]) & df_1["text"].str.contains(list_city[1]) & df_1["text"].str.contains(list_city[2])]
When at least one value from the list matches the text in the dataframe lines
Sample example:
df_2
data text
06-02-2022 New York, Los Angeles, Chicago, Phoenix
05-02-2022 New York, Houston, Phoenix
04-02-2022 San Antonio, San Diego, Jacksonville
Need result df_2_res:
df_2_res
data text
06-02-2022 New York, Los Angeles, Chicago, Phoenix
05-02-2022 New York, Houston, Phoenix
I tried this design, it works, but it doesn't look very nice:
df_2_res= df_2.loc[df_1["text"].str.contains(list_city[0]) | df_2["text"].str.contains(list_city[1]) | df_2["text"].str.contains(list_city[2])]
How can it be improved? Since it is planned to change the number of cities in the filtering list.

here is one way to do it
For case # 1 : AND Condition
# use re.IGNORECASE to make findall case insensitive
import re
(df_1.loc[df_1['text'].str
.findall(r'|'.join(list_city), re.IGNORECASE)
.apply(lambda x: len(x)).eq(len(list_city))])
data text
0 06-02-2022 New York, Los Angeles, Chicago, Phoenix
CASE #2 : OR Condition
#create an OR condition using join
# filter using loc
df_2.loc[df_1['text'].str.contains(r'|'.join(list_city))]
data text
0 06-02-2022 New York, Los Angeles, Chicago, Phoenix
1 05-02-2022 New York, Houston, Phoenix

Try using the isin() function
Output:

Solution with pandas.Series.str.count
list_city = ['New York', 'Los Angeles', 'Chicago']
print('\nfirst task - ALL')
df1 = df[df['text'].str.count(r'|'.join(list_city)).eq(len(list_city))]
print(df1)
print('\nsecond task - ANY')
df1 = df[df['text'].str.count(r'|'.join(list_city)).gt(0)]
print(df1)
first task - ALL
data text
0 06-02-2022 New York, Los Angeles, Chicago, Phoenix
second task - ANY
data text
0 06-02-2022 New York, Los Angeles, Chicago, Phoenix
1 05-02-2022 New York, Houston, Phoenix

Related

Creating unique dataframes based on one with adding a value to the column

I have a dataset with statistics by region. I would like to build several other city datasets based on this dataset. At the same time, when creating in each such set, I would like to add a column with the name of the city.
That is, from one data set, I would like to receive three.
I'll give you an example. Initial dataset:
df
date name_region
2022-01-01 California
2022-01-02 California
2022-01-03 California
Next, I have a list with cities: city_list = ['Los Angeles', 'San Diego', 'San Jose']
As an output, I want to have 3 datasets (or more, depending on the number of items in the list):
df_city_1
date name_region city
2022-01-01 California Los Angeles
2022-01-02 California Los Angeles
2022-01-03 California Los Angeles
df_city_2
date name_region city
2022-01-01 California San Diego
2022-01-02 California San Diego
2022-01-03 California San Diego
df_city_3
date name_region city
2022-01-01 California San Jose
2022-01-02 California San Jose
2022-01-03 California San Jose
It would be ideal if, at the same time, the data set could be accessed by a key determined by an element in the list:
df_city['Los Angeles']
date name_region city
2022-01-01 California Los Angeles
2022-01-02 California Los Angeles
2022-01-02 California Los Angeles
How can I do that? I found only a way of this division into several data sets, when the original set already has information on the unique values of the column (in this case, the same cities), , but this does not suit me very well.
Another possible solution:
dfs = []
for city in city_list:
dfs.append(df.assign(city = city))
cities = dict(zip(city_list, dfs))
cities['Los Angeles']
Output:
date name_region city
0 2022-01-01 California Los Angeles
1 2022-01-02 California Los Angeles
2 2022-01-02 California Los Angeles
#ouroboros1, to whom I thank, suggests a very nice way of shortening my code:
cities = dict(zip(city_list, [df.assign(city = city) for city in city_list]))
You can use a dictionary comprehension, and add the column city each time using df.assign.
import pandas as pd
data = {'date': {0: '2022-01-01', 1: '2022-01-02', 2: '2022-01-02'},
'name_region': {0: 'California', 1: 'California', 2: 'California'}}
df = pd.DataFrame(data)
city_list = ['Los Angeles', 'San Diego', 'San Jose']
# "df_city" as a `dict`
df_city = {city: df.assign(city=city) for city in city_list}
# accessing each `df` by key (i.e. a `list` element)
print(df_city['Los Angeles'])
date name_region city
0 2022-01-01 California Los Angeles
1 2022-01-02 California Los Angeles
2 2022-01-02 California Los Angeles

After updating a df from filtred df old values reappear

i have a df where i have a requirement to filter it into new df and work on it and after working i wanted to update it to the original df like.
Street
City
State
Zip
4210 Nw Lake Dr
Lees Summit
Mo
64064
9810 Scripps Lake Dr. Suite A San Diego
Ca - 92131
1124 Ethel St
Glendale
Ca
91207
4000 E Bristol St Ste 3 Elkhart
In-46514
my intened output is
Street
City
State
Zip
4210 Nw Lake Dr
Lees Summit
Mo
64064
9810 Scripps Lake Dr. Suite A San Diego
Ca
92131
1124 Ethel St
Glendale
Ca
91207
4000 E Bristol St Ste 3 Elkhart
In
46514
So firstly i filtered the original dataframe into a new df and worked on it.
with following code
Filter3_df= Final[Final['State'].isnull()]
Filter3_df['temp'] = Filter3_df['City'].str.extract('([A-Za-z]+)')
mask2= Filter3_df['temp'].notnull()
Filter3_df.loc[mask2, 'Zip'] = Filter3_df.loc[mask2, 'City'].str[-5:]
Filter3_df.loc[mask2, 'State'] = Filter3_df.loc[mask2, 'temp']
del Filter3_df['temp']
Filter3_df['City']= float('NaN')
after this the table for Filter3_df looks like this
Street
City
State
Zip
9810 Scripps Lake Dr. Suite A San Diego
Ca
92131
4000 E Bristol St Ste 3 Elkhart
In
46514
but when i update this filtered_df back to the original df using
Final.update(Filter3_df)
I am not getting the intended output instead I am getting the output as
Street
City
State
Zip
4210 Nw Lake Dr
Lees Summit
Mo
64064
9810 Scripps Lake Dr. Suite A San Diego
Ca - 92131
Ca
92131
1124 Ethel St
Glendale
Ca
91207
4000 E Bristol St Ste 3 Elkhart
In-46514
In
46514
kindly let me know where am i going wrong.
From the docs, pandas.DataFrame.update:
Modify in place using non-NA values from another DataFrame.
Replace Filter3_df['City']= float('NaN'), which is NA for floats, with the value you really want:
Filter3_df['City'] = ""

Getting a word from a set in a dataframe?

I have a dataframe column 'address' with values like this in each row:
3466B, Jerome Avenue, The Bronx, Bronx County, New York, 10467, United States, (40.881836199999995, -73.88176324294639)
Jackson Heights 74th Street - Roosevelt Avenue (7), 75th Street, Queens, Queens County, New York, 11372, United States, (40.74691655, -73.8914737373454)
I need only to keep the value Bronx / Queens / Manhattan / Staten Island from each row.
Is there any way to do this?
Thanks in advance.
One option is this, assuming the values are always in the same place. Using .split(', ')[2]
"3466B, Jerome Avenue, The Bronx, Bronx County, New York, 10467, United States, (40.881836199999995, -73.88176324294639)".split(', ')[2]
If the source file is a CSV (Comma-separated values), I would have a look at pandas and pandas.read_csv('filename.csv') and leverage all the nice features that are in pandas.
If the values are not at the same position and you need only a is in set of values or not:
import pandas as pd
df = pd.DataFrame(["The Bronx", "Queens", "Man"])
df.isin(["Queens", "The Bronx"])
You could add a column, let's call it 'district' and then populate it like this.
import pandas as pd
df = pd.DataFrame({'address':["3466B, Jerome Avenue, The Bronx, Bronx County, New York, 10467, United States, (40.881836199999995, -73.88176324294639)",
"Jackson Heights 74th Street - Roosevelt Avenue (7), 75th Street, Queens, Queens County, New York, 11372, United States, (40.74691655, -73.8914737373454)"]})
districts = ['Bronx','Queens','Manhattan', 'Staten Island']
df['district'] = ''
for district in districts:
df.loc[df['address'].str.contains(district) , 'district'] = district
print(df)

Pandas - Create a new column (Branch name) based on another column (City name)

I have the following Python Pandas Dataframe (8 rows):
City Name
New York
Long Beach
Jamestown
Chicago
Forrest Park
Berwyn
Las Vegas
Miami
I would like to add a new Column (Branch Name) based on City Name as below:
City Name Branch Name
New York New York
Long Beach New York
Jamestown New York
Chicago Chicago
Forrest Park Chicago
Berwyn Chicago
Las Vegas Las Vegas
Miami Miami
How do I do that?
You can use .map(). City names not in the dictionnary will be kept.
df["Branch Name"] = df["City Name"].map({"Long Beach":"New York",
"Jamestown":"New York",
"Forrest Park":"Chicago",
"Berwyn":"Chicago",}, na_action='ignore')
df["Branch Name"] = df["Branch Name"].fillna(df["City Name"])

Filling out empty cells with lists of values

I have a data frame that looks like below:
City State Country
Chicago IL United States
Boston
San Diego CA United States
Los Angeles CA United States
San Francisco
Sacramento
Vancouver BC Canada
Toronto
And I have 3 lists of values that are ready to fill in the None cells:
city = ['Boston', 'San Francisco', 'Sacramento', 'Toronto']
state = ['MA', 'CA', 'CA', 'ON']
country = ['United States', 'United States', 'United States', 'Canada']
The order of the elements in these list are correspondent to each other. Thus, the first items across all 3 lists match each other, and so forth. How can I fill out the empty cells and produce a result like below?
City State Country
Chicago IL United States
Boston MA United States
San Diego CA United States
Los Angeles CA United States
San Francisco CA United States
Sacramento CA United States
Vancouver BC Canada
Toronto ON Canada
My code gives me an error and I'm stuck.
if df.loc[df['City'] == 'Boston']:
'State' = 'MA'
Any solution is welcome. Thank you.
Create two mappings, one for <city : state>, and another for <city : country>.
city_map = dict(zip(city, state))
country_map = dict(zip(city, country))
Next, set City as the index -
df = df.set_index('City')
And, finally use map/replace to transform keys to values as appropriate -
df['State'] = df['City'].map(city_map)
df['Country'] = df['City'].map(country_map)
As an extra final step, you may call df.reset_index() at the end.

Categories

Resources