I have a dataframe in Pandas with a column called 'Campaign' it has values like this:
"UK-Sample-Car Rental-Car-Broad-MatchPost"
I need to be able to pull out that the string contains the word 'Car Rental' and set another Product column to be 'CAR'. The hyphen is not always separating out the word Car, so finding the string this way isn't an possible.
How can I achieve this in Pandas/Python?
pandas as some sweet string functions you can use
for example, like this:
df['vehicle'] = df.Campaign.str.extract('(Car).Rental').str.upper()
This sets the column vehicle to what is contained inside the parenthesis of the regular expression given to the extract function.
Also the str.upper makes it uppercase
Extra Bonus:
If you want to assign vehicle something that is not in the original string, you have to take a few more steps, but we still use the string functions This time str.contains .
is_motorcycle = df.Campaign.str.contains('Motorcycle')
df['vehicle'] = pd.Series(["MC"] * len(df)) * is_motorcycle
The second line here creates a series of "MC" strings, then masks it on the entries which we found to be motorcycles.
If you want to combine multiple, I suggest you use the map function:
vehicle_list = df.Campaign.str.extract('(Car).Rental|(Motorcycle)|(Hotel)|(.*)')
vehicle = vehicle_list.apply(lambda x: x[x.last_valid_index()], axis=1)
df['vehicle'] = vehicle.map({'Car':'Car campaign', 'Hotel':'Hotel campaign'})
This first extracts the data into a list of options per line. The cases are split by | and the last one is just a catch-all which is needed for the Series.apply function below.
The Series.map function is pretty straight forward, if the captured data is 'Car', we set 'Car campaign', and 'Hotel' we set 'Hotel campaign' etc.
Related
I have a bunch of keywords stored in a 620x2 pandas dataframe seen below. I think I need to treat each entry as its own set, where semicolons separate elements. So, we end up with 1240 sets. Then I'd like to be able to search how many times keywords of my choosing appear together. For example, I'd like to figure out how many times 'computation theory' and 'critical infrastructure' appear together as a subset in these sets, in any order. Is there any straightforward way I can do this?
Use .loc to find if the keywords appear together.
Do this after you have split the data into 1240 sets. I don't understand whether you want to make new columns or just want to keep the columns as is.
# create a filter for keyword 1
filter_keyword_2 = (df['column_name'].str.contains('critical infrastructure'))
# create a filter for keyword 2
filter_keyword_2 = (df['column_name'].str.contains('computation theory'))
# you can create more filters with the same construction as above.
# To check the number of times both the keywords appear
len(df.loc[filter_keyword_1 & filter_keyword_2])
# To see the dataframe
subset_df = df.loc[filter_keyword_1 & filter_keyword_2]
.loc selects the conditional data frame. You can use subset_df=df[df['column_name'].str.contains('string')] if you have only one condition.
To the column split or any other processing before you make the filters or run the filters again after processing.
Not sure if this is considered straightforward, but it works. keyword_list is the list of paired keywords you want to search.
df['Author Keywords'] = df['Author Keywords'].fillna('').str.split(';\s*').apply(set)
df['Index Keywords'] = df['Index Keywords'].fillna('').str.split(';\s*').apply(set)
df.apply(lambda x : x.apply(lambda y : all([kw in y for kw in keyword_list]))).sum().sum()
One in my columns in my pandas DataFrame has very irregular expressions. I want to remove everything except the coordinates. However, I cannot just use the replace or remove function since the parts of what I want to remove are different in each column. Is there a way of picking just the part of the strings which I actually want to use?
One cell looks like this:
{'is_geometry': True, 'configuration': 'technologies', 'additional_translations': {}, 'key': 'Map', 'value': '{"type":"FeatureCollection","features":[{"type":"Feature","id":1549869006355,"geometry":{"type":"Point","coordinates":[67.91225703380735,34.69585762863356]},"properties":null}]}', 'map_url': '/en/technologies/view/technologies_1723/map/', 'template': 'raw'}
where the id and the map_url are always different. I would like to only have [67.91225703380735,34.69585762863356] in this example. Further, is there a way of turning the two values around in order that I have [34.69585762863356,67.91225703380735] instead?
I'm not sure exactly what you want, but assuming your dataframe's column contains dicts that are like your example, this should work:
import ast
import json
df['nums'] = df.loc[df['tech_map'].notna(), 'tech_map'].astype(str).apply(ast.literal_eval).str['value'].apply(json.loads).str['features'].str[0].str['geometry'].str['coordinates'].str[::-1]
Two notes:
- The above is basically equivalent to doing json.loads(row['value'])['features'][0]['geometry']['coordinates'][::-1] for each row
- [::-1] reverse a list
I have a dataframe with lots of categories. Here list of some of them
Bank
(0827) ОСП
(0283) Банк ВТБ (ПАО)
(0822) ОСИП_ПЕНСЫ
(0260) АО Тинькофф Банк
(0755) ПАО Совкомбанк
I want to filter dataframe based on string matching. I don't want to pass entire row name, i wanna pass something like ['Совкомбанк', 'Тинькофф']. The expecting result of this is :
(0260) АО Тинькофф Банк
(0755) ПАО Совкомбанк
I tried df = df[df[column_name].isin(values)] but i didn't work.
.isin will check for exact match. What you are looking for is .str.contains:
match_strs = ['Совкомбанк', 'Тинькофф']
df = df[df[column_name].str.contains("(" + "|".join(match_strs) + ")")]
You can have custom regular expressions within str.contains(...) to search for whatever you want.
If you want to just pass the names you have to clean up the Bank column
df[df['Bank'].str.split(' ').str.get(1).isin(values)]
I'm new to Pandas. I want to take some strings returned from pandas series (a bunch of values under a column in a csv named 'lots') and put them in a set. To this end I wrote the following:
setbincsv_df = bincsv_df['lots'].apply(set)
print(setbincsv_df )
But the output resulting from that print statement takes a value in that series like "OP" and displays it as 136 {P, O}. Not only does it not split it but it reverses it.
Bottom 5 items returned:
**"132 {I, F}"
"133 {E, F}"
"134 {W, I}"
"135 {V, H}"
"136 {P, O}"**
I'd expect it to return the value as it was in the series "OP". Why is this happening?
If you use apply you are applying the set operation to the string of each row.
For example if you have the word "pull"
print(set("pull"))
{'p','u','l'}
what you probably want is to do set(series):
df = pd.DataFrame({'lots':['ai','cd','ai','drgf']})
print(set(df['lots']) )
that outputs
{'cd', 'ai', 'drgf'}
Say I have two dataframes A and B, each containing two columns called x and y.
I want to join these two dataframes but not on rows on which the x and y columns are equal across the two dataframes, but on rows where A's x columns is a substring of B's x column and same for y.
For example
if A[x][1]='mpla' and B[x][1]='mplampla'
I would want that to be captured.
On sql it would be something like:
select *
from A
join B
on A.x<=B.x and A.y<=B.y.
Can something like this be done on python?
You can match a single string at a time against all the strings in one column, like this:
import numpy.core.defchararray as ca
ca.find(B.x.values.astype(str), 'mpla') >= 0
The problem with that is you'll have to loop over all elements of A. But if you can afford that, it should work.
See also: pandas + dataframe - select by partial string
you could try something like
B.x.where(B.x.str.contains(A.x), B.index, axis=index) #this would give you the ones that don't match
B.x.where(B.x.str.match(A.x, as_indexer=True), B.index, axis=index) #this would also give you the one's that don't match. You could see if you can use the "^" operator used for regex to get the ones that match.
You could also maybe try
np.where(B.x.str.contains(A.x), B.index, np.nan)
also you can try:
matchingmask = B[B.x.str.contains(A.x)]
matchingframe = B.ix[matchingmask.index] #or
matchingcolumn = B.ix[matchingmask.index].x #or
matchingindex = B.ix[matchingmask.index].index
All of these assume you have the same index on both frames (I think)
You want to look at the string methods: http://pandas.pydata.org/pandas-docs/stable/text.html#text-string-methods
you want to read up on regex and pandas where method: http://pandas.pydata.org/pandas-docs/dev/indexing.html#the-where-method-and-masking