Regular expression to rename the column by stripping the column name - python

I have df with many columns and each column have repeated values because its survey data. As an example my data look like this:
df:
Q36r9: sales platforms - Before purchasing a new car Q36r32: Advertising letters - Before purchasing a new car
Not Selected Selected
So i want to strip the text from column names. For example from first column I want to get the text between ":" and "-". So it should be like this: "sales platform" and in second part i want to convert vales of column, "selected" should be changed with the name of column and "Not Selected" as NaN
so desired output would be like this:
sales platforms Advertising letters
NaN Advertising letters
Edited: Another Problem if i have column name like:
Q40r1c3: WeChat - Looking for a new car - And now if you think again - Which social media platforms or sources would you use in each situation?
If i just want to get something in between ":" and "-". It should extract "WeChat"

IIUC,
we can take advantage of some regex and greed matching using .* which matches everything between a defined pattern
import re
df.columns = [re.search(':(.*)-',i).group(1) for i in df.columns.str.strip()]
print(df.columns)
sales platforms Advertising letters
0 Not Selected None
Edit:
with greedy matching we can use +?
+? Quantifier — Matches between one and unlimited times, as few times as possible, expanding as needed (lazy)
Q36r9: sales platforms - Before purchasing a new car Q40r1c3: WeChat - Looking for a new car - And now if you think again - Which social media platforms or sources would you use in each situation?
0 1
import re
[re.search(':(.+?)-',i).group(1).strip() for i in df.columns]
['sales platforms', 'WeChat']

Related

how to spot delete spaces in pandas column

i have a dataframe with a column location which looks like this:
on the screenshot you see the case with 5 spaces in location column, but there are a lot more cells with 3 and 4 spaces, while the most normal case is just two spaces: between the city and the state, and between the state and the post-code.
i need to perform the str.split() on location column, but due to the different number of spaces it will not work, because if i substitute spaces with empty space or commas, i'll get different number of potential splits.
so i need to find a way to turn spaces that are inside city names into hyphens, so that i am able to do the split later, but at the same time not touch other spaces (between city and state, and between state and post code). any ideas?
I have written those code in terms of easy understanding/readability. One way to solve above query is to split location column first into city & state, perform operation on city & merge back with state.
import pandas as pd
df = pd.DataFrame({'location':['Cape May Court House, NJ 08210','Van Buron Charter Township, MI 48111']})
df[['city','state'] ]= df['location'].str.split(",",expand=True)
df['city'] = df['city'].str.replace(" ",'_')
df['location_new'] = df['city']+','+df['state']
df.head()
final output will look like this with required output in column location_new :

Pandas Multi-index: sort second column by frequency of values

I have a data frame which consists of a series of SVO triples extracted from thousands of texts. For the particular dimension I am seeking to explore, I have reduced the SVO triples strictly to those using the pronouns he, she, and I. So, the data frame looks something like this:
subject
verb
object
he
contacted
parks
i
mentioned
dog
i
said
that
i
said
ruby
she
worked
office
she
had
dog
he
contact
company
While I can use df_.groupby(["subject"]).count() to give me totals for each of the pronouns, what I would like to do is to group by the first column, the subject, and then sort the second column, the verb, by the most frequently occurring verbs, such that I had a result that looked something like:
subject
verb
count [of verb]
i
said
2
i
mentioned
1
...
...
...
Is it possible to do this within the pandas dataframe or do I need to use the dictionary of lists output by df_.groupby("subject").groups?
I'm getting stuck on how to value_count the second column, I think.

pandas.Series.str.contains - how to stop at first match?

I have a df with two columns (company_name and sales).
The company_name column includes the name of the company plus a short description (e.g. company X - medical insurance; company Y - travel and medical insurance; company Z - medical and holiday insurance etc.)
I want to add a third column with a binary classification (medical_insurance or travel_insurance) based on the first matching string value included in the company_name.
I have tried using str.contains but when matching words from different groups are present in the company_name column (e.g., medical and travel), str.contains doesn't necessarily classify it matching the first instance (which is what I need).
medical_focused = df.loc[df['company_name'].str.contains(
'medical|hospital', flags=re.IGNORECASE, na=False),'classification'] = 'medical_focused'
travel_focused = df.loc[df['company_name'].str.contains(
'travel|holiday', flags=re.IGNORECASE, na=False),'classification'] = 'travel_focused'
How can I force str.contains to stop at the first instance?
Thanks!

Need to split data into multiple columns based on character length of each row, using Python

df1.head(1)
Airline_data
0 CAK ATL 114.47 528 424.56 FL 70.19 ...
The above column named "Airline_data" contains all information combined into a single column.
This has to be splitted to multiple columns like ("City1","City2","Average Fare", ... etc )based on the to below information of String index positional splitting
Column name : Section of original column to be split
City1 : 1-3
City2 : 5-7
Average Fare : 11-17
and so on.
PLEASE NOTE: Simply splitting based on blank spaces wont work here.
I think, the most intuitive way is to apply str.extract to the column of interest
(in your case 0).
In order to have proper output column names, use named capturing groups
of respective sizes.
To capture the "wanted" fields, put between them either a space or a dot
(matching any char) with respective repetition count.
So for your example 3 columns, run:
df[0].str.extract(r'(?P<City1>.{3}) (?P<City2>.{3}) {3}(?P<Average_Fare>.{7})')
Note: The name of a named capturing group can not include any spaces,
so I put "_" instead. If you want to get rid of these underscores,
just rename respective columns.
In the final version, to capture all remaining columns, add respective
other capturing groups to the regex above.
Or, if you have the source as a text file, then read it calling
read_fwf. It reads just fixed witdh fields (for details look
the documentation).
This variant is even better in one detail: read_fwf by default performs
conversion of input columns to appropriate types (e.g. int or float),
whereas str.extract generates just text columns (if you need,
you have to cast the required columns to the intended types on your own.

Searching for an item within a list in a column and saving that item to a new column

I am very new to Python and need help!
I want to search a column of a data frame for an item in a list and if found, store that item in a new column. My the location column is messy and am trying to extract a state abbreviation if there is one.
So far I have been able to find the columns where the search terms are found (I’m not sure if this is 100% correct), how would I take the search term that was found and store it in a new column?
state_search=('CO', 'CA', 'WI', 'VA', 'NY', 'PA', 'MA', 'TX',)
pattern = '|'.join(state_search)
state_jobs_df=jobs_data_df.loc[jobs_data_df['location'].str.contains(pattern), :]
I want to take the state that was found and store that in a new 'state' column. Thanks for any help.
print (jobs_data_df)
location
0 Madison, WI 53702
1 Senior Training Leader located in Raynham, MA ...
2 Dixon CA
3 Camphill, PA Weekends and nights
4 Charlottesville, VA Some travel required
5 Houston, TX
6 Denver, CO 80215
7 Respiratory Therapy Primary Location : TX- Som...
Use Series.str.extract with word boundaries and filter non missing rows by Series.notna or DataFrame.dropna:
pat = '|'.join(r"\b{}\b".format(x) for x in state_search)
jobs_data_df['state'] = jobs_data_df['location'].str.extract('('+ pat + ')', expand=False)
jobs_data_df = jobs_data_df[jobs_data_df['state'].notna()]
Or:
jobs_data_df = jobs_data_df.dropna(subset=['state'])
It's a bit hack-y, but a simpler solution might take a form similar to:
for row in dataRows:
for state in state_search:
if state in row:
#put state in correct column here
break #should break just the inner loop; if that doesn't happen, delete this line
It's probably helpful to think about how the underlying program would have to approach the problem (checking each row for a string that matches one of your states, then doing something with it), and go at that directly. Unless you're dealing with a huge load of data, it may not be worth going crazy fancy with regular expressions or the like.

Categories

Resources