Capture date from Excel sheet with RegEx - python

I know capturing a date is usually a simple enough RegEx task, but I need this to be so specific that I'm struggling.
1 SUSTAINABLE HARVEST SECTOR | QUOTA LISTING JUN 11 2013
2 QUOTA
3 TRADE ID AVAILABLE STOCK AMOUNT PRICE
4 130196 COD GBW 10,000 $0.60
5 130158 HADDOCK GBE 300 $0.60
That is what the beginning of my Excel spreadsheet looks like, and what 100's more look like, with the date and the data changing but the format staying the same.
My thoughts were to capture everything that follows LISTING up until the newline... then place the non numbers (JUN) in my Trade Month column, place the first captured number (11) in my Trade Day column, and place the last captured number (2013) in my Trade Year column... but I can't figure out how to. Here's what I have so far:
pattern = re.compile(r'Listing(.+?)(?=\n)')
df = pd.read_excel(file_path)
print("df is:", df)
a = pattern.findall(str(df))
print("a:", a)
but that returns nothing. Any help solving this problem, which I know is probably super simple, is appreciated. Thanks.

Make your expression case insensitive (ie LISTING != Listing):
pattern = re.compile(r'Listing(.+?)(?=\n)', re.IGNORECASE)
Besides, a lookahead for a newline in this situation comes down to the equal expression:
pattern = re.compile(r'Listing(.+)', re.IGNORECASE)
See your working pattern here.

Related

How to isolate part of string in pandas dataframe

I have a dataframe containing a column of strings. I want to take out a part of each string in each row, which is the year and then create a new column and assign it to that column. My problem is to isolate the last part of the string. An example could be: 'TON GFR 2018 N' For this string I would be able to execute by running one of the following (For this I want to isolate 18 and not 2018).
new_data['Year'] = pd.DataFrame([str(ele[1])[:2] for ele in list(new_data['Name'].str.split('20'))])
new_data['Year'] = new_data['Name'].str.split('20').str[1]
new_data['Year'] = new_data['Year'].str[:2]
However, I also meet names like these: 'TON RO20 2018 N' or TON 2020 N and then it does not work. I also encounter different number of spaces in different rows in the dataframe, hence it does not work to count the number of spaces in the string.
Any smart solutions to my problem?
Use .str.extract() to extract 4 digits string starting with 20 and get the last 2 digits, as follows:
new_data['Year'] = new_data['Name'].str.extract(r'20(\d\d)')
If you want to ensure the 4-digit string is not part of a longer string/number, you can further use regex meta-character \b (word boundary) to enclose the target strings, as follows:
new_data['Year'] = new_data['Name'].str.extract(r'\b20(\d\d)\b')
Demo
Input data:
print(new_data)
Name
0 TON GFR 2018 N
1 TON RO20 2018 N
2 TON 2020 N
Result:
print(new_data)
Name Year
0 TON GFR 2018 N 18
1 TON RO20 2018 N 18
2 TON 2020 N 20
if this is all the time the same distance from the end you could use:
new_data["Year"] = new_data["Name"].str.slice(start=-4, stop=-2)

Regular expression to rename the column by stripping the column name

I have df with many columns and each column have repeated values because its survey data. As an example my data look like this:
df:
Q36r9: sales platforms - Before purchasing a new car Q36r32: Advertising letters - Before purchasing a new car
Not Selected Selected
So i want to strip the text from column names. For example from first column I want to get the text between ":" and "-". So it should be like this: "sales platform" and in second part i want to convert vales of column, "selected" should be changed with the name of column and "Not Selected" as NaN
so desired output would be like this:
sales platforms Advertising letters
NaN Advertising letters
Edited: Another Problem if i have column name like:
Q40r1c3: WeChat - Looking for a new car - And now if you think again - Which social media platforms or sources would you use in each situation?
If i just want to get something in between ":" and "-". It should extract "WeChat"
IIUC,
we can take advantage of some regex and greed matching using .* which matches everything between a defined pattern
import re
df.columns = [re.search(':(.*)-',i).group(1) for i in df.columns.str.strip()]
print(df.columns)
sales platforms Advertising letters
0 Not Selected None
Edit:
with greedy matching we can use +?
+? Quantifier — Matches between one and unlimited times, as few times as possible, expanding as needed (lazy)
Q36r9: sales platforms - Before purchasing a new car Q40r1c3: WeChat - Looking for a new car - And now if you think again - Which social media platforms or sources would you use in each situation?
0 1
import re
[re.search(':(.+?)-',i).group(1).strip() for i in df.columns]
['sales platforms', 'WeChat']

Panel Data Research & Development Capitalisation

I am working with a panel data containing many companies' research and development expenses throughout the years.
What I would like to do is to capitalise these expenses as if they were assets. For those who are not familiar with financial terminology, I am trying to accumulate the values of each year's R&D expenses with the following ones by decaying its value (or "depreciating" it) every time period by the corresponding depreciation rate.
The dataframe looks something like this:
fyear tic rd_tot rd_dep
0 1979 AMFD 1.345 0.200
1 1980 AMFD 0.789 0.200
.. .. .. .. ..
211339 2017 ACA 3.567 0.340
211340 2018 ACA 2.990 0.340
211341 2018 CTRM 0.054 0.234
Where fyear is the fiscal year, tic is the company specific letter code, rd_tot is the total R&D expenditure for the year and rd_dep is the applicable depreciation rate.
So far I was able to come up with this:
df['r&d_capital'] = [(df['rd_tot'].iloc[:i] * (1 - df['rd_dep'].iloc[:i]*np.arange(i)[::-1])).sum()for i in range(1, len(df)+1)]
However the problem is that the code just runs through the entire column without taking in consideration that the R&D expense needs to be capitalised in a company (or tic) specific way. I also tried by using .groupby(['tic]) but it did not work.
Therefore, I am trying to look for help to solve this problem, so that I can get each years R&D expenses capitalisation on a COMPANY SPECIFIC way.
Thank you very much for your help!
This solution breaks the initial dataframe into separate ones (one for each 'tic' group), and applies the r&d capital calculation formula on each df.
Finally, we re-construct the dataframe using pd.concat.
tic_dfs = [tic_group for _, tic_group in df.groupby('tic')]
for df in tic_dfs:
df['r&d_capital'] = [(df['rd_tot'].iloc[:i] * (1 - df['rd_dep'].iloc[:i]*np.arange(i)[::-1])).sum() for i in range(1,len(df)+1)]
result=pd.concat([df for df in tic_dfs]).sort_index()
Note: "_" is the mask for the group name e.g. "ACA", "AMFD" etc, while tic_group is the actual data body.

Searching for an item within a list in a column and saving that item to a new column

I am very new to Python and need help!
I want to search a column of a data frame for an item in a list and if found, store that item in a new column. My the location column is messy and am trying to extract a state abbreviation if there is one.
So far I have been able to find the columns where the search terms are found (I’m not sure if this is 100% correct), how would I take the search term that was found and store it in a new column?
state_search=('CO', 'CA', 'WI', 'VA', 'NY', 'PA', 'MA', 'TX',)
pattern = '|'.join(state_search)
state_jobs_df=jobs_data_df.loc[jobs_data_df['location'].str.contains(pattern), :]
I want to take the state that was found and store that in a new 'state' column. Thanks for any help.
print (jobs_data_df)
location
0 Madison, WI 53702
1 Senior Training Leader located in Raynham, MA ...
2 Dixon CA
3 Camphill, PA Weekends and nights
4 Charlottesville, VA Some travel required
5 Houston, TX
6 Denver, CO 80215
7 Respiratory Therapy Primary Location : TX- Som...
Use Series.str.extract with word boundaries and filter non missing rows by Series.notna or DataFrame.dropna:
pat = '|'.join(r"\b{}\b".format(x) for x in state_search)
jobs_data_df['state'] = jobs_data_df['location'].str.extract('('+ pat + ')', expand=False)
jobs_data_df = jobs_data_df[jobs_data_df['state'].notna()]
Or:
jobs_data_df = jobs_data_df.dropna(subset=['state'])
It's a bit hack-y, but a simpler solution might take a form similar to:
for row in dataRows:
for state in state_search:
if state in row:
#put state in correct column here
break #should break just the inner loop; if that doesn't happen, delete this line
It's probably helpful to think about how the underlying program would have to approach the problem (checking each row for a string that matches one of your states, then doing something with it), and go at that directly. Unless you're dealing with a huge load of data, it may not be worth going crazy fancy with regular expressions or the like.

Python data scraping from string

I have lines of text containing multiple variables which correspond to a specific entry.
I have been trying to use regular expressions, such as the one below, with mixed success (lines are quite standardised but do contain typos and inconsistencies)
re.compile('matching factor').findall(input)
I was wondering what is the best way to approach this case, what data structures to use and how to loop it to go though multiple lines of text. Here is the sample of the text, with highlighted data I would like to scrape:
CHINA: National Grain Trade Centre: in auction of state reserves, govt. sold 70,418 t wheat (equivalent to 3.5% of total volume offered) at an average price of CNY2,507/t ($378.19) and 4,359 t maize (4.7%), at an average price of CNY1,290/t ($194.39). Separately, sold 2,100 t of 2013 wheat imports (1.5%) at CNY2,617/t ($394.25). 23 Oct
I am interested to create a data set containing variable such as:
VOLUME - COMMODITY - PERCENTAGE SOLD - PRICE - DATE

Categories

Resources