How to calcuate the overlap date in pyspark - python

I have data with users who has worked with multiple companies.Some users who worked in more than one companies at the same time. How to aggregate the overall experience without considering overlap experience.
I have gone through some of the links could get right solutions.Any help will appreciated.
EMP CSV DATA
fullName,Experience_datesEmployeed,Experience_expcompany,Experience_expduraation, Experience_position
David,Feb 1999 - Sep 2001, Foothill,2 yrs 8 mos, Marketing Assoicate
David,1994 - 1997, abc,3 yrs,Senior Auditor
David,Jun 2020 - Present, Fellows INC,3 mos,Director Board
David,2017 - Jun 2019, Fellows INC ,2 yrs,Fellow - Class 22
David,Sep 2001 - Present, The John D.,19 yrs, Manager
Expected output:
FullName,Total_Experience
David,24.8 yrs

Related

Calculating maximum/minimum changes using pandas

Suppose I had a data set containing job titles and salaries over the past three years and I want to go about calculating the difference in salary average from the first year to the last.
Using Pandas, how exactly would I go about doing that? I've managed to create a df with the average salaries for each year but I suppose what I'm trying to do is say "for data scientist, subtract the average salary from 2022 with the average salary from 2020" and iterate through all job_titles doing the same thing.
work_year job_title salary_in_usd
0 2020 AI Scientist 45896.000000
1 2020 BI Data Analyst 98000.000000
2 2020 Big Data Engineer 97690.333333
3 2020 Business Data Analyst 117500.000000
4 2020 Computer Vision Engineer 60000.000000
.. ... ... ...
93 2022 Machine Learning Scientist 141766.666667
94 2022 NLP Engineer 37236.000000
95 2022 Principal Data Analyst 75000.000000
96 2022 Principal Data Scientist 162674.000000
97 2022 Research Scientist 105569.000000
Create a function which does the thing you want on each group:
def first_to_last_year_diff(df):
diff = (
df[df.work_year == df.work_year.max()].salary_in_usd
- df[df.work_year == df.work_year.max()].salary_in_usd
)
return diff
Then group on job title and apply your function:
df.groupby("job_title").apply(first_to_last_year_diff)

Splitting single text column into multiple columns Pandas

I am working on extraction of raw data from various sources. After a process, I could form a dataframe that looked like this.
data
0 ₹ 16,50,000\n2014 - 49,000 km\nJaguar XF 2.2\nJAN 16
1 ₹ 23,60,000\n2017 - 28,000 km\nMercedes-Benz CLA 200 CDI Style, 2017, Diesel\nNOV 26
2 ₹ 26,00,000\n2016 - 44,000 km\nMercedes Benz C-Class Progressive C 220d, 2016, Diesel\nJAN 03
I want to split this raw dataframe into relevant columns in order of the raw data occurence: Price, Year, Mileage, Name, Date
I have tried to use df.data.split('-', expand=True) with other delimiter options sequentially along with some lambda functions to achieve this, but haven't gotten much success.
Need assistance in splitting this data into relevant columns.
Expected output:
price year mileage name date
16,50,000 2014 49000 Jaguar 2.2 XF Luxury Jan-17
23,60,000 2017 28000 CLA CDI Style Nov-26
26,00,000 2016 44000 Mercedes C-Class C220d Jan-03
Try split on '\n' then on '-'
df[["Price","Year-Mileage","Name","Date"]] =df.data.str.split('\n', expand=True)
df[["Year","Mileage"]] =df ["Year-Mileage"].str.split('-', expand=True)
df.drop(columns=["data","Year-Mileage"],inplace=True)
print(df)
Price Name Date Year Mileage
0 ₹ 16,50,000 Jaguar XF 2.2 JAN 16 2014 49,000 km
2 ₹ 26,00,000 Mercedes Benz C-Class Progressive C 220d, 2016, Diesel JAN 03 2016 44,000 km
1 ₹ 23,60,000 Mercedes-Benz CLA 200 CDI Style, 2017, Diesel NOV 26 2017 28,000 km

Extract date and sort rows by date

I have a dataset that includes some strings in the following forms:
Text
Jun 28, 2021 — Brendan Moore is p...
Professor of Psychology at University
Aug 24, 2019 — Chemistry (Nobel prize...
by A Craig · 2019 · Cited by 1 — Authors. ...
... 2020 | Volume 8 | Article 330Edited by:
I would like to create a new column where there are, if there exist, dates sorted by ascending order.
To do so, I need to extract the part of string which includes date information from each row, whether exits.
Something like this:
Text Numbering
Jun 28, 2021 — Brendan Moore is p... 2
Professor of Psychology at University -1
Aug 24, 2019 — Chemistry (Nobel prize... 1
by A Craig · 2019 · Cited by 1 — Authors. ... -1
... 2020 | Volume 8 | Article 330Edited by: -1
All the rows not starting with a date (that follows the format: Jun 28, 2021 — are assigned to -1.
The first step would be identify the pattern: xxx xx, xxxx;
then, transforming date object into datetime (yyyy-mm-dd).
Once got this date information, it needs to be converted into numerical, then sorted.
I am having difficulties in answering the last point, specifically on how to filter only dates and sort them in an appropriate way.
The expected output would be
Text Numbering (sort by date asc)
Jun 28, 2021 — Brendan Moore is p... 2
Professor of Psychology at University -1
Aug 24, 2019 — Chemistry (Nobel prize... 1
by A Craig · 2019 · Cited by 1 — Authors. ... -1
... 2020 | Volume 8 | Article 330Edited by: -1
Mission accomplished:
# Find rows that start with a date
matches = df['Text'].str.match(r'^\w+ \d+, \d{4}')
# Parse dates out of date rows
df['date'] = pd.to_datetime(df[matches]['Text'], format='%b %d, %Y', exact=False, errors='coerce')
# Assign numbering for dates
df['Numbering'] = df['date'].sort_values().groupby(np.ones(df.shape[0])).cumcount() + 1
# -1 for the non-dates
df.loc[~matches, 'Numbering'] = -1
# Cleanup
df.drop('date', axis=1, inplace=True)
Output:
>>> df
Text Numbering
0 Jun 28, 2021 - Brendan Moore is p... 2
1 Professor of Psychology at University -1
2 Aug 24, 2019 - Chemistry (Nobel prize... 1
3 by A Craig - 2019 - Cited by 1 - Authors. ... -1
4 ... 2020 | Volume 8 | Article 330Edited by: -1

How do I group two time interval columns into discernible months?

I have a housing market dataset categorized by U.S Counties showing columns such as total_homes_sold. I'm trying to show a comparison between housing sales YoY (e.g. Jan 2020 vs. Jan 2019) and by county (e.g. Aberdeen Mar 2020 vs. Suffolk Mar 2020). However not sure how to group the dates as they are not sorted by months (Jan, Feb, Mar etc.) but rather by 4-week intervals: period_begin and period_end.
Intervals between years vary. The period_begin for Aberdeen (around Jan) for 2019 might be 1/7 to 2/3 but 1/6 to 2/2 for 2020 (image shown below).
I tried using count (code below) to label each 4-week period as a number (shown below) thinking I could compare Aberdeen 2017-1 to Aberdeen 2020-1 (1 coded as the first time interval) but realized that some years for some regions have more 4 week periods in a year than others (2017 has 13 whereas 2018 has 14).
*df['count'] = df.groupby((everyfourth['region_name'] != df['region_name'].shift(1)).cumsum()).cumcount()+1*
Any ideas on what code I could use to closely categorize these two columns into month-like periods?
Snippet of Dataset here
Let me know if you have any questions. Not sure I made sense! Thanks.

Python: Pandas does not separate columns when reading tabular text file

I have a text file like this:
PERSONAL INFORMATION
First Name: Michael
Last Name: Junior
Birth Date: May 17, 1999
Location: Whitehurst Hall 301. City: Stillwater. State: OK
Taken on July 8, 2000 10:50:30 AM MST
WORK EXPERIENCE
Work type select Part-time
ID number 10124
Company name ABCDFG Inc.
Positions Software Engineer/Research Scientist
Data Analyst/Scientist
As you could see the first column is feature names and the second column values. I read it using this code:
import pandas as pd
import numpy as np
import scipy as sp
df=pd.read_table('personal.txt',skiprows=1)
pd.set_option('display.max_colwidth',10000)
pd.set_option('display.max_rows',1000)
df
But it merges columns and outputs:
PERSONAL INFORMATION
0 First Name: Michael
1 Last Name: Junior
2 Birth Date: May 17, 1999
3 Location: Whitehurst Hall 301. City: Stillwater. State: OK
4 Taken on July 8, 2000 10:50:30 AM MST
5 WORK EXPERIENCE
6 Work type select Part-time
7 ID number 10124
8 Company name Google Inc.
9 Positions Software Engineer/Research Scientist
10 Data Analyst/Scientist
I should escape from those titles PERSONAL INFORMATION and WORK EXPERIENCE as well. How can I read in a way that it gives me results appropriately in two columns?

Categories

Resources