Python: Pandas does not separate columns when reading tabular text file - python

I have a text file like this:
PERSONAL INFORMATION
First Name: Michael
Last Name: Junior
Birth Date: May 17, 1999
Location: Whitehurst Hall 301. City: Stillwater. State: OK
Taken on July 8, 2000 10:50:30 AM MST
WORK EXPERIENCE
Work type select Part-time
ID number 10124
Company name ABCDFG Inc.
Positions Software Engineer/Research Scientist
Data Analyst/Scientist
As you could see the first column is feature names and the second column values. I read it using this code:
import pandas as pd
import numpy as np
import scipy as sp
df=pd.read_table('personal.txt',skiprows=1)
pd.set_option('display.max_colwidth',10000)
pd.set_option('display.max_rows',1000)
df
But it merges columns and outputs:
PERSONAL INFORMATION
0 First Name: Michael
1 Last Name: Junior
2 Birth Date: May 17, 1999
3 Location: Whitehurst Hall 301. City: Stillwater. State: OK
4 Taken on July 8, 2000 10:50:30 AM MST
5 WORK EXPERIENCE
6 Work type select Part-time
7 ID number 10124
8 Company name Google Inc.
9 Positions Software Engineer/Research Scientist
10 Data Analyst/Scientist
I should escape from those titles PERSONAL INFORMATION and WORK EXPERIENCE as well. How can I read in a way that it gives me results appropriately in two columns?

Related

Calculating maximum/minimum changes using pandas

Suppose I had a data set containing job titles and salaries over the past three years and I want to go about calculating the difference in salary average from the first year to the last.
Using Pandas, how exactly would I go about doing that? I've managed to create a df with the average salaries for each year but I suppose what I'm trying to do is say "for data scientist, subtract the average salary from 2022 with the average salary from 2020" and iterate through all job_titles doing the same thing.
work_year job_title salary_in_usd
0 2020 AI Scientist 45896.000000
1 2020 BI Data Analyst 98000.000000
2 2020 Big Data Engineer 97690.333333
3 2020 Business Data Analyst 117500.000000
4 2020 Computer Vision Engineer 60000.000000
.. ... ... ...
93 2022 Machine Learning Scientist 141766.666667
94 2022 NLP Engineer 37236.000000
95 2022 Principal Data Analyst 75000.000000
96 2022 Principal Data Scientist 162674.000000
97 2022 Research Scientist 105569.000000
Create a function which does the thing you want on each group:
def first_to_last_year_diff(df):
diff = (
df[df.work_year == df.work_year.max()].salary_in_usd
- df[df.work_year == df.work_year.max()].salary_in_usd
)
return diff
Then group on job title and apply your function:
df.groupby("job_title").apply(first_to_last_year_diff)

How to calcuate the overlap date in pyspark

I have data with users who has worked with multiple companies.Some users who worked in more than one companies at the same time. How to aggregate the overall experience without considering overlap experience.
I have gone through some of the links could get right solutions.Any help will appreciated.
EMP CSV DATA
fullName,Experience_datesEmployeed,Experience_expcompany,Experience_expduraation, Experience_position
David,Feb 1999 - Sep 2001, Foothill,2 yrs 8 mos, Marketing Assoicate
David,1994 - 1997, abc,3 yrs,Senior Auditor
David,Jun 2020 - Present, Fellows INC,3 mos,Director Board
David,2017 - Jun 2019, Fellows INC ,2 yrs,Fellow - Class 22
David,Sep 2001 - Present, The John D.,19 yrs, Manager
Expected output:
FullName,Total_Experience
David,24.8 yrs

Conditional copy of values from one column to another columns

I have a pandas dataframe that looks something like this:
name job jobchange_rank date
Thisguy Developer 1 2012
Thisguy Analyst 2 2014
Thisguy Data Scientist 3 2015
Anotherguy Developer 1 2018
The jobchange_rank represents the each individual's (based on name) ranked change in position, where rank nr 1 represent his/her first position nr 2 his/her second position, etc.
Now for the fun part. I want to create a new column where I can see a person's previous job, something like this:
name job jobchange_rank date previous_job
Thisguy Developer 1 2012 None
Thisguy Analyst 2 2014 Developer
Thisguy Data Scientist 3 2015 Analyst
Anotherguy Developer 1 2018 None
I've created the following code to get the "None" values where there was no job change:
df.loc[df['jobchange_rank'].sub(df['jobchange_rank'].min()) == 0, 'previous_job'] = 'None'
Sadly, I can't seem to figure out how to get the values from the other column where the needed condition applies.
Any help is more then welcome!
Thanks in advance.
This answer assumes that your DataFrame is sorted by name and jobchange_rank, if that is not the case, sort first.
# df = df.sort_values(['name', 'jobchange_rank'])
m = df['name'].eq(df['name'].shift())
df['job'].shift().where(m)
0 NaN
1 Developer
2 Analyst
3 NaN
Name: job, dtype: object
Or using a groupby + shift (assuming at least sorted by jobchange_rank)
df.groupby('name')['job'].shift()
0 NaN
1 Developer
2 Analyst
3 NaN
Name: job, dtype: object
Although the groupby + shift is more concise, on larger inputs, if your data is already sorted like your example, it may be faster to avoid the groupby and use the first solution.

Creating a user-input filters on csv file that contains large data

I have a program that open and read a file in csv format that contains large data such as:
State Crime type Occurrences Year
CALIFORNIA ROBBERY 12 1999
CALIFORNIA ASSAULT 45 2003
NEW YORK ARSON 9 1999
CALIFORNIA ARSON 21 2000
TEXAS THEFT 30 2000
OREGON ASSAULT 10 2001
I need to create 3 filters by user input. For example:
Enter State:
Enter Crime Type:
Enter Year:
If I enter:
Enter State: CALIFORNIA
Enter Crime: ASSAULT
Enter Year: 2003
Crime Report
State Crime type Occurrences Year
CALIFORNIA ASSAULT 45 2003
This needs to happen.
I have no clue on how to tackle this problem.. I was only able to open and read the data file in csv format into a table in Python that will just print out every line. However, I need to incorporate search filter to narrow the result such as shown above. Anyone familiar with this? Thank you all for your help.
The Pandas library in Python allows you to view and manipulate csv data. The following solution imports the pandas library, reads the csv using the read_csv() function and loads it into a dataframe, then ask for input values, keeping in mind that State and Crime should be string values and cast as str and Year should be integer and cast as int, then applies a simple query to filter the results you need from the dataframe. We build this query keeping in mind that all three conditions should be met and that the input strings can be lowercase too.
In [125]: import pandas as pd
In [126]: df = pd.read_csv('test.csv')
In [127]: df
Out[127]:
State Crime type Occurrences Year
0 CALIFORNIA ROBBERY 12 1999
1 CALIFORNIA ASSAULT 45 2003
2 NEW YORK ARSON 9 1999
In [128]: state = str(input("Enter State: "))
Enter State: California
In [129]: crime_type = str(input("Enter Crime Type: "))
Enter Crime Type: robbery
In [130]: year = int(input("Enter Year: "))
Enter Year: 1999
In [131]: df.loc[lambda x:(x['State'].str.lower().str.contains(state.lower()))
...: & (x['Crime type'].str.lower().str.contains(crime_type.lower())) & (x
...: ['Year'] == year)]
Out[131]:
State Crime type Occurrences Year
0 CALIFORNIA ROBBERY 12 1999

Two Sets of data in CSV file

I have a CSV file which has two sets of data on the same sheet. I did my research and the closest I could find is what I have attached. The issue I am having is that both of them are not tables, their separate data sets; both of which are separated by a number of rows. I want to save each of the data sets as a separate CSV. Is this possible in Python? Please provide your kind assistance.
Python CSV module: How can I account for multiple tables within the same file?
First Set:
Presented_By: Source: City:
Chris Realtor Knoxville
John Engineer Lantana
Wade Doctor Birmingham
Second Set:
DriveBy 15
BillBoard 45
Social Media 85
My source is a Excel file which I convert into a CSV file.
import pandas as pd
data_xls = pd.read_excel('T:\DataDump\Matthews\REPORT 11.13.16.xlsm', 'InfoCenterTracker', index_col=None)
data_xls.to_csv('your_csv.csv', encoding='utf-8')
second_set = pd.read_csv('your_csv.csv',skiprows=[10,11,12,13,14,15,16,17,18,19,20,21,22,23,23])
Use skiprows in pandas' read_csv
$ cat d.dat
Presented_By: Source: City:
Chris Realtor Knoxville
John Engineer Lantana
Wade Doctor Birmingham
DriveBy 15
BillBoard 45
Social Media 85
In [1]: import pandas as pd
In [2]: pd.read_csv('d.dat',skiprows=[0,1,2,3])
Out[2]:
DriveBy 15
0 BillBoard 45
1 Social Media 85
In [3]: pd.read_csv('d.dat',skiprows=[4,5,6])
Out[3]:
Presented_By: Source: City:
0 Chris Realtor Knoxv...
1 John Engineer Lantana
2 Wade Doctor Birmi...
You can detect what rows to skip by searching for when the csv has 2 entries not 3
In [25]: for n, line in enumerate(open('d.dat','r').readlines()):
...: if len(line.split()) !=3:
...: breakpoint = n
...:
In [26]: pd.read_csv('d.dat',skiprows=range(breakpoint-1))
Out[26]:
DriveBy 15
0 BillBoard 45
1 Social Media 85
In [27]: pd.read_csv('d.dat',skiprows=range(breakpoint-1, n+1))
Out[27]:
Presented_By: Source: City:
0 Chris Realtor Knoxv...
1 John Engineer Lantana
2 Wade Doctor Birmi...

Categories

Resources