Python read specific value from text file and total sum - python

I have this text file, Masterlist.txt, which looks something like this:
S1234567A|Jan Lee|Ms|05/10/1990|Software Architect|IT Department|98785432|PartTime|3500
S1234567B|Feb Tan|Mr|10/12/1991|Corporate Recruiter|HR Corporate Admin|98766432|PartTime|1500
S1234567C|Mark Lim|Mr|15/07/1992|Benefit Specialist|HR Corporate Admin|98265432|PartTime|2900
S1234567D|Apr Tan|Ms|20/01/1996|Payroll Administrator|HR Corporate Admin|91765432|FullTime|1600
S1234567E|May Ng|Ms|25/05/1994|Training Coordinator|HR Corporate Admin|98767432|Hourly|1200
S1234567Y|Lea Law|Ms|11/07/1994|Corporate Recruiter|HR Corporate Admin|94445432|PartTime|1600
I want to reduce the Salary(the number at the end of each line) of each line, only if "PartTime" is in the line and after 1995, by 50%, and then add it up.
Currently I only know how to select only lines with "PartTime" in it, and my code looks like this:
f = open("Masterlist.txt", "r")
for x in f:
if "FullTime" in x:
print(x)
How do I extract the Salary and reduce by 50% + add it up only if the year is after 1995?

Try using pandas library.
From your question I suppose you want to reduce by 50% Salary if year is less than 1995, otherwise increase by 50%.
import pandas as pd
path = r'../Masterlist.txt' # path to your .txt file
df = pd.read_csv(path, sep='|', names = [0,1,2,'Date',4,5,6,'Type', 'Salary'], parse_dates=['Date'])
# Now column Date is treated as datetime object
print(df.head())
0 1 2 Date 4 \
0 S1234567A Jan Lee Ms 1990-05-10 Software Architect
1 S1234567B Feb Tan Mr 1991-10-12 Corporate Recruiter
2 S1234567C Mark Lim Mr 1992-07-15 Benefit Specialist
3 S1234567D Apr Tan Ms 1996-01-20 Payroll Administrator
4 S1234567E May Ng Ms 1994-05-25 Training Coordinator
5 6 Type Salary
0 IT Department 98785432 PartTime 3500
1 HR Corporate Admin 98766432 PartTime 1500
2 HR Corporate Admin 98265432 PartTime 2900
3 HR Corporate Admin 91765432 FullTime 1600
4 HR Corporate Admin 98767432 Hourly 1200
df.Salary = df.apply(lambda row: row.Salary*0.5 if row['Date'].year < 1995 and row['Type'] == 'PartTime' else row.Salary + (row.Salary*0.5 ), axis=1)
print(df.Salary.head())
0 1750.0
1 750.0
2 1450.0
3 2400.0
4 600.0
Name: Salary, dtype: float64
Add some modifications to the if, else statement inside the apply function if you wanted something different.

Related

Getting maximum counts of a column in grouped dataframe

My dataframe df is:
Election Year Votes Party Region
0 2000 50 A a
1 2000 100 B a
2 2000 26 A b
3 2000 180 B b
4 2000 300 A c
5 2000 46 C c
6 2005 149 A a
7 2005 46 B a
8 2005 312 A b
9 2005 23 B b
10 2005 16 A c
11 2005 35 C c
I want to get the Party winning maximum region every year. So the desired output is:
Election Year Party
2000 B
2005 A
I tried this code to get the the above output, but it is giving error:
winner = df.groupby(['Election Year'])['Votes'].max().reset_index()
winner = winner.groupby('Election Year').first().reset_index()
winner = winner[['Election Year', 'Party']].to_string(index=False)
winner
how can I get the desired output?
Here is one approach with nested groupby. We first count per-party votes in each year-region pair, then use mode to find the party winning most regions. The mode need not be unique (if two or more parties win the same number of regions).
df.groupby(["Year", "Region"])\
.apply(lambda gp: gp.groupby("Party").Votes.sum().idxmax())\
.unstack().mode(1).rename(columns={0: "Party"})
Party
Year
2000 B
2005 A
To address the comment, you can replace idxmax above with nlargest and diff to find regions where win margin is below a given number.
margin = df.groupby(["Year", "Region"])\
.apply(lambda gp: gp.groupby("Party").Votes.sum().nlargest(2).diff()) > -125
print(margin[margin].reset_index()[["Year", "Region"]])
# Year Region
# 0 2000 a
# 1 2005 a
# 2 2005 c
You can use GroupBy.idxmax() to get the index of max Votes for each group of Election Year, then use .loc to locate the rows followed by selection of required columns, as followed:
df.loc[df.groupby('Election Year')['Votes'].idxmax()][['Election Year', 'Party']]
Result:
Election Year Party
4 2000 A
8 2005 A
Edit
If we are to get the Party winning most Region, we can use the following codes (without using the slow .apply() with lambda function):
(df.loc[
df.groupby(['Election Year', 'Region'])['Votes'].idxmax()]
[['Election Year', 'Party', 'Region']]
.pivot(index='Election Year', columns='Region')
.mode(axis=1)
).rename({0: 'Party'}, axis=1).reset_index()
Result:
Election Year Party
0 2000 B
1 2005 A
Try this
winner = df.groupby(['Election Year','Party'])['Votes'].max().reset_index()
winner.drop('Votes', axis = 1, inplace = True)
winner
Another method: (closed to #hilberts_drinking_problem in fact)
>>> df.groupby(["Election Year", "Region"]) \
.apply(lambda x: x.loc[x["Votes"].idxmax(), "Party"]) \
.unstack().mode(axis="columns") \
.rename(columns={0: "Party"}).reset_index()
Election Year Party
0 2000 B
1 2005 A
I believe the one liner df.groupby(["Election Year"]).max().reset_index()['Election Year', 'Party'] solves your problem

Conditional copy of values from one column to another columns

I have a pandas dataframe that looks something like this:
name job jobchange_rank date
Thisguy Developer 1 2012
Thisguy Analyst 2 2014
Thisguy Data Scientist 3 2015
Anotherguy Developer 1 2018
The jobchange_rank represents the each individual's (based on name) ranked change in position, where rank nr 1 represent his/her first position nr 2 his/her second position, etc.
Now for the fun part. I want to create a new column where I can see a person's previous job, something like this:
name job jobchange_rank date previous_job
Thisguy Developer 1 2012 None
Thisguy Analyst 2 2014 Developer
Thisguy Data Scientist 3 2015 Analyst
Anotherguy Developer 1 2018 None
I've created the following code to get the "None" values where there was no job change:
df.loc[df['jobchange_rank'].sub(df['jobchange_rank'].min()) == 0, 'previous_job'] = 'None'
Sadly, I can't seem to figure out how to get the values from the other column where the needed condition applies.
Any help is more then welcome!
Thanks in advance.
This answer assumes that your DataFrame is sorted by name and jobchange_rank, if that is not the case, sort first.
# df = df.sort_values(['name', 'jobchange_rank'])
m = df['name'].eq(df['name'].shift())
df['job'].shift().where(m)
0 NaN
1 Developer
2 Analyst
3 NaN
Name: job, dtype: object
Or using a groupby + shift (assuming at least sorted by jobchange_rank)
df.groupby('name')['job'].shift()
0 NaN
1 Developer
2 Analyst
3 NaN
Name: job, dtype: object
Although the groupby + shift is more concise, on larger inputs, if your data is already sorted like your example, it may be faster to avoid the groupby and use the first solution.

Creating a user-input filters on csv file that contains large data

I have a program that open and read a file in csv format that contains large data such as:
State Crime type Occurrences Year
CALIFORNIA ROBBERY 12 1999
CALIFORNIA ASSAULT 45 2003
NEW YORK ARSON 9 1999
CALIFORNIA ARSON 21 2000
TEXAS THEFT 30 2000
OREGON ASSAULT 10 2001
I need to create 3 filters by user input. For example:
Enter State:
Enter Crime Type:
Enter Year:
If I enter:
Enter State: CALIFORNIA
Enter Crime: ASSAULT
Enter Year: 2003
Crime Report
State Crime type Occurrences Year
CALIFORNIA ASSAULT 45 2003
This needs to happen.
I have no clue on how to tackle this problem.. I was only able to open and read the data file in csv format into a table in Python that will just print out every line. However, I need to incorporate search filter to narrow the result such as shown above. Anyone familiar with this? Thank you all for your help.
The Pandas library in Python allows you to view and manipulate csv data. The following solution imports the pandas library, reads the csv using the read_csv() function and loads it into a dataframe, then ask for input values, keeping in mind that State and Crime should be string values and cast as str and Year should be integer and cast as int, then applies a simple query to filter the results you need from the dataframe. We build this query keeping in mind that all three conditions should be met and that the input strings can be lowercase too.
In [125]: import pandas as pd
In [126]: df = pd.read_csv('test.csv')
In [127]: df
Out[127]:
State Crime type Occurrences Year
0 CALIFORNIA ROBBERY 12 1999
1 CALIFORNIA ASSAULT 45 2003
2 NEW YORK ARSON 9 1999
In [128]: state = str(input("Enter State: "))
Enter State: California
In [129]: crime_type = str(input("Enter Crime Type: "))
Enter Crime Type: robbery
In [130]: year = int(input("Enter Year: "))
Enter Year: 1999
In [131]: df.loc[lambda x:(x['State'].str.lower().str.contains(state.lower()))
...: & (x['Crime type'].str.lower().str.contains(crime_type.lower())) & (x
...: ['Year'] == year)]
Out[131]:
State Crime type Occurrences Year
0 CALIFORNIA ROBBERY 12 1999

Add new column to dataframe based on an average

I have a dataframe that includes the category of a project, currency, number of investors, goal, etc., and I want to create a new column which will be "average success rate of their category":
state category main_category currency backers country \
0 0 Poetry Publishing GBP 0 GB
1 0 Narrative Film Film & Video USD 15 US
2 0 Narrative Film Film & Video USD 3 US
3 0 Music Music USD 1 US
4 1 Restaurants Food USD 224 US
usd_goal_real duration year hour
0 1533.95 59 2015 morning
1 30000.00 60 2017 morning
2 45000.00 45 2013 morning
3 5000.00 30 2012 morning
4 50000.00 35 2016 afternoon
I have the average success rates in series format:
Dance 65.435209
Theater 63.796134
Comics 59.141527
Music 52.660558
Art 44.889045
Games 43.890467
Film & Video 41.790649
Design 41.594386
Publishing 34.701650
Photography 34.110847
Fashion 28.283186
Technology 23.785582
And now I want to add in a new column, where each column will have a success rate matching their category, i.e. wherever the row is technology, the new column will include 23.78 for that row.
df[category_success_rate] = i want the output column to be the % success which matches with the category in "main category" column.
I think you need GroupBy.transform with a Boolean mask, df['state'].eq(1) or (df['state'] == 1):
df['category_success_rate'] = (df['state'].eq(1)
.groupby(df['main_category']).transform('mean') * 100)
Alternative:
df['category_success_rate'] = ((df['state'] == 1)
.groupby(df['main_category']).transform('mean') * 100)

How to use pandas to pull out the counties with the largest amount of water used in a given year?

I am new to python and pandas and I am struggling to figure out how to pull out the 10 counties with the most water used for irrigation in 2014.
%matplotlib inline
import csv
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
data = pd.read_csv('info.csv') #reads csv
data['Year'] = pd.to_datetime(['Year'], format='%Y') #converts string to
datetime
data.index = data['Year'] #makes year the index
del data['Year'] #delete the duplicate year column
This is what the data looks like (this is only partial of the data):
County WUCode RegNo Year SourceCode SourceID Annual CountyName
1 IR 311 2014 WELL 1 946 Adams
1 IN 311 2014 INTAKE 1 268056 Adams
1 IN 312 2014 WELL 1 48 Adams
1 IN 312 2014 WELL 2 96 Adams
1 IR 312 2014 INTAKE 1 337968 Adams
3 IR 315 2014 WELL 5 81900 Putnam
3 PS 315 2014 WELL 6 104400 Putnam
I have a couple questions:
I am not sure how to pull out only the "IR" in the WUCode Column with pandas and I am not sure how to print out a table with the 10 counties with the highest water usage for irrigation in 2014.
I have been able to use the .loc function to pull out the information I need, with something like this:
data.loc['2014', ['CountyName', 'Annual', 'WUCode']]
From here I am kind of lost. Help would be appreciated!
import numpy as np
import pandas as pd
import string
df = pd.DataFrame(data={"Annual": np.random.randint(20, 1000000, 1000),
"Year": np.random.randint(2012, 2016, 1000),
"CountyName": np.random.choice(list(string.ascii_letters), 1000)},
columns=["Annual", "Year", "CountyName"])
Say df looks like:
Annual Year CountyName
0 518966 2012 s
1 44511 2013 E
2 332010 2012 e
3 382168 2013 c
4 202816 2013 y
For the year 2014...
df[df['Year'] == 2014]
Group by CountyName...
df[df['Year'] == 2014].groupby("CountyName")
Look at Annual...
df[df['Year'] == 2014].groupby("CountyName")["Annual"]
Get the sum...
df[df['Year'] == 2014].groupby("CountyName")["Annual"].sum()
Sort the result descending...
df[df['Year'] == 2014].groupby("CountyName")["Annual"].sum().sort_values(ascending=False)
Take the top 10...
df[df['Year'] == 2014].groupby("CountyName")["Annual"].sum().sort_values(ascending=False).head(10)
This example prints out (your actual result may vary since my data was random):
CountyName
Q 5191814
y 4335358
r 4315072
f 3985170
A 3685844
a 3583360
S 3301817
I 3231621
t 3228578
u 3164965
This may work for you:
res = df[df['WUCode'] == 'IR'].groupby(['Year', 'CountyName'])['Annual'].sum()\
.reset_index()\
.sort_values('Annual', ascending=False)\
.head(10)
# Year CountyName Annual
# 0 2014 Adams 338914
# 1 2014 Putnam 81900
Explanation
Filter by WUCode, as required, and groupby Year and CountyName.
Use reset_index so your result is a dataframe rather than a series.
Use sort_values and extract top 10 via pd.DataFrame.head.

Categories

Resources