Extracting Data from non-Structured String in Python - python

everyone. I´m working with a Job Board dataset and I intend to turn the "Salary Offered" column into something I can use to make calculations, comparisons and predictions. I have five different cases for the data in the column:
-Yearly Salaries within a range (YSWR)
IE:£15,000 - £17,000 per annum
-Hourly salaries within a range (HSWR)
IE:£22.00 - £26.00 per hour
-Yearly salaries with specific values (YSWSV)
IE:£18,323 per annum
Hourly salaries with specific values (HSWSV)
IE £26.00 per hour
-Salary not Specified/Salary Negotiable/Competitive Salary
I need to preprocess this field into:
-One column that indicates either the salary is yearly or hourly
-Two columns indicating the minimun/maximun salary(0 for non specified values and equal values for the cases that are not in a range)
Any Idea where to start? .I am working with python and PANDAS. I am a begginer when it comes to data preprocessing .
Thanks in advance.
Felix

You can use regular expression to get the values and then implement the logic using if=else
import re
import pandas as pd
df = pd.DataFrame([['£16,000 per annum'],
['£25.0 per annum'],
['£19,000 per annum'],
['£26.0 per annum'],
['Salary not specified'],
['Competetive salary']], columns=['salary_offered'])
def apply_conditions(s):
v = re.findall(r'^£(\d+,?\d+.?\d+)', s)
if(len(v) == 0): # that means salary not specified
return [0,0,'not specified'] #[min, max, salary]
else:
v = v[0]
# replace ',' with '' so that we can parse
v = v.replace(',', '')
v = float(v)
if(15000 < v < 17000):
return [15000, 17000, v]
elif(22 < v < 26):
return [22, 26, v]
elif(v == 18323):
return [18323, 18323, v]
elif(v == 26):
return [26, 26, v]
else:
return [0, 0, 'not in range']
df['salary_offered'] = df['salary_offered'].apply(lambda x: apply_conditions(x))
df = pd.DataFrame(df['salary_offered'].to_list(), columns=['minimum', 'maximum', 'value'])
from
salary_offered
0 £16,000 per annum
1 £25.0 per annum
2 £19,000 per annum
3 £26.0 per annum
4 Salary not specified
5 Competetive salary
to
minimum maximum value
0 15000 17000 16000
1 22 26 25
2 0 0 not in range
3 26 26 26
4 0 0 not specified
5 0 0 not specified

Related

How can I search for and extract a particular value from a dataframe in python?

I have a dataframe called "oat" - here is a piece of it:
Name Age Year T Neigh One Neigh Two
0 Carl P 31 1998 0.1 5454 657
1 Tyler A 26 2012 3.9 578 98
2 Antoine G 20 1997 1.7 17 9878
3 Travis A 23 2008 3.2 199 398
4 Geoff H 22 1980 -0.3 901 7650
5 David C 28 2014 4.5 8001 72
6 Antoine G 21 1998 2.3 5678 9800
7 Tyler A 25 2011 3.1 2245 450
I'm trying to run a for loop through each row. The values in column "Neigh One" refer to the index of another row, from which, based on particular variables, will lead to another row from which I'd like to extract a variable.
Here's what I've tried:
for index, row in oat.iterrows():
indice = row['Neigh One']
name = oat.iloc[indice]["Name"]
age = oat.iloc[indice]["Age"]
age_plus_one = age + 1
new = oat.loc[(oat.Name == name) & (oat.Age == age_plus_one),'T'].tolist()[0]
print(new)
I am getting an error message from the last variable I try, "new." Basically I am looping through each row, and based on the "Neigh One" value, it will go to that index, and extract the name and age and then add 1. From there, I am looking to find the new row with that same name, but with one added to the age.
Note: There is either zero rows that will match this, or only one row. It would be impossible to have more than one match.
All I want to do is, for each loop, simply return the value of 'T' that comes back based on my boolean filter.
I have also tried the following for the final variable, with the error messages that each returns:
new= oat[(oat['Name'] == name) & (oat['Age'] == age_plus_one)].T.item()
ValueError: can only convert an array of size 1 to a Python scalar
new = oat[(oat['Name'] == name) & (oat['Age'] == age_plus_one),'T'].values[0]
not an error, but returns a True of False boolean list for the entire dataframe rather than the actual values.
new = oat.loc[(oat.Name == name) & (oat['Age'] == age_plus_one),'T'].values[0]
IndexError: index 0 is out of bounds for axis 0 with size 0
new = oat.loc[(oat.name == name) & (oat.Age == age_plus_one),'T'].tolist()[0]
IndexError: list index out of range
for index, row in oat.iterrows():
indice = row['Neigh One']
name = oat.iloc[indice]["Name"]
age = oat.iloc[indice]["Age"]
age_plus_one = age + 1
#--------below is revised---------
mask = (oat.Name == name) & (oat.Age == age_plus_one)
if sum(mask) == 0:
new = None
else:
new = oat.loc[mask,'T'].tolist()[0]
print(new)
As you mentioned, there might be no match for (oat.Name == name) & (oat.Age == age_plus_one). So a if-else will help to switch case.

groupby: trying to group by country and list top 10 varieties per country along with avg price and avg points

I am trying to generate a dataframe which is grouped by country and lists top 10 varieties of wine in each country along with their average price and points.
I have successfully grouped by country and wine and generated average values of price and points.
I can generate top 10 varieties in each country by using value_counts().nlargesst(10) but I can't get rid of the remaining in the initial group by with the averages
countryGroup = df.groupby(['country', 'variety'])['price','points'].mean().round(2).rename(columns = {'price':'AvgPrice','points':'AvgPoints'})
countryVariety = df.groupby('country')['variety']
countryVariety = countryVariety.apply(lambda x:x.value_counts().nlargest(10))
data link
actual result is a list of top 10 varieties in each country.
but what I need along with this is the average price and points
Here's some sample data. For these problems, where a large quantity of data is required, it's useful to generate random test data, which can be done in a few lines:
import pandas as pd
import numpy as np
import string
np.random.seed(123)
n = 1000
df = pd.DataFrame({'country': np.random.choice(list('AB'), n),
'variety': np.random.choice(list(string.ascii_lowercase), n),
'price': np.random.normal(100, 10, n),
'points': np.random.choice(100, n)})
One way to solve this is to groupby twice. The first allows us to calculate the quantities for each country-variety group. The second keeps the top 10 per country (based on size) with .sort_values + tail
df_agg = (df.groupby(['country', 'variety']).agg({'variety': 'size', 'price': 'mean', 'points': 'mean'})
.rename(columns={'variety': 'size'}))
df_agg = df_agg.sort_values('size').groupby(level=0).tail(10).sort_index()
Output:
size price points
country variety
A c 19 98.606563 45.842105
e 19 102.264391 48.894737
l 23 96.469739 52.913043
n 27 99.532544 55.740741
p 20 98.298753 49.700000
q 21 98.660938 60.666667
u 26 101.330755 63.615385
x 20 102.540790 48.550000
y 23 99.553557 49.869565
z 27 99.968973 44.259259
B b 25 99.375984 56.360000
c 22 100.632402 56.181818
e 25 99.476491 49.520000
k 22 96.991041 40.090909
p 24 99.802004 51.333333
q 26 99.022372 53.884615
u 22 103.063360 49.090909
v 24 101.907610 53.250000
x 22 94.607472 49.227273
z 23 98.984382 44.739130

Pandas.unique() Returning an Arrary with None type Value

I´ve created a new column in a Dataframe that contains the categorical feature 'QD' which describes in which "decile" (the 10%, 20, 30% lower values) the value of another feature of the DataFrame is positioned. You can see the DF head below:
EPS CPI POC Vendeu Delta QD
1 20692 1 19185.30336 0 -1506.69664 QD07
8 20933 1 20433.27115 0 -499.72885 QD08
10 20393 1 20808.04948 0 415.04948 QD10
18 20503 1 19153.45978 0 -1349.54022 QD07
19 20587 1 20175.31906 1 -411.68094 QD09
Data Frame Head
The 'QD' column was created through the function below:
minimo = DF['EPS'].min()
passo = (DF['EPS'].max() - DF['EPS'].min())/10
def get_q(value):
for i in range(1,11):
if value < (minimo + (i*passo)):
return str('Q' + str(i).zfill(2))
Function applied on 'Delta'
Analyzing this column, I noticed something strange:
AUX2['QD'].unique()
out:
array(['QD07', 'QD08', 'QD10', 'QD09', 'QD06', 'QD05', 'QD04', 'QD03',
'QD02', 'QD01', None], dtype=object)
'QD' unique values
de .unique() method returns an array with an none value on it. At first I thought there was something wrong with the function, but when I tried to grab the position of the none value, look:
AUX2['QD'].value_counts()
out:
QD05 852
QD04 848
QD06 685
QD03 578
QD07 540
QD08 377
QD02 318
QD09 209
QD10 68
QD01 61
Name: QD, dtype: int64
.value_counts()
len(AUX2[AUX2['QD'] == None]['QD'])
out:
0
len()
What am I missing here?
When you are using .value_counts() add dropna=False
df[df['name column'].isnull()]

Analyze data using python

I have a csv file in the following format:
30 1964 1 1
30 1962 3 1
30 1965 0 1
31 1959 2 1
31 1965 4 1
33 1958 10 1
33 1960 0 1
34 1959 0 2
34 1966 9 2
34 1958 30 1
34 1960 1 1
34 1961 10 1
34 1967 7 1
34 1960 0 1
35 1964 13 1
35 1963 0 1
The first column denotes the age and the last column denotes the survival rate(1 if patient survives 5 years or longer;2 if patient died within 5 years)
I have to calculate which age has the highest survival rate. I am new to python and I cannot figure out how to proceed. I was able to calculate the most repeated age using the mode function but I cannot figure out how to check one column and print the corresponding other column. Please help.
I was able to find an answer where I had to analyze just the first row.
import csv
import matplotlib.pyplot as plt
import numpy as np
df = open('Dataset.csv')
csv_df=csv.reader(df)
a=[]
b=[]
for row in csv_df:
a.append(row[0])
b.append(row[3])
print('The age that has maximum reported incidents of cancer is '+ mode(a))
I am not entirely sure whether I understood your logic clearly for determining the age with the maximum survival rate. Assuming that the age that has the heighest number of 1s have the heighest survival rate the following code is written
I have done the reading part a little differently as the data set acted wired when I used csv. If the csv module works fine in your environment, use it. The idea is, to retrieve each element of value in each row; we are interested in the 0th and 3rd columns.
In the following code, we maintain a dictionary, survival_map, and count the frequency of a particular age being associated with a 1.
import operator
survival_map = {}
with open('Dataset.csv', 'rb') as in_f:
for row in in_f:
row = row.rstrip() #to remove the end line character
items = row.split(',') #I converted the tab space to a comma, had a problem otherwise
age = int(items[0])
survival_rate = int(items[3])
if survival_rate == 1:
if age in survival_map:
survival_map[age] += 1
else:
survival_map[age] = 1
Once we build the dictionary, {33: 2, 34: 5, 35: 2, 30: 3, 31: 2}, it is sorted in reverse by the key:
sorted_survival_map = sorted(survival_map.items(), key=operator.itemgetter(1), reverse = True)
max_survival = sorted_survival_map[0]
UPDATE:
For a single max value, OP's suggestion (in a comment) is preferred. Posting it here:
maximum = max(dict, key=dict.get)
print(maximum, dict[maximum])
For multiple max values
max_keys = []
max_value = 0
for k,v in survival_map.items():
if v > max_value:
max_keys = [k]
max_value = v
elif v == max_value:
max_keys.append(k)
print [(x, max_value) for x in max_keys]
Of course, this could be achieved by a dictionary comprehension; however for readability, I am proposing this. Also, this is done through one pass through the objects in the dictionary without going through it multiple times. Therefore, the solution has O(n) time complexity and would be the fastest.

Group rows in a CSV by blocks of 25

I have a csv file with 2 columns, representing a distribution of items per year which looks like this:
A B
1900 10
1901 2
1903 5
1908 8
1910 25
1925 3
1926 4
1928 1
1950 10
etc, about 15000 lines.
When making a distribution diagram based on this data, it's too many points on an axe, not very pretty. I want to group rows by blocks of 25 years, so that at the end I would have less points at the axe.
So, for example, from 1900 till 1925 I would have a sum of produced items, 1 row in A column and 1 row in B column:
1925 53
1950 15
So far I only figured how to convert the data in csv file to int:
o=open('/dates_dist.csv', 'rU')
mydata = csv.reader(o)
def int_wrapper(mydata):
for v in reader:
yield map(int, v)
reader = int_wrapper(mydata)
Can't find how to do it further...
You could use itertools.groupby:
import itertools as IT
import csv
def int_wrapper(mydata):
for v in mydata:
yield map(int, v)
with open('data', 'rU') as o:
mydata = csv.reader(o)
header = next(mydata)
reader = int_wrapper(mydata)
for key, group in IT.groupby(reader, lambda row: (row[0]-1)//25+1):
year = key*25
total = sum(row[1] for row in group)
print(year, total)
yields
(1900, 10)
(1925, 43)
(1950, 15)
Note that 1900 to 1925 (inclusive) spans 26 years, not 25. So
if you want to group 25 years, given the way you are reporting the totals, you probably want the half-open interval (1900, 1925].
The expression row[0]//25 takes the year and integer divides by 25.
This number will be the same for all numbers in the range [1900, 1925).
To make the range half-open on the left, subtract and add 1: (row[0]-1)//25+1.
Here is my approach . Its definitely not the most engaging python code, but could be a way to achieve the desired output.
if __name__ == '__main__':
o=open('dates_dist.csv', 'rU')
lines = o.read().split("\n") # Create a list having each line of the file
out_dict = {}
curr_date = 0;
curr_count = 0
chunk_sz = 25; #years
if len(lines) > 0:
line_split = lines[0].split(",")
start_year = int(line_split[0])
curr_count = 0
# Iterate over each line of the file
for line in lines:
# Split at comma to get the year and the count.
# line_split[0] will be the year and line_split[1] will be the count.
line_split = line.split(",")
curr_year = int(line_split[0])
time_delta = curr_year-start_year
if time_delta<chunk_sz or time_delta == chunk_sz:
curr_count = curr_count + int(line_split[1])
else:
out_dict[start_year+chunk_sz] = curr_count
start_year = start_year+chunk_sz
curr_count = int(line_split[1])
#print curr_year , curr_count
out_dict[start_year+chunk_sz] = curr_count
print out_dict
You could create a dummy column and group by it after doing some integer division:
df['temp'] = df['A'] // 25
>>> df
A B temp
0 1900 10 76
1 1901 2 76
2 1903 5 76
3 1908 8 76
4 1910 25 76
5 1925 3 77
6 1926 4 77
7 1928 1 77
8 1950 10 78
>>> df.groupby('temp').sum()
A B
temp
76 9522 50
77 5779 8
78 1950 10
My numbers are slightly different from yours since I am technically grouping from 1900-1924, 1925-1949, and 1950-1974, but the idea is the same.

Categories

Resources