Conversion of years and months to months with string column input python - python

dataset example:
experience
5 month
nan
1 months
8 month
17 months
8 year
11 years
1.7 year
3.1 years
15.7 months
18 year
2017.2 years
98.3 years
68 year
I have a column with applicant's experience in years. It is very messy and I tried to go through it and create a sample. I have numbers followed by (month or months) and (year of years).
There are many nan entries and it should be ignored.
The goal is to create a column experience in months:
if nan
copy nan to the corresponding column
if the row has month or months
copy the number to the corresponding column
if year or years in the row and the number <55
the number shall be multiplied by 12 and copied to the corresponding column
else copy nan to the corresponding column
How to achieve this?

my_dict = {'Experience': ['5 month', 'nan', '1 months', '8 month','17 months','8 year',
'11 years','1.7 year', '3.1 years', '15.7 months','18 year',
'2017.2 years', '98.3 years', '68 year']}
df = pd.DataFrame(my_dict)
# Create filter for month/months
month_filt = df['Experience'].str.contains('month')
# Filter DataFrame for rows that contain month/months
df['Months'] = df.loc[month_filt, 'Experience'].str.strip('month|months')
# Create filter for year/years
year_filt = df['Experience'].str.contains('year')
# Filter DataFrame for rows that contain year/years
df['Years'] = df.loc[year_filt, 'Experience'].str.strip('year|years')
# Fill NaN in Years column
df.loc[df['Years'].isna(),'Years'] = np.nan
# Convert Years to months
df.loc[df['Months'].isna(),'Months'] = df['Years'].astype('float') * 12
# Set years greater than 55 to NaN
df.loc[df['Years'].astype('float') > 55, 'Months'] = np.nan
Experience Months Years
0 5 month 5 NaN
1 nan NaN NaN
2 1 months 1 NaN
3 8 month 8 NaN
4 17 months 17 NaN
5 8 year 96 8
6 11 years 132 11
7 1.7 year 20.4 1.7
8 3.1 years 37.2 3.1
9 15.7 months 15.7 NaN
10 18 year 216 18
11 2017.2 yearsNaN 2017.2
12 98.3 years NaN 98.3
13 68 year NaN 68

Simple solution using reg expressions, keeping workings for transparency.
import numpy as np
df = pd.read_csv(io.StringIO("""experience
5 month
nan
1 months
8 month
17 months
8 year
11 years
1.7 year
3.1 years
15.7 months
18 year
2017.2 years
98.3 years
68 year"""))
df = df.assign(unit=lambda dfa: dfa["experience"].str.extract("([a-z]+)+"),
val=lambda dfa: dfa["experience"].str.extract("([0-9,\.]+)").astype(float),
months=lambda dfa: np.where(dfa["unit"].isin(["month","months"]), dfa["val"],
np.where(dfa["unit"].isin(["year","years"])
&dfa["val"].lt(55), dfa["val"]*12, np.nan)))
print(df.to_string(index=False))
output
experience unit val months
5 month month 5.0 5.0
NaN NaN NaN NaN
1 months months 1.0 1.0
8 month month 8.0 8.0
17 months months 17.0 17.0
8 year year 8.0 96.0
11 years years 11.0 132.0
1.7 year year 1.7 20.4
3.1 years years 3.1 37.2
15.7 months months 15.7 15.7
18 year year 18.0 216.0
2017.2 years years 2017.2 NaN
98.3 years years 98.3 NaN
68 year year 68.0 NaN

This assumes the formatting is consistent (value, space, time period). You can use split to get the two parts.
df = pd.DataFrame({'experience': ['5 month', np.nan, '1 months', '8 month', '17 months', '8 year', '11 years']})
def get_values(x):
if pd.notnull(x):
val = int(x.split(' ')[0])
prd = x.split(' ')[1]
if prd in ['month', 'months']:
return val
elif prd in ['year', 'years'] and val < 55:
return val * 12
else:
return x
df['months'] = df.apply(lambda x: get_values(x.experience), axis=1)
Output:
experience months
0 5 month 5.0
1 NaN NaN
2 1 months 1.0
3 8 month 8.0
4 17 months 17.0
5 8 year 96.0
6 11 years 132.0
If there are a high percentage of NaN, you can filter first before running the lambda function
df[df.experience.notnull()].apply(lambda x: get_values(x.experience), axis=1)

There's probably a nicer way to do this using panda data frames but is this what you are trying to achieve? You can probably use the regex if nothing else. I've not added the condition for < 55 years but I'm sure you can work that out.
import re
applicants = []
applicant1 = {'name': 'Lisa', 'experience': 'nan'}
applicant2 = {'name': 'Bill', 'experience': '3.1 months'}
applicant3 = {'name': 'Mandy', 'experience': '1 month'}
applicant4 = {'name': 'Geoff', 'experience': '6.7 years'}
applicant5 = {'name': 'Patricia', 'experience': '1 year'}
applicant6 = {'name': 'Kirsty', 'experience': '2017.2 years'}
applicants.append(applicant1)
applicants.append(applicant2)
applicants.append(applicant3)
applicants.append(applicant4)
applicants.append(applicant5)
applicants.append(applicant6)
print(applicants)
month_pattern = '^([\d]+[\.\d]*) month(s*)'
year_pattern = '^([\d]+[\.\d]*) year(s*)'
applicant_output = []
for applicant in applicants:
if applicant['experience'] == 'nan':
applicant_output.append(applicant)
else:
month = re.search(month_pattern, applicant['experience'])
if month is not None:
applicant_output.append(
{
'name': applicant['name'],
"exprience_months": month.group(1)
})
else:
year = re.search(year_pattern, applicant['experience'])
if year is not None:
months = str(float(year.group(1)) * 12)
applicant_output.append(
{
'name': applicant['name'],
"exprience_months": months
})
print(applicant_output)
This gives the output:
[{'name': 'Lisa', 'experience': 'nan'}, {'name': 'Bill', 'experience': '3.1 months'}, {'name': 'Mandy', 'experience': '1 month'}, {'name': 'Geoff', 'experience': '6.7 years'}, {'name': 'Patricia', 'experience': '1 year'}, {'name': 'Kirsty', 'experience': '2017. years'}]
with the result:
[{'name': 'Lisa', 'experience': 'nan'}, {'name': 'Bill', 'exprience_months': '3.1'}, {'name': 'Mandy', 'exprience_months': '1'}, {'name': 'Geoff', 'exprience_months': '80.4'}, {'name': 'Patricia', 'exprience_months': '12.0'}, {'name': 'Kirsty', 'exprience_months': '24206.4'}]

temp_df to sepearate out month/year part
temp_df = df['experience'].str.split('([A-Za-z]+)', expand=True)
temp_df = temp_df.loc[:, ~(temp_df == "").any(axis=0)] # deleting the extra column coming upon split
temp_df[0] = temp_df[0].astype(float)
temp_df
Getting the multiplier for experince value
multiplier = pd.Series([1] * len(temp_df), index=temp_df.index)
year_rows = temp_df[1].str.contains('year', case=False).fillna(False) # getting the rows which has year
temp_df.loc[(year_rows) & (temp_df[0]>=55), 0] = np.nan # converting exp value to nan where value is >= 55 and unit is year
multiplier[year_rows] = 12
df['experience_in_months'] = temp_df[0] * multiplier
df

Related

Pandas dataframe mapping one value to another in a row

I have a pandas two dataframes, one with columns 'Name', 'Year' and 'Currency Rate'.
The other with columns named 'Name', and a long list of years(1950, 1951,...,2019,2020).
And in the column 'Year', it is storing value 2000, 2001,...,2015. The years(1950, 1951,...,2019,2020) columns are storing income of the year.
I want to merge these two dataframes but map the income of the year according to the 'Year' column to a new column named 'Income' and drop all the other years, is there a convenient way to do this?
What I am thinking is splitting the second dataframe into 16 different years, and then joining it to the first dataframe.
UPDATED to reflect OP's comment clarifying the question:
To restate the question based on my updated understanding:
There is a dataframe with columns Name, Year and Currency Rate (3 columns in total). Each row contains a Currency Rate in a given Year for a given Name. Each row is unique by (Name, Year) pair.
There is a second dataframe with columns Name, 1950, 1951, ..., 2020 (72 columns in total). Each cell contains an Income value for the corresponding Name for the row and the Year corresponding to the column name. Each row is unique by Name.
Question: How do we add an Income column to the first dataframe with each row containing the Income value from the second dataframe in the cell with (row, column) corresponding to the (Name, Year) pair of such row in the first dataframe?
Test case assumptions I have made:
Name in the first dataframe is a letter from 'a' to 'n' with some duplicates.
Year in the first dataframe is between 2000 and 2015 (as in the question).
Currency Rate in the first dataframe is arbitrary.
Name in the second dataframe is a letter from 'a' to 'z' (no duplicates).
The values in the second dataframe (which represent Income) are arbitrarily constructed using the ASCII offsets of the characters in the corresponding Name concatenated with the Year of the corresponding column name. This way we can visually "decode" them in the test results to confirm that the value from the correct location in the second dataframe has been loaded into the new Income column in the first dataframe.
table1 = [
{'Name': 'a', 'Year': 2000, 'Currency Rate': 1.1},
{'Name': 'b', 'Year': 2001, 'Currency Rate': 1.2},
{'Name': 'c', 'Year': 2002, 'Currency Rate': 1.3},
{'Name': 'd', 'Year': 2003, 'Currency Rate': 1.4},
{'Name': 'e', 'Year': 2004, 'Currency Rate': 1.5},
{'Name': 'f', 'Year': 2005, 'Currency Rate': 1.6},
{'Name': 'g', 'Year': 2006, 'Currency Rate': 1.7},
{'Name': 'h', 'Year': 2007, 'Currency Rate': 1.8},
{'Name': 'i', 'Year': 2008, 'Currency Rate': 1.9},
{'Name': 'j', 'Year': 2009, 'Currency Rate': 1.8},
{'Name': 'k', 'Year': 2010, 'Currency Rate': 1.7},
{'Name': 'l', 'Year': 2011, 'Currency Rate': 1.6},
{'Name': 'm', 'Year': 2012, 'Currency Rate': 1.5},
{'Name': 'm', 'Year': 2013, 'Currency Rate': 1.4},
{'Name': 'n', 'Year': 2014, 'Currency Rate': 1.3},
{'Name': 'n', 'Year': 2015, 'Currency Rate': 1.2}
]
table2 = [{'Name': name} | {str(year): (income := sum(ord(c) - ord('a') + 1 for c in name)*10000 + year) for year in range(1950, 2021)} for name in set(['x', 'y', 'z']) | set(map(lambda row: row['Name'], table1))]
import pandas as pd
df1 = pd.DataFrame(table1)
df2 = pd.DataFrame(table2).sort_values(by='Name')
print(df1)
print(df2)
df1['Income'] = df1.apply(lambda x: int(df2[df2['Name'] == x['Name']][str(x['Year'])]), axis=1)
print(df1.to_string(index=False))
Output:
Name Year Currency Rate
0 a 2000 1.1
1 b 2001 1.2
2 c 2002 1.3
3 d 2003 1.4
4 e 2004 1.5
5 f 2005 1.6
6 g 2006 1.7
7 h 2007 1.8
8 i 2008 1.9
9 j 2009 1.8
10 k 2010 1.7
11 l 2011 1.6
12 m 2012 1.5
13 m 2013 1.4
14 n 2014 1.3
15 n 2015 1.2
Name 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 ... 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020
11 a 11950 11951 11952 11953 11954 11955 11956 11957 11958 11959 11960 ... 12009 12010 12011 12012 12013 12014 12015 12016 12017 12018 12019 12020
15 b 21950 21951 21952 21953 21954 21955 21956 21957 21958 21959 21960 ... 22009 22010 22011 22012 22013 22014 22015 22016 22017 22018 22019 22020
0 c 31950 31951 31952 31953 31954 31955 31956 31957 31958 31959 31960 ... 32009 32010 32011 32012 32013 32014 32015 32016 32017 32018 32019 32020
10 d 41950 41951 41952 41953 41954 41955 41956 41957 41958 41959 41960 ... 42009 42010 42011 42012 42013 42014 42015 42016 42017 42018 42019 42020
9 e 51950 51951 51952 51953 51954 51955 51956 51957 51958 51959 51960 ... 52009 52010 52011 52012 52013 52014 52015 52016 52017 52018 52019 52020
1 f 61950 61951 61952 61953 61954 61955 61956 61957 61958 61959 61960 ... 62009 62010 62011 62012 62013 62014 62015 62016 62017 62018 62019 62020
4 g 71950 71951 71952 71953 71954 71955 71956 71957 71958 71959 71960 ... 72009 72010 72011 72012 72013 72014 72015 72016 72017 72018 72019 72020
3 h 81950 81951 81952 81953 81954 81955 81956 81957 81958 81959 81960 ... 82009 82010 82011 82012 82013 82014 82015 82016 82017 82018 82019 82020
2 i 91950 91951 91952 91953 91954 91955 91956 91957 91958 91959 91960 ... 92009 92010 92011 92012 92013 92014 92015 92016 92017 92018 92019 92020
13 j 101950 101951 101952 101953 101954 101955 101956 101957 101958 101959 101960 ... 102009 102010 102011 102012 102013 102014 102015 102016 102017 102018 102019 102020
14 k 111950 111951 111952 111953 111954 111955 111956 111957 111958 111959 111960 ... 112009 112010 112011 112012 112013 112014 112015 112016 112017 112018 112019 112020
12 l 121950 121951 121952 121953 121954 121955 121956 121957 121958 121959 121960 ... 122009 122010 122011 122012 122013 122014 122015 122016 122017 122018 122019 122020
5 m 131950 131951 131952 131953 131954 131955 131956 131957 131958 131959 131960 ... 132009 132010 132011 132012 132013 132014 132015 132016 132017 132018 132019 132020
7 n 141950 141951 141952 141953 141954 141955 141956 141957 141958 141959 141960 ... 142009 142010 142011 142012 142013 142014 142015 142016 142017 142018 142019 142020
8 x 241950 241951 241952 241953 241954 241955 241956 241957 241958 241959 241960 ... 242009 242010 242011 242012 242013 242014 242015 242016 242017 242018 242019 242020
6 y 251950 251951 251952 251953 251954 251955 251956 251957 251958 251959 251960 ... 252009 252010 252011 252012 252013 252014 252015 252016 252017 252018 252019 252020
16 z 261950 261951 261952 261953 261954 261955 261956 261957 261958 261959 261960 ... 262009 262010 262011 262012 262013 262014 262015 262016 262017 262018 262019 262020
[17 rows x 72 columns]
Name Year Currency Rate Income
a 2000 1.1 12000
b 2001 1.2 22001
c 2002 1.3 32002
d 2003 1.4 42003
e 2004 1.5 52004
f 2005 1.6 62005
g 2006 1.7 72006
h 2007 1.8 82007
i 2008 1.9 92008
j 2009 1.8 102009
k 2010 1.7 112010
l 2011 1.6 122011
m 2012 1.5 132012
m 2013 1.4 132013
n 2014 1.3 142014
n 2015 1.2 142015

Python/increase code efficiency about multiple columns filter

I was wondering if someone could help me find a more efficiency way to run my code.
I have a dataset contains 7 columns, which are country,sector,year,month,week,weekday,value.
the year column have only 3 elements, 2019,2020,2021
What I have to do here is to substract every value in 2020 and 2021 from 2019.
But its more complicated that I need to match the weekday columns.
For example,i need to use year 2020, month 1, week 1, weekday 0(monday) value to substract,
year 2019, month 1, week 1, weekday 0(monday) value, if cant find it, it will pass, and so on, which means, the weekday(monday,Tuesaday....must be matched)
And here is my code, it can run, but it tooks me hours:(
for i in itertools.product(year_list,country_list, sector_list,month_list,week_list,weekday_list):
try:
data_2 = df_carbon[(df_carbon['country'] == i[1])
& (df_carbon['sector'] == i[2])
& (df_carbon['year'] == i[0])
& (df_carbon['month'] == i[3])
& (df_carbon['week'] == i[4])
& (df_carbon['weekday'] == i[5])]['co2'].tolist()[0]
data_1 = df_carbon[(df_carbon['country'] == i[1])
& (df_carbon['sector'] == i[2])
& (df_carbon['year'] == 2019)
& (df_carbon['month'] == i[3])
& (df_carbon['week'] == i[4])
& (df_carbon['weekday'] == i[5])]['co2'].tolist()[0]
co2.append(data_2-data_1)
country.append(i[1])
sector.append(i[2])
year.append(i[0])
month.append(i[3])
week.append(i[4])
weekday.append(i[5])
except:
pass
I changed the for loops to itertools, but it still not fast enough, any other ideas?
many thanks:)
##############################
here is the sample dataset
country co2 sector date week weekday year month
Brazil 108.767782 Power 2019-01-01 0 1 2019 1
China 14251.044482 Power 2019-01-01 0 1 2019 1
EU27 & UK 1886.493814 Power 2019-01-01 0 1 2019 1
France 53.856398 Power 2019-01-01 0 1 2019 1
Germany 378.323440 Power 2019-01-01 0 1 2019 1
Japan 21.898788 IA 2021-11-30 48 1 2021 11
Russia 19.773822 IA 2021-11-30 48 1 2021 11
Spain 42.293944 IA 2021-11-30 48 1 2021 11
UK 56.425121 IA 2021-11-30 48 1 2021 11
US 166.425000 IA 2021-11-30 48 1 2021 11
or this
import pandas as pd
pd.DataFrame({
'year': [2019, 2020, 2021],
'co2': [1,2,3],
'country': ['Brazil', 'Brazil', 'Brazil'],
'sector': ['power', 'power', 'power'],
'month': [1, 1, 1],
'week': [0,0,0],
'weekday': [0,0,0]
})
pandas can subtract two dataframe index-by-index, so the idea would be to separate your data into a minuend and a subtrahend, set ['country', 'sector', 'month', 'week', 'weekday'] as their indices, just subtract them, and remove rows (by dropna) where a match in year 2019 is not found.
df_carbon = pd.DataFrame({
'year': [2019, 2020, 2021],
'co2': [1,2,3],
'country': ['ab', 'ab', 'bc']
})
index = ['country']
# index = ['country', 'sector', 'month', 'week', 'weekday']
df_2019 = df_carbon[df_carbon['year']==2019].set_index(index)
df_rest = df_carbon[df_carbon['year']!=2019].set_index(index)
ans = (df_rest - df_2019).reset_index().dropna()
ans['year'] += 2019
Two additional points:
In this subtraction the year is also covered, so I need to add 2019 back.
I created a small example of df_carbon to test my code. If you had provided a more realistic version in text form, I would have tested my code using your data.

Pandas trying to find a solution to extract string better with different pattern

I have a pandas dataframe with a column that looks like this:
Period
0 summer 2020
1 winter 2021
2 day
3 March '20
4 June '21
5 12-13 April '20
6 summer 2021
7 12/03/20 base
8 week 8 '20
9 Weekend base
10 Monday base
11 BOM base
12 Year 2021
I want to return a new column to derive a new category. So if column Period contains the string 'summer' return 'season' or if string contains 'March' then return 'month'.
However I have a problem where some strings contain a month name preceded by a date for example 12-14 April '20. For these that have both a date and month I want to return 'weekend'.
I want this output:
Period Time
0 summer 2020 season
1 winter 2021 season
2 day day
3 March '20 month
4 Q1 '21 quarter
5 12-14 April '20 week/weekend
6 summer 2021 season
7 12/03/20 base day
8 week 8 '20 week/weekend
9 Weekend base week/weekend
10 Monday base day
11 BOM base day
12 Year 2021 year
Here my attempt where I used '-' as the common character for this type of string but it doesn't solve the problem as it will return 'month' because of April in the example above.
df['Time'] = pd.np.where(df.Period.str.contains("Summer"), "season",
pd.np.where(df.Period.str.contains("Winter"), "season",
pd.np.where(df.Period.str.contains("January"), "month",
pd.np.where(df.Period.str.contains("February"), "month",
pd.np.where(df.Period.str.contains("March"), "month",
pd.np.where(df.Period.str.contains("April"), "month",
pd.np.where(df.Period.str.contains("June"), "month",
pd.np.where(df.Period.str.contains("July"), "month",
pd.np.where(df.Period.str.contains("August"), "month",
pd.np.where(df.Period.str.contains("September"), "month",
pd.np.where(df.Period.str.contains("October"), "month",
pd.np.where(df.Period.str.contains("November"), "month",
pd.np.where(df.Period.str.contains("December"), "month",
pd.np.where(df.Period.str.contains("Q"), "quarter",
pd.np.where(df.Period.str.contains("-"), "week/weekend",
pd.np.where(df.Period.str.contains("Week"), "week/weekend",
pd.np.where(df.Period.str.contains("Year"), "year", "day-ahead")))))))))))))))))
EDITED: added new strings to column Period (index 7-12). And changed category 'weekend' to 'week/weekend'. If it's not season, month, quarter, week/weekend, and year then I would to return 'day' like in my code at the very end.
You could use a mapping dictionary to identify all of the matches using pd.Series.str.extract():
import pandas as pd
df = pd.DataFrame({'Period': ['summer 2020','winter 2021','day','March \'20','Q1 \'21','12-14 April \'20','summer 2021']})
mapping = {
'weekend': ['-'],
'season': ['spring','summer','fall','autumn','winter'],
'day': ['day'],
'month': ['January','February','March','April','May','June','July','August','September','October','November','December'],
'quarter': ['Q']
}
df['Time'] = pd.concat([df['Period'].str.extract('({})'.format(')|('.join(v))).bfill(axis=1).iloc[:,0] for k, v in mapping.items()], axis=1).bfill(axis=1).iloc[:,0]
invert_mapping = {i: k for k, v in mapping.items() for i in v}
df['Time'] = df['Time'].map(invert_mapping)
Yields:
Period Time
0 summer 2020 season
1 winter 2021 season
2 day day
3 March '20 month
4 Q1 '21 quarter
5 12-14 April '20 weekend
6 summer 2021 season
This is not much different from your solution but (hopefully) it's more readable and easier to maintain.
import pandas as pd
import numpy as np
seasons = "|".join(["summer", "autumn", "winter", "spring"])
months = "|".join(['January', 'February', 'March', 'April',
'May', 'June', 'July','August', 'September',
'October', 'November', 'December'])
quarters = "|".join([f"Q{i+1}" for i in range(4)])
x = df["Period"]
cond_list = [x.str.contains(seasons),
x.str.contains(months),
x.str.contains(quarters),
x.str.contains("-"),
x.str.contains("day")]
choice_list = ["season",
"month",
"quarter",
"weekend",
"day"]
df["Time"] = np.select(cond_list, choice_list)
The reason for "|".join(...) is given here

How to groupby in Pandas and keep all columns [duplicate]

This question already has answers here:
Pandas DataFrame Groupby two columns and get counts
(8 answers)
Closed 3 years ago.
I have a data frame like this:
year drug_name avg_number_of_ingredients
0 2019 NEXIUM I.V. 8
1 2016 ZOLADEX 10
2 2017 PRILOSEC 59
3 2017 BYDUREON BCise 24
4 2019 Lynparza 28
And I need to group drug names and mean number of ingredients by year like this:
year drug_name avg_number_of_ingredients
0 2019 drug a,b,c.. mean value for column
1 2018 drug a,b,c.. mean value for column
2 2017 drug a,b,c.. mean value for column
If I do df.groupby('year'), I lose drug names. How can I do it?
Let me show you the solution on the simple example. First, I make the same data frame as you have:
>>> df = pd.DataFrame(
[
{'year': 2019, 'drug_name': 'NEXIUM I.V.', 'avg_number_of_ingredients': 8},
{'year': 2016, 'drug_name': 'ZOLADEX', 'avg_number_of_ingredients': 10},
{'year': 2017, 'drug_name': 'PRILOSEC', 'avg_number_of_ingredients': 59},
{'year': 2017, 'drug_name': 'BYDUREON BCise', 'avg_number_of_ingredients': 24},
{'year': 2019, 'drug_name': 'Lynparza', 'avg_number_of_ingredients': 28},
]
)
>>> print(df)
year drug_name avg_number_of_ingredients
0 2019 NEXIUM I.V. 8
1 2016 ZOLADEX 10
2 2017 PRILOSEC 59
3 2017 BYDUREON BCise 24
4 2019 Lynparza 28
Now, I make a df_grouped, which still consists of information about drugs name.
>>> df_grouped = df.groupby('year', as_index=False).agg({'drug_name': ', '.join, 'avg_number_of_ingredients': 'mean'})
>>> print(df_grouped)
year drug_name avg_number_of_ingredients
0 2016 ZOLADEX 10.0
1 2017 PRILOSEC, BYDUREON BCise 41.5
2 2019 NEXIUM I.V., Lynparza 18.0

Pandas Creating Dataframes from Loops

I am trying to make a dataframe so that I can send it to a CSV easily, otherwise I have to do this process manually..
I'd like this to be my final output. Each person has a month and year combo that starts at 1/1/2014 and goes to 12/1/2016:
Name date
0 ben 1/1/2014
1 ben 2/1/2014
2 ben 3/1/2014
3 ben 4/1/2014
....
12 dan 1/1/2014
13 dan 2/1/2014
14 dan 3/1/2014
code so far:
import pandas as pd
days = [1]
months = list(range(1, 13))
years = ['2014', '2015', '2016']
listof_people = ['ben','dan','nathan', 'gary', 'Mark', 'Sean', 'Tim', 'Chris']
df = pd.DataFrame({"Name": listof_people})
for month in months:
df.append({'date': month}, ignore_index=True)
print(df)
When I try looping to create the dataframe it either does not work, I get index errors (because of the non-matching lists) and I'm at a loss.
I've done a good bit of searching and have found some following links that are similar, but I can't reverse engineer the work to fit my case.
Filling empty python dataframe using loops
How to build and fill pandas dataframe from for loop?
I don't want anyone to feel like they are "doing my homework", so if i'm derping on something simple please let me know.
I think you can use product for all combination with to_datetime for column date:
from itertools import product
days = [1]
months = list(range(1, 13))
years = ['2014', '2015', '2016']
listof_people = ['ben','dan','nathan', 'gary', 'Mark', 'Sean', 'Tim', 'Chris']
df1 = pd.DataFrame(list(product(listof_people, months, days, years)))
df1.columns = ['Name', 'month','day','year']
print (df1)
Name month day year
0 ben 1 1 2014
1 ben 1 1 2015
2 ben 1 1 2016
3 ben 2 1 2014
4 ben 2 1 2015
5 ben 2 1 2016
6 ben 3 1 2014
7 ben 3 1 2015
8 ben 3 1 2016
9 ben 4 1 2014
10 ben 4 1 2015
...
...
df1['date'] = pd.to_datetime(df1[['month','day','year']])
df1 = df1[['Name','date']]
print (df1)
Name date
0 ben 2014-01-01
1 ben 2015-01-01
2 ben 2016-01-01
3 ben 2014-02-01
4 ben 2015-02-01
5 ben 2016-02-01
6 ben 2014-03-01
7 ben 2015-03-01
...
...
mux = pd.MultiIndex.from_product(
[listof_people, years, months],
names=['Name', 'Year', 'Month'])
pd.Series(
1, mux, name='Day'
).reset_index().assign(
date=pd.to_datetime(df[['Year', 'Month', 'Day']])
)[['Name', 'date']]

Categories

Resources