Update a dataframe iteratively - python

I have a dataframe:
QID URL Questions Answers Section QType Theme Topics Answer0 Answer1 Answer2 Answer3 Answer4 Answer5 Answer6
1113 1096 https://docs.google.com/forms/d/1hIkfKc2frAnxsQzGw_h4bIqasPwAPzzLFWqzPE3B_88/edit?usp=sharing To what extent are the following factors considerations in your choice of flight? ['Very important consideration', 'Important consideration', 'Neutral', 'Not an important consideration', 'Do not consider'] When choosing an airline to fly with, which factors are most important to you? (Please list 3.) Multiple Choice Airline XYZ ['extent', 'follow', 'factor', 'consider', 'choic', 'flight'] Very important consideration Important consideration Neutral Not an important consideration Do not consider NaN NaN
1116 1097 https://docs.google.com/forms/d/1hIkfKc2frAnxsQzGw_h4bIqasPwAPzzLFWqzPE3B_88/edit?usp=sharing How far in advance do you typically book your tickets? ['0-2 months in advance', '2-4 months in advance', '4-6 months in advance', '6-8 months in advance', '8-10 months in advance', '10-12 months in advance', '12+ months in advance'] When choosing an airline to fly with, which factors are most important to you? (Please list 3.) Multiple Choice Airline XYZ ['advanc', 'typic', 'book', 'ticket'] 0-2 months in advance 2-4 months in advance 4-6 months in advance 6-8 months in advance 8-10 months in advance 10-12 months in advance 12+ months in advance
with rows of which I want to change a few lines that are actually QuestionGrid titles, with new lines that also represent the answers. I have a other, Pickle, which contains the information to build the lines that will update the old ones. Each time an old line will be transformed into several new lines (I specify this because I do not know how to do it).
These lines are just the grid titles of questions like the following one:
Expected dataframe
I would like to insert them in the original dataframe, instead of the lines where they match in the 'Questions' column, as in the following dataframe:
QID Questions QType Answer1 Answer2 Answer3 Answer4 Answer5
1096 'To what extent are the following factors considerations in your choice of flight?' Question Grid 'Very important consideration' 'Important consideration' 'Neutral' 'Not an important consideration' 'Do not consider'
1096_S01 'The airline/company you fly with'
1096_S02 'The departure airport'
1096_S03 'Duration of flight/route'
1096_S04 'Baggage policy'
1097 'To what extent are the following factors considerations in your choice of flight?' Question Grid ...
1097_S01 ...
...
What I tried
import pickle
qa = pd.read_pickle(r'Python/interns.p')
df = pd.read_csv("QuestionBank.csv")
def isGrid(dic, df):
'''Check if a row is a row related to a Google Forms grid
if it is a case update this row'''
d_answers = dic['answers']
try:
answers = d_answers[2]
if len(answers) > 1:
# find the line in df and replace the line where it matches by the lines
update_lines(dic, df)
return df
except TypeError:
return df
def update_lines(dic, df):
'''find the line in df and replace the line where it matches
with the question in dic by the new lines'''
lines_to_replace = df.index[df['Questions'] == dic['question']].tolist() # might be several rows and maybe they aren't all to replace
# I check there is at least a row to replace
if lines_to_replace:
# I replace all rows where the question matches
for line_to_replace in lines_to_replace:
# replace this row and the following by the following dataframe
questions = reduce(lambda a,b: a + b,[data['answers'][2][x][3] for x in range(len(data['answers'][2]))])
ind_answers = dic["answers"][2][0][1]
answers = []
# I get all the potential answers
for i in range(len(ind_answers)):
answers.append(reduce(lambda a,b: a+b,[ind_answers[i] for x in range(len(questions))])) # duplicates as there are many lines with the same answers in a grid, maybe I should have used set
answers = {f"Answer{i}": answers[i] for i in range(0, len(answers))} # dyanmically allocate to place them in the right columns
dict_replacing = {'Questions': questions, **answers} # dictionary that will replace the forle create the new lines
df1 = pd.DataFrame(dict_replacing)
df1.index = df1.index / 10 + line_to_replace
df = df1.combine_first(df)
return df
I did a Colaboratory notebook if needed.
What I obtain
But the dataframe is the same size before and after we do this. In effect, I get:
QID Questions QType Answer1 Answer2 Answer3 Answer4 Answer5
1096 'To what extent are the following factors considerations in your choice of flight?' Question Grid 'Very important consideration' 'Important consideration' 'Neutral' 'Not an important consideration' 'Do not consider'
1097 'To what extent are the following factors considerations in your choice of flight?' Question Grid ...

Related

groupby two columns based on third column values [duplicate]

This question already has answers here:
group by two columns based on created column
(2 answers)
Closed 2 years ago.
I have a dataset like this
df = pd.DataFrame({'time':['13:30', '9:20', '18:12', '19:00', '11:20', '13:30', '15:20', '17:12', '16:00', '8:20'],
'item': [coffee, bread, pizza, rice, soup, coffee, bread, pizza, rice, soup]})
I want to produce this output
first need to split time into 3 categories, based on these intervals
interval (6,11] for breakfast, (11,15] for lunch and (15,20] for dinner.
this is my code:
df['hour'] = df.Time.apply(lambda x: int(x.split(':')[0]))
def time_period(hour):
if hour >= 6 and hour < 11:
return 'breakfast'
elif hour >= 11 and hour < 15:
return 'lunch'
else:
return 'dinner'
df['meal'] = df['hour'].apply(lambda x: time_period(x))
but I don't know how to do 'groupby' part.
You can simply groupby on "meal" and create list of items for each "meal". Just add the following line to do this-
df.groupby(['meal'])["item"].apply(list)
You can apply count and all on top of this to achieve the result you want.

Categorical Nested Lists into Pandas DataFrame

I have three levels of categorical data that I need to convert into a Pandas DataFrame with repeating labels on the upper categories. I have lists for "main", "sub", and "tertiary" as follows:
main_labels = ['Certain infectious and parasitic diseases','Neoplasms']
main_icds = ['A00-B99','C00-D49']
sub_labels = ['Intestinal infectious diseases','Tuberculosis','Malignant neoplasms of lip, oral cavity and pharynx','Malignant neoplasms of digestive organs']
sub_icds = ['A00-A09','A15-A19','C00-C14','C15-C26']
ter_labels = ['Cholera','Typhoid and paratyphoid fevers','Respiratory tuberculosis','Tuberculosis of nervous system','Malignant neoplasm of lip','Malignant neoplasm of base of tongue','Malignant neoplasm of esophagus','Malignant neoplasm of stomach']
ter_icds = ['A00','A01','A15','A17','C00','C01','C15','C16']
For illustration and example purposes, I need them to look like below in a Pandas DataFrame. If I can accomplish this, I can add in the label values.
It seemed like it would be easy but I'm stumped. Any help greatly appreciated. I tried searching historical posts but was having trouble finding the right key words to get anything close to what I'm trying to do. Thanks!
I think the best way is to start with the ternary category, then find its sub and main classifications. python allows inequalities on alphanumeric strings, so this should be pretty robust.
import pandas as pd
main_icds = ['A00-B99','C00-D49']
sub_icds = ['A00-A09','A15-A19','C00-C14','C15-C26']
ter_icds = ['A00','A01','A15','A17','C00','C01','C15','C16']
#split on '-' to get bounds for each category
subs = [sub.split('-') for sub in sub_icds]
mains = [main.split('-') for main in main_icds]
df = pd.DataFrame({'ter_icd':ter_icds})
df['sub_icd'] = [sub_icd for ter in ter_icds
for sub_icd,sub in zip(sub_icds,subs)
if (ter >= sub[0]) & (ter <= sub[1])]
df['main_icd'] = [main_icd for ter in ter_icds
for main_icd,main in zip(main_icds,mains)
if (ter >= main[0]) & (ter <= main[1])]

Received 'Boolean Series key will be reindexed to match DataFrame index' warning when creating a new data frame

Is there any potential downside to using the following code to create a new data frame, wherein I'm specifying very specific information from the original data frame I want to see in the new data frame.
df_workloc = (df[df['WorkLoc'] == 'Home'][df['CareerSat'] == 'Very satisfied'][df['CurrencySymbol'] == 'USD'][df['CompTotal'] >= 50000])
I used the 2019 Stack Overflow survey data. As such:
WorkLoc specifies where a respondent works.
CareerSat specifies a respondent's career satisfaction.
CurrencySymbol specifies what currency a respondent gets paid in.
CompTotal specifies what a respondent's total compensation is.
If anyone has a cleaner, more efficient way of achieving a data frame with refined / specific information I'd love to see it. One thing I'd like to do is specify a Compensation total CompTotal of >= 50000 and <=75000 in the same line. However, I get an error when I tried to include the second boolean.
Thanks in advance.
I think you need chain conditions with & for bitwise AND and filter by boolean indexing, also for last condition use Series.between:
m1 = df['WorkLoc'] == 'Home'
m2 = df['CareerSat'] == 'Very satisfied'
m3 = df['CurrencySymbol'] == 'USD'
m4 = df['CompTotal'].between(50000, 75000)
df_workloc = df[m1 & m2 & m3 & m4]
Or for one line solution:
df_workloc = df[(df['WorkLoc'] == 'Home') &
(df['CareerSat'] == 'Very satisfied') &
(df['CurrencySymbol'] == 'USD') &
df['CompTotal'].between(50000, 75000)]

Splitting a cell in pandas DataFrame and counting values

I have an xlsx file with survey data sorted by questions as follows:
df = pd.DataFrame({
'Question 1': ['5-6 hours', '6-7 hours', '9-10 hours'],
'Question 2': ['Very restful', 'Somewhat restful', 'Somewhat restful'],
'Question 3': ['[Home (dorm; apartment)]', '[Vehicle;None of the above; Other]', '[Campus;Home (dorm; apartment);Vehicle]'],
'Question 4': ['[Family;No one; alone]', '[Classmates; students;Family;No one; alone]', '[Family]'],
})
>>> df
Question 1 Question 2 Question 3 Question 4
5-6 hours Very restful [Home (dorm; apartment)] [Family;No one; alone]
6-7 hours Somewhat restful [Vehicle;None of the above; Other] [Classmates; students;Family;No one; alone]
9-10 hours Somewhat restful [Campus;Home (dorm; apartment);Vehicle] [Family]
For Questions 3 and 4, the input was a checkbox style, allowing for multiple answers. How could I approach getting the value counts for specific answer choices, rather than the value counts for the cell as a whole?
e.g
Question 4
Family 3
No one; alone 2
Classmates; students 1
Currently I'm doing this:
files = os.listdir()
for filename in files:
if filename.endswith(".xlsx"):
df = pd.read_excel(filename)
for column in df:
x = pd.Series(df[column].values).value_counts()
print(x)
However, this doesn't allow me to seperate cells that have multiple answers.
Thank you!
This gets you part of the way, but I don't know how to parse your data. For example, if you used the semi-colon as the delimiter in Question 3, the parsed string ends up as ['Home (dorm", " apartment)"].
>>> pd.Series([choice.strip()
for choice in df['Question 4'].str[1:-1].str.split(';').sum()]
).value_counts()
Family 3
alone 2
No one 2
Classmates 1
students 1
dtype: int64
You mean groupby ? https://www.geeksforgeeks.org/python-pandas-dataframe-groupby/
df1 = df.groupby('Question 4')
or groupby('...').agg(...)
https://www.shanelynn.ie/summarising-aggregation-and-grouping-data-in-python-pandas/

How to combine two sets of data with differences in merge-index strings?

I want to merge two csv-files with soccer data. They hold different data of the same and different games (partial overlap). Normally I would do a merge with df.merge, but the problem is, that the nomenclature differs for some teams in the two Datasets. E.g. "Atletic Bilbao" is called "Club Atletic" in the second set.
Therefore I would like to norm the team-naming on the two Datasets in order to be able to do a simple df.merge-operation on dates and teamnames. At the moment this would result in extra-lines, when a team has different names in the two sets.
So my main question is: How can I norm the teamnames in the two sets easily, without having to analyse all the differences "by hand" and hardcode "replace"-operations on one of the sets?
Dataset1 is downloadable here: https://data.fivethirtyeight.com/#soccer-spi
Dataset2 is not available freely, but it looks like this:
hometeam awayteam date homeproba drawproba awayproba homexg awayxg
Manchester United Leicester 2018-08-10 22:00:00 0.2812 0.3275 0.3913 1.5137 1.73813
--Edit after first comments--
So the main question is: How could I automatically analyse the differences in the two datasets naming? Helpful facts:
As the sets hold wholes seasons, the overlap per team name is at least 30+ games.
Most of the teams have same names, name differences are the smaller part of the team names.
Most name differences have at least a common substring.
Both datasets have date-information of the games.
We know, a team plays only one game a day.
So if Dataset1 says:
1.1.2018 Real - Atletic Club
And Dataset2 says:
1.1.2018 Real - Atletic Bilbao
We should be able to analyse that: {'Atletic Club':'Atletic Bilbao'}
So this is how I could solve this finally:
import pandas as pd
df_teamnames = pd.merge(dataset1,dataset2,on=['hometeam','date'])
df_teamnames = df_teamnames[['awayteam_x','awayteam_y']]
df_teamnames = df_teamnames.drop_duplicates()
This gives you a dataframe holding each team's name existing in both datasets like this:
1 Marseille Marseille
2 Atletic Club Atletic Bilbao
...
Assuming your dates are compatible (and correct), this should probably work to generate a translation dictionary. This type of thing is always super fragile I think though, and you shouldn't really rely on it.
import pandas as pd
names_1 = dataset1['hometeam'].unique().tolist()
names_2 = dataset2['hometeam'].unique().tolist()
mapping_dict = dict()
for common_name in set(names_1).intersection(set(names_2)):
mapping_dict[common_name] = common_name
unknown_1 = set(names_1).difference(set(names_2))
unknown_2 = set(names_2).difference(set(names_1))
trim_df1 = dataset1.loc[:, ['hometeam', 'awayteam', 'date']]
trim_df2 = dataset2.loc[:, ['hometeam', 'awayteam', 'date']]
aligned_data = trim_df1.join(trim_df2, on = ['hometeam', 'date'], how = 'inner', lsuffix = '_1', rsuffix = '_2')
for unknown_name in unknown_1:
matching_name = aligned_data.loc[aligned_data['awayteam_1'] == unknown_name, 'awayteam_2'].unique()
if len(matching_name) != 1:
raise ValueError("Couldn't find a unique match")
mapping_dict[unknown_name] = matching_name[0]
unknown_2.remove(matching_name[0])
if len(unknown_2) != 0:
raise ValueError("We have extra team names for some reason")

Categories

Resources