I have an xlsx file with survey data sorted by questions as follows:
df = pd.DataFrame({
'Question 1': ['5-6 hours', '6-7 hours', '9-10 hours'],
'Question 2': ['Very restful', 'Somewhat restful', 'Somewhat restful'],
'Question 3': ['[Home (dorm; apartment)]', '[Vehicle;None of the above; Other]', '[Campus;Home (dorm; apartment);Vehicle]'],
'Question 4': ['[Family;No one; alone]', '[Classmates; students;Family;No one; alone]', '[Family]'],
})
>>> df
Question 1 Question 2 Question 3 Question 4
5-6 hours Very restful [Home (dorm; apartment)] [Family;No one; alone]
6-7 hours Somewhat restful [Vehicle;None of the above; Other] [Classmates; students;Family;No one; alone]
9-10 hours Somewhat restful [Campus;Home (dorm; apartment);Vehicle] [Family]
For Questions 3 and 4, the input was a checkbox style, allowing for multiple answers. How could I approach getting the value counts for specific answer choices, rather than the value counts for the cell as a whole?
e.g
Question 4
Family 3
No one; alone 2
Classmates; students 1
Currently I'm doing this:
files = os.listdir()
for filename in files:
if filename.endswith(".xlsx"):
df = pd.read_excel(filename)
for column in df:
x = pd.Series(df[column].values).value_counts()
print(x)
However, this doesn't allow me to seperate cells that have multiple answers.
Thank you!
This gets you part of the way, but I don't know how to parse your data. For example, if you used the semi-colon as the delimiter in Question 3, the parsed string ends up as ['Home (dorm", " apartment)"].
>>> pd.Series([choice.strip()
for choice in df['Question 4'].str[1:-1].str.split(';').sum()]
).value_counts()
Family 3
alone 2
No one 2
Classmates 1
students 1
dtype: int64
You mean groupby ? https://www.geeksforgeeks.org/python-pandas-dataframe-groupby/
df1 = df.groupby('Question 4')
or groupby('...').agg(...)
https://www.shanelynn.ie/summarising-aggregation-and-grouping-data-in-python-pandas/
Related
I have some financial information that I've kept on an excel document for a while and I'd like to run some python code on it, but I'm having some issues converting the objects types to floats. The problem seems to be the '$ -'
This is how the data looks when loaded in:
import pandas as pd
dfData = {'Item': ['Product 1','Product 2','Product 3'],
'Cost': [14.87,'-9.47','$ -']
}
df = pd.DataFrame(dfData,columns=['Item','Cost'])
df
Item Cost
0 Product 1 14.87
1 Product 2 -9.47
2 Product 3 $ -
I've tried:
df['Cost'] = df['Cost'].str.replace('$','').str.replace(' ','').astype('float')
...as well as other similar str.replace commands, but I keep getting the following error:
ValueError: could not convert string to float: ''
This is my first stack overflow post, so go easy on me! I have looked all over for a solution, but for some reason can't find one addressing this specific problem. I can't replace the '-' either, because row 1 has it indicating a negative value.
You don't need to chain str.replace, you can just use replace:
df['Cost'] = df['Cost'].replace({'\$': '', '-': '-0'}, regex=True).astype(float)
print(df)
# Output
Item Cost
0 Product 1 14.87
1 Product 2 -9.47
2 Product 3 -0.00
I created a map where they key is a string and the value is a tuple. I also have a dataframe that looks like this
d = {'comments' : ['This is a bad website', 'The website is slow']}
df = pd.DataFrame(data = d)
The maps value for This is a bad website contains something like this
[("There isn't enough support to our site",
'Staff Not Onsite',
0.7323943661971831),
('I would like to have them on site more frequently',
'Staff Not Onsite',
0.6875)]
What I want to do now is create 6 new columns inside the data frame using the first two tuple entries in the map.
So what I would want is something like this
d = {'comments' : ['This is a bad website', 'The website is slow'],
'comment_match_1' : ['There isn't enough support to our site', ''],
'Negative_category_1' : ['Staff Not Onsite', ''],
'Score_1' : [0.7323, 0],
'comment_match_2' : ['I would like to have them on site more frequently', ''],
'Negative_category_2' : ['Staff Not Onsite', ''],
'Score_2' : [0.6875, 0]}
df = pd.DataFrame(data = d)
Any suggestions on how to achieve this are greatly appreciated.
Here is how I generated the map for reference
d = {}
a = []
for x, y in zip(df['comments'], df['negative_category']):
for z in unlabeled_df['comments']:
a.append((x, y, difflib.SequenceMatcher(None, x, z).ratio()))
d[z] = a
Thus when I execute this line of code
d['This is a bad website']
I get
[("There isn't enough support to our site",
'Staff Not Onsite',
0.7323943661971831),
('I would like to have them on site more frequently',
'Staff Not Onsite',
0.6875), ...]
You can recreate a mapping dictionary by flattening the values corresponding to each of the key in the dictionary, then with the help of Series.map substitute the values in the column comments from mapping dictionary, finally create new dataframe from these substituted values and join this new dataframe with the comments column:
mapping = {k: np.hstack(v) for k, v in d.items()}
df.join(pd.DataFrame(df['comments'].map(mapping).dropna().tolist()))
comments 0 1 2 3 4 5
0 This is a bad website There isn't enough support to our site Staff Not Onsite 0.7323943661971831 I would like to have them on site more frequently Staff Not Onsite 0.6875
1 The website is slow NaN NaN NaN NaN NaN NaN
I have a dataframe:
QID URL Questions Answers Section QType Theme Topics Answer0 Answer1 Answer2 Answer3 Answer4 Answer5 Answer6
1113 1096 https://docs.google.com/forms/d/1hIkfKc2frAnxsQzGw_h4bIqasPwAPzzLFWqzPE3B_88/edit?usp=sharing To what extent are the following factors considerations in your choice of flight? ['Very important consideration', 'Important consideration', 'Neutral', 'Not an important consideration', 'Do not consider'] When choosing an airline to fly with, which factors are most important to you? (Please list 3.) Multiple Choice Airline XYZ ['extent', 'follow', 'factor', 'consider', 'choic', 'flight'] Very important consideration Important consideration Neutral Not an important consideration Do not consider NaN NaN
1116 1097 https://docs.google.com/forms/d/1hIkfKc2frAnxsQzGw_h4bIqasPwAPzzLFWqzPE3B_88/edit?usp=sharing How far in advance do you typically book your tickets? ['0-2 months in advance', '2-4 months in advance', '4-6 months in advance', '6-8 months in advance', '8-10 months in advance', '10-12 months in advance', '12+ months in advance'] When choosing an airline to fly with, which factors are most important to you? (Please list 3.) Multiple Choice Airline XYZ ['advanc', 'typic', 'book', 'ticket'] 0-2 months in advance 2-4 months in advance 4-6 months in advance 6-8 months in advance 8-10 months in advance 10-12 months in advance 12+ months in advance
with rows of which I want to change a few lines that are actually QuestionGrid titles, with new lines that also represent the answers. I have a other, Pickle, which contains the information to build the lines that will update the old ones. Each time an old line will be transformed into several new lines (I specify this because I do not know how to do it).
These lines are just the grid titles of questions like the following one:
Expected dataframe
I would like to insert them in the original dataframe, instead of the lines where they match in the 'Questions' column, as in the following dataframe:
QID Questions QType Answer1 Answer2 Answer3 Answer4 Answer5
1096 'To what extent are the following factors considerations in your choice of flight?' Question Grid 'Very important consideration' 'Important consideration' 'Neutral' 'Not an important consideration' 'Do not consider'
1096_S01 'The airline/company you fly with'
1096_S02 'The departure airport'
1096_S03 'Duration of flight/route'
1096_S04 'Baggage policy'
1097 'To what extent are the following factors considerations in your choice of flight?' Question Grid ...
1097_S01 ...
...
What I tried
import pickle
qa = pd.read_pickle(r'Python/interns.p')
df = pd.read_csv("QuestionBank.csv")
def isGrid(dic, df):
'''Check if a row is a row related to a Google Forms grid
if it is a case update this row'''
d_answers = dic['answers']
try:
answers = d_answers[2]
if len(answers) > 1:
# find the line in df and replace the line where it matches by the lines
update_lines(dic, df)
return df
except TypeError:
return df
def update_lines(dic, df):
'''find the line in df and replace the line where it matches
with the question in dic by the new lines'''
lines_to_replace = df.index[df['Questions'] == dic['question']].tolist() # might be several rows and maybe they aren't all to replace
# I check there is at least a row to replace
if lines_to_replace:
# I replace all rows where the question matches
for line_to_replace in lines_to_replace:
# replace this row and the following by the following dataframe
questions = reduce(lambda a,b: a + b,[data['answers'][2][x][3] for x in range(len(data['answers'][2]))])
ind_answers = dic["answers"][2][0][1]
answers = []
# I get all the potential answers
for i in range(len(ind_answers)):
answers.append(reduce(lambda a,b: a+b,[ind_answers[i] for x in range(len(questions))])) # duplicates as there are many lines with the same answers in a grid, maybe I should have used set
answers = {f"Answer{i}": answers[i] for i in range(0, len(answers))} # dyanmically allocate to place them in the right columns
dict_replacing = {'Questions': questions, **answers} # dictionary that will replace the forle create the new lines
df1 = pd.DataFrame(dict_replacing)
df1.index = df1.index / 10 + line_to_replace
df = df1.combine_first(df)
return df
I did a Colaboratory notebook if needed.
What I obtain
But the dataframe is the same size before and after we do this. In effect, I get:
QID Questions QType Answer1 Answer2 Answer3 Answer4 Answer5
1096 'To what extent are the following factors considerations in your choice of flight?' Question Grid 'Very important consideration' 'Important consideration' 'Neutral' 'Not an important consideration' 'Do not consider'
1097 'To what extent are the following factors considerations in your choice of flight?' Question Grid ...
I have texts in one column and respective dictionary in another column. I have tokenized the text and want to replace those tokens which found a match for the key in respective dictionary. the text and and the dictionary are specific to each record of a pandas dataframe.
import pandas as pd
data =[['1','i love mangoes',{'love':'hate'}],['2', 'its been a long time we have not met',{'met':'meet'}],['3','i got a call from one of our friends',{'call':'phone call','one':'couple of'}]]
df = pd.DataFrame(data, columns = ['id', 'text','dictionary'])
The final dataframe the output should be
data =[['1','i hate mangoes'],['2', 'its been a long time we have not meet'],['3','i got a phone call from couple of of our friends']
df = pd.DataFrame(data, columns =['id, 'modified_text'])
I am using Python 3 in a windows machine
You can use dict.get method after zipping the 2 cols and splitting the sentence:
df['modified_text']=([' '.join([b.get(i,i) for i in a.split()])
for a,b in zip(df['text'],df['dictionary'])])
print(df)
Output:
id text \
0 1 i love mangoes
1 2 its been a long time we have not met
2 3 i got a call from one of our friends
dictionary \
0 {'love': 'hate'}
1 {'met': 'meet'}
2 {'call': 'phone call', 'one': 'couple of'}
modified_text
0 i hate mangoes
1 its been a long time we have not meet
2 i got a phone call from couple of of our friends
I added spaces to the key and values to distinguish a whole word from part of it:
def replace(text, mapping):
new_s = text
for key in mapping:
k = ' '+key+' '
val = ' '+mapping[key]+' '
new_s = new_s.replace(k, val)
return new_s
df_out = (df.assign(modified_text=lambda f:
f.apply(lambda row: replace(row.text, row.dictionary), axis=1))
[['id', 'modified_text']])
print(df_out)
id modified_text
0 1 i hate mangoes
1 2 its been a long time we have not met
2 3 i got a phone call from couple of of our friends
I am new to Python/Pandas and have a data frame with two columns one a series and another a string.
I am looking to split the contents of a Column(Series) to multiple columns .Appreciate your inputs on this regard .
This is my current dataframe content
Songdetails Density
0 ["'t Hof Van Commerce", "Chance", "SORETGR12AB... 4.445323
1 ["-123min.", "Try", "SOERGVA12A6D4FEC55"] 3.854437
2 ["10_000 Maniacs", "Please Forgive Us (LP Vers... 3.579846
3 ["1200 Micrograms", "ECSTACY", "SOKYOEA12AB018... 5.503980
4 ["13 Cats", "Please Give Me Something", "SOYLO... 2.964401
5 ["16 Bit Lolitas", "Tim Likes Breaks (intermez... 5.564306
6 ["23 Skidoo", "100 Dark", "SOTACCS12AB0185B85"] 5.572990
7 ["2econd Class Citizen", "For This We'll Find ... 3.756746
8 ["2tall", "Demonstration", "SOYYQZR12A8C144F9D"] 5.472524
Desired output is SONG , ARTIST , SONG ID ,DENSITY i.e. split song details into columns.
for e.g. for the sample data
SONG DETAILS DENSITY
8 ["2tall", "Demonstration", "SOYYQZR12A8C144F9D"] 5.472524
SONG ARTIST SONG ID DENSITY
2tall Demonstration SOYYQZR12A8C144F9D 5.472524
Thanks
The following worked for me:
In [275]:
pd.DataFrame(data = list(df['Song details'].values), columns = ['Song', 'Artist', 'Song Id'])
Out[275]:
Song Artist Song Id
0 2tall Demonstration SOYYQZR12A8C144F9D
1 2tall Demonstration SOYYQZR12A8C144F9D
For you please try: pd.DataFrame(data = list(df['Songdetails'].values), columns = ['SONG', 'ARTIST', 'SONG ID'])
Thank you , i had a do an insert of column to the new data frame and was able to achieve what i needed thanks df2 = pd.DataFrame(series.apply(lambda x: pd.Series(x.split(','))))
df2.insert(3,'Density',finaldf['Density'])