I want to insert several different values in just one cell
E.g.
Friends' names
ID | Grade | Names
----+--------------+----------------------------
1 | elementary | Kai, Matthew, Grace
2 | guidance | Eli, Zoey, David, Nora, William
3 | High school | Emma, James, Levi, Sophia
Or as a list or dictionary:
ID | Grade | Names
----+--------------+------------------------------
1 | elementary | [Kai, Matthew, Grace]
2 | guidance | [Eli, Zoey, David, Nora, William]
3 | High school | [Emma, James, Levi, Sophia]
or
ID | Grade | Names
----+--------------+---------------------------------------------
1 | elementary | { a:Kai, b:Matthew, c:Grace}
2 | guidance | { a:Eli, b:Zoey, c:David, d:Nora, e:William}
3 | High school | { a:Emma, b:James, c:Levi, d:Sophia}
Is there a way?
Yes there is a way, but that doesn't mean you should do it this way.
You could for example save your values as a json string and save them inside the column. If you later want to add a value you can simply parse the json, add the value and put it back into the database. (Might also work with a BLOB, but I'm not sure)
However, I would not recommend saving a list inside of a column, as SQL is not meant to be used like that.
What I would recommend is that you have a table and for every grade with its own primary key. Like this:
ID
Grade
1
Elementary
2
Guidance
3
High school
And then another table containing all the names, having its own primary key and the gradeId as its secondary key. E.g:
ID
GradeID
Name
1
1
Kai
2
1
Matthew
3
1
Grace
4
2
Eli
5
2
Zoey
6
2
David
7
2
Nora
8
2
William
9
3
Emma
10
3
James
11
3
Levia
12
3
Sophia
If you want to know more about this, you should read about Normalization in SQL.
Related
I am being provided with a data set and i am writing a function.
my objectice is quiet simple. I have a air bnb data base with various columns my onjective is simple. I am using a for loop over neighbourhood group list (that i created) and i am trying to extract (append) the data related to that particular element in a empty dataframe.
Example:
import pandas as pd
import numpy as np
dict1 = {'id' : [2539,2595,3647,3831,12937,18198,258838,258876,267535,385824],'name':['Clean & quiet apt home by the park','Skylit Midtown Castle','THE VILLAGE OF HARLEM....NEW YORK !','Cozy Entire Floor of Brownstone','1 Stop fr. Manhattan! Private Suite,Landmark Block','Little King of Queens','Oceanview,close to Manhattan','Affordable rooms,all transportation','Home Away From Home-Room in Bronx','New York City- Riverdale Modern two bedrooms unit'],'price':[149,225,150,89,130,70,250,50,50,120],'neighbourhood_group':['Brooklyn','Manhattan','Manhattan','Brooklyn','Queens','Queens','Staten Island','Staten Island','Bronx','Bronx']}
df = pd.DataFrame(dict1)
df
I created a function as follows
nbd_grp = ['Bronx','Queens','Staten Islands','Brooklyn','Manhattan']
# Creating a function to find the cheapest place in neighbourhood group
dfdf = pd.DataFrame(columns = ['id','name','price','neighbourhood_group'])
def cheapest_place(neighbourhood_group):
for elem in nbd_grp:
data = df.loc[df['neighbourhood_group']==elem]
cheapest = data.loc[data['price']==min(data['price'])]
dfdf = cheapest.copy()
cheapest_place(nbd_grp)
My Expected Output is :
id
name
Price
neighbourhood group
267535
Home Away From Home-Room in Bronx
50
Bronx
18198
Little King of Queens
70
Queens
258876
Affordable rooms,all transportation
50
Staten Island
3831
Cozy Entire Floor of Brownstone
89
Brooklyn
3647
THE VILLAGE OF HARLEM....NEW YORK !
150
Manhattan
My advice is that anytime you are working in a database or in a dataframe and you think "I need to loop", you should think again.
When in a dataframe you are in a world of set-based logic and there is likely a better set-based way of solving the problem. In your case you can groupby() your neighbourhood_group and get the min() of the price column and then merge or join that result set back to your original dataframe to get your id and name columns.
That would look something like:
df_min_price = df.groupby('neighbourhood_group').price.agg(min).reset_index().merge(df, on=['neighbourhood_group','price'])
+-----+---------------------+-------+--------+-------------------------------------+
| idx | neighbourhood_group | price | id | name |
+-----+---------------------+-------+--------+-------------------------------------+
| 0 | Bronx | 50 | 267535 | Home Away From Home-Room in Bronx |
| 1 | Brooklyn | 89 | 3831 | Cozy Entire Floor of Brownstone |
| 2 | Manhattan | 150 | 3647 | THE VILLAGE OF HARLEM....NEW YORK ! |
| 3 | Queens | 70 | 18198 | Little King of Queens |
| 4 | Staten Island | 50 | 258876 | Affordable rooms,all transportation |
+-----+---------------------+-------+--------+-------------------------------------+
I'm working with a huge set of data that I can't work with in excel so I'm using Pandas/Python, but I'm relatively new to it. I have this column of book titles that also include genres, both before and after the title. I only want the column to contain book titles, so what would be the easiest way to remove the genres?
Here is an example of what the column contains:
Book Labels
Science Fiction | Drama | Dune
Thriller | Mystery | The Day I Died
Thriller | Razorblade Tears | Family | Drama
Comedy | How To Marry Keanu Reeves In 90 Days | Drama
...
So above, the book titles would be Dune, The Day I Died, Razorblade Tears, and How To Marry Keanu Reeves In 90 Days, but as you can see the genres precede as well as succeed the titles.
I was thinking I could create a list of all the genres (as there are only so many) and remove those from the column along with the "|" characters, but if anyone has suggestions on a simpler way to remove the genres and "|" key, please help me out.
It is an enhancement to #tdy Regex solution. The original regex Family|Drama will match the words "Family" and "Drama" in the string. If the book title contains the words in gernes, the words will be removed as well.
Supposed that the labels are separated by " | ", there are three match conditions we want to remove.
Gerne at start of string. e.g. Drama | ...
Gerne in the middle. e.g. ... | Drama | ...
Gerne at end of string. e.g. ... | Drama
Use regex (^|\| )(?:Family|Drama)(?=( \||$)) to match one of three conditions. Note that | Drama | Family has 2 overlapped matches, here I use ?=( \||$) to avoid matching once only. See this problem [Use regular expressions to replace overlapping subpatterns] for more details.
>>> genres = ["Family", "Drama"]
>>> df
# Book Labels
# 0 Drama | Drama 123 | Family
# 1 Drama 123 | Drama | Family
# 2 Drama | Family | Drama 123
# 3 123 Drama 123 | Family | Drama
# 4 Drama | Family | 123 Drama
>>> re_str = "(^|\| )(?:{})(?=( \||$))".format("|".join(genres))
>>> df['Book Labels'] = df['Book Labels'].str.replace(re_str, "", regex=True)
# 0 | Drama 123
# 1 Drama 123
# 2 | Drama 123
# 3 123 Drama 123
# 4 | 123 Drama
>>> df["Book Labels"] = df["Book Labels"].str.strip("| ")
# 0 Drama 123
# 1 Drama 123
# 2 Drama 123
# 3 123 Drama 123
# 4 123 Drama
So I'm trying to take a dataframe like this (for example):
ID | reason_for_rejection
--------------------------
1 | invalid insurance
2 | behavior issues
3 | not enough money
4 | no space in hospital
5 | anger issues
...
and, using a hand-written mapping (for example {financial: [invalid insurance, not enough money], patient problems: [behavior issues, anger issues]...} create a new column containing the mapped values and turn this into:
ID | reason_for_rejection | reason_for_rejection_grouped
---------------------------------------------------------------
1 | invalid insurance | financial
2 | behavior issues | patient problems
3 | not enough money | financial
4 | no space in hospital | occupancy
5 | anger issues | patient problems
...
So while the 'reason_for_rejection' column will have a lot of unique values, I want to use some kind of a mapping that maps those unique values into 7 or 8 unique values in 'reason_for_rejection_grouped'.
I considered using a dictionary here, but the key would be a value in 'reason_for_rejection_grouped' and the values would be values in 'reason_for_rejection', so then I'd have to get the key based off the value which would be computationally expensive (and I have a really big dataset to look at).
Any guidance or suggestions would be super helpful!
I have a pandas dataframe that look like this:
df = pd.DataFrame({'name': ['bob', 'time', 'jane', 'john', 'andy'], 'favefood': [['kfc', 'mcd', 'wendys'], ['mcd'], ['mcd', 'popeyes'], ['wendys', 'kfc'], ['tacobell', 'innout']]})
-------------------------------
name | favefood
-------------------------------
bob | ['kfc', 'mcd', 'wendys']
tim | ['mcd']
jane | ['mcd', 'popeyes']
john | ['wendys', 'kfc']
andy | ['tacobell', 'innout']
For each person, I want to find out how many favefood's of other people overlap with their own.
I.e., for each person I want to find out how many other people have a non-empty intersection with them.
The resulting dataframe would look like this:
------------------------------
name | overlap
------------------------------
bob | 3
tim | 2
jane | 2
john | 1
andy | 0
The problem is that I have about 2 million rows of data. The only way I can think of doing this would be through a nested for-loop - i.e. for each person, go through the entire dataframe to see what overlaps (this would be extremely inefficient). Would there be anyway to do this more efficiently using pandas notation? Thanks!
Logic behind it
s=df['favefood'].explode().str.get_dummies().sum(level=0)
s.dot(s.T).ne(0).sum(axis=1)-1
Out[84]:
0 3
1 2
2 2
3 1
4 0
dtype: int64
df['overlap']=s.dot(s.T).ne(0).sum(axis=1)-1
Method from sklearn
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
s=pd.DataFrame(mlb.fit_transform(df['favefood']),columns=mlb.classes_, index=df.index)
s.dot(s.T).ne(0).sum(axis=1)-1
0 3
1 2
2 2
3 1
4 0
dtype: int64
First of all, I apologize! It's my first time using stack overflow so I hope I'm doing it right! I searched but can't find what I'm looking for.
I'm also quite new with pandas and python :)
I am going to try to use an example and I will try to be clear.
I have a dataframe with 30 columns that contains information about a shopping cart, 1 of the columns (order) have 2 values, either completed of in progress.
And I have like 20 columns with items, lets say apple, orange, bananas... And I need to know how many times there is an apple in a complete order and how many in a in progress order. I decided to use a pivot table with the aggregate function count.
This would be a small example of the dataframe:
Order | apple | orange | banana | pear | pineapple | ... |
-----------|-------|--------|--------|------|-----------|------|
completed | 2 | 4 | 10 | 5 | 1 | |
completed | 5 | 4 | 5 | 8 | 3 | |
iProgress | 3 | 7 | 6 | 5 | 2 | |
completed | 6 | 3 | 1 | 7 | 1 | |
iProgress | 10 | 2 | 2 | 2 | 2 | |
completed | 2 | 1 | 4 | 8 | 1 | |
I have the output I want but what I'm looking for is a more elegant way of selecting lots of columns without having to type them manually.
df.pivot_table(index=['Order'], values=['apple', 'bananas', 'orange', 'pear', 'strawberry',
'mango'], aggfunc='count')
But I want to select around 15 columns, so instead of typing one by one 15 times, I'm sure there is an easy way of doing it by using column numbers or something. Let's say I want to select columns from 6 till 15.
I have tried with things like values=[df.columns[6:15]], I have also tried using df.iloc, but as I said, I'm pretty new so I'm probably using things wrong or making silly things!
Is there also a way to get them in the order they have? Because in my answer they seem to have been ordered alphabetically and I want to keep the order of the columns. So it should be apple, orange, banana...
Order Completed In progress
apple 92 221
banana 102 144
mango 70 55
I'm just looking for a way of improving my code and I hope I have not made much mess. Thank you!
I think you can use:
#if need select only few columns - df.columns[1:3]
df = df.pivot_table(columns=['Order'], values=df.columns[1:3], aggfunc='count')
print (df)
Order completed iProgress
apple 4 2
orange 4 2
#if need use all column, parameter values can be omit
df = df.pivot_table(columns=['Order'], aggfunc='count')
print (df)
Order completed iProgress
apple 4 2
banana 4 2
orange 4 2
pear 4 2
pineapple 4 2
What is the difference between size and count in pandas?
df = df.pivot_table(columns=['Order'], aggfunc=len)
print (df)
Order completed iProgress
apple 4 2
banana 4 2
orange 4 2
pear 4 2
pineapple 4 2
#solution with groupby and transpose
df = df.groupby('Order').count().T
print (df)
Order completed iProgress
apple 4 2
orange 4 2
banana 4 2
pear 4 2
pineapple 4 2
Your example doesn't show an example of an item not in the cart. I'm assuming it comes up as None or 0. If this is correct, then I fill na values and count how many are greater than 0
df.set_index('Order').fillna(0).gt(0).groupby(level='Order').sum().T