Why does my df save one dictionary as two duplicate rows? - python

I have a dictionary:
import pandas as pd
d = {'id': 1, 'name': 'Pizza', 'calories': 234}
print(pd.dataFrame(d))
When I try to turn it into a dataframe using pd.DataFrame(d), I get a dataframe with two duplicate rows of the same entry:
id
name
calories
0
1
Pizza
234
1
1
Pizza
234
I want the outcome to be only one row for each entry, not two.
I have tried using pd.DataFrame(d)and pd.DataFrame.from_dict(d). I know I can just use df.iloc[0] or remove duplicates and solve this issue, but why is duplicate even saved at all?
Pandas version is 1.4.2

Not sure if this is version dependent, but I can't even make a DataFrame just using the two lines you've mentioned. That said, you should be able to resolve this by making each value a list.
d = {'id': [1], 'name': ['Pizza'], 'calories': [234]}
pd.DataFrame(d)

I don't know if it's this obvious to everyone, but personally I feel like a doofus finding this out now. the solution was to put the dict inside list brackets.
pd.DataFrame([d])

Related

Python Pandas: Sorting Pivot Table column by another column

I am trying to pivot some data in Python pandas package by using the pivot_table feature but as part of this I have a specific, bespoke order that I want to see my columns returned in - determined by a Sort_Order field which is already in the dataframe. So for test example with:
raw_data = {'Support_Reason' : ['LD', 'Mental Health', 'LD', 'Mental Health', 'LD', 'Physical', 'LD'],
'Setting' : ['Nursing', 'Nursing', 'Residential', 'Residential', 'Community', 'Prison', 'Residential'],
'Setting_Order' : [1, 1, 2, 2, 3, 4, 2],
'Patient_ID' : [6789, 1234, 4567, 5678, 7890, 1235, 3456]}
Data = pd.DataFrame(raw_data, columns = ['Support_Reason', 'Setting', 'Setting_Order', 'Patient_ID'])
Data
Then pivot:
pivot = pd.pivot_table(Data, values='Patient_ID', index=['Support_Reason'],
columns=['Setting'], aggfunc='count',dropna = False)
pivot = pivot.reset_index()
pivot
This is exactly how I want my table to look except that the columns have defaulted to A-Z ordering. I would like them to be ordered Ascending as per the Setting_Order column - so that would be order of Nursing, Residential, Community then Prison. Is there some additional syntax that I could add to my pd.pivot_table code would make this possible please?
I realise there are a few different work-arounds for this, the simplest being re-ordering the columns afterwards(!) but I want to avoid having to hard-code column names as these will change over time (both the headings and their order) and the Setting and Setting_Order fields will be managed in a separate reference table. So any form of answer that will avoid having to list Settings in code would be ideal really.
Try:
ordered = df.sort_values("Setting_Order")["Setting"].drop_duplicates().tolist()
pivot = pivot[list(pivot.columns.difference(ordered))+ordered]
col_order = list(Data.sort_values('Setting_Order')['Setting'].unique())
pivot[col_order+['Support_Reason']]
Does this help?

Changing column values for a value in an adjacent column in the same dataframe using Python

I am quite new to Python programming.
I am working with the following dataframe:
Before
Note that in column "FBgn", there is a mix of FBgn and FBtr string values. I would like to replace the FBtr-containing values with FBgn values provided in the adjacent column called "## FlyBase_FBgn". However, I want to keep the FBgn values in column "FBgn". Maybe keep in mind that I am showing only a portion of the dataframe (reality: 1432 rows). How would I do that? I tried the replace() method from Pandas, but it did not work.
This is actually what I would like to have:
After
Thanks a lot!
With Pandas, you could try:
df.loc[df["FBgn"].str.contains("FBtr"), "FBgn"] = df["## FlyBase_FBgn"]
Welcome to stackoverflow. Please next time provide more info including your code. It is always helpful
Please see the code below, I think you need something similar
import pandas as pd
#ignore the dict1, I just wanted to recreate your df
dict1= {"FBgn": ['FBtr389394949', 'FBgn3093840', 'FBtr000025'], "FBtr": ['FBgn546466646', '', 'FBgn15565555']}
df = pd.DataFrame(dict1) #recreating your dataframe
#print df
print(df)
#function to replace the values
def replace_values(df):
for i in range(0, (df.size//2)):
if 'tr' in df['FBgn'][i]:
df['FBgn'][i] = df['FBtr'][i]
return df
df = replace_values(df)
#print new df
print(df)

How to replace a string that is a part of a dataframe with a list in pandas?

I am a beginner at coding, and since this is a very simple question, I know there must be answers out there. However, I've searched for about a half hour, typing countless queries in google, and all has flown over my head.
Lets say I have a dataframe with columns "Name", "Hobbies" and 2 people, so 2 rows. Currently, I have the hobbies as strings in the form "hobby1, hobby2". I would like to change this into ["hobby1", "hobby2"]
hobbies_as_string = df.iloc[0, 2]
hobbies_as_list = hobbies_as_string.split(',')
df.iloc[0, -2] = hobbies_as_list
However, this falls to an error, ValueError: Must have equal len keys and value when setting with an iterable. I don't understand why if I get hobbies_as_string as a copy, I'm able to assign the hobbies column as a list no problem. I'm also able to assign df.iloc[0,-2] as a string, such as "Hey", and that works fine. I'm guess it has to do the with ValueError. Why won't pandas let me assign it as a list??
Thank you very much for your help and explanation.
Use the "at" method to replace a value with a list
import pandas as pd
# create a dataframe
df = pd.DataFrame(data={'Name': ['Stinky', 'Lou'],
'Hobbies': ['Shooting Sports', 'Poker']})
# replace Lous hobby of poker with a list of degen hobbies with the at method
df.at[1, 'Hobbies'] = ['Poker', 'Ponies', 'Dice']
Are you looking to apply a split row-wise to each value into a list?
import pandas as pd
df = pd.DataFrame({'Name' : ['John', 'Kate'],
'Hobbies' : ["Hobby1, Hobby2", "Hobby2, Hobby3"]})
df['Hobbies'] = df['Hobbies'].apply(lambda x: x.split(','))
df
OR if you are not a big lambda exer, then you can do str.split() on the entire column, which is easier:
import pandas as pd
df = pd.DataFrame({'Name' : ['John', 'Kate'],
'Hobbies' : ["Hobby1, Hobby2", "Hobby2, Hobby3"]})
df['Hobbies'] = df['Hobbies'].str.split(",")
df
Output:
Name Hobbies
0 John [Hobby1, Hobby2]
1 Kate [Hobby2, Hobby3]
Another way of doing it
df=pd.DataFrame({'hobbiesStrings':['"hobby1, hobby2"']})
df
replace ,whitespace with "," and put hobbiesStrings values in a list
x=df.hobbiesStrings.str.replace('((?<=)(\,\s+)+)','","').values.tolist()
x
Here I use regex expressions
Basically I am replacing comma \, followed by whitespace \s with ","
rewrite column s using df.assign
df=df.assign(hobbies_stringsnes=[x])
Chained together
df=df.assign(hobbies_stringsnes=[df.hobbiesStrings.str.replace('((\,\s))','","').values.tolist()])
df
Output

How to assign a dataframe variable based on a dictionary key in a loop

My searching was unable to find a solution for this one. I hope it is simple and just missed it.
I am trying to assign a dataframe variable based on a dictionary key. I want to loop through a dictionary of keys 0, 1, 2 3... and save the dataframe as df_0, df_1, df_2 ... I am able to get the key and values working and can assign one dataframe, but cannot find a way to assign new dataframes based on the keys.
I tried How to create a new dataframe with every iteration of for loop in Python but it didn't seem to work.
Here is what I tried:
docs_dict = {0: '2635_base', 1: '2635_tri'}
for keys, docs in docs_dict.items():
print(keys, docs)
df = pd.read_excel(Path(folder_loc[docs]) / file_name[docs], sheet_name=sheet_name[docs], skiprows=3)}
Output: 0 2635_base 1 2635_tri from the print statement, and %whos DataFrame > df as excepted.
What I would like to get is: df_0 and df_1 based on the excel files in other dictionaries which work fine.
df[keys] = pd.read_excel(Path(folder_loc[docs]) / file_name[docs], sheet_name=sheet_name[docs], skiprows=3)
produces a ValueError: Wrong number of items passed 26, placement implies 1
SOLVED thanks to RubenB for pointing me to How do I create a variable number of variables? and answer by #rocky-li using globals()
for keys, docs in docs_dict.items():
print(keys, docs)
globals()['df_{}'.format(keys)] = pd.read_excel(...)}
>> Output: dataframes df_0, df_1, ...
You might want to try a dict comprehension as such (substitute pd.read_excel(...docs...) with whatever you need to read the dataframe from disc):
docs_dict = {0: '2635_base', 1: '2635_tri'}
dfs_dict = {k: pd.read_excel(...docs...) for k, docs in docs_dict.items()}

Any way to create pandas dataframe by parsing/splitting list of urls?

I want to create pandas dataframe from list of urls where I want to split each url by hierarchy and create new columns for it. More specifically, I want to break up url by domain, protocol, query, fragment, paths. I think it's doable by using pandas, and I learned this solution but didn't get expected one.
example data snippet
Here is example data snippet in csv file and here is my attempt to do this:
import pandas as pd
df=pd.read_csv('example data snippet.csv')
df['protocol'],df['domain'],df['path'],df['query'],df['fragment'] = zip(*df['url'].map(urlparse.urlsplit))
above attempt wasn't successful because it's ouput doesn't meet with my expectation, so I am wondering is there better way to make this happen with pandas. Can anyone point me out how to make this work? Anyway to get this done easily? Any idea?
desired output
I want to split url and create new column for each component, the columns of my final pandas dataframe would be like this:
df.columns=['id', 'title', 'news source', 'topic', 'news category']
for example, in this url, I could say:
'variety.com/2017/biz/news/tax-march-donald-trump-protest-1202031487/'
'variety.com/2018/film/news/list-2018-oscar-nominations-1202668757/
news source =['variety.com','variety.com']
topic = ['tax-march-donald-trump-protest','list-2018-oscar-nominations']
new category = ['biz', 'film']
how can I do this kind of parsing for given urls list and add them into new column in pandas dataframe? anyway to get this done? thanks in advance
how many do you have?
I think I would go 1 by 1 because you're ignoring a random amount of stuff and you'll need to write rules for what to ignore.
if you use url.split("/") you'll get a list, but then you need to remove what you don't need to keep what you want.
once you have what you want, it will be in a nice shape where you can put it into a dataframe:
import pandas as pd
urls = ['variety.com/2017/biz/news/tax-march-donald-trump-protest-1202031487/',
'variety.com/2018/film/news/list-2018-oscar-nominations-1202668757/']
cols = ['c1', 'c2', 'c3', 'c4']
make_me = []
for url in urls:
lst = url.split("/")
# your business rules go here
make_me.append([x for x in lst if not x.isdigit() and not x == ""])
df = pd.DataFrame(make_me, columns=cols)
df
c1 c2 c3 c4
0 variety.com biz news tax-march-donald-trump-protest-1202031487
1 variety.com film news list-2018-oscar-nominations-1202668757
Then you could reference each column as you like:
df.c1
>
0 variety.com
1 variety.com
Name: c1, dtype: object
and still have it all together and indexed. I think the rules might get tough and you might need to make them domain specific.

Categories

Resources