This question already has an answer here:
list of columns in common in two pandas dataframes
(1 answer)
Closed 4 years ago.
I am trying to find all column titles from a list of column titles which exist in a dataframe (and output all which exist as a dictionary for use in a tkinter dropdown menu).
For example, say i have a list of columns:
Options = ['title3', 'title5', 'title6']
and the dataframe has columns:
title1 title4 title3 title6
I would need the output to be:
choices = {'title3', 'title6'}.
The only way i currently have this working is inelegant:
if 'title1' in df1:
choices = { 'title1'}
if 'title1' in df1 and 'title5' in df1:
choices = { 'title1', 'title5'}
ect.
If anyone knows of a better way for me to get the result I would greatly appreciate any help!
Thanks
I think need intersection:
df = pd.DataFrame(columns=['title1','title4','title3','title6'])
Options = ['title3', 'title5', 'title6']
choices = df.columns.intersection(Options).tolist()
print (choices)
['title3', 'title6']
Related
This question already has answers here:
How do I get a list of all the duplicate items using pandas in python?
(13 answers)
Closed 1 year ago.
In preps for data analyst interview questions, I came across "find all duplicate emails (not unique emails) in "one-liner" using pandas."
The best I've got is not a single line but rather three:
# initialize dataframe
import pandas as pd
d = {'email':['a','b','c','a','b']}
df= pd.DataFrame(d)
# select emails having duplicate entries
results = pd.DataFrame(df.value_counts())
results.columns = ['count']
results[results['count'] > 1]
>>>
count
email
b 2
a 2
Could the second block following the latter comment be condensed into a one-liner, avoiding the temporary variable results?
Just use duplicated:
>>> df[df.duplicated()]
email
3 a
4 b
Or if you want a list:
>>> df[df["email"].duplicated()]["email"].tolist()
['a', 'b']
This question already has answers here:
Split / Explode a column of dictionaries into separate columns with pandas
(13 answers)
Closed 2 years ago.
I have a pandas DataFrame df with a column latlng.
The rows in this column have the format
{'latitude': '34.041005', 'longitude': '-118.249569'}.
In order to be able to add markers to a map (using folium librairie), I would like to create two columns 'latitude' and longitude which in this example would contain respectively 34.041005 and -118.249569.
EDIT:
Managed to have it working with this first step:
df['latlng'] = df['latlng'].map(eval)
You can use pd.json_normalize to avoid apply which is costly:
In [684]: df_out = pd.json_normalize(df.latlong)
In [686]: df_out
Out[686]:
latitude longitude
0 34.041005 -118.249569
1 30.041005 -120.249569
Then you can concat these columns back to df like below:
pd.concat([df.drop('latlong', axis=1), df_out], axis=1)
The following should work:
df['latitude']=[i['latitude'] for i in eval(df['latlong'])]
df['longtitude']=[i['longtitude'] for i in eval(df['longtitude'])]
This should do the job for you:
df['blatlong'].apply(pd.Series)
Try this:
df_new = pd.DataFrame(df['latlng'].values.tolist(), index=df.index)
This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 3 years ago.
Considering two dataframes like the ones below:
import pandas as pd
df = pd.DataFrame({'id_emp' : [1,2,3,4,5],
'name_emp': ['Cristiano', 'Gaúcho', 'Fenômeno','Angelin', 'Souza']})
df2 = pd.DataFrame({'id_emp': [1,2,3,6,7],
'name_emp': ['Cristiano', 'Gaúcho', 'Fenômeno', 'Kaká', 'Sérgio'],
'Description': ['Forward', 'Middle', 'Forward', 'back', 'winger']})
I have to create a third data frame from the union of them. I need to compare the id_emp values of the two dataframes, if they are the same, the third dataframe will receive the columns name_dep and description, in addition to the id_emp. Expected output result is as follows:
id_emp|name_emp|Description
1 |Cristiano|Forward
2 |Gaúcho |Middle
3 |Fenômeno |Forward
All you need is merge:
df.merge(df2)
I have a Dataframe that looks like the following:
enter image description here
The dataframe counts the number of question according to their state:
question_count_data.columns = ['date', 'curriculum_name_en', 'concept', 'language',
'concept_name_en', 'concept_name_tc', 'state', 'question_count']
question_count_data['state'] = question_count_data['state']\
.map({10: 'DRAFT', 20: 'REVIEW', 30: 'PUBLISHED', 40: 'ERROR', 50: 'DISABLED'})
I have used the following method to create this dataframe:
question_count_data = df_question.groupby(['date', 'concept__curriculum__name_en', 'concept',
'language', 'concept_name_en', 'concept_name_tc', 'state', ],
as_index=False)['question_count'].sum()
I want to now create separate columns for each state DRAFT, REVIEW, PUBLISHED, etc and provide the question count in rows , that has to look like the following :
enter image description here
Whats the best possible way to do this using my question_count_data Dataframe? I dont want to change the groupby method already implemented because thats what providing me the question count.
I do not think having another groupby method would be possible solution because what i ultimately want to do is getting the row value of the column State and getting them to separate columns like Draft, Review, Published, etc and then provide the count for each date.
A detailed explanation would be helpful please.
You are really close, need to remove as_index=False for Series with MultiIndex and then reshape by Series.unstack:
cols = ['date', 'concept__curriculum__name_en', 'concept',
'language', 'concept_name_en', 'concept_name_tc', 'state']
question_count_data = (df_question.groupby(cols)['question_count']
.sum()
.unstack(fill_value=0))
This question already has answers here:
Select columns using pandas dataframe.query()
(5 answers)
Closed 4 years ago.
I'm trying to use query on a MultiIndex column. It works on a MultiIndex row, but not the column. Is there a reason for this? The documentation shows examples like the first one below, but it doesn't indicate that it won't work for a MultiIndex column.
I know there are other ways to do this, but I'm specifically trying to do it with the query function
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.random((4,4)))
df.index = pd.MultiIndex.from_product([[1,2],['A','B']])
df.index.names = ['RowInd1', 'RowInd2']
# This works
print(df.query('RowInd2 in ["A"]'))
df = pd.DataFrame(np.random.random((4,4)))
df.columns = pd.MultiIndex.from_product([[1,2],['A','B']])
df.columns.names = ['ColInd1', 'ColInd2']
# query on index works, but not on the multiindexed column
print(df.query('index < 2'))
print(df.query('ColInd2 in ["A"]'))
To answer my own question, it looks like query shouldn't be used at all (regardless of using MultiIndex columns) for selecting certain columns, based on the answer(s) here:
Select columns using pandas dataframe.query()
You can using IndexSlice
df.query('ilevel_0>2')
Out[327]:
ColInd1 1 2
ColInd2 A B A B
3 0.652576 0.639522 0.52087 0.446931
df.loc[:,pd.IndexSlice[:,'A']]
Out[328]:
ColInd1 1 2
ColInd2 A A
0 0.092394 0.427668
1 0.326748 0.383632
2 0.717328 0.354294
3 0.652576 0.520870