Pandas: Union of Dataframes [duplicate] - python

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 3 years ago.
Considering two dataframes like the ones below:
import pandas as pd
df = pd.DataFrame({'id_emp' : [1,2,3,4,5],
'name_emp': ['Cristiano', 'Gaúcho', 'Fenômeno','Angelin', 'Souza']})
df2 = pd.DataFrame({'id_emp': [1,2,3,6,7],
'name_emp': ['Cristiano', 'Gaúcho', 'Fenômeno', 'Kaká', 'Sérgio'],
'Description': ['Forward', 'Middle', 'Forward', 'back', 'winger']})
I have to create a third data frame from the union of them. I need to compare the id_emp values ​​of the two dataframes, if they are the same, the third dataframe will receive the columns name_dep and description, in addition to the id_emp. Expected output result is as follows:
id_emp|name_emp|Description
1 |Cristiano|Forward
2 |Gaúcho |Middle
3 |Fenômeno |Forward

All you need is merge:
df.merge(df2)

Related

DataFrame: from string dictionnary in one columns to floats in two column {'latitude': '34.04', 'longitude': '-118.24'} [duplicate]

This question already has answers here:
Split / Explode a column of dictionaries into separate columns with pandas
(13 answers)
Closed 2 years ago.
I have a pandas DataFrame df with a column latlng.
The rows in this column have the format
{'latitude': '34.041005', 'longitude': '-118.249569'}.
In order to be able to add markers to a map (using folium librairie), I would like to create two columns 'latitude' and longitude which in this example would contain respectively 34.041005 and -118.249569.
EDIT:
Managed to have it working with this first step:
df['latlng'] = df['latlng'].map(eval)
You can use pd.json_normalize to avoid apply which is costly:
In [684]: df_out = pd.json_normalize(df.latlong)
In [686]: df_out
Out[686]:
latitude longitude
0 34.041005 -118.249569
1 30.041005 -120.249569
Then you can concat these columns back to df like below:
pd.concat([df.drop('latlong', axis=1), df_out], axis=1)
The following should work:
df['latitude']=[i['latitude'] for i in eval(df['latlong'])]
df['longtitude']=[i['longtitude'] for i in eval(df['longtitude'])]
This should do the job for you:
df['blatlong'].apply(pd.Series)
Try this:
df_new = pd.DataFrame(df['latlng'].values.tolist(), index=df.index)

python split pandas numeric vector column into multiple columns [duplicate]

This question already has answers here:
Split a Pandas column of lists into multiple columns
(11 answers)
Closed 4 years ago.
I have a dataframe in pandas, with a column which is a vector:
df = pd.DataFrame({'ID':[1,2], 'Averages':[[1,2,3],[4,5,6]]})
and I wish to split and divide it into elements which would look like this:
df2 = pd.DataFrame({'ID':[1,2], 'A':[1,4], 'B':[2,5], 'C':[3,6]})
I have tried
df['Averages'].astype(str).str.split(' ') but with no luck. any help would be appreciated.
pd.concat([df['ID'], df['Averages'].apply(pd.Series)], axis = 1).rename(columns = {0: 'A', 1: 'B', 2: 'C'})
This will work:
df[['A','B','C']] = pd.DataFrame(df.averages.values.tolist(), index= df.index)

Python/Pandas - Query a MultiIndex Column [duplicate]

This question already has answers here:
Select columns using pandas dataframe.query()
(5 answers)
Closed 4 years ago.
I'm trying to use query on a MultiIndex column. It works on a MultiIndex row, but not the column. Is there a reason for this? The documentation shows examples like the first one below, but it doesn't indicate that it won't work for a MultiIndex column.
I know there are other ways to do this, but I'm specifically trying to do it with the query function
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.random((4,4)))
df.index = pd.MultiIndex.from_product([[1,2],['A','B']])
df.index.names = ['RowInd1', 'RowInd2']
# This works
print(df.query('RowInd2 in ["A"]'))
df = pd.DataFrame(np.random.random((4,4)))
df.columns = pd.MultiIndex.from_product([[1,2],['A','B']])
df.columns.names = ['ColInd1', 'ColInd2']
# query on index works, but not on the multiindexed column
print(df.query('index < 2'))
print(df.query('ColInd2 in ["A"]'))
To answer my own question, it looks like query shouldn't be used at all (regardless of using MultiIndex columns) for selecting certain columns, based on the answer(s) here:
Select columns using pandas dataframe.query()
You can using IndexSlice
df.query('ilevel_0>2')
Out[327]:
ColInd1 1 2
ColInd2 A B A B
3 0.652576 0.639522 0.52087 0.446931
df.loc[:,pd.IndexSlice[:,'A']]
Out[328]:
ColInd1 1 2
ColInd2 A A
0 0.092394 0.427668
1 0.326748 0.383632
2 0.717328 0.354294
3 0.652576 0.520870

Python Pandas finding multiple column titles which exist in a dataframe [duplicate]

This question already has an answer here:
list of columns in common in two pandas dataframes
(1 answer)
Closed 4 years ago.
I am trying to find all column titles from a list of column titles which exist in a dataframe (and output all which exist as a dictionary for use in a tkinter dropdown menu).
For example, say i have a list of columns:
Options = ['title3', 'title5', 'title6']
and the dataframe has columns:
title1 title4 title3 title6
I would need the output to be:
choices = {'title3', 'title6'}.
The only way i currently have this working is inelegant:
if 'title1' in df1:
choices = { 'title1'}
if 'title1' in df1 and 'title5' in df1:
choices = { 'title1', 'title5'}
ect.
If anyone knows of a better way for me to get the result I would greatly appreciate any help!
Thanks
I think need intersection:
df = pd.DataFrame(columns=['title1','title4','title3','title6'])
Options = ['title3', 'title5', 'title6']
choices = df.columns.intersection(Options).tolist()
print (choices)
['title3', 'title6']

Python data frames - how to select all columns that have a specific substring in their name [duplicate]

This question already has answers here:
Find column whose name contains a specific string
(8 answers)
Closed 7 years ago.
in Python I have a data frame (df) that contains columns with the following names A_OPEN, A_CLOSE, B_OPEN, B_CLOSE, C_OPEN, C_CLOSE, D_ etc.....
How can I easily select only the columns that contain _CLOSE in their name? A,B,C,D,E,F etc can have any value so I do not want to use the specific column names
In SQL this would be done with the like operator: df[like'%_CLOSE%']
What's the python way?
You could use a list comprehension, e.g.:
df[[x for x in df.columns if "_CLOSE" in x]]
Example:
df = pd.DataFrame(
columns = ['_CLOSE_A', '_CLOSE_B', 'C'],
data = [[2,3,4], [3,4,5]]
)
Then,
>>>print(df[[x for x in df.columns if "_CLOSE" in x]])
_CLOSE_A _CLOSE_B
0 2 3
1 3 4

Categories

Resources