Can sub-columns be created in a pandas data frame? - python

Data frame
I am working with a data frame in Jupyter Notebooks and I am having some difficulty with it. The data frame consists of locations and these are represented by coordinates. These points represent a route taken by a driver on a given day.
There are 3 columns at the moment; Start, Intermediary or End.
A driver begins the day at the Start point, visits 1 or more Intermediary points and returns to the End point at the end of the day. The Start point is like a base location so the End point is identical to the Start point.
It's very basic but I am having trouble visualising this data. I was thinking something like this below to help improve my situation:
| Start | Intermediary | End |
| | | | | | |
_________________________________________________________________
| s_lat | s_lng | i_lat | i_lng | e_lat | e_lng |
Or would it be best if I scrap the top 3 columns (Start, Intermediary, End)?
I am keen not to start a discussion here as per the Guidelines so I am keen to learn something new about Python Pandas and if there is a way I can improve my current method.

I think need here MultiIndex created by MultiIndex.from_product:
mux = pd.MultiIndex.from_product([['Start','Intermediary','End'], ['lat','lng']])
df = pd.DataFrame(data, columns=mux)
EDIT:
Setup:
temp=u""" start intermediary end
('54.957055',' -7.740156') ('54.956915136264', ' -7.753690062122') ('54.957055','-7.740156')
('54.8913208', '-7.5740475') ('54.864402885577', '-7.653445692445'),('54','0') ('54.8913208','-7.5740475')
('55.2375819', '-7.2357427') ('55.253936739337', '-7.259624609577'), ('54','2'),('54','1') ('55.2375819','-7.2357427')
('54.5298806', '-8.1350247') ('54.504374314741', '-8.188334960168') ('54.5298806','-8.1350247')
('54.2810187', ' -7.896937') ('54.303836850038', '-8.180136033695'), ('54','3') ('54.2810187','-7.896937')
"""
#after testing replace 'pd.compat.StringIO(temp)' to 'filename.csv'
df = pd.read_csv(pd.compat.StringIO(temp), sep="\s{3,}")
print (df)
start \
0 ('54.957055',' -7.740156')
1 ('54.8913208', '-7.5740475')
2 ('55.2375819', '-7.2357427')
3 ('54.5298806', '-8.1350247')
4 ('54.2810187', ' -7.896937')
intermediary \
0 ('54.956915136264', ' -7.753690062122')
1 ('54.864402885577', '-7.653445692445'),('54','0')
2 ('55.253936739337', '-7.259624609577'), ('54',...
3 ('54.504374314741', '-8.188334960168')
4 ('54.303836850038', '-8.180136033695'), ('54',...
end
0 ('54.957055','-7.740156')
1 ('54.8913208','-7.5740475')
2 ('55.2375819','-7.2357427')
3 ('54.5298806','-8.1350247')
4 ('54.2810187','-7.896937')
import ast
#convert string values to tuples
df = df.applymap(lambda x: ast.literal_eval(x))
#convert onpy pairs values to nested lists
df['intermediary'] = df['intermediary'].apply(lambda x: list(x) if isinstance(x[1], tuple) else [x])
#DataFrame by first Start column
df1 = pd.DataFrame(df['start'].values.tolist(), columns=['lat','lng'])
#DataFrame by intermediary column with reshape for 2 columns df
df2 = (pd.concat([pd.DataFrame(x, columns=['lat','lng']) for x in df['intermediary']], keys=df.index)
.reset_index(level=1, drop=True)
.add_prefix('intermediary_'))
print (df2)
#join all DataFrames together
df3 = df1.add_prefix('start_').join(df2).join(df1.add_prefix('end_'))
#create MultiIndex by split
df3.columns = df3.columns.str.split('_', expand=True)
print (df3)
start intermediary end \
lat lng lat lng lat
0 54.957055 -7.740156 54.956915136264 -7.753690062122 54.957055
1 54.8913208 -7.5740475 54.864402885577 -7.653445692445 54.8913208
1 54.8913208 -7.5740475 54 0 54.8913208
2 55.2375819 -7.2357427 55.253936739337 -7.259624609577 55.2375819
2 55.2375819 -7.2357427 54 2 55.2375819
2 55.2375819 -7.2357427 54 1 55.2375819
3 54.5298806 -8.1350247 54.504374314741 -8.188334960168 54.5298806
4 54.2810187 -7.896937 54.303836850038 -8.180136033695 54.2810187
4 54.2810187 -7.896937 54 3 54.2810187
lng
0 -7.740156
1 -7.5740475
1 -7.5740475
2 -7.2357427
2 -7.2357427
2 -7.2357427
3 -8.1350247
4 -7.896937
4 -7.896937

To add a top to column to a pd.DataFrame run:
def add_top_column(df, top_col, inplace=False):
if not inplace:
df = df.copy()
df.columns = pd.MultiIndex.from_product([[top_col], df.columns])
return df
orig_df = pd.DataFrame([[1, 2], [3, 4]], columns=['a', 'b'])
new_df = add_top_column(orig_df, "new column")
In order to combine 3 DataFrames each with its own new top column:
new_df2 = add_top_column(orig_df, "new column2")
new_df3 = add_top_column(orig_df, "new column3")
print(pd.concat([new_df, new_df2, new_df3], axis=1))
"""
# And this is the expected output:
new column new column2 new column3
a b a b a b
0 1 2 1 2 1 2
1 3 4 3 4 3 4
"""
Note that if the DataFrames' index do not match, you might need to reset the index.

You can read an Excel file with 2 headers (2 levels of columns).
df = pd.read_excel(
sourceFilePath,
index_col = [0],
header = [0, 1]
)
You can reshape your df like this in order to keep just 1 header (its easier to work with only 1 header):
df = df.stack([0,1], dropna=False).to_frame('Valeur').reset_index()

Related

How to Pivot/Stack for multi header column dataframe

np.random.seed(2022) # added to make the data the same each time
cols = pd.MultiIndex.from_arrays([['A','A' ,'B','B'], ['min','max','min','max']])
df = pd.DataFrame(np.random.rand(3,4),columns=cols)
df.index.name = 'item'
A B
min max min max
item
0 0.009359 0.499058 0.113384 0.049974
1 0.685408 0.486988 0.897657 0.647452
2 0.896963 0.721135 0.831353 0.827568
There are two column headers and while working with csv, I get a blank column name for every other column on unmerging.
I want result that looks like this. How can I do it?
I tried to use pivot table but couldn't do it.
Try:
df = (
df.stack(level=0)
.reset_index()
.rename(columns={"level_1": "title"})
.sort_values(by=["title", "item"])
)
print(df)
Prints:
item title max min
0 0 A 0.762221 0.737758
2 1 A 0.930523 0.275314
4 2 A 0.746246 0.123621
1 0 B 0.044137 0.264969
3 1 B 0.577637 0.699877
5 2 B 0.601034 0.706978
Then to CSV:
df.to_csv('out.csv', index=False)

assign one column value to another column based on condition in pandas

I want to how we can assign one column value to another column if it has null or 0 value
I have a dataframe like this:
id column1 column2
5263 5400 5400
4354 6567 Null
5656 5456 5456
5565 6768 3489
4500 3490 Null
The Expected Output is
id column1 column2
5263 5400 5400
4354 6567 6567
5656 5456 5456
5565 6768 3489
4500 3490 3490
that is,
if df['column2'] = Null/0 then it has take df['column1'] value.
Can someone explain, how can I achieve my desired output?
Based on the answers to this similar question, you can do the following:
Using np.where:
df['column2'] = np.where((df['column2'] == 'Null') | (df['column2'] == 0), df['column1'], df['column2'])
Instead, using only pandas and Python:
df['column2'][(df['column2'] == 0) | (df['column2'] == 'Null')] = df['column1']
Here's my suggestion. Not sure whether it is the fastest, but it should work here ;)
#we start by creating an empty list
column2 = []
#for each row in the dataframe
for i in df.index:
# if the value col2 is null or 0, then it takes the value of col1
if df.loc[i, 'column2'] in ['null', 0]:
column2.append(df.loc[i, 'column1'])
#else it takes the value of column 2
else:
column2.append(df.loc[i, 'column2'])
#we replace the current column 2 by the new one !
df['column2'] = column2```
Update using only Native Pandas Functionality
#Creates boolean array conditionCheck, checking conditions for each row in df
#Where() will only update when conditionCheck == False, so inverted boolean values using "~"
conditionCheck = ~((df['column2'].isna()) | (df['column2']==0))
df["column2"].where(conditionCheck,df["column1"],inplace=True)
print(df)
Code to Generate Sample DataFrame
Changed row 3 of column2 to 0 to test all scenarios
import numpy as np
import pandas as pd
data = [
[5263,5400,5400]
,[4354,6567,None]
,[5656,5456,0]
,[5565,6768,3489]
,[4500,3490,None]
]
df = pd.DataFrame(data,columns=["id","column1","column2"],dtype=pd.Int64Dtype())
Similar question was already solved here.
"Null" keyword does not exist in python. Empty cells in pandas have np.nan type. So assuming you mean np.nans, one good way to achieve your desired output would be:
Create a boolean mask to select rows with np.nan or 0 value and then copy when mask is True.
mask = (df['column2'].isna()) | (df['column2']==0)
df.loc[mask, "column2"] = df.loc[mask, "column1"]
Just use ffill(). Go through the example.
from pandas import DataFrame as df
import numpy as np
import pandas as pd
items = [1,2,3,4,5]
place = [6,7,8,9,10]
quality = [11,np.nan,12,13,np.nan]
df = pd.DataFrame({"A":items, "B":place, "C":quality})
print(df)
"""
A B C
0 1 6 11.0
1 2 7 NaN
2 3 8 12.0
3 4 9 13.0
4 5 10 NaN
"""
aa = df.ffill(axis=1).astype(int)
print(aa)
"""
A B C
0 1 6 11
1 2 7 7
2 3 8 12
3 4 9 13
4 5 10 10
"""

Grouping data from multiple columns in data frame into summary view

I have a data frame as below and would like to create summary information as shown. Can you please help how this can be done in pandas.
Data-frame:
import pandas as pd
ds = pd.DataFrame(
[{"id":"1","owner":"A","delivery":"1-Jan","priority":"High","exception":"No Bill"},{"id":"2","owner":"A","delivery":"2-Jan","priority":"Medium","exception":""},{"id":"3","owner":"B","delivery":"1-Jan","priority":"High","exception":"No Bill"},{"id":"4","owner":"B","delivery":"1-Jan","priority":"High","exception":"No Bill"},{"id":"5","owner":"C","delivery":"1-Jan","priority":"High","exception":""},{"id":"6","owner":"C","delivery":"2-Jan","priority":"High","exception":""},{"id":"7","owner":"C","delivery":"","priority":"High","exception":""}]
)
Result:
Use:
#crosstab and rename empty string column
df = pd.crosstab(ds['owner'], ds['delivery']).rename(columns={'':'No delivery Date'})
#change positions of columns - first one to last one
df = df[df.columns[1:].tolist() + df.columns[:1].tolist()]
#get counts by comparing and sum of True values
df['high_count'] = ds['priority'].eq('High').groupby(ds['owner']).sum().astype(int)
df['exception_count'] = ds['exception'].eq('No Bill').groupby(ds['owner']).sum().astype(int)
#convert id to string and join with ,
df['ids'] = ds['id'].astype(str).groupby(ds['owner']).agg(','.join)
#index to column
df = df.reset_index()
#reove index name delivery
df.columns.name = None
print (df)
owner 1-Jan 2-Jan No delivery Date high_count exception_count ids
0 A 1 1 0 1 1 1,2
1 B 2 0 0 2 2 3,4
2 C 1 1 1 3 0 5,6,7

Replace values in a pandas column using another pandas df which has the corresponding replacements

I have a pandas df named inventory, which has a column containing Part Numbers (AlphaNumeric). Some of those part numbers have been superseded and I have another df named replace_with containing two columns, 'old part numbers' and 'new part numbers'.
For example:
Inventory has values like:
* 123AAA
* 123BBB
* 123CCC
......
and replace-with has values like
**oldPartnumbers** ..... **newPartnumbers**
* 123AAA ............ 123ABC
* 123CCC ........... 123DEF
SO, i need to replace corresponding values in inventory with the new numbers. After replacement inventory will look like as follows:
* 123ABC
* 123BBB
* 123DEF
Is there a simple way to do that in python? Thanks!
Setup
Consider the dataframes inventory and replace_with
inventory = pd.DataFrame(dict(Partnumbers=['123AAA', '123BBB', '123CCC']))
replace_with = pd.DataFrame(dict(
oldPartnumbers=['123AAA', '123BBB', '123CCC'],
newPartnumbers=['123ABC', '123DEF', '123GHI']
))
Option 1
map
d = replace_with.set_index('oldPartnumbers').newPartnumbers
inventory['Partnumbers'] = inventory['Partnumbers'].map(d)
inventory
Partnumbers
0 123ABC
1 123DEF
2 123GHI
Option 2
replace
d = replace_with.set_index('oldPartnumbers').newPartnumbers
inventory['Partnumbers'].replace(d, inplace=True)
inventory
Partnumbers
0 123ABC
1 123DEF
2 123GHI
Let say you have 2 df as follows:
import pandas as pd
df1 = pd.DataFrame([[1,3],[5,4],[6,7]], columns = ['PN','name'])
df2 = pd.DataFrame([[2,22],[3,33],[4,44],[5,55]], columns = ['oldname','newname'])
df1:
PN oldname
0 1 3
1 5 4
2 6 7
df2:
oldname newname
0 2 22
1 3 33
2 4 44
3 5 55
run left join between them:
temp = df1.merge(df2,'left',left_on='name',right_on='oldname')
temp:
PN name oldname newname
0 1 3 3.0 33.0
1 5 4 4.0 44.0
2 6 7 NaN NaN
then calculate the new name column and replace it:
df1['name'] = temp.apply(lambda row: row['newname'] if pd.notnull(row['newname']) else row['name'], axis=1)
df1:
PN name
0 1 33.0
1 5 44.0
2 6 7.0
or, as one liner:
df1['name'] = df1.merge(df2,'left',left_on='name',right_on='oldname').apply(lambda row: row['newname'] if pd.notnull(row['newname']) else row['name'], axis=1)
This solution is relatively fast - it uses pandas data alignment and the numpy "copyto" function.
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'partNumbers': ['123AAA', '123BBB', '123CCC', '123DDD']})
df2 = pd.DataFrame({'oldPartnumbers': ['123AAA', '123BBB', '123CCC'],
'newPartnumbers': ['123ABC', '123DEF', '123GHI']})
# assign index in each dataframe to original part number columns
# (faster than set_index method, but use set_index if original index must be preserved)
df1.index = df1.partNumbers
df2.index = df2.oldPartnumbers
# use pandas index data alignment
df1['updatedPartNumbers'] = df2.newPartnumbers
# use numpy to copy in old part num when a new part num is not found
np.copyto(df1.updatedPartNumbers.values,
df1.partNumbers.values,
where=pd.isnull(df1.updatedPartNumbers))
# reset index
df1.reset_index(drop=True, inplace=True)
df1:
partNumbers updatedPartNumbers
0 123AAA 123ABC
1 123BBB 123DEF
2 123CCC 123GHI
3 123DDD 123DDD

frequency table as a data frame in pandas

I'm looking for a more efficient way to do this as I am new to python. I want a data frame of the cyl value and the counts - ideally without having to go and do the rename column. I'm coming from R.
What is happening is 'cyl' is the index if i don't use the to-frame.reset-index piece of code and when I do use the reset-index code it becomes a column called 'index' - which is really the cyl values, while the the 2nd column 'cyl' is really the frequency counts..
import pandas as pd
new_df = pd.value_counts(mtcars.cyl).to_frame().reset_index()
new_df.columns = ['cyl', 'frequency']
I think you can omit to_frame():
new_df = pd.value_counts(mtcars.cyl).reset_index()
new_df.columns = ['cyl', 'frequency']
Sample:
mtcars = pd.DataFrame({'cyl':[1, 2, 2, 4, 4]})
print (mtcars)
cyl
0 1
1 2
2 2
3 4
4 4
new_df = pd.value_counts(mtcars.cyl).reset_index()
new_df.columns = ['cyl', 'frequency']
print (new_df)
cyl frequency
0 4 2
1 2 2
2 1 1

Categories

Resources