What is the pandas version of tidyr::separate? [duplicate] - python

This question already has answers here:
Pandas split column into multiple columns by comma
(8 answers)
Closed 3 years ago.
The R package tidyr has a nice separate function to "Separate one column into multiple columns."
What is the pandas version?
For example here is a dataset:
import pandas
from six import StringIO
df = """ i | j | A
AR | 5 | Paris,Green
For | 3 | Moscow,Yellow
For | 4 | New York,Black"""
df = StringIO(df.replace(' ',''))
df = pandas.read_csv(df, sep="|", header=0)
I'd like to separate the A column into 2 columns containing the content of the 2 columns.
This question is related: Accessing every 1st element of Pandas DataFrame column containing lists

The equivalent of tidyr::separate is str.split with a special assignment:
df['Town'], df['Color'] = df['A'].str.split(',', 1).str
print(df)
# i j A Town Color
# 0 AR 5 Paris,Green Paris Green
# 1 For 3 Moscow,Yellow Moscow Yellow
# 2 For 4 NewYork,Black NewYork Black
The equivalent of tidyr::unite is a simple concatenation of the character vectors:
df["B"] = df["i"] + df["A"]
df
# i j A B
# 0 AR 5 Paris,Green ARParis,Green
# 1 For 3 Moscow,Yellow ForMoscow,Yellow
# 2 For 4 NewYork,Black ForNewYork,Black

Related

reshape data to split one column into multiple columns based on delimiter in pandas or otherwise in python [duplicate]

This question already has answers here:
How to unnest (explode) a column in a pandas DataFrame, into multiple rows
(16 answers)
Closed 3 years ago.
I have the following dataframe
df_in = pd.DataFrame({
'State':['C','B','D','A','C','B'],
'Contact':['alpha a. theta| beta','beta| alpha a. theta| delta','Theta','gamma| delta','alpha|Eta| gamma| delta','beta'],
'Timestamp':[911583000000,912020000000,912449000000,912742000000,913863000000,915644000000]})
How do I transform it so that the second column which has pipe separated data is broken out into different rows as follows:
df_out = pd.DataFrame({
'State':['C','C','B','B','B','D','A','A','C','C','C','C','B'],
'Contact':['alpha a. theta','beta','beta','alpha a. theta','delta','Theta','gamma', 'delta','alpha','Eta','gamma','delta','beta'],
'Timestamp':[911583000000,911583000000,912020000000,912020000000,912020000000,912449000000,912742000000,912742000000,913863000000,913863000000,913863000000,913863000000,915644000000]})
print(df_in)
print(df_out)
I can use pd.melt but for that I already need to have the 'Contact' column broken out into multiple columns and not have all the contacts in one column separated by a delimiter.
You could split the column, then merge on the index:
df_in.Contact.str.split('|',expand=True).stack().reset_index()\
.merge(df_in.reset_index(),left_on ='level_0',right_on='index')\
.drop(['level_0','level_1','index','Contact'],1)
Out:
0 State Timestamp
0 alpha a. theta C 911583000000
1 beta C 911583000000
2 beta B 912020000000
3 alpha a. theta B 912020000000
4 delta B 912020000000
5 Theta D 912449000000
6 gamma A 912742000000
7 delta A 912742000000
8 alpha C 913863000000
9 Eta C 913863000000
10 gamma C 913863000000
11 delta C 913863000000
12 beta B 915644000000

Can sub-columns be created in a pandas data frame?

Data frame
I am working with a data frame in Jupyter Notebooks and I am having some difficulty with it. The data frame consists of locations and these are represented by coordinates. These points represent a route taken by a driver on a given day.
There are 3 columns at the moment; Start, Intermediary or End.
A driver begins the day at the Start point, visits 1 or more Intermediary points and returns to the End point at the end of the day. The Start point is like a base location so the End point is identical to the Start point.
It's very basic but I am having trouble visualising this data. I was thinking something like this below to help improve my situation:
| Start | Intermediary | End |
| | | | | | |
_________________________________________________________________
| s_lat | s_lng | i_lat | i_lng | e_lat | e_lng |
Or would it be best if I scrap the top 3 columns (Start, Intermediary, End)?
I am keen not to start a discussion here as per the Guidelines so I am keen to learn something new about Python Pandas and if there is a way I can improve my current method.
I think need here MultiIndex created by MultiIndex.from_product:
mux = pd.MultiIndex.from_product([['Start','Intermediary','End'], ['lat','lng']])
df = pd.DataFrame(data, columns=mux)
EDIT:
Setup:
temp=u""" start intermediary end
('54.957055',' -7.740156') ('54.956915136264', ' -7.753690062122') ('54.957055','-7.740156')
('54.8913208', '-7.5740475') ('54.864402885577', '-7.653445692445'),('54','0') ('54.8913208','-7.5740475')
('55.2375819', '-7.2357427') ('55.253936739337', '-7.259624609577'), ('54','2'),('54','1') ('55.2375819','-7.2357427')
('54.5298806', '-8.1350247') ('54.504374314741', '-8.188334960168') ('54.5298806','-8.1350247')
('54.2810187', ' -7.896937') ('54.303836850038', '-8.180136033695'), ('54','3') ('54.2810187','-7.896937')
"""
#after testing replace 'pd.compat.StringIO(temp)' to 'filename.csv'
df = pd.read_csv(pd.compat.StringIO(temp), sep="\s{3,}")
print (df)
start \
0 ('54.957055',' -7.740156')
1 ('54.8913208', '-7.5740475')
2 ('55.2375819', '-7.2357427')
3 ('54.5298806', '-8.1350247')
4 ('54.2810187', ' -7.896937')
intermediary \
0 ('54.956915136264', ' -7.753690062122')
1 ('54.864402885577', '-7.653445692445'),('54','0')
2 ('55.253936739337', '-7.259624609577'), ('54',...
3 ('54.504374314741', '-8.188334960168')
4 ('54.303836850038', '-8.180136033695'), ('54',...
end
0 ('54.957055','-7.740156')
1 ('54.8913208','-7.5740475')
2 ('55.2375819','-7.2357427')
3 ('54.5298806','-8.1350247')
4 ('54.2810187','-7.896937')
import ast
#convert string values to tuples
df = df.applymap(lambda x: ast.literal_eval(x))
#convert onpy pairs values to nested lists
df['intermediary'] = df['intermediary'].apply(lambda x: list(x) if isinstance(x[1], tuple) else [x])
#DataFrame by first Start column
df1 = pd.DataFrame(df['start'].values.tolist(), columns=['lat','lng'])
#DataFrame by intermediary column with reshape for 2 columns df
df2 = (pd.concat([pd.DataFrame(x, columns=['lat','lng']) for x in df['intermediary']], keys=df.index)
.reset_index(level=1, drop=True)
.add_prefix('intermediary_'))
print (df2)
#join all DataFrames together
df3 = df1.add_prefix('start_').join(df2).join(df1.add_prefix('end_'))
#create MultiIndex by split
df3.columns = df3.columns.str.split('_', expand=True)
print (df3)
start intermediary end \
lat lng lat lng lat
0 54.957055 -7.740156 54.956915136264 -7.753690062122 54.957055
1 54.8913208 -7.5740475 54.864402885577 -7.653445692445 54.8913208
1 54.8913208 -7.5740475 54 0 54.8913208
2 55.2375819 -7.2357427 55.253936739337 -7.259624609577 55.2375819
2 55.2375819 -7.2357427 54 2 55.2375819
2 55.2375819 -7.2357427 54 1 55.2375819
3 54.5298806 -8.1350247 54.504374314741 -8.188334960168 54.5298806
4 54.2810187 -7.896937 54.303836850038 -8.180136033695 54.2810187
4 54.2810187 -7.896937 54 3 54.2810187
lng
0 -7.740156
1 -7.5740475
1 -7.5740475
2 -7.2357427
2 -7.2357427
2 -7.2357427
3 -8.1350247
4 -7.896937
4 -7.896937
To add a top to column to a pd.DataFrame run:
def add_top_column(df, top_col, inplace=False):
if not inplace:
df = df.copy()
df.columns = pd.MultiIndex.from_product([[top_col], df.columns])
return df
orig_df = pd.DataFrame([[1, 2], [3, 4]], columns=['a', 'b'])
new_df = add_top_column(orig_df, "new column")
In order to combine 3 DataFrames each with its own new top column:
new_df2 = add_top_column(orig_df, "new column2")
new_df3 = add_top_column(orig_df, "new column3")
print(pd.concat([new_df, new_df2, new_df3], axis=1))
"""
# And this is the expected output:
new column new column2 new column3
a b a b a b
0 1 2 1 2 1 2
1 3 4 3 4 3 4
"""
Note that if the DataFrames' index do not match, you might need to reset the index.
You can read an Excel file with 2 headers (2 levels of columns).
df = pd.read_excel(
sourceFilePath,
index_col = [0],
header = [0, 1]
)
You can reshape your df like this in order to keep just 1 header (its easier to work with only 1 header):
df = df.stack([0,1], dropna=False).to_frame('Valeur').reset_index()

How do I rename an index row in Python Pandas? [duplicate]

This question already has answers here:
How can I change a specific row label in a Pandas dataframe?
(2 answers)
Closed 5 years ago.
I see how to rename columns, but I want to rename an index (row name) that I have in a data frame.
I had a table with 350 rows in it, I then added a total to the bottom. I then removed every row except the last row.
-------------------------------------------------
| | A | B | C |
-------------------------------------------------
| TOTAL | 1243 | 423 | 23 |
-------------------------------------------------
So I have the row called 'Total', and then several columns. I want to rename the word 'Total' to something else.
Is this even possible?
Many thanks
You could use a dictionary structure with rename(), for example,
In [1]: import pandas as pd
df = pd.Series([1, 2, 3])
df
Out[1]: 0 1
1 2
2 3
dtype: int64
In [2]: df.rename({1: 3, 2: 'total'})
Out[2]: 0 1
3 2
total 3
dtype: int64
Easy as this...
df.index.name = 'Name'

In pandas, How to select the rows that contains NaN? [duplicate]

This question already has answers here:
How to select rows with one or more nulls from a pandas DataFrame without listing columns explicitly?
(6 answers)
Closed 6 years ago.
Suppose I have the following dataframe in df:
a | b | c
------+-------+-------
5 | 2 | 4
NaN | 6 | 8
5 | 9 | 0
3 | 7 | 1
If I do df.loc[df['a'] == 5] it will correctly return the first and third row, but then if I do a df.loc[df['a'] == np.NaN] it returns nothing.
I think this is more a python thing than a pandas one. If I compare np.nan against anything, even np.nan == np.nan will evaluate as False, so the question is, how should I test for np.nan?
Try using isnull like so:
import pandas as pd
import numpy as np
a=[1,2,3,np.nan,5,6,7]
df = pd.DataFrame(a)
df[df[0].isnull()]

Looping a merge for a amount of csv files using pandas [duplicate]

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 4 years ago.
Right now I'm creating a program which combines csv files into one with like columns not duplicated. The columns created would need to be added next the the adjacent column.
As of right now I'm able to get the files but I'm unable to determine a way to develop a way to iterate a data frame over each read csv and then merge all of these data frames together and push out a csv file.
RIght now I'm testing out this with three csv files with a common ID column
What I have right now is as follows:
os.chdir(filedname)
data = pd.merge([pd.DataFrame.from_csv(file) for
file in glob.glob("*.csv")],on='ID')
data.to_csv('merged.csv')
The files look like this:
(File 1) (File 2)
ID BLA ID X
1 2 1 55
2 3 2 2
3 4 3 12
4 5 4 52
And each different column besides the ID column in each csv file in the directory should be merged with each other to create one csv file like this:
ID BLA X
1 2 55
2 3 2
3 4 12
4 5 52
Any advice would be great in helping me solve this problem.
simple example:
# Demo DataFrames
df1 = pd.DataFrame([[1,2,3],[2,3,4],[3,1,3]], columns=['ID','BLA','X'])
df2 = pd.DataFrame([[1,2,3],[2,5,4],[3,10,100]], columns=['ID','X','BLA'])
df3 = pd.DataFrame([[1,2,3],[2,8,7],[3,0,0]], columns=['ID','BLA','D'])
# Demo DataFrames sequence
dfs = [df1,df2,df3]
# Merge DataFrames
df = pd.DataFrame(columns=['ID'])
for d in dfs:
cols = [x for x in d.columns if x not in df.columns or x == 'ID']
df = pd.merge(df, d[cols], on='ID', how='outer', suffixes=['',''])
# result
ID BLA X D
0 1 2 3 3
1 2 3 4 7
2 3 1 3 0
in your case it could be something like:
data = [pd.DataFrame.from_csv(f) for f in glob.glob("*.csv")]
df = pd.DataFrame(columns=['ID'])
for d in data:
cols = [x for x in d.columns if x not in df.columns or x == 'ID']
df = pd.merge(df, d[cols], on='ID', how='outer', suffixes=['',''])

Categories

Resources