Break Pandas series into multiple DataFrame columns based on string position - python

Given a Pandas Series with strings, I'd like to create a DataFrame with columns for each section of the Series based on position.
For example, given this input:
s = pd.Series(['abcdef', '123456'])
ind = [2, 3, 1]
Ideally I'd get this:
target_df = pd.DataFrame({
'col1': ['ab', '12'],
'col2': ['cde', '345'],
'col3': ['f', '6']
})
One way is creating them one-by-one, e.g.:
df['col1'] = s.str[:3]
df['col2'] = s.str[3:5]
df['col3'] = s.str[5]
But I'm guessing this is slower than a single split.
I tried a regex, but not sure how to parse the result:
pd.DataFrame(s.str.split("(^(\w{2})(\w{3})(\w{1}))"))
# 0
# 0 [, abcdef, ab, cde, f, ]
# 1 [, 123456, 12, 345, 6, ]

Your regex is almost there (note Series.str.extract(expand=True) returns a DataFrame):
df = s.str.extract("^(\w{2})(\w{3})(\w{1})", expand = True)
df.columns = ['col1', 'col2', 'col3']
# col1 col2 col3
# 0 ab cde f
# 1 12 345 6
Here's a function to generalize this:
def split_series_by_position(s, ind, cols):
# Construct regex.
regex = "^(\w{" + "})(\w{".join(map(str, ind)) + "})"
df = s.str.extract(regex, expand=True)
df.columns = cols
return df
# Example which will produce the result above.
split_series_by_position(s, ind, ['col1', 'col2', 'col3'])

Related

Expand pandas column list of string values into multiple columns

I have data in dataframe, the second column is a list of string:
The desired outcome would have two separate columns for content.images...:
You could try something like:
df[['content.images.0','content.images.1']] = df['content.images'].str.split(', ', expand=True)
Assuming this example input:
df = pd.DataFrame({'col1': ['X', 'Y'],
'col2': [['ABC', 'DEF'],['GHI', 'JLK', 'MNO']]})
# col1 col2
# 0 X [ABC, DEF]
# 1 Y [GHI, JLK, MNO]
You could apply(pd.Series) and add a custom prefix with add_prefix before doing a join with the original dataframe:
out = (df.drop(columns=['col2'])
.join(df['col2'].apply(pd.Series).add_prefix('col2_'))
.fillna('') # optional
)
output:
col1 col2_0 col2_1 col2_2
0 X ABC DEF
1 Y GHI JLK MNO

Converting Pandas header to string type

I have a dataframe which I read from a CSV as:
df = pd.read_csv(csv_path, header = None)
By default, Pandas assigns the header (df.columns) to be [0, 1, 2, ...] of type int64
What's the best way to to convert this to type str, such that df.columns results in ['0', '1', '2',...] (i.e type str)?
Currently, the best way I can think of doing this is df.columns = list(map(str, df.columns))
Unfortunately, df.astype(str) only affects the values and not the column names
You can use astype(str) with column names like this:
df.columns = df.columns.astype(str)
Example:
In [2472]: l = [1,2]
In [2473]: l1 = [2,3]
In [2475]: df = pd.DataFrame([l, l1])
In [2476]: df
Out[2476]:
0 1
0 1 2
1 2 3
In [2480]: df.columns = df.columns.astype(str)
In [2482]: df.columns
Out[2482]: Index(['0', '1'], dtype='object')

Join two same columns from two dataframes, pandas

I am looking for fastest way to join columns with same names using separator.
my dataframes:
df1:
A,B,C,D
my,he,she,it
df2:
A,B,C,D
dog,cat,elephant,fish
expected output:
df:
A,B,C,D
my:dog,he:cat,she:elephant,it:fish
As you can see, I want to merge columns with same names, two cells in one.
I can use this code for A column:
df=df1.merge(df2)
df['A'] = df[['A_x','A_y']].apply(lambda x: ':'.join(x), axis = 1)
In my real dataset i have above 30 columns, and i dont want to write same lines for each of them, is there any faster way to receive my expected output?
How about concat and groupby ?
df3 = pd.concat([df1,df2],axis=0)
df3 = df3.groupby(df3.index).transform(lambda x : ':'.join(x)).drop_duplicates()
print(df3)
A B C D
0 my:dog he:cat she:elephant it:fish
How about this?
df3 = df1 + ':' + df2
print(df3)
A B C D
0 my:dog he:cat she:elephant it:fish
This is good because if there's columns that doesn't match, you get NaN, so you can filter then later if you want:
df1 = pd.DataFrame({'A': ['my'], 'B': ['he'], 'C': ['she'], 'D': ['it'], 'E': ['another'], 'F': ['and another']})
df2 = pd.DataFrame({'A': ['dog'], 'B': ['cat'], 'C': ['elephant'], 'D': ['fish']})
df1 + ':' + df2
A B C D E F
0 my:dog he:cat she:elephant it:fish NaN NaN
you can do this by simply adding the two dataframe with a separator.
import pandas as pd
df1 = pd.DataFrame(columns=["A", "B", "C", "D"], index=[0])
df2 = pd.DataFrame(columns=["A", "B", "C", "D"], index=[0])
df1["A"] = "my"
df1["B"] = "he"
df1["C"] = "she"
df1["D"] = "it"
df2["A"] = "dog"
df2["B"] = "cat"
df2["C"] = "elephant"
df2["D"] = "fish"
print(df1)
print(df2)
df3 = df1 + ':' + df2
print(df3)
This will give you a result like:
A B C D
0 my he she it
A B C D
0 dog cat elephant fish
A B C D
0 my:dog he:cat she:elephant it:fish
Is this what you try to achieve? Although, this only works if you have same columns in both the dataframes. The extra columns will have nans. What do you want to do with the columns those are not same in df1 and df2? Please comment below to help me understand your problem better.
You can simply do:
df = df1 + ':' + df2
print(df)
Which is simple and effective
This should be your answer

How to concat item that is in list format in columns in dataframe

i want to concatenate item that is in list format in dataframe
i have a data frame below, when i print the DataFrame.head(), it shows below
A B
1 [1,2,3,4]
2 [5,6,7,8]
Expect Result (convert it from list to string separate by comma)
A B
1 1,2,3,4
2 5,6,7,8
You could do:
import pandas as pd
data = [[1, [1,2,3,4]],
[2, [5,6,7,8]]]
df = pd.DataFrame(data=data, columns=['A', 'B'])
df['B'] = [','.join(map(str, lst)) for lst in df.B]
print(df.head(2))
Output
A B
0 1 1,2,3,4
1 2 5,6,7,8
You can use the map or apply methods for this:
import pandas as pd
data = [[1, [1,2,3,4]],
[2, [5,6,7,8]]]
df = pd.DataFrame(data=data, columns=['A', 'B'])
df['B'] = df['B'].map(lambda x: ",".join(map(str,x)))
# or
# df['B'] = df['B'].apply(lambda x: ",".join(map(str,x)))
print(df.head(2))
df = pd.DataFrame([['1',[1,2,3,4]],['2',[5,6,7,8]]], columns=list('AB'))
generic way to convert lists to strings. in your example, your list is of type int but it could be any type that can be represented as a string to join the elements in the list by using ','.join(map(str, a_list)) Then just iterate through the rows in the specific column that cotains the lists you want to join
for i, row in df.iterrows():
df.loc[i,'B'] = ','.join(map(str, row['B']))

How to convert data of type Panda to Panda.Dataframe?

I have a object of which type is Panda and the print(object) is giving below output
print(type(recomen_total))
print(recomen_total)
Output is
<class 'pandas.core.frame.Pandas'>
Pandas(Index=12, instrument_1='XXXXXX', instrument_2='XXXX', trade_strategy='XXX', earliest_timestamp='2016-08-02T10:00:00+0530', latest_timestamp='2016-08-02T10:00:00+0530', xy_signal_count=1)
I want to convert this obejct in pd.DataFrame, how i can do it ?
i tried pd.DataFrame(object), from_dict also , they are throwing error
Interestingly, it will not convert to a dataframe directly but to a series. Once this is converted to a series use the to_frame method of series to convert it to a DataFrame
import pandas as pd
df = pd.DataFrame({'col1': [1, 2], 'col2': [0.1, 0.2]},
index=['a', 'b'])
for row in df.itertuples():
print(pd.Series(row).to_frame())
Hope this helps!!
EDIT
In case you want to save the column names use the _asdict() method like this:
import pandas as pd
df = pd.DataFrame({'col1': [1, 2], 'col2': [0.1, 0.2]},
index=['a', 'b'])
for row in df.itertuples():
d = dict(row._asdict())
print(pd.Series(d).to_frame())
Output:
0
Index a
col1 1
col2 0.1
0
Index b
col1 2
col2 0.2
To create new DataFrame from itertuples namedtuple you can use list() or Series too:
import pandas as pd
# source DataFrame
df = pd.DataFrame({'a': [1,2], 'b':[3,4]})
# empty DataFrame
df_new_fromAppend = pd.DataFrame(columns=['x','y'], data=None)
for r in df.itertuples():
# create new DataFrame from itertuples() via list() ([1:] for skipping the index):
df_new_fromList = pd.DataFrame([list(r)[1:]], columns=['c','d'])
# or create new DataFrame from itertuples() via Series (drop(0) to remove index, T to transpose column to row)
df_new_fromSeries = pd.DataFrame(pd.Series(r).drop(0)).T
# or use append() to insert row into existing DataFrame ([1:] for skipping the index):
df_new_fromAppend.loc[df_new_fromAppend.shape[0]] = list(r)[1:]
print('df_new_fromList:')
print(df_new_fromList, '\n')
print('df_new_fromSeries:')
print(df_new_fromSeries, '\n')
print('df_new_fromAppend:')
print(df_new_fromAppend, '\n')
Output:
df_new_fromList:
c d
0 2 4
df_new_fromSeries:
1 2
0 2 4
df_new_fromAppend:
x y
0 1 3
1 2 4
To omit index, use param index=False (but I mostly need index for the iteration)
for r in df.itertuples(index=False):
# the [1:] needn't be used, for example:
df_new_fromAppend.loc[df_new_fromAppend.shape[0]] = list(r)
The following works for me:
import pandas as pd
df = pd.DataFrame({'col1': [1, 2], 'col2': [0.1, 0.2]}, index=['a', 'b'])
for row in df.itertuples():
row_as_df = pd.DataFrame.from_records([row], columns=row._fields)
print(row_as_df)
The result is:
Index col1 col2
0 a 1 0.1
Index col1 col2
0 b 2 0.2
Sadly, AFAIU, there's no simple way to keep column names, without explicitly utilizing "protected attributes" such as _fields.
With some tweaks in #Igor's answer
I concluded with this satisfactory code which preserved column names and used as less of pandas code as possible.
import pandas as pd
df = pd.DataFrame({'col1': [1, 2], 'col2': [0.1, 0.2]})
# Or initialize another dataframe above
# Get list of column names
column_names = df.columns.values.tolist()
filtered_rows = []
for row in df.itertuples(index=False):
# Some code logic to filter rows
filtered_rows.append(row)
# Convert pandas.core.frame.Pandas to pandas.core.frame.Dataframe
# Combine filtered rows into a single dataframe
concatinated_df = pd.DataFrame.from_records(filtered_rows, columns=column_names)
concatinated_df.to_csv("path_to_csv", index=False)
The result is a csv containing:
col1 col2
1 0.1
2 0.2
To convert a list of objects returned by Pandas .itertuples to a DataFrame, while preserving the column names:
# Example source DF
data = [['cheetah', 120], ['human', 44.72], ['dragonfly', 54]]
source_df = pd.DataFrame(data, columns=['animal', 'top_speed'])
animal top_speed
0 cheetah 120.00
1 human 44.72
2 dragonfly 54.00
Since Pandas does not recommended building DataFrames by adding single rows in a for loop, we will iterate and build the DataFrame at the end:
WOW_THAT_IS_FAST = 50
list_ = list()
for animal in source_df.itertuples(index=False, name='animal'):
if animal.top_speed > 50:
list_.append(animal)
Now build the DF in a single command and without manually recreating the column names.
filtered_df = pd.DataFrame(list_)
animal top_speed
0 cheetah 120.00
2 dragonfly 54.00

Categories

Resources