How to transform a CSV column in dataframe to rows - python

I have a dataframe with CSVs in language column
Name Language
0 A French,Espanol
1 B Deutsch,English
I wish to transform the above dataframe as below
Name Language
0 A French
1 A Espanol
2 B Deutsch
3 B English
I tried the below code but couldn't accomplish
df=df.join(df.pop('Language').str.extractall(',$')[0] .reset_index(level=1,drop=True) .rename('Language')) .reset_index(drop=True)

pandas.DataFrame.explode should be suited for that task. Combine it with pandas.DataFrame.assign to get the desired column:
import pandas as pd
df = pd.DataFrame({'Name':['A', 'B'], 'Language': ['French,Espanol', 'Deutsch,English']})
df = df.assign(Language=df['Language'].str.split(',')).explode('Language')
# Name Language
# 0 A French
# 0 A Espanol
# 1 B Deutsch
# 1 B English

First create a new dataframe with the same columns, then split second values and appent rows to the dataframe.
import pandas as pd
csv_df = pd.DataFrame([['1', '2,3'], ['2', '4,5']], columns=['Name', 'Language'])
df = pd.DataFrame(columns=['Name ', 'Language'])
for index, row in csv_df .iterrows():
name = row['Name']
s = row['Language']
txt = s.split(',')
for x in txt:
df = df.append(pd.Series([name, x], index=df.columns), ignore_index=True)
print(df)

Related

Pandas replace dataframe value with a variable at a variable row

I want to replace a row in a csv file with a variable. The row itself also has to be a variable. The following code is an example:
import pandas as pd
# sample dataframe
df = pd.DataFrame({'A': ['a','b','c'], 'B':['b','c','d']})
print("Original DataFrame:\n", df)
x = 1
y = 12698
df_rep = df.replace([int(x),1], y)
print("\nAfter replacing:\n", df_rep)
This can be done using pandas indexing eg df.iloc[row_num, col_num].
#update df
df.iloc[x,1]=y
#print df
print(df)
A B
0 a b
1 b 12698
2 c d

How to extract entire rows from pandas data frame, if a column's string value contains a specific pattern

I have the following data frame with column 'Name' having a pattern '///' in its values
data = [['a1','yahoo', 'apple'], ['a2','gma///il', 'mango'], ['a3','amazon', 'papaya'],
['a4','bi///ng', 'guava']]
df = pd.DataFrame(data, columns = ['ID', 'Name', 'Info'])
I need to extract the entire row from this data frame if the column 'Name' has Value having a pattern '///' in it. I have tried the following code but getting a empty dataframe.
new_df = df.loc[df['Name'] == '///']
My expected output should give me a data frame like this:
data_new = [['a2','gma///il', 'mango'],['a4','bi///ng', 'guava']]
new_df = pd.DataFrame(data, columns = ['ID', 'Name', 'Info'])
print(new_df)
Use Series.str.contains:
import pandas as pd
data = [['a1','yahoo', 'apple'], ['a2','gma///il', 'mango'],
['a3','amazon', 'papaya'],['a4','bi///ng', 'guava']]
df = pd.DataFrame(data, columns = ['ID', 'Name', 'Info'])
print (df[df["Name"].str.contains("///")])
#
ID Name Info
1 a2 gma///il mango
3 a4 bi///ng guava
If you want to filter on perticular one column then use this solution
import numpy as np
immport pandas as pd
data = [['a1','yahoo', 'apple'], ['a2','gma///il', 'mango'], ['a3','amazon', 'papaya'],
['a4','bi///ng', 'guava']]
df = pd.DataFrame(data, columns = ['ID', 'Name', 'Info'])
mask = np.column_stack([df['Name'].str.contains(r"\///", na=False)])
df.loc[mask.any(axis=1)]
Output:
ID Name Info
1 a2 gma///il mango
3 a4 bi///ng guava
If you need filtering on all columns for some pattern then see the below solution
import numpy as np
mask = np.column_stack([df[col].str.contains(r"\///", na=False) for col in df])
df.loc[mask.any(axis=1)]
Output:
ID Name Info
1 a2 gma///il mango
3 a4 bi///ng guava
DataFrame has string function contains() for this
new_df = df[ df['Name'].str.contains('///') ]

Fill in empty header with previous column name - pandas

I have a dataframe where each second column name is skipped:
eg
Step_1.
The idea is to fill unnamed columns with previous name to get:
Step_2.
To sum up "in" and "out" in each class, to get final result like this
The intermediary Step_1 is important and cannot be skipped to get the final result.
I appreciate any help and apologize for not being clear enough when asking question at the first attempt.
Thank you
Idea is convert columns to Series, so possible replace missing values instead values starting by Unnamed with forward filling:
df.columns = df.columns.to_series().mask(lambda x: x.str.startswith('Unnamed')).ffill()
print (df)
Column_1 Column_1 Column_2 Column_2
0 a d f g
EDIT:
If missing values in index:
df.columns = df.columns.to_series().ffill()
MultiIndex solution is necessary, if second row is header too - first use header=[0,1] for MultiIndex:
import pandas as pd
temp=u"""Column_1;Unnamed_column;Column_2;Unnamed_column
a;d;f;g
1;5;5;6
7;8;9;4"""
#after testing replace 'pd.compat.StringIO(temp)' to 'filename.csv'
df = pd.read_csv(pd.compat.StringIO(temp), sep=";", header=[0,1])
print (df)
Column_1 Unnamed_column Column_2 Unnamed_column
a d f g
0 1 5 5 6
1 7 8 9 4
a = df.columns.get_level_values(0)
b = df.columns.get_level_values(1)
df.columns = [a.to_series().mask(lambda x: x.str.startswith('Unnamed')).ffill(), b]
print (df)
Column_1 Column_2
a d f g
0 1 5 5 6
1 7 8 9 4
I tried this,
t = pd.DataFrame(df.columns)
t.loc[t[0].str.startswith('Unnamed: '),0] = np.NaN
t[0].bfill(inplace=True)
df.columns = t[0].values
Create temp dataframe with column of original dataframe. apply ffill or bfill as per your wish. assign back the values again to original dataframe.
You can rewrite the df.index with a list comprehension.
from itertools import chain
df = pd.DataFrame(
{"Column_1": [1], "Unnamed_column1": [2], "Column_2": [3], "Unnamed_column2": [4]})
cols = [[c, c] for c in df.columns[::2]]
df.columns = [_ for _ in chain(*cols)]
Having said that it might be better to assign unique names to columns as they will be used keys/indices, i.e .
cols = [[c, c+"_new"] for c in df.columns[::2]]

pandas convert grouped rows into columns

I have a dataframe such as:
label column1
a 1
a 2
b 6
b 4
I would like to make a dataframe with a new column, with the opposite value from column1 where the labels match. Such as:
label column1 column2
a 1 2
a 2 1
b 6 4
b 4 6
I know this is probably very simple to do with a groupby command but I've been searching and can't find anything.
The following uses groupby and apply and seems to work okay:
x = pd.DataFrame({ 'label': ['a','a','b','b'],
'column1': [1,2,6,4] })
y = x.groupby('label').apply(
lambda g: g.assign(column2 = np.asarray(g.column1[::-1])))
y = y.reset_index(drop=True) # optional: drop weird index
print(y)
you can try the code block below:
#create the Dataframe
df = pd.DataFrame({'label':['a','a','b','b'],
'column1':[1,2,6,4]})
#Group by label
a = df.groupby('label').first().reset_index()
b = df.groupby('label').last().reset_index()
#Concat those groups to create columns2
df2 = (pd.concat([b,a])
.sort_values(by='label')
.rename(columns={'column1':'column2'})
.reset_index()
.drop('index',axis=1))
#Merge with the original Dataframe
df = df.merge(df2,left_index=True,right_index=True,on='label')[['label','column1','column2']]
Hope this helps
Assuming their are only pairs of labels, you could use the following as well:
# Create dataframe
df = pd.DataFrame(data = {'label' :['a', 'a', 'b', 'b'],
'column1' :[1,2, 6,4]})
# iterate over dataframe, identify matching label and opposite value
for index, row in df.iterrows():
newvalue = int(df[(df.label == row.label) & (df.column1 != row.column1)].column1.values[0])
# set value to new column
df.set_value(index, 'column2', newvalue)
df.head()
You can use groupby with apply where create new Series with back order:
df['column2'] = df.groupby('label')["column1"] \
.apply(lambda x: pd.Series(x[::-1].values)).reset_index(drop=True)
print (df)
column1 label column2
0 1 a 2
1 2 a 1
2 6 b 4
3 4 b 6

Appending to an empty DataFrame in Pandas?

Is it possible to append to an empty data frame that doesn't contain any indices or columns?
I have tried to do this, but keep getting an empty dataframe at the end.
e.g.
import pandas as pd
df = pd.DataFrame()
data = ['some kind of data here' --> I have checked the type already, and it is a dataframe]
df.append(data)
The result looks like this:
Empty DataFrame
Columns: []
Index: []
This should work:
>>> df = pd.DataFrame()
>>> data = pd.DataFrame({"A": range(3)})
>>> df = df.append(data)
>>> df
A
0 0
1 1
2 2
Since the append doesn't happen in-place, so you'll have to store the output if you want it:
>>> df = pd.DataFrame()
>>> data = pd.DataFrame({"A": range(3)})
>>> df.append(data) # without storing
>>> df
Empty DataFrame
Columns: []
Index: []
>>> df = df.append(data)
>>> df
A
0 0
1 1
2 2
And if you want to add a row, you can use a dictionary:
df = pd.DataFrame()
df = df.append({'name': 'Zed', 'age': 9, 'height': 2}, ignore_index=True)
which gives you:
age height name
0 9 2 Zed
You can concat the data in this way:
InfoDF = pd.DataFrame()
tempDF = pd.DataFrame(rows,columns=['id','min_date'])
InfoDF = pd.concat([InfoDF,tempDF])
The answers are very useful, but since pandas.DataFrame.append was deprecated (as already mentioned by various users), and the answers using pandas.concat are not "Runnable Code Snippets" I would like to add the following snippet:
import pandas as pd
df = pd.DataFrame(columns =['name','age'])
row_to_append = pd.DataFrame([{'name':"Alice", 'age':"25"},{'name':"Bob", 'age':"32"}])
df = pd.concat([df,row_to_append])
So df is now:
name age
0 Alice 25
1 Bob 32
pandas.DataFrame.append Deprecated since version 1.4.0: Use concat() instead.
Therefore:
df = pd.DataFrame() # empty dataframe
df2 = pd..DataFrame(...) # some dataframe with data
df = pd.concat([df, df2])

Categories

Resources