Create DataFrame and set_index at once

Create DataFrame and set_index at once - python

This works:
import pandas as pd
data = [["aa", 1, 2], ["bb", 3, 4]]
df = pd.DataFrame(data, columns=['id', 'a', 'b'])
df = df.set_index('id')
print(df)
"""
a b
id
aa 1 2
bb 3 4
"""
but is it possible in just one call of pd.DataFrame(...) directly with a parameter, without using set_index after?

Convert values to 2d array:
data = [["aa", 1, 2], ["bb", 3, 4]]
arr = np.array(data)
df = pd.DataFrame(arr[:, 1:], columns=['a', 'b'], index=arr[:, 0])
print (df)
a b
aa 1 2
bb 3 4
Details:
print (arr)
[['aa' '1' '2']
['bb' '3' '4']]
Another solution:
data = [["aa", 1, 2], ["bb", 3, 4], ["cc", 30, 40]]
cols = ['a','b']
L = list(zip(*data))
print (L)
[('aa', 'bb', 'cc'), (1, 3, 30), (2, 4, 40)]
df = pd.DataFrame(dict(zip(cols, L[1:])), index=L[0])
print (df)
a b
aa 1 2
bb 3 4
cc 30 40

Related

Create and populate dataframe column simulating (excel) vlookup function

I am trying to create a new column in a dataframe and polulate it with a value from another data frame column which matches a common column from both data frames columns.
DF1 DF2
A B W B
——— ———
Y 2 X 2
N 4 F 4
Y 5 T 5
I though the following could do the tick.
df2[‘new_col’] = df1[‘A’] if df1[‘B’] == df2[‘B’] else “Not found”
So result should be:
DF2
W B new_col
X 2 Y -> Because DF1[‘B’] == 2 and value in same row is Y
F 4 N
T 5 Y
but I get the below error, I believe that is because dataframes are different sizes?
raise ValueError("Can only compare identically-labeled Series objects”)
Can you help me understand what am I doing wrong and what is the best way to achieve what I am after?
Thank you in advance.
UPDATE 1
Trying Corralien solution I still get the below:
ValueError: You are trying to merge on int64 and object columns. If you wish to proceed you should use pd.concat
This is the code I wrote
df1 = pd.DataFrame(np.array([['x', 2, 3], ['y', 5, 6], ['z', 8, 9]]),
columns=['One', 'b', 'Three'])
df2 = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
columns=['a', 'b', 'c'])
df2.reset_index().merge(df1.reset_index(), on=['b'], how='left') \
.drop(columns='index').rename(columns={'One': 'new_col'})
UPDATE 2
Here is the second option, but it does not seem to add columns in df2.
df1 = pd.DataFrame(np.array([['x', 2, 3], ['y', 5, 6], ['z', 8, 9]]),
columns=['One', 'b', 'Three'])
df2 = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
columns=['a', 'b', 'c'])
df2 = df2.set_index('b', append=True).join(df1.set_index('b', append=True)) \
.reset_index('b').rename(columns={'One': 'new_col'})
print(df2)
b a c new_col Three
0 2 1 3 NaN NaN
1 5 4 6 NaN NaN
2 8 7 9 NaN NaN
Why is the code above not working?

Your question is not clear because why is F associated with N and T with Y? Why not F with Y and T with N?
Using merge:
>>> df2.merge(df1, on='B', how='left')
W B A
0 X 2 Y
1 F 4 N # What you want
2 F 4 Y # Another solution
3 T 4 N # What you want
4 T 4 Y # Another solution
How do you decide on the right value? With row index?
Update
So you need to use the index position:
>>> df2.reset_index().merge(df1.reset_index(), on=['index', 'B'], how='left') \
.drop(columns='index').rename(columns={'A': 'new_col'})
W B new_col
0 X 2 Y
1 F 4 N
2 T 4 Y
In fact you can consider the column B as an additional index of each dataframe.
Using join
>>> df2.set_index('B', append=True).join(df1.set_index('B', append=True)) \
.reset_index('B').rename(columns={'A': 'new_col'})
B W new_col
0 2 X Y
1 4 F N
2 4 T Y
Setup:
df1 = pd.DataFrame([['x', 2, 3], ['y', 5, 6], ['z', 8, 9]],
columns=['One', 'b', 'Three'])
df2 = pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9]],
columns=['a', 'b', 'c'])

Apply function only on specific rows AND columns using Python Pandas

I have a dataframe below:
df = {'a': [1, 2, 3],
'b': [77, 88, 99],
'c1': [1, 1, 1],
'c2': [2, 2, 2],
'c3': [3, 3, 3]}
df = pd.DataFrame(df)
and a function:
def test_function(row):
return row['b']
How can I apply this function on the 'c' columns (i.e. c1, c2 and c3), BUT only for specific rows whose 'a' value matches the 2nd character of the 'c' columns?
For example, for the first row, the value of 'a' is 1, so for the first row, I would like to apply this function on column 'c1'.
For the second row, the value of 'a' is 2, so for the second row, I would like to apply this function on column 'c2'. And so forth for the rest of the rows.
The desired end result should be:
df_final = {'a': [1, 2, 3],
'b': [77, 88, 99],
'c1': [77, 1, 1],
'c2': [2, 88, 2],
'c3': [3, 3, 99]}
df_final = pd.DataFrame(df_final)

Use Series.mask with compare c columns filtered by DataFrame.filter and if match repalce by values of b:
c_cols = df.filter(like='c').columns
def test_function(row):
#for test integers from 0 to 9
#m = c_cols.str[1].astype(int) == row['a']
#for test integers from 0 to 100
m = c_cols.str.extract('(\d+)', expand=False).astype(int) == row['a']
row[c_cols] = row[c_cols].mask(m, row['b'])
return row
df = df.apply(test_function, axis=1)
print (df)
a b c1 c2 c3
0 1 77 77 2 3
1 2 88 1 88 3
2 3 99 1 2 99
Non loop faster alternative with broadcasting:
arr = c_cols.str.extract('(\d+)', expand=False).astype(int).to_numpy()[:, None]
m = df['a'].to_numpy() == arr
df[c_cols] = df[c_cols].mask(m, df['b'], axis=0)

Combining 3 Arrays into 1 Matrix (Python 3)

I have 3 arrays of equal length (e.g.):
[a, b, c]
[1, 2, 3]
[i, ii, iii]
I would like to combine them into a matrix:
|a, 1, i |
|b, 2, ii |
|c, 3, iii|
The problem I have is that when I use codes such as dstack, hstack or concatenate. I get them numerically added or stacked in a fashion that I can work with.

You could use zip():
which maps the similar index of multiple containers so that they can be used just using as single entity.
a1 = ['a', 'b', 'c']
b1 = ['1', '2', '3']
c1 = ['i', 'ii', 'iii']
print(list(zip(a1,b1,c1)))
OUTPUT:
[('a', '1', 'i'), ('b', '2', 'ii'), ('c', '3', 'iii')]
EDIT:
I just thought of stepping forward, how about flattening the list afterwards and then use numpy.reshape
flattened_list = []
#flatten the list
for x in res:
for y in x:
flattened_list.append(y)
#print(flattened_list)
import numpy as np
data = np.array(flattened_list)
shape = (3, 3)
print(data.reshape( shape ))
OUTPUT:
[['a' '1' 'i']
['b' '2' 'ii']
['c' '3' 'iii']]
OR
for one liners out there:
#flatten the list
for x in res:
for y in x:
flattened_list.append(y)
# print(flattened_list)
print([flattened_list[i:i+3] for i in range(0, len(flattened_list), 3)])
OUTPUT:
[['a', '1', 'i'], ['b', '2', 'ii'], ['c', '3', 'iii']]
OR
As suggested by #norok2
print(list(zip(*zip(a1, b1, c1))))
OUTPUT:
[('a', 'b', 'c'), ('1', '2', '3'), ('i', 'ii', 'iii')]

Assuming that you have 3 numpy arrays:
>>> a, b, c = np.random.randint(0, 9, 9).reshape(3, 3)
>>> print(a, b, c)
[4 1 4] [5 8 5] [3 0 2]
then you can stack them vertically (i.e. along the first dimension), and then transpose the resulting matrix to get the order you need:
>>> np.vstack((a, b, c)).T
array([[4, 5, 3],
[1, 8, 0],
[4, 5, 2]])
A slightly more verbose example is to instead stack horizontally, but this requires that your arrays are made into 2D using reshape:
>>> np.hstack((a.reshape(3, 1), b.reshape(3, 1), c.reshape(3, 1)))
array([[4, 5, 3],
[1, 8, 0],
[4, 5, 2]])

this gives you a list of tuples, which might not be what you want:
>>> list(zip([1,2,3],[4,5,6],[7,8,9]))
[(1, 4, 7), (2, 5, 8), (3, 6, 9)]
this gives you a numpy array:
>>> from numpy import array
>>> array([[1,2,3],[4,5,6],[7,8,9]]).transpose()
array([[1, 4, 7],
[2, 5, 8],
[3, 6, 9]])

If you have different data types in each array, then it would make sense to use pandas for this:
# Iterative approach, using concat
import pandas as pd
my_arrays = [['a', 'b', 'c'], [1, 2, 3], ['i', 'ii', 'iii']]
df1 = pd.concat([pd.Series(array) for array in my_arrays], axis=1)
# Named arrays
array1 = ['a', 'b', 'c']
array2 = [1, 2, 3]
array3 = ['i', 'ii', 'iii']
df2 = pd.DataFrame({'col1': array1,
'col2': array2,
'col3': array3})
Now you have the structure you desired, with appropriate data types for each column:
print(df1)
# 0 1 2
# 0 a 1 i
# 1 b 2 ii
# 2 c 3 iii
print(df2)
# col1 col2 col3
# 0 a 1 i
# 1 b 2 ii
# 2 c 3 iii
print(df1.dtypes)
# 0 object
# 1 int64
# 2 object
# dtype: object
print(df2.dtypes)
# col1 object
# col2 int64
# col3 object
# dtype: object
You can extract the numpy array with the .values attribute:
df1.values
# array([['a', 1, 'i'],
# ['b', 2, 'ii'],
# ['c', 3, 'iii']], dtype=object)

Assign new column using a set of sub-columns

I have a dataframe with a column 'name' of the form ['A','B','C',A','B','B'....] and a set of arrays: one corresponding to 'A', say array_A = [0, 1, 2 ...] and array_B = [3, 1, 0 ...], array_C etc...
I want to create a new column 'value' by assigning array_A where the row name in the dataframe is 'A', and similarly for 'B' and 'C'.
The function df['value']=np.where(df['name']=='A',array_A, df['value']) won't do it because it would overwrite the values for other names or have dimensionality issues.
For example:
arrays = {'A': np.array([0, 1, 2]),
'B': np.array([3, 1])}
Desired output:
df = pd.DataFrame({'name': ['A', 'B', 'A', 'A', 'B']})
name value
0 A 0
1 B 3
2 A 1
3 A 2
4 B 1

You can use a for loop with a dictionary:
arrays = {'A': np.array([0, 1, 2]),
'B': np.array([3, 1])}
df = pd.DataFrame({'name': ['A', 'B', 'A', 'A', 'B']})
for k, v in arrays.items():
df.loc[df['name'] == k, 'value'] = v
df['value'] = df['value'].astype(int)
print(df)
name value
0 A 0
1 B 3
2 A 1
3 A 2
4 B 1

Selecting from multi-level groupby in pandas

Lets say I have two dataframes: df with columns ('a', 'b', 'c') and tf with columns ('a', 'b'). I do a group-combine on the two common columns in df:
grouped_sum = df.groupby(('a', 'b')).sum()
How can I "add" the column c to tf according to grouped_sum, i.e.
tf[i]['c'] = grouped_sum[tf[i]['a'], tf[i]['b']]
for all rows i of the second data frame? For a groupby with a single level it works simply by indexing the group with the corresponding column of tf.

If you groupby with as_index=False you can merge with tf:
In [11]: tf = pd.DataFrame([[1, 2], [3, 4]], columns=list('ab'))
In [12]: df = pd.DataFrame([[1, 2, 3], [1, 2, 4], [3, 4, 5]], columns=list('abc'))
In [13]: grouped_sum = df.groupby(['a', 'b'], as_index=False).sum()
In [14]: grouped_sum
Out[14]:
a b c
0 1 2 7
1 3 4 5
In [15]: tf.merge(grouped_sum) # this won't always be the same as grouped_sum!
Out[15]:
a b c
0 1 2 7
1 3 4 5
another option is to set a and b as the index of tf.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Create DataFrame and set_index at once - python

This works: import pandas as pd data = [["aa", 1, 2], ["bb", 3, 4]] df = pd.DataFrame(data, columns=['id', 'a', 'b']) df = df.set_index('id') print(df) """ a b id aa 1 2 bb 3 4 """ but is it possible in just one call of pd.DataFrame(...) directly with a parameter, without using set_index after?

Related

Create and populate dataframe column simulating (excel) vlookup function

Apply function only on specific rows AND columns using Python Pandas

Combining 3 Arrays into 1 Matrix (Python 3)

Assign new column using a set of sub-columns

Selecting from multi-level groupby in pandas

Categories

Resources