Panda Numpy converting data to a column - python

I have a data result that when I print it looks like
>>>print(result)
[[0]
[1]
[0]
[0]
[1]
[0]]
I guess that's about the same as [ [0][1][0][0][1][0] ] which seems a bit weird [0,1,0,0,1,0] seems a more logical representation but somehow it's not like that.
Though I would like these values to be added as a single column to a Panda dataframe df
I tried several ways to join it to my dataframe:
df = pd.concat(df,result)
df = pd.concat(df,{'result' =result})
df['result'] =pd.aply(result, axis=1)
with no luck. How can I do it?

There is multiple ways for flatten your data:
df = pd.DataFrame(data=np.random.rand(6,2))
result = np.array([0,1,0,0,1,0])[:, None]
print (result)
[[0]
[1]
[0]
[0]
[1]
[0]]
df['result'] = result[:,0]
df['result1'] = result.ravel()
#df['result1'] = np.concatenate(result)
print (df)
0 1 result result1
0 0.098767 0.933861 0 0
1 0.532177 0.610121 1 1
2 0.288742 0.718452 0 0
3 0.520980 0.367746 0 0
4 0.253658 0.011994 1 1
5 0.662878 0.846113 0 0

If you are looking to put that array in flat format pandas dataframe column, following is simplest way:
df["result"] = sum(result, [])

As long as the number of data points in this list is the same as the number of rows of the dataframe this should work:
import pandas as pd
your_data = [[0],[1],[0],[0],[1],[0]]
df = pd.DataFrame() # skip and use your own dataframe with len(df) == len(your_data)
df['result'] = [i[0] for i in your_data]

Related

How to convert a nested list of keys to a dummies-like dataframe

How to convert following list to a pandas dataframe?
my_list = [["A","B","C"],["A","B","D"]]
And as an output I would like to have a dataframe like:
Index
A
B
C
D
1
1
1
1
0
2
1
1
0
1
You can craft Series and concatenate them:
my_list = [["A","B","C"],["A","B","D"]]
df = (pd.concat([pd.Series(1, index=l, name=i+1)
for i,l in enumerate(my_list)], axis=1)
.T
.fillna(0, downcast='infer') # optional
)
or with get_dummies:
df = pd.get_dummies(pd.DataFrame(my_list))
df = df.groupby(df.columns.str.split('_', 1).str[-1], axis=1).max()
output:
A B C D
1 1 1 1 0
2 1 1 0 1
I'm unsure how those two structures relate. The my_list is a list of two lists containing ["A","B","C"] and ["A", "B","D"].
If you want a data frame like the table you have, I would suggest making a dictionary of the values first, then converting it into a pandas dataframe.
my_dict = {"A":[1,1], "B":[1,1], "C": [1,0], "D":[0,1]}
my_df = pd.DataFrame(my_dict)
print(my_df)
Output:

How to Merge Multiple Panda's DataFrames into an Array for each Column Value Based on Another Column Value

I have several Panda's Dataframes that I would like to merge together. When I merge them I would like the values that have the same columns to become an array of values.
For example, I would like to merge two data frames together if they have the same value in a specified column. When they are merged the data becomes an array of values.
df1 =
A Value
0 x 0
1 y 0
df2 =
A Value
0 x 1
1 y 1
2 z 1
After Combining:
df =
A Number_Value
0 x [0, 1]
1 y [0, 1]
2 z [, 1]
I do not believe the merge() or concat() call would be appropriate. I thought calling .to_numpy() would be able to do this, if I were to convert each value in each row to an array, but that does not seem to work.
Use concat with aggregate list:
df = pd.concat([df1, df2]).groupby('A', as_index=False).agg(list)
print (df)
A Value
0 x [0, 1]
1 y [0, 1]
2 z [1]
Test DataFrames without A column:
L = [df1, df2]
print ([x for x in L if 'A' not in x.columns])
EDIT: For add '' for empty values add it to fill_value parameter:
L = [df1, df2]
df = pd.concat(L, keys=range(len(L))).reset_index(level=1, drop=True).set_index('A', append=True)
mux = pd.MultiIndex.from_product(df.index.levels)
df = df.reindex(mux, fill_value='').groupby('A').agg(list).reset_index()
print (df)
A Value
0 x [0, 1]
1 y [0, 1]
2 z [, 1]

Get index of DataFrame for rows that are identical to elements of an array

I have a DataFrame (temp) and an array (x), whose elements correspond to some of the lines of the DataFrame. I want to get the indexes of the DataFrame whose corresponding records are identical to the elements of the array:
For example:
temp = pd.DataFrame({"A": [1,2,3,4], "B": [4,5,6,7], "C": [7,8,9,10]})
A B C
0 1 4 7
1 2 5 8
2 3 6 9
3 4 7 10
x = np.array([[1,4,7], [3,6,9]])
It should return the indexes: 0 and 2.
I was trying unsuccessfully with this:
temp.loc[temp.isin(x[0])].index
Using numpy broadcasting:
array = temp.to_numpy()[:, None]
mask = (array == x).all(axis=-1).any(axis=-1)
temp.index[mask]
I would convert to Multiindex and then to isin with np.where
i = pd.MultiIndex.from_frame(temp[['A','B','C']])
out = np.where(i.isin(pd.MultiIndex.from_arrays(x.T)))[0]
print(out)
#[0 2]
Or with merge:
cols = ['A','B','C']
out = temp.reset_index().merge(pd.DataFrame(x,columns=cols)).loc[:,'index'].tolist()
Or with np.isin and all
out = temp.index[np.isin(temp[['A','B','C']],x).all(1)]
Since you need to match entire rows of the DataFrame to rows in the numpy array, you can convert the DataFrame to an array and then use enumerate to loop and return the indices:
temp_arr = temp.to_numpy()
for idx, row in enumerate(temp_arr):
if row in x:
print(idx)
Output:
0
2
A more elegant way using list comprehension would be:
idx_list = [i for i, row in enumerate(temp_arr) if row in x ]
print(idx_list)
Output:
[0, 2]

Pandas Dataframe convert column of lists to multiple columns

I am trying to convert a dataframe that has list of various size for example something like this:
d={'A':[1,2,3],'B':[[1,2,3],[3,5],[4]]}
df = pd.DataFrame(data=d)
df
to something like this:
d1={'A':[1,2,3],'B-1':[1,0,0],'B-2':[1,0,0],'B-3':[1,1,0],'B-4':[0,0,1],'B-5':[0,1,0]}
df1 = pd.DataFrame(data=d1)
df1
Thank you for the help
explode the lists then get_dummies and sum over the original index. (max [credit to #JonClements] if you want true dummies and not counts in case there can be multiples). Then join the result back
dfB = pd.get_dummies(df['B'].explode()).sum(level=0).add_prefix('B-')
#dfB = pd.get_dummies(df['B'].explode()).max(level=0).add_prefix('B-')
df = pd.concat([df['A'], dfB], axis=1)
# A B-1 B-2 B-3 B-4 B-5
#0 1 1 1 1 0 0
#1 2 0 0 1 0 1
#2 3 0 0 0 1 0
You can use pop to remove the column you explode so you don't need to specify df[list_of_all_columns_except_B] in the concat:
df = pd.concat([df, pd.get_dummies(df.pop('B').explode()).sum(level=0).add_prefix('B-')],
axis=1)

Python : How do you filter out columns from a dataset based on substring match in Column names

df_train = pd.read_csv('../xyz.csv')
headers = df_train.columns
I want to filter out those columns in headers which have _pct in their substring.
Use df.filter
df = pd.DataFrame({'a':[1,2,3], 'b_pct':[1,2,3],'c_pct':[1,2,3],'d':[1]*3})
print(df.filter(items=[i for i in df.columns if '_pct' not in i]))
## or as jezrael suggested
# print(df[[i for i in df.columns if '_pct' not in i]])
Output:
a d
0 1 1
1 2 1
2 3 1
Use:
#data from AkshayNevrekar answer
df = df.loc[:, ~df.columns.str.contains('_pct')]
print (df)
Filter solution is not trivial:
df = df.filter(regex=r'^(?!.*_pct).*$')
a d
0 1 1
1 2 1
2 3 1
Thank you, #IanS for another solutions:
df[df.columns.difference(df.filter(like='_pct').columns).tolist()]
df.drop(df.filter(like='_pct').columns, axis=1)
As df.columns returns a list of the column names, you can use list comprehension and build your new list with a simple condition:
new_headers = [x for x in headers if '_pct' not in x]

Categories

Resources