I have a 1D DataFrame that is indexed with keys of the form i_n, where i and n are strings (for the sake of this example, i is an integer number and n is a character). This would be a simple example:
values
0_a 0.583772
1_a 0.782358
2_a 0.766844
3_a 0.072565
4_a 0.576667
0_b 0.503876
1_b 0.352815
2_b 0.512834
3_b 0.070908
4_b 0.074875
0_c 0.361226
1_c 0.526089
2_c 0.299183
3_c 0.895878
4_c 0.874512
Now I would like to re-arrange this DataFrame to be 2D such that the number (the part of the index name before the underscore) serves as column name and the character (the part of the index after the underscore) serves as index:
0 1 2 3 4
a 0.583772 0.782358 0.766844 0.0725654 0.576667
b 0.503876 0.352815 0.512834 0.0709081 0.0748752
c 0.361226 0.526089 0.299183 0.895878 0.874512
I have a solution for the problem (the function convert_2d below), but I was wondering, whether there would be a more idiomatic way to achieve this. Here the code that was used to generate the original DataFrame and to convert it to the desired form:
import pandas as pd
import numpy as np
def convert_2d(df):
df2 = pd.DataFrame(columns=['a','b','c'], index=list(range(5))).T
names = set(idx.split('_')[1] for idx in df.index)
numbers = set(idx.split('_')[0] for idx in df.index)
for i in numbers:
for n in names:
df2[i][n] = df['values']['{}_{}'.format(i,n)]
return df2
##generating 1d example data:
data = np.random.rand(15)
indices = ['{}_{}'.format(i,n) for n in ['a','b','c'] for i in range(5)]
df = pd.DataFrame(
data, columns=['values']
).rename(index={i:idx for i,idx in enumerate(indices)})
print(df)
##converting to 2d
print(convert_2d(df))
Some notes about the index keys: it can be assumed (like in my function) that there are no 'missing keys' (i.e. a 2d array can always be achieved) and the only thing that can be taken for granted about the keys is the (single) underscore (i.e. the numbers and letters were only chosen for explanatory reasons, in reality there would be just two arbitrary strings connected by the underscore).
IIUC Create the Multiple index thenunstack
df.index=pd.MultiIndex.from_tuples(df.index.str.split('_').map(tuple))
df['values'].unstack(level=0)
Out[65]:
0 1 2 3 4
a 0.583772 0.782358 0.766844 0.072565 0.576667
b 0.503876 0.352815 0.512834 0.070908 0.074875
c 0.361226 0.526089 0.299183 0.895878 0.874512
Related
I have 1st dataFrame with column 'X' as :
X
A468593-3
A697269-2
A561044-2
A239882 04
2nd dataFrame with column 'Y' as :
Y
000A561044
000A872220
I would like to match the part of substrings from both columns with minimum no. of characters(example 7 chars only alphanumeric to be considered for matching. all special chars to be excluded).
so, my output DataFrame should be like this
X
A561044-2
Any possible solution would highly appreciate.
Thanks in advance.
IIUC and assuming that the first three values of Y start with 0, you can slice Y by [3:] to remove the first three zero values. Then, you can join these values by |. Finally, you can create your mask using contains that checks whether a series contains a specified value (in your case you would have something like 'A|B' and check whether a value contains 'A' or 'B'). Then, this mask can be used to query your other data frame.
Code:
import pandas as pd
df1 = pd.DataFrame({"X": ["A468593-3", "A697269-2", "A561044-2", "A239882 04"]})
df2 = pd.DataFrame({"Y": ["000A561044", "000A872220"]})
mask = df1["X"].str.contains(f'({"|".join(df2["Y"].str[3:])})')
df1.loc[mask]
Output:
X
2 A561044-2
If you have values in Y that do not start with three zeros, you can use this function to reduce your columns and remove all first n zeros.
def remove_first_numerics(s):
counter = 0
while s[counter].isnumeric():
counter +=1
return s[counter:]
df_test = pd.DataFrame({"A": ["01Abd3Dc", "3Adv3Dc", "d31oVgZ", "10dZb1B", "CCcDx10"]})
df_test["A"].apply(lambda s: remove_first_numerics(s))
Output:
0 Abd3Dc
1 Adv3Dc
2 d31oVgZ
3 dZb1B
4 CCcDx10
Name: A, dtype: object
I have two numpy matricies of the same shape.
In one of them each column contains all 0's except for a 1.
In the other matrix each column contains random numbers.
My goal is to count the number of columns for which the position of the 1 in the column of the first matrix corresponds with the position of the highest element in the column of the second matrix.
For example:
a = [[1,0],
[0,1]]
b = [[2,3],
[3,5]]
myFunc(a,b)
would yield 1 since the argmax of the first column in b is not the same as in a but it is the same in the second column.
My solution was to iterate over the columns and check if the argmax was the same, store that in a list and then sum that at the end, but this doesn't take advantage of numpy's fastness. Is there a faster way to do this? Thanks!
This checks the indices of max in each column of b against indices of 1s in corresponding column of a and counts the matches:
(a.T.nonzero()[1]==b.argmax(axis=0)).sum()
output in your example:
1
Given that there will only be a single 1 in the first array, then you should just be able to compare where the argmax is at the same position
def myfunc(binary_array,value_array):
return np.sum(a.argmax(axis=1)==b.argmax(axis=1))
a = np.array([[1,0],
[0,1]])
b = np.array([[2,3],
[3,5]])
myfunc(a,b)
1
c=np.array([[0,1,0],[1,0,0],[0,0,1]])
d=np.array([[1,2,3],[2,2,3],[1,3,4]])
myfunc(c,d)
1
e=np.array([[0,1,0],[0,0,1],[0,0,1]])
f=np.array([[1,2,3],[2,2,3],[1,3,4]])
myfunc(e,f)
2
import numpy as np
import pandas as df
from numpy import asarray
from numpy import save
files=np.load('arr.npy',allow_pickle=True)
#print(files)
data=df.DataFrame(files)
type(data)
rr=data.shape[0]
for i in range(0,rr):
res=data[0][i]
after running res variable contains last element
but i want all the values
so tell me how to store all the 2d matrix values in python ??
data variable is the dataframe
it contains 9339 rows and 2 columns
but i want 1st column it is the 32x32 matrix
how to store values res variable
Notice that res = data[0][i] initializes a new variable res on the first iteration of the loop (when i is 0), but then keeps reassigning its value to the value in the next row (staying on column 0).
I'm not sure exactly what you want, but it sounds like you just want the first column in a separate variable? Here is how to get the first column, as a pandas series and/or plain list, with a smaller example (9 rows and 2 columns)
import pandas as pd
random_data = np.random.rand(9,2)
data_df = pd.DataFrame(random_data)
print(data_df)
# this gets the first column as a pandas series. Change index from 0 to get another column.
print('\nfirst column:')
first_col = data_df[data_df.columns[0]]
print(first_col)
# if you want a plain list instead of a series
print('\nfirst column as list:')
print(first_col.tolist())
Output:
0 1
0 0.218237 0.323922
1 0.806697 0.371456
2 0.526571 0.993491
3 0.403947 0.299652
4 0.753333 0.542269
5 0.365885 0.534462
6 0.404583 0.514687
7 0.298897 0.637910
8 0.453891 0.234333
first column:
0 0.218237
1 0.806697
2 0.526571
3 0.403947
4 0.753333
5 0.365885
6 0.404583
7 0.298897
8 0.453891
Name: 0, dtype: float64
first column as list:
[0.21823726509923325, 0.8066974875381492, 0.526571422644495, 0.40394686954663594, 0.7533330239460391, 0.36588470364914194, 0.4045827678891364, 0.2988970490642284, 0.45389073978613426]
I'm confused about the syntax regarding the following line of code:
x_values = dataframe[['Brains']]
The dataframe object consists of 2 columns (Brains and Bodies)
Brains Bodies
42 34
32 23
When I print x_values I get something like this:
Brains
0 42
1 32
I'm aware of the pandas documentation as far as attributes and methods of the dataframe object are concerned, but the double bracket syntax is confusing me.
Consider this:
Source DF:
In [79]: df
Out[79]:
Brains Bodies
0 42 34
1 32 23
Selecting one column - results in Pandas.Series:
In [80]: df['Brains']
Out[80]:
0 42
1 32
Name: Brains, dtype: int64
In [81]: type(df['Brains'])
Out[81]: pandas.core.series.Series
Selecting subset of DataFrame - results in DataFrame:
In [82]: df[['Brains']]
Out[82]:
Brains
0 42
1 32
In [83]: type(df[['Brains']])
Out[83]: pandas.core.frame.DataFrame
Conclusion: the second approach allows us to select multiple columns from the DataFrame. The first one just for selecting single column...
Demo:
In [84]: df = pd.DataFrame(np.random.rand(5,6), columns=list('abcdef'))
In [85]: df
Out[85]:
a b c d e f
0 0.065196 0.257422 0.273534 0.831993 0.487693 0.660252
1 0.641677 0.462979 0.207757 0.597599 0.117029 0.429324
2 0.345314 0.053551 0.634602 0.143417 0.946373 0.770590
3 0.860276 0.223166 0.001615 0.212880 0.907163 0.437295
4 0.670969 0.218909 0.382810 0.275696 0.012626 0.347549
In [86]: df[['e','a','c']]
Out[86]:
e a c
0 0.487693 0.065196 0.273534
1 0.117029 0.641677 0.207757
2 0.946373 0.345314 0.634602
3 0.907163 0.860276 0.001615
4 0.012626 0.670969 0.382810
and if we specify only one column in the list we will get a DataFrame with one column:
In [87]: df[['e']]
Out[87]:
e
0 0.487693
1 0.117029
2 0.946373
3 0.907163
4 0.012626
There is no special syntax in Python for [[ and ]]. Rather, a list is being created, and then that list is being passed as an argument to the DataFrame indexing function.
As per #MaxU's answer, if you pass a single string to a DataFrame a series that represents that one column is returned. If you pass a list of strings, then a DataFrame that contains the given columns is returned.
So, when you do the following
# Print "Brains" column as Series
print(df['Brains'])
# Return a DataFrame with only one column called "Brains"
print(df[['Brains']])
It is equivalent to the following
# Print "Brains" column as Series
column_to_get = 'Brains'
print(df[column_to_get])
# Return a DataFrame with only one column called "Brains"
subset_of_columns_to_get = ['Brains']
print(df[subset_of_columns_to_get])
In both cases, the DataFrame is being indexed with the [] operator.
Python uses the [] operator for both indexing and for constructing list literals, and ultimately I believe this is your confusion. The outer [ and ] in df[['Brains']] is performing the indexing, and the inner is creating a list.
>>> some_list = ['Brains']
>>> some_list_of_lists = [['Brains']]
>>> ['Brains'] == [['Brains']][0]
True
>>> 'Brains' == [['Brains']][0][0] == [['Brains'][0]][0]
True
What I am illustrating above is that at no point does Python ever see [[ and interpret it specially. In the last convoluted example ([['Brains'][0]][0]) there is no special ][ operator or ]][ operator... what happens is
A single-element list is created (['Brains'])
The first element of that list is indexed (['Brains'][0] => 'Brains')
That is placed into another list ([['Brains'][0]] => ['Brains'])
And then the first element of that list is indexed ([['Brains'][0]][0] => 'Brains')
Other solutions demonstrate the difference between a series and a dataframe. For the Mathematically minded, you may wish to consider the dimensions of your input and output. Here's a summary:
Object Series DataFrame
Dimensions (obj.ndim) 1 2
Syntax arg dim 0 1
Syntax df['col'] df[['col']]
Max indexing dim 1 2
Label indexing df['col'].loc[x] df.loc[x, 'col']
Label indexing (scalar) df['col'].at[x] df.at[x, 'col']
Integer indexing df['col'].iloc[x] df.iloc[x, 'col']
Integer indexing (scalar) df['col'].iat[x] dfi.at[x, 'col']
When you specify a scalar or list argument to pd.DataFrame.__getitem__, for which [] is syntactic sugar, the dimension of your argument is one less than the dimension of your result. So a scalar (0-dimensional) gives a 1-dimensional series. A list (1-dimensional) gives a 2-dimensional dataframe. This makes sense since the additional dimension is the dataframe index, i.e. rows. This is the case even if your dataframe happens to have no rows.
[ ] and [[ ]] are the concept of NumPy.
Try to understand the basics of np.array creating and use reshape and check with ndim, you'll understand.
Check my answer here.
https://stackoverflow.com/a/70194733/7660981
In some transformations, I seem to be forced to break from the Pandas dataframe grouped object, and I would like a way to return to that object.
Given a dataframe of time series data, if one groups by one of the values in the dataframe, we are given an underlying dictionary from key to dataframe.
Being forced to make a Python dict from this, the structure cannot be converted back into a Dataframe using the .from_dict() because the structure is key to dataframe.
The only way to go back to Pandas without some hacky column renaming is, to my knowledge, by converting it back to a grouped object.
Is there any way to do this?
If not, how would I convert a dictionary of instance to dataframe back into a Pandas datastructure?
EDIT ADDING SAMPLE::
rng = pd.date_range('1/1/2000', periods=10, freq='10m')
df = pd.DataFrame({'a':pd.Series(randn(len(rng)), index=rng), 'b':pd.Series(randn(len(rng)), index=rng)})
// now have dataframe with 'a's and 'b's in time series
for k, v in df.groupby('a'):
df_dict[k] = v
// now we apply some transformation that cannot be applied view aggregate, transform, or apply
// how do we get this back into a groupedby object?
If I understand OP's question correctly, you want to group a dataframe by some key(s), do different operations on each group (possibly generating new columns, etc.) and then go back to the original dataframe.
Modifying you example (group by random integers instead of floats which are usually unique):
np.random.seed(200)
rng = pd.date_range('1/1/2000', periods=10, freq='10m')
df = pd.DataFrame({'a':pd.Series(np.random.randn(len(rng)), index=rng), 'b':pd.Series(np.random.randn(len(rng)), index=rng)})
df['group'] = np.random.randint(3,size=(len(df)))
Usually, If I need single values for each columns per group, I'll do this (for example, sum of 'a', mean of 'b')
In [10]: df.groupby('group').aggregate({'a':np.sum, 'b':np.mean})
Out[10]:
a b
group
0 -0.214635 -0.319007
1 0.711879 0.213481
2 1.111395 1.042313
[3 rows x 2 columns]
However, if I need a series for each group,
In [19]: def func(sub_df):
sub_df['c'] = sub_df['a'] * sub_df['b'].shift(1)
return sub_df
....:
In [20]: df.groupby('group').apply(func)
Out[20]:
a b group c
2000-01-31 -1.450948 0.073249 0 NaN
2000-11-30 1.910953 1.303286 2 NaN
2001-09-30 0.711879 0.213481 1 NaN
2002-07-31 -0.247738 1.017349 2 -0.322874
2003-05-31 0.361466 1.911712 2 0.367737
2004-03-31 -0.032950 -0.529672 0 -0.002414
2005-01-31 -0.221347 1.842135 2 -0.423151
2005-11-30 0.477257 -1.057235 0 -0.252789
2006-09-30 -0.691939 -0.862916 2 -1.274646
2007-07-31 0.792006 0.237631 0 -0.837336
[10 rows x 4 columns]
I'm guess you want something like the second example. But the original question wasn't very clear even with your example.