How to create a dataframe in pandas? - python

I have two lists. One is called 'Region' and another is called 'Products'. Region has 111 distinct values and Products has 1181 distinct values. I want to create a dataframe of each combination of products and region from these two lists.
For example, I want this type of dataframe made up of two list. Here, product has 2 values and region has 3 values.
Pdts Region
0 A X
1 B X
2 A Y
3 B Y
4 A Z
5 B Z
I want this type of dataframe but my 'Region' list has 111 distinct values and 'Products' list has 1181 distinct values. How can I achieve this?

That's a Cartesian product
import pandas as pd
df1 = pd.DataFrame({'dataframe1': ['A', 'B']})
df2 = pd.DataFrame({'dataframe2': ['X', 'Y', 'Z']})
# Assign new columns to a DataFrame
# Merge with a database-style join
# Drop specified labels from rows or columns
product_df = df1.assign(key=1)\
.merge(df2.assign(key=1), on='key')\
.drop('key', 1)
print(product_df)
Output
dataframe1 dataframe2
0 A X
1 A Y
2 A Z
3 B X
4 B Y
5 B Z

You can do something like that:
import pandas as pd
d = {'Region': first_list, 'Products': second_list}
df = pd.DataFrame(data=d)

Ya mean:
>>> df['Region'] = ['X', 'Y', 'Z'] * (len(df) // 3)
>>> df
Pdts Region
0 A X
1 B Y
2 A Z
3 B X
4 A Y
5 B Z
>>>

Related

Assign multiple columns different values based on conditions in Panda dataframe

I have dataframe where new columns need to be added based on existing column values conditions and I am looking for an efficient way of doing.
For Ex:
df = pd.DataFrame({'a':[1,2,3],
'b':['x','y','x'],
's':['proda','prodb','prodc'],
'r':['oz1','0z2','oz3']})
I need to create 2 new columns ['c','d'] based on following conditions
If df['b'] == 'x':
df['c'] = df['s']
df['d'] = df['r']
elif df[b'] == 'y':
#assign different values to c, d columns
We can use numpy where and apply conditions on new column like
df['c] = ny.where(condition, value)
df['d'] = ny.where(condition, value)
But I am looking if there is a way to do this in a single statement or without using for loop or multiple numpy or panda apply.
The exact output is unclear, but you can use numpy.where with 2D data.
For example:
cols = ['c', 'd']
df[cols] = np.where(df['b'].eq('x').to_numpy()[:,None],
df[['s', 'r']], np.nan)
output:
a b s r c d
0 1 x proda oz1 proda oz1
1 2 y prodb 0z2 NaN NaN
2 3 x prodc oz3 prodc oz3
If you want multiple conditions, use np.select:
cols = ['c', 'd']
df[cols] = np.select([df['b'].eq('x').to_numpy()[:,None],
df['b'].eq('y').to_numpy()[:,None]
],
[df[['s', 'r']],
df[['r', 'a']]
], np.nan)
it is however easier here to use a loop for the conditions if you have many:
cols = ['c', 'd']
df[cols] = np.select([df['b'].eq(c).to_numpy()[:,None] for c in ['x', 'y']],
[df[repl] for repl in (['s', 'r'], ['r', 'a'])],
np.nan)
output:
a b s r c d
0 1 x proda oz1 proda oz1
1 2 y prodb 0z2 0z2 2
2 3 x prodc oz3 prodc oz3

How to remove all rows that have values above the 99th percentile for certain columns in pandas efficiently?

I have a pandas dataframe
a b c d e f
0 0.025641 0.554686 0.988809 0.176905 0.050028 0.333333
1 0.027151 0.520914 0.985590 0.409572 0.163980 0.424242
2 0.028788 0.478810 0.970480 0.288557 0.095053 0.939394
3 0.018692 0.450573 0.985910 0.178048 0.118399 0.484848
4 0.023256 0.787253 0.865287 0.217591 0.205670 0.303030
And a list of columns
cols_list = ['a', 'd', 'f']
I want to filter out all rows which have have values above the 99th percentile for all of these columns.
I could do something like:
for col in cols_list:
df[f'q_{col}'] = df[col].quantile([0.99]).values[0]
for col in cols_list:
df = df[df[col] <= df[f'q_{col}']]
Is there a more efficient way to do this ?
You can use the operator le to compare the dataframe with the quantiles, then use all/any to check for values along the rows:
valids = df[cols_list].le(df[cols_list].quantile(0.99)).all(1)
df[valids]
Output:
a b c d e f
0 0.025641 0.554686 0.988809 0.176905 0.050028 0.333333
3 0.018692 0.450573 0.985910 0.178048 0.118399 0.484848
4 0.023256 0.787253 0.865287 0.217591 0.205670 0.303030

Slicing a DataFrameGroupBy object

Is there a way to slice a DataFrameGroupBy object?
For example, if I have:
df = pd.DataFrame({'A': [2, 1, 1, 3, 3], 'B': ['x', 'y', 'z', 'r', 'p']})
A B
0 2 x
1 1 y
2 1 z
3 3 r
4 3 p
dfg = df.groupby('A')
Now, the returned GroupBy object is indexed by values from A, and I would like to select a subset of it, e.g. to perform aggregation. It could be something like
dfg.loc[1:2].agg(...)
or, for a specific column,
dfg['B'].loc[1:2].agg(...)
EDIT. To make it more clear: by slicing the GroupBy object I mean accessing only a subset of groups. In the above example, the GroupBy object will contain 3 groups, for A = 1, A = 2, and A = 3. For some reasons, I may only be interested in groups for A = 1 and A = 2.
It seesm you need custom function with iloc - but if use agg is necessary return aggregate value:
df = df.groupby('A')['B'].agg(lambda x: ','.join(x.iloc[0:3]))
print (df)
A
1 y,z
2 x
3 r,p
Name: B, dtype: object
df = df.groupby('A')['B'].agg(lambda x: ','.join(x.iloc[1:3]))
print (df)
A
1 z
2
3 p
Name: B, dtype: object
For multiple columns:
df = pd.DataFrame({'A': [2, 1, 1, 3, 3],
'B': ['x', 'y', 'z', 'r', 'p'],
'C': ['g', 'y', 'y', 'u', 'k']})
print (df)
A B C
0 2 x g
1 1 y y
2 1 z y
3 3 r u
4 3 p k
df = df.groupby('A').agg(lambda x: ','.join(x.iloc[1:3]))
print (df)
B C
A
1 z y
2
3 p k
If I understand correctly, you only want some groups, but those are supposed to be returned completely:
A B
1 1 y
2 1 z
0 2 x
You can solve your problem by extracting the keys and then selecting groups based on those keys.
Assuming you already know the groups:
pd.concat([dfg.get_group(1),dfg.get_group(2)])
If you don't know the group names and are just looking for random n groups, this might work:
pd.concat([dfg.get_group(n) for n in list(dict(list(dfg)).keys())[:2]])
The output in both cases is a normal DataFrame, not a DataFrameGroupBy object, so it might be smarter to first filter your DataFrame and only aggregate afterwards:
df[df['A'].isin([1,2])].groupby('A')
The same for unknown groups:
df[df['A'].isin(list(set(df['A']))[:2])].groupby('A')
I believe there are some Stackoverflow answers refering to this, like How to access pandas groupby dataframe by key

Pandas flatten hierarchical index on non overlapping columns

I have a dataframe, and I set the index to a column of the dataframe. This creates a hierarchical column index. I want to flatten the columns to a single level. Similar to this question - Python Pandas - How to flatten a hierarchical index in columns, however, the columns do not overlap (i.e. 'id' is not at level 0 of the hierarchical index, and other columns are at level 1 of the index).
df = pd.DataFrame([(101,3,'x'), (102,5,'y')], columns=['id', 'A', 'B'])
df.set_index('id', inplace=True)
A B
id
101 3 x
102 5 y
Desired output is flattened columns, like this:
id A B
101 3 x
102 5 y
You are misinterpreting what you are seeing.
A B
id
101 3 x
102 5 y
Is not showing you a hierarchical column index. id is the name of the row index. In order to show you the name of the index, pandas is putting that space there for you.
The answer to your question depends on what you really want or need.
As the df is, you can dump it to a csv just the way you want:
print(df.to_csv(sep='\t'))
id A B
101 3 x
102 5 y
print(df.to_csv())
id,A,B
101,3,x
102,5,y
Or you can alter the df so that it displays the way you'd like
print(df.rename_axis(None))
A B
101 3 x
102 5 y
please do not do this!!!!
I'm putting it to demonstrate how to manipulate
I could also keep the index as it is but manipulate both column and row index names to print how you would like.
print(df.rename_axis(None).rename_axis('id', 1))
id A B
101 3 x
102 5 y
But this has named the columns' index id which makes no sense.
there will always be an index in your dataframes. if you don't set 'id' as index, it will be at the same level as other columns and pandas will populate an increasing integer for your index starting from 0.
df = pd.DataFrame([(101,3,'x'), (102,5,'y')], columns=['id', 'A', 'B'])
In[52]: df
Out[52]:
id A B
0 101 3 x
1 102 5 y
the index is there so you can slice the original dataframe. such has
df.iloc[0]
Out[53]:
id 101
A 3
B x
Name: 0, dtype: object
so let says you want ID as index and ID as a column, which is very redundant, you could do:
df = pd.DataFrame([(101,3,'x'), (102,5,'y')], columns=['id', 'A', 'B'])
df.set_index('id', inplace=True)
df['id'] = df.index
df
Out[55]:
A B id
id
101 3 x 101
102 5 y 102
with this you can slice by 'id' such has:
df.loc[101]
Out[57]:
A 3
B x
id 101
Name: 101, dtype: object
but it would the same info has :
df = pd.DataFrame([(101,3,'x'), (102,5,'y')], columns=['id', 'A', 'B'])
df.set_index('id', inplace=True)
df.loc[101]
Out[58]:
A 3
B x
Name: 101, dtype: object
Given:
>>> df2=pd.DataFrame([(101,3,'x'), (102,5,'y')], columns=['id', 'A', 'B'])
>>> df2.set_index('id', inplace=True)
>>> df2
A B
id
101 3 x
102 5 y
For printing purdy, you can produce a copy of the DataFrame with a reset the index and use .to_string:
>>> print df2.reset_index().to_string(index=False)
id A B
101 3 x
102 5 y
Then play around with the formatting options so that the output suites your needs:
>>> fmts=[lambda s: u"{:^5}".format(str(s).strip())]*3
>>> print df2.reset_index().to_string(index=False, formatters=fmts)
id A B
101 3 x
102 5 y

Element-wise average and standard deviation across multiple dataframes

Data:
Multiple dataframes of the same format (same columns, an equal number of rows, and no points missing).
How do I create a "summary" dataframe that contains an element-wise mean for every element? How about a dataframe that contains an element-wise standard deviation?
A B C
0 -1.624722 -1.160731 0.016726
1 -1.565694 0.989333 1.040820
2 -0.484945 0.718596 -0.180779
3 0.388798 -0.997036 1.211787
4 -0.249211 1.604280 -1.100980
5 0.062425 0.925813 -1.810696
6 0.793244 -1.860442 -1.196797
A B C
0 1.016386 1.766780 0.648333
1 -1.101329 -1.021171 0.830281
2 -1.133889 -2.793579 0.839298
3 1.134425 0.611480 -1.482724
4 -0.066601 -2.123353 1.136564
5 -0.167580 -0.991550 0.660508
6 0.528789 -0.483008 1.472787
You can create a panel of your DataFrames and then compute the mean and SD along the items axis:
df1 = pd.DataFrame(np.random.randn(10, 3), columns=['A', 'B', 'C'])
df2 = pd.DataFrame(np.random.randn(10, 3), columns=['A', 'B', 'C'])
df3 = pd.DataFrame(np.random.randn(10, 3), columns=['A', 'B', 'C'])
p = pd.Panel({n: df for n, df in enumerate([df1, df2, df3])})
>>> p.mean(axis=0)
A B C
0 -0.024284 -0.622337 0.581292
1 0.186271 0.596634 -0.498755
2 0.084591 -0.760567 -0.334429
3 -0.833688 0.403628 0.013497
4 0.402502 -0.017670 -0.369559
5 0.733305 -1.311827 0.463770
6 -0.941334 0.843020 -1.366963
7 0.134700 0.626846 0.994085
8 -0.783517 0.703030 -1.187082
9 -0.954325 0.514671 -0.370741
>>> p.std(axis=0)
A B C
0 0.196526 1.870115 0.503855
1 0.719534 0.264991 1.232129
2 0.315741 0.773699 1.328869
3 1.169213 1.488852 1.149105
4 1.416236 1.157386 0.414532
5 0.554604 1.022169 1.324711
6 0.178940 1.107710 0.885941
7 1.270448 1.023748 1.102772
8 0.957550 0.355523 1.284814
9 0.582288 0.997909 1.566383
One simple solution here is to simply concatenate the existing dataframes into a single dataframe while adding an ID variable to track the original source:
dfa = pd.DataFrame( np.random.randn(2,2), columns=['a','b'] ).assign(id='a')
dfb = pd.DataFrame( np.random.randn(2,2), columns=['a','b'] ).assign(id='b')
df = pd.concat([df1,df2])
a b id
0 -0.542652 1.609213 a
1 -0.192136 0.458564 a
0 -0.231949 -0.000573 b
1 0.245715 -0.083786 b
So now you have two 2x2 dataframes combined into a single 4x2 dataframe. The 'id' columns identifies the source dataframe so you haven't lost any generality, and can select on 'id' to do the same thing you would to any single dataframe. E.g. df[ df['id'] == 'a' ].
But now you can also use groupby to do any pandas method such as mean() or std() on an element by element basis:
df.groupby('id').mean()
a b
index
0 0.198164 -0.811475
1 0.639529 0.812810
The following solution worked for me.
average_data_frame = (dataframe1 + dataframe2 ) / 2
Or, if you have more than two dataframes, say n, then
average_data_frame = dataframe1
for i in range(1,n):
average_data_frame = average_data_frame + i_th_dataframe
average_data_frame = average_data_frame / n
Once you have the average, you can go for the standard deviation. If you are looking for a "true Pythonic" approach, you should follow other answers. But if you are looking for a working and quick solution, this is it.

Categories

Resources