how to filter this python dataframe - python

Greeting
I try to get the smallest sizes dataframe that got valid row
import pandas as pd
import random
columns = ['x0','y0']
df_ = pd.DataFrame(index=range(0,30), columns=columns)
df_ = df_.fillna(0)
columns1 = ['x1','y1']
df = pd.DataFrame(index=range(0,11), columns=columns1)
for index, row in df.iterrows():
df.loc[index, "x1"] = random.randint(1, 100)
df.loc[index, "y1"] = random.randint(1, 100)
df_ = df_.combine_first(df)
df = pd.DataFrame(index=range(0,17), columns=columns1)
for index, row in df.iterrows():
df.loc[index, "x2"] = random.randint(1, 100)
df.loc[index, "y2"] = random.randint(1, 100)
df_ = df_.combine_first(df)
From the example the dataframe should output rows from 0 to 10 and the rest got filter out.
I think of keep a counter to keep track of the min row
or using pandasql
or if there is a trick to get this info from the dataframe
the size of dataframe
Actually I will be appending 500+ files with various size to append
and use it to do some analysis. So perf is a consideration.
-student of python

If you want to drop the rows which have NaNs use dropna (here, this is the first ten rows):
In [11]: df_.dropna()
Out[11]:
x0 x1 x2 y0 y1 y2
0 0 49 58 0 68 2
1 0 2 37 0 19 71
2 0 26 95 0 12 17
3 0 87 5 0 70 69
4 0 84 77 0 70 92
5 0 71 98 0 22 5
6 0 28 95 0 70 15
7 0 31 19 0 24 31
8 0 9 37 0 55 29
9 0 30 53 0 15 45
10 0 8 61 0 74 41
However a cleaner, more efficient, and faster way to do this entire process is to update just those first rows (I'm assuming the random integer stuff is just you generating some example dataframes).
Let's store your DataFrames in a list:
In [21]: df1 = pd.DataFrame([[1, 2], [np.nan, 4]], columns=['a', 'b'])
In [22]: df2 = pd.DataFrame([[1, 2], [5, 6], [7, 8]], columns=['a', 'c'])
In [23]: dfs = [df1, df2]
Take the minimum length:
In [24]: m = min(len(df) for df in dfs)
First create an empty DataFrame with the desired rows and columns:
In [25]: columns = reduce(lambda x, y: y.columns.union(x), dfs, [])
In [26]: res = pd.DataFrame(index=np.arange(m), columns=columns)
To do this efficiently we're going to update, and making these changes inplace - on just this DataFrame*:
In [27]: for df in dfs:
res.update(df)
In [28]: res
Out[28]:
a b c
0 1 2 2
1 5 4 6
*If we didn't do this, or were using combine_first or similar, we'd most likely have lots of copying (new DataFrames being created), which will slow things down.
Note: combine_first doesn't offer an inplace flag... you could use combine but this is also more complicated (as well as less efficient). It's also quite straightforward to use where (and manually update), which IIRC is what combine does under the hood.

Related

Outer product on Pandas DataFrame rows

Have two DataFrames with identical columns labels. Columns are label, data1, data2, ..., dataN.
Need to take the product of the DataFrames, multiplying data1 * data1, data2 * data2, etc for every possible combination of rows in DataFrame1 with the rows in DataFrame2. As such, want the resulting DataFrame to maintain the label column of both frames in some way
Example:
Frame 1:
label
d1
d2
d3
a
1
2
3
b
4
5
6
Frame 2:
label
d1
d2
d3
c
7
8
9
d
10
11
12
Result:
label_1
label_2
d1
d2
d3
a
c
7
16
27
a
d
10
22
36
b
c
28
40
54
b
d
40
55
72
I feel like there is a nice way to do this, but all I can come up with is gross loops with lots of memory reallocation.
Let's do a cross merge first then mutiple the dn_x with dn_y
out = df1.merge(df2, how='cross')
out = (out.filter(like='label')
.join(out.filter(regex='d.*_x')
.mul(out.filter(regex='d.*_y').values)
.rename(columns=lambda col: col.split('_')[0])))
print(out)
label_x label_y d1 d2 d3
0 a c 7 16 27
1 a d 10 22 36
2 b c 28 40 54
3 b d 40 55 72
first idea with DataFrame.reindex and MultiIndex created by MultiIndex.from_product:
mux = pd.MultiIndex.from_product([df1['label'], df2['label']])
df = (df1.set_index('label').reindex(mux, level=0)
.mul(df2.set_index('label').reindex(mux, level=1))
.rename_axis(['label1','label2'])
.reset_index())
print (df)
label1 label2 d1 d2 d3
0 a c 7 16 27
1 a d 10 22 36
2 b c 28 40 54
3 b d 40 55 72
Or solution with cross join:
df = (df1.rename(columns={'label':'label1'})
.merge(df2.rename(columns={'label':'label2'}),
how='cross',
suffixes=('_','')))
For multiple columns get cols ends by _ and multiple same columns without _, last drop columns cols:
cols = df.filter(regex='_$').columns
no_ = cols.str.rstrip('_')
df[no_] *= df[cols].to_numpy()
df = df.drop(cols, axis=1)
print (df)
label1 label2 d1 d2 d3
0 a c 7 16 27
1 a d 10 22 36
2 b c 28 40 54
3 b d 40 55 72
One option is with a cross join, using expand_grid from pyjanitor, before computing the products:
# pip install pyjanitor
import pandas as pd
import janitor as jn
others = {'df1':df1, 'df2':df2}
out = jn.expand_grid(others=others)
numbers = out.select_dtypes('number')
numbers= numbers['df1'] * numbers['df2']
labels = out.select_dtypes('object')
labels.columns = ['label_1', 'label_2']
pd.concat([labels, numbers], axis = 1)
label_1 label_2 d1 d2 d3
0 a c 7 16 27
1 a d 10 22 36
2 b c 28 40 54
3 b d 40 55 72
OP here. Ynjxsjmh's answer allowed me to write the code to solve my problem, but I just wanted to post a function which is a little more general in the end and includes a little more explanation for anyone who stumbles here in the future.
Hit me with suggestions if you think of anything.
def exhaustive_df_operation(
self,
df1: pd.DataFrame,
df2: pd.DataFrame,
func: callable,
label_cols: list,
suffixes: tuple = ("_x", "_y"),
):
"""
Given DataFrames with multiple rows, executes the given
function on all row combinations ie in an exhaustive manner.
DataFrame column names must be the same. Label cols are the
columns which label the input/output and should not be used in
the computation.
Arguments:
df1: pd.DataFrame
First DataFrame to act on.
df2: pd.DataFrame
Second DataFrame to act on.
func: callable
numpy function to call as the operation on the DataFrames.
label_cols: list
The columns names corresponding to columns that label the
rows as distinct. Must be common to the DataFrames, but
several may be passed.
suffixes: tuple
The suffixes to use when calculating the cross merge.
Returns:
result: pd.DataFrame
DataFrame that results from product, will have
len(df1)*len(df2) rows. label_cols will label the DataFrame
from which the row was sourced.
eg. df1 df2
label a b label a b
i 1 2 k 5 6
j 3 4 l 7 8
func = np.add
label_cols = ['label']
suffixes = ("_x","_y")
result =
label_x label_y a b
i k 6 8
i l 8 10
j k 8 10
j l 10 12
"""
# Creating a merged DataFrame with an exhaustive "cross"
# product
merged = df1.merge(df2, how="cross", suffixes=suffixes)
# The names of the columns that will identify result rows
label_col_names = [col + suf for col in label_cols for suf in suffixes]
# The actual identifying columns
label_cols = merged[label_col_names]
# Non label columns ending suffix[0]
data_col_names = [
col
for col in merged.columns
if (suffixes[0] in col and col not in label_col_names)
]
data_1 = merged[data_col_names]
# Will need for rename later - removes suffix from column
# names with data
name_fix_dict = {old: old[: -len(suffixes[0])] for old in data_col_names}
# Non label columns ending suffix[1]
data_col_names = [
col
for col in merged.columns
if (suffixes[1] in col and col not in label_col_names)
]
data_2 = merged[data_col_names]
# Need .values because data_1 and data_2 have different column
# labels which confuses pandas/numpy.
result = label_cols.join(func(data_1, data_2.values))
# Removing suffixes from data columns
result.rename(columns=name_fix_dict, inplace=True)
return result

How to multiply combinations of two sets of pandas dataframe columns

I would like to multiply the combinations of two sets of columns
Let say there is a dataframe below:
import pandas as pd
df = {'A':[1,2,3], 'B':[4,5,6], 'C':[7,8,9], 'D':[0,1,2]}
df = pd.DataFrame(df)
Now, I want to multiply AC, AD, BC, BD
This is like multiplying the combination of [A,B] and [C,D]
I tried to use itertools but failed to figure it out.
So, the desired output will be like:
output = {'AC':[7,16,27], 'AD':[0,2,6], 'BC':[28,40,54], 'BD':[0,5,12]}
output = pd.DataFrame(output)
IIUC, you can try
import itertools
cols1 = ['A', 'B']
cols2 = ['C', 'D']
for col1, col2 in itertools.product(cols1, cols2):
df[col1+col2] = df[col1] * df[col2]
print(df)
A B C D AC AD BC BD
0 1 4 7 0 7 0 28 0
1 2 5 8 1 16 2 40 5
2 3 6 9 2 27 6 54 12
Or with new create dataframe
out = pd.concat([df[col1].mul(df[col2]).to_frame(col1+col2)
for col1, col2 in itertools.product(cols1, cols2)], axis=1)
print(out)
AC AD BC BD
0 7 0 28 0
1 16 2 40 5
2 27 6 54 12
You can directly multiply multiple columns if you convert them to NumPy arrays first with .to_numpy()
>>> df[["A","B"]].to_numpy() * df[["C","D"]].to_numpy()
array([[ 7, 0],
[16, 5],
[27, 12]])
You can also unzip a collection of wanted pairs and use them to get a new view of your DataFrame (indexing the same column multiple times is fine) .. then multiplying together the two new NumPy arrays!
>>> import math # standard library for prod()
>>> pairs = ["AC", "AD", "BC", "BD"] # wanted pairs
>>> result = math.prod(df[[*cols]].to_numpy() for cols in zip(*pairs))
>>> pd.DataFrame(result, columns=pairs) # new dataframe
AC AD BC BD
0 7 0 28 0
1 16 2 40 5
2 27 6 54 12
This extends to any number of pairs (triples, octuples of columns..) as long as they're the same length (beware: zip() will silently drop extra columns beyond the shortest group)
>>> pairs = ["ABD", "BCD"]
>>> result = math.prod(df[[*cols]].to_numpy() for cols in zip(*pairs))
>>> pd.DataFrame(result, columns=pairs)
ABD BCD
0 0 0
1 10 40
2 36 108

Python - initiate empty dataframe and populate from another dataframe

Working with python pandas 0.19.
I want to create a new dataframe (df2) as a subset of an existing dataframe (df1). df1 looks like this:
In [1]: df1.head()
Out [1]:
col1_name col2_name col3_name
0 23 42 55
1 27 55 57
2 52 20 52
3 99 18 53
4 65 32 51
The logic is:
df2 = []
for i in range(0,N):
loc = some complicated logic
df1_sub = df1.ix[loc,]
df2.append(df1_sub)
df2 = pd.DataFrame.from_records(df2)
The result df2 is indeed a dataframe, but the content is all comprised of column names of df1. It looks like this:
In [2]: df2.head()
Out [2]:
col1_name col2_name col3_name
0 col1_name col2_name col3_name
1 col1_name col2_name col3_name
2 col1_name col2_name col3_name
3 col1_name col2_name col3_name
4 col1_name col2_name col3_name
I know it's probably related to the conversion from list to dataframe but I'm not sure what exactly I'm missing here. Or is there a better way of doing this?
As per Ted Petrou, the solution is simply:
pd.concat(df2)
I was confused by the data type of df2.
It is impossible, given the logic within the for loop, to directly select df1 using some index.
How about just slice the dataframe?
import pandas as pd
DF1 = pd.DataFrame()
DF1['x'] = ['a','b','c','a','c','b']
DF1['y'] = [1,3,2,-1,-2,-3]
DF2 = DF1[[(x == 'a' and y > 0) for x,y in zip(DF1['x'], DF1['y'])]]
This should be way more efficient than appending. DF1[Complicated Condition] takes any Boolean arguement
You can take advantage of pandas' (actually numpy's) masked arrays.
import pandas as pd
df1 = pd.DataFrame({'a': [1, 2, 3, 4, 5], 'b': ['a', 'b', 'c', 'd', 'e'],
'c': [10, 11, 12, 13, 14]})
# a b c
# 0 1 a 10
# 1 2 b 11
# 2 3 c 12
# 3 4 d 13
# 4 5 e 14
Let's assume that df2 should be a subset of df1: it should have columns b and c and only the rows where column a has an even value:
df2 = df1[df1['a'] % 2 == 0][['b', 'c']]
# b c
# 1 b 11
# 3 d 13

How to replace the first n elements in each row of a dataframe that are larger than a certain threshold

I have a huge dataframe that contains only numbers (the one I show below is just for demonstration purposes). My goal is to replace in each row of the dataframe the first n numbers that are larger than a certain value val by 0.
To give an example:
My dataframe could look like this:
c1 c2 c3 c4
0 38 10 1 8
1 44 12 17 46
2 13 6 2 7
3 9 16 13 26
If I now choose n = 2 (number of replacements) and val = 10, my desired output would look like this:
c1 c2 c3 c4
0 0 10 1 8
1 0 0 17 46
2 0 6 2 7
3 9 0 0 26
In the first row, only one value is larger than val so only one gets replaced, in the second row all values are larger than val but only the first two can be replaced. Analog for rows 3 and 4 (please note that not only the first two columns are affected but the first two values in a row which can be in any column).
A straightforward and very ugly implementation could look like this:
import numpy as np
import pandas as pd
np.random.seed(1)
col1 = [np.random.randint(1, 50) for ti in xrange(4)]
col2 = [np.random.randint(1, 50) for ti in xrange(4)]
col3 = [np.random.randint(1, 50) for ti in xrange(4)]
col4 = [np.random.randint(1, 50) for ti in xrange(4)]
df = pd.DataFrame({'c1': col1, 'c2': col2, 'c3': col3, 'c4': col4})
val = 10
n = 2
for ind, row in df.iterrows():
# number of replacements
re = 0
for indi, vali in enumerate(row):
if vali > val:
df.iloc[ind, indi] = 0
re += 1
if re == n:
break
That works but I am sure that there are much more efficient ways of doing this. Any ideas?
You could write your own a bit weird function and use apply with axis=1:
def f(x, n, m):
y = x.copy()
y[y[y > m].iloc[:n].index] = 0
return y
In [380]: df
Out[380]:
c1 c2 c3 c4
0 38 10 1 8
1 44 12 17 46
2 13 6 2 7
3 9 16 13 26
In [381]: df.apply(f, axis=1, n=2, m=10)
Out[381]:
c1 c2 c3 c4
0 0 10 1 8
1 0 0 17 46
2 0 6 2 7
3 9 0 0 26
Note: y = x.copy() needs to make a copy of the series. If you need to change your values inplace you could omit that line. You need extra y because with slicing you'll get a copy not the original object.

Pandas parse json in column and expand to new rows in dataframe

I have a dataframe containing (record formatted) json strings as follows:
In[9]: pd.DataFrame( {'col1': ['A','B'], 'col2': ['[{"t":"05:15","v":"20.0"}, {"t":"05:20","v":"25.0"}]',
'[{"t":"05:15","v":"10.0"}, {"t":"05:20","v":"15.0"}]']})
Out[9]:
col1 col2
0 A [{"t":"05:15","v":"20.0"}, {"t":"05:20","v":"2...
1 B [{"t":"05:15","v":"10.0"}, {"t":"05:20","v":"1...
I would like to extract the json and for each record add a new row to the dataframe:
co1 t v
0 A 05:15:00 20
1 A 05:20:00 25
2 B 05:15:00 10
3 B 05:20:00 15
I've been experimenting with the following code:
def json_to_df(x):
df2 = pd.read_json(x.col2)
return df2
df.apply(json_to_df, axis=1)
but the resulting dataframes are assigned as tuples, rather than creating new rows. Any advice?
The problem with apply is that you need to return mulitple rows and it expects only one. A possible solution:
def json_to_df(row):
_, row = row
df_json = pd.read_json(row.col2)
col1 = pd.Series([row.col1]*len(df_json), name='col1')
return pd.concat([col1,df_json],axis=1)
df = map(json_to_df, df.iterrows()) #returns a list of dataframes
df = reduce(lambda x,y:x.append(y), x) #glues them together
df
col1 t v
0 A 05:15 20
1 A 05:20 25
0 B 05:15 10
1 B 05:20 15
Ok, taking a little inspiration from hellpanderrr's answer above, I came up with the following:
In [92]:
pd.DataFrame( {'X': ['A','B'], 'Y': ['fdsfds','fdsfds'], 'json': ['[{"t":"05:15","v":"20.0"}, {"t":"05:20","v":"25.0"}]',
'[{"t":"05:15","v":"10.0"}, {"t":"05:20","v":"15.0"}]']},)
Out[92]:
X Y json
0 A fdsfds [{"t":"05:15","v":"20.0"}, {"t":"05:20","v":"2...
1 B fdsfds [{"t":"05:15","v":"10.0"}, {"t":"05:20","v":"1...
In [93]:
dfs = []
def json_to_df(row, json_col):
json_df = pd.read_json(row[json_col])
dfs.append(json_df.assign(**row.drop(json_col)))
_.apply(json_to_df, axis=1, json_col='json')
pd.concat(dfs)
Out[93]:
t v X Y
0 05:15 20 A fdsfds
1 05:20 25 A fdsfds
0 05:15 10 B fdsfds
1 05:20 15 B fdsfds

Categories

Resources