How to map nested dictionaries to dataframe columns in python? - python

I have nested dictionaries like this:
X = {A:{col1:12,col-2:13},B:{col1:12,col-2:13},C:{col1:12,col-2:13},D:{col1:12,col-2:13}}
Y = {A:{col1:3,col-2:5},B:{col1:1,col-2:2},C:{col1:4,col-2:7},D:{col1:8,col-2:7}}
Z = {A:{col1:6,col-2:7},B:{col1:4,col-2:7},C:{col1:5,col-2:7},D:{col1:4,col-2:9}}
I also have a data frame with a single column like this:
Df: data frame([A,B,C,D],columns = ['Names'])
For every row in the data frame, I want to map the given dictionaries in this format:
Names Col-1_X Col-1_Y Col-1_Z Col-2_X Col-2_Y Col-2_Z
A 12 3 6 13 5 7
B 12 1 4 13 2 7
C 12 4 5 13 7 7
D 12 8 4 13 7 9
Can anyone help me get the data in this format?

Transpose and concat them:
dfX = pd.DataFrame(X).T.add_suffix('_X')
dfY = pd.DataFrame(Y).T.add_suffix('_Y')
dfZ = pd.DataFrame(Z).T.add_suffix('_Z')
output = pd.concat([dfX, dfY,dfZ], axis=1))
output :
col1_X col-2_X col1_Y col-2_Y col1_Z col-2_Z
A 12 13 3 5 6 7
B 12 13 1 2 4 7
C 12 13 4 7 5 7
D 12 13 8 7 4 9

Related

Compare even and odd rows in a Pandas Data Frame

I have a data frame like that :
Index
Time
Id
0
10:10:00
11
1
10:10:01
12
2
10:10:02
12
3
10:10:04
12
4
10:10:06
13
5
10:10:07
13
6
10:10:08
11
7
10:10:10
11
8
10:10:12
11
9
10:10:14
13
I want to compare id column for each pairs. So between the row 0 and 1, between the row 2 and 3 etc.
In others words I want to compare even rows with odd rows and keep same id pairs rows.
My ideal output would be :
Index
Time
Id
2
10:10:02
12
3
10:10:04
12
4
10:10:06
13
5
10:10:07
13
6
10:10:08
11
7
10:10:10
11
I tried that but it did not work :
df = df[
df[::2]["id"] ==df[1::2]["id"]
]
You can use a GroupBy.transform approach:
# for each pair, is there only one kind of Id?
out = df[df.groupby(np.arange(len(df))//2)['Id'].transform('nunique').eq(1)]
Or, more efficient, using the underlying numpy array:
# convert to numpy
a = df['Id'].to_numpy()
# are the odds equal to evens?
out = df[np.repeat((a[::2]==a[1::2]), 2)]
output:
Index Time Id
2 2 10:10:02 12
3 3 10:10:04 12
4 4 10:10:06 13
5 5 10:10:07 13
6 6 10:10:08 11
7 7 10:10:10 11

how to union multiple columns from one panda data frame into one series? [duplicate]

This question already has answers here:
How do I melt a pandas dataframe?
(3 answers)
Closed 11 months ago.
I have a data frame that has actually more than 20 columns. The example below give 4 columns. each column has equal number of rows. How to convert to a new dataframe(exmaple shown below) which has only one columns. I will use the new combined dataframe to calculate some metrics. How do I write a neat and efficient code for this? Thank you so much!
data={"col1":[1,2,3,5], "col_2":[6,7,8,9], "col_3":[10,11,12,14], "col_4":[7,8,9,10]}
pd.DataFrame.from_dict(data)
You can convert the DataFrame to a numpy array and flatten it using the ravel method. Finally, construct a Series (or a DataFrame) with the result.
data = {"col1":[1,2,3,5], "col_2":[6,7,8,9], "col_3":[10,11,12,14], "col_4":[7,8,9,10]}
df = pd.DataFrame(data)
new_col = pd.Series(df.to_numpy().ravel(order='F'), name='new_col')
Output:
>>> new_col
0 1
1 2
2 3
3 5
4 6
5 7
6 8
7 9
8 10
9 11
10 12
11 14
12 7
13 8
14 9
15 10
Name: new_col, dtype: int64
If you start from your dictionary, use itertools.chain:
data={"col1":[1,2,3,5], "col_2":[6,7,8,9], "col_3":[10,11,12,14], "col_4":[7,8,9,10]}
from itertools import chain
pd.DataFrame({'col': chain.from_iterable(data.values())})
Else, ravel the underlying numpy array:
df = pd.DataFrame.from_dict(data)
pd.Series(df.to_numpy().ravel('F'))
Output:
0 1
1 2
2 3
3 5
4 6
5 7
6 8
7 9
8 10
9 11
10 12
11 14
12 7
13 8
14 9
15 10
dtype: int64
Depending on the computation to perform, you might not even need to instantiate a DataFrame/Series and stick to the array:
a = df.to_numpy().ravel('F')
Output: array([ 1, 2, 3, 5, 6, 7, 8, 9, 10, 11, 12, 14, 7, 8, 9, 10])
Try with melt
out = pd.DataFrame.from_dict(data).melt().drop(['variable'],axis=1)
Out[109]:
value
0 1
1 2
2 3
3 5
4 6
5 7
6 8
7 9
8 10
9 11
10 12
11 14
12 7
13 8
14 9
15 10

How to melt multiple columns into one column?

I have this table:
a b c d e f 19-08-06 19-08-07 19-08-08 g h i
1 2 3 4 5 6 7 8 9 10 11 12
I have 34 columns of the date, so I want to melt the date columns to be into one column only.
How can I do this in pyhton?
Thanks in advance
You can use pd.Series.fullmatch to create a boolean mask for extracting date columns, then use df.melt
m = df.columns.str.fullmatch("\d{2}-\d{2}-\d{2}")
cols = df.columns[m]
df.melt(value_vars=cols, var_name='date', value_name='vals')
date vals
0 19-08-06 7
1 19-08-07 8
2 19-08-08 9
If you want to melt while keeping other columns then try this.
df.melt(
id_vars=df.columns.difference(cols), var_name="date", value_name="vals"
)
a b c d e f g h i date vals
0 1 2 3 4 5 6 10 11 12 19-08-06 7
1 1 2 3 4 5 6 10 11 12 19-08-07 8
2 1 2 3 4 5 6 10 11 12 19-08-08 9
Here I did not use value_vars=cols as it's done implicitly
value_vars: tuple, list, or ndarray, optional
Column(s) to unpivot. If not specified, uses all columns that are
not set as id_vars.

Why does pd.rolling and .apply() return multiple outputs from a function returning a single value?

I'm trying to create a rolling function that:
Divides two DataFrames with 3 columns in each df.
Calculate the mean of each row from the output in step 1.
Sums the averages from step 2.
This could be done by using pd.iterrows() hence looping through each row. However, this would be inefficient when working with larger datasets. Therefore, my objective is to create a pd.rolling function that could do this much faster.
What I would need help with is to understand why my approach below returns multiple values while the function I'm using only returns a single value.
EDIT : I have updated the question with the code that produces my desired output.
This is the test dataset I'm working with:
#import libraries
import pandas as pd
import numpy as np
#create two dataframes
values = {'column1': [7,2,3,1,3,2,5,3,2,4,6,8,1,3,7,3,7,2,6,3,8],
'column2': [1,5,2,4,1,5,5,3,1,5,3,5,8,1,6,4,2,3,9,1,4],
"column3" : [3,6,3,9,7,1,2,3,7,5,4,1,4,2,9,6,5,1,4,1,3]
}
df1 = pd.DataFrame(values)
df2 = pd.DataFrame([[2,3,4],[3,4,1],[3,6,1]])
print(df1)
print(df2)
column1 column2 column3
0 7 1 3
1 2 5 6
2 3 2 3
3 1 4 9
4 3 1 7
5 2 5 1
6 5 5 2
7 3 3 3
8 2 1 7
9 4 5 5
10 6 3 4
11 8 5 1
12 1 8 4
13 3 1 2
14 7 6 9
15 3 4 6
16 7 2 5
17 2 3 1
18 6 9 4
19 3 1 1
20 8 4 3
0 1 2
0 2 3 4
1 3 4 1
2 3 6 1
One method to achieve my desired output by looping through each row:
RunningSum = []
for index, rows in df1.iterrows():
if index > 3:
Div = abs((((df2 / df1.iloc[index-3+1:index+1].reset_index(drop="True").values)-1)*100))
Average = Div.mean(axis=0)
SumOfAverages = np.sum(Average)
RunningSum.append(SumOfAverages)
#printing my desired output values
print(RunningSum)
[330.42328042328046,
212.0899470899471,
152.06349206349208,
205.55555555555554,
311.9047619047619,
209.1269841269841,
197.61904761904765,
116.94444444444444,
149.72222222222223,
430.0,
219.51058201058203,
215.34391534391537,
199.15343915343914,
159.6031746031746,
127.6984126984127,
326.85185185185185,
204.16666666666669]
However, this would be timely when working with large datasets. Therefore, I've tried to create a function which applies to a pd.rolling() object.
def SumOfAverageFunction(vals):
Div = df2 / vals.reset_index(drop="True")
Average = Div.mean(axis=0)
SumOfAverages = np.sum(Average)
return SumOfAverages
RunningSum = df1.rolling(window=3,axis=0).apply(SumOfAverageFunction)
The problem here is that my function returns multiple output. How can I solve this?
print(RunningSum)
column1 column2 column3
0 NaN NaN NaN
1 NaN NaN NaN
2 3.214286 4.533333 2.277778
3 4.777778 3.200000 2.111111
4 5.888889 4.416667 1.656085
5 5.111111 5.400000 2.915344
6 3.455556 3.933333 5.714286
7 2.866667 2.066667 5.500000
8 2.977778 3.977778 3.063492
9 3.555556 5.622222 1.907937
10 2.750000 4.200000 1.747619
11 1.638889 2.377778 3.616667
12 2.986111 2.005556 5.500000
13 5.333333 3.075000 4.750000
14 4.396825 5.000000 3.055556
15 2.174603 3.888889 2.148148
16 2.111111 2.527778 1.418519
17 2.507937 3.500000 3.311111
18 2.880952 3.000000 5.366667
19 2.722222 3.370370 5.750000
20 2.138889 5.129630 5.666667
After reordering of operations, your calculations can be simplified
BASE = df2.sum(axis=0) /3
BASE_series = pd.Series({k: v for k, v in zip(df1.columns, BASE)})
result = df1.rdiv(BASE_series, axis=1).sum(axis=1)
print(np.around(result[4:], 3))
Outputs:
4 5.508
5 4.200
6 2.400
7 3.000
...
if you dont want to calculate anything before index 4 then change:
df1.iloc[4:].rdiv(...

Dask.dataframe or Alternative: Scalable way of dropping rows of low frequency items

I am looking for a way to remove rows from a dataframe that contain low frequency items. I adapted the following snippet from this post:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0, high=9, size=(100,2)),
columns = ['A', 'B'])
threshold = 10 # Anything that occurs less than this will be removed.
value_counts = df.stack().value_counts() # Entire DataFrame
to_remove = value_counts[value_counts <= threshold].index
df.replace(to_remove, np.nan, inplace=True)
The problem is, that this code does not scale, it seems.
The line to_remove = value_counts[value_counts <= threshold].index has now been running for several hours for my data (2 GB compressed HDFStore). I therefore need a better solution. Ideally out-of-core. I suspect dask.dataframe is suitable, but I fail to express the above code in terms of dask. The key functions stack and replace are absent from dask.dataframe.
I tried the following (works in normal pandas) to work around the lack of these two functions:
value_countss = [df[col].value_counts() for col in df.columns]
infrequent_itemss = [value_counts[value_counts < 3] for value_counts in value_countss]
rows_to_drop = set(i for indices in [df.loc[df[col].isin(infrequent_items.keys())].index.values for col, infrequent_items in zip(df.columns, infrequent_itemss)] for i in indices)
df.drop(rows_to_drop)
That does not actually work with dask though. It errors at infrequent_items.keys().
Even if it did work, given that this is the opposite of elegant, I suspect there must be a better way.
Can you suggest something?
Not sure if this will help you out, but it's too big for a comment:
df = pd.DataFrame(np.random.randint(0, high=20, size=(30,2)), columns = ['A', 'B'])
unique, counts = np.unique(df.values.ravel(), return_counts=True)
d = dict(zip(unique, counts))
threshold = 10
to_remove = [k for k, v in d.items() if v < threshold]
df.replace(to_remove, np.nan, inplace=True)
See:
How to count the occurrence of certain item in an ndarray in Python?
how to count occurrence of each unique value in pandas
Toy problem showed a 40x speedup from 400 us to 10 us in the step you mentioned.
The following code, which incorporates Evan's improvement, solves my issue:
unique, counts = np.unique(df.values.ravel(), return_counts=True)
d = dict(zip(unique, counts))
to_remove = {k for k, v in d.items() if v < threshold}
mask = df.isin(to_remove)
column_mask = (~mask).all(axis=1)
df = df[column_mask]
demo:
def filter_low_frequency(df, threshold=4):
unique, counts = np.unique(df.values.ravel(), return_counts=True)
d = dict(zip(unique, counts))
to_remove = {k for k, v in d.items() if v < threshold}
mask = df.isin(to_remove)
column_mask = (~mask).all(axis=1)
df = df[column_mask]
return df
df = pd.DataFrame(np.random.randint(0, high=20, size=(10,10)))
print(df)
print(df.stack().value_counts())
df = filter_low_frequency(df)
print(df)
0 1 2 3 4 5 6 7 8 9
0 3 17 11 13 8 8 15 14 7 8
1 2 14 11 3 16 10 19 19 14 4
2 8 13 13 17 3 13 17 18 5 18
3 7 8 14 9 15 12 0 15 2 19
4 6 12 13 11 16 6 19 16 2 17
5 2 1 2 17 1 3 12 10 2 16
6 0 19 9 4 15 3 3 3 4 0
7 18 8 15 9 1 18 15 17 9 0
8 17 15 9 11 13 9 11 4 19 8
9 13 6 7 8 8 10 0 3 16 13
8 9
3 8
13 8
17 7
15 7
19 6
2 6
9 6
11 5
16 5
0 5
18 4
4 4
14 4
10 3
12 3
7 3
6 3
1 3
5 1
dtype: int64
0 1 2 3 4 5 6 7 8 9
6 0 19 9 4 15 3 3 3 4 0
8 17 15 9 11 13 9 11 4 19 8

Categories

Resources