transpose data based on multiple tables with the help of pandas - python

Looking for some help for the below. Have 2 big csv files and need to get the data based on few conditions. Here is my sample data file.
a1,a2,v1,v2,v3,v4
1,12,12.99,0.0,34.33,0
1,13,12.99,0.0,34.33,0
1,145,12.99,0.0,34.33,0
2,15,12.99,0.0,34.33,0
2,169,12.99,0.0,34.33,0
3,164,12.99,0.0,34.33,0
3,147,12.99,0.0,34.33,0
1,174,12.99,0.0,34.33,0
2,148,12.99,0.0,34.33,0
4,154,12.99,0.0,34.33,0
a1,k1
1,v1
2,v2
3,v3
4,v4
The values under a1 and k1 to be matched and if any of the v* are zero,those to be dropped from the final csv file.
a1,a2,v1,v2,v3,v4
1,12,12.99,0.0,34.33,0
1,13,12.99,0.0,34.33,0
1,145,12.99,0.0,34.33,0
3,164,12.99,0.0,34.33,0
3,147,12.99,0.0,34.33,0
1,174,12.99,0.0,34.33,0
The values of v2 and v4 are zeroes, so 2 and 4 from A1 rows are dropped.
Thanks in adanvce.

IIUC:
# Find v* columns where all (any?) values are 0
vx_idx = df1.filter(regex=r'^v\d+').eq(0).all().loc[lambda x: x].index
# Find a1 values that match with v*
a1_val = df2.loc[df2['k1'].isin(v), 'a1'].tolist()
# Filter out your final dataframe
out = df1[~df1['a1'].isin(a1_val)]
Output:
>>> out
a1 a2 v1 v2 v3 v4
0 1 12 12.99 0.0 34.33 0
1 1 13 12.99 0.0 34.33 0
2 1 145 12.99 0.0 34.33 0
5 3 164 12.99 0.0 34.33 0
6 3 147 12.99 0.0 34.33 0
7 1 174 12.99 0.0 34.33 0
>>> print(out.to_csv(index=False))
a1,a2,v1,v2,v3,v4
1,12,12.99,0.0,34.33,0
1,13,12.99,0.0,34.33,0
1,145,12.99,0.0,34.33,0
3,164,12.99,0.0,34.33,0
3,147,12.99,0.0,34.33,0
1,174,12.99,0.0,34.33,0

Related

how to get min values of columns by rolling another columns?

GROUP_NAV_DATE GROUP_REH_VALUE target
0 2018/11/29 1 1.06
1 2018/11/30 1.0029 1.063074
2 2018/12/3 1.03 1.0918
3 2018/12/4 1.032 1.09392
4 2018/12/5 1.0313 1.093178
5 2020/12/6 1.034 1.09604
6 2020/12/8 1.062 1.12572
7 2020/12/9 1.07 1.1342
8 2020/12/10 1 1.06
9 2020/12/11 0.99 1.0494
10 2020/12/12 0.96 1.0176
11 2020/12/13 1.062 1.12572
goal
create first_date column that value is from GROUP_NAV_DATE.The logic is that the value of GROUP_REH_VALUE is the first time less than target values in GROUP_REH_VALUE columns, and the result is greater than original date for each row.
For example, GROUP_REH_VALUE=1 for index 0, the first match is 2020/12/8. For index 9, the first match is 2020/12/13 not 2020/12/8.
Note: for each row, target values is 1.06*GROUP_REH_VALUE.
Expect
GROUP_NAV_DATE GROUP_REH_VALUE target first_date
0 2018/11/29 1 1.06 2020/12/8
1 2018/11/30 1.0029 1.063074 2020/12/9
2 2018/12/3 1.03 1.0918 NA
3 2018/12/4 1.032 1.09392 NA
4 2018/12/5 1.0313 1.093178 NA
5 2020/12/6 1.034 1.09604 NA
6 2020/12/8 1.062 1.12572 NA
7 2020/12/9 1.07 1.1342 NA
8 2020/12/10 1 1.06 2020/12/13
9 2020/12/11 0.99 1.0494 2020/12/13
10 2020/12/12 0.96 1.0176 2020/12/13
11 2020/12/13 1.062 1.12572 NA
Try
I try rolling and idxmin but when it depends on another columns, I could not ger answer.
You can use expanding but this code works only because:
There is a direct relation between GROUP_REH_VALUE and target columns 1.06*GROUP_REH_VALUE so the target column is useless.
You have a numeric index because expanding checks if the return value is numeric else you will raise an TypeError: must be real number, not str if GROUP_NAV_DATE is the index.
def f(sr):
m = sr.iloc[-1]*1.06 < sr
return sr[m].last_valid_index() if sum(m) else np.nan
# Need to reverse dataframe because you are looking forward.
idx = df.loc[::-1, 'GROUP_REH_VALUE'].expanding().apply(f).dropna()
# Set dates
df.loc[idx.index, 'first_time'] = df.loc[idx, 'GROUP_NAV_DATE'].tolist()
Output:
>>> df
GROUP_NAV_DATE GROUP_REH_VALUE target first_time
0 2018/11/29 1.0000 1.060000 2020/12/8
1 2018/11/30 1.0029 1.063074 2020/12/9
2 2018/12/3 1.0300 1.091800 NaN
3 2018/12/4 1.0320 1.093920 NaN
4 2018/12/5 1.0313 1.093178 NaN
5 2020/12/6 1.0340 1.096040 NaN
6 2020/12/8 1.0620 1.125720 NaN
7 2020/12/9 1.0700 1.134200 NaN
8 2020/12/10 1.0000 1.060000 2020/12/13
9 2020/12/11 0.9900 1.049400 2020/12/13
10 2020/12/12 0.9600 1.017600 2020/12/13
11 2020/12/13 1.0620 1.125720 NaN

pandas concat compare 2 dataframes with same column names how to build a differ column?

I'm using pandas concat to compare 2 dataframes have same columns and rows:
import pandas as pd
df=pd.read_csv(r'C:\Users\compare\T1.csv')
df2=pd.read_csv(r'C:\Users\compare\T2.csv')
index=['contract','RB','payee','fund']
df_all = pd.concat([df.set_index(index), df2.set_index(index)],
axis='columns', keys=['First', 'Second'])
df_final = df_all.swaplevel(axis='columns')[df.columns[54:56]]
df_final
The output is:
SD1 SD2
First Second First Second
contract RB payee fund
AG72916Z 2 1 W42 15622.9 15622.9 15622.9 15489.2
4 1 W44 14697.8 14697.8 14697.8 14572.1
8 1 W48 7388.56 7388.56 7388.56 7325.37
AL0024AZ C3 1 202 226.585 226.59 220.366 220.37
S3 1 204 804.059 804.06 781.99 781.99
My question is how can I add a differ column after each Second ,so that I can easily tell the comparison result,the output should looks like this:
SD1 SD2
First Second differ First Second differ
contract RB payee fund
AG72916Z 2 1 W42 15622.9 15622.9 0 15622.9 15489.2 133.7
4 1 W44 14697.8 14697.8 0 14697.8 14572.1 125.7
8 1 W48 7388.56 7388.56 0 7388.56 7325.37 63.19
AL0024AZ C3 1 202 226.585 226.59 0.05 220.366 220.37 -0.004
S3 1 204 804.059 804.06 0.01 781.99 781.99 0
A bit tricky but necessary to keep ordering:
out = df_final.stack(level=0).assign(Diff=lambda x: x['First'] - x['Second']) \
.stack().unstack(level=[-2, -1])
print(out)
# Output
SD1 SD2
First Second Diff First Second Diff
contract RB payee fund
AG72916Z 2 1 W42 15622.900 15622.90 0.000 15622.900 15489.20 133.700
4 1 W44 14697.800 14697.80 0.000 14697.800 14572.10 125.700
8 1 W48 7388.560 7388.56 0.000 7388.560 7325.37 63.190
AL0024AZ C3 1 202 226.585 226.59 -0.005 220.366 220.37 -0.004
S3 1 204 804.059 804.06 -0.001 781.990 781.99 0.000
Update
What if I want to only select the row that Diff large than 100
Use:
>>> out[out.loc[:, (slice(None), 'Diff')].gt(100).any(1)]
SD1 SD2
First Second Diff First Second Diff
contract RB payee fund
AG72916Z 2 1 W42 15622.9 15622.9 0.0 15622.9 15489.2 133.7
4 1 W44 14697.8 14697.8 0.0 14697.8 14572.1 125.7
# Same result with
# idx = pd.IndexSlice
# out[out.loc[:, idx[:, 'Diff']].gt(100).any(1)]

What would be the best way to perform operations against a significant number of rows from one df to one row from another?

If I have df1:
A B C D
0 4.51 6.212 3.12 1
1 3.12 3.444 1.12 1
2 6.98 7.413 7.02 0
3 4.51 8.916 5.12 1
....
n1 ~ 2000
and df2
A B C D
0 4.51 6.212 3.12 1
1 3.12 3.444 1.12 1
2 6.98 7.413 7.02 0
3 4.51 8.916 5.12 1
....
n2 = 10000+
And have to perform an operation like:
df12 =
df1[0,A]-df2[0,A] df1[0,B]-df2[0,B] df1[0,C]-df2[0,C]....
df1[0,A]-df2[1,A] df1[0,B]-df2[1,B] df1[0,C]-df2[1,C]
...
df1[0,A]-df2[n2,A] df1[0,B]-df2[n2,B] df1[0,C]-df2[n2,C]
...
df1[1,A]-df2[0,A] df1[1,B]-df2[0,B] df1[1,C]-df2[0,C]....
df1[1,A]-df2[1,A] df1[1,B]-df2[1,B] df1[1,C]-df2[1,C]
...
df1[1,A]-df2[n2,A] df1[1,B]-df2[n2,B] df1[1,C]-df2[n2,C]
...
df1[n1,A]-df2[0,A] df1[n1,B]-df2[0,B] df1[n1,C]-df2[0,C]....
df1[n1,A]-df2[1,A] df1[n1,B]-df2[1,B] df1[n1,C]-df2[1,C]
...
df1[n1,A]-df2[n2,A] df1[n1,B]-df2[n2,B] df1[n1,C]-df2[n2,C]
Where every row in df1 is compared against every row in df2 producing a score.
What would be the best way to perform this operation using either pandas or vaex/equivalent?
Thanks in advance!
Broadcasting is the way to go:
pd.DataFrame((df1.to_numpy()[:,None] - df2.to_numpy()[None,...]).reshape(-1, df1.shape[1]),
columns = df2.columns,
index = pd.MultiIndex.from_product((df1.index,df2.index))
)
Output (for df1 the three first rows, df2 the two first rows):
A B C D
0 0 0.00 0.000 0.0 0.0
1 1.39 2.768 2.0 0.0
1 0 -1.39 -2.768 -2.0 0.0
1 0.00 0.000 0.0 0.0
2 0 2.47 1.201 3.9 -1.0
1 3.86 3.969 5.9 -1.0
I would use openpyxl
This loop would do
for row in sheet.iter_rows(min_row=minr, min_col=starting col, max_col=finshing col, max_row=maxr):
for cell in row:
df1 = cell.value
for row in sheet.iter_rows(min_row=minr, min_col=starting col, max_col=finshing col, max_row=maxr):
for cell in row:
df2 = cell.value
from here you want do what , create new values ? put them where? the code here points on them
An interesting question, that vaex can actually solve quite memory efficient (although we should be able to require practically no memory in the future).
Let's start creating the vaex dataframes, and increase the numbers a bit, 2,000 and 200,000 rows.
import vaex
import numpy as np
names = "ABCD"
N = 2000
M = N * 100
print(f'{N*M:,} rows')
400,000,000 rows
df1 = vaex.from_dict({name + '1': np.random.random(N) * 6 for name in names})
# We add a virtual range column for joining (requires no memory)
df1['i1'] = vaex.vrange(0, N, dtype=np.int32)
print(df1)
# A1 B1 C1 D1 i1
0 5.863720056164963 3.362898169138271 3.2880444660598784 0.1482153863632898 0
1 5.873731485927979 5.669031702051764 5.696571067838359 1.0310578585207142 1
2 4.513310303419997 4.466469647700519 5.047406986222205 3.4417402924374407 2
3 0.43402400660624174 1.157476656465433 2.179139262842482 1.1666706679131253 3
4 3.3698854360766526 2.203558794966768 0.39649910973621827 2.5576740079630502 4
... ... ... ... ... ...
1,995 4.836227485536714 4.093067389612236 5.992282902119859 1.3549691660861871 1995
1,996 1.1157617217838995 1.1606619796004967 3.2771620798090533 4.249631266421745 1996
1,997 4.628846984287445 4.019449674317169 3.7307713985954947 3.7702606362049362 1997
1,998 1.3196727531762933 2.6758762345410565 3.249315566523623 2.6501467681546123 1998
1,999 5.013902261001965 0.07240435410582324 0.8744364486077243 2.6801177594876218 1999
df2 = vaex.from_dict({name + '2': np.random.random(M) * 6 for name in names})
df2['i2'] = vaex.vrange(0, M, dtype=np.int32)
print(df2)
# A2 B2 C2 D2 i2
0 2.6928366822161234 3.227321076730826 5.944154034728931 3.3584747680090814 0
1 4.824761575636117 2.960087600265437 3.492601702004836 1.054879207993813 1
2 4.33510613806528 0.46883404117516103 5.632361155412736 0.436374137912523 2
3 0.0422709543055384 2.5319705848478855 3.5596949266321216 2.5364151309685354 3
4 2.749335843510271 3.5446979145461146 2.550223710733076 5.02069361871291 4
... ... ... ... ... ...
199,995 5.32205669155252 4.321667991189379 2.1192950613326182 5.937425946574905 199995
199,996 0.10746705113978328 4.104809740632655 0.6282195590464632 3.9603843538752974 199996
199,997 5.74108180127652 3.5863223687990136 4.64031507831471 4.610807268734913 199997
199,998 5.839402924722246 2.630974123991404 4.50411700551054 3.0960758923309983 199998
199,999 1.6954091816701466 1.8054911765387567 4.300317113825266 4.900845720973579 199999
Now we create our 'master' vaex dataframe, that requires no memory at all, it's made of a virtual column and two expressions (stored as virtual columns):
df = vaex.from_arrays(i=vaex.vrange(0, N*M, dtype=np.int64))
df['i1'] = df.i // M # index to df1
df['i2'] = df.i % M # index to df2
print(df)
# i i1 i2
0 0 0 0
1 1 0 1
2 2 0 2
3 3 0 3
4 4 0 4
... ... ... ...
399,999,995 399999995 1999 199995
399,999,996 399999996 1999 199996
399,999,997 399999997 1999 199997
399,999,998 399999998 1999 199998
399,999,999 399999999 1999 199999
Unfortunately vaex cannot use these integer indices as lookups for joining directly, it has to go through as hashmap. So there is room for improvement here for vaex. If vaex could do this, we could scale this idea up to trillions of rows.
print(f"The next two joins require ~{len(df)*8*2//1024**2:,} MB of RAM")
The next two joins require ~6,103 MB of RAM
df_big = df.join(df1, on='i1')
df_big = df_big.join(df2, on='i2')
print(df_big)
# i i1 i2 A1 B1 C1 D1 A2 B2 C2 D2
0 0 0 0 5.863720056164963 3.362898169138271 3.2880444660598784 0.1482153863632898 2.6928366822161234 3.227321076730826 5.944154034728931 3.3584747680090814
1 1 0 1 5.863720056164963 3.362898169138271 3.2880444660598784 0.1482153863632898 4.824761575636117 2.960087600265437 3.492601702004836 1.054879207993813
2 2 0 2 5.863720056164963 3.362898169138271 3.2880444660598784 0.1482153863632898 4.33510613806528 0.46883404117516103 5.632361155412736 0.436374137912523
3 3 0 3 5.863720056164963 3.362898169138271 3.2880444660598784 0.1482153863632898 0.0422709543055384 2.5319705848478855 3.5596949266321216 2.5364151309685354
4 4 0 4 5.863720056164963 3.362898169138271 3.2880444660598784 0.1482153863632898 2.749335843510271 3.5446979145461146 2.550223710733076 5.02069361871291
... ... ... ... ... ... ... ... ... ... ... ...
399,999,995 399999995 1999 199995 5.013902261001965 0.07240435410582324 0.8744364486077243 2.6801177594876218 5.32205669155252 4.321667991189379 2.1192950613326182 5.937425946574905
399,999,996 399999996 1999 199996 5.013902261001965 0.07240435410582324 0.8744364486077243 2.6801177594876218 0.10746705113978328 4.104809740632655 0.6282195590464632 3.9603843538752974
399,999,997 399999997 1999 199997 5.013902261001965 0.07240435410582324 0.8744364486077243 2.6801177594876218 5.74108180127652 3.5863223687990136 4.64031507831471 4.610807268734913
399,999,998 399999998 1999 199998 5.013902261001965 0.07240435410582324 0.8744364486077243 2.6801177594876218 5.839402924722246 2.630974123991404 4.50411700551054 3.0960758923309983
399,999,999 399999999 1999 199999 5.013902261001965 0.07240435410582324 0.8744364486077243 2.6801177594876218 1.6954091816701466 1.8054911765387567 4.300317113825266 4.900845720973579
Now we have our big dataframe, and we only need to do the computation, which is using virtual columns, and thus require no extra memory.
# add virtual colums (which require no memory)
for name in names:
df_big[name] = df_big[name + '1'] - df_big[name + '2']
print(df_big[['A', 'B', 'C', 'D']])
# A B C D
0 3.17088337394884 0.13557709240744487 -2.6561095686690526 -3.2102593816457916
1 1.038958480528846 0.40281056887283384 -0.20455723594495767 -0.9066638216305232
2 1.5286139180996834 2.8940641279631096 -2.344316689352858 -0.2881587515492332
3 5.821449101859425 0.8309275842903854 -0.2716504605722432 -2.3881997446052456
4 3.1143842126546923 -0.18179974540784372 0.7378207553268026 -4.87247823234962
... ... ... ... ...
399,999,995 -0.30815443055055525 -4.249263637083556 -1.244858612724894 -3.257308187087283
399,999,996 4.906435209862181 -4.032405386526832 0.2462168895612611 -1.2802665943876756
399,999,997 -0.7271795402745553 -3.51391801469319 -3.765878629706985 -1.930689509247291
399,999,998 -0.8255006637202813 -2.5585697698855805 -3.629680556902816 -0.41595813284337657
399,999,999 3.318493079331818 -1.7330868224329334 -3.4258806652175418 -2.220727961485957
If we had to do this all in memory, how much RAM would that have required?
print(f"This would otherwise require {len(df_big) * (4*3*8)//1023**2:,} MB of RAM")
This would otherwise require 36,692 MB of RAM
So, quite efficient I would say, and in the future it would be interesting to see if we can do the join more efficiently, and require practically zero RAM for this problem.

How to insert rows with 0 data for missing quarters into a pandas dataframe?

I have a dataframe with specific Quota values for given quarters (YYYY-Qx format), and need to visualize them with some linecharts. However, some of the quarters are missing (as there was no Quota during those quarters).
Period Quota
2017-Q1 500
2017-Q3 600
2018-Q2 700
I want to add them (starting at 2017-Q1 until today, so 2019-Q2) to the dataframe with a default value of 0 in the Quota column. A desired output would be the following:
Period Quota
2017-Q1 500
2017-Q2 0
2017-Q3 600
2017-Q4 0
2018-Q1 0
2018-Q2 700
2018-Q3 0
2018-Q4 0
2019-Q1 0
2019-Q2 0
I tried
df['Period'] = pd.to_datetime(df['Period']).dt.to_period('Q')
And then resampling the df with 'Q' frequency, but I must be doing something wrong, as it doesn't help with anything.
Any help would be much appreciated.
Use:
df.index = pd.to_datetime(df['Period']).dt.to_period('Q')
end = pd.Period(pd.datetime.now(), freq='Q')
df = (df['Quota'].reindex(pd.period_range(df.index.min(), end), fill_value=0)
.rename_axis('Period')
.reset_index()
)
df['Period'] = df['Period'].dt.strftime('%Y-Q%q')
print (df)
Period Quota
0 2017-Q1 500
1 2017-Q2 0
2 2017-Q3 600
3 2017-Q4 0
4 2018-Q1 0
5 2018-Q2 700
6 2018-Q3 0
7 2018-Q4 0
8 2019-Q1 0
9 2019-Q2 0
#An alternate solution based on left join
qtr=['Q1','Q2','Q3','Q4']
finl=[]
for i in range(2017,2020):
for j in qtr:
finl.append((str(i)+'_'+j))
df1=pd.DataFrame({'year_qtr':finl}).reset_index(drop=True)
df1.head(2)
original_value=['2017_Q1' ,'2017_Q3' ,'2018_Q2']
df_original=pd.DataFrame({'year_qtr':original_value,
'value':[500,600,700]}).reset_index(drop=True)
final=pd.merge(df1,df_original,how='left',left_on=['year_qtr'], right_on =['year_qtr'])
final.fillna(0)
Output
year_qtr value
0 2017_Q1 500.0
1 2017_Q2 0.0
2 2017_Q3 600.0
3 2017_Q4 0.0
4 2018_Q1 0.0
5 2018_Q2 700.0
6 2018_Q3 0.0
7 2018_Q4 0.0
8 2019_Q1 0.0
9 2019_Q2 0.0
10 2019_Q3 0.0
11 2019_Q4 0.0

Data exploration JSON nested data in pandas

How can I get my JSON data into a reasonable data frame? I have a deeply nested file which I aim to get into a large data frame. All is described in the Github repository below:
http://www.github.com/simongraham/dataExplore.git
With nested jsons, you will need to walk through the levels, extracting needed segments. For the nutrition segment of the larger json, consider iterating through every nutritionPortions level and each time running the pandas normalization and concatenating to final dataframe:
import pandas as pd
import json
with open('/Users/simongraham/Desktop/Kaido/Data/kaidoData.json') as f:
data = json.load(f)
# INITIALIZE DF
nutrition = pd.DataFrame()
# ITERATIVELY CONCATENATE
for item in data[0]["nutritionPortions"]:
if 'ftEnergyKcal' in item.keys(): # MISSING IN 3 OF 53 LEVELS
temp = (pd.io
.json
.json_normalize(item, 'nutritionNutrients',
['vcNutritionId','vcUserId','vcPortionId','vcPortionName','vcPortionSize',
'ftEnergyKcal', 'vcPortionUnit','dtConsumedDate'])
)
nutrition = pd.concat([nutrition, temp])
nutrition.head()
Output
ftValue nPercentRI vcNutrient vcNutritionPortionId \
0 0.00 0.0 alcohol c993ac30-ecb4-4154-a2ea-d51dbb293f66
1 0.00 0.0 bcfa c993ac30-ecb4-4154-a2ea-d51dbb293f66
2 7.80 6.0 biotin c993ac30-ecb4-4154-a2ea-d51dbb293f66
3 49.40 2.0 calcium c993ac30-ecb4-4154-a2ea-d51dbb293f66
4 1.82 0.0 carbohydrate c993ac30-ecb4-4154-a2ea-d51dbb293f66
vcTrafficLight vcUnit dtConsumedDate \
0 g 2016-04-12T00:00:00
1 g 2016-04-12T00:00:00
2 µg 2016-04-12T00:00:00
3 mg 2016-04-12T00:00:00
4 g 2016-04-12T00:00:00
vcNutritionId ftEnergyKcal \
0 070b97a4-d562-427d-94a8-1de1481df5d1 18.2
1 070b97a4-d562-427d-94a8-1de1481df5d1 18.2
2 070b97a4-d562-427d-94a8-1de1481df5d1 18.2
3 070b97a4-d562-427d-94a8-1de1481df5d1 18.2
4 070b97a4-d562-427d-94a8-1de1481df5d1 18.2
vcUserId vcPortionName vcPortionSize \
0 fe585e3d-2863-46fe-a41f-290bf58ad169 1 mug 260
1 fe585e3d-2863-46fe-a41f-290bf58ad169 1 mug 260
2 fe585e3d-2863-46fe-a41f-290bf58ad169 1 mug 260
3 fe585e3d-2863-46fe-a41f-290bf58ad169 1 mug 260
4 fe585e3d-2863-46fe-a41f-290bf58ad169 1 mug 260
vcPortionId vcPortionUnit
0 2 ml
1 2 ml
2 2 ml
3 2 ml
4 2 ml

Categories

Resources