Get dataframe from confusing dictionary data structure - python

I have a dictionary like on below :
{1: ds yhat yhat_lower yhat_upper
30 2015-08-09 49.908927 31.632462 66.742083
31 2015-08-16 49.750056 34.065527 67.069122
32 2015-08-23 49.591185 32.620258 67.403908
33 2015-08-30 49.432314 32.257891 67.541757
34 2015-09-06 72.395618 55.612973 89.711030
35 2015-09-13 49.114572 32.199945 66.255518
36 2015-09-20 48.955701 30.759960 66.118051,
2: ds yhat yhat_lower yhat_upper
30 2015-08-09 38.001931 23.583157 51.291784
31 2015-08-16 37.922999 25.370967 50.504328
32 2015-08-23 37.844068 23.743860 51.143868
33 2015-08-30 37.765136 24.903955 50.309284
34 2015-09-06 39.227773 25.089493 52.719935
35 2015-09-13 37.607273 24.370609 51.313454
36 2015-09-20 37.528341 23.395560 50.499454}
Want to get dataframe like this output
ProductCode ds yhat yhat_lower yhat_upper
1 2015-08-09 49.908927 31.632462 66.742083
1 2015-08-16 49.750056 34.065527 67.069122
1 2015-08-23 49.591185 32.620258 67.403908
1 2015-08-30 49.432314 32.257891 67.541757
1 2015-09-06 72.395618 55.612973 89.711030
1 2015-09-13 49.114572 32.199945 66.255518
1 2015-09-20 48.955701 30.759960 66.118051,
2 2015-08-09 38.001931 23.583157 51.291784
2 2015-08-16 37.922999 25.370967 50.504328
2 2015-08-23 37.844068 23.743860 51.143868
2 2015-08-30 37.765136 24.903955 50.309284
2 2015-09-06 39.227773 25.089493 52.719935
2 2015-09-13 37.607273 24.370609 51.313454
2 2015-09-20 37.528341 23.395560 50.499454
My failed attempt:
new_df = pd.DataFrame(df.items(), columns=['ProductCode', 'yhat'])
print(new_df)
ProdutCode yhat
0 1 ds yhat yhat_lower yhat_upp...
1 2 ds yhat yhat_lower yhat_upp...
in the dictionary all the ds yhat yhat_lower and yhat_upper head and their values were taken as dict.values(). How to separate those letter part as Dataframe columns and numeric value part is their own column values?

Let us try pd.concat
yourdf=pd.concat(d).reset_index(level=0)

Related

How could I create sum of column itself dynamically in python

My raw data is:
def f_ST(ST,F,T):
a=ST/F-1-np.log(ST/F)
return 2*a/T
df=pd.DataFrame(range(50,140,5),columns=['K'])
df['f(K0)']=df.apply(lambda x: f_ST(x.K,100,0.25),axis=1)
df['f(K1)']=df['f(K0)'].shift(-1)
df['dK']=df['K'].diff(1)
The thing I want to do is: I have a funtion f(k)
f(k)= (k-100)/100 - ln(k/100)
I want to calculate w, which goes following the steps
get 1-period foward value of f(k), then calculate
tmp(k)=f1_f(k)-f(k)/dk
w is calculated as
w[0]=tmpw[0]
w[n]-tmpw[n]-(w[0]+w[1]+...w[n-1])
And the result look like
nbr date k f(k) f1_f(k) d_k tmpw w
10 2019-02-19 100 0.000000 0.009679 5.0 0.001936 0.001936
11 2019-02-19 105 0.009679 0.037519 5.0 0.005568 0.003632
12 2019-02-19 110 0.037519 0.081904 5.0 0.008877 0.003309
13 2019-02-19 115 0.081904 0.141428 5.0 ...
14 2019-02-19 120 0.141428 0.214852 5.0 ...
15 2019-02-19 125 0.214852 0.301086 5.0
16 2019-02-19 130 0.301086 0.399163 5.0
Question: could anyone help to derive a quick code (not mathematically) without using loop?
Thanks a lot!
I don't fully understand your question, for me all those notation were a bit confusing.
If I got what you want right, for every row you want to have an accumulated value of all previous rows. than the value of another column of this row would be calculated based on this accumulated value.
In this case I would prefer something, calculate an accumulated column first, use it later.
for example:
note you need to call list(range()) instead of list, so your example is throwing an error
import pandas as pd
import numpy as np
def f_ST(ST,F,T):
a=ST/F-1-np.log(ST/F)
return 2*a/T
df=pd.DataFrame(list(range(50,140,5)),columns=['K'])
df['f(K0)']=df.apply(lambda x: f_ST(x.K,100,0.25),axis=1)
df['f(K1)']=df['f(K0)'].shift(-1)
df['dK']=df['K'].diff(1)
df['accumulate'] = df['K'].shift(1).cumsum()
df['currentVal-accumulated'] = df['K'] - df['accumulate']
print(df)
prints:
K f(K0) ... accumulate currentVal-accumulated
0 50 1.545177 ... NaN NaN
1 55 1.182696 ... 50.0 5.0
2 60 0.886605 ... 105.0 -45.0
3 65 0.646263 ... 165.0 -100.0
4 70 0.453400 ... 230.0 -160.0
5 75 0.301457 ... 300.0 -225.0
6 80 0.185148 ... 375.0 -295.0
7 85 0.100151 ... 455.0 -370.0
8 90 0.042884 ... 540.0 -450.0
9 95 0.010346 ... 630.0 -535.0
10 100 0.000000 ... 725.0 -625.0
11 105 0.009679 ... 825.0 -720.0
12 110 0.037519 ... 930.0 -820.0
13 115 0.081904 ... 1040.0 -925.0
14 120 0.141428 ... 1155.0 -1035.0
15 125 0.214852 ... 1275.0 -1150.0
16 130 0.301086 ... 1400.0 -1270.0
17 135 0.399163 ... 1530.0 -1395.0
[18 rows x 6 columns]

Pandas: how to select rows in data frame based on condition of a specific value on a specific column [duplicate]

This question already has answers here:
Pandas split DataFrame by column value
(5 answers)
Closed 3 years ago.
I have a given data frame as below example:
0 1 2 3 4 5 6 7 8
0 842517 M 20.57 17.77 132.9 1326 0.08474 0.07864 0.0869
1 84300903 M 19.69 21.25 130 1203 0.1096 0.1599 0.1974
2 84348301 M 11.42 20.38 77.58 386.1 0.1425 0.2839 0.2414
3 843786 M 12.45 15.7 82.57 477.1 0.1278 0.17 0.1578
4 844359 M 18.25 19.98 119.6 1040 0.09463 0.109 0.1127
And I wrote a function that should split the dataset into 2 data frames, based on comparison of a value in a specific column and a specific value.
For example, if I have col_idx = 2 and value=18.3 the result should be:
df1 - below the value:
0 1 2 3 4 5 6 7 8
2 84348301 M 11.42 20.38 77.58 386.1 0.1425 0.2839 0.2414
3 843786 M 12.45 15.7 82.57 477.1 0.1278 0.17 0.1578
4 844359 M 18.25 19.98 119.6 1040 0.09463 0.109 0.1127
df2 - above the value:
0 1 2 3 4 5 6 7 8
0 842517 M 20.57 17.77 132.9 1326 0.08474 0.07864 0.0869
1 84300903 M 19.69 21.25 130 1203 0.1096 0.1599 0.1974
The function should look like:
def split_dataset(data_set, col_idx, value):
below_df = ?
above_df = ?
return below_df, above_df
Can anybody complete my script please?
below_df = data_set[data_set[col_idx] < value]
above_df = data_set[data_set[col_idx] > value] # you have to deal with data_set[col_idx] == value though
You can use loc:
def split_dataset(data_set, col_idx, value):
below_df = df.loc[df[col_idx]<=value]
above_df = df.loc[df[col_idx]>=value]
return below_df, above_df
df1,df2=split_dataset(df,'2',18.3)
Output:
df1
0 1 2 3 4 5 6 7 8
2 84348301 M 11.42 20.38 77.58 386.1 0.14250 0.2839 0.2414
3 843786 M 12.45 15.70 82.57 477.1 0.12780 0.1700 0.1578
4 844359 M 18.25 19.98 119.60 1040.0 0.09463 0.1090 0.1127
df2
0 1 2 3 4 5 6 7 8
0 842517 M 20.57 17.77 132.9 1326.0 0.08474 0.07864 0.0869
1 84300903 M 19.69 21.25 130.0 1203.0 0.10960 0.15990 0.1974
Note:
Note that in this function call the names of the columns are numbers. You have to know before calling the function the correct type of columns. You may have to use string type or not.
You should also define what happens if the value with which the data frame is divided (value) is included in the column of the data frame.

How to get rows of a most recent day in the ascending order of time way when reading csv file?

I want to get rows of a most recent day which is in ascending order of time way.
I get dataframe as follows:
label uId adId operTime siteId slotId contentId netType
0 0 u147333631 3887 2019-03-30 15:01:55.617 10 30 2137 1
1 0 u146930169 1462 2019-03-31 09:51:15.275 3 32 1373 1
2 0 u139816523 2084 2019-03-27 08:10:41.769 10 30 2336 1
3 0 u106546472 1460 2019-03-31 08:51:41.085 3 32 1371 4
4 0 u106642861 2295 2019-03-27 22:58:03.679 3 32 2567 4
Cause I get about 100 million rows in this csv file, it is impossible to load all this into my PC memory.
So I want to get rows of a most recent day in ascending order of time way when reading this csv files.
For examples, if the most recent day is on 2019-04-04, it will output as follows:
#this not a real data, just for examples.
label uId adId operTime siteId slotId contentId netType
0 0 u147336431 3887 2019-04-04 00:08:42.315 1 54 2427 2
1 0 u146933269 1462 2019-04-04 01:06:16.417 30 36 1343 6
2 0 u139536523 2084 2019-04-04 02:08:58.079 15 23 1536 7
3 0 u106663472 1460 2019-04-04 03:21:13.050 32 45 1352 2
4 0 u121642861 2295 2019-04-04 04:36:08.653 3 33 3267 4
Could anyone help me?
Thanks in advence.
I'm assuming you can't read the entire file into memory, and the file is in a random order. You can read the file in chunks and iterate through the chunks.
# read 50,000 lines of the file at a time
reader = pd.read_csv(
'csv_file.csv',
parse_dates=True,
chunksize=5e5,
header=0
)
recent_day=pd.datetime(2019,4,4)
next_day=recent_day + pd.Timedelta(days=1)
df_list=[]
for chunk in reader:
#check if any rows match the date range
date_rows = chunk.loc[
(chunk['operTime'] >= recent_day]) &\
(chunk['operTime'] < next_day)
]
#append dataframe of matching rows to the list
if date_rows.empty:
pass
else:
df_list.append(date_rows)
final_df = pd.concat(df_list)
final_df = final_df.sort_values('operTime')
Seconding what anky_91 said, sort_values() will be helpful here.
import pandas as pd
df = pd.read_csv('file.csv')
# >>> df
# label uId adId operTime siteId slotId contentId netType
# 0 0 u147333631 3887 2019-03-30 15:01:55.617 10 30 2137 1
# 1 0 u146930169 1462 2019-03-31 09:51:15.275 3 32 1373 1
# 2 0 u139816523 2084 2019-03-27 08:10:41.769 10 30 2336 1
# 3 0 u106546472 1460 2019-03-31 08:51:41.085 3 32 1371 4
# 4 0 u106642861 2295 2019-03-27 22:58:03.679 3 32 2567 4
sub_df = df[(df['operTime']>'2019-03-31') & (df['operTime']<'2019-04-01')]
# >>> sub_df
# label uId adId operTime siteId slotId contentId netType
# 1 0 u146930169 1462 2019-03-31 09:51:15.275 3 32 1373 1
# 3 0 u106546472 1460 2019-03-31 08:51:41.085 3 32 1371 4
final_df = sub_df.sort_values(by=['operTime'])
# >>> final_df
# label uId adId operTime siteId slotId contentId netType
# 3 0 u106546472 1460 2019-03-31 08:51:41.085 3 32 1371 4
# 1 0 u146930169 1462 2019-03-31 09:51:15.275 3 32 1373 1
I think you could also use a datetimeindex here; that might be necessary if the file is sufficiently large.
Like #anky_91 mentionned, you can use the sort_values function. Here is a short example of how it works:
df = pd.DataFrame( {'Symbol':['A','A','A'] ,
'Date':['02/20/2015','01/15/2016','08/21/2015']})
df.sort_values(by='Date')
Out :
Date Symbol
2 08/21/2015 A
0 02/20/2015 A
1 01/15/2016 A

How can I Extract only numbers from this columns?

Suppose, you have a column in excel, with values like this... there are only 5500 numbers present but it show length 5602 means that 102 strings are present
4 SELECTIO
6 N NO
14 37001
26 37002
38 37003
47 37004
60 37005
73 37006
82 37007
92 37008
105 37009
119 37010
132 37011
143 37012
157 37013
168 37014
184 37015
196 37016
207 37017
220 37018
236 37019
253 37020
267 37021
280 37022
287 Krishan
290 37023
300 37024
316 37025
337 37026
365 37027
...
74141 42471
74154 42472
74169 42473
74184 42474
74200 42475
74216 42476
74233 42477
74242 42478
74256 42479
74271 42480
74290 42481
74309 42482
74323 42483
74336 42484
74350 42485
74365 42486
74378 42487
74389 42488
74398 42489
74413 42490
74430 42491
74446 42492
74459 42493
74474 42494
74491 42495
74504 42496
74516 42497
74530 42498
74544 42499
74558 42500
Name: Selection No., Length: 5602, dtype: object
and I want to get only numeric values like this in python using pandas
37001
37002
37003
37004
37005
how can I do this? I have attached my code in python using pandas..............................................
def selection(sle):
if sle in re.match('[3-4][0-9]{4}',sle):
return 1
else:
return 0
select['status'] = select['Selection No.'].apply(selection)
and now I am geting an "argument of type 'NoneType' is not iterable" error.
Try using Numpy with np.isreal and only select numbers..
import pandas as pd
import numpy as np
df = pd.DataFrame({'SELECTIO':['N NO',37002,37003,'Krishan',37004,'singh',37005], 'some_col':[4,6,14,26,38,47,60]})
df
SELECTIO some_col
0 N NO 4
1 37002 6
2 37003 14
3 Krishan 26
4 37004 38
5 singh 47
6 37005 60
>>> df[df[['SELECTIO']].applymap(np.isreal).all(1)]
SELECTIO some_col
1 37002 6
2 37003 14
4 37004 38
6 37005 60
result:
Specific to column SELECTIO ..
df[df[['SELECTIO']].applymap(np.isreal).all(1)]
SELECTIO some_col
1 37002 6
2 37003 14
4 37004 38
6 37005 60
OR just another approach importing numbers + lambda :
import numbers
df[df[['SELECTIO']].applymap(lambda x: isinstance(x, numbers.Number)).all(1)]
SELECTIO some_col
1 37002 6
2 37003 14
4 37004 38
6 37005 60
Note: there is problem when you are extracting a column you are using ['Selection No.'] but indeed you have a Space in the name it will be like ['Selection No. '] that's the reason you are getting KeyError while executing it, try and see!
Your function contains wrong expression: if sle in re.match('[3-4][0-9]{4}',sle): - it tries to find a column value sle IN match object which "always have a boolean value of True" (re.match returns None when there's no match)
I would suggest to proceed with pd.Series.str.isnumeric function:
In [544]: df
Out[544]:
Selection No.
0 37001
1 37002
2 37003
3 asnsh
4 37004
5 singh
6 37005
In [545]: df['Status'] = df['Selection No.'].str.isnumeric().astype(int)
In [546]: df
Out[546]:
Selection No. Status
0 37001 1
1 37002 1
2 37003 1
3 asnsh 0
4 37004 1
5 singh 0
6 37005 1
If a strict regex pattern is required - use pd.Series.str.contains function:
df['Status'] = df['Selection No.'].str.contains('^[3-4][0-9]{4}$', regex=True).astype(int)

How to add a repeated column using pandas

I am doing my homework and I encounter a problem, I have a large matrix, the first column Y002 is a nominal variable, which has 3 levels and encoded as 1,2,3 respectively. The other two columns V96 and V97 are just numeric.
Now, I wanna get a group mean corresponds to the variable Y002. I wrote the code like this
group = data2.groupby(by=["Y002"]).mean()
Then I index to get each group mean using
group1 = group["V96"]
group2 = group["V97"]
Now I wanna append this group mean as a new column into the original dataframe, in which each mean matches the corresponding Y002 code(1 or 2 or 3). Actually I tried this code, but it only shows NAN.
data2["group1"] = pd.Series(group1, index=data2.index)
Hope someone could help me with this, many thanks :)
PS: Hope this makes sense. just like R language, we can do the same thing using
data2$group1 = with(data2, tapply(V97,Y002,mean))[data2$Y002]
But how can we implement this in Python and pandas???
You can use .transform()
import pandas as pd
import numpy as np
# your data
# ============================
np.random.seed(0)
df = pd.DataFrame({'Y002': np.random.randint(1,4,100), 'V96': np.random.randn(100), 'V97': np.random.randn(100)})
print(df)
V96 V97 Y002
0 -0.6866 -0.1478 1
1 0.0149 1.6838 2
2 -0.3757 0.9718 1
3 -0.0382 1.6077 2
4 0.3680 -0.2571 2
5 -0.0447 1.8098 3
6 -0.3024 0.8923 1
7 -2.2244 -0.0966 3
8 0.7240 -0.3772 1
9 0.3590 -0.5053 1
.. ... ... ...
90 -0.6906 1.5567 2
91 -0.6815 -0.4189 3
92 -1.5122 -0.4097 1
93 2.1969 1.1164 2
94 1.0412 -0.2510 3
95 -0.0332 -0.4152 1
96 0.0656 -0.6391 3
97 0.2658 2.4978 1
98 1.1518 -3.0051 2
99 0.1380 -0.8740 3
# processing
# ===========================
df['V96_mean'] = df.groupby('Y002')['V96'].transform(np.mean)
df['V97_mean'] = df.groupby('Y002')['V97'].transform(np.mean)
df
V96 V97 Y002 V96_mean V97_mean
0 -0.6866 -0.1478 1 -0.1944 0.0837
1 0.0149 1.6838 2 0.0497 -0.0496
2 -0.3757 0.9718 1 -0.1944 0.0837
3 -0.0382 1.6077 2 0.0497 -0.0496
4 0.3680 -0.2571 2 0.0497 -0.0496
5 -0.0447 1.8098 3 0.0053 -0.0707
6 -0.3024 0.8923 1 -0.1944 0.0837
7 -2.2244 -0.0966 3 0.0053 -0.0707
8 0.7240 -0.3772 1 -0.1944 0.0837
9 0.3590 -0.5053 1 -0.1944 0.0837
.. ... ... ... ... ...
90 -0.6906 1.5567 2 0.0497 -0.0496
91 -0.6815 -0.4189 3 0.0053 -0.0707
92 -1.5122 -0.4097 1 -0.1944 0.0837
93 2.1969 1.1164 2 0.0497 -0.0496
94 1.0412 -0.2510 3 0.0053 -0.0707
95 -0.0332 -0.4152 1 -0.1944 0.0837
96 0.0656 -0.6391 3 0.0053 -0.0707
97 0.2658 2.4978 1 -0.1944 0.0837
98 1.1518 -3.0051 2 0.0497 -0.0496
99 0.1380 -0.8740 3 0.0053 -0.0707
[100 rows x 5 columns]

Categories

Resources