Creating dataframe from dict - python

first start by creating a list with some values:
list = ['SBSP3.SA', 'CSMG3.SA', 'CGAS5.SA']
I create an empty dictionary because that's the only way I found it to read several .csv files I want as a dataframe. And then I do a for loop to store my .csv files in the empty dictionary:
d = {}
d = {ticker: pd.read_csv('{}.csv'.format(ticker)) for ticker in list}
after that I can only call the dataframe by passing slices with the dictionary keys:
d['SBSP3.SA'].head(5)
Date High Low Open Close Volume Adj Close
0 2017-01-02 14.70 14.60 14.64 14.66 7525700.0 13.880955
1 2017-01-03 15.65 14.95 14.95 15.50 39947800.0 14.676315
2 2017-01-04 15.68 15.31 15.45 15.50 37071700.0 14.676315
3 2017-01-05 15.91 15.62 15.70 15.75 47586300.0 14.913031
4 2017-01-06 15.92 15.50 15.78 15.66 25592000.0 14.827814
I can't for example:
df = pd.DataFrame(d)
My question is:
Can I merge all these dataframes that I threw in dictionary (d) with axis = 1 to view it as one?
Breaking the head a lot here I managed to put all the dataframes together but I lost their key and I could not distinguish who is who, since the name of the columns is the same.
Can I name these keys in columns?
Example:
Date High_SBSP3.SA Low_SBSP3.SA Open_SBSP3.SA Close_SBSP3.SA Volume_SBSP3.SA Adj Close_SBSP3.SA
0 2017-01-02 14.70 14.60 14.64 14.66 7525700.0 13.880955
1 2017-01-03 15.65 14.95 14.95 15.50 39947800.0 14.676315
2 2017-01-04 15.68 15.31 15.45 15.50 37071700.0 14.676315
3 2017-01-05 15.91 15.62 15.70 15.75 47586300.0 14.913031
4 2017-01-06 15.92 15.50 15.78 15.66 25592000.0 14.827814

Don't use list as a variable name, it shadows the actual built-in list.
You don't need a dictionary, a simple list is enough to store all your dataframes.
Call pd.concat on this list - it should properly concatenate the dataframes one below the other, as long as they have the same column names.
ticker_list = ['SBSP3.SA', 'CSMG3.SA', 'CGAS5.SA']
pd_list = [pd.read_csv('{}.csv'.format(ticker)) for ticker in ticker_list]
df = pd.concat(pd_list)
Use df = pd.concat(pd_list, ignore_index=True) if you want to reset the indices when concatenating.

pd.merge will do what you want (including renaming columns) but since it only allows for merging two frames at a time the column names will not be consistent when repeating the merge. Thus you need to rename the columns manually before.
import pandas as pd
from functools import reduce
ticker_list = ['SBSP3.SA', 'CSMG3.SA', 'CGAS5.SA']
pd_list = [pd.read_csv('{}.csv'.format(ticker)) for ticker in ticker_list]
for idx, df in enumerate(pd_list):
old_names = df.columns[1:]
new_names = list(map(lambda x : x + '_' + ticker_list[idx] , old_names))
zipped = dict(zip(old_names, new_names))
df.rename(zipped, axis=1, inplace=True)
def dfmerge(x, y):
return pd.merge(x, y, on="date")
df = reduce(dfmerge, pd_list)
print(df)
Output (with my data):
date High_SBSP3.SA Low_SBSP3.SA Open_SBSP3.SA High_CSMG3.SA Low_CSMG3.SA Open_CSMG3.SA High_CGAS5.SA Low_CGAS5.SA Open_CGAS5.SA
0 2017-01-02 1 2 3 1 2 3 1 2 3
1 2017-01-03 4 5 6 4 5 6 4 5 6
2 2017-01-04 7 8 9 7 8 9 7 8 9
Hint: you may need to edit/delete your comment. Since I preferred to overwrite my previous answer instead of adding a new one.

Related

pandas groupby by customized year, e.g. a school year

In a pandas data frame I would like to find the mean values of a column, grouped by a 'customized' year.
An example would be to compute the mean values of school marks for a school year (e.g. Sep/YYYY to Aug/YYYY+1).
The pandas docs gives some information on offsets and business year etc., but I can't really make any sense out of that to get a working example.
Here is a minimal example where mean values of school marks are computed per year (Jan-Dec), which is what I do not want.
import pandas as pd
import numpy as np
df = pd.DataFrame(data=np.random.randint(low=1, high=5, size=36),
index=pd.date_range('2001-09-01', freq='M', periods=36),
columns=['marks'])
df_yearly = df.groupby(pd.Grouper(freq="A")).mean()
This could yield e.g.:
print(df):
marks
2001-09-30 1
2001-10-31 4
2001-11-30 2
2001-12-31 1
2002-01-31 4
2002-02-28 1
2002-03-31 2
2002-04-30 1
2002-05-31 3
2002-06-30 3
2002-07-31 3
2002-08-31 3
2002-09-30 4
2002-10-31 1
...
2003-11-30 4
2003-12-31 2
2004-01-31 1
2004-02-29 2
2004-03-31 1
2004-04-30 3
2004-05-31 4
2004-06-30 2
2004-07-31 2
2004-08-31 4
print(df_yearly):
marks
2001-12-31 2.000000
2002-12-31 2.583333
2003-12-31 2.666667
2004-12-31 2.375000
My desired output would correspond to something like:
2001-09/2002-08 mean_value
2002-09/2003-08 mean_value
2003-09/2004-08 mean_value
Many thanks!
We can manually compute the school years:
# if month>=9 we move it to the next year
school_years = df.index.year + (df.index.month>8).astype(int)
Another option is to use fiscal year starting from September:
school_years = df.index.to_period('Q-AUG').qyear
And we can groupby:
df.groupby(school_years).mean()
Output:
marks
2002 2.333333
2003 2.500000
2004 2.500000
One more approach
a = (df.index.month == 9).cumsum()
val = df.groupby(a, sort=False)['marks'].mean().reset_index()
dates = df.index.to_series().groupby(a, sort=False).agg(['first', 'last']).reset_index()
dates.merge(val, on='index')
Output
index first last marks
0 1 2001-09-30 2002-08-31 2.750000
1 2 2002-09-30 2003-08-31 2.333333
2 3 2003-09-30 2004-08-31 2.083333

Get data using row / col reference from two column values in another data frame

df1
Date APA AR BP-GB CDEV ... WLL WPX XEC XOM CL00-USA
0 2018-01-01 42.22 19.00 5.227 19.80 ... 26.48 14.07 122.01 83.64 60.42
1 2018-01-02 44.30 19.78 5.175 20.00 ... 27.37 14.31 125.51 85.03 60.37
2 2018-01-03 45.33 19.78 5.242 20.33 ... 27.99 14.39 126.20 86.70 61.63
3 2018-01-04 46.84 19.80 5.300 20.37 ... 28.11 14.44 128.66 86.82 62.01
4 2018-01-05 46.39 19.44 5.296 20.12 ... 27.79 14.24 127.82 86.75 61.44
df2
Date Ticker Event_Type Event_Description Price add
0 2018-11-19 XEC M&A REN 88.03 1
1 2018-03-28 CXO M&A RSPP 143.25 1
2 2018-08-14 FANG M&A EGN 133.75 1
3 2019-05-09 OXY M&A APC 56.33 1
4 2019-08-26 PDCE M&A SRCI 29.65 1
My goal is to update df2.['add'] by using df2['Ticker'] and df2['Date'] to pull the value from df1 ... so for example the first row in df2 is XEC on 2018-11-19... the code needs to first look at df1[XEC] and then pull the value that matches the 2018-11-19 row in df[Date]
My attempt was:
df_Events['add'] = df_Prices.loc[[df_Prices['Date']==df_Events['Date']],[df_Prices.columns==df_Events['Ticker']]]
Try:
df2 = df2.merge(df1.melt(value_vars=df1.columns.tolist()[1:], id_vars='date', value_name="add", var_name='Ticker').reset_index(), how='left')`
This should change df1 Tickers columns to a single column, and than merge the values in that column to df2.
One more approach may be as below (I had started looking at it, so I am putting here even though you have accepted the answer)
First convert dates into datetime object in both dataframes & set it as index ony in the first one (code below)
df1['Date']=pd.to_datetime(df1['Date'])
df1.set_index('Date',inplace=True)
df2['Date']=pd.to_datetime(df2['Date'])
Then use apply to get the values for each of the columns.
df2['add']=df2.apply(lambda x: df1.loc[(x['Date']),(x['Ticker'])], axis=1)
This will work only if dates & values for all tickers exist in both dataframes (hence will throw as 'KeyError'

Sort Values in DataFrame using Categorical Key without groupby Split Apply Combine

So... I have a Dataframe that looks like this, but much larger:
DATE ITEM STORE STOCK
0 2018-06-06 A L001 4
1 2018-06-06 A L002 0
2 2018-06-06 A L003 4
3 2018-06-06 B L001 1
4 2018-06-06 B L002 2
You can reproduce the same DataFrame with the following code:
import pandas as pd
import numpy as np
import itertools as it
lojas = ['L001', 'L002', 'L003']
itens = list("ABC")
dr = pd.date_range(start='2018-06-06', end='2018-06-12')
df = pd.DataFrame(data=list(it.product(dr, itens, lojas)), columns=['DATE', 'ITEM', 'STORE'])
df['STOCK'] = np.random.randint(0,5, size=len(df.ITEM))
I wanna calculate de STOCK difference between days in every pair ITEM-STORE and iterating over groups in a groupby object is easy using the function .diff() to get something like this:
DATE ITEM STORE STOCK DELTA
0 2018-06-06 A L001 4 NaN
9 2018-06-07 A L001 0 -4.0
18 2018-06-08 A L001 4 4.0
27 2018-06-09 A L001 0 -4.0
36 2018-06-10 A L001 3 3.0
45 2018-06-11 A L001 2 -1.0
54 2018-06-12 A L001 2 0.0
I´ve manage to do so by the following code:
gg = df.groupby([df.ITEM, df.STORE])
lg = []
for (name, group) in gg:
aux = group.copy()
aux.reset_index(drop=True, inplace=True)
aux['DELTA'] = aux.STOCK.diff().fillna(value=0, inplace=Tr
lg.append(aux)
df = pd.concat(lg)
But in a large DataFrame, it gets impracticable. Is there a faster more pythonic way to do this task?
I've tried to improve your groupby code, so this should be a lot faster.
v = df.groupby(['ITEM', 'STORE'], sort=False).STOCK.diff()
df['DELTA'] = np.where(np.isnan(v), 0, v)
Some pointers/ideas here:
Don't iterate over groups
Don't pass series as the groupers if the series belong to the same DataFrame. Pass string labels instead.
diff can be vectorized
The last line is tantamount to a fillna, but fillna is slower than np.where
Specifying sort=False will prevent the output from being sorted by grouper keys, improving performance further
This can also be re-written as
df['DELTA'] = df.groupby(['ITEM', 'STORE'], sort=False).STOCK.diff().fillna(0)

Selecting column values of a dataframe which is in a range and put it in appropriate columns of another dataframe in pandas

I have a csv file which is something like below
date,mean,min,max,std
2018-03-15,3.9999999999999964,inf,0.0,100.0
2018-03-16,0.46403712296984756,90.0,0.0,inf
2018-03-17,2.32452732452731,,0.0,143.2191767899579
2018-03-18,2.8571428571428523,inf,0.0,100.0
2018-03-20,0.6928406466512793,100.0,0.0,inf
2018-03-22,2.8675703858185635,,0.0,119.05383697172658
I want to select those column values which is > 20 and < 500 that is (20 to 500) and put those values along with date in another column of a dataframe.The other dataframe looks something like this
Date percentage_change location
2018-02-14 23.44 BOM
So I want to get the date, value from the csv and add it into the new dataframe at appropriate columns.Something like
Date percentage_change location
2018-02-14 23.44 BOM
2018-03-15 100.0 NaN
2018-03-16 90.0 NaN
2018-03-17 143.2191767899579 NaN
.... .... ....
Now I am aware of functions like df.max(axis=1) and df.min(axis=1) which gives you the min and max but not sure for finding values based on a range.So how can this be achieved?
Given dataframes df1 and df2, you can achieve this via aligning column names, cleaning numeric data, and then using pd.DataFrame.append.
df_app = df1.loc[:, ['date', 'mean', 'min', 'std']]\
.rename(columns={'date': 'Date'})\
.replace(np.inf, 0)\
.fillna(0)
print(df_app)
df_app['percentage_change'] = np.maximum(df_app['min'], df_app['std'])
print(df_app)
df_app = df_app[df_app['percentage_change'].between(20, 500)]
res = df2.append(df_app.loc[:, ['Date', 'percentage_change']])
print(res)
# Date location percentage_change
# 0 2018-02-14 BOM 23.440000
# 0 2018-03-15 NaN 100.000000
# 1 2018-03-16 NaN 90.000000
# 2 2018-03-17 NaN 143.219177
# 3 2018-03-18 NaN 100.000000
# 4 2018-03-20 NaN 100.000000
# 5 2018-03-22 NaN 119.053837

Pandas Reindex - Fill Column with Missing Values

I tried several examples of this topic but with no results. I'm reading a DataFrame like:
Code,Counts
10006,5
10011,2
10012,26
10013,20
10014,17
10015,2
10018,2
10019,3
How can I get another DataFrame like:
Code,Counts
10006,5
10007,NaN
10008,NaN
...
10011,2
10012,26
10013,20
10014,17
10015,2
10016,NaN
10017,NaN
10018,2
10019,3
Basically filling the missing values of the 'Code' Column? I tried the df.reindex() method but I can't figure out how it works. Thanks a lot.
I'd set the index to you 'Code' column, then reindex by passing in a new array based on your current index, arange accepts a start and stop param (you need to add 1 to the end) and then reset_index this assumes that your 'Code' values are already sorted:
In [21]:
df.set_index('Code', inplace=True)
df = df.reindex(index = np.arange(df.index[0], df.index[-1] + 1)).reset_index()
df
Out[21]:
Code Counts
0 10006 5
1 10007 NaN
2 10008 NaN
3 10009 NaN
4 10010 NaN
5 10011 2
6 10012 26
7 10013 20
8 10014 17
9 10015 2
10 10016 NaN
11 10017 NaN
12 10018 2
13 10019 3

Categories

Resources