Merging from df1 to df2 on same column - expanding dataset out - python

I am making a small mistake and I'm not sure how to merge two df correctly. I want to merge on IBES_cusip to get gvkey into df1.
I try the following, but it just expands the dataset out and does not match correctly:
df1 = df1.merge(df2, how = 'left', on =['IBES_cusip'])
df1
IBES_cusip pends pdicity ... ltg_eps futepsgrowth
0 00036110 1983-05-31 ANN ... NaN NaN
1 00036110 1983-05-31 ANN ... NaN NaN
2 00036110 1983-05-31 ANN ... NaN NaN
3 98970110 1983-05-31 ANN ... NaN NaN
4 98970110 1983-05-31 ANN ... NaN NaN
... ... ... ... ... ...
373472 98970111 2018-12-31 ANN ... 10.00 0.381119
373473 98970111 2018-12-31
df2
gvkey IBES_cusip
0 024538 86037010
1 004678 33791510
2 066367 26357810
3 137024 06985P20
4 137024 06985P20
... ...
833796 028955 33975610
833797 061676 17737610
833798 011096 92035510
833799 005774 44448210
833800 008286 69489010

Your main problem is that your df2 contains duplicate values in IBES_cusip column.
from the sample you gave I can see that
3 137024 06985P20
4 137024 06985P20
are the same values, this would cause the to get unwanted results (duplicate rows in the output).
try this
df1 = df1.merge(df2.drop_duplicates(subset=['IBES_cusip']), how='left', on='IBES_cusip')
Which should technically just add a gvkey column to your df1.
This assumes that you are pretty sure that you don't have rows with the same IBES_cusip that are matched with different gvkey otherwise you need to figure that out first.

Related

Pandas Timeseries reindex producing NaNs

I am surprised that my reindex is producing NaNs in whole dataframe when the original dataframe does have numerical values init. Don't know why?
Code:
df =
A ... D
Unnamed: 0 ...
2022-04-04 11:00:05 NaN ... 2419.0
2022-04-04 11:00:10 NaN ... 2419.0
## exp start and end times
exp_start, exp_end = '2022-04-04 11:00:00','2022-04-04 13:00:00'
## one second index
onesec_idx = pd.date_range(start=exp_start,end=exp_end,freq='1s')
## map new index to the df
df = df.reindex(onesec_idx)
Result:
df =
A ... D
2022-04-04 11:00:00 NaN ... NaN
2022-04-04 11:00:01 NaN ... NaN
2022-04-04 11:00:02 NaN ... NaN
2022-04-04 11:00:03 NaN ... NaN
2022-04-04 11:00:04 NaN ... NaN
2022-04-04 11:00:05 NaN ... NaN
From the documentation you can see that df.reindex() will Places NA/NaN in locations having no value in the previous index.
However you can also provide a value that you want to replace missing values with (It defaults to NaN):
df.reindex(onesec_idx, fill_value='')
If you want to replace the NaN in a particular column or even in the whole dataframe you can run something like after doing a reindex:
df.fillna('',inplace=True) # for replacing NaN in the entire df with ''
df['d'].fillna(0, inplace=True) # if you want to replace all NaN in the D column with 0
Sources:
Documentation for reindex: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reindex.html
Documentation for fillna: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html

How to combine two columns from different csv files to one csv file

I have seen many methods like concat, join, merge but i am missing the technique for my simple dataset.
I have two datasets looks like mentioned below
dates.csv
2020-07-06
2020-07-07
2020-07-08
2020-07-09
2020-07-10
.....
...
...
mydata.csv
Expected,Predicted
12990,12797.578628473471
12990,12860.382061836583
12990,12994.159035827917
12890,13019.073929662367
12890,12940.34108357684
.............
.......
.....
I want to combine these two datasets which have same number of rows on btoh csv files. I tried concat method but i see NaN's
delete = dates.csv (pd.DataFrame)
data1 = mydata.csv (pd.DataFrame)
result = pd.concat([delete, data1], axis=0, ignore_index=True)
print(result)
Output:
0 Expected Predicted
0 2020-07-06 NaN NaN
1 2020-07-07 NaN NaN
2 2020-07-08 NaN NaN
3 2020-07-09 NaN NaN
4 2020-07-10 NaN NaN
.. ... ... ...
307 NaN 10999.0 10526.433098
308 NaN 10999.0 10911.247147
309 NaN 10490.0 11038.685328
310 NaN 10490.0 10628.204624
311 NaN 10490.0 10632.495169
[312 rows x 3 columns]
I dont want all NaN's.
Thanks for your help!
You could use .join() method from pandas.
delete = dates.csv (pd.DataFrame)
data1 = mydata.csv (pd.DataFrame)
result = delete.join(data1)
If your two dataframes respect the same order, you can use the join method mentionned by Nik, by default it joins on the index.
Otherwise, if you have a key that you can join your dataframes on, you can specify it like this:
joined_data = first_df.join(second_df, on=key)
Your first_df and second_df should then share a column with the same name to join on.

Python Pandas totals and dates

Im sorry for not posting the data but it wouldn't really help. The thing is a need to make a graph and I have a csv file full of information organised by date. It has 'Cases' 'Deaths' 'Recoveries' 'Critical' 'Hospitalized' 'States' as categories. It goes in order by date and has the amount of cases,deaths,recoveries per day of each state. How do I sum this categories to make a graph that shows how the total is increasing? I really have no idea how to start so I can't post my data. Below are some numbers that try to explain what I have.
0 2020-02-20 1 Andalucía NaN NaN NaN
1 2020-02-20 2 Aragón NaN NaN NaN
2 2020-02-20 3 Asturias NaN NaN NaN
3 2020-02-20 4 Baleares 1.0 NaN NaN
4 2020-02-20 5 Canarias 1.0 NaN NaN
.. ... ... ... ... ... ...
888 2020-04-06 19 Melilla 92.0 40.0 3.0
889 2020-04-06 14 Murcia 1283.0 500.0 84.0
890 2020-04-06 15 Navarra 3355.0 1488.0 124.0
891 2020-04-06 16 País Vasco 9021.0 4856.0 417.0
892 2020-04-06 17 La Rioja 2846.0 918.0 66.0
It's unclear exactly what you mean by "sum this categories". I'm assuming you mean that for each date, you want to sum the values across all different regions to come up with the total values for Spain?
In which case, you will want to groupby date, then .sum() the columns (you can drop the States category.
grouped_df = df.groupby("date")["Cases", "Deaths", ...].sum()
grouped_df.set_index("date").plot()
This snippet will probably not work directly, you may need to reformat the dates etc. But should be enough to get you started.
I think you are looking for groupby followed by a cumsum not including dates.
columns_to_group = ['Cases', 'Deaths',
'Recoveries', 'Critical', 'Hospitalized', 'date']
new_columns = ['Cases_sum', 'Deaths_sum',
'Recoveries_sum', 'Critical_sum', 'Hospitalized_sum']
df_grouped = df[columns_to_group].groupby('date').sum().reset_index()
For plotting seaborn provides an easy functions:
import seaborn as sns
df_melted = df_grouped.melt(id_vars=["date"])
sns.lineplot(data=df_melted, x='date', y = 'value', hue='variable')

Selecting column values of a dataframe which is in a range and put it in appropriate columns of another dataframe in pandas

I have a csv file which is something like below
date,mean,min,max,std
2018-03-15,3.9999999999999964,inf,0.0,100.0
2018-03-16,0.46403712296984756,90.0,0.0,inf
2018-03-17,2.32452732452731,,0.0,143.2191767899579
2018-03-18,2.8571428571428523,inf,0.0,100.0
2018-03-20,0.6928406466512793,100.0,0.0,inf
2018-03-22,2.8675703858185635,,0.0,119.05383697172658
I want to select those column values which is > 20 and < 500 that is (20 to 500) and put those values along with date in another column of a dataframe.The other dataframe looks something like this
Date percentage_change location
2018-02-14 23.44 BOM
So I want to get the date, value from the csv and add it into the new dataframe at appropriate columns.Something like
Date percentage_change location
2018-02-14 23.44 BOM
2018-03-15 100.0 NaN
2018-03-16 90.0 NaN
2018-03-17 143.2191767899579 NaN
.... .... ....
Now I am aware of functions like df.max(axis=1) and df.min(axis=1) which gives you the min and max but not sure for finding values based on a range.So how can this be achieved?
Given dataframes df1 and df2, you can achieve this via aligning column names, cleaning numeric data, and then using pd.DataFrame.append.
df_app = df1.loc[:, ['date', 'mean', 'min', 'std']]\
.rename(columns={'date': 'Date'})\
.replace(np.inf, 0)\
.fillna(0)
print(df_app)
df_app['percentage_change'] = np.maximum(df_app['min'], df_app['std'])
print(df_app)
df_app = df_app[df_app['percentage_change'].between(20, 500)]
res = df2.append(df_app.loc[:, ['Date', 'percentage_change']])
print(res)
# Date location percentage_change
# 0 2018-02-14 BOM 23.440000
# 0 2018-03-15 NaN 100.000000
# 1 2018-03-16 NaN 90.000000
# 2 2018-03-17 NaN 143.219177
# 3 2018-03-18 NaN 100.000000
# 4 2018-03-20 NaN 100.000000
# 5 2018-03-22 NaN 119.053837

How should I subtract two dataframes and in Pandas and diplay the required output?

My table looks like this:
In [82]:df.head()
Out[82]:
MatDoc MatYr MvT Material Plnt SLoc Batch Customer AmountLC Amount ... PO MatYr.1 MatDoc.1 Order ProfitCtr SLED/BBD PstngDate EntryDate Time Username
0 4912693062 2015 551 100062 HDC2 0001 5G30MC1A11 NaN 9.03 9.06 ... NaN NaN NaN NaN IN1165B085 26.01.2016 01.08.2015 01.08.2015 01:13:16 O33462
1 4912693063 2015 501 166 HDC2 0004 NaN NaN 0.00 0.00 ... NaN NaN NaN NaN IN1165B085 NaN 01.08.2015 01.08.2015 01:13:17 O33462
2 4912693320 2015 551 101343 HDC2 0001 5G28MC1A11 NaN 53.73 53.72 ... NaN NaN NaN NaN IN1165B085 25.01.2016 01.08.2015 01.08.2015 01:16:30 O33462
Here, I need to group by data on Order column and sum only AmountLC column.Then I need to check for the Order column values such that it should be present in both MvT101group and MvT102group. and if an Order matches in both sets of data then I need to subtract MvT102group from MvT101group. and display
Order|Plnt|Material|Batch|Sum101=SumofMvt101ofAmountLC|Sum102=SumofMvt102ofAmountLC|(Sum101-Sum102)/100
What I have done is first I made new df containing only 101 and 102: Mvt101 and MvT102
MvT101 = df.loc[df['MvT'] == 101]
MvT102 = df.loc[df['MvT'] == 102]
Then I grouped it by Order and got the sum value for the column
MvT101group = MvT101.groupby('Order', sort=True)
In [76]:
MvT101group[['AmountLC']].sum()
Out[76]:
Order AmountLC
1127828 16348566.88
1127829 22237710.38
1127830 29803745.65
1127831 30621381.06
1127832 33926352.51
MvT102group = MvT102.groupby('Order', sort=True)
In [77]:
MvT102group[['AmountLC']].sum()
Out[77]:
Order AmountLC
1127830 53221.70
1127831 651475.13
1127834 67442.16
1127835 2477494.17
1128622 218743.14
After this I am not able to understand how should I write my query.
Please ask me any further details if you want.Here is the CSV file from where I am working Link
Hope I understood the question correctly. After grouping both groups as you did:
MvT101group = MvT101.groupby('Order',sort=True).sum()
MvT102group = MvT102.groupby('Order',sort=True).sum()
You can update the columns' names for both groups:
MvT101group.columns = MvT101group.columns.map(lambda x: str(x) + '_101')
MvT102group.columns = MvT102group.columns.map(lambda x: str(x) + '_102')
Then merge all 3 tables so that you will have all 3 columns in the main table:
df = df.merge(MvT101group, left_on=['Order'], right_index=True, how='left')
df = df.merge(MvT102group, left_on=['Order'], right_index=True, how='left')
And then you can add the calculated column:
df['calc'] = (df['Order_101']-df['Order_102']) / 100

Categories

Resources