Pandas MultiIndex: Divide all columns by one column - python

I have a data frame results of the form
TOTEXPPQ TOTEXPCQ FINLWT21
year quarter
13 1 9.183392e+09 5.459961e+09 1271559.398
2 2.907887e+09 1.834126e+09 481169.672
and I was trying to divide all (the first two) columns by the last one. My attempt was
weights = results.pop('FINLWT21')
results/weights
But I get
ValueError: cannot join with no level specified and no overlapping names
Which I don't get: There are overlapping names in the index:
weights.head()
year quarter
13 1 1271559.398
2 481169.672
Is there perhaps a better way to do this division? Do I need to reset the index?

You have to specify the axis for the divide (with the div method):
In [11]: results.div(weights, axis=0)
Out[11]:
TOTEXPPQ TOTEXPCQ
year quarter
13 1 7222.149445 4293.909517
2 6043.371329 3811.807158
The default is axis=1 and the result columns and weights' index names do not overlap, hence the error message.

Related

get before and after days based on 2 dataframes (date time )

i have a data frame which is of format(date range from start="2018-09-09",end="2020-02-02") with values from 1 to 513
I have another data frame with the format(only 3 dates)
based on second data frame i want 2 dates before and 1 date after what I mean is this
Edited: Corrected answer as per the question
If you do this:
keep = []
for val in df2['value']:
keep += [val-3, val-2, val-1, val]
df_final = df1.take(keep)
Assumption: Your value columns always starts from 1 and is sequential. Also, its datatype is integer not string.
What it does:
The row numbers (indices) of every date = value of that row - 1, since indices start from 0.
So this keeps only the value-3 (2 days before), value-2 (1 day before), value-1 (that day present in df2) and value (1 day after) indices in the keep list.
Then DataFrame.take(indices) does the work for us, by only taking from the mentioned DataFrame df1 the rows with indices mentioned in the argument indices, a list.

To get subset of dataframe based on index of a label

I have a dataframe from yahoo finance
import pandas as pd
import yfinance
ticker = yfinance.Ticker("INFY.NS")
df = ticker.history(period = '1y')
print(df)
This gives me df as,
If I specify,
date = "2021-04-23"
I need a subset of df with row having indexes label "2021-04-23"
rows of 2 days before the date
row of 1 day after of date
The important thing here is, we cannot calculate before & after using date strings as df may not have some dates but rows to be printed based on indexes. (i.e. 2 rows of previous indexes and one row of next index)
For example, in df, there is no "2021-04-21" but "2021-04-20"
How can we implement this?
You can go for integer-based indexing. First find the integer location of the desired date and then take the desired subset with iloc:
def get_subset(df, date):
# get the integer index of the matching date(s)
matching_dates_inds, = np.nonzero(df.index == date)
# and take the first one (works in case of duplicates)
first_matching_date_ind = matching_dates_inds[0]
# take the 4-element subset
desired_subset = df.iloc[first_matching_date_ind - 2: first_matching_date_ind + 2]
return desired_subset
If need before and after values by positions (if always exist date in DatetimeIndex) use DataFrame.iloc with position by Index.get_loc with min and max for select rows if not exist values before 2 or after 1 like in sample data:
df = pd.DataFrame({'a':[1,2,3]},
index=pd.to_datetime(['2021-04-21','2021-04-23','2021-04-25']))
date = "2021-04-23"
pos = df.index.get_loc(date)
df = df.iloc[max(0, pos-2):min(len(df), pos+2)]
print (df)
a
2021-04-21 1
2021-04-23 2
2021-04-25 3
Notice:
min and max are added for not failed selecting if date is first (not exist 2 values before, or second - not exist second value before) or last (not exist value after)

Select rows from Dataframe with variable number of conditions

I'm trying to write a function that takes as inputs a DataFrame with a column 'timestamp' and a list of tuples. Every tuple will contain a beginning and end time.
What I want to do is to "split" the dataframe in two new ones, where the first contains the rows for which the timestamp value is not contained between the extremes of any tuple, and the other is just the complementary.
The number of filter tuples is not known a priori though.
df = DataFrame({'timestamp':[0,1,2,5,6,7,11,22,33,100], 'x':[1,2,3,4,5,6,7,8,9,1])
filt = [(1,4), (10,40)]
left, removed = func(df, filt)
This should give me two dataframes
left: with rows with timestamp [0,5,6,7,100]
removed: with rows with timestamp [1,2,11,22,33]
I believe the right approach is to write a custom function that can be used as a filter, and then call is somehow to filter/mask the dataframe, but I could not find a proper example of how to implement this.
Check
out = df[~pd.concat([df.timestamp.between(*x) for x in filt]).any(level=0)]
Out[175]:
timestamp x
0 0 1
3 5 4
4 6 5
5 7 6
9 100 1
Can't you use filtering with .isin():
left,removed = df[df['timestamp'].isin([0,5,6,7,100])],df[df['timestamp'].isin([1,2,11,22,33])]

Finding rows with highest means in dataframe

I am trying to find the rows, in a very large dataframe, with the highest mean.
Reason: I scan something with laser trackers and used a "higher" point as reference to where the scan starts. I am trying to find the object placed, through out my data.
I have calculated the mean of each row with:
base = df.mean(axis=1)
base.columns = ['index','Mean']
Here is an example of the mean for each row:
0 4.407498
1 4.463597
2 4.611886
3 4.710751
4 4.742491
5 4.580945
This seems to work fine, except that it adds an index column, and gives out columns with an index of type float64.
I then tried this to locate the rows with highest mean:
moy = base.loc[base.reset_index().groupby(['index'])['Mean'].idxmax()]
This gives out tis :
index Mean
0 0 4.407498
1 1 4.463597
2 2 4.611886
3 3 4.710751
4 4 4.742491
5 5 4.580945
But it only re-index (I have now 3 columns instead of two) and does nothing else. It still shows all rows.
Here is one way without using groupby
moy=base.sort_values('Mean').tail(1)
It looks as though your data is a string or single column with a space in between your two numbers. Suggest splitting the column into two and/or using something similar to below to set the index to your specific column of interest.
import pandas as pd
df = pd.read_csv('testdata.txt', names=["Index", "Mean"], delimiter="\s+")
df = df.set_index("Index")
print(df)

Pandas/Python - Updating dataframes based on value match

I want to update the mergeAllGB.Intensity columns NaN values with values from another dataframe where ID, weekday and hour are matching. I'm trying:
mergeAllGB.Intensity[mergeAllGB.Intensity.isnull()] = precip_hourly[precip_hourly.SId == mergeAllGB.SId & precip_hourly.Hour == mergeAllGB.Hour & precip_hourly.Weekday == mergeAllGB.Weekday].Intensity
However, this returns ValueError: Series lengths must match to compare. How could I do this?
Minimal example:
Inputs:
_______
mergeAllGB
SId Hour Weekday Intensity
1 12 5 NaN
2 5 6 3
precip_hourly
SId Hour Weekday Intensity
1 12 5 2
Desired output:
________
mergeAllGB
SId Hour Weekday Intensity
1 12 5 2
2 5 6 3
TL;DR this will (hopefully) work:
# Set the index to compare by
df = mergeAllGB.set_index(["SId", "Hour", "Weekday"])
fill_df = precip_hourly.set_index(["SId", "Hour", "Weekday"])
# Fill the nulls with the relevant values of intensity
df["Intensity"] = df.Intensity.fillna(fill_df.Intensity)
# Cancel the special indexes
mergeAllGB = df.reset_index()
Alternatively, the line before the last could be
df.loc[df.Intensity.isnull(), "Intensity"] = fill_df.Intensity
Assignment and comparison in pandas are done by index (which isn't shown in your example).
In the example, running precip_hourly.SId == mergeAllGB.SId results in ValueError: Can only compare identically-labeled Series objects. This is because we try to compare the two columns by value, but precip_hourly doesn't have a row with index 1 (default indexing starts at 0), so the comparison fails.
Even if we assume the comparison succeeded, the assignment stage is problematic.
Pandas tries to assign according to the index - but this doesn't have the intended meaning.
Luckily, we can use it for our own benefit - by setting the index to be ["SId", "Hour", "Weekday"], any comparison and assignments will be done with relation to this index, so running df.Intensity= fill_df.Intensity will assign to df.Intensity the values in fill_df.Intensity wherever the index match, that is, wherever they have the same ["SId", "Hour", "Weekday"].
In order to assign only to the places where the Intensity is NA, we need to filter first (or use fillna). Note that filter by df.Intensity[df.Intensity.isnull()] will work, but assignment to it will probably fail if you have several values with the same (SId, Hour, Weekday) values.

Categories

Resources