Currency Conversion Dataframe - skip Columns - python

I am retrieving Yahoo stock ticker data and want to convert the given currency to euros. For this purpose I am using the Python Library Currency Converter and the pandas method multiply.
One of the columns, trading volume, shouldn't be "converted" - whats the best way to skip it?
This is what I currently have:
import pandas as pd
import datetime
import pandas_datareader.data as web
from pandas import Series, DataFrame
from currency_converter import CurrencyConverter
start = datetime.datetime(2017, 1, 1)
end = datetime.datetime(2020, 12, 31)
c = CurrencyConverter()
df = web.DataReader("EXK", 'yahoo', start, end)
df.tail()
conversion = c.convert(1, 'USD', 'EUR')
eurodf = df.multiply(conversion,axis='rows')
eurodf.tail()
One approach I thought of taking, was to maybe join the "volume" column after multiplication.
Alternatively I could just target that one column and convert it back?

You can loc all columns except one. For example:
A B C
0 0 1 2
1 3 4 5
2 6 7 8
df.loc[:, df.columns.drop('B')] *= 10
Result:
A B C
0 0 1 20
1 30 4 50
2 60 7 80

Related

How to retrieve unnamed columns after a groupby and unstack?

I have a dataset of events with a date column which I need to display in a weekly plot and do some more data processing afterwards. After some googling I found pd.Grouper(freq="W") so I am using that to group the events by week and display them. My problem is that after doing the groupby and ungroup I end up with a data frame where there is an unnamed column that I am unable to refer to except using iloc. This is an issue because in later plots I am grouping by other columns so I need a way to refer to this column by name, not iloc.
Here's a reproducible example of my dataset:
from datetime import datetime
from faker import Faker
fake = Faker()
start_date = datetime(2023, 1, 1)
end_date = datetime(2023, 2, 1)
# Generate data frame of 30 random dates in January 2023
df = pd.DataFrame(
{"date": [fake.date_time_between(start_date=start_date, end_date=end_date) for i in range(30)],
"dummy": [1 for i in range(30)]}) # There's probably a better way of counting than this
grouper = df.set_index("date").groupby([pd.Grouper(freq="W"), 'dummy'])
result = grouper['dummy'].count().unstack('dummy').fillna(0)
The result data frame that I get has weird indexes/columns that I am unable to navigate:
>>> print(result)
dummy 1
date
2023-01-01 1
2023-01-08 3
2023-01-15 4
2023-01-22 9
2023-01-29 8
2023-02-05 5
>>> print(result.columns)
Int64Index([1], dtype='int64', name='dummy')
Then only column here is dummy, but even after result.dummy I get an AttributeError
I've also tried result.reset_index():
dummy date 1
0 2023-01-01 1
1 2023-01-08 3
2 2023-01-15 4
3 2023-01-22 9
4 2023-01-29 8
5 2023-02-05 5
But for this data frame I can only get the date column - the counts column named "1" cannot be accessed using result.reset_index()["1"] as I get an AttributeError
I am completely perplexed by what is going on here, pandas is really powerful but sometimes I find it incredibly unintuitive. I've checked several pages of the docs and checked if there's another index level (there isn't). Can someone who's better at pandas help me out here?
I just want a way to convert the grouped data frame into something like this:
date counts
0 2023-01-01 1
1 2023-01-08 3
2 2023-01-15 4
3 2023-01-22 9
4 2023-01-29 8
5 2023-02-05 5
Where date and counts are columns and there is an unnamed index
You can solve this by simply doing:
from datetime import datetime
from faker import Faker
fake = Faker()
start_date = datetime(2023, 1, 1)
end_date = datetime(2023, 2, 1)
# Generate data frame of 30 random dates in January 2023
df = pd.DataFrame(
{"date": [fake.date_time_between(start_date=start_date, end_date=end_date) for i in range(30)],
"dummy": [1 for i in range(30)]}) # There's probably a better way of counting than this
result = df.groupby([pd.Grouper(freq="W", key='date'), 'dummy'], squeeze=True)['dummy'].count()
result = result.reset_index(name='counts')
result = result.drop(['dummy'], axis = 1)
which gives
date counts
0 2023-01-01 3
1 2023-01-08 7
2 2023-01-15 5
3 2023-01-22 5
4 2023-01-29 8
5 2023-02-05 2

Python: Count Unique Value over rolling past 3 days

I have a df that is a time series of user access data
UserID Access Date
a 10/01/2019
b 10/01/2019
c 10/01/2019
a 10/02/2019
b 10/02/2019
d 10/02/2019
e 10/03/2019
f 10/03/2019
a 10/03/2019
b 10/03/2019
a 10/04/2019
b 10/04/2019
c 10/05/2019
I have another df that lists out the dates and I want to aggregate the unique occurrence of UserIDs in the rolling past 3 days. The expected output would look like below:
Date Past_3_days_unique_count
10/01/2019 NaN
10/02/2019 NaN
10/03/2019 6
10/04/2019 5
10/04/2019 5
How would I be able to achieve this?
It's quite straightforward - let me walk you through it via the following snippet and its comments.
import pandas as pd
import numpy as np
# Generate some dates
dates = pd.date_range("01-01-2016", "01-10-2016", freq="6H")
# Generate some user ids
ids = np.random.randint(1, 5, len(dates))
df = pd.DataFrame({"id": ids, "date": dates})
# Collect unique IDs for each day
q = df.groupby(df["date"].dt.to_period("D"))["id"].nunique()
# Grab the rolling sum over 3 previous days which is what we wanted
q.rolling(3).sum()
Use pandas groupby the documentation is very good

Python Dataframe always select middle row

My output dataframe will have sometimes 1 or 5 or 10 rows. How do I select exactly the middle row.
My code:
df =
val
0 10
1 20
2 30
3 40
mid_rw = round(len(df)/2)
print(df.iloc[mid_rw])
But above does not work if there is one row only? How to make it work for one row as well?
how about this:
import pandas as pd
df = pd.DataFrame({'val':[10,20,30,40]})
mid_rw = int(len(df)/2)
print(df.iloc[mid_rw])
int will round to floor

How to assign random values from a list to a column in a pandas dataframe?

I am working with Python in Bigquery and have a large dataframe df (circa 7m rows). I also have a list lst that holds some dates (say all days in a given month).
I am trying to create an additional column "random_day" in df with a random value from lst in each row.
I tried running a loop and apply function but being quite a large dataset it is proving challenging.
My attempts passed by the loop solution:
df["rand_day"] = ""
for i in a["row_nr"]:
rand_day = sample(day_list,1)[0]
df.loc[i,"rand_day"] = rand_day
And the apply solution, defining first my function and then calling it:
def random_day():
rand_day = sample(day_list,1)[0]
return day
df["rand_day"] = df.apply(lambda row: random_day())
Any tips on this?
Thank you
Use numpy.random.choice and if necessary convert dates by to_datetime:
df = pd.DataFrame({
'A':list('abcdef'),
'B':[4,5,4,5,5,4],
})
day_list = pd.to_datetime(['2015-01-02','2016-05-05','2015-08-09'])
#alternative
#day_list = pd.DatetimeIndex(['2015-01-02','2016-05-05','2015-08-09'])
df["rand_day"] = np.random.choice(day_list, size=len(df))
print (df)
A B rand_day
0 a 4 2016-05-05
1 b 5 2016-05-05
2 c 4 2015-08-09
3 d 5 2015-01-02
4 e 5 2015-08-09
5 f 4 2015-08-09

Slicing Multi Index Header DataFrames in Python With Custom Sorting

I'm trying to get a handle on slicing. I've got the following dataframe, df:
Feeder # 1 Feeder # 2
TimeStamp MW Month Day Hour TimeStamp MW Month Day Hour
0 2/3 1.2 1 30 22 2/3 2.4 1 30 22
1 2/4 2.3 1 31 23 2/3 4.1 1 31 23
2 2/5 3.4 2 1 0 2/3 3.7 2 1 0
There are 8 feeders in total.
If I want to select all the MW columns in all the Feeders, I can do:
df.xs('MW', level=1, axis=1,drop_level=False)
If I want Feeders 2 through 4, I can do:
df.loc[:,'Feeder #2':'Feeder #4']
BUT if I want columns MW through Day in just Feeders 2 through 4 via:
df.loc[:,pd.IndexSlice['Feeder #2':'Feeder #4','MW':'Day']]
I get the following error.
MultiIndex Slicing requires the index to be fully lexsorted tuple len (2), lexsort depth (1)
So if I sort the dataframe, then I'm able to do:
df.sortlevel(level=0,axis=1).loc[:,pd.IndexSlice['Feeder #2':'Feeder #4','Day':'MW']]
But sorting the dataframe destroys the original order of level 1 in the header-- everything gets alphabetized (lexsorted in Python-speak?). And my desired contents get jumbled: 'Day':'MW' yields the Day, Hour and MW columns. But what I want is 'MW':'Day' which would yield the MW, Month, and Day columns.
So my question is: is it possible to slice through my dataframe and preserve the order of the columns? Alternatively, can I lexsort the dataframe, perform the slices I need and then put the dataframe back in its original order?
Thanks in advance.
I think you can use CategoricalIndex to keep the order:
import pandas as pd
import numpy as np
level0 = "Feeder#1 Feeder#2 Feeder#3 Feeder#4".split()
level1 = "TimeStamp MW Month Day Hour".split()
idx0 = pd.CategoricalIndex(level0, level0, ordered=True)
idx1 = pd.CategoricalIndex(level1, level1, ordered=True)
columns = pd.MultiIndex.from_product([idx0, idx1])
df = pd.DataFrame(np.random.randint(0, 10, (10, 20)), columns=columns)
Then you can do this:
df.loc[:, pd.IndexSlice["Feeder#2":"Feeder#3", "MW":"Day"]]
edit
to convert the levels to CategoricalIndex:
columns = df.columns
for i in range(columns.nlevels):
level = pd.unique(columns.get_level_values(i))
cidx = pd.CategoricalIndex(level, level, sorted=True)
print(cidx)

Categories

Resources