Pandas - Operate with series and delays - python

I have the following dataframe in Pandas. My doubt is how can I make to operate with series where one has a time delay. For example, I would like to calculate the result of dividing the GDP of a period by the population of the next period.
GDP Population
1950 3.31 1
1951 3.5 1
...
2000 15.2 2

To do that you can just use:
df['new_col'] = df['GDP'] / df['Population'].shift(1)

Have you consider using shift?
import pandas as pd
import numpy as np
df = pd.DataFrame({"GDP": np.random.normal(3,1,51),
"Pop":np.random.randint(1,10,51)},
index=np.arange(1950,2001))
df["res"] = df.GDP.shift(1)/df.Pop

Related

Issue with read html in pandas

I want to read a table form Wikipedia:
import pandas as pd
caption="Edit section: 2019 inequality-adjusted HDI (IHDI) (2020 report)"
df = pd.read_html('https://en.wikipedia.org/wiki/List_of_countries_by_inequality-adjusted_Human_Development_Index',match=caption)
df
But I got this errore:
"ValueError: No tables found matching pattern 'Edit section: 2019 inequality-adjusted HDI (IHDI) (2020 report)'"
This method worked for table like below table:
caption = "Average daily maximum and minimum temperatures for selected cities in Minnesota"
df = pd.read_html('https://en.wikipedia.org/wiki/Minnesota', match=caption)
df
But I get confused for this one, how can I solved this problem?
You have multiple problems here.
pandas doesn't support https, and there's no such caption that you're looking for.
Try this:
import pandas as pd
import requests
caption = "Table of countries by IHDI"
df = pd.read_html(
requests.get("https://en.wikipedia.org/wiki/List_of_countries_by_inequality-adjusted_Human_Development_Index").text,
match=caption,
)
print(df[0].head())
Output:
Rank Country ... 2019 estimates (2020 report)[4][5][6]
Rank Country ... Overall loss (%) Growth since 2010
0 1 Norway ... 6.1 0.021
1 2 Iceland ... 5.8 0.055
2 3 Switzerland ... 6.9 0.015
3 4 Finland ... 5.3 0.040
4 5 Ireland ... 7.3 0.066
[5 rows x 6 columns]
import pandas as pd
df = pd.read_html('https://en.wikipedia.org/wiki/List_of_countries_by_inequality-adjusted_Human_Development_Index')
df[2]
Or if you wish to use match argument
import pandas as pd
caption="Table of countries by IHDI"
df = pd.read_html('https://en.wikipedia.org/wiki/List_of_countries_by_inequality-adjusted_Human_Development_Index',match=caption)
df[0]
Returns

Resample a pandas dataframe with multiple variables

I have a dataframe in long format with data on a 15 min interval for several variables. If I apply the resample method to get the average daily value, I get the average values of all variables for a given time interval (and not the average value for speed, distance).
Does anyone know how to resample the dataframe and keep the 2 variables?
Note: The code below contains an EXAMPLE dataframe in long format, my real example loads data from csv and has different time intervals and frequencies for the variables, so I cannot simply resample the dataframe in wide format.
import pandas as pd
import numpy as np
dti = pd.date_range('2015-01-01', '2015-12-31', freq='15min')
df = pd.DataFrame(index = dti)
# Average speed in miles per hour
df['speed'] = np.random.randint(low=0, high=60, size=len(df.index))
# Distance in miles (speed * 0.5 hours)
df['distance'] = df['speed'] * 0.25
df.reset_index(inplace=True)
df2 = df.melt (id_vars = 'index')
df3 = df2.resample('d', on='index').mean()
IIUC:
>>> df.groupby(df.index.date).mean()
speed distance
2015-01-01 29.562500 7.390625
2015-01-02 31.885417 7.971354
2015-01-03 30.895833 7.723958
2015-01-04 30.489583 7.622396
2015-01-05 28.500000 7.125000
... ... ...
2015-12-27 28.552083 7.138021
2015-12-28 29.437500 7.359375
2015-12-29 29.479167 7.369792
2015-12-30 28.864583 7.216146
2015-12-31 48.000000 12.000000
[365 rows x 2 columns]

How could I transform the numpy array to pandas dataframe?

I am new to analyze using python, I wonder how can I transform the format of the left table to the right one. My initial thought is to create a nested for loop.
The desired table
First, I find read the required csv file.
Imported csv
Then, I count the number of countries in the Column 'country' and the number of the new column names list.
`countries = len(test['country'])`
`columns = len(['Year', 'Values'])`
After that, I should go for the nested for loop, however, I have no idea on writing the code.What I have come up was as follows:
`for i in countries:`
`for j in columns:`
You can use df.melt here:
In [3575]: df = pd.DataFrame({'country':['Afghanistan', 'Albania'], '1970':[1.36, 6.1], '1971':[1.39, 6.22], '1972':[1.43, 6.34]})
In [3576]: df
Out[3576]:
country 1970 1971 1972
0 Afghanistan 1.36 1.39 1.43
1 Albania 6.10 6.22 6.34
In [3609]: df = df.melt('country', var_name='Year', value_name='Values').sort_values('country')
In [3610]: df
Out[3610]:
country Year Values
0 Afghanistan 1970 1.36
2 Afghanistan 1971 1.39
4 Afghanistan 1972 1.43
1 Albania 1970 6.10
3 Albania 1971 6.22
5 Albania 1972 6.34
Not sure of what you want to do, but:
If you want to transform a column in a numpy array, you can use the following example:
import pandas as pd
import numpy as np
df = pd.DataFrame({"foo": [1,2,3], "bar": [10,20,30]})
print(df)
foo_array = np.array(df["foo"])
print(foo_array)
and then iterate on foo_array
You can also loop on your data frame using :
for row in df.iterrows():
print(row)
But it's not recommended since you can often use built in pandas function to do the same job.
your data frame is also an iterable object which only contains the columns names:
for d in df:
print(d)
# output:
# foo
# bar

Select a time range in DataFrame without date

I'm using/learning Pandas to load a csv style dataset where I have a time column that can be used as index. The data is sampled roughly at 100Hz. Here is a simplified snippet of the data:
Time (sec) Col_A Col_B Col_C
0.0100 14.175 -29.97 -22.68
0.0200 13.905 -29.835 -22.68
0.0300 12.257 -29.32 -22.67
... ...
1259.98 -0.405 2.205 3.825
1259.99 -0.495 2.115 3.735
There are 20 min of data, resulting in about 120,000 rows at 100 Hz. My goal is to select those rows within a certain time range, say 100-200 sec.
Here is what I've figured out
import panda as pd
df = pd.DataFrame(my_data) # my_data is a numpy array
df.set_index(0, inplace=True)
df.columns = ['Col_A', 'Col_B', 'Col_C']
df.index = pd.to_datetime(df.index, unit='s', origin='1900-1-1') # the date in origin is just a space-holder
My dataset doesn't include the date. How to avoid setting a fake date like I did above? It feels wrong, and also is quite annoying when I plot the data against time.
I know there are ways to remove date from the datatime object like here.
But my goal is to select some rows that are in a certain time range, which means I need to use pd.date_range(). This function does not seem to work without date.
It's not the end of the world if I just use a fake date throughout my project. But I'd like to know if there are more elegant ways around it.
I don't see why you need to use datetime64 objects for this. Your time column is an number, so you can very easily select time intervals with inequalities. You can also plot the columns without issue.
Sample Data
import pandas as pd
import numpy as np
np.random.seed(123)
df = pd.DataFrame({'Time': np.arange(0,1200,0.01),
'Col_A': np.random.randint(1,100,120000),
'Col_B': np.random.randint(1,10,120000)})
Select Data between 100 and 200 seconds.
df[df.Time.between(100,200)]
Outputs:
Time Col_A Col_B
10000 100.00 75 9
10001 100.01 23 7
...
19999 199.99 39 7
20000 200.00 25 2
Plotting against time
#First 100 rows just for illustration
df[0:100].plot(x='Time')
Convert to timedelta64
If you really wanted to, you could convert the column to a timedelta64[ns]
df['Time'] = pd.to_datetime(df.Time, unit='s') - pd.to_datetime('1970-01-01')
print(df.head())
# Time Col_A Col_B
#0 00:00:00 67 6
#1 00:00:00.010000 93 1
#2 00:00:00.020000 99 3
#3 00:00:00.030000 18 2
#4 00:00:00.040000 84 3
df.dtypes
#Time timedelta64[ns]
#Col_A int32
#Col_B int32
#dtype: object

How to find out about what N the resample function in pandas did its job?

I use the python module pandas and its function resample to calculate means of a dataset. I wonder how I can get to know about what N the resampling for each day/each month takes places.
In the below given example I calculate means for the three month January, Feb. and March.
The answer to my question in that case is: N for January = 31, N for February = 29, N for March = 31. Is there a way to get that information about N for more complex data?
import pandas as pd
import numpy as np
#create dates as index
dates = pd.date_range('1/1/2000', periods=91)
index = pd.Index(dates, name = 'dates')
#create DataFrame df
df = pd.DataFrame(np.random.randn(91, 1), index, columns=['A'])
print df['A']
#calculate monthly_mean
monthly_mean = df.resample('M', how='mean')
Thanks in advance.
You could use how='count', IIUC:
>>> df.resample('M', how='count')
2000-01-31 A 31
2000-02-29 A 29
2000-03-31 A 31
dtype: int64

Categories

Resources