I'm having the petrole_price dataset and I'm checking the condition whether it's greater than yesterday or not. If it is true it will create a subset.
petrole_price
Country Today Yesterday
0 India 120 117
1 US 90 92
2 UAE 32 31
3 Russia 70 69
4 UK 55 55
While execute below code I'm getting error as key_error 'Today'
petrole_price = petrole_price[petrole_price['Today'] > petrole_price['Yesterday']]
Here is the entire error:
petrole_price = petrole_price[petrole_price['Today'] > petrole_price['Yesterday']]
File "/home/tgphamifm/.local/lib/python3.9/site-packages/pandas/core/frame.py", line 3458, in __getitem__
indexer = self.columns.get_loc(key)
File "/home/tgphamifm/.local/lib/python3.9/site-packages/pandas/core/indexes/base.py", line 3363, in get_loc
raise KeyError(key) from err
KeyError: 'Today'
Related
I have inherited a code from a different persona nd I'm trying to reproduce the dataframes on my local. So I got a CSV file that I query to pull up the below data:
df1 = pd.read_csv(filepath, engine='python')
df1 = df1.iloc[:, 0:3];
The query returns below data from the CSV:
Data By Grade Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4
0 Cumulative FY 2018 FY 2019 NaN NaN NaN
1 Grade QL-05 QL-06 QL-07 QL-08
2 Females 0 7 1 0
.. ... ... ... ... ...
61 New Hires % 80.00% 29.29% 43.90% 5.00%
62 Scalability Rate 307 101 206 0\
next I want to pull out Grade field alone and I try the below query:
dfg = df1[df1['Grade'].str.contains(#cmpntfirst, na=False)].set_index('Grade')
print(dfg)
This query results in the below error:
Traceback (most recent call last):
File "..\..\test.py", line 21, in <module>
print(df1["Grade"]);
File "C:\Users\xxx\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\core\frame.py", line 3505, in __getitem__
indexer = self.columns.get_loc(key)
File "C:\Users\xxx\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\core\indexes\base.py", line 3623, in get_loc
raise KeyError(key) from err
KeyError: 'Grade'
Can someone explain this issue please? I am trying to unwind the code and understand what the above line does. Any help on how to fix this? I'm new to Python and This is my first python question
Here is my code:
l_names = [ ]
for l in links:
l_names.append(l.get_text())
df = [ ]
for u in urls:
req = s.get(u)
req_soup = BeautifulSoup(req.content,'lxml')
req_tables = req_soup.find_all('table', {'class':'infobox vevent'})
req_df = pd.read_html(str(req_tables), flavor='bs4', header=0)
dfr = pd.concat(req_df)
dfr = dfr.drop(index=0)
dfr.columns = range(dfr.columns.size)
dfr[1] = dfr[1].str.replace(r"([A-Z])", r" \1").str.strip().str.replace(' ', ' ')
dfr = dfr[~dfr[0].isin(remove_list)]
dfr = dfr.dropna()
dfr = dfr.reset_index(drop=True)
dfr.insert(loc=0, column='Title', value='Change')
df.append(dfr)
Here is some info about l_names and df:
len(l_names)
83
len(df)
83
display(df)
[ Title 0 1
0 Change Genre Melodrama Revenge
1 Change Written by Kwon Soon-won Park Sang-wook
2 Change Directed by Yoon Sung-sik
3 Change Starring Park Si-hoo Jang Hee-jin
4 Change No. of episodes 16
5 Change Running time 60 minutes
6 Change Original network T V Chosun
7 Change Original release January 27 – March 24, 2019,
Title 0 1
0 Change Genre Romance Comedy
1 Change Written by Jung Do-yoon Oh Seon-hyung
2 Change Directed by Lee Jin-seo Lee So-yeon
3 Change Starring Jang Na-ra Choi Daniel Ryu Jin Kim Min-seo
4 Change No. of episodes 20
5 Change Running time Mondays and Tuesdays at 21:55 ( K S T)
6 Change Original network Korean Broadcasting System
7 Change Original release 2 May –5 July 2011,
Title 0 1
0 Change Genre Mystery Thriller Suspense
1 Change Directed by Kim Yong-soo
2 Change Starring Cho Yeo-jeong Kim Min-jun Shin Yoon-joo ...
3 Change No. of episodes 4
4 Change Running time 61-65 minutes
5 Change Original network K B S2
6 Change Original release March 14 – March 22, 2016,
Title 0 1
0 Change Genre Melodrama Comedy Romance
1 Change Written by Yoon Sung-hee
2 Change Directed by Lee Joon-hyung
3 Change Starring Ji Chang-wook Wang Ji-hye Kim Young-kwang P...
4 Change No. of episodes 24
5 Change Running time Wednesdays and Thursdays at 21:20 ( K S T)
6 Change Original network Channel A
7 Change Original release December 21, 2011 – March 8, 2012,
I want to replace 'Change' with TV show names which are stored in l_names.
For this example, only four TV shows will be given but I have 83 in total.
print(l_names)
['Babel', 'Baby Faced Beauty', 'Babysitter', "Bachelor's Vegetable Store"]
But when I try to plug in l_names in my for loop code as my values. I get an error.
dfr.insert(loc=0, column='Title', value=l_names)
df.append(dfr)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Input In [96], in <cell line: 19>()
29 dfr = dfr.dropna()
30 dfr = dfr.reset_index(drop=True)
---> 31 dfr.insert(loc=0, column='Title', value=l_names)
32 df.append(dfr)
File ~/anaconda3/envs/beans/lib/python3.9/site-packages/pandas/core/frame.py:4444, in DataFrame.insert(self, loc, column, value, allow_duplicates)
4441 if not isinstance(loc, int):
4442 raise TypeError("loc must be int")
-> 4444 value = self._sanitize_column(value)
4445 self._mgr.insert(loc, column, value)
File ~/anaconda3/envs/beans/lib/python3.9/site-packages/pandas/core/frame.py:4535, in DataFrame._sanitize_column(self, value)
4532 return _reindex_for_setitem(value, self.index)
4534 if is_list_like(value):
-> 4535 com.require_length_match(value, self.index)
4536 return sanitize_array(value, self.index, copy=True, allow_2d=True)
File ~/anaconda3/envs/beans/lib/python3.9/site-packages/pandas/core/common.py:557, in require_length_match(data, index)
553 """
554 Check the length of data matches the length of the index.
555 """
556 if len(data) != len(index):
--> 557 raise ValueError(
558 "Length of values "
559 f"({len(data)}) "
560 "does not match length of index "
561 f"({len(index)})"
562 )
ValueError: Length of values (83) does not match length of index (8)
I also tried adding a for loop in my for loop.
for x in l_names:
dfr.insert(loc=0, column='Title', value=x)
df.append(dfr)
I get this error
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Input In [97], in <cell line: 19>()
30 dfr = dfr.reset_index(drop=True)
31 for x in l_names:
---> 32 dfr.insert(loc=0, column='Title', value=x)
33 df.append(dfr)
File ~/anaconda3/envs/beans/lib/python3.9/site-packages/pandas/core/frame.py:4440, in DataFrame.insert(self, loc, column, value, allow_duplicates)
4434 raise ValueError(
4435 "Cannot specify 'allow_duplicates=True' when "
4436 "'self.flags.allows_duplicate_labels' is False."
4437 )
4438 if not allow_duplicates and column in self.columns:
4439 # Should this be a different kind of error??
-> 4440 raise ValueError(f"cannot insert {column}, already exists")
4441 if not isinstance(loc, int):
4442 raise TypeError("loc must be int")
ValueError: cannot insert Title, already exists
I also added allow_duplicates = True and all that did was just make the Titles and names repeat over and over again.
I also have tried other methods to add in the title name.
But my lack of skill in using pandas has led me to this dead end.
Thanks again for your help and expertise.
Solution 1: After you create the df with 83 dataframe in it, you can loop df and update Title column values.
for i,dfr in enumerate(df):
dfr['Title'] = l_names[i]
Solution 2: In your loop, you don't need an extra loop, just use the index i to get the title and insert it.
for i,u in enumerate(urls):
...
dfr.insert(loc=0,column="Title",value=l_names[i])
df.append(dfr)
This question already has an answer here:
Why do I get a KeyError when using pandas apply?
(1 answer)
Closed 13 days ago.
I was looking at this answer by Roman Pekar for using apply. I initially copied the code exactly and it worked fine. Then I used it on my df3 that is created from a csv file and I got a KeyError. I checked datatypes the columns I was using are int64, so that is okay. I don't have nulls. If I can get this working then I will make the function more complex. How do I get this working?
def fxy(x, y):
return x * y
df3 = pd.read_csv(path + 'test_data.csv', usecols=[0,1,2])
print(df3.dtypes)
df3['Area'] = df3.apply(lambda x: fxy(x['Len'], x['Width']))
Trace back
Traceback (most recent call last):
File "f:\...\my_file.py", line 54, in <module>
df3['Area'] = df3.apply(lambda x: fxy(x['Len'], x['Width']))
File "C:\...\frame.py", line 8833, in apply
return op.apply().__finalize__(self, method="apply")
File "C:\...\apply.py", line 727, in apply
return self.apply_standard()
File "C:\...\apply.py", line 851, in apply_standard
results, res_index = self.apply_series_generator()
File "C:\...\apply.py", line 867, in apply_series_generator
results[i] = self.f(v)
File "f:\...\my_file.py", line 54, in <lambda>
df3['Area'] = df3.apply(lambda x: fxy(x['Len'], x['Width']))
File "C:\...\series.py", line 958, in __getitem__
return self._get_value(key)
File "C:\...\series.py", line 1069, in _get_value
loc = self.index.get_loc(label)
File "C:\...\range.py", line 389, in get_loc
raise KeyError(key)
KeyError: 'Len'
I don't see a way to attach the csv file. Below is Sample df3 if I save the below with excel as "CSV (Comma delimited)(*.csv) I get the same results.
ID
Len
Width
A
170
4
B
362
5
C
12
15
D
42
7
E
15
3
F
46
49
G
71
74
I think you miss the axis=1 on apply:
df3['Area'] = df3.apply(lambda x: fxy(x['Len'], x['Width']), axis=1)
But in your case, you can just do:
df3['Area'] = df3['Len'] * df3['Width']
print(df3)
# Output
ID Len Width Area
0 A 170 4 680
1 B 362 5 1810
2 C 12 15 180
3 D 42 7 294
4 E 15 3 45
5 F 46 49 2254
6 G 71 74 5254
I have a dataframe with the following information:
ticker date close gap
0 BHP 1981-07-31 0.945416 -0.199458
1 BHP 1981-08-31 0.919463 -0.235930
2 BHP 1981-09-30 0.760040 -0.434985
3 BHP 1981-10-30 0.711842 -0.509136
4 BHP 1981-11-30 0.778578 -0.428161
.. ... ... ... ...
460 BHP 2019-11-29 38.230000 0.472563
461 BHP 2019-12-31 38.920000 0.463312
462 BHP 2020-01-31 39.400000 0.459691
463 BHP 2020-02-28 33.600000 0.627567
464 BHP 2020-03-31 28.980000 0.784124
I developed the following code to find where the rows are when it crosses 0:
zero_crossings =np.where(np.diff(np.sign(BHP_data['gap'])))[0]
This returns:
array([ 52, 54, 57, 75, 79, 86, 93, 194, 220, 221, 234, 235, 236,
238, 245, 248, 277, 379, 381, 382, 383, 391, 392, 393, 395, 396],
dtype=int64)
I need to be able to do the following:
calculate the number of months between points where 'gap' crosses 0
remove items where the number of months is <12
average the remaining months
However, I don't know how to turn this nd.array into something useful that I can make the calculations from. When I try:
pd.DataFrame(zero_crossings)
I get the following df, which only returns the index:
0
0 52
1 54
2 57
3 75
4 79
5 86
.. ..
Please help...
Just extended your code a bit to get the zero_crossings into the original dataframe as required.
import pandas as pd
import numpy as np
BHP_data = pd.DataFrame({'gap': [-0.199458, 0.472563, 0.463312, 0.493318, -0.509136, 0.534985, 0.784124]})
BHP_data['zero_crossings'] = 0
zero_crossings = np.where(np.diff(np.sign(BHP_data['gap'])))[0]
print(zero_crossings) # [0 3 4]
# Updates the column to 1 based on the 0 crossing
BHP_data.loc[zero_crossings, 'zero_crossings'] = 1
print(BHP_data)
Output
gap zero_crossings
0 -0.199458 1
1 0.472563 0
2 0.463312 0
3 0.493318 1
4 -0.509136 1
5 0.534985 0
6 0.784124 0
I have a pandas DataFrame containing 5 columns.
['date', 'sensorId', 'readerId', 'rssi']
df_json['time'] = df_json.date.dt.time
I am aiming to find people who have entered a store (rssi > 380). However this would be much more accurate if I could also check every record a sensorId appears in and whether the time in that record is within 5 seconds of the current record.
Data from the dataFrame: (df_json)
date sensorId readerId rssi
0 2017-03-17 09:15:59.453 4000068 76 352
0 2017-03-17 09:20:17.708 4000068 56 374
1 2017-03-17 09:20:42.561 4000068 60 392
0 2017-03-17 09:44:21.728 4000514 76 352
0 2017-03-17 10:32:45.227 4000461 76 332
0 2017-03-17 12:47:06.639 4000046 43 364
0 2017-03-17 12:49:34.438 4000046 62 423
0 2017-03-17 12:52:28.430 4000072 62 430
1 2017-03-17 12:52:32.593 4000072 62 394
0 2017-03-17 12:53:17.708 4000917 76 335
0 2017-03-17 12:54:24.848 4000072 25 402
1 2017-03-17 12:54:35.738 4000072 20 373
I would like to use jezrael's answer of df['date'].diff(). However I cannot successfully use this, I receive many different errors. The ['date'] column is of dtype datetime64[ns].
How the data is stored above is not useful, for the .diff() to be of any use the data must be stored as below (dfEntered):
Sample Data: dfEntered
date sensorId readerId time rssi
2017-03-17 4000046 43 12:47:06.639000 364
62 12:49:34.438000 423
4000068 56 09:20:17.708000 374
60 09:20:42.561000 392
76 09:15:59.453000 352
4000072 20 12:54:35.738000 373
12:54:42.673000 374
25 12:54:24.848000 402
12:54:39.723000 406
62 12:52:28.430000 430
12:52:32.593000 394
4000236 18 13:28:14.834000 411
I am planning on replacing 'time' with 'date'. Time is of dtype object and I cannot seem to cast it or diff() it.'date' will be just as useful.
The only way (I have found) of having df_json appear as dfEntered is with:
dfEntered = df_json.groupby(by=[df_json.date.dt.time, 'sensorId', 'readerId', 'date'])
If I do:
dfEntered = df_json.groupby(by=[df_json.date.dt.time, 'sensorId', 'readerId'])['date'].diff()
results in:
File "processData.py", line 61, in <module>
dfEntered = df_json.groupby(by=[df_json.date.dt.date, 'sensorId', 'readerId', 'rssi'])['date'].diff()
File "<string>", line 17, in diff
File "C:\Users\danie\Anaconda2\lib\site-packages\pandas\core\groupby.py", line 614, in wrapper
raise ValueError
ValueError
If I do:
dfEntered = df_json.groupby(by=[df_json.date.dt.date, 'sensorId', 'readerId', 'rssi'])['time'].count()
print(dfEntered['date'])
Results in:
File "processData.py", line 65, in <module>
print(dfEntered['date'])
File "C:\Users\danie\Anaconda2\lib\site-packages\pandas\core\series.py", line 601, in __getitem__
result = self.index.get_value(self, key)
File "C:\Users\danie\Anaconda2\lib\site-packages\pandas\core\indexes\multi.py", line 821, in get_value
raise e1
KeyError: 'date'
I applied a .count() to the groupby just so that I can output it. I had previously tried a .agg({'date':'diff'}) which resluts in the valueError, but the dtype is datetime64[ns] (atleast in the original df_json, I cannot view the dtype of dfEntered['date']
If the above would work I would like to have a df of [df_json.date.dt.date, 'sensorId', 'readerId', 'mask'] mask being true if they entered a store.
I then have the below df (contains sensorIds that received a text)
sensor_id sms_status date_report rssi readerId
0 5990100 SUCCESS 2017-05-03 13:41:28.412800 500 10
1 5990001 SUCCESS 2017-05-03 13:41:28.412800 500 11
2 5990100 SUCCESS 2017-05-03 13:41:30.413000 500 12
3 5990001 SUCCESS 2017-05-03 13:41:31.413100 500 13
4 5990100 SUCCESS 2017-05-03 13:41:34.413400 500 14
5 5990001 SUCCESS 2017-05-03 13:41:35.413500 500 52
6 5990100 SUCCESS 2017-05-03 13:41:38.413800 500 60
7 5990001 SUCCESS 2017-05-03 13:41:39.413900 500 61
I would then like to merge the two together on day, sensorId, readerId.
I am hoping that would result in a df that could appear as [df_json.date.dt.date, 'sensorId', 'readerId', 'mask'] and therefore I could say that a sensorId with a mask of true is a conversion. A conversion being that sensorId received a text that day and also entered the store that day.
I'm beginning to get wary that my end aim isn't even achievable, as I simply do not understand how pandas works yet :D (damn errors)
UPDATE
dfEntered = dfEntered.reset_index()
This is allowing me to access the date and apply a diff.
I don't quite understand the theory of how this problem occurred, and why reset_index() fixed this.
I think you need boolean indexing with mask created with diff:
df = pd.DataFrame({'rssi': [500,530,1020,1201,1231,10],
'time': pd.to_datetime(['2017-01-01 14:01:08','2017-01-01 14:01:14',
'2017-01-01 14:01:17', '2017-01-01 14:01:27',
'2017-01-01 14:01:29', '2017-01-01 14:01:30'])})
print (df)
rssi time
0 500 2017-01-01 14:01:08
1 530 2017-01-01 14:01:14
2 1020 2017-01-01 14:01:17
3 1201 2017-01-01 14:01:27
4 1231 2017-01-01 14:01:29
5 10 2017-01-01 14:01:30
print (df['time'].diff())
0 NaT
1 00:00:06
2 00:00:03
3 00:00:10
4 00:00:02
5 00:00:01
Name: time, dtype: timedelta64[ns]
mask = (df['time'].diff() >'00:00:05') & (df['rssi'] > 380)
print (mask)
0 False
1 True
2 False
3 True
4 False
5 False
dtype: bool
df1 = df[mask]
print (df1)
rssi time
1 530 2017-01-01 14:01:14
3 1201 2017-01-01 14:01:27