'expected string or buffer' when using re.match with pandas - python

I am trying to clean some data from a csv file. I need to make sure that whatever is in the 'Duration' category matches a certain format. This is how I went about that:
import re
import pandas as pd
data_path = './ufos.csv'
ufos = pd.read_csv(data_path)
valid_duration = re.compile('^[0-9]+ (seconds|minutes|hours|days)$')
ufos_clean = ufos[valid_duration.match(ufos.Duration)]
ufos_clean.head()
This gives me the following error:
TypeErrorTraceback (most recent call last)
<ipython-input-4-5ebeaec39a83> in <module>()
6
7 valid_duration = re.compile('^[0-9]+ (seconds|minutes|hours|days)$')
----> 8 ufos_clean = ufos[valid_duration.match(ufos.Duration)]
9
10 ufos_clean.head()
TypeError: expected string or buffer
I used a similar method to clean data before without the regular expressions. What am I doing wrong?
Edit:
MaxU got me the closest, but what ended up working was:
valid_duration_RE = '^[0-9]+ (seconds|minutes|hours|days)$'
ufos_clean = ufos
ufos_clean = ufos_clean[ufos.Duration.str.contains(valid_duration_RE)]
There's probably a lot of redundancy in there, I'm pretty new to python, but it worked.

You can use vectorized .str.match() method:
valid_duration_RE = '^[0-9]+ (seconds|minutes|hours|days)$'
ufos_clean = ufos[ufos.Duration.str.contains(valid_duration_RE)]

I guess you want it the other way round (not tested):
import re
import pandas as pd
data_path = './ufos.csv'
ufos = pd.read_csv(data_path)
def cleanit(val):
# your regex solution here
pass
ufos['ufos_clean'] = ufos['Duration'].apply(cleanit)
After all, ufos is a DataFrame.

Related

Python - input array has wrong dimensions

I'm an absolute beginner when it comes to coding and recently I discovered talib.
I've been trying to calculate an RSI, but I encountered an error. I've been looking up the internet for a solution like I usually do, but without success. I'm guessing my data has a wrong datatype for the talib.RSI function, but that's about how far my knowledge goes.
Would be great if someone could come up with a solution and expand a little bit on it so I might be able to learn a bit along the way :-)
Many thanks in advance,
Mattie
import pandas as pd
import talib
import numpy as np
data = pd.read_excel (r'name.xlsx')
df = pd.DataFrame(data, columns = ['close'])
RSI_PERIOD = 14
close_prices = pd.DataFrame(df, columns = ['close'])
np_close_prices = np.array(close_prices)
print(np_close_prices)
rsi = talib.RSI(np_close_prices, RSI_PERIOD)
print(rsi)
--------------------------------------------------------------------------- Exception Traceback (most recent call
last) in
12 print(np_close_prices)
13
---> 14 rsi = talib.RSI(np_close_prices, RSI_PERIOD)
15 print(rsi)
~\anaconda3\lib\site-packages\talib_init_.py in wrapper(*args,
**kwargs)
25
26 if index is None:
---> 27 return func(*args, **kwargs)
28
29 # Use Series' float64 values if pandas, else use values as passed
talib_func.pxi in talib._ta_lib.RSI()
talib_func.pxi in talib._ta_lib.check_array()
Exception: input array has wrong dimensions
#kcw78 thanks for your reply.
I looked up the internet some more before I saw your reply and managed to find an answer. I have no clue what lambda is or what it does yet, but hopefully one day I'll find out and understand how this fixes the problem :)
import pandas as pd
import talib
import numpy as np
RSI_PERIOD = 14
data = pd.read_excel (r'name.xlsx')
df = pd.DataFrame(data, columns = ['close'])
rsi = df.apply(lambda x: talib.RSI(x, RSI_PERIOD))
rsi.columns = ['RSI']
print(rsi)

Why is this error occuring when I am using filter in pandas: TypeError: 'int' object is not iterable

When I want to remove some elements which satisfy a particular condition, python is throwing up the following error:
TypeError Traceback (most recent call last)
<ipython-input-25-93addf38c9f9> in <module>()
4
5 df = pd.read_csv('fb441e62df2d58994928907a91895ec62c2c42e6cd075c2700843b89.csv;
----> 6 df = filter(df,~('-02-29' in df['Date']))
7 '''tmax = []; tmin = []
8 for dates in df['Date']:
TypeError: 'int' object is not iterable
The following is the code :
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('data/C2A2_data/BinnedCsvs_d400/fb441e62df2d58994928907a91895ec62c2c42e6cd075c2700843b89.csv');
df = filter(df,~('-02-29' in df['Date']))
What wrong could I be doing?
Following is sample data
Sample Data
Use df.filter() (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.filter.html)
Also please attach the csv so we can run it locally.
Another way to do this is to use one of pandas' string methods for Boolean indexing:
df = df[~ df['Date'].str.contains('-02-29')]
You will still have to make sure that all the dates are actually strings first.
Edit:
Seeing the picture of your data, maybe this is what you want (slashes instead of hyphens):
df = df[~ df['Date'].str.contains('/02/29')]

Explain Function Mistake

I managed to write my first function. however I do not understand it :-)
I approached my real problem with a simplified on. See the following code:
import pandas as pd
import matplotlib as plt
import numpy as np
from pyXSteam.XSteam import XSteam
steamTable = XSteam(XSteam.UNIT_SYSTEM_MKS)
T1_T_in = [398,397,395]
T1_p_in = [29,29,29]
T1_mPkt_in = [2.2,3,3.5]
def Power(druck,temp,menge):
H = []
Q = []
for i in range(len(druck)):
H.append(steamTable.h_pt(druck[i],temp[i]))
Q.append(H[i]*menge[i])
return Q
t1Q=Power(T1_p_in,T1_T_in,T1_mPkt_in)
t3Q = Power(T3_p_in,T3_T_in,T3_mPkt_in)
print(t1Q)
print(t3Q)
It works. The real problem now is different in that way that I read the data from an excel file. I got an error message and (according my learnings from this good homepage :-)) I added ".tolist()" in the function and it works. I do not understand why I need to change it to a list? Can anybody explain it to me? Thank you for your help.
import pandas as pd
import matplotlib as plt
import numpy as np
from pyXSteam.XSteam import XSteam
steamTable = XSteam(XSteam.UNIT_SYSTEM_MKS)
pfad="XXX.xlsx"
df = pd.read_excel(pfad)
T1T_in = df.iloc[2:746,1]
T1p_in = df.iloc[2:746,2]
T1mPkt_in = df.iloc[2:746,3]
def Power(druck,temp,menge):
H = []
Q = []
for i in range(len(druck)):
H.append(steamTable.h_pt(druck.tolist()[i],temp.tolist()[i]))
Q.append(H[i]*menge.tolist()[i])
return Q
t1Q=Power(T1p_in,T1T_in,T1mPkt_in)
t1Q[0:10]
The reason your first example works is because you are passing the T1_mPkt_in variable into the menge parameter as a list:
T1_mPkt_in = [2.2,3,3.5]
Your second example is not working because you pass the T1_mPkt_in variable into the menge parameter as a series and not a list:
T1mPkt_in = df.iloc[2:746,3]
If you print out the type of T1_mPkt_in, you will get:
<class 'pandas.core.series.Series'>
In pandas, to convert a series back into a list, you can call .tolist() to store the data in a list so that you can properly index it.

Python: json normalize "String indices must be integers" error

I am getting a type error as "TypeError: string indices must be integers" in the following code.
import pandas as pd
import json
from pandas.io.json import json_normalize
full_json_df = pd.read_json('data/world_bank_projects.json')
json_nor = json_normalize(full_json_df, 'mjtheme_namecode')
json_nor.groupby('name')['code'].count().sort_values(ascending=False).head(10)
Output:
TypeError
Traceback (most recent call last)
<ipython-input-28-9401e8bf5427> in <module>()
1 # Find the top 10 major project themes (using column 'mjtheme_namecode')
2
----> 3 json_nor = json_normalize(full_json_df, 'mjtheme_namecode')
4 #json_nor.groupby('name')['code'].count().sort_values(ascending = False).head(10)
TypeError: string indices must be integers
According to pandas documentation, for data argument of the method json_normalize :
data : dict or list of dicts Unserialized JSON objects
In above, pd.read_json returns dataframe.
So, you can try converting dataframe to dictionary using .to_dict(). There are various options for using to_dict() as well.
May be something like below:
json_normalize(full_json_df.to_dict(), ......)

Normalize columns in a numpy array- results in typeerror

want to do a simple normalization of the data in a numpy ndarray.
specifically want X-mu/sigma. Tried using the exact code that
that I found in earlier questions - kept getting error = TypeError
cannot perform reduce with flexible type. Gave up and tried a simpler
normzlization method X-mu/X.ptp - got the same error.
import csv
import numpy as np
from numpy import *
import urllib.request
#Import comma separated data from git.hub
url = 'http://archive.ics.uci.edu/ml/machine-learning-
databases/wine/wine.data'
urllib.request.urlretrieve(url,'F:/Python/Wine Dataset/wine_data')
#open file
filename = 'F:/Python/Wine Dataset/wine_data';
raw_data = open(filename,'rt');
#Put raw_data into a numpy.ndarray
reader = csv.reader(raw_data);
x = list(reader);
data = np.array(x)
#First column is classification, other columns are features
y= data[:,0];
X_raw = data[:,1:13];
# Attempt at normalizing data- really wanted X-mu/sigma gave up
# even this simplified version doesn't work
# latest error is TypeError cannot perform reduce with flexible type?????
X = (X_raw - X_raw.min(0)) / X_raw.ptp(0);
print(X);
#
#
#
#
Finally figured it out. The line "data = np.array(x)" returned an array containing string data.
was:
data = "np.array(x)"
changed to: "np.array(x).astype(np.float)"
after that everything worked - simple issue cost me hours

Categories

Resources