Regex expression to remove specific characters - python

I'm currently trying to get into regex expressions - at the moment I want to write one which acts as follows:
import regex
a = '[[0.1 0.1 0.1 0.1]\n [1.2 1.2 1.2 1.2]\n [2.3 2.3 2.3 2.3]\n [3.4 3.4 3.4 3.4]]'
a_transformed = re.sub(regex_expression, a)
# a_transformed = '0.1 0.1 0.1 0.1 1.2 1.2 1.2 1.2 2.3 2.3 2.3 2.3 3.4 3.4 3.4 3.4'
Basically I only need to sub all occurences of (,n,[,]), but currently I'm struggling to get the expression right.
Thanks for the help in advance!

You can try the following:
>>> re.sub(r'[^\d. ]', '', a)
'0.1 0.1 0.1 0.1 1.2 1.2 1.2 1.2 2.3 2.3 2.3 2.3 3.4 3.4 3.4 3.4'
Here '[^\d. ]' means anything except a digit, '.' and space like characters. ^ inside [] means negate this character group.

Related

Python: multiplication with RollingGroupby

I have a dataframe as follows
id
return1
return2
weekday1
0.1
0.2
weekday1
0.2
0.4
weekday1
0.3
0.5
weekday2
0.4
0.7
weekday2
0.5
0.6
weekday2
0.6
0.1
I know how to do the rolling-groupby-sum, which is
df.groupby(df.index.dayofweek) #originally the index is a time series
.rolling(52).sum()
.droplevel(level=0).sort_index()
Now I need to add 1 to all the elements first and then multiply those in the same group as follows.
Step 1 - add 1:
id
return1
return2
weekday1
1.1
1.2
weekday1
1.2
1.4
weekday1
1.3
1.5
weekday2
1.4
1.7
weekday2
1.5
1.6
weekday2
1.6
1.1
Step2 - multiply by group:
id
return1
return2
weekday1
1.1×1.2×1.3
1.2×1.4×1.5
weekday2
1.4×1.5×1.6
1.7×1.6×1.1
I use the following codes
df.transform(lambda x : x+1).groupby(df.index.dayofweek)
.rolling(52).mul()
.droplevel(level=0).sort_index()
but it gives an AttributeError: 'RollingGroupby' object has no attribute 'mul'.
cumprod() doesn't work either. Perhaps it has somthing to do with the rolling part for that there's no such thing as rolling.cumprod() or rolling.mul().
Is there a better way to do the multiplication within a group with rolling part?
Use numpy.prod in Rolling.apply:
df.add(1).groupby(df.index.dayofweek).rolling(52).apply(np.prod)
Btw, from expected ouput seems need GroupBy.prod:
df.add(1).groupby(df.index).prod()

Pandas: df["<tab> find completion, df.loc["<tab> WON'T find any completion. Is it normal?

I read somewhere that the preferred way for accessing dataframes columns is through the method .loc, but I found some drawbacks and I am wondering if it is normal.
Say that I do the following:
import pandas as pd
df = read_csv("MyFile.csv")
and assume that the dataframe df that looks like this:
ColA ColB ColC
Time
0.0 9.2 -3.5 2.0
0.1 10.2 -0.9 1.1
0.2 4.3 2.1 4.2
If I type df[" and then hit TAB the autocompletion kicks in and I can choose the column name from a pop up list whereas if I type df.loc[" and then hit TAB, then nothing happens and I am wondering if it is a normal behavior.
Also, it seems that if the column name are tuples, e.g.
('ColA','X') ('ColB','Y') ('ColC','Z')
Time
0.0 9.2 -3.5 2.0
0.1 10.2 -0.9 1.1
0.2 4.3 2.1 4.2
then I can access them with e.g. df[('ColA','X')] but I cannot with df.loc[('ColA','X')].
I am running iPython 7.2.2 (console) on a Windows 10 machine, if it may help.

how to generate a list within a list delimited by a space

how do i replicate the structure of result of itertools.product?
so as you know itertools.product gives us an object and we need to put them in a list so we can print it
.. something like this.. right?
import itertools
import numpy as np
CN=np.asarray((itertools.product([0,1], repeat=5)))
print(CN)
i want to be able to make something like that but i want the data to be from a csv file.. so i want to make something like this
#PSEUDOCODE
import pandas as pd
df = pd.read_csv('csv here')
#a b c d are the columns that i want to get
x = list(df['a'] df['c'] df['c'] df['d'])
print(x)
so the result will be something like this
[[5.1 3.5 1.4 0.2]
[4.9 3. 1.4 0.2]
[4.7 3.2 1.3 0.2]
[4.6 3.1 1.5 0.2]
[5. 3.6 1.4 0.2]
[5.4 3.9 1.7 0.4]
[4.6 3.4 1.4 0.3]
[5. 3.4 1.5 0.2]
[4.4 2.9 1.4 0.2]
[4.9 3.1 1.5 0.1]]
how can i do that?
EDIT:
i am trying to learn how to do recursive feature elimination and i saw in some codes in google that they use the iris data set..
from sklearn import datasets
dataset = datasets.load_iris()
x = dataset.data
print(x)
and when printed it looked something like this
[[5.1 3.5 1.4 0.2]
[4.9 3. 1.4 0.2]
[4.7 3.2 1.3 0.2]
[4.6 3.1 1.5 0.2]
[5. 3.6 1.4 0.2]
[5.4 3.9 1.7 0.4]
[4.6 3.4 1.4 0.3]
[5. 3.4 1.5 0.2]
[4.4 2.9 1.4 0.2]
[4.9 3.1 1.5 0.1]]
how could i make my dataset something like that so i can use this RFE template ?
# Recursive Feature Elimination
from sklearn import datasets
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
# load the iris datasets
dataset = datasets.load_iris()
# create a base classifier used to evaluate a subset of attributes
model = LogisticRegression()
# create the RFE model and select 3 attributes
rfe = RFE(model, 3)
print(rfe)
rfe = rfe.fit(dataset.data, dataset.target)
print("features:",dataset.data)
print("target:",dataset.target)
print(rfe)
# summarize the selection of the attributes
print(rfe.support_)
print(rfe.ranking_)
You don't have to. If you want to use rfe.fit function, you need to feed features and target seperately.
So if your df is like:
a b c d target
0 5.1 3.5 1.4 0.2 1
1 4.9 3.0 1.4 0.2 1
2 4.7 3.2 1.3 0.2 0
3 4.6 3.1 1.5 0.2 0
4 5.0 3.6 1.4 0.2 1
5 5.4 3.9 1.7 0.4 1
6 4.6 3.4 1.4 0.3 0
7 5.0 3.4 1.5 0.2 0
8 4.4 2.9 1.4 0.2 1
9 4.9 3.1 1.5 0.1 1
you can use:
...
rfe = rfe.fit(df[['a', 'b', 'c', 'd']], df['target'])
...

Matplotilb- Need to find source data from a class attributes

I have a lines object which was created with the following:
junk = plt.plot([xxxx], [yyyy])
for x in junk:
print type(x)
<class 'matplotlib.lines.Line2D'>
I need to find the names of the two lists 'xxxx' and 'yyyy'. How can I get them from the class attributes?
You can use dir to see the content of an object in python, or check the docs for the class. I guess the objects you are looking for are xdata and ydata (although I'm a bit confused, in your post you ask for the names of the lists?)
In [27]:
import numpy as np
import matplotlib.pyplot as plt
​
x = np.arange(0, 5, 0.1);
y = np.sin(x)
junk = plt.plot(x, y)
for x in junk:
#print(dir(x))
print(x.get_xdata())
print(x.get_ydata())
[ 0. 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1. 1.1 1.2 1.3 1.4
1.5 1.6 1.7 1.8 1.9 2. 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9
3. 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 4. 4.1 4.2 4.3 4.4
4.5 4.6 4.7 4.8 4.9]
[ 0. 0.09983342 0.19866933 0.29552021 0.38941834 0.47942554
0.56464247 0.64421769 0.71735609 0.78332691 0.84147098 0.89120736
0.93203909 0.96355819 0.98544973 0.99749499 0.9995736 0.99166481
0.97384763 0.94630009 0.90929743 0.86320937 0.8084964 0.74570521
0.67546318 0.59847214 0.51550137 0.42737988 0.33498815 0.23924933
0.14112001 0.04158066 -0.05837414 -0.15774569 -0.2555411 -0.35078323
-0.44252044 -0.52983614 -0.61185789 -0.68776616 -0.7568025 -0.81827711
-0.87157577 -0.91616594 -0.95160207 -0.97753012 -0.993691 -0.99992326
-0.99616461 -0.98245261]
Hope it helps.

Wrong decimal calculations with pandas

I have a data frame (df) in pandas with four columns and I want a new column to represent the mean of this four columns: df['mean']= df.mean(1)
1 2 3 4 mean
NaN NaN NaN NaN NaN
5.9 5.4 2.4 3.2 4.225
0.6 0.7 0.7 0.7 0.675
2.5 1.6 1.5 1.2 1.700
0.4 0.4 0.4 0.4 0.400
So far so good. But when I save the results to a csv file this is what I found:
5.9,5.4,2.4,3.2,4.2250000000000005
0.6,0.7,0.7,0.7,0.6749999999999999
2.5,1.6,1.5,1.2,1.7
0.4,0.4,0.4,0.4,0.4
I guess I can force the format in the mean column, but any idea why this is happenning?
I am using winpython with python 3.3.2 and pandas 0.11.0
You could use the float_format parameter:
import pandas as pd
import io
content = '''\
1 2 3 4 mean
NaN NaN NaN NaN NaN
5.9 5.4 2.4 3.2 4.225
0.6 0.7 0.7 0.7 0.675
2.5 1.6 1.5 1.2 1.700
0.4 0.4 0.4 0.4 0.400'''
df = pd.read_table(io.BytesIO(content), sep='\s+')
df.to_csv('/tmp/test.csv', float_format='%g', index=False)
yields
1,2,3,4,mean
,,,,
5.9,5.4,2.4,3.2,4.225
0.6,0.7,0.7,0.7,0.675
2.5,1.6,1.5,1.2,1.7
0.4,0.4,0.4,0.4,0.4
The answers seem correct. Floating point numbers cannot be perfectly represented on our systems. There are bound to be some differences. Read The Floating Point Guide.
>>> a = 5.9+5.4+2.4+3.2
>>> a / 4
4.2250000000000005
As you said, you could always format the results if you want to get only a fixed number of points after the decimal.
>>> "{:.3f}".format(a/4)
'4.225'

Categories

Resources