Subsetting columns of a data frame that are stored into variables - python

I have a data frame (a .txt from R) that looks like this:
my_sample my_coord1 my_coord2 my_cl
A 0.34 0.12 1
B 0.2 1.11 1
C 0.23 0.10 1
D 0.9 0.34 2
E 0.21 0.6 2
... ... ... ...
Using python I would like to extract columns 2 and 3 and put them into a variable as well as I would like to put column 4 into another variable. In R is: my_var1 = mydf[,c(2:3)] and my_var2 = mydf[,4]. I don't know how to do this in python but I tried:
mydf = open("mydf.txt", "r")
print(mydf.read())
lines = mydf.readlines()
for line in lines:
sline = line.split(' ')
print(sline)
mydf.close()
But I don't know how to save into a variable each subsetting.
I know it seems a quite simple question but I'm a newbie in the field.
Thank you in advance

You can use read_table from pandas in order to deal with tabular data file. The code
import pandas as pd
mydf = pd.read_table('mydf.txt',delim_whitespace = True)
my_var1 = mydf[['my_coord1','my_coord2']]
my_var2 = mydf['my_cl']

Related

How to write grid in csv file in python

I have a list of tuples. Each tuple contain 2 values, together with the results of an operation between the two values. Here is an example:
my_list = [(1,1,1.0), (1,2,0.8), (1,3,0.3), (2,1,0.8), (2,2,1.0), (2,3,0.5), (3,1,0.3), (3,2,0.5), (3,3,1.0)]
I need to store this value in a csv file so that they look like this:
0 1 2 3
1 1 0.8 0.3
2 0.8 1 0.5
3 0.3 0.5 1
In other words, I need to go to a new row every time the first number of the tuple change.
This is the function I am currently using, which writes each tuple in a new row (not what I want):
def write_csv(my_list, fname = ''):
with open (fname, mode='a+') as f:
f_writer = csv.writer(f, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
for x in my_list:
f_writer.writerow([str(x[0]), str(x[1]), str(x[2])])
Any suggestion on how to modify (rewrite from scratch) it?
You could use a combination of the Numpy and Pandas Python libraries.
import pandas as pd
import numpy as np
my_list = [(1,1,1.0), (1,2,0.8), (1,3,0.3), (2,1,0.8), (2,2,1.0), (2,3,0.5), (3,1,0.3), (3,2,0.5), (3,3,1.0)]
new_list = [cell[2] for cell in my_list] # Extract cell values
np_array = np.array(new_list).reshape(3,3) # Create 3x3 matrix
df = pd.DataFrame(np_array) # Create a dataframe
df.to_csv("test.csv") # Write to a csv
For clarity, the dataframe will look like:
df
0 1 2
0 1.0 0.8 0.3
1 0.8 1.0 0.5
2 0.3 0.5 1.0
And the csv file will look like:
,0,1,2
0,1.0,0.8,0.3
1,0.8,1.0,0.5
2,0.3,0.5,1.0

How can I remove extra digits of a float64 value?

I have a data frame column.
P08107 3.658940e-11
P62979 4.817399e-05
P16401 7.784275e-05
Q96B49 7.784275e-05
Q15637 2.099078e-04
P31689 1.274387e-03
P62258 1.662718e-03
P07437 3.029516e-03
O00410 3.029516e-03
P23381 3.029516e-03
P27348 5.733834e-03
P29590 9.559550e-03
P25685 9.957186e-03
P09429 1.181282e-02
P62937 1.260040e-02
P11021 1.396807e-02
P31946 1.409311e-02
P19338 1.503901e-02
Q14974 2.213431e-02
P11142 2.402201e-02
I want to leave one decimal and remove extra digits, that it looks like
3.7e-11
instead of
3.658940e-11
and etc with all the others.
I know how to slice a string but it doesn't seem to work here.
If you have a pandas dataframe you could set the display options.
import pandas as pd
import numpy as np
pd.options.display.float_format = '{:.2f}'.format
pd.DataFrame(dict(randomvalues=np.random.random_sample((5,))))
Returns:
randomvalues
0 0.02
1 0.66
2 0.24
3 0.87
4 0.63
You could use str.format:
>>> '{:.2g}'.format(3.658940e-11)
'3.7e-11'
String slicing will not work here, because it does not round the values:
>>> s = '3.658940e-11'
>>> s[:3] + 'e' + s.split('e')[1]
'3.6e-11'

How to append to the right ID's column in CSV with pandas?

I have got a test file and 100 model which I would like to evaluate on the test.
In the test file there are 2 column, first is IDs and the second is Probability.
I would like that each model would append it's evaluation to a new column next to the relevant ID.
My code right now build it under each other, like this:
1 0.1
2 0.12
3 0.32
1 0.21
2 0.22
3 0.17
And I would need form like this:
1 0.1 0.21
2 0.12 0.22
3 0.32 0.17
to a csv.
My code looks like this:
for chunk in pd.read_csv('test_numeric_out.csv', chunksize=10000):
chunk = chunk.drop(chunk.columns[len(chunk.columns)-1], axis=1)
for model in models:
X_test = chunk.drop(['Id'],axis=1)
inputnames = X_test.columns.values
X_test['p_0']=0
X_test['p_1']=0
X_test[ ['p_0','p_1'] ] = model.predict_proba(X_test[inputnames])
submission = pd.DataFrame({
"Id":chunk['Id'],
"Response":X_test['p_1']
})
if (head==0):
submission.to_csv(proba_out_csv,
index=False,
header=True,
mode='a',
chunksize=100000)
else:
submission.to_csv(proba_out_csv,
index=False,
header=False,
mode='a',
chunksize=100000)
head = 1
I believe it can be done a bit easier.
inputnames = chunk.columns.drop('Id').values
# drop works here too, so no need to create additional dataframe
# to get inputnames
for i, model in enumerate(models):
chunk['p_1_{}'.format(i)] = model.predict_proba(chunk[inputnames])[:, 1]
# we are interested only in the second column
# do not need to create different dataframe to store results
# just create distinct column for each model
chunk.to_csv(proba_out_csv)

Splitting dataframe python

I have this relatively large (9mb) JSON, it's a list of dicts (I don't know if that's the convention for JSON) any way I've been able to read it in and turn into a data frame.
The data is a backtest for a predictive model model and is of the format:
[{"assetname":"xxx", 'return':0.9, "timestamp":1451080800},{"assetname":"xxx", 'return':0.9, "timestamp":1451080800}...{"assetname":"yyy", 'return':0.9, "timestamp":1451080800},{"assetname":"yyy", 'return':0.9, "timestamp":1451080800} ]
I would like the separate all the assets into their own data frames, can anyone help?
Here's the data btw
http://www.mediafire.com/view/957et8za5wv56ba/test_predictions.json
Just put your data into DataFrame:
import pandas as pd
df = pd.DataFrame([{"assetname":"xxx", 'return':0.9, "timestamp":1451080800},
{"assetname":"xxx", 'return':0.9, "timestamp":1451080800},
{"assetname":"yyy", 'return':0.9, "timestamp":1451080800},
{"assetname":"yyy", 'return':0.9, "timestamp":1451080800}])
print(df)
Output:
assetname return timestamp
0 xxx 0.9 1451080800
1 xxx 0.9 1451080800
2 yyy 0.9 1451080800
3 yyy 0.9 1451080800
You can load a dataframe from a json file like this:
In [9]: from pandas.io.json import read_json
In [10]: d = read_json('Descargas/test_predictions.json')
In [11]: d.head()
Out[11]:
market_trading_pair next_future_timestep_return ohlcv_start_date \
0 Poloniex_ETH_BTC 0.003013 1450753200
1 Poloniex_ETH_BTC -0.006521 1450756800
2 Poloniex_ETH_BTC 0.003171 1450760400
3 Poloniex_ETH_BTC -0.003083 1450764000
4 Poloniex_ETH_BTC -0.001382 1450767600
prediction_at_ohlcv_end_date
0 -0.157053
1 -0.920074
2 0.999806
3 0.627140
4 0.999857
You may split it like this:
Poloniex_ETH_BTC = d[d['market_trading_pair'] == 'Poloniex_ETH_BTC']
Extending rapto's answer, you can split the whole dataframe by the value of one column like this:
df_dict = dict()
for name,df in d.groupby('market_trading_pair'):
df_dict[name]=df

Pandas Read CSV with string delimiters via regex

I am trying to import a weirdly formatted text file into a pandas DataFrame. Two example lines are below:
LOADED LANE 1 MAT. TYPE= 2 LEFFECT= 1 SPAN= 200. SPACE= 10. BETA= 3.474 LOADEFFECT 5075. LMAX= 3643. COV= .13
LOADED LANE 1 MAT. TYPE= 3 LEFFECT= 1 SPAN= 200. SPACE= 10. BETA= 3.515 LOADEFFECT10009. LMAX= 9732. COV= .08
First I tried the following:
df = pd.read_csv('beta.txt', header=None, delim_whitespace=True, usecols=[2,5,7,9,11,13,15,17,19])
This seemed to work fine, however got messed up when it hit the above example line, where there is no whitespace after the LOADEFFECT string (you may need to scroll a bit right to see it in the example). I got a result like:
632 1 2 1 200 10 3.474 5075. 3643. 0.13
633 1 3 1 200 10 3.515 LMAX= COV= NaN
Then I decided to use a regular expression to define my delimiters. After many trial and error runs (I am no expert in regex), I managed to get close with the following line:
df = pd.read_csv('beta.txt', header=None, sep='/s +|LOADED LANE|MAT. TYPE=|LEFFECT=|SPAN=|SPACE=|BETA=|LOADEFFECT|LMAX=|COV=', engine='python')
This almost works, but creates a NaN column for some reason at the very beginning:
632 NaN 1 2 1 200 10 3.474 5075 3643 0.13
633 NaN 1 3 1 200 10 3.515 10009 9732 0.08
At this point I think I can just delete that first column, and get away with it. However I wonder what would be the correct way to set up the regex to correctly parse this text file in one shot. Any ideas? Other than that, I am sure there is a smarter way to parse this text file. I would be glad to hear your recommendations.
Thanks!
import re
import pandas as pd
import csv
csvfile = open("parsing.txt") #open text file
reader = csv.reader(csvfile)
new_list=[]
for line in reader:
for i in line:
new_list.append(re.findall(r'(\d*\.\d+|\d+)', i))
table = pd.DataFrame(new_list)
table # output will be pandas DataFrame with values

Categories

Resources