Specifying Range of Columns in Python - python

Specifically, I am asking how one would take several columns from a text file and input those into an array without specifying each column individually.
1 2 1 4.151E-12 4.553E-12 4.600E-12 4.852E-12 6.173E-12 7.756E-12 9.383E-12 1.096E-11 1.243E-11 1.379E-11 1.504E-11 1.619E-11 1.724E-11 2.139E-11 2.426E-11 2.791E-11 3.009E-11 3.152E-11 3.252E-11 3.326E-11 3.382E-11 3.426E-11 3.462E-11 3.572E-11 3.640E-11 3.698E-11 3.752E-11
2 3 1 1.433E-12 1.655E-12 1.907E-12 2.014E-12 2.282E-12 2.682E-12 3.159E-12 3.685E-12 4.246E-12 4.833E-12 5.440E-12 6.059E-12 6.688E-12 9.845E-12 1.285E-11 1.810E-11 2.238E-11 2.590E-11 2.886E-11 3.139E-11 3.359E-11 3.552E-11 3.724E-11 4.375E-11 4.832E-11 5.192E-11 5.486E-11
For example, I want the second column of this data set in an array by itself, and I want the third column in an array by itself. However, I want column four through the last column in arrays that are separated by column. I don't know how to do this without specifying each individual column.

Given that you mentioned a text file, I'm gonna treat it as the content is fetched from a text file line by line:
with open("data.txt") as f:
for line in f:
data = line.split()
# I want the second column of this data set in an array by itself
second_column = data[1]
# I want the third column in an array by itself
third_column = data[2]
# I want column four through the last column in arrays that are separated by column
fourth_to_last_column = data[3:]
If you then print second_column, third_column and fourth_to_last_column of the first line/input, it would look like this:
2
1
['4.151E-12', '4.553E-12', '4.600E-12', '4.852E-12', '6.173E-12', '7.756E-12', '9.383E-12', '1.096E-11', '1.243E-11', '1.379E-11', '1.504E-11', '1.619E-11', '1.724E-11', '2.139E-11', '2.426E-11', '2.791E-11', '3.009E-11', '3.152E-11', '3.252E-11', '3.326E-11', '3.382E-11', '3.426E-11', '3.462E-11', '3.572E-11', '3.640E-11', '3.698E-11', '3.752E-11']

Related

splitting of urls from a list in dataframe where column name is company_urls

I have a dataframe(df) like this:
company_urls
0 [https://www.linkedin.com/company/gulf-capital...
1 [https://www.linkedin.com/company/gulf-capital...
2 [https://www.linkedin.com/company/fajr-capital...
3 [https://www.linkedin.com/company/goldman-sach...
And df.company_urls[0] is
['https://www.linkedin.com/company/gulf-capital/about/',
'https://www.linkedin.com/company/the-abraaj-group/about/',
'https://www.linkedin.com/company/abu-dhabi-investment-company/about/',
'https://www.linkedin.com/company/national-bank-of-dubai/about/',
'https://www.linkedin.com/company/efg-hermes/about/']
So I have to create a new columns like this:
company_urls company_url1 company_url2 company_url3 ...
0 [https://www.linkedin.com/company/gulf-capital... https://www.linkedin.com/company/gulf-capital/about/ https://www.linkedin.com/company/the-abraaj-group/about/...
1 [https://www.linkedin.com/company/gulf-capital... https://www.linkedin.com/company/gulf-capital/about/ https://www.linkedin.com/company/gulf-related/about/...
2 [https://www.linkedin.com/company/fajr-capital... https://www.linkedin.com/company/fajr-capital/about/...
3 [https://www.linkedin.com/company/goldman-sach... https://www.linkedin.com/company/goldman-sachs/about/...
How do I do that?
I have created this function for my personal use, and I think will work for your needs:
a) Specify the df name
b) Specify the column you want to split
c) Specify the delimiter
def composition_split(dat,col,divider =','): # set your delimiter here
"""
splits the column of interest depending on how many delimiters we have
creates all the columns needed to make the split
"""
x1 = dat[col].astype(str).apply(lambda x: x.count(divider)).max()
x2 = ["company_url_"+str(i) for i in np.arange(0,x1+1,1)]
dat[x2] = dat[col].str.split(divider,expand = True)
return dat
Basically this will create as many columns needed depending on how you specify the delimiter. For example, if the URL can be split 3 times based on a certain delimiter, it will create 3 new columns.
your_new_df = composition_split(df,'col_to_split',',') # for example

pandas multiple dataframe plot

I have two data frames. They have the same structure but they come from two different model. Basically, I would like to compare them in order to find the differences. The first thing that I would like to do is to plot two rows, the first from the first data frames and the second from the other.
This is what I do:
I read the two csv file,
PRICES = pd.read_csv('test_model_1.csv',sep=';',index_col=0, header = 0)
PRICES_B = pd.read_csv('bench_mark.csv',sep=';',index_col=0, header = 0)
then I plot the 8th column of both, as:
rowM = PRICES.iloc[8]
rowB = PRICES_B.iloc[8]
rowM.plot()
rowB.plot()
It does not seem the correct way. Indeed, I am not able to choose the labels or the legends.
This the results:
comparison between the 8th row of the first dataframe and the 8th row of the second dataframe
Someone could suggest me the correct way to compare the two data frames and plot some of the selected columns?
lets prepare some test data:
mtx1 = np.random.rand(10,8)*1.1+2
mtx2 = np.random.rand(10,8)+2
df1 = pd.DataFrame(mtx1)
df2 = pd.DataFrame(mtx2)
example output for df1:
Out[60]:
0 1 2 3
0 2.604748 2.233979 2.575730 2.491230
1 3.005079 2.984622 2.745642 2.082218
2 2.577554 3.001736 2.560687 2.838092
3 2.342114 2.435438 2.449978 2.984128
4 2.416953 2.124780 2.476963 2.766410
5 2.468492 2.662972 2.975939 3.026482
6 2.738153 3.024694 2.916784 2.988288
7 2.082538 3.030582 2.959201 2.438686
8 2.917811 2.798586 2.648060 2.991314
9 2.133571 2.162194 2.085843 2.927913
now let's plot it:
import matplotlib.pyplot as plt
%matplotlib inline
i = range(0,len(df1.loc[6,:])) # from 0 to 3
plt.plot(i,df1.loc[6,:]) # take whole row 6
plt.plot(i,df2.loc[6,:]) # take whole row 6
result:

how can I really change the values of specific text file with pandas

all
I have a txt file with many columns with no headers.
I use
df=pd.read_csv('a.txt',sep=' ',header=None,usecols=[2,3],names=['waiting time','running time')
suppose the columns would be like this:
waiting time running time
0 8617344 8638976
1 8681728 8703360
2 8703488 8725120
3 8725120 8725760
4 4185856 4207488
for the third column, I want to subtract values of the second columns, then I can get
waiting time running time
0 8617344 21632
1 8681728 21632
2 8703488 21632
3 8725120 640
4 4185856 21632
My question is that how let change really happen in txt file? It means the txt file has been really changed correspondingly.
If your question is how to update the text file with your new data, you just use the write version of your first line:
# Save to file a.txt
# Use a space as the separator
# Don't save the index
df.to_csv('a.txt', sep=' ', index=False)

How to pre-process a very large data in python

I have a couple of files 100 MB each. The format for those files looks like this:
0 1 2 5 8 67 9 122
1 4 5 2 5 8
0 2 1 5 6
.....
(note the actual file does not have the alignment spaces added in, only one space separates each element, added alignment for aesthetic effect)
this first element in each row is it's binary classification, and the rest of the row are indices of features where the value is 1. For instance, the third row says the row's second, first, fifth and sixth features are 1, the rest are zeros.
I tried to read each line from each file, and use sparse.coo_matrix create a sparse matrix like this:
for train in train_files:
with open(train) as f:
row = []
col = []
for index, line in enumerate(f):
record = line.rstrip().split(' ')
row = row+[index]*(len(record)-4)
col = col+record[4:]
row = np.array(row)
col = np.array(col)
data = np.array([1]*len(row))
mtx = sparse.coo_matrix((data, (row, col)), shape=(n_row, max_feature))
mmwrite(train+'trans',mtx)
but this took forever to finish. I started reading the data at night, and let the computer run after I went to sleep, and when I woke up, it still haven't finish the first file!
What are the better ways to process this kind of data?
I think this would be a bit faster than your method because it does not read file line by line. You can try this code with a small portion of one file and compare with your code.
This code also requires to know the feature number in advance. If we don't know the feature number, it would require another line of code which was commented out.
import pandas as pd
from scipy.sparse import lil_matrix
from functools import partial
def writeMx(result, row):
# zero-based matrix requires the feature number minus 1
col_ind = row.dropna().values - 1
# Assign values without duplicating row index and values
result[row.name, col_ind] = 1
def fileToMx(f):
# number of features
col_n = 136
df = pd.read_csv(f, names=list(range(0,col_n+2)),sep=' ')
# This is the label of the binary classification
label = df.pop(0)
# Or get the feature number by the line below
# But it would not be the same across different files
# col_n = df.max().max()
# Number of row
row_n = len(label)
# Generate feature matrix for one file
result = lil_matrix((row_n, col_n))
# Save features in matrix
# DataFrame.apply() is usually faster than normal looping
df.apply(partial(writeMx, result), axis=0)
return(result)
for train in train_files:
# result is the sparse matrix you can further save or use
result = fileToMx(train)
print(result.shape, result.nnz)
# The shape of matrix and number of nonzero values
# ((420, 136), 15)

reading variable number of columns with Python

I need to read a variable number of columns from my input file ( the number of columns is defined by the user, there's no limitation ). For every column I have multiple variables to read, three in my case, set by the user as well.
So the file to read is like:
2 3 5
6 7 9
3 6 8
In Fortran this is really easy to do:
DO 180 I=1,NMOD
READ(10,*) QARR(I),RARR(I),WARR(I)
NMOD is defined by the user, as well as all the values in the example. All of them are input parameters to be stored in memory. By doing these I can save all the variables I need and I can use it whenever I want, recalling them by changing the I index. How can I obtain the same result with Python?
Example file 'text'
2 3 5
6 7 9
3 6 8
Python code
data = []
with open('text') as file:
columns_to_read = 1 # here you tell how many columns you want to read per line
for line in file:
data.append(list(map(int, line.split()[:columns_to_read])))
print(data) # print: [[2], [6], [3]]
data will hold an array of arrays that represent your lines.
from itertools import islice
with open('file.txt', 'rt') as f:
# default slice from row 0 until end with step 1
# example islice(10, 20, 2) take only row 10,12,14,16,18
dat = islice(f, 0, None, 1)
column = None # change column here, default to all
# this keep the list value as string
# mylist = [i.split() for i in dat]
# this keep the list value as int
mylist = [[int(j) for j for i.split()[:column] for i in dat]
Code above construct 2d list
access with mylist[row][column]
Example - mylist[2][3] access row 2 column 3
Edit : improve code efficiency with #Guillaume #Javier suggestion

Categories

Resources