Only one index label in the dataset - python

I am working with the ecoli dataset from http://archive.ics.uci.
edu/ml/datasets/Ecoli. The values are separated by tabs. I would like to index each column and give them a name. But when i do that using the following code:
import pandas as pd
ecoli_cols= ['N_ecoli', 'info1', 'info2', 'info3', 'info4','info5','info6,'info7','type']
d= pd.read_table('ecoli.csv',sep= ' ',header = None, names= ecoli_cols)
Instead of creating the name for each index it creates a 6 new columns. But i would like to have those index name for each of the columns that i already have. And later i would like to extract information from this dataset. So it is important to have them as comma separated or in tables. Thanks

You can use url with data and separator \s+ - one or more whitespaces:
url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/ecoli/ecoli.data'
ecoli_cols= ['N_ecoli', 'info1', 'info2', 'info3', 'info4','info5','info6','info7','type']
df = pd.read_table(url,sep= '\s+',header = None, names= ecoli_cols)
#alternative use parameter delim_whitespace
#df = pd.read_table(url, delim_whitespace= True, header = None, names = ecoli_cols)
print (df.head())
N_ecoli info1 info2 info3 info4 info5 info6 info7 type
0 AAT_ECOLI 0.49 0.29 0.48 0.5 0.56 0.24 0.35 cp
1 ACEA_ECOLI 0.07 0.40 0.48 0.5 0.54 0.35 0.44 cp
2 ACEK_ECOLI 0.56 0.40 0.48 0.5 0.49 0.37 0.46 cp
3 ACKA_ECOLI 0.59 0.49 0.48 0.5 0.52 0.45 0.36 cp
4 ADI_ECOLI 0.23 0.32 0.48 0.5 0.55 0.25 0.35 cp
But if want use your file with separator as tab:
d = pd.read_table('ecoli.csv', sep='\t',header = None, names= ecoli_cols)
And if separator is ;:
d = pd.read_table('ecoli.csv', sep=';',header = None, names= ecoli_cols)

Related

Initial value of multiple variables dataframe for time dilation

Dataframe:
product1
product2
product3
product4
product5
straws
orange
melon
chair
bread
melon
milk
book
coffee
cake
bread
melon
coffe
chair
book
CountProduct1
CountProduct2
CountProduct3
Countproduct4
Countproduct5
1
1
1
1
1
2
1
1
1
1
2
3
2
2
2
RatioProduct1
RatioProduct2
RatioProduct3
Ratioproduct4
Ratioproduct5
0.28
0.54
0.33
0.35
0.11
0.67
0.25
0.13
0.11
0.59
2.5
1.69
1.9
2.5
1.52
I want to create five others columns that keep my initial ratio of each item along the dataframe.
Output:
InitialRatio1
InitialRatio2
InitialRatio3
InitialRatio4
InitialRatio5
0.28
0.54
0.33
0.35
0.11
0.33
0.25
0.13
0.31
0.59
0.11
0.33
0.31
0.35
0.13
Check the code again. Do you have an error in product3 = coffe and product4 = coffee? Fixed coffe to coffee. As a result, 0.31 should not be.
import pandas as pd
pd.set_option('display.max_rows', None) # print everything rows
pd.set_option('display.max_columns', None) # print everything columns
df = pd.DataFrame(
{
'product1':['straws', 'melon', 'bread'],
'product2':['orange', 'milk', 'melon'],
'product3':['melon', 'book', 'coffee'],
'product4':['chair', 'coffee', 'chair'],
'product5':['bread', 'cake', 'book'],
'time':[1,2,3],
'Count1':[1,2,2],
'Count2':[1,1,3],
'Count3':[1,1,2],
'Count4':[1,1,2],
'Count5':[1,1,2],
'ratio1':[0.28, 0.67, 2.5],
'ratio2':[0.54, 0.25, 1.69],
'ratio3':[0.33, 0.13, 1.9],
'ratio4':[0.35, 0.11, 2.5],
'ratio5':[0.11, 0.59, 1.52],
})
print(df)
product = df[['product1', 'product2', 'product3', 'product4', 'product5']].stack().reset_index()
count = df[['Count1', 'Count2', 'Count3', 'Count4', 'Count5']].stack().reset_index()
ratio = df[['ratio1', 'ratio2', 'ratio3', 'ratio4', 'ratio5']].stack().reset_index()
print(ratio)
arr = pd.unique(product[0])
aaa = [i for i in range(len(arr)) if product[product[0] == arr[i]].count()[0] > 1]
for i in aaa:
prod_ind = product[product[0] == arr[i]].index
val_ratio = ratio.loc[prod_ind[0], 0]
ratio.loc[prod_ind, 0] = val_ratio
print(ratio.pivot_table(index='level_0', columns='level_1', values=[0]))
Output:
level_1 ratio1 ratio2 ratio3 ratio4 ratio5
level_0
0 0.28 0.54 0.33 0.35 0.11
1 0.33 0.25 0.13 0.11 0.59
2 0.11 0.33 0.11 0.35 0.13
To work with data, they need to be turned into one column using stack().reset_index(). Create a list of unique products arr. Further in the list aaa I get indexes of arr, which are more than one.
prod_ind = product[product[0] == arr[i]].index
In a loop, I get indexes of products that are more than one.
val_ratio = ratio.loc[prod_ind[0], 0]
Get the first value of the product.
ratio.loc[prod_ind, 0] = val_ratio
Set this value for all products.
To access the values, explicit loc indexing is used, where the row indices are in square brackets on the left, and the names of the columns on the right. Read more here.
In pivot_table I create back the table.
To insert the processed data into the original dataframe, simply use the following:
table = ratio.pivot_table(index='level_0', columns='level_1', values=[0])
df[['ratio1', 'ratio2', 'ratio3', 'ratio4', 'ratio5']] = table
print(df)
If you're after code to create the init_rateX columns then the following will work
pd.DataFrame(
np.divide(
df[["ratio1", "ratio2", "ratio3", "ratio4", "ratio5"]].to_numpy(),
df[["Count1", "Count2", "Count3", "Count4", "Count5"]].to_numpy(),
),
columns=["init_rate1", "init_rate2", "init_rate3", "init_rate4", "init_rate5"],
)
which gives
init_rate1 init_rate2 init_rate3 init_rate4 init_rate5
0 0.28 0.25 0.33 0.57 0.835
1 0.33 0.13 0.97 0.65 0.760
2 0.54 0.11 0.45 0.95 1.160
3 0.35 0.59 0.34 1.25 1.650
However it does not agree with your calcs for init_rate4 or init_rate5 so some clarification might be needed.

Pandas read_csv Multiple spaces delimiter

I have a file with 7 aligned columns, with empty cells.Example:
SN 1995ap 0.230 40.44 0.46 0.00 silver
SN 1995ao 0.300 40.76 0.60 0.00 silver
SN 1995ae 0.067 37.54 0.34 0.00 silver
SN 1995az 0.450 42.13 0.21 gold
SN 1995ay 0.480 42.37 0.20 gold
SN 1995ax 0.615 42.85 0.23 gold
I want to read it using pandas.read_csv(), but I have some trouble. The separator can be either 1 or 2 spaces. If I use sep='\s+' it works, but it ignores empty cells, therefore I get cells shifted to the left and empty cells in the last columns. I tried to use regex separator sep=\s{1,2}, but i get the following error:
pandas.errors.ParserError: Expected 7 fields in line 63, saw 9. Error could possibly be due to quotes being ignored when a multi-char delimiter is used.
My code:
import pandas as pd
riess_2004b=pd.read_csv('Riess_2004b.txt', skiprows=22, header=None, sep='\s{1,2}', engine='python')
What I am not getting right?
Fix-width file (read_fwf) seems like a better fit for your case:
df = pd.read_fwf("Riess_2004b.txt", colspecs="infer", header=None)
If there is no extra spaces in your field value and no continuous empty values in one row, you can try delim_whitespace argument and then shift the NAN part to left by one column.
df = pd.read_csv('xx', delim_whitespace=True)
def shift(col):
m = col.isna().shift(-1, fill_value=False)
col = col.fillna(method='ffill')
col[m] = pd.NA
return col
df = df.T.apply(shift, axis=0).T
print(df)
SN 1995ap 0.230 40.44 0.46 0.00 silver
0 SN 1995ao 0.3 40.76 0.6 0.00 silver
1 SN 1995ae 0.067 37.54 0.34 0.00 silver
2 SN 1995az 0.45 42.13 0.21 <NA> gold
3 SN 1995ay 0.48 42.37 0.2 <NA> gold
4 SN 1995ax 0.615 42.85 0.23 <NA> gold

filter dataframe by condition when the condition is for the columns and not on the values

I have dataframe with float columns that look similar to this:
>>>397.55 400.231 404.42 407.12 465.23 478.92 492.3 501.2 505.6 ...
0 0.23 0.122 0.43 0.11 0.345 0.22 0.66 0.34 0.21
1 0.12 0.443 0.76 0.12 0.22 0.24 0.56 0.11 0.04
2 0.45 0.87 0.23 0.99 0.11 0.44 0.78 0.65 0.23
...
I want to filter the dataframe s'll have only column that their value is between 405.2 to 472.7.
I have tried to filter it with condition on the columns but it did not work:
df[(df.columns>405.2)]
>>>ValueError: Item wrong length 224 instead of 10783.
224 is the number of columns and 10783 is number of rows.
Is there any way I can filter my dataframe to be between two values when the values are the column name?
Use DataFrame.loc, first : means get all rows and columns by condition:
df.loc[:, (df.columns>405.2)]

Create new DF with values representing difference between two dataframes

I am working with two numeric data.frames, both with 13803obs and 13803 variables. Their col- and rownames are identical however their entries are different. What I want to do is create a new data.frame where I have subtracted df2 values with df1 values.
"Formula" would be this, df1(entri-values) - df2(entri-values) = df3 difference. In other words, the purpose is to find the difference between all entries.
My problem illustrated here.
DF1
[GENE128] [GENE271] [GENE2983]
[GENE231] 0.71 0.98 0.32
[GENE128] 0.23 0.61 0.90
[GENE271] 0.87 0.95 0.63
DF2
[GENE128] [GENE271] [GENE2983]
[GENE231] 0.70 0.94 0.30
[GENE128] 0.25 0.51 0.80
[GENE271] 0.82 0.92 0.60
NEW DF3
[GENE128] [GENE271] [GENE2983]
[GENE231] 0.01 0.04 0.02
[GENE128] -.02 0.10 0.10
[GENE271] 0.05 0.03 0.03
So, in DF3 the values are the difference between DF1 and DF2 for each entry.
DF1(GENE231) - DF2(GENE231) = DF3(DIFFERENCE-GENE231)
DF1(GENE271) - DF2(GENE271) = DF3(DIFFERENCE-GENE271)
and so on...
Help would be much appreciated!
Kind regards,
Harebell

python pandas How to remove outliers from a dataframe and replace with an average value of preceding records

I have a dataframe 16k records and multiple groups of countries and other fields. I have produced an initial output of the a data that looks like the snipit below. Now i need to do some data cleansing, manipulating, remove skews or outliers and replace it with a value based on certain rules.
i.e. on the below how could i identify the skewed points (any value greater than 1) and replace them with the average of the next two records or previous record if there no later records.(in that group)
So in the dataframe below I would like to replace Bill%4 for IT week1 of 1.21 with the average of week2 and week3 for IT so it is 0.81.
any tricks for this?
Country Week Bill%1 Bill%2 Bill%3 Bill%4 Bill%5 Bill%6
IT week1 0.94 0.88 0.85 1.21 0.77 0.75
IT week2 0.93 0.88 1.25 0.80 0.77 0.72
IT week3 0.94 1.33 0.85 0.82 0.76 0.76
IT week4 1.39 0.89 0.86 0.80 0.80 0.76
FR week1 0.92 0.86 0.82 1.18 0.75 0.73
FR week2 0.91 0.86 1.22 0.78 0.75 0.71
FR week3 0.92 1.29 0.83 0.80 0.75 0.75
FR week4 1.35 0.87 0.84 0.78 0.78 0.74
I don't know of any built-ins to do this, but you should be able to customize this to meet your needs, no?
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.rand(10,5),columns=list('ABCDE'))
df.index = list('abcdeflght')
# Define cutoff value
cutoff = 0.90
for col in df.columns:
# Identify index locations above cutoff
outliers = df[col][ df[col]>cutoff ]
# Browse through outliers and average according to index location
for idx in outliers.index:
# Get index location
loc = df.index.get_loc(idx)
# If not one of last two values in dataframe
if loc<df.shape[0]-2:
df[col][loc] = np.mean( df[col][loc+1:loc+3] )
else:
df[col][loc] = np.mean( df[col][loc-3:loc-1] )

Categories

Resources