Python - Create lists from columns - python

How can I create from this (two columns, fixed width):
0.35 23.8
0.39 23.7
0.43 23.6
0.47 23.4
0.49 23.1
0.51 22.8
0.53 22.4
0.55 21.6
Two lists:
list1 = [0.35, 0.39, 0.43, ...]
list2 = [23.8, 23.7, 23.6, ...]
Thank you.

Probably you are looking for something like this
>>> str1 = """0.35 23.8
0.39 23.7
0.43 23.6
0.47 23.4
0.49 23.1
0.51 22.8
0.53 22.4
0.55 21.6"""
>>> zip(*(e.split() for e in str1.splitlines()))
[('0.35', '0.39', '0.43', '0.47', '0.49', '0.51', '0.53', '0.55'), ('23.8', '23.7', '23.6', '23.4', '23.1', '22.8', '22.4', '21.6')]
You can easily extend the above solution to cater to any type of iterables including file
>>> with open("test1.txt") as fin:
print zip(*(e.split() for e in fin))
[('0.35', '0.39', '0.43', '0.47', '0.49', '0.51', '0.53', '0.55'), ('23.8', '23.7', '23.6', '23.4', '23.1', '22.8', '22.4', '21.6')]
Instead of strings if you want the numbers as floats, you need to pass it through the float function possibly by map
>>> zip(*(map(float, e.split()) for e in str1.splitlines()))
[(0.35, 0.39, 0.43, 0.47, 0.49, 0.51, 0.53, 0.55), (23.8, 23.7, 23.6, 23.4, 23.1, 22.8, 22.4, 21.6)]
And finally to unpack it to two separate lists
>>> from itertools import izip
>>> column_tuples = izip(*(map(float, e.split()) for e in str1.splitlines()))
>>> list1, list2 = map(list, column_tuples)
>>> list1
[0.35, 0.39, 0.43, 0.47, 0.49, 0.51, 0.53, 0.55]
>>> list2
[23.8, 23.7, 23.6, 23.4, 23.1, 22.8, 22.4, 21.6]
So how it works
zip takes a list of iterables and returns a list of pair wise tuple for each iterator. itertools.izip is similar but instead of returning a list of pairwise tuples, it returns an iterator of pairwise tuples. This would be more memory friendly
map applied a function to each element of the iterator. So map(float, e.split) would convert the strings to floats. Note an alternate way of representing maps is through LC or generator expression
Finally str.splitlines converts the newline separated string to a list of individual lines.

Try this:
splitted = columns.split()
list1 = splitted[::2] #column 1
list2 = splitted[1::2] #column 2

Related

Why does SciPy ttest_rel return numpy ndarray of NaN?

I am trying to calculate t-test score of data stored in different dataframes using ttest_rel from scipy.stats. But when calculating t-test between same data, it is returning a numpy ndarray of NaN instead of a single NaN. What am I doing wrong that I am getting a numpy array instead of a single value?
My code with a sample dataframe is as follows:
import pandas as pd
import numpy as np
import re
from scipy.stats import ttest_rel
cities_nhl = pd.DataFrame({'metro': ['NewYork', 'LosAngeles', 'StLouis', 'Detroit', 'Boston', 'Baltimore'],
'total_ratio': [0.45, 0.51, 0.62, 0.43, 0.26, 0.32]})
cities_nba = pd.DataFrame({'metro': ['Boston', 'LosAngeles', 'Phoenix', 'Baltimore', 'Detroit', 'NewYork'],
'total_ratio': [0.50, 0.41, 0.34, 0.53, 0.33, 0.42]})
cities_mlb = pd.DataFrame({'metro': ['Seattle', 'Detroit', 'Boston', 'Baltimore', 'NewYork', 'LosAngeles'],
'total_ratio': [0.48, 0.27, 0.52, 0.33, 0.28, 0.67]})
cities_nfl = pd.DataFrame({'metro': ['LosAngeles', 'Atlanta', 'Detroit', 'Boston', 'NewYork', 'Baltimore'],
'total_ratio': [0.47, 0.41, 0.82, 0.13, 0.56, 0.42]})
needed_cols = ['metro', 'total_ratio'] #metro is a string and total_ratio is a float column
df_dict = {'NHL': cities_nhl[needed_cols], 'NBA': cities_nba[needed_cols],
'MLB': cities_mlb[needed_cols], 'NFL': cities_nfl[needed_cols]} #keeping all dataframes in a dictionary
#for ease of access
sports = ['NHL','NBA','MLB','NFL'] #name of sports
p_values_dict = {'NHL':[], 'NBA':[], 'MLB':[], 'NFL':[]} #dictionary to store p values
for clm1 in sports:
for clm2 in sports:
#merge the dataframes of two sports and then calculate their ttest score
_df = pd.merge(df_dict[clm1], df_dict[clm2],
how='inner', on='metro', suffixes=[f'_{clm1}', f'_{clm2}'])
_pval = ttest_rel(_df[f"total_ratio_{clm1}"], _df[f"total_ratio_{clm2}"])[1]
p_values_dict[clm1].append(_pval)
p_values = pd.DataFrame(p_values_dict, index=sports)
p_values
| |NHL |NBA |MLB |NFL |
|-------|-----------|-----------|-----------|----------|
|NHL |[nan, nan] |0.589606 |0.826298 |0.38493 |
|NBA |0.589606 |[nan, nan] |0.779387 |0.782173 |
|MLB |0.826298 |0.779387 |[nan, nan] |0.713229 |
|NFL |0.38493 |0.782173 |0.713229 |[nan, nan]|
The problem here is actually not related to scipy, but is due to duplicate column labels in your dataframes. In this part of your code:
_df = pd.merge(df_dict[clm1], df_dict[clm2],
how='inner', on='metro', suffixes=[f'_{clm1}', f'_{clm2}'])
When clm1 and clm2 are equal (say they are both NHL), you get a _df dataframe like this:
metro total_ratio_NHL total_ratio_NHL
0 NewYork 0.45 0.45
1 LosAngeles 0.51 0.51
2 StLouis 0.62 0.62
3 Detroit 0.43 0.43
4 Boston 0.26 0.26
5 Baltimore 0.32 0.32
Then, when you pass the columns to the ttest_rel function, you end up passing both columns when you refer to a single column label, because they have the same label:
ttest_rel(_df[f"total_ratio_{clm1}"], _df[f"total_ratio_{clm2}"])
And that's how you get two t-statistics and two p-values.
So, you can modify those two lines to eliminate duplicate column labels, like this:
_df = pd.merge(df_dict[clm1], df_dict[clm2],
how='inner', on='metro', suffixes=[f'_{clm1}_1', f'_{clm2}_2'])
_pval = ttest_rel(_df[f"total_ratio_{clm1}_1"], _df[f"total_ratio_{clm2}_2"])[1]
The result will look like this:
NHL NBA MLB NFL
NHL NaN 0.589606 0.826298 0.384930
NBA 0.589606 NaN 0.779387 0.782173
MLB 0.826298 0.779387 NaN 0.713229
NFL 0.384930 0.782173 0.713229 NaN

Multiindex data.frame from two data.frames join by column headers

There are a dozens similar sounding questions here, I think I've searched them all and could not find a solution to my problem:
I have 2 df: df_c:
CAN-01 CAN-02 CAN-03
CE
ce1 0.84 0.73 0.50
ce2 0.06 0.13 0.05
And df_z:
CAN-01 CAN-02 CAN-03
marker
cell1 0.29 1.5 7
cell2 1.00 3.0 1
I want to join for each 'marker' + 'CE' combination over their column names
Example: cell1 + ce1:
[[0.29, 0.84],[1.5,0.73],[7,0.5], ...]
(Continuing for cell1 + ce2, cell2 + ce1, cell2 + ce2)
I have a working example using two loops and .loc twice, but it takes forever on the full data set.
I think the best to build is a multiindex DF with some merge/join/concat magic:
CAN-01 CAN-02 CAN-03
Source
0 CE 0.84 0.73 0.50
Marker 0.29 1.5 7
1 CE ...
Marker ...
Sample Code
dc = [['ce1', 0.84, 0.73, 0.5],['c2', 0.06,0.13,0.05]]
dat_c = pd.DataFrame(dc, columns=['CE', 'CAN-01', 'CAN-02', 'CAN-03'])
dat_c.set_index('CE',inplace=True)
dz = [['cell1', 0.29, 1.5, 7],['cell2', 1, 3, 1]]
dat_z = pd.DataFrame(dz, columns=['marker', "CAN-01", "CAN-02", "CAN-03"])
dat_z.set_index('marker',inplace=True)
Bad/Slow Solution
for ci, c_row in dat_c.iterrows(): # for each CE in CE table
tmp = []
for j,colz in enumerate(dat_z.columns[1:]):
if not colz in dat_c:
continue
entry_c = c_row.loc[colz]
if len(entry_c.shape) > 0:
continue
tmp.append([dat_z.loc[marker,colz],entry_c])
IIUC:
use append()+groupby():
dat_c.index=[f"cell{x+1}" for x in range(len(dat_c))]
df=dat_c.append(dat_z).groupby(level=0).agg(list)
output of df:
CAN-01 CAN-02 CAN-03
cell1 [0.84, 0.29] [0.73, 1.5] [0.5, 7.0]
cell2 [0.06, 1.0] [0.13, 3.0] [0.05, 1.0]
If needed list:
dat_c.index=[f"cell{x+1}" for x in range(len(dat_c))]
lst=dat_c.append(dat_z).groupby(level=0).agg(list).to_numpy().tolist()
output of lst:
[[[0.84, 0.29], [0.73, 1.5], [0.5, 7.0]],
[[0.06, 1.0], [0.13, 3.0], [0.05, 1.0]]]

why my dataframe columns to list gives tuples, not a simple list

I have dataframe. I tried simple a function to convert column names to list. Surprisingly I am getting tuples. Why?
my code:
big_list = [[nan, 21.0, 3.2, 12.0, 3.24],
[42.0, 23.799999999999997, 6.0, 13.599999999999998, 5.24],
[112.0, 32.199999999999996, 14.400000000000002, 18.4, 11.24],
[189.0, 46.2, 28.400000000000002, 26.400000000000002, 21.240000000000002]]
df= pd.DataFrame(np.array(big_list),index=range(0,4,1),columns=[sns])
df =
ig abc def igh klm
0 NaN 21.0 3.2 12.0 3.24
1 42.0 23.8 6.0 13.6 5.24
2 112.0 32.2 14.4 18.4 11.24
3 189.0 46.2 28.4 26.4 21.24
print(list(df))
present output:
[('ig',), ('abc',), ('def',), ('igh',), ('klm',)]
Expected outptu:
['ig','abc','def','igh','klm']
The following code should do it:
df = pd.DataFrame([[np.nan, 21.0, 3.2, 12.0, 3.24]], columns=['ig','abc','def','igh','klm'])
print(list(df.columns))
It gives the following output:
['ig', 'abc', 'def', 'igh', 'klm']
If your output is different, then the dataframe might be wrongly constructed

Python / Pandas - Turning a list with inner dictionaries into a DataFrame

I'm new to pandas and I need to turn a list with inner dictionaries into a DataFrame
Below an example of the list:
[('Fwd1', {'Score': 1.0, 'Prediction': 7.6, 'MAPE': 2.37}), ('Fwd2', {'Score': 1.0, 'Prediction': 7.62, 'MAPE': 2.57}), ('Fwd3', {'Score': 1.0, 'Prediction': 7.53, 'MAPE': 2.54})]
I would like it to look like this:
Prediction MAPE Score
Date
Fwd1 7.6 2.37 1
Fwd2 7.62 2.57 1
Fwd3 7.53 2.54 1
Anyone available to enlight my journey?
You can convert the list of tuples into dict first and then create a dataframe. Unfortunately, in this structure, the index and columns need to be swapped for you desired output, so you need to transpose it.
Assume x is your list:
In [18]: dict(x)
Out[18]:
{'Fwd1': {'MAPE': 2.37, 'Prediction': 7.6, 'Score': 1.0},
'Fwd2': {'MAPE': 2.57, 'Prediction': 7.62, 'Score': 1.0},
'Fwd3': {'MAPE': 2.54, 'Prediction': 7.53, 'Score': 1.0}}
In [19]: pd.DataFrame(dict(x))
Out[19]:
Fwd1 Fwd2 Fwd3
MAPE 2.37 2.57 2.54
Prediction 7.60 7.62 7.53
Score 1.00 1.00 1.00
In [20]: pd.DataFrame(dict(x)).T
Out[20]:
MAPE Prediction Score
Fwd1 2.37 7.60 1
Fwd2 2.57 7.62 1
Fwd3 2.54 7.53 1

Getting values out of a python dictionary

I am struggling with python dictionaries. I created a dictionary that looks like:
d = {'0.500': ['18.4 0.5', '17.9 0.4', '16.9 0.4', '18.6 0.4'],
'1.000': ['14.8 0.5', '14.9 0.5', '15.6 0.4', '15.9 0.3'],
'0.000': ['23.2 0.5', '23.2 0.8', '23.2 0.7', '23.2 0.1']}
and I would like to end up having:
0.500 17.95 0.425
which is the key, average of (18.4+17.9+16.9+18.6), average of (0.5+0.4+0.4+0.4)
(and the same for 1.000 and 0.000 with their corresponding averages)
Initially my dictionary had only two values, so I could rely on indexes:
for key in d:
dvdl1 = d[key][0].split(" ")[0]
dvdl2 = d[key][1].split(" ")[0]
average = ((float(dvdl1)+float(dvdl2))/2)
but now I would like to have my code working for different dictionary lengths with lets say 4 (example above) or 5 or 6 values each...
Cheers!
for k,v in d.iteritems():
col1, col2 = zip(*[map(float,x.split()) for x in v])
print k, sum(col1)/len(v), sum(col2)/len(v)
...
0.500 17.95 0.425
1.000 15.3 0.425
0.000 23.2 0.525
How this works:
>>> v = ['18.4 0.5', '17.9 0.4', '16.9 0.4', '18.6 0.4']
first split each item at white-spaces and apply float to them, so we get a lists of lists:
>>> zipp = [map(float,x.split()) for x in v]
>>> zipp
[[18.4, 0.5], [17.9, 0.4], [16.9, 0.4], [18.6, 0.4]] #list of rows
Now we can use zip with * which acts as un-zipping and we will get a list of columns.
>>> zip(*zipp)
[(18.4, 17.9, 16.9, 18.6), (0.5, 0.4, 0.4, 0.4)]

Categories

Resources