verify if cell is the same and build an excelfile - python

I have some measurements(as a dict) and a list with labels. Need to verify if labels are in my measurements and write it to an excelfile.
my output-excelfile need to look like this.
list1 = ['A', 'B', 'C', 'D']
measurement1 = {'A':1, 'B':1}
measurement2 = {'C':3, 'D':4}
#Output
'A' 'B' 'C' 'D'
measurement1 1 1 0 0
measurement2 0 0 1 1
I have no idea how to build the matrix with (0,1)
Hope you can help me.
EDIT
Finally i got a solution. At first i iterate over all measurements and wrote to dict measurements all missing labels.
Than building a dataframe with ones and putting with 3 loops zeros in the dataframe to the msising positions with .loc
d = pd.DataFrame(1, index = measurements.keys(), columns = list1)
for y in measurements.keys():
for z in measurements[y]:
for x in list1:
if x == z:
d.loc[y,z] = 0
Maybe its possible to make it with only 2 loops.

Use nested list comprehension with filtering for check membership in list1 and last create DataFrame by constructor:
list1 = ['A', 'B', 'C', 'D']
measurement1 = {'A':1, 'B':1}
measurement2 = {'C':3, 'D':4}
L = [measurement1, measurement2]
d = [dict.fromkeys([y for y in x.keys() if y in list1], 1) for x in L]
df = pd.DataFrame(d).fillna(0).astype(int)
print (df)
A B C D
0 1 1 0 0
1 0 0 1 1

This should work, using only standard Python:
list1 = ['A', 'B', 'C', 'D']
measurement1 = {'A':1, 'B':1}
measurement2 = {'C':3, 'D':4}
measurements = [measurement1, measurement2]
headers = { h: i for i, h in enumerate(list1) }
matrix = []
for measurement in measurements:
row = [0] * len(headers)
for header in measurement.keys():
row[headers[header]] = 1
matrix.append(row)
For your example, the output will be:
matrix
=> [[1, 1, 0, 0], [0, 0, 1, 1]]

You can use a list of the dictionaries ad create a dataframe then reindex with the list and convert to bool by checking notna
pd.DataFrame([measurement1,measurement2]).reindex(columns=list1).notna().astype(int)
A B C D
0 1 1 0 0
1 0 0 1 1

Related

Given a dataframe, how do I bucket columns according to their names and merge columns in the same bucket into one?

Suppose I have a dataframe with (for example) 10 columns: a,b,c,d,e,f,g,h,i,j
I want to bucket these columns as follows: a,b,c into x, d,f,g into y, e,h,i into z and j into j.
Each row of the output will have the x column value equal to the non-NaN a or b or c value of the original df. In case of multiple non-NaN values for a,b,c columns for a particular row in the original df, the output df will just contain a list of those non-NaN values.
To give an example, if the original df is (- just means NaN to save typing effort):
a b c d e f g h i j
0 1 - - - 2 - 4 3 - -
1 - 6 - 0 4 - - - - 2
2 - 3 2 - - - - 1 - 9
The output will be:
x y z j
0 1 4 [2,3] -
1 6 0 4 2
2 [3,2] - 1 9
Is there an efficient way of doing this? I'm not even able to get started using conventional methods.
one way is to create a dictionary with your mappings, apply your column names, stack and to apply your groupby operation and unstack to your original shape.
I couldn't see any logic in your mappings so it will have to be a manual operation I'm afraid.
buckets = {'x': ['a', 'b', 'c'], 'y': ['d', 'f', 'g'], 'z': ['e', 'h', 'i'], 'j': 'j'}
df.columns = df.columns.map( {i : x for x,y in buckets.items() for i in y})
out = df.stack().groupby(level=[0,1]).agg(list).unstack(1)[buckets.keys()]
print(out)
x y z j
0 [1] [4] [2, 3] NaN
1 [6] [0] [4] [2]
2 [3, 2] NaN [1] [9]
First create the dict for mapping , the groupby
d = {'a':'x','b':'x','c':'x','d':'y','f':'y','g':'y','e':'z','h':'z','i':'z','j':'j'}
out = df.groupby(d,axis=1).agg(lambda x : [y[y!='-']for y in x.values])
Out[138]:
j x y z
0 [] [1] [4] [2, 3]
1 [2] [6] [0] [4]
2 [9] [3, 2] [] [1]
Starting with a very basic approach, let's define our buckets and simply iterate, then clean up:
buckets = {
'x': ['a', 'b', 'c'],
'y': ['d', 'e', 'f'],
'z': ['g', 'h', 'i'],
'j': ['j']
}
def clean(val):
val = [x for x in val if not np.isnan(val)]
if len(val) == 0:
return np.nan
elif len(val) == 1:
return val[0]
else:
return val
new_df = pd.DataFrame()
for new_col, old_cols in buckets.items():
new_df[key] = df[old_cols].values.tolist().apply(clean)
Here's how you can do it.
First, we define a method to perform the row-wise bucketing operation.
def bucket_rows(row):
row = row.dropna().to_list()
if len(row) == 0:
row = [np.nan]
return row
Then, we can use the pandas.DataFrame.apply method to map this function onto each row on a dataframe (here, a sub-dataframe, if you will, since we'll get the sub-df using the column names).
I have implemented everything in the following code snippet.
import numpy as np
import pandas as pd
bucket_cols=[["a", "b", "c"], ["d", "f", "g"], ["e", "h","i"], ["j"]]
bucket_names=["x", "y", "z", "j"]
buckets = {}
def bucket_rows(row):
row = row.dropna().to_list() # applying pd.Series.dropna method to remove NaN values
# if the list is empty, populate it with NaN
if len(row) == 0:
row = [np.nan]
# returns bucketed row
return row
# looping through buckets and perforing bucketing operation
for idx, cols in enumerate(bucket_cols):
bucket = df[cols].apply(bucket_rows, axis=1).to_list()
buckets[idx] = bucket
# creating bucketted df from buckets dict
df_bucketted = pd.DataFrame(buckets)

data accumulation from csv file in python

out_gate,in_gate,num_connection
a,b,1
a,b,3
b,a,2
b,c,4
c,a,5
c,b,5
c,b,3
c,a,4
shown above is a sample csv file.
First of all, My final goal is that the compile result becomes a table about number of connections between gates like below:
a b c
a 0 4 0
b 2 0 4
c 9 8 0
and Now I finished making a list of the first column(out_gate)
like this; listfile = ['a','b','c'] and trying to match this each data (a,b,c) one by one to the in_gate
so, for example when out_gate 'c'-> in_gate 'b', number of connections is 8 and
'c'->'a' becomes 9.
I can match out_blk and in_blk in a row with its connection numbers, but hard to accumulate the connection numbers of each out_gate
Is there any solution ?
In plain Python you should look at the csv module for the input and a collections.defaultdict for collecting the totals:
from csv import reader
from collections import defaultdict
d = defaultdict(lambda: defaultdict(int))
with open('file.csv') as f:
r = reader(f)
next(r) # skip headers
for row in r:
if len(row) >= 3:
x, y, count = row
d[x][y] += int(count)
keys = sorted(d)
for x in keys:
print(' '.join(str(d[x][y]) for y in keys))
0 4 0
2 0 4
9 8 0
If you do this for large amounts of data, you should absolutely check out numpy and pandas, which both have more effective and natural methods of handling tables than native python.
In case you only need a solution right now, accumulations can be done straight forwardly in pure python with collections.defaultdict:
from collections import defaultdict
con = defaultdict(int)
for count, line in enumerate(connections):
if count == 0:
continue
in_gate, out_gate, number = line.split(',')
con[f"{in_gate}->{out_gate}"] += int(number)
Now you can access the entries the following way:
print(con['a->b'])
>> 4
print(con['a->c'])
>> 0
This is a one-line high-level answer via pandas.pivot_table, if you do not wish to resort to line-by-line readers and defaultdict.
import pandas as pd
df = pd.DataFrame([['a', 'b', 1], ['a', 'b', 3], ['b', 'a', 2], ['b', 'c', 4],
['c', 'a', 5], ['c', 'b', 5], ['c', 'b', 3], ['c', 'a', 4]],
columns=['out_gate', 'in_gate', 'num_connection'])
pd.pivot_table(df, index='out_gate', columns='in_gate', values='num_connection', aggfunc='sum').fillna(0)
You can use itertools.groupby:
import csv
import itertools
data = list(csv.reader(open('filename.csv')))
new_data = [b+[int(a)] for *b, a in data]
final_data = {tuple(a):sum(map(lambda x:x[-1], list(b))) for a, b in itertools.groupby(sorted(new_data, key=lambda x:x[:2]), key=lambda x:x[:2])}
letters = sorted(set([i for b in final_data.keys() for i in b]))
matrix = '\n'.join([' '.join(map(str, [final_data.get((b, i), 0) for i in letters])) for b in letters])
Output:
0 4 0
2 0 4
9 8 0

pandas data frame sort

I have a pandas dataframe like this which I try to sort by column 'dist'. The sorted dataframe should start with E or F as per below. I use sort_values which it is not working for me. The function is computing distances from 'Start' location to a list of locations ['C', 'B', 'D', 'E', 'A', 'F'] and then is supposed to sort the dataframe in ascending order using 'dist' column.
Could someone advice me why sorting is not working?
locations = {'Start':(20,5),'A':(10,3), 'B':(5,3), 'C':(5, 7), 'D':(10,7),'E':(14,4),'F':(14,6)}
loc_list
Out[194]: ['C', 'B', 'D', 'E', 'A', 'F']
def closest_locations(from_loc_point, to_loc_list):
lresults=list()
for list_index in range(len(to_loc_list)):
dist= hypot(locations[from_loc_point[0]][0] -locations[to_loc_list[list_index]][0],locations[from_loc_point[0]][1] -locations[to_loc_list[list_index]][1]) # cumsum distante
lista_dist = [from_loc_point[0],to_loc_list[list_index],dist]
lresults.append(lista_dist[:])
RESULTS = pd.DataFrame(np.array(lresults))
RESULTS.columns = ['from','to','dist']
RESULTS.sort_values(['dist'],ascending=[True],inplace=True)
RESULTS.index = range(len(RESULTS))
return RESULTS
closest_locations(['Start'], loc_list)
Out[189]:
from to dist
0 Start D 10.19803902718557
1 Start A 10.19803902718557
2 Start C 15.132745950421555
3 Start B 15.132745950421555
4 Start E 6.08276253029822
5 Start F 6.08276253029822
closest_two_loc.dtypes
Out[247]:
from object
to object
dist object
dtype: object
Is this what you want?
locations = {'Start':(20,5),'A':(10,3), 'B':(5,3), 'C':(5, 7), 'D':(10,7),'E':(14,4),'F':(14,6)}
df= pd.DataFrame.from_dict(locations, orient='index').rename(columns={0:'x', 1:'y'})
df['dist'] = df.apply(lambda row: pd.np.sqrt((row['x'] - df.loc['Start', 'x'])**2 + (row['y'] - df.loc['Start', 'y'])**2), axis=1)
df.drop(['Start']).sort_values(by='dist')
x y dist
E 14 4 6.082763
F 14 6 6.082763
A 10 3 10.198039
D 10 7 10.198039
C 5 7 15.132746
B 5 3 15.132746
or if you want to wrap it in a function
def dist_from(df, col):
df['dist'] = df.apply(lambda row: pd.np.sqrt((row['x'] - df.loc[col,'x'])**2 + (row['y'] - df.loc[col, 'y'])**2), axis=1)
df['form'] = col
df.drop([col]).sort_values(by='dist')
df.index.name = 'to'
return df.reset_index().loc[:, ['from', 'to', 'dist']]
You need to convert values in "dist" column to float:
df = closest_locations(['Start'], loc_list)
df.dist = list(map(lambda x: float(x), df.dist)) # convert each value to float
print(df.sort_values('dist')) # now it will sort properly
Output:
from to dist
4 Start E 6.082763
5 Start F 6.082763
0 Start D 10.198039
1 Start A 10.198039
2 Start C 15.132746
3 Start B 15.132746
Edit: As mentioned by #jezrael in comments, following is a more direct method:
df.dist = df.dist.astype(float)

Pandas assign label based on index value

I have a dataframe with index and multiple columns. Secondly, I have few lists containing index values sampled on certain criterias. Now I want to create columns with labes based on fact whether or not the index of certain row is present in a specified list.
Now there are two situations where I am using it:
1) To create a column and give labels based on one list:
df['1_name'] = df.index.map(lambda ix: 'A' if ix in idx_1_model else 'B')
2) To create a column and give labels based on multiple lists:
def assignLabelsToSplit(ix_, random_m, random_y, model_m, model_y):
if (ix_ in random_m) or (ix_ in model_m):
return 'A'
if (ix_ in random_y) or (ix_ in model_y):
return 'B'
else:
return 'not_assigned'
df['2_name'] = df.index.map(lambda ix: assignLabelsToSplit(ix, idx_2_random_m, idx_2_random_y, idx_2_model_m, idx_2_model_y))
This is working, but it is quite slow. Each call takes about 3 minutes and considering I have to execute the funtions multiple times, it needs to be faster.
Thank you for any suggestions.
I think you need double numpy.where with Index.isin :
df['2_name'] = np.where(df.index.isin(random_m + model_m), 'A',
np.where(df.index.isin(random_y + model_y), 'B', 'not_assigned'))
Sample:
np.random.seed(100)
df = pd.DataFrame(np.random.randint(10, size=(10,1)), columns=['A'])
#print (df)
random_m = [0,1]
random_y = [2,3]
model_m = [7,4]
model_y = [5,6]
print (type(random_m))
<class 'list'>
print (random_m + model_m)
[0, 1, 7, 4]
print (random_y + model_y)
[2, 3, 5, 6]
df['2_name'] = np.where(df.index.isin(random_m + model_m), 'A',
np.where(df.index.isin(random_y + model_y), 'B', 'not_assigned'))
print (df)
A 2_name
0 8 A
1 8 A
2 3 B
3 7 B
4 7 A
5 0 B
6 4 B
7 2 A
8 5 not_assigned
9 2 not_assigned

For Loop Filter on Pandas DataFrame not Working

I have a very simple for loop :
## Keep or Drop Rows from Ad Servers
dataframes = [atlas_df, flashtalking_df, innovid_df, ias_viewability_df, ias_fraud_df]
for df in dataframes:
df = df[df['Placement Name'].str.contains("»")]
when I run the for loop though, nothing filters.
However, if I write it down manually as:
ias_fraud_df = ias_fraud_df[ias_fraud_df['Placement Name'].str.contains("»")]
The filter works.
Any ideas on what I am missing?
You're working on the iterator, you need to reference the original df by using an index into the list:
for i in range(len(dataframes)):
df = dataframes[i]
dataframes[i] = df[df['Placement Name'].str.contains("»")]
This is so the original df in the list is modified
Example:
In [108]:
l = list('abcd')
for i in range(len(l)):
l[i] = 'new_' + l[i]
Out[108]:
['new_a', 'new_b', 'new_c', 'new_d']
Versus:
In [110]:
l = list('abcd')
for x in l:
x = 'new_' + x
l
Out[110]:
['a', 'b', 'c', 'd']
So you see that the latter which is semantically the same as your code never modifies the original elements in the list whilst the other does
You can use list comprehension - output is list of filtered Dataframes:
dataframes = [df[df['Placement Name'].str.contains(u"»")] for df in dataframes]
Sample:
atlas_df = pd.DataFrame({'Placement Name':['deu_gathf»', 'deu_gahf', 'fra_gagg'],
'another_col':[1,2,3]})
flashtalking_df = pd.DataFrame({'Placement Name':['deu_gahf»','fra_ga', 'deu_gatt'],
'another_col':[4,5,6]})
dataframes = [atlas_df, flashtalking_df]
print (dataframes)
[ Placement Name another_col
0 deu_gathf» 1
1 deu_gahf 2
2 fra_gagg 3, Placement Name another_col
0 deu_gahf» 4
1 fra_ga 5
2 deu_gatt 6]
dataframes = [df[df['Placement Name'].str.contains(u"»")] for df in dataframes]
print (dataframes)
[ Placement Name another_col
0 deu_gathf» 1, Placement Name another_col
0 deu_gahf» 4]

Categories

Resources