data accumulation from csv file in python - python

out_gate,in_gate,num_connection
a,b,1
a,b,3
b,a,2
b,c,4
c,a,5
c,b,5
c,b,3
c,a,4
shown above is a sample csv file.
First of all, My final goal is that the compile result becomes a table about number of connections between gates like below:
a b c
a 0 4 0
b 2 0 4
c 9 8 0
and Now I finished making a list of the first column(out_gate)
like this; listfile = ['a','b','c'] and trying to match this each data (a,b,c) one by one to the in_gate
so, for example when out_gate 'c'-> in_gate 'b', number of connections is 8 and
'c'->'a' becomes 9.
I can match out_blk and in_blk in a row with its connection numbers, but hard to accumulate the connection numbers of each out_gate
Is there any solution ?

In plain Python you should look at the csv module for the input and a collections.defaultdict for collecting the totals:
from csv import reader
from collections import defaultdict
d = defaultdict(lambda: defaultdict(int))
with open('file.csv') as f:
r = reader(f)
next(r) # skip headers
for row in r:
if len(row) >= 3:
x, y, count = row
d[x][y] += int(count)
keys = sorted(d)
for x in keys:
print(' '.join(str(d[x][y]) for y in keys))
0 4 0
2 0 4
9 8 0

If you do this for large amounts of data, you should absolutely check out numpy and pandas, which both have more effective and natural methods of handling tables than native python.
In case you only need a solution right now, accumulations can be done straight forwardly in pure python with collections.defaultdict:
from collections import defaultdict
con = defaultdict(int)
for count, line in enumerate(connections):
if count == 0:
continue
in_gate, out_gate, number = line.split(',')
con[f"{in_gate}->{out_gate}"] += int(number)
Now you can access the entries the following way:
print(con['a->b'])
>> 4
print(con['a->c'])
>> 0

This is a one-line high-level answer via pandas.pivot_table, if you do not wish to resort to line-by-line readers and defaultdict.
import pandas as pd
df = pd.DataFrame([['a', 'b', 1], ['a', 'b', 3], ['b', 'a', 2], ['b', 'c', 4],
['c', 'a', 5], ['c', 'b', 5], ['c', 'b', 3], ['c', 'a', 4]],
columns=['out_gate', 'in_gate', 'num_connection'])
pd.pivot_table(df, index='out_gate', columns='in_gate', values='num_connection', aggfunc='sum').fillna(0)

You can use itertools.groupby:
import csv
import itertools
data = list(csv.reader(open('filename.csv')))
new_data = [b+[int(a)] for *b, a in data]
final_data = {tuple(a):sum(map(lambda x:x[-1], list(b))) for a, b in itertools.groupby(sorted(new_data, key=lambda x:x[:2]), key=lambda x:x[:2])}
letters = sorted(set([i for b in final_data.keys() for i in b]))
matrix = '\n'.join([' '.join(map(str, [final_data.get((b, i), 0) for i in letters])) for b in letters])
Output:
0 4 0
2 0 4
9 8 0

Related

How to compare every value in a Pandas dataframe to all the next values?

I am learning Pandas and I am moving my python code to Pandas. I want to compare every value with the next values using a sub. So the first with the second etc.. The second with the third but not with the first because I already did that. In python I use two nested loops over a list:
sub match_values (a, b):
#do some stuff...
l = ['a', 'b', 'c']
length = len(l)
for i in range (1, length):
for j in range (i, length): # starts from i, not from the start!
if match_values(l[i], l[j]):
#do some stuff...
How do I do a similar technique in Pandas when my list is a column in a dataframe? Do I simply reference every value like before or is there a clever "vector-style" way to do this fast and efficient?
Thanks in advance,
Jo
Can you please check this ? It provides an output in the form of a list for each row after comparing the values.
>>> import pandas as pd
>>> import numpy as np
>>> val = [16,19,15,19,15]
>>> df = pd.DataFrame({'val': val})
>>> df
val
0 16
1 19
2 15
3 19
4 15
>>>
>>>
>>> df['match'] = df.apply(lambda x: [ (1 if (x['val'] == df.loc[idx, 'val']) else 0) for idx in range(x.name+1, len(df)) ], axis=1)
>>> df
val match
0 16 [0, 0, 0, 0]
1 19 [0, 1, 0]
2 15 [0, 1]
3 19 [0]
4 15 []
Yes, vector comparison as pandas is built on Numpy:
df['columnname'] > 5
This will result in a Boolean array. If you also want to return the actually part of the dataframe:
df[df['columnname'] > 5]

Given a dataframe, how do I bucket columns according to their names and merge columns in the same bucket into one?

Suppose I have a dataframe with (for example) 10 columns: a,b,c,d,e,f,g,h,i,j
I want to bucket these columns as follows: a,b,c into x, d,f,g into y, e,h,i into z and j into j.
Each row of the output will have the x column value equal to the non-NaN a or b or c value of the original df. In case of multiple non-NaN values for a,b,c columns for a particular row in the original df, the output df will just contain a list of those non-NaN values.
To give an example, if the original df is (- just means NaN to save typing effort):
a b c d e f g h i j
0 1 - - - 2 - 4 3 - -
1 - 6 - 0 4 - - - - 2
2 - 3 2 - - - - 1 - 9
The output will be:
x y z j
0 1 4 [2,3] -
1 6 0 4 2
2 [3,2] - 1 9
Is there an efficient way of doing this? I'm not even able to get started using conventional methods.
one way is to create a dictionary with your mappings, apply your column names, stack and to apply your groupby operation and unstack to your original shape.
I couldn't see any logic in your mappings so it will have to be a manual operation I'm afraid.
buckets = {'x': ['a', 'b', 'c'], 'y': ['d', 'f', 'g'], 'z': ['e', 'h', 'i'], 'j': 'j'}
df.columns = df.columns.map( {i : x for x,y in buckets.items() for i in y})
out = df.stack().groupby(level=[0,1]).agg(list).unstack(1)[buckets.keys()]
print(out)
x y z j
0 [1] [4] [2, 3] NaN
1 [6] [0] [4] [2]
2 [3, 2] NaN [1] [9]
First create the dict for mapping , the groupby
d = {'a':'x','b':'x','c':'x','d':'y','f':'y','g':'y','e':'z','h':'z','i':'z','j':'j'}
out = df.groupby(d,axis=1).agg(lambda x : [y[y!='-']for y in x.values])
Out[138]:
j x y z
0 [] [1] [4] [2, 3]
1 [2] [6] [0] [4]
2 [9] [3, 2] [] [1]
Starting with a very basic approach, let's define our buckets and simply iterate, then clean up:
buckets = {
'x': ['a', 'b', 'c'],
'y': ['d', 'e', 'f'],
'z': ['g', 'h', 'i'],
'j': ['j']
}
def clean(val):
val = [x for x in val if not np.isnan(val)]
if len(val) == 0:
return np.nan
elif len(val) == 1:
return val[0]
else:
return val
new_df = pd.DataFrame()
for new_col, old_cols in buckets.items():
new_df[key] = df[old_cols].values.tolist().apply(clean)
Here's how you can do it.
First, we define a method to perform the row-wise bucketing operation.
def bucket_rows(row):
row = row.dropna().to_list()
if len(row) == 0:
row = [np.nan]
return row
Then, we can use the pandas.DataFrame.apply method to map this function onto each row on a dataframe (here, a sub-dataframe, if you will, since we'll get the sub-df using the column names).
I have implemented everything in the following code snippet.
import numpy as np
import pandas as pd
bucket_cols=[["a", "b", "c"], ["d", "f", "g"], ["e", "h","i"], ["j"]]
bucket_names=["x", "y", "z", "j"]
buckets = {}
def bucket_rows(row):
row = row.dropna().to_list() # applying pd.Series.dropna method to remove NaN values
# if the list is empty, populate it with NaN
if len(row) == 0:
row = [np.nan]
# returns bucketed row
return row
# looping through buckets and perforing bucketing operation
for idx, cols in enumerate(bucket_cols):
bucket = df[cols].apply(bucket_rows, axis=1).to_list()
buckets[idx] = bucket
# creating bucketted df from buckets dict
df_bucketted = pd.DataFrame(buckets)

verify if cell is the same and build an excelfile

I have some measurements(as a dict) and a list with labels. Need to verify if labels are in my measurements and write it to an excelfile.
my output-excelfile need to look like this.
list1 = ['A', 'B', 'C', 'D']
measurement1 = {'A':1, 'B':1}
measurement2 = {'C':3, 'D':4}
#Output
'A' 'B' 'C' 'D'
measurement1 1 1 0 0
measurement2 0 0 1 1
I have no idea how to build the matrix with (0,1)
Hope you can help me.
EDIT
Finally i got a solution. At first i iterate over all measurements and wrote to dict measurements all missing labels.
Than building a dataframe with ones and putting with 3 loops zeros in the dataframe to the msising positions with .loc
d = pd.DataFrame(1, index = measurements.keys(), columns = list1)
for y in measurements.keys():
for z in measurements[y]:
for x in list1:
if x == z:
d.loc[y,z] = 0
Maybe its possible to make it with only 2 loops.
Use nested list comprehension with filtering for check membership in list1 and last create DataFrame by constructor:
list1 = ['A', 'B', 'C', 'D']
measurement1 = {'A':1, 'B':1}
measurement2 = {'C':3, 'D':4}
L = [measurement1, measurement2]
d = [dict.fromkeys([y for y in x.keys() if y in list1], 1) for x in L]
df = pd.DataFrame(d).fillna(0).astype(int)
print (df)
A B C D
0 1 1 0 0
1 0 0 1 1
This should work, using only standard Python:
list1 = ['A', 'B', 'C', 'D']
measurement1 = {'A':1, 'B':1}
measurement2 = {'C':3, 'D':4}
measurements = [measurement1, measurement2]
headers = { h: i for i, h in enumerate(list1) }
matrix = []
for measurement in measurements:
row = [0] * len(headers)
for header in measurement.keys():
row[headers[header]] = 1
matrix.append(row)
For your example, the output will be:
matrix
=> [[1, 1, 0, 0], [0, 0, 1, 1]]
You can use a list of the dictionaries ad create a dataframe then reindex with the list and convert to bool by checking notna
pd.DataFrame([measurement1,measurement2]).reindex(columns=list1).notna().astype(int)
A B C D
0 1 1 0 0
1 0 0 1 1

Iterate in a dataframe with strings

I'm trying to create a cognitive task named 2-backed test.
I created a semi-random list with certain conditions and now I wanted to know what should be the good answer for the participant.
I want a column in my dataframe saying if yes or no, 2 letters before it was the same letter.
Here is my code :
from random import choice, shuffle
import pandas as pd
num = 60
letters = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L']
# letters_1 = [1, 2, 3, 4, 5, 6]
my_list = [choice(letters), choice(letters)]
probab = list(range(num - 2))
shuffle(probab)
# We want 20% of the letters to repeat the letter 2 letters back
pourc = 20
repeatnum = num * pourc // 100
for i in probab:
ch = prev = my_list[-2]
if i >= repeatnum:
while ch == prev:
ch = choice(letters)
my_list.append(ch)
df = pd.DataFrame(my_list, columns=["letters"])
df.head(10)
letters
0 F
1 I
2 D
3 I
4 H
5 C
6 L
7 G
8 D
9 L
# Create a list to store the data
response = []
# For each row in the column,
for i in df['letters']:
# if more than a value,
if i == [i - 2]:
response.append('yes')
else:
response.append('no')
# Create a column from the list
df['response'] = response
First error :
if i == [i - 2]:
TypeError: unsupported operand type(s) for -: 'str' and 'int'
If I use numbers instead of letters, I can get over this error, but I would prefer keeping letters..
But after that if I run it with number, I get no errors, but my new column response only have 'no'. But I know that 12 times it should be 'yes'.
It seems like you want to perform a comparison on the column and the same column shifted by two elements. Use shift + np.where -
df['response'] = np.where(df.letters.eq(df.letters.shift(2)), 'yes', 'no')
df.head(10)
letters response
0 F no
1 I no
2 D no
3 I yes
4 H no
5 C no
6 L no
7 G no
8 D no
9 L no
But I know that 12 times it should be 'yes'.
df.response.eq('yes').sum()
12

How to select values from pandas dataframe by column value

I am doing an analysis of a dataset with 6 classes, zero based. The dataset is many thousands of items long.
I need two dataframes with classes 0 & 1 for the first data set and 3 & 5 for the second.
I can get 0 & 1 together easily enough:
mnist_01 = mnist.loc[mnist['class']<= 1]
However, I am not sure how to get classes 3 & 5... so what I would like to be able to do is:
mnist_35 = mnist.loc[mnist['class'] == (3 or 5)]
...rather than doing:
mnist_3 = mnist.loc[mnist['class'] == 3]
mnist_5 = mnist.loc[mnist['class'] == 5]
mnist_35 = pd.concat([mnist_3,mnist_5],axis=0)
You can use isin, probably using set membership to make each check an O(1) time complexity operation:
mnist = pd.DataFrame({'class': [0, 1, 2, 3, 4, 5],
'val': ['a', 'b', 'c', 'd', 'e', 'f']})
>>> mnist.loc[mnist['class'].isin({3, 5})]
class val
3 3 d
5 5 f
>>> mnist.loc[mnist['class'].isin({0, 1})]
class val
0 0 a
1 1 b

Categories

Resources