0| name1 | name2 | tot |
+-------+-------+-----+
1| A | B | 3 |
2| C | A | 3 |
3| B | D | 4 |
4| A | E | 2 |
5| B | C | 5 |
+-------+-------+-----+
I want to select rows based on the previuous rows, where a "letter" is present in other rows above at least 2 time (respectively in name1 or name2) and their tot is >= 3.
In this example i want to select:
A E 2
B C 5
because in 4th row we have A (name1) that appear in 1st and 2nd rows, with a tot >= 3;
and the B C 5 rows, because we have B that appear in 1st and 3rd rows, with a tot >= 3.
ps. I want to create another dataset based on these new results
You can build a cache using collections.defaultdict
from collections import defaultdict
df = pd.DataFrame({'name1': list('ACBAB'), 'name2': list('BADEC'), 'tot': [3, 3, 4, 2, 5]})
seen = defaultdict(int) # every new key will be initialized with 0
keep = []
for row in df.itertuples():
keep.append(
(seen[row.name1] > 1) |
(seen[row.name2] > 1)
)
if row.tot >= 3:
# we can do this safely without risk of KeyError because `seen` is a default dict
seen[row.name1] += 1
seen[row.name2] += 1
out = df[keep]
Output
name1 name2 tot
3 A E 2
4 B C 5
Related
I am trying to drop values in one data frame based on value-based on another data frame. I would appreciate your expertise on this, please.
Data frame 1 – df1:
| A | C |
| -------- | -------------- |
| f | 10 |
| c | 15 |
| b | 20 |
| d | 30 |
| h | 35 |
| e | 40 |
-----------------------------
Data frame 2 – df2:
| A | B |
| -------- | -------------- |
| a | w |
| b | 1 |
| c | w |
| d | 1 |
| e | w |
| f | 0 |
| g | 1 |
| h | 1 |
-----------------------------
I want to modify the df1 and drop(eliminate) values in column A if corresponding values in column B is ‘w’ in df2.
Resulted data frame looks like below.
| A | C |
| -------- | -------------- |
| f | 10 |
| b | 20 |
| d | 30 |
| h | 35 |
-----------------------------
You can first create a list from df2 with the values of A that have associated value of 'w' in B, and then use isin and ~ (which means not in essentially):
a = df2.loc[df2['B'].str.contains(r'w',case=False,na=False),'A'].tolist()
b = df1[~df1['A'].isin(a)]
And get back your desired outcome:
print(b)
A C
0 f 10
2 b 20
3 d 30
4 h 35
I find this link particularly helpful if you want to read more on Python's operators:
https://www.w3schools.com/python/python_operators.asp
What you're looking for is a merge with certain conditions:
# recreating your data
>>> import pandas as pd
>>> df1 = pd.DataFrame.from_dict({'A': list('fcbdhe'), 'B': [10, 15, 20, 30, 35, 40]})
>>> df2 = pd.DataFrame.from_dict({'A': list('abcdefgh'), 'B': list('w1w1w011')})
# merge but we further need to project that to fit the desired output
>>> df1.merge(df2[df2['B'] != 'w'], how='inner', on='A')
A B_x B_y
0 f 10 0
1 b 20 1
2 d 30 1
3 h 35 1
# what you're looking for
>>> df1.merge(df2[df2['B'] != 'w'], how='inner', on='A')[['A', 'B_x']].rename(columns={'B_x': 'C'})
A C
0 f 10
1 b 20
2 d 30
3 h 35
I assume that your first dataframe is df1 and second dataframe is df2.
First of all, list all the values of column A of df2 that value is 'w' in column B of df2.
df2_A = df2[df2['B'] == 'w']['A'].values.tolist()
# >>>output = ['a','c','e']
It will list all the values of column A of df2 that has value 'w' in column B of df2.
Then you can drop the values of df1 whose values lies in the list of above.
for i, val in enumerate(df1['A'].values.tolist()):
if val in df2_A:
df1.drop(i, axis=0, inplace=True)
Full code:
df2_A = df2[df2['B'] == 'w']['A'].values.tolist()
for i, val in enumerate(df1['A'].values.tolist()):
if val in df2_A: #checking if A column of df1 has values similar to above list
df1.drop(i, axis=0, inplace=True)
df1
df1.merge(df2[df2['B']!='w'],how='inner').drop(["B"], axis=1)
I have a DataFrame with col names 'a', 'b', 'c'
#Input
import pandas as pd
list_of_dicts = [
{'a' : 0, 'b' : 4, 'c' : 3},
{'a' : 1, 'b' : 1, 'c' : 2 },
{'a' : 0, 'b' : 0, 'c' : 0 },
{'a' : 1, 'b' : 0, 'c' : 3 },
{'a' : 2, 'b' : 1, 'c' : 0 }
]
df = pd.DataFrame(list_of_dicts)
#Input DataFrame
-----|------|------|-----|
| a | b | c |
-----|------|------|-----|
0 | 0 | 4 | 3 |
1 | 1 | 1 | 2 |
2 | 0 | 0 | 0 |
3 | 1 | 0 | 3 |
4 | 2 | 1 | 0 |
I want to reduce the wide DataFrame to One column, with the column names
as DataFrame values multiplied by the corresponding row values. The operation must be done Row wise.
#Output
| Values |
-----------------
0 | b |
1 | b |
2 | b |
3 | b |
4 | c |
5 | c |
6 | c |
7 | a |
8 | b |
9 | c |
10 | c |
11 | a |
12 | c |
13 | c |
14 | c |
15 | a |
17 | a |
18 | b |
Explanation:
Row 0 in the Input DataFrame has 4 'b' and 3 'c', so the first seven elements of the output DataFrame are bbbbccc
Row 1 similarly has 1 'a' 1 'b' and 2 'c', so the output will have abcc as the next 4 elements
Row 2 has 0's across, so would be skipped entirely.
The Order of the output is very important
For example, the first row has '4' b and 3 'c', so the output DataFrame must be bbbbccc because Column 'b' comes before column 'c'. The operation must be row-wise from left to right.
I'm trying to find an efficient way in order to accomplish this. The real dataset is too big for me to compute. Please provide the python3 solution.
Stack the data (you could melt as well), and drop rows where the count is zero. Finally use numpy.repeat to build a new array, and build your new dataframe from that.
reshape = df.stack().droplevel(0).loc[lambda x: x != 0]
pd.DataFrame(np.repeat(reshape.index, reshape), columns=['values'])
values
0 b
1 b
2 b
3 b
4 c
5 c
6 c
7 a
8 b
9 c
10 c
11 a
12 c
13 c
14 c
15 a
16 a
17 b
I don't think pandas buys you anything in this process, and especially if you have a large amount of data you don't want to read that all into memory and reprocess it into another large data structure.
import csv
with open('input.csv', 'r') as fh:
reader = csv.DictReader(fh)
for row in reader:
for key in reader.headers:
value = int(row[key])
for i in range(value):
print(key)
I have the following dataframe with multiple cols and rows,
A | B | C | D | E |....
2 | b | c | NaN | 1 |
3 | c | b | NaN | 0 |
4 | b | b | NaN | 1 |
.
.
.
Is there a way to add excel formulas (for some columns) in the manner stated below through an example using python in an output excel file?
For instance, I want to be able to have the output something like this,
=SUM(A0:A2) | | | | =SUM(E0:E2)
A | B | C | D | E
0 2 | b | c | =IF(B0=C0, "Yes", "No") | 1
1 3 | c | b | =IF(B1=C1, "Yes", "No") | 0
2 4 | b | b | =IF(B2=C2, "Yes", "No") | 1
.
.
.
Final output,
9 | | | | 2
A | B | C | D | E
0 2 | b | c | No | 1
1 3 | c | b | No | 0
2 4 | b | b | Yes | 1
.
.
.
I want to add formulas in the final output excel file so that if there are any changes in the values of columns (in the final output excel file) other columns can also be updated in the excel file in real time, for instance,
15 | | | | 3
A | B | C | D | E
0 2 | b | b | Yes | 1
1 9 | c | b | No | 1
2 4 | b | b | Yes | 1
.
.
.
If I change the values of, for instance, A1 from 3 to 9, then the sum of the column changes to 15; when I change the value of C0 from "c" to "b", the value of its corresponding row value, that is, D0 changes from "No" to "Yes"; Same for col E.
I know you can use xlsxwriter library to write the formulas but I am not able to figure out as to how I can add the formulas in the manner I have stated in the example above.
Any help would be really appreciated, thanks in advance!
You're best doing all of your formulas you wish to keep via xlsxwriter and not pandas.
You would use pandas if you only wanted to export the result, since you want to preserve the formula, do it when you write your spreadsheet.
The code below will write out the dataframe and formula to an xlsx file called test.
import xlsxwriter
import pandas as pd
from numpy import nan
data = [[2, 'b', 'c', nan, 1], [3, 'c', 'b', nan, 0], [4, 'b', 'b', nan, 1]]
df = pd.DataFrame(data=data, columns=['A', 'B', 'C', 'D', 'E'])
## Send values to a list so we can iterate over to allow for row:column matching in formula ##
values = df.values.tolist()
## Create Workbook ##
workbook = xlsxwriter.Workbook('test.xlsx')
worksheet = workbook.add_worksheet()
row = 0
col = 0
## Iterate over the data we extracted from the DF, generating our cell formula for 'D' each iteration ##
for idx, line in enumerate(values):
d = f'=IF(B{row + 1}=C{row + 1}, "Yes", "No")'
a, b, c, _, e = line
## Write cells into spreadsheet ##
worksheet.write(row, col, a)
worksheet.write(row, col + 1, b)
worksheet.write(row, col + 2, c)
worksheet.write(row, col + 3, d)
worksheet.write(row, col + 4, e)
row += 1
## Write the total sums to the bottom row of the sheet utilising the row counter to specify our stop point ##
worksheet.write(row, 0, f'=SUM(A1:A{row})')
worksheet.write(row, 4, f'=SUM(E1:E{row})')
workbook.close()
I have a dataframe that looks like this:
Col1 | Col2 | Col1 | Col3 | Col1 | Col4
a | d | | h | a | p
b | e | b | i | b | l
| l | a | l | | a
l | r | l | a | l | x
a | i | a | w | | i
| c | | i | r | c
d | o | d | e | d | o
Col1 is repeated multiple times in the dataframe. In each Col1, there is missing information. I need to create a new column that has all of the information from each Col1 occurrence.
How can I create a column with the complete information and then delete the previous duplicate columns?
Some information may be missing from multiple columns. This script is also meant to be used in the future when there could be one, three, five, or any number of duplicated Col1 columns.
The desired output looks like this:
Col2 | Col3 | Col4 | Col5
d | h | p | a
e | i | l | b
l | l | a | a
r | a | x | l
i | w | i | a
c | i | c | r
o | e | o | d
I have been looking over this question but it is not clear to me how I could keep the desired Col1 with complete values. I could delete multiple columns of the same name but I need to first create a column with complete information.
First replace empty values in your columns with nan as below:
import numpy as np
df = df.replace(r'^\s*$', np.nan, regex=True)
Then, you could use groupby and then first()
df.groupby(level = 0, axis = 1).first()
May be something like this is what you are looking for.
col_list = list(set(df.columns))
dicts={}
for col in col_list:
val = list(filter(None,set(df.filter(like=col).stack().reset_index()[0].str.strip(' ').tolist())))
dicts[col]= val
max_len=max([len(k) for k in dicts.values()])
pd.DataFrame({k:pd.Series(v[:max_len]) for k,v in dicts.items()})
output
Col3 Col4 Col1 Col2
0 h i d d
1 w l b r
2 i c r i
3 l x l l
4 a p a o
5 e o NaN c
6 NaN a NaN e
I have a dataframe like:
|column1 |
|a,b,c |
|d,b |
|a & b,c |
and i'd like to have it like this
column_a | column_b | column_c | column_d | column_a & b
1 | 1 | 1 |0 | 0
0 | 1 | 0 |1 | 0
1 | 1 | 1 |0 | 1
similar to get dummies, except that I have multiple strings per cell
i don't believe there are repeat strings in a cell, so no '2's
any help would be greatly appreciated!!!
You could start with something like this:
data = '''|column1 |
|a,b,c |
|d,b |
|a & b,c |'''
rows = [r.strip() for r in data.replace('\n','').split('|')[3:] if r.strip() != '']
values = []
for r in rows:
values += r.split(',')
values = set(values)
print(' | '.join(['column_' + v for v in values]))
for r in rows:
output = ''
for v in values:
if v in r:
output += '1'
else:
output += '0'
output += ' | '
print(output)
You'll have to use some string formatting to make it look pretty, but this should get you started.