Probably easy, but couldn't find it.
Want to copy large chunk of data in row where concat data wasn't properly filled in so NaN is below values.
Small example:
df1 = {'col1': ['a', 'b','c','d','e','f','g',np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan]}
df1 = pd.DataFrame(data=df1)
Did this:
df1['col1'][7:14] = df1['col1'][0:7]
Worked fine.
But what about larger data sets where I don't know the index slicing? Is there a built-in function for this?
Try 1) not to chain index, 2) passing numpy array on assignment:
df.loc[7:14, 'col1'] = df.loc[:7,'col1'].values
Output:
col1
0 a
1 b
2 c
3 d
4 e
5 f
6 g
7 a
8 b
9 c
10 d
11 e
12 f
13 g
Related
I have a dataframe like
id col1 col2 col3 ......col25
1 a b c d ...........
2 d e f NA ........
3 a NA NA NA .......
What I want is:
id start end
1 a b
1 b c
1 c d
2 d e
2 e f
for names, row in data_final.iterrows():
for i in range(0,26):
try:
x = pd.Series([row["id"],row[i], row[i+1]],index=['id', 'start','end'])
df1 = df1.append(x, ignore_index = True)
except:
break
This works but it is definitely is not the best solution as its time complexity is too high.
I need a better and efficient solution for this.
One way could be to stack to remove missing values, groupby and zip to aggregate each elements with the succeeding one. The we just need to flatten the result with itertools.chain and create a dataframe:
from itertools import chain
l = [list(zip(v.values[:-1], v.values[1:])) for _,v in df.stack().groupby(level=0)]
pd.DataFrame(chain.from_iterable(l), columns=['start', 'end'])
start end
0 a b
1 b c
2 c d
3 d e
4 e f
I have a data frame like this,
df
col1 col2
1 A
2 A
3 A
4 A
5 A
6 A
7 B
8 B
9 A
10 A
11 A
12 A
13 B
14 A
15 B
16 A
17 A
18 A
Now if there is continuous B or only one row between two Bs then display starting rows of those Bs.
So final output would look like,
col1 col2
7 B
13 B
I could do this using a for loop by comparing the row values, but the execution time will be huge. I am looking for any pandas shortcut or any other method to do it most efficiently.
You can first replace non B values to missing values and then forward filling them by limit 1 - so last 2 B create one group and last get first values of B groups:
m = df['col2'].where(df['col2'].eq('B')).ffill(limit=1).eq('B')
df = df[ m.ne(m.shift()) & m]
print (df)
col1 col2
6 7 B
12 13 B
cols = []
for i in range(len(df)):
if i!=0:
if df['col2'][i]==B and df['col2'][i-1]!=B:
if i>=2 and df['col2'][i-1]!=B:
cols.append(df['col1'][i])
print(df[df['col1'].isin(cols)])
Output:
col1 col2
7 B
13 B
find indexes with B not having it's i-1 and i-2 row not having B and retrieve the rows from data frames of retrieved indexes.
You can use shift and vector logic:
a = df['col2']
mask = (a.shift(1) != a) & ((a.shift(-1) == a) | (a.shift(-2) == a)) & (a == 'B')
df = df[mask]
I am new to data science. I want to check which elements from one data frame exist in another data frame, e.g.
df1 = [1,2,8,6]
df2 = [5,2,6,9]
# for 1 output should be False
# for 2 output should be True
# for 6 output should be True
etc.
Note: I have matrix not vector.
I have tried using the following code:
import pandas as pd
import numpy as np
priority_dataframe = pd.read_excel(prioritylist_file_path, sheet_name='Sheet1', index=None)
priority_dict = {column: np.array(priority_dataframe[column].dropna(axis=0, how='all').str.lower()) for column in
priority_dataframe.columns}
keys_found_per_sheet = []
if file_path.lower().endswith(('.csv')):
file_dataframe = pd.read_csv(file_path)
else:
file_dataframe = pd.read_excel(file_path, sheet_name=sheet, index=None)
file_cell_array = list()
for column in file_dataframe.columns:
for file_cell in np.array(file_dataframe[column].dropna(axis=0, how='all')):
if isinstance(file_cell, str) == 'str':
file_cell_array.append(file_cell)
else:
file_cell_array.append(str(file_cell))
converted_file_cell_array = np.array(file_cell_array)
for key, values in priority_dict.items():
for priority_cell in values:
if priority_cell in converted_file_cell_array[:]:
keys_found_per_sheet.append(key)
break
I am doing something wrong in if priority_cell in converted_file_cell_array[:] ?
Is there any other efficient way to do that?
You can take the .values from each dataframe, convert them to a set(), and take the set intersection.
set1 = set(df1.values.reshape(-1).tolist())
set2 = set(dr2.values.reshape(-1).tolist())
different = set1 & set2
You can flatten all values of DataFrames by numpy.ravel and then use set.intersection():
df1 = pd.DataFrame({'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')})
print (df1)
A B C D E F
0 a 4 7 1 5 a
1 b 5 8 3 3 a
2 c 4 9 5 6 a
3 d 5 4 7 9 b
4 e 5 2 1 2 b
5 f 4 3 0 4 b
df2 = pd.DataFrame({'A':[2,3,13,4], 'Z':list('abfr')})
print (df2)
A Z
0 2 a
1 3 b
2 13 f
3 4 r
L = list(set(df1.values.ravel()).intersection(df2.values.ravel()))
print (L)
['f', 2, 3, 4, 'a', 'b']
I am trying to aggregate some statistics from a groupby object on chunks of data. I have to chunk the data because there are many (18 million) rows. I want to find the number of rows in each group in each chunk, then sum them together. I can add groupby objects but when a group is not present in one term, a NaN is the result. See this case:
>>> df = pd.DataFrame({'X': ['A','B','C','A','B','C','B','C','D','B','C','D'],
'Y': range(12)})
>>> df
X Y
0 A 0
1 B 1
2 C 2
3 A 3
4 B 4
5 C 5
6 B 6
7 C 7
8 D 8
9 B 9
10 C 10
11 D 11
>>> df[0:6].groupby(['X']).count() + df[6:].groupby(['X']).count()
Y
X
A NaN
B 4
C 4
D NaN
But I want to see:
>>> df[0:6].groupby(['X']).count() + df[6:].groupby(['X']).count()
Y
X
A 2
B 4
C 4
D 2
Is there a good way to do this? Note in the real code I am looping through a chunked iterator of a million rows per groupby.
Call add and pass fill_value=0 you could iteratively add whilst chunking I guess:
In [98]:
df = pd.DataFrame({'X': ['A','B','C','A','B','C','B','C','D','B','C','D'],
'Y': np.arange(12)})
df[0:6].groupby(['X']).count().add(df[6:].groupby(['X']).count(), fill_value=0)
Out[98]:
Y
X
A 2
B 4
C 4
D 2
Consider a trivial example with a Dataframe df and a Series s
import pandas as pd
matching_vals = range(20,30)
df = pd.DataFrame(columns=['a'], index=range(0,10))
df['a'] = matching_vals
s = pd.Series(list("ABCDEFGHIJ"), index=matching_vals)
df['b'] = s[df['a']]
At this point I would expect df['b'] to contain the letters A through J, but instead it's all NaN. However, if I replace the last line with
n = df['a'][2]
df['c'] = s[n]
then df['c'] is filled with Cs, as I'd expect, so I'm pretty sure it's not some strange type error.
I'm new to pandas, and this is driving me crazy.
s[df['a']] has an index which is different than df's index:
In [104]: s[df['a']]
Out[104]:
a
20 A
21 B
22 C
23 D
24 E
25 F
26 G
27 H
28 I
29 J
When you assign a Series to a column of a DataFrame, Pandas tries to assign values according to the index. Since s[df['a']] does not have any values associated with the indices of df, NaN values are assigned. The assignment does not add new rows to df.
If you don't want the index to enter into the assignment, you could use
df['b'] = s[df['a']].values
For a demonstration of the matching of indices, notice how
import pandas as pd
df = pd.DataFrame(columns=['a'], index=range(0,10))
df['a'] = range(0,10)[::-1]
s = pd.Series(list("ABCDEFGHIJ"), index=range(0,10)[::-1])
df['b'] = s[df['a']]
yields
In [123]: s[df['a']]
Out[123]:
a
9 A
8 B
7 C
6 D
5 E
4 F
3 G
2 H
1 I
0 J
dtype: object
In [124]: df
Out[124]:
a b
0 9 J
1 8 I
2 7 H
3 6 G
4 5 F
5 4 E
6 3 D
7 2 C
8 1 B
9 0 A
[10 rows x 2 columns]
The values of df['b'] are "flipped" to make the indices match.