Count sequence within a column in pandas - python

I have a following problem. Suppose I have this dataframe:
import pandas as pd
d = {'Name': ['c', 'c', 'c', 'a', 'a', 'b', 'b', 'd', 'd'], 'Project': ['aa','ab','bc', 'aa', 'ab','aa', 'ab','ca', 'cb'],
'col2': [3, 4, 0, 6, 45, 6, -3, 8, -3]}
df = pd.DataFrame(data=d)
I need to add a new column that add a number to each project per name. Desired output is:
import pandas as pd
dnew = {'Name': ['c', 'c', 'c', 'a', 'a', 'b', 'b', 'd', 'd'], 'Project': ['aa','ab','bc', 'aa', 'ab','aa', 'ab','ca', 'cb'],
'col2': [3, 4, 0, 6, 45, 6, -3, 8, -3], 'New_column': ['1', '1','1','2', '2','2','2','3','3']}
NEWdf = pd.DataFrame(data=dnew)
In other words: 'aa','ab','bc' in Project occurs in the first rows, so I add 1 to the new column. 'aa', 'ab' is the second Project from the beginning. It occurs for Name 'a' and 'b', so I add 2 to the both new column. 'ca', 'cb' is the third project and it occurs only for name 'd', so I add 3 only to the name 'd'.
I tried to combine groupby with a for loop, but it did not worked to me. Thanks a lot for a help!

Looks like networkx since Name and Project are related , you can use:
import networkx as nx
G=nx.from_pandas_edgelist(df, 'Name', 'Project')
l = list(nx.connected_components(G))
s = pd.Series(map(list,l)).explode()
df['new'] = df['Project'].map({v:k for k,v in s.items()}).add(1)
print(df)
Name Project col2 new
0 a aa 3 1
1 a ab 4 1
2 b bb 6 2
3 b bc 6 2
4 c aa 6 1
5 c ab 6 1

Related

Looking to drop first 5 rows of dataframe after ever new value occurs

I am looking to drop the first 5 rows each time a new value occurs in a dataframe
data = {
'col1': ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'C', 'C', 'C'],
'col2': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21]
}
df = pd.DataFrame(data)
I am looking to drop the first 5 rows after each new value. Ex: 'A' value is new... delete first 5 rows. Now encounter 'B' value... delete its first 5 rows...
You need to do the following:
mask = df.groupby('col1').cumcount() >= 5
df = df.loc[mask]
You can use a negative tail:
df.groupby('col1').tail(-5)
To group by consecutive values:
group = df['col1'].ne(df['col1'].shift()).cumsum()
df.groupby(group).tail(-5)
Output:
col1 col2
5 A 6
6 A 7
12 B 13
13 B 14
19 C 20
20 C 21
NB. As pointed out by #Mark, there is an issue for older pandas versions (<1.4), in which case the cumcount approach can be used.

Create dataframe from values/columns from another dataframe [duplicate]

This question already has answers here:
How do I melt a pandas dataframe?
(3 answers)
Closed 12 months ago.
I hava a dataframe like this:
df1 = pd.DataFrame({'A': [1, 2, 3], 'B': ['a', 'b', 'c'], 'C': [4, 5, 6], 'D': ['e', 'f', 'g'], 'E': [7, 8, 9], id: [25, 15, 30]})
I would like to use the values of df1 (and their respective columns) as a basis for filling in df2.
Expected:
expected = pd.DataFrame({'column': ['A', 'B', 'C', 'D', 'E', 'A', 'B', 'C', 'D', 'E'], 'value': [1, 'a', 4, 'e', 7, 2, 'b', 5, 'f', 8], 'id': [25, 15]})
I tried using iterrows, but as I need to use it for a large amount of data, the performance results were not positive. Can you help me?
df1 = pd.DataFrame({'A': [1, 2, 3], 'B': ['a', 'b', 'c'], 'C': [4, 5, 6], 'D': ['e', 'f', 'g'], 'E': [7, 8, 9], 'id': [25, 15, 30]})
pd.melt(df1, id_vars=['id'], var_name = 'column')
id column value
0 25 A 1
1 15 A 2
2 30 A 3
3 25 B a
4 15 B b
5 30 B c
6 25 C 4
7 15 C 5
8 30 C 6
9 25 D e
10 15 D f
11 30 D g
12 25 E 7
13 15 E 8
14 30 E 9
Have you tried Dataframe.melt? I guess something like this could do the trick:
df1.melt(ignore_index=False).merge(
df1, left_index=True, right_index=True
)[['variable', 'value', 'id']].reset_index()
There are some rows to be ignored, but that should be easy. I don't now about performance regarding large data frames, though.

how to behave all data as a single group in pandas groupby

I have a dataset and I need to groupby my dataset based on column group:
import numpy as np
import pandas as pd
arr = np.array([1, 2, 4, 7, 11, 16, 22, 29, 37, 46])
df = pd.DataFrame({'group': ['A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B'],
"target": arr})
for g_name, g_df in df.groupby("group"):
print("GROUP: {}".format(g_name))
print(g_df)
However, sometimes group might not exist as a column and in this case, I am trying to whole data as a single group.
for g_name, g_df in df.groupby(SOMEPARAMETERS):
print(g_df)
output:
target
1
2
4
7
11
16
22
29
37
46
Is it possible to change the parameter of groupby to get whole data as a single group?
Assuming you mean something like this where you have two columns on which you want to group:
import numpy as np
import pandas as pd
arr = np.array([1, 2, 4, 7, 11, 16, 22, 29, 37, 46])
df = pd.DataFrame({'group1': ['A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B'],
'group2': ['C', 'D', 'D', 'C', 'D', 'D', 'C', 'D', 'D', 'C'],
'target': arr})
Then you can easily extend your example with:
for g_name, g_df in df.groupby(["group1", "group2"]):
print("GROUP: {}".format(g_name))
print(g_df)
Is this what you meant?

Look up value in an array

Suppose I have two datasets
DS1
ArrayCol
[1,2,3,4]
[1,2,3]
DS2
Key Name
1 A
2 B
3 C
4 D
how to look up the values in the array to map the "Name" so that I can have another dataset like the following?
DS3
COlNew
[A,B,C,D]
[A,B,C]
Thanks, it's in databricks, so method is ok . python,sql,scala…...
you can try this
ds1 = [[1, 2, 3, 4], [1, 2, 3]]
ds2 = {1: 'A', 2: 'B', 3: 'C', 4: 'D'}
new_data = [[ds2[cell] for cell in col] for col in ds1]
print(new_data)
output:
[['A', 'B', 'C', 'D'], ['A', 'B', 'C']]
hope that will be help. :)
Lets consider your dataset are in files and you can do something like this,
making use of dict
f=open("ds1.txt").readlines()
g=open("ds2.txt").readlines()
u=dict(item.rstrip().split("\t") for item in g)
for i in f:
i = i.rstrip().strip('][').split(',')
print [u[col] for col in i]
Output
['A', 'B', 'C', 'D']
['A', 'B', 'C']

Pandas String Series to int normalisation for Tensor

I have a Pandas::Series object with repeated String values that I need to normalise into int values to feed into a TensorFlow.
I have looked at converting this into a Category as per this but it creates a code per item rather than identifying duplicates.
e.g. I wish for the following conversion
['a', 'b', 'c', 'd', 'a', 'a', 'c'] -> [1, 2, 3, 4, 1, 1, 3]
You need a bit change factorize:
print ((pd.factorize(['a', 'b', 'c', 'd', 'a', 'a', 'c'])[0] + 1).tolist())
[1, 2, 3, 4, 1, 1, 3]
You need add cat.codes after convert to category
pd.Series(['a', 'b', 'c', 'd', 'a', 'a', 'c']).astype('category').cat.codes+1
Out[1407]:
0 1
1 2
2 3
3 4
4 1
5 1
6 3
dtype: int8

Categories

Resources