Searching through data base for partial and full match integers - python

I'm trying to search through a dataframe with a column that can have one or more integer values, to match one or more given integers.
The integers in the database has a '-' in between For example
--------------------------------------------------
| Customer 1 |1124 |
--------------------------------------------------
| Customer 2 |1124-1123 |
--------------------------------------------------
| Customer 3 |1124-1234-1642 |
--------------------------------------------------
| Customer 3 |1213-1234-1642 |
--------------------------------------------------
The objective here is to do a partial and full match, and be able to and be able to find out how many integers didn't match.
So for example let's say I have find all customers with 1124, the output would look like this(going off the example I provided)
--------------------------------------------------
| Customer 1 |1124 |None
--------------------------------------------------
| Customer 2 |1124-1123 |1
--------------------------------------------------
| Customer 3 |1124-1234-1642 |2
--------------------------------------------------
Thanks ahead of time!

Use set
define x as the test set
make s a series of sets
s - x creates a series of differences
(s - x).str.len() are the sizes of the differences
s & x is a boolean series indicating whether there is an intersection. Or in this case, if x is in s
x = {'1124'}
s = df['col2'].str.split('-').apply(set)
df.assign(col3=(s - x).str.len())[s & x]
col1 col2 col3
0 Customer 1 1124 0
1 Customer 2 1124-1123 1
2 Customer 3 1124-1234-1642 2
Setup
df = pd.DataFrame({
'col1': ['Customer 1', 'Customer 2', 'Customer 3', 'Customer 3'],
'col2': ['1124', '1124-1123', '1124-1234-1642', '1213-1234-1642']
})

Related

Add prefix to ffill, identifying values which were carried forward

Is there a wayto add a prefix when filling na's with ffill in pandas? I have a dataframe containing, taxonomic information like so:
| Kingdom | Phylum | Class | Order | Family | Genus |
| Bacteria | Firmicutes | Bacilli | Lactobacillales | Lactobacillaceae | Lactobacillus |
| Bacteria | Bacteroidetes | Bacteroidia | Bacteroidales | | |
| Bacteria | Bacteroidetes | | | | |
Since not all of the taxa in my dataframe can be classified fully, I have some empty cells. Replacing the spaces with NA and using ffill I can fill these with the last valid string in each row but I would like to add a string to these (for example "Unknown_Bacteroidales") so I can identify which ones were carried forward.
So far I tried this taxa_formatted = "unknown_" + taxonomy.fillna(method='ffill', axis=1) but this of course adds the "unknown_" prefix to everything in the dataframe.
You can this using boolean masking with df.isna.
df = df.replace("", np.nan) # if already NaN present skip this step
d = df.ffill()
d[df.isna()]+="(Copy)"
d
Kingdom Phylum Class Order Family Genus
0 Bacteria Firmicutes Bacilli Lactobacillales Lactobacillaceae Lactobacillus
1 Bacteria Bacteroidetes Bacteroidia Bacteroidales Lactobacillaceae(Copy) Lactobacillus(Copy)
2 Bacteria Bacteroidetes Bacteroidia(Copy) Bacteroidales(Copy) Lactobacillaceae(Copy) Lactobacillus(Copy)
You can use df.add here.
d = df.ffill(axis=1)
df.add("unkown_" + d[df.isna()],fill_value='')
Kingdom Phylum Class Order Family Genus
0 Bacteria Firmicutes Bacilli Lactobacillales Lactobacillaceae Lactobacillus
1 Bacteria Bacteroidetes Bacteroidia Bacteroidales unkown_Bacteroidales unkown_Bacteroidales
2 Bacteria Bacteroidetes unkown_Bacteroidetes unkown_Bacteroidetes unkown_Bacteroidetes unkown_Bacteroidetes
You need to use mask and update:
#make true nan's first.
#df = df.replace('',np.nan)
s = df.isnull()
df = df.ffill(axis=1)
df.update('unknown_' + df.mask(~s) )
print(df)
Bacteria Firmicutes Bacilli Lactobacillales \
0 Bacteria Bacteroidetes Bacteroidia Bacteroidales
1 Bacteria Bacteroidetes unknown_Bacteroidetes unknown_Bacteroidetes
Lactobacillaceae Lactobacillus
0 unknown_Bacteroidales unknown_Bacteroidales
1 unknown_Bacteroidetes unknown_Bacteroidetes

How to enrich dataframe by adding columns in specific condition

I have a two different datasets:
users:
+-------+---------+--------+
|user_id| movie_id|timestep|
+-------+---------+--------+
| 100 | 1000 |20200728|
| 101 | 1001 |20200727|
| 101 | 1002 |20200726|
+-------+---------+--------+
movies:
+--------+---------+--------------------------+
|movie_id| title | genre |
+--------+---------+--------------------------+
| 1000 |Toy Story|Adventure|Animation|Chil..|
| 1001 | Jumanji |Adventure|Children|Fantasy|
| 1002 | Iron Man|Action|Adventure|Sci-Fi |
+--------+---------+--------------------------+
How to get dataset in the following format? So I can get user's taste profile, so I can compare different users by their similarity score?
+-------+---------+--------+---------+---------+-----+
|user_id| Action |Adventure|Animation|Children|Drama|
+-------+---------+--------+---------+---------+-----+
| 100 | 0 | 1 | 1 | 1 | 0 |
| 101 | 1 | 1 | 0 | 1 | 0 |
+-------+---------+---------+---------+--------+-----+
Where df is the movies dataframe and dfu is the users dataframe
The 'genre' column needs to be split into a list with pandas.Series.str.split, and then using pandas.DataFrame.explode, transform each element of the list into a row, replicating index values.
pandas.merge the two dataframes on 'movie_id'
Use pandas.DataFrame.groupby on 'user_id' and 'genre' and aggregate by count.
Shape final
.unstack converts the groupby dataframe from long to wide format
.fillna replace NaN with 0
.astype changes the numeric values from float to int
Tested in python 3.10, pandas 1.4.3
import pandas as pd
# data
movies = {'movie_id': [1000, 1001, 1002],
'title': ['Toy Story', 'Jumanji', 'Iron Man'],
'genre': ['Adventure|Animation|Children', 'Adventure|Children|Fantasy', 'Action|Adventure|Sci-Fi']}
users = {'user_id': [100, 101, 101],
'movie_id': [1000, 1001, 1002],
'timestep': [20200728, 20200727, 20200726]}
# set up dataframes
df = pd.DataFrame(movies)
dfu = pd.DataFrame(users)
# split the genre column strings at '|' to make lists
df.genre = df.genre.str.split('|')
# explode the lists in genre
df = df.explode('genre', ignore_index=True)
# merge df with dfu
dfm = pd.merge(dfu, df, on='movie_id')
# groupby, count and unstack
final = dfm.groupby(['user_id', 'genre'])['genre'].count().unstack(level=1).fillna(0).astype(int)
# display(final)
genre Action Adventure Animation Children Fantasy Sci-Fi
user_id
100 0 1 1 1 0 0
101 1 2 0 1 1 1

How to split a column of dictionary type into two different pandas column of different type?

I have a dataframe with 2 columns (plus index) like this, it has around 14,000 lines.
Employee | RecordID
{'Id': 185, 'Title': 'Full Name'} | 9
I'd like to split the columns like this:
Id | Title | RecordID
185 | 'Full Name' | 9
I tried to use this solution:
df2 = pd.DataFrame(data_df["Employee"].values.tolist(), index=data_df.index) <- error
data_df = pd.concat([data_df, df2], axis = 1).drop(column, axis = 1)
but it gives this error on the df2 line
*** AttributeError: 'float' object has no attribute 'keys'
I have 2 theories: one that it's because i have different column types in the employee dictionary, and two: there are 3 records that have an empty employee id, like this:
Employee | RecordID
nan | 7051
I need to keep those 3 records without an employee record and show their record Id, and in the final data_df show empty columns for employee id and employee name.
So in summary:
INPUT
Employee | RecordID
{'Id': 185, 'Title': 'Full Name'} | 9
nan | 7051
EXPECTED OUTPUT
Id | Title | RecordID
185 | 'Full Name' | 9
nan | nan | 7051
I made it work using data_df["Employee"].apply(pd.Series) but it's painfully slow.
Is there a way not using pd.series to split a column of dictionaries where such dictionary has different column types and nan values to separate columns into the parent pandas dataframe?
Thanks,
You can do
data_df1= data_df.dropna()
df2 = pd.DataFrame(data_df1["Employee"].values.tolist(), index= data_df1.index)
data_df=data_df.join(df2,how='left')

find rows that share values

I have a pandas dataframe that look like this:
df = pd.DataFrame({'name': ['bob', 'time', 'jane', 'john', 'andy'], 'favefood': [['kfc', 'mcd', 'wendys'], ['mcd'], ['mcd', 'popeyes'], ['wendys', 'kfc'], ['tacobell', 'innout']]})
-------------------------------
name | favefood
-------------------------------
bob | ['kfc', 'mcd', 'wendys']
tim | ['mcd']
jane | ['mcd', 'popeyes']
john | ['wendys', 'kfc']
andy | ['tacobell', 'innout']
For each person, I want to find out how many favefood's of other people overlap with their own.
I.e., for each person I want to find out how many other people have a non-empty intersection with them.
The resulting dataframe would look like this:
------------------------------
name | overlap
------------------------------
bob | 3
tim | 2
jane | 2
john | 1
andy | 0
The problem is that I have about 2 million rows of data. The only way I can think of doing this would be through a nested for-loop - i.e. for each person, go through the entire dataframe to see what overlaps (this would be extremely inefficient). Would there be anyway to do this more efficiently using pandas notation? Thanks!
Logic behind it
s=df['favefood'].explode().str.get_dummies().sum(level=0)
s.dot(s.T).ne(0).sum(axis=1)-1
Out[84]:
0 3
1 2
2 2
3 1
4 0
dtype: int64
df['overlap']=s.dot(s.T).ne(0).sum(axis=1)-1
Method from sklearn
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
s=pd.DataFrame(mlb.fit_transform(df['favefood']),columns=mlb.classes_, index=df.index)
s.dot(s.T).ne(0).sum(axis=1)-1
0 3
1 2
2 2
3 1
4 0
dtype: int64

How to create a new dataframe based on value_counts of a column in another dataframe but with certain conditions on other columns?

I have a pandas data-frame of tickets raised on a group of servers like this:
a b c Users Problem
0 data data data User A Server Down
1 data data data User B Server Down
2 date data data User C Memory Full
3 date data data User C Swap Full
4 date data data User D Unclassified
5 date data data User E Unclassified
6 data data data User B RAM Failure
I need to create another dataframe like this with the data grouped by the type of tickets and the count of tickets raised by only two users, A and B separately and a single column with the count for other users.
Expected new Dataframe:
+---------------+--------+--------+-------------+
| Type Of Error | User A | User B | Other Users |
+---------------+--------+--------+-------------+
| Server Down | 50 | 60 | 150 |
+---------------+--------+--------+-------------+
| Memory Full | 40 | 50 | 20 |
+---------------+--------+--------+-------------+
| Swap Full | 10 | 20 | 15 |
+---------------+--------+--------+-------------+
| Unclassified | 10 | 20 | 50 |
+---------------+--------+--------+-------------+
| | | | |
+---------------+--------+--------+-------------+
I've tried .value_counts() which provides total count of that type. I however need it to be based on the User.
If no User A or User B change users to Other Users by Series.where and then use crosstab:
df['Users'] = df['Users'].where(df['Users'].isin(['User A','User B']), 'Other Users')
df = pd.crosstab(df['Problem'], df['Users'])[['User A','User B','Other Users']]
print (df)
Users User A User B Other Users
Problem
Memory Full 0 0 1
RAM Failure 0 1 0
Server Down 1 1 0
Swap Full 0 0 1
Unclassified 0 0 2
You could use pivot_table which is great at using aggregate functions:
users = df.Users.copy()
users[~users.isin(['User A', 'User B'])] = 'Other Users'
df.pivot_table(index='Problem', columns=users, aggfunc='count', values='a',
fill_value=0).reindex(['User A', 'User B', 'Other Users'], axis=1)
It gives:
Users User A User B Other Users
Problem
Memory Full 0 0 1
RAM Failure 0 1 0
Server Down 1 1 0
Swap Full 0 0 1
Unclassified 0 0 2

Categories

Resources