I have a dataframe with two columns:
key | value
"a" | 1
"a" | 2
"b" | 4
which I would like to map to a dictionary that would look like:
my_dict["a"] = [1,2]
my_dict["b"] = [4]
My current implementation is
for k in df["keys"].unique():
vals = df[df["keys"] == k]["value"]
my_dict[k] = vals
But this implementation takes a long time on my dataframe that has ~400k rows. Is there a better way to do this?
Related
What is a quick way to check if all the columns in a pandas dataframe are the same?
E.g. I have a dataframe with the columns a,b,c below, and I need to check that the columns are all the same, i.e. that a = b = c
+---+---+---+
| a | b | c |
+---+---+---+
| 5 | 5 | 5 |
| 7 | 7 | 7 |
| 9 | 9 | 9 |
+---+---+---+
I had thought of using apply to iterate over all the rows, but I am afraid it might be inefficient as it would be a non-vectorised loop.
I suspect looping over the columns would be quicker because I always have fewer columns than rows (a few dozen columns but hundreds of thousands of rows).
I have come up with the contraption below. I need to tidy it up and make it into a function but it works - the question is if there is a more elegant / faster way of doing it?
np.where returns zero when the items are all the same and 1 otherwise (not the opposite), so summing the output gives me the number of mismatches.
I iterate over all the columns (excluding the first), comparing them to the first.
The first output counts the matches/mismatches by column, the second by row.
If you add something like
df.iloc[3,2] = 100
after defining df, the output tells you the 3rd row of column c doesn't match
import numpy as np
import pandas as pd
df = pd.DataFrame()
x = np.arange(0,20)
df['a'] = x
df['b'] = x
df['c'] = x
df['d'] = x
#df.iloc[3,2] = 100
cols = df.columns
out = pd.DataFrame()
for c in np.arange(1, len(cols) ):
out[cols[c]] = np.where(df[cols[0]] == df[cols[c]], 0, 1)
print(out.sum(axis = 0))
print(out.sum(axis = 1))
Let's try duplicated:
(~df.T.duplicated()).sum()==1
# True
You can use df.drop_duplicates() and check for the len to be 1. This would mean that all columns are same:
In [1254]: len(df.T.drop_duplicates()) == 1
Out[1254]: True
Use DataFrame.duplicated + DataFrame.all:
df.T.duplicated(keep=False).all()
#True
I'm having a very large dataset (20GB+) and I need to select all distinct values from column A where there are at least two other distinct values in column B for each distinct value on column A.
For the following dataframe:
| A | B |
|---|---|
| x | 1 |
| x | 2 |
| y | 1 |
| y | 1 |
Should return only x because it has two distinct values on column B, while y has only 1 distinct value.
The following code does the trick, but it takes a very long time (as in hours) since the dataset is very large:
def get_values(list_of_distinct_values, dataframe):
valid_values = []
for value in list_of_distinct_values:
value_df = dataframe.loc[dataframe['A'] == value]
if len(value_df.groupby('B')) > 1:
valid_values.append(value)
return valid_values
Can anybody suggest a faster way of doing this?
I think you can solve your problem with the method drop_duplicates() of the dataframe. You need to use the parameters subset and keep (to remove all the lines with duplicates) :
import pandas as pd
df = pd.DataFrame({
'A': ["x", "x", "y", "y"],
'B': [1, 2, 1, 1],
})
df.drop_duplicates(subset=['A', 'B'], keep=False).drop_duplicates(subset=['A'])['A']
In a Pandas.DataFrame, I would like to find the index of the row whose value in a given column is closest to (but below) a specified value. Specifically, say I am given the number 40 and the DataFrame df:
| | x |
|---:|----:|
| 0 | 11 |
| 1 | 15 |
| 2 | 17 |
| 3 | 25 |
| 4 | 54 |
I want to find the index of the row such that df["x"] is lower but as close as possible to 40. Here, the answer would be 3 because df[3,'x']=25 is smaller than the given number 40 but closest to it.
My dataframe has other columns, but I can assume that the column "x" is increasing.
For an exact match, I did (correct me if there is a better method):
list = df[(df.x == number)].index.tolist()
if list:
result = list[0]
But for the general case, I do not know how to do it in a "vectorized" way.
Filter rows below 40 by Series.lt in boolean indexing and get mximal index value by Series.idxmax:
a = df.loc[df['x'].lt(40), 'x'].idxmax()
print (a)
3
For improve performance is possible use numpy.where with np.max, solution working if default index:
a = np.max(np.where(df['x'].lt(40))[0])
print (a)
3
If not default RangeIndex:
df = pd.DataFrame({'x':[11,15,17,25,54]}, index=list('abcde'))
a = np.max(np.where(df['x'].lt(40))[0])
print (a)
3
print (df.index[a])
d
How about that:
import pandas as pd
data = {'x':[0,1,2,3,4,20,50]}
df = pd.DataFrame(data)
#get df with selected condition
sub_df = df[df['x'] <= 40]
#get the idx of the maximum
idx = sub_df.idxmax()
print(idx)
Use Series.where to mask greater or equal than n, then use Series.idxmax to obtain
the closest one:
n=40
val = df['x'].where(df['x'].lt(n)).idxmax()
print(val)
3
We could also use Series.mask:
df['x'].mask(df['x'].ge(40)).idxmax()
or callable with loc[]
df['x'].loc[lambda x: x.lt(40)].idxmax()
#alternative
#df.loc[lambda col: col['x'].lt(40),'x'].idxmax()
If not by default RangeIndex
i = df.loc[lambda col: col['x'].lt(40),'x'].reset_index(drop=True).idxmax()
df.index[i]
I have pandas DataFrame from CSV (gist with small sample):
| title | genres |
--------------------------------------------------------
| %title1% |[{id: 1, name: '...'}, {id: 2, name: '...'}]|
| %title2% |[{id: 2, name: '...'}, {id: 4, name: '...'}]|
...
| %title9% |[{id: 3, name: '...'}, {id: 9, name: '...'}]|
Each title can be associated with a various count of the genres (more or greater 1).
The task is to convert arrays from genre column into columns and put ones (or Trues) for each genre:
| title | genre_1 | genre_2 | genre_3 | ... | genre_9 |
---------------------------------------------------------
| %title1% | 1 | 1 | 0 | ... | 0 |
| %title2% | 1 | 0 | 0 | ... | 0 |
...
| %title9% | 0 | 0 | 1 | ... | 1 |
Genres are the constant set (about 20 items in that set).
Naive method is:
Create the set of all genres
Create columns for each genre filled with 0
For each row, in the DataFrame check if some of the genres are in the genres columns and fill the column for that genre with 1.
This approach looks a bit weird.
I think that pandas have a more suitable method for that.
As far as I know, there is no way to perform JSON-deserialization on a Pandas dataframe in a vectorized fashion. One way you ought to be able to do this is with .iterrows() which will let you do this in one loop (albeit slower than most built-in pandas operations).
import json
df = # ... your dataframe
for index, row in df.iterrows():
# deserialize the JSON string
json_data = json.loads(row['genres'])
# add a new column for each of the genres (Pandas is okay with it being sparse)
for genre in json_data:
df.loc[index, genre['name']] = 1 # update the row in the df itself
df.drop(['genres'], axis=1, inplace=True)
Note that empty cells with be filled with NaN, not 0 -- you should use .fillna() to change this. A brief example with a vaguely similar dataframe looks like
In [1]: import pandas as pd
In [2]: df = pd.DataFrame([{'title': 'hello', 'json': '{"foo": "bar"}'}, {'title': 'world', 'json': '{"foo": "bar", "ba
...: z": "boo"}'}])
In [3]: df.head()
Out[3]:
json title
0 {"foo": "bar"} hello
1 {"foo": "bar", "baz": "boo"} world
In [4]: import json
...: for index, row in df.iterrows():
...: data = json.loads(row['json'])
...: for k, v in data.items():
...: df.loc[index, k] = v
...: df.drop(['json'], axis=1, inplace=True)
In [5]: df.head()
Out[5]:
title foo baz
0 hello bar NaN
1 world bar boo
If your csv data looks like this.
(i added the quotes to the keys of genres json just to work easily with json package. Since it is not the main problem you can do that as preprocessing)
You will have to iterate through all the rows of input DataFrame .
for index, row in inputDf.iterrows():
fullDataFrame = pd.concat([fullDataFrame, get_dataframe_for_a_row(row)])
in get_dataframe_for_a_row function:
prepare a DataFrame with column title and value row['title']
add columns with names formed by appending id to 'genre_'.
assign them value of 1
and then build a DataFrame for each row and concat them to a full DataFrame .
pd.concat() concatenates the dataframe obtained from each row.
will merge the comumns if already exist.
finally, fullDataFrame.fillna(0) to replace NaN with 0
your final DataFrame will look like this.
here is the full code:
import pandas as pd
import json
inputDf = pd.read_csv('title_genre.csv')
def labels_for_genre(a):
a[0]['id']
labels = []
for i in range(0 , len(a)):
label = 'genre'+'_'+str(a[i]['id'])
labels.append(label)
return labels
def get_dataframe_for_a_row(row):
labels = labels_for_genre(json.loads(row['genres']))
tempDf = pd.DataFrame()
tempDf['title'] = [row['title']]
for label in labels:
tempDf[label] = ['1']
return tempDf
fullDataFrame = pd.DataFrame()
for index, row in inputDf.iterrows():
fullDataFrame = pd.concat([fullDataFrame, get_dataframe_for_a_row(row)])
fullDataFrame = fullDataFrame.fillna(0)
Full working solution without iterrows:
import pandas as pd
import itertools
import json
# read data
movies_df = pd.read_csv('https://gist.githubusercontent.com/feeeper/9c7b1e8f8a4cc262f17675ef0f6e1124/raw/022c0d45c660970ca55e889cd763ce37a54cc73b/example.csv', converters={ 'genres': json.loads })
# get genres for all items
all_genres_entries = list(itertools.chain.from_iterable(movies_df['genres'].values))
# create the list with unique genres
genres = list({v['id']:v for v in all_genres_entries}.values())
# fill genres columns
for genre in genres:
movies_df['genre_{}'.format(genre['id'])] = movies_df['genres'].apply(lambda x: 1 if genre in x else 0)
I am iterating with a for loop over a table with a html file and I have the following values in variables name, gene_name_1, value1, gene_name_2, value2 in the first iteration.
keyX and valueX are part of a dictionary but I don't know how many keys and values are present for each iteration.
My idea was to use a dictionary which looks more or less like this:
d = {'gene_name_1': 2, 'gene_name_2': 5}
But now I realize that the values of the dictionary would change in every loop iteration, so it could look like this in the next loop:
d = {'gene_name_1': 3, 'gene_name_2': 0, 'gene_name_3': 9}
So I am not quite sure if a dictionary is the best data structure here:
What I would like to obtain is a pandas data frame which looks more or less like this.
| gene_name_1 | gene_name_2 | gene_name_3 | ...
organism1 | 2 | 5 | 0 | ...
organism2 | 3 | 0 | 9 | ...
...
Just to clarify: 0 is for those names where the key does not appear.
My problem is, I don't know the column names or the amount of columns. I wanted to start with an empty data frame, but I am not sure if this is the best way to do it.
How can I start on a data frame where I don't know the names or the amount of columns?
I hope this was understandable, if I should clarify somehow, please let me know.
I think you need create list of dicts and pass it to DataFrame constructor, last replace NaN to 0 by fillna:
d = {'gene_name_1': 2, 'gene_name_2': 5}
d1 = {'gene_name_1': 3, 'gene_name_2': 0, 'gene_name_3': 9}
#use loop
L = [d, d1]
df = pd.DataFrame(L).fillna(0)
print (df)
gene_name_1 gene_name_2 gene_name_3
0 2 5 0.0
1 3 0 9.0