I have two pandas dataframes:
df1 = pd.DataFrame(
{
"col1": ["1","2",np.nan,"3"],
}
)
df2 = pd.DataFrame(
{
"col1": [2.0,3.0,4.0,np.nan],
}
)
I would like to know how many values of df1.col1 exist in df2.col1. In this case it should be 2 as I want "2" and 2.0 to be seen as equal.
I do have a working solution, but because I think I'll need this more often (and for learning purposes, of cause), I wanted to ask you if there is a more comfortable way to do that.
df1.col1[df1.col1.notnull()].isin(df2.col1[df2.col1.notnull()].astype(int).astype(str)).value_counts()
Use Series.dropna and convert to floats, if working with integers and missing values:
a = df1.col1.dropna().astype(float).isin(df2.col1.dropna()).value_counts()
Or:
a = df1.col1.dropna().isin(df2.col1.dropna().astype(int).astype(str)).value_counts()
Related
I have two different lists and an id:
id = 1
timestamps = [1,2,3,4]
values = ['A','B','C','D']
What I want to do with them is concatenating them into a pandas DataFrame so that:
id
timestamp
value
1
1
A
1
2
B
1
3
C
1
4
D
By iterating over a for loop I will produce a new set of two lists and a new ID with each iteration which should then be concatenated to the existing data frame. The pseudocode would look like this:
# for each sample in group:
# do some calculation to create the two lists
# merge the lists into the data frame, using the ID as index
What I tried to do so far is using concatenate like this:
pd.concat([
existing_dataframe,
pd.DataFrame(
{
"id": id,
"timestamp": timestamps,
"value": values}
)])
But there seems to be a problem that the ID field and the other lists are of different lengths. Thanks for your help!
Use:
pd.DataFrame(
{
"timestamp": timestamps,
"value": values}
).assign(id=id).reindex(columns=["id", "timestamp", "value"])
Or:
df = \
pd.DataFrame(
{
"timestamp": timestamps,
"value": values}
)
df.insert(column='id',value=id, loc=0)
I have the dataframe:
And I would like to obtain using Pivot Table or an alternative function this result:
I am trying to transform the rows of the Custom Field column into Columns, with the Pivot Table function of Pandas, and I get an error:
import pandas as pd
data = {
"Custom Field": ["CF1", "CF2", "CF3"],
"id": ["RSA", "RSB", "RSC"],
"Name": ["Wilson", "Junior", "Otavio"]
}
### create the dataframe ###
df = pd.DataFrame(data)
print(df)
df2 = df.pivot_table(columns=['Custom Field'], index=['Name'])
print(df2)
I suspect it is because I am working with Strings.
Any suggestions?
Thanks in advance.
You need pivot, not pivot_table. The latter does aggregation on possibly repeating values whereas the former is just a rearrangement of the values and fails for duplicate values.
df.pivot(columns=['Custom Field'], index=['Name'])
Update as per comment: if there are multiple values per cell, you need to use privot_table and specify an appropriate aggregate function, e.g. concatenate the string values. You can also specify a fill value for empty cells (instead of NaN):
df = pd.DataFrame({"Custom Field": ["CF1", "CF2", "CF3", "CF1"],
"id": ["RSA", "RSB", "RSC", "RSD"],
"Name": ["Wilson", "Junior", "Otavio", "Wilson"]})
df.pivot_table(columns=['Custom Field'], index=['Name'], aggfunc=','.join, fill_value='-')
id
Custom Field CF1 CF2 CF3
Name
Junior - RSB -
Otavio - - RSC
Wilson RSA,RSD - -
I'm struggling with the correct syntax to flatten some data.
I have a dlt table with a column (named lorem for the sake of the example) where each row looks like this:
[{"field1": {"field1_1": null, "field1_2": null},
"field2": "blabla", "field3": 13209914,
"field4": {"field4_1": null, "field4_2": null}, "field5": 4}, ...
]
I want my output to create a new table based on the first that basically creates a row per each element in the array I shared above.
Table should look like:
|field1_1|field1_2|field2|field3|field4_1|field4_2|field_5|
|:-------|:-------|:-----|:-----|:-------|:-------|:------|
|null|null|blabla|13209914|null|null|4|
However when I explode like: select(explode("lorem")) I do not get the wanted output, instead I get only field 0 and exploded and the other fields except everything inside field4.
My question is, in what other way should I be flattening this data?
I can provide a clearer example if needed.
Use withColumn to add the additional columns you need. A simple example:
%%pyspark
from pyspark.sql.functions import col
df = spark.read.json("abfss://somelake#somestorage.dfs.core.windows.net/raw/flattenJson.json")
df2 = df \
.withColumn("field4_1", col("field4.field4_1")) \
.withColumn("field4_2", col("field4.field4_2"))
df2.show()
My results:
I'm looking for a clean, fast way to expand a pandas dataframe column which contains a json object (essentially a dict of nested dicts), so I could have one column for each element in the json column in json normalized form; however, this needs to retain all of the original dataframe columns as well. In some instances, this dict might have a common identifier I could use to merge with the original dataframe, but not always. For example:
import pandas as pd
import numpy as np
df = pd.DataFrame([
{
'col1': 'a',
'col2': {'col2.1': 'a1', 'col2.2': {'col2.2.1': 'a2.1', 'col2.2.2': 'a2.2'}},
'col3': '3a'
},
{
'col1': 'b',
'col2': np.nan,
'col3': '3b'
},
{
'col1': 'c',
'col2': {'col2.1': 'c1', 'col2.2': {'col2.2.1': np.nan, 'col2.2.2': 'c2.2'}},
'col3': '3c'
}
])
Here is a sample dataframe. As you can see, col2 is a dict in all of these cases which has another nested dict inside of it, or could be a null value, containing nested elements I would like to be able to access. (For the nulls, I would want to be able to handle them at any level--entire elements in the dataframe, or just specific elements in the row.) In this case, they have no ID that could link up to the original dataframe. My end goal would be essentially to have this:
final = pd.DataFrame([
{
'col1': 'a',
'col2.1': 'a1',
'col2.2.col2.2.1': 'a2.1',
'col2.2.col2.2.2': 'a2.2',
'col3': '3a'
},
{
'col1': 'b',
'col2.1': np.nan,
'col2.2.col2.2.1': np.nan,
'col2.2.col2.2.2': np.nan,
'col3': '3b'
},
{
'col1': 'c',
'col2.1': 'c1',
'col2.2.col2.2.1': np.nan,
'col2.2.col2.2.2': 'c2.2',
'col3': '3c'
}
])
In my instance, the dict could have up to 50 nested key-value pairs, and I might only need to access a few of them. Additionally, I have about 50 - 100 other columns of data I need to preserve with these new columns (so an end goal of around 100 - 150). So I suppose there might be two methods I'd be looking for--getting a column for each value in the dict, or getting a column for a select few. The former option I haven't yet found a great workaround for; I've looked at some prior answers but found them to be rather confusing, and most threw errors. This seems especially difficult when there are dicts nested inside of the column. To attempt the second solution, I tried the following code:
def get_val_from_dict(row, col, label):
if pd.isnull(row[col]):
return np.nan
norm = pd.json_normalize(row[col])
try:
return norm[label]
except:
return np.nan
needed_cols = ['col2.1', 'col2.2.col2.2.1', 'col2.2.col2.2.2']
for label in needed_cols:
df[label] = df.apply(get_val_from_dict, args = ('col2', label), axis = 1)
This seemed to work for this example, and I'm perfectly happy with the output, but for my actual dataframe which had substantially more data, this seemed a bit slow--and, I would imagine, is not a great or scalable solution. Would anyone be able to offer an alternative to this sluggish approach to resolving the issue I'm having?
(Also, apologies also about the massive amounts of nesting in my naming here. If helpful, I am adding in several images of the dataframes below--the original, then the target, and then the current output.)
instead of using apply or pd.json_normalize on the column that has a dictionary, convert the whole data frame to dictionary & use pd.json_normalize on it, finally picking the fields you wish to keep. This works because while the individual column for any given row may be null, the entire row would not be.
example:
# note that this method also prefixes an extra `col2.`
# at the start of the names of the denested data,
# which is not present in the example output
# the column renaming conforms to your desired name.
import re
final_cols = ['col1', 'col2.col2.1', 'col2.col2.2.col2.2.1', 'col2.col2.2.col2.2.2', 'col3']
out = pd.json_normalize(df.to_dict(orient='records'))[final_cols]
out.rename(columns=lambda x: re.sub(r'^col2\.', '', x), inplace=True)
out
# out:
col1 col2.1 col2.2.col2.2.1 col2.2.col2.2.2 col3
0 a a1 a2.1 a2.2 3a
1 b NaN NaN NaN 3b
2 c c1 NaN c2.2 3c
but for my actual dataframe which had substantially more data, this was quite slow
Right now I have 1000 rows of data, each row has about 100 columns, and then the column I want to expand has about 50 nested key/value pairs in it. I would expect that the data could scale up to 100k rows with the same number of columns over the next year or so, and so I'm hoping to have a scalable process ready to go at that point
pd.json_normalize should be faster than your attempt, but it is not faster than doing the flattening in pure python, so you might get more performance if you wrote a custom transform function & constructed the dataframe as below.
out = pd.DataFrame(transform(x) for x in df.to_dict(orient='records'))
It seems to me that it would be imminently useful for pandas to support the idea of projection (omitting or selecting columns) during data parsing.
Many JSON datasets I find have a ton of extraneous fields I don't need, or I need to parse a specific field in the nested structure.
What I do currently is pipe through jq to create a file that contains only the fields I need. This becomes the "cleaned" file.
I would prefer a method where I didn't have to create a new cleaned file every time I want to look at a particular facet or set of facets, but I could instead tell pandas to load the JSON path .data.interesting and only project fields: A B C.
As an example:
{
"data": {
"not interesting": ["milk", "yogurt", "dirt"],
"interesting": [{ "A": "moonlanding", "B": "1956", "C": 100000, "D": "meh" }]
}
Unfortunately, it seems like there's no easy way to do it on load, but if you're okay with doing it immediately after...
# drop by index
df.drop(df.columns[[1, 2]], axis=1, inplace=True)
# drop by name
df.drop(['B', 'C'], axis=1, inplace=True)