I have two different lists and an id:
id = 1
timestamps = [1,2,3,4]
values = ['A','B','C','D']
What I want to do with them is concatenating them into a pandas DataFrame so that:
id
timestamp
value
1
1
A
1
2
B
1
3
C
1
4
D
By iterating over a for loop I will produce a new set of two lists and a new ID with each iteration which should then be concatenated to the existing data frame. The pseudocode would look like this:
# for each sample in group:
# do some calculation to create the two lists
# merge the lists into the data frame, using the ID as index
What I tried to do so far is using concatenate like this:
pd.concat([
existing_dataframe,
pd.DataFrame(
{
"id": id,
"timestamp": timestamps,
"value": values}
)])
But there seems to be a problem that the ID field and the other lists are of different lengths. Thanks for your help!
Use:
pd.DataFrame(
{
"timestamp": timestamps,
"value": values}
).assign(id=id).reindex(columns=["id", "timestamp", "value"])
Or:
df = \
pd.DataFrame(
{
"timestamp": timestamps,
"value": values}
)
df.insert(column='id',value=id, loc=0)
Related
I'm cleaning data (from a csv file) in Pandas and one of the columns pic of it's first 4 rows has hundreds of values (in each row) seperated by comma.
I use str.split(',', expand=True) and able to get the values spread across various columns. However, I'm seeing rows of one column shifting under other column (it can be seen in the pic 2).
Is there any method to get the values under their respective columns?
Note: Each row is associated with unique ID.
I've been stuck on this problem for quite some time and couldn't resolve it. Any help would be highly appreciated!
Edit 1: TL;DR
-- Input -- First 2 rows of a column for an example--
{"crap1": 12, "NAME": "John", "AGE": "30","SEX": "M", "crap2": 34, ....... "ID": 01}
{"crap1": 56, "NAME": "Anna", "AGE": "25","SEX": "F", "crap2": 78, ....... "ID": 02}
-- Desired Output -- Derive 4 columns from 1, based on values in each row
NAME | AGE | SEX | ID
John | 30 | M | 01
Anna | 25 | F | 02
You can try expanding the column with multiple entries into a separate dataframe and then joining them back into the original dataframe.
df2 = df.col1.str.split(',',expand=True)
During this, you can also drop the original column that you wanted to expand and also give the new columns.
df2.columns = ['col2_%d'%idx for idx,__ in enumerate(df2.columns)]
df = df.drop(columns=['col1'])
df = pd.concat([df,df2],axis=1)
Since your example was an image, I couldn't test it out on that specific case. Here's a small working example to illustrate the idea :D
import pandas as pd
def get_example_data():
df = pd.DataFrame(
{
'col1' : ['abc','def','ghi,jkl','abc,def','def'],
'col2' : ['XYZ','XYZ','XYZ','XYZ','XYZ']
}
)
return df
def clean_dataframe(df):
# expand the column into a separate dataframe
df2 = df.col1.str.split(',',expand=True)
print(df2)
# incase you would like to retain original column name : col1 --> col1_0,col1_1
df2.columns = ['col1_%d'%idx for idx,__ in enumerate(df2.columns)]
print(df2)
# drop original column
df = df.drop(columns=['col1'])
# concat expanded column
df = pd.concat([df,df2],axis=1)
print(df)
return df
if __name__=='__main__':
df = get_example_data()
print(df)
df = clean_dataframe(df)
I have a df with the origin and destination between two points and I want to convert the strings to a numerical index, and I need to have a representation to back convert it for model interpretation.
df1 = pd.DataFrame({"Origin": ["London", "Liverpool", "Paris", "..."], "Destination": ["Liverpool", "Paris", "Liverpool", "..."]})
I separately created a new index on the sorted values.
df2 = pd.DataFrame({"Location": ["Liverpool", "London", "Paris", "..."], "Idx": ["1", "2", "3", "..."]})
What I want to get is this:
df3 = pd.DataFrame({"Origin": ["1", "2", "3", "..."], "Destination": ["1", "3", "1", "..."]})
I am sure there is a simpler way of doing this but the only two methods I can think of are to do a left join onto the Origin column by the Origin to Location and the same for destination then remove extraneous columns, or loop of every item in df1 and df2 and replace matching values. I've done the looped version and it works but it's not very fast, which is to be expected.
I am sure there must be an easier way to replace these values but I am drawing a complete blank.
You can use .map():
mapping = dict(zip(df2.Location, df2.Idx))
df1.Origin = df1.Origin.map(mapping)
df1.Destination = df1.Destination.map(mapping)
print(df1)
Prints:
Origin Destination
0 2 1
1 1 3
2 3 1
3 ... ...
Or "bulk" .replace():
df1 = df1.replace(mapping)
print(df1)
I have a CSV column called ref_type as shown in the below screen shot with mixed types which are sometimes string and other rows as JSON. I am reading this CSV using pandas read_csv method which inherits the type as object
i would like to convert the JSON part as below
Please help to parse above scenario.
Thanks in Advance.
Found the solution and its not the best but its working.
I already have a flatten JSON function as below
def flatten_json_columns(df, json_cols, custom_df):
"""
This function flattens JSON columns to individual columns
It merges the flattened dataframe with expected dataframe to capture missing columns from JSON
:param df: CSV raw dataframe
:param json_cols: custom data columns in CSV's
:param custom_df: expected dataframe
:return: returns df pandas dataframe
"""
# Loop through all JSON columns
for column in json_cols:
if not df[column].isnull().all():
# Replace None and NaN with empty braces
df[column].fillna(value='{}', inplace=True)
# Deserialize's a str instance containing a JSON document to a Python object
df[column] = df[column].apply(json.loads)
# Normalize semi-structured JSON data into a flat table
column_as_df = pd.json_normalize(df[column])
# Extract main column name and attach it to each sub column name
column_as_df.columns = [f"{column}_{subcolumn}" for subcolumn in column_as_df.columns]
# Merge extracted result from custom_data field with expected fields
result_df = pd.merge(column_as_df, custom_df, how='left')
# Drop the temp column and merge the flattened dataframe with orginal dataframe
df = df.merge(result_df, right_index=True, left_index=True)
else:
df = pd.concat([df, custom_df], axis=1)
# Return dataframe with flatten columns
return df
my data frame looks like below
I created a another column called ref_type_json from ref_type by putting only json rows and ignoring all strings. instead of strings i returned none
ref_type_df['ref_type_json'] = [column if column[0] == '{' else None for column in ref_type_df['ref_type']]
now the ref_type_df looks as below
i also created empty expected data frame so that the output of flatten JSON function aligns with the out put of expected dataframe
ref_type_expected = {
'ref_type_json_fromNumber': [],
'ref_type_json_toNumber': [],
'ref_type_json_comment': []
}
ref_type_expected_df = pd.DataFrame.from_dict(ref_type_expected)
Finally, I invoked the flatten JSON function which converts the JSON to columns
result_df = flatten_json_columns(df=ref_type_df,
json_cols=['ref_type_json'],
custom_df=ref_type_expected_df)
result_df.drop('ref_type_json', axis=1)
my result data frame looks as below
Please let me know if you have a better solution for it.
I would just build a dataframe containing the new columns by hand and join it to the first one. Unfortunately you have not provided copyable data so I just used mine.
Original df:
df = pd.DataFrame({'ref': ['Outcomes', 'API-TEST', '{"from":"abc", "to": "def"}',
'Manual(add)', '{"from": "gh", "to": "ij"}', 'Migration']})
Giving:
ref
0 Outcomes
1 API-TEST
2 {"from":"abc", "to": "def"}
3 Manual(add)
4 {"from": "gh", "to": "ij"}
5 Migration
Extract only json data from ref column:
data = [] # future data of the dataframe
ix = [] # future index
cols = set() # future columns
for name, s in df[['ref']].iterrows():
try:
d = json.loads(s['ref'])
ix.append(name) # if we could decode feed the future dataframe
cols.update(set(d.keys()))
data.append(d)
except json.JSONDecodeError:
pass # else ignore the line
df = df.join(pd.DataFrame(data, ix, cols), how='left')
gives:
ref to from
0 Outcomes NaN NaN
1 API-TEST NaN NaN
2 {"from":"abc", "to": "def"} def abc
3 Manual(add) NaN NaN
4 {"from": "gh", "to": "ij"} ij gh
5 Migration NaN NaN
I have DataFrame like below:
rng = pd.date_range('2020-12-01', periods=5, freq='D')
df = pd.DataFrame({"ID" : ["1", "2", "1", "2", "2"],
"category" : ["A", "B", "A", "C", "B"],
"status" : ["active", "finished", "active", "finished", "other"],
"Date": rng})
And I need to create DataFrame and calculate 2 columns:
New1 = category of the last agreement with "active" status
New2 = category of the last agreement with "finished" status
To be more precision below I give result DataFrame:
Assuming the dataframe is already sorted by date, we want to keep the last row where "status"=="active"and the last row where "status"=="finished". We also want to keep the first and second columns only, and we rename category to "New1" for the active status, and to "New2" for the finished status.
last_active = df[df.status == "active"].iloc[-1, [0, 1]].rename({"category": "New1"})
last_finished = df[df.status == "finished"].iloc[-1, [0, 1]].rename({"category": "New2"})
We got two pandas Series that we want to concatenate side by side, then transpose to have one entry per row :
pd.concat([last_active, last_finished], axis=1, sort=False).T
Perhaps, you also want to call "reset_index() afterwards, to have a fresh new RangeIndex in your resulting DataFrame.
I have another problem with joining to dataframes using pandas. I want to merge a complete dataframe into a column/field of another dataframe where the foreign key field of DF2 matches the unique key of DF1.
The input data are 2 CSV files roughly looking likes this:
CSV 1 / DF 1:
cid;name;surname;address
1;Mueller;Hans;42553
2;Meier;Peter;42873
3;Schmidt;Micha;42567
4;Pauli;Ulli;98790
5;Dick;Franz;45632
CSV 2 / DF 1:
OID;ticketid;XID;message
1;9;1;fgsgfs
2;8;2;gdfg
3;7;3;gfsfgfg
4;6;4;fgsfdgfd
5;5;5;dgsgd
6;4;5;dfgsgdf
7;3;1;dfgdhfd
8;2;2;dfdghgdh
I want each row of DF2, which XID matches with a cid of DF1, as a single field in DF1. my final goal is to convert the above input files into a nested JSON format.
Edit 1:
Something like this:
[
{
"cid": 1,
"name": "Mueller",
"surname": "Hans",
"address": 42553,
"ticket" :[{
"OID": 1,
"ticketid": 9,
"XID": 1,
"message": "fgsgfs"
}]
},
...]
Edit 2:
Some further thoughts: Would it be possible to create a dictionary of each row in dataframe 2 and then append this dictionary to a new column in dataframe 1 where some value (xid) of the dictionary matches with a unique id in a row (cid) ?
Some pseudo code I have in my mind:
Add new column "ticket" in DF1
Iterate over rows in DF2:
row to dictionary
iterate over DF1
find row where cid = dict.XID
append dictionary to field in "ticket"
convert DF1 to JSON
Non Python solution are also acceptable.
Not sure what you expect as output but check merge
df1.merge(df2, left_on="cid", right_on="XID", how="left")
[EDIT based on the expected output]
Maybe something like this:
(
df1.merge(
df2.groupby("XID").apply(lambda g: g.to_dict(orient="records")).reset_index(name="ticket"),
how="left", left_on="cid", right_on="XID")
.drop(["XID"], axis=1)
.to_json(orient="records")
)