I just started with python and know almost nothing about pandas.
so now I have a directed graph which look like this:
from ID
to ID
13
22
13
56
14
10
14
15
14
16
now I need to transform it to horizontal data format like this:
from ID
To 0
To 1
To 2
13
22
56
NAN
14
10
15
16
I find something like Pandas Merge duplicate index into single index but it seems it cannot merge rows into different colums.
Will pandas use NAN to fill in the blanks during this process due to the need to add more columns?
You can do something like this to transform your dataframe, pandas will automatically add NaN values for any number of columns you have with this solution.
df = df.groupby('from')['to'].apply(list)
## create the desired df from the lists
df = pd.DataFrame(df.tolist(),index=df.index).reset_index()
## Rename columns
df.columns = ["from ID"] + [f"To {i}" for i in range(1,len(df.columns))]
I have a dataset that looks like this:
df = pd.DataFrame([[1,1,5,4],[1,1,6,3]], columns =['date','site','chemistry','measurement']) df
I'm looking to transform this dataset so that the values in the chemistry and measurement columns become separate columns and the repeated values in the other columns become a single row like this:
new_df = pd.DataFrame([[1,1,4,3]], columns=['date','site','5','6']) new_df
I've tried some basic things like df.transpose() pd.pivot() but this doesn't get me what I need.
The pivot is closer but still not the format I'm looking for.
I'm imagining there's a way to loop through the dataframe to this but I'm not sure how to do it. Any suggestions?
Try this:
df.set_index(['date','site','chemistry'])['measurement'].unstack().reset_index()
Output:
chemistry date site 5 6
0 1 1 4 3
I have two dataframes, df1 and df2. df1 has repeat observations arranged in wide format, and df2 in long format.
import pandas as pd
df1 = pd.DataFrame({"ID":[1,2,3],"colA_1":[1,2,3],"date1":["1.1.2001", "2.1.2001","3.1.2001"],"colA_2":[4,5,6],"date2":["1.1.2002", "2.1.2002","3.1.2002"]})
df2 = pd.DataFrame({"ID":[1,1,2,2,3,3],"col1":[1,1.5,2,2.5,3,3.5],"date":["1.1.2001", "1.1.2002","2.1.2001","2.1.2002","3.1.2001","3.1.2002"], "col3":[11,12,13,14,15,16],"col4":[21,22,23,24,25,26]})
df1 looks like:
ID colA_1 date1 colA_2 date2
0 1 1 1.1.2001 4 1.1.2002
1 2 2 2.1.2001 5 2.1.2002
2 3 3 3.1.2001 6 3.1.2002
df2 looks like:
ID col1 date1 col3 col4
0 1 1.0 1.1.2001 11 21
1 1 1.5 1.1.2002 12 22
2 2 2.0 2.1.2001 13 23
3 2 2.5 2.1.2002 14 24
4 3 3.0 3.1.2001 15 25
5 3 3.5 3.1.2002 16 26
6 3 4.0 4.1.2002 17 27
I want to take a given column from df2, "col3", and then:
(1) if the columns "ID" and "date" in df2 match with the columns "ID" and "date1" in df1, I want to put the value in a new column in df1 called "colB_1".
(2) else if the columns "ID" and "date" in df2 match with the columns "ID" and "date2" in df1, I want to put the value in a new column in df1 called "colB_2".
(3) else if the columns "ID" and "date" in df2 have no match with either ("ID" and "date1") or ("ID" and "date2"), I want to ignore these rows.
So, the output of this output dataframe, df3, should look like this:
ID colA_1 date1 colA_2 date2 colB_1 colB_2
0 1 1 1.1.2001 4 1.1.2002 11 12
1 2 2 2.1.2001 5 2.1.2002 13 14
2 3 3 3.1.2001 6 3.1.2002 15 16
What is the best way to do this?
I found this link, but the answer doesn't work for my case. I would like a really explicit way to specify column matching. I think it's possible that df.mask might be able to help me, but I am not sure how to implement it.
e.g.: the following code
df3 = df1.copy()
df3["colB_1"] = ""
df3["colB_2"] = ""
filter1 = (df1["ID"] == df2["ID"]) & (df1["date1"] == df2["date"])
filter2 = (df1["ID"] == df2["ID"]) & (df1["date2"] == df2["date"])
df3["colB_1"] = df.mask(filter1, other=df2["col3"])
df3["colB_2"] = df.mask(filter2, other=df2["col3"])
gives the error
ValueError: Can only compare identically-labeled Series objects
I asked this question previously, and it was marked as closed; my question was marked as a duplicate of this one. However, this is not the case. The answers in the linked question suggest the use of either map or df.merge. Map does not work with multiple conditions (in my case, ID and date). And df.merge (the answer given for matching multiple columns) does not work in my case when one of the column names in df1 and df2 that are to be merged are different ("date" and "date1", for example).
For example, the below code:
df3 = df1.merge(df2[["ID","date","col3"]], on=['ID','date1'], how='left')
fails with a Key Error.
Also noteworthy is that I will be dealing with many different files, with many different column naming schemes, and I will need a different subset each time. This is why I would like an answer that explicitly names the columns and conditions.
Any help with this would be much appreciated.
You can the pd.wide_to_long after replacing the underscore , this will unpivot the dataframe which you can use to merge with df2 and then pivot back using unstack:
m =df1.rename(columns=lambda x: x.replace('_',''))
unpiv = pd.wide_to_long(m,['colA','date'],'ID','v').reset_index()
merge_piv = (unpiv.merge(df2[['ID','date','col3']],on=['ID','date'],how='left')
.set_index(['ID','v'])['col3'].unstack().add_prefix('colB_'))
final = df1.merge(merge_piv,left_on='ID',right_index=True)
ID colA_1 date1 colA_2 date2 colB_1 colB_2
0 1 1 1.1.2001 4 1.1.2002 11 12
1 2 2 2.1.2001 5 2.1.2002 13 14
2 3 3 3.1.2001 6 3.1.2002 15 16
I have a pd.Series like
myS = pd.Series(np.arange(1,11,1))
I also have a pd.DataFrame like
mydf = pd.DataFrame([[1,2,3],[7,8,9]])
I would like to select values in myS based on index in mydf, but would like to have the results stored in a dataframe with same shape as mydf.
So the desired resulting dataframe is pd.DataFrame([[2,3,4],[8,9,1]])
What is the best way to achieve this?
Using replace
yourdf=mydf.replace(myS)
yourdf
Out[174]:
0 1 2
0 2 3 4
1 8 9 10
I'm having trouble creating a time lag column for my data. It works fine when I do it for a dataframe with a just a kind of elements, but it doesn't not work fine, when I have different elements. For example, my dataset looks something like this:
when using the command suggested:
data1['lag_t'] = data1['total_tax'].shift(1)
I get a result like this:
As you can see, it just displace all the 'total_tax' value one row. However, I need to do this lag for EACH ONE of the id_inf (as separate items).
My dataset is really huge, so I need to find a way to solve this issue. So I can get as a result a table like this:
You can groupby on index and shift
# an example with random data.
data1 = pd.DataFrame({'id': [9,9,9,54,54,54],'total_tax':[5,6,7,1,2,3]}).set_index('id')
data1['lag_t'] = data1.groupby(level=0)['total_tax'].apply(lambda x: x.shift())
print (data1)
tax lag_t
id
9 5 NaN
9 6 5.0
9 7 6.0
54 1 NaN
54 2 1.0
54 3 2.0