I'm having trouble creating a time lag column for my data. It works fine when I do it for a dataframe with a just a kind of elements, but it doesn't not work fine, when I have different elements. For example, my dataset looks something like this:
when using the command suggested:
data1['lag_t'] = data1['total_tax'].shift(1)
I get a result like this:
As you can see, it just displace all the 'total_tax' value one row. However, I need to do this lag for EACH ONE of the id_inf (as separate items).
My dataset is really huge, so I need to find a way to solve this issue. So I can get as a result a table like this:
You can groupby on index and shift
# an example with random data.
data1 = pd.DataFrame({'id': [9,9,9,54,54,54],'total_tax':[5,6,7,1,2,3]}).set_index('id')
data1['lag_t'] = data1.groupby(level=0)['total_tax'].apply(lambda x: x.shift())
print (data1)
tax lag_t
id
9 5 NaN
9 6 5.0
9 7 6.0
54 1 NaN
54 2 1.0
54 3 2.0
Related
I am working with a large dataset with a column for reviews which is comprised of a series of strings for example: "A,B,C" , "A,B*,B" etc..
for example,
import pandas as pd
df=pd.DataFrame({'cat1':[1,2,3,4,5],
'review':['A,B,C', 'A,B*,B,C', 'A,C', 'A,B,C,D', 'A,B,C,A,B']})
df2 = df["review"].str.split(",",expand = True)
df.join(df2)
I want to split that column up into separate columns for each letter, then add those columns into the original data frame. I used df2 = df["review"].str.split(",",expand = True) and df.join(df2) to do that.
However, when i use df["A"].unique() there are entries that should not be in the column. I only want 'A' to appear there, but there is also B and C. Also, B and B* are not splitting into two columns.
My dataset is quite large so I don't know how to properly illustrate this problem, I have tried to provide a small scale example, however, everything seems to be working correctly in this example;
I have tried to look through the original column with df['review'].unique() and all entries were entered correctly (no missing commas or anything like that), so I was wondering if there is something wrong with my approach that would influence it to not work correctly across all datasets. Or is there something wrong with my dataset.
Does anyone have any suggestions as to how I should troubleshoot?
when i use df["A"].unique() there are entries that should not be in the column. I only want 'A' to appear there
IIUC, you wanted to create dummy variables instead?
df2 = df.join(df['review'].str.get_dummies(sep=',').pipe(lambda x: x*[*x]).replace('',float('nan')))
Output:
cat1 review A B B* C D
0 1 A,B,C A B NaN C NaN
1 2 A,B*,B,C A B B* C NaN
2 3 A,C A NaN NaN C NaN
3 4 A,B,C,D A B NaN C D
4 5 A,B,C,A,B A B NaN C NaN
Using groupby in a df to transform a column is giving me a diferent result
I have a df like this:
column data is a float column and group is the categories of my data.
I want to transform my column data stratefy by categories with .pct_change().rolling(2).sum(),
like this:
df[df.group == 'D'].data.pct_change().rolling(2).sum()
That give me:
data
0 NaN
2 NaN
3 -0.604782
5 -0.298356
6 1.036227
8 -0.008092
9 396.681408
16 397.087583
17 -0.427873
23 0.253185
29 0.040770
But, when trying to automate things using groupby,
like this:
df_modif = pd.concat([df.groupby(by='group').data.pct_change().rolling(2).sum(), df.group], axis=1)
That give me:
Can someone help me understand what is going on?
this is closely related to the question I asked earlier here Python Pandas Dataframe Pivot Table Column and Values Order. Thanks again for the help. Very much appreciated.
I'm trying to automate a report that will be distributed via email to a large audience so it needs to look "pretty" :)
I'm having trouble resetting/removing the Indexes and/or Axis post-Pivots to enable me to use the .style CSS functions (i.e. creating a Styler Object out of the df) to make the table look nice.
I have a DataFrame where two of the principal fields (in my example here they are "Name" and "Bucket") will be variable. The desired display order will also change (so it can't be hard-coded) but it can be derived earlier in the application (e.g. "Name_Rank" and "Bucket_Rank") into Integer "Sorting Values" which can be easily sorted (and theoretically dropped later).
I can drop the column Sorting Value but not the Row/Header/Axis(?). Additionally, no matter what I try I just can't seem to get rid of the blank row between the headers and the DataTable.
I (think) I need to set the Index = Bucket and Headers = "Name" and "TDY/Change" to use the .style style object functionality properly.
import pandas as pd
import numpy as np
data = [
['AAA',2,'X',3,5,1],
['AAA',2,'Y',1,10,2],
['AAA',2,'Z',2,15,3],
['BBB',3,'X',3,15,3],
['BBB',3,'Y',1,10,2],
['BBB',3,'Z',2,5,1],
['CCC',1,'X',3,10,2],
['CCC',1,'Y',1,15,3],
['CCC',1,'Z',2,5,1],
]
df = pd.DataFrame(data, columns =
['Name','Name_Rank','Bucket','Bucket_Rank','Price','Change'])
display(df)
Name
Name_Rank
Bucket
Bucket_Rank
Price
Change
0
AAA
2
X
3
5
1
1
AAA
2
Y
1
10
2
2
AAA
2
Z
2
15
3
3
BBB
3
X
3
15
3
4
BBB
3
Y
1
10
2
5
BBB
3
Z
2
5
1
6
CCC
1
X
3
10
2
7
CCC
1
Y
1
15
3
8
CCC
1
Z
2
5
1
Based on the prior question/answer I can pretty much get the table into the right format:
df2 = (pd.pivot_table(df, values=['Price','Change'],index=['Bucket_Rank','Bucket'],
columns=['Name_Rank','Name'], aggfunc=np.mean)
.swaplevel(1,0,axis=1)
.sort_index(level=0,axis=1)
.reindex(['Price','Change'],level=1,axis=1)
.swaplevel(2,1,axis=1)
.rename_axis(columns=[None,None,None])
).reset_index().drop('Bucket_Rank',axis=1).set_index('Bucket').rename_axis(columns=
[None,None,None])
which looks like this:
1
2
3
CCC
AAA
BBB
Price
Change
Price
Change
Price
Change
Bucket
Y
15
3
10
2
10
2
Z
5
1
15
3
5
1
X
10
2
5
1
15
3
Ok, so...
A) How do I get rid of the Row/Header/Axis(?) that used to be "Name_Rank" (e.g. the integer "Sorting Values" 1,2,3). I figured a hack where the df is exported to XLS/re-imported with Header=(1,2) but that can't be the best way to accomplish the objective.
B) How do I get rid of the blank row above the data in the table? From what I've read online it seems like you should "rename_axis=[None]" but this doesn't seem to work no matter which order I try.
C) Is there a way to set the Header(s) such that the both what used to be "Name" and "Price/Change" rows are Headers so that the .style functionality can be employed to format them separate from the data in the table below?
Thanks a lot for whatever suggestions anyone might have. I'm totally stuck!
Cheers,
Devon
In pandas 1.4.0 the options for A and B are directly available using the Styler.hide method:
I have data that tracks a group of individuals over time. To give a small example it looks kind of like this:
ID TIME HEIGHT
0 0 10.2
0 1 3.3
0 2 2.1
1 0 11.3
1 1 8.6
1 2 9.1
2 0 10.0
2 1 35.0
2 2 4.1
.
.
.
100 0 1.0
100 1 3.0
100 2 9.0
Where, for illustration, ID refers to a particular person. Thus, this plotting TIME on the x-axis and HEIGHT on the y-axis for all the values of ID=0 gives us the change in person 0s height.
I want to graph a random sample of these people and plot them. So for instance, I want to plot the change in height over time of 3 people. However, applying the usual df.sample(3) will not always ensure that I get all of the time for a particular person, instead it will select randomly 3 rows and plot them. Is there a preferred/convenient way in pandas to sample random groups?
A lot of questions like this one seem to be about sampling from every group which is not what I want to do.
You want to plot 'TIME' in the x-axis, then get a rectangular dataframe with 'TIME' as the index and 'ID' as the columns. From there, use sample with axis=1 to sample columns and leave the index intact.
df.set_index(['TIME', 'ID']).HEIGHT.unstack().sample(3, axis=1).plot()
I am trying to study the probability of having a zero value in my data and I have developed a code that outputs the value of a column of data when the other is zero which is what I need. But having to do that for each column vs all other 28 of my 577by29 dataframe is difficult so I decided to create a for loop that does that for me where I have this:
import numpy as np
import pandas as pd
allchan = pd.read_csv('allchan.csv',delimiter = ' ')
allchanarray = np.array(allchan)
dfallchan = pd.DataFrame(allchanarray,range(1,578),dtype=float)
for n in range(0,29):
print((dfallchan[(dfallchan[0]>0) & (dfallchan[n]==0)][0]))
print((dfallchan[(dfallchan[0]>0) & (dfallchan[n]==0)][0].count()))
what I want to do is to assign each thing inside a print statement in a column of a variable of some kind (array, list, DataFrame, or series ) which I seem to be struggling with. and then save the output as an Excel file using something.to_excel, before I change the column that I am comparing to and so on. note that the output of the code should return different values of the first channel (column of input data) as the zeros are randomly distributed in my input file and each output column is expected to have a different length.
please help me with the code and explain to me why did you use one type of variable not the other
Thank you!
So you use the following feature of the pandas.
data = [[1,2,3], [4,5], [1,2,3,4,5,6,7]]
df = pd.DataFrame(data).transpose()
df
0 1 2
0 1.0 4.0 1.0
1 2.0 5.0 2.0
2 3.0 NaN 3.0
3 NaN NaN 4.0
4 NaN NaN 5.0
5 NaN NaN 6.0
6 NaN NaN 7.0
The variable data is a list of lists. In your case you can have a list of your filtered values of 1st column. Function transpose flips the dimensions.
Accordingly, you can make further changes to add the number of values filtered from the 1st column.