Pandas is printing more rows than expected - python

Currently I work on a database and I try to sort my rows with pandas. I have a column called 'sessionkey' which refers to a session. So each row can be assigned to a session. I tried to seperate the Data into these sessions.
Furthermore there can be duplicated rows. I tried to drop those with the drop_duplicates function from pandas.
df = pd.read_csv((path_of_data+'part-00000-9d3e32a7-87f8-4218-bed1-e30855ce6f0c-c000.csv'), keep_default_na=False, engine='python')
tmp = df['sessionkey'].values #I want to split data into different sessions
tmp = np.unique(tmp)
df.set_index('sessionkey', inplace=True)
watching = df.loc[tmp[10]].drop_duplicates(keep='first') #here I pick one example
print(watching.sort_values(by =['eventTimestamp', 'eventClickSequenz']))
print(watching.info())
I would have thought that this works fine but when I tried to check my results by printing out my splitted dataframe the output looks very odd to me. For example I printed the length of the Dataframe it says 38 rows x 4 columns. But when I print the same Dataframe there are clearly more than 38 rows and there are still duplicates in it.
I already tried to split the data by using unique indices:
comparison = pd.DataFrame()
for index, item in enumerate(df['sessionkey'].values):
if item==tmp: comparison = comparison.append(df.iloc[index])
comparison.drop_duplicates(keep='first', inplace=True)
print(comparison.sort_values( by = ['eventTimestamp']))
But the Problem is still the same.
The output also seems to follow a pattern. Lets say we have 38 entries. Then pandas returns me the first 1-37 entries and then appends the 2-38 entries. So the last one is left out and then the whole list is shifted and printed again.
When I return the numpy values there are just 38 different rows. So is this a problem of the print function from pandas? Is there an error in my code? Does pandas have a problem with not-unique indexes?
EDIT:
Okay I figured out what the problem is. I wanted to look at a long dataframe so I used:
pd.set_option('display.max_rows', -1)
Now we can use some sample data:
data = np.array([[119, 0], [119, 1], [119, 2]])
columns = ['sessionkey', 'event']
df = pd.DataFrame(data, columns = columns)
print(df)
Printed it now looks like this:
sessionkey event
0 119 0
1 119 1
1 119 1
2 119 2
Although I expected it to look like this:
sessionkey event
0 119 0
1 119 1
2 119 2
I thought my Dataframe has the wrong shape but this is not the case.
So the event in the middle gets printed doubled. Is this a bug or the intendent output?

so drop_duplicates() doesn't look at the index when getting rid of rows, instead it looks at the whole row. But it does have a useful subset kwarg which allows you to specify which rows to use.
You can try the following
df = pd.read_csv((path_of_data+'part-00000-9d3e32a7-87f8-4218-bed1-e30855ce6f0c-c000.csv'), keep_default_na=False, engine='python')
print(df.shape)
print(df["session"].nunique()) # number of unique sessions
df_unique = df.drop_duplicates(subset=["session"],keep='first')
# these two numbers should be the same
print(df_unique.shape)
print(df_unique["session"].nunique())

It sounds like you want to drop_duplicates based on the index - by default drop_duplicates drops based on the column values. To do that try
df.loc[~df.index.duplicated()]
This should only select index values which are not duplicated

I used your sample code.
data = np.array([[119, 0], [119, 1], [119, 2]])
columns = ['sessionkey', 'event']
df = pd.DataFrame(data, columns = columns)
print(df)
And I got your expected outcome.
sessionkey event
0 119 0
1 119 1
2 119 2
After I set the max_rows option, as you did:
pd.set_option('display.max_rows', -1)
I got the incorrect outcome.
sessionkey event
0 119 0
1 119 1
1 119 1
2 119 2
The problem might be the "-1" setting. The doc states that "None" will set max rows to unlimited. I am unsure what "-1" will do in a parameter that takes positive integers or None as acceptable values.
Try
pd.set_option('display.max_rows', None)

Related

pandas dataframe: how to select rows where one column-value is like 'values in a list'

I have a requirement where I need to select rows from a dataframe where one column-value is like values in a list.
The requirement is for a large dataframe with millions of rows and need to search for rows where column-value is like values of a list of thousands of values.
Below is a sample data.
NAME,AGE
Amar,80
Rameshwar,60
Farzand,90
Naren,60
Sheikh,45
Ramesh,55
Narendra,85
Rakesh,86
Ram,85
Kajol,80
Naresh,86
Badri,85
Ramendra,80
My code is like below. But problem is that I'm using a for loop, hence with increased number of values in the list-of-values (variable names_like in my code) I need to search, the number of loop and concat operation increases and it makes the code runs very slow.
I can't use the isin() option as isin is for exact match and for me it is not an exact match, it a like condition for me.
Looking for a better more performance efficient way of getting the required result.
My Code:-
import pandas as pd
infile = "input.csv"
df = pd.read_csv(infile)
print(f"df=\n{df}")
names_like = ['Ram', 'Nar']
df_res = pd.DataFrame(columns=df.columns)
for name in names_like:
df1 = df[df['NAME'].str.contains(name, na=False)]
df_res = pd.concat([df_res,df1], axis=0)
print(f"df_res=\n{df_res}")
My Output:-
df_res=
NAME AGE
1 Rameshwar 60
5 Ramesh 55
8 Ram 85
12 Ramendra 80
3 Naren 60
6 Narendra 85
10 Naresh 86
Looking for a better more performance efficient way of getting the required result.
You can pass all names in joined list by | for regex or, loop is not necessary:
df_res = df[df['NAME'].str.contains('|'.join(names_like), na=False)]
Use this hope you will find a great way.
df_res = df[df['NAME'].str.contains('|'.join(names_like), na=False)]

pandas error in df.apply() only for a specific dataframe

Noticed something very strange in pandas. My dataframe(with 3 rows and 3 columns) looks like this:
When I try to extract ID and Name(separated by underscore) to their own columns using command below, it gives me an error:
df[['ID','Name']] = df.apply(lambda x: get_first_last(x['ID_Name']), axis=1, result_type='broadcast')
Error is:
ValueError: cannot broadcast result
Here's the interesting part though..When I delete the "From_To" column from the original dataframe, performing the same df.apply() to split ID_Name works perfectly fine and I get the new columns like this:
I have checked a lot of SO answers but none seem to help. What did I miss here?
P.S. get_first_last is a very simple function like this:
def get_first_last(s):
str_lis = s.split("_")
return [str_lis[0], str_lis[1]]
From the doc of pandas.DataFrame.apply :
'broadcast' : results will be broadcast to the original shape of the DataFrame, the original index and columns will be retained.
So the problem is that the original shape of your dataframe is (3, 3) and the result of your apply function is 2 columns, so you have a mismatch. and that also explane why when you delete the "From_To", the new shape is (3, 2) and now you have a match ...
You can use 'broadcast' instead of 'expand' and you will have your expected result.
table = [
['1_john', 23, 'LoNDon_paris'],
['2_bob', 34, 'Madrid_milan'],
['3_abdellah', 26, 'Paris_Stockhom']
]
df = pd.DataFrame(table, columns=['ID_Name', 'Score', 'From_to'])
df[['ID','Name']] = df.apply(lambda x: get_first_last(x['ID_Name']), axis=1, result_type='expand')
hope this helps !!
It's definitely not a good use case to use apply, you should rather do:
df[["ID", "Name"]]=df["ID_Name"].str.split("_", expand=True, n=1)
Which for your data will output (I took only first 2 columns from your data frame):
ID_Name Score ID Name
0 1_john 23 1 john
1 2_bob 34 2 bob
2 3_janet 45 3 janet
Now n=1 is just in case you would have multiple _ (e.g. as a part of the name) - to make sure you will return at most 2 columns (otherwise the above code would fail)
For instance, if we slightly modify your code, we get the following output:
ID_Name Score ID Name
0 1_john 23 1 john
1 2_bob_jr 34 2 bob_jr
2 3_janet 45 3 janet

python for loop using index to create values in dataframe

I have a very simple for loop problem and I haven't found a solution in any of the similar questions on Stack. I want to use a for loop to create values in a pandas dataframe. I want the values to be strings that contain a numerical index. I can make the correct value print, but I can't make this value get saved in the dataframe. I'm new to python.
# reproducible example
import pandas as pd
df1 = pd.DataFrame({'x':range(5)})
# for loop to add a row with an index
for i in range(5):
print("data_{i}.txt".format(i=i)) # this prints the value that I want
df1['file'] = "data_{i}.txt".format(i=i)
This loop prints the exact value that I want to put into the 'file' column of df1, but when I look at df1, it only uses the last value for the index.
x file
0 0 data_4.txt
1 1 data_4.txt
2 2 data_4.txt
3 3 data_4.txt
4 4 data_4.txt
I have tried using enumerate, but can't find a solution with this. I assume everyone will yell at me for posting a duplicate question, but I have not found anything that works and if someone points me to a solution that solves this problem, I'll happily remove this question.
There are better ways to create a DataFrame, but to answer your question:
Replace the last line in your code:
df1['file'] = "data_{i}.txt".format(i=i)
with:
df1.loc[i, 'file'] = "data_{0}.txt".format(i)
For more information, read about the .loc here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html
On the same page, you can read about accessors like .at and .iloc as well.
You can do list-comprehension:
df1['file'] = ["data_{i}.txt".format(i=i) for i in range(5)]
print(df1)
Prints:
x file
0 0 data_0.txt
1 1 data_1.txt
2 2 data_2.txt
3 3 data_3.txt
4 4 data_4.txt
OR at the creating of DataFrame:
df1 = pd.DataFrame({'x':range(5), 'file': ["data_{i}.txt".format(i=i) for i in range(5)]})
print(df1)
OR:
df1 = pd.DataFrame([{'x':i, 'file': "data_{i}.txt".format(i=i)} for i in range(5)])
print(df1)
I've found success with the .at method
for i in range(5):
print("data_{i}.txt".format(i=i)) # this prints the value that I want
df1.at[i, 'file'] = "data_{i}.txt".format(i=i)
Returns:
x file
0 0 data_0.txt
1 1 data_1.txt
2 2 data_2.txt
3 3 data_3.txt
4 4 data_4.txt
when you assign a variable to a dataframe column the way you do -
using the df['colname'] = 'val', it assigns the val across all rows.
That is why you are seeing only the last value.
Change your code to:
import pandas as pd
df1 = pd.DataFrame({'x':range(5)})
# for loop to add a row with an index
to_assign = []
for i in range(5):
print("data_{i}.txt".format(i=i)) # this prints the value that I want
to_assign.append(data_{i}.txt".format(i=i))
##outside of the loop - only once - to all dataframe rows
df1['file'] = to_assign.
As a thought, pandas has a great API for performing these type of actions without for loops.
You should start practicing those.

Grouper and axis must be same length in Python

I am a beginner of Python, and I study a textbook to learn the Pandas module.
I have a dataframe called Berri_bike, and it is from the following code:
bike_df=pd.read_csv(os.path.join(path,'comptagevelo2012.csv'),parse_dates=['Date'],\
encoding='latin1',dayfirst=True,index_col='Date')
Berri_bike=bike_df['Berri1'].copy() # get only the column='Berri1'
Berri_bike['Weekday']=Berri_bike.index.weekday
weekday_counts = Berri_bike.groupby('Weekday').aggregate(sum)
weekday_counts
I have 3 columns in my Berri_bilk , a data index- from 1/1/2012 to 12/31/2012, and value column with numbers for each data, and a weekday column I assigned to it. But when I want to group by the values, I got the error: ValueError: Grouper and axis must be same length, I am not sure what this mean, what I want to do is very simple, like in SQL, sum(value) grouped weekday... can anyone please let me know what happended here?
You copy your column into a pandas series instead of a new dataframe, hence the following operations behave differently. You can see this if you print out Berri_bike because it doesn't show the column name.
Instead, you should copy the column into a new dataframe:
import pandas as pd
df = pd.DataFrame(np.random.randint(0, 30, size = (70, 2)),
columns = ["A", "B"],
index = pd.date_range("20180101", periods = 70))
Berri_bike = df[["A"]]
Berri_bike['Weekday'] = Berri_bike.index.weekday
weekday_counts = Berri_bike.groupby("Weekday").sum()
print(weekday_counts)
#sample output
A
Weekday
0 148
1 101
2 127
3 139
4 163
5 74
6 135

Pandas dataframe "all true" criterion

Python 2.7, Pandas 0.18.
I have a DataFrame, and I have methods that select a subset of the rows via a criterion parameter. I'd like to know a more idiomatic way to write a criterion that matches all rows.
Here's a very simple example:
import pandas as pd
def apply_to_matching(df,criterion):
df.loc[criterion,'A'] = df[criterion]['A']*df[criterion]['B']
df = pd.DataFrame({'A':[1,2,3,4],'B':[10,100,1000,10000]})
criterion = (df['A']<3)
result = apply_to_matching(df,criterion)
print df
The output would be:
A B
0 10 10
1 200 100
2 3 1000
3 4 10000
because the criterion applies to only the first two rows.
I would like to know the idiomatic way to create a criterion that selects all rows of the DataFrame.
This could be done by adding a column of all true values to the DataFrame:
# Add a column
df['AllTrue']=True
criterion = df['AllTrue']
result = apply_to_matching(df,criterion)
print df.drop('AllTrue',axis=1)
The output is:
A B
0 10 10
1 200 100
2 3000 1000
3 40000 10000
but that approach adds a column to my DataFrame, which I have to filter out later to not get it in my output.
So, is there a more idiomatic way to do this in Pandas? One which does not require me to know anything about the column names, and not change the DataFrame?
When everything should be True, the boolean indexing way would require a series of True. With the code you have above, another way to look at it is that the criterion argument can also receive slices. Getting all the rows would mean slicing the entire rows like this df.loc[:, 'A']. As you need to pass it as an argument to apply_to_matching function, use slice builtin:
apply_to_matching(df, slice(None, None))

Categories

Resources