Python: pandas Data Frame and meaning of {Series}0 in debugger

Python: pandas Data Frame and meaning of {Series}0 in debugger - python

I am using pandas in Python 2.7 and read a csv file like this:
import pandas as pd
df = pd.read_csv("test_file.csv")
df has a column titled rating, and a column titled 'review', I do some manipulations on df for example:
df3 = df[df['rating'] != 3]
Now if I look in a debugger at df['review'] and df3['review'] I see this information:
df['review'] = {Series}0
df3['review'] = {Series}1
Also if I want to see the first element of df['review'] I use:
df['review'][0]
which is fine, but if I do the same for df3, I get this error:
df3['review'][0]
{KeyError}0L
However, it looks like I can do this:
df3['review'][1]
Can someone please explain the difference?

Indexing with an integer on a Series doesn't work like a list. In particular, df['review'][0] doesn't get the first element of the "review" column, it gets the element with index 0:
In [4]: s = pd.Series(['a', 'b', 'c', 'd'], index=[1, 0, 2, 3])
In [5]: s
Out[5]:
1 a
0 b
2 c
3 d
dtype: object
In [6]: s[0]
Out[6]: 'b'
Presumably, in generating df3 you dropped the row with index 0. If you actually want to get the first element regardless of the index, use iloc:
In [7]: s.iloc[0]
Out[7]: 'a'

Related

How do i filter row and column from a 3 row matrix in Python?

enter image description here
How do I filter so that row 2 is == '1' and find the min value in row 1. And it prints the column its in?
So in this case the result would only bring up Driver 3 [.941, 0.210, 1] because row 2 is 1 and row1 is smaller than the rest of that would also have '1' in row 2.
When I try np.amin(df, axis=2) I get the min values i want but row 2 returns a 0 instead of 1.

You first need to select the subset of columns you're interested in, then find from those the minimum value, then figure out which column its in (or both at once with argmin). (Note that pandas API somewhat prefers filtering out rows vs columns.)
First setup some example data
import pandas as pd
import numpy as np
data = np.arange(24).reshape(4,6)
data[1,3:5]=1 # Adjust some values in row 2
df = pd.DataFrame(data, columns=['A', 'B', 'C', 'D', 'E', 'F'])
So transpose columns and rows
dfT = df.T
Get the columns where row 2 ==1
dfFiltered = dfT[dfT[1] == 1]
Find the minimum row with the minimum value and get its index
dfFiltered.index[dfFiltered[0].argmin()]
This produces 'D' for this example.

First of all, for this kind of application you really want your DataFrame to be transposed. It is just not convenient to filter on columns.
import pandas as pd
A = pd.DataFrame(...)
B = A.transpose()
# Now you can filter the drivers where "column" 2 is equal to 1:
C = B[B[2] == 1]
# And find where the minimum of "column" 1 is:
idx = C[1].argmin()
# Finally, find out the driver and its information:
driver = C.index[idx]
info = C.iloc[idx]
# info2 = C.loc[driver] # alternative
# To get the format you want:
print(driver, info.values)

Python csv Merge several column to one cell

How can I merge the several columns into one cell?
How can I convert a CSV file , that includes 1 by X cells where 1 is the row count and X is the column count unkown by the user,into a new CSV file including one single cell to combine all data from the original CSV file?
Right now://One row, four columns #####In fact, it will be variable number of columns as the data is extracted from a log file#############
1 A B C D
My purposes: //One row, one colum
1 A
B
C
D
The index of the row may not be always be 1 as I have many similar rows like that.
please refer to the original file and the expected new file at the following URL for details
https://drive.google.com/drive/folders/1cX0o86nbLAV5Foj5avCK6E2oAvExzt5t
161.csv is the file at the present
161-final.csv is what I want...
The count of rows will not be changed. However, the count of columns is a variable as the data is extracted from a log file. In the end, I only need each row has only one column.
I am just a fresh man with pandas. Will it be a way that I calculate the count of columns and then merge it into one cell?
Very appreciate your help.

code:
import pandas as pd
import numpy as np
df = pd.DataFrame([['a', 'a', 'a', 'a'], ['b', 'b', 'b', 'b']])
print(df)
df1 = np.empty(df.shape[0], dtype=object)
df1[:] = df.values.tolist()
print(df1)
output:
0 1 2 3
0 a a a a
1 b b b b
[list(['a', 'a', 'a', 'a']) list(['b', 'b', 'b', 'b'])]

Not sure this is what you want, but you can manage to put the content of one row in a single column with pd.groupby()
import pandas as pd
df = pd.DataFrame(np.arange(4).reshape(1, 4))
df.columns = ['A', 'B', 'C', 'D']
df = df.groupby(df.index).apply(lambda x: x.values.ravel())
df = df.to_frame('unique_col') # If needed
Output :
unique_col
0 [0, 1, 2, 3]
Not sure this is possible to have the output not in a list as you mentioned it in your example.

I used a "stupid" way to do that. It is the "pd.read_table(file)" do the magic.
def CsvMerge(file1, file2,output):
##It is for preparation of the getFormated()
##Merge the "Format.csv" and the "text.csv.csv
df1=pd.read_csv(file1)
###This is for merging the columns to one column
###It can also avoid an csv error
df2=pd.read_table(file2)
df3=pd.concat([df1,df2],axis=1)
print(df3)
with open(output,'w+',encoding="utf-8") as f:
df3.to_csv(f,index=False)
f.close()

pandas confusion with multiindex columns loc and get_level_values [duplicate]

I have a Dataframe with a pandas MultiIndex:
In [1]: import pandas as pd
In [2]: multi_index = pd.MultiIndex.from_product([['CAN','USA'],['total']],names=['country','sex'])
In [3]: df = pd.DataFrame({'pop':[35,318]},index=multi_index)
In [4]: df
Out[4]:
pop
country sex
CAN total 35
USA total 318
Then I remove some rows from that DataFrame:
In [5]: df = df.query('pop > 100')
In [6]: df
Out[6]:
pop
country sex
USA total 318
But when I consult the MutliIndex, it still has both countries in its levels.
In [7]: df.index.levels[0]
Out[7]: Index([u'CAN', u'USA'], dtype='object')
I can fix this myself in a rather strange way:
In [8]: idx_names = df.index.names
In [9]: df = df.reset_index(drop=False)
In [10]: df = df.set_index(idx_names)
In [11]: df
Out[11]:
pop
country sex
USA total 318
In [12]: df.index.levels[0]
Out[12]: Index([u'USA'], dtype='object')
But this seems rather messy. Is there a better way I'm missing?

From version pandas 0.20.0+ use MultiIndex.remove_unused_levels:
print (df.index)
MultiIndex(levels=[['CAN', 'USA'], ['total']],
labels=[[1], [0]],
names=['country', 'sex'])
df.index = df.index.remove_unused_levels()
print (df.index)
MultiIndex(levels=[['USA'], ['total']],
labels=[[0], [0]],
names=['country', 'sex'])

This is something that has bitten me before. Dropping columns or rows does NOT change the underlying MultiIndex, for performance and philosophical reasons, and this is officially not considered a bug (read more here). The short answer is that the developers say "that's not what the MultiIndex is for". If you need a list of the contents of a MultiIndex level after modification, for example for iteration or to check to see if something is included, you can use:
df.index.get_level_values(<levelname>)
This returns the current active values within that index level.
So I guess the "trick" here is that the API native way to do it is to use get_level_values instead of just .index or .columns

I will be surprised if there is a more "built-in" way to eliminate the unused country than to re-create the index in the way you're doing (or some similar way). If you look at your index before and after the slice:
In [165]: df.index
Out[165]:
MultiIndex(levels=[[u'CAN', u'USA'], [u'total']],
labels=[[0, 1], [0, 0]],
names=[u'country', u'sex'])
In [166]: df = df.query('pop > 100')
In [167]: df.index
Out[167]:
MultiIndex(levels=[[u'CAN', u'USA'], [u'total']],
labels=[[1], [0]],
names=[u'country', u'sex'])
you can see that the labels - which are indexes into the level values - have updated but not the level values. This may be an imperfect analogy, but it strikes me that the level values are analogous to an enumerated column in a database table, while the labels are analogous to the actual values of rows in the table. If you delete all the rows in a table with a value of "CAN", it doesn't change the fact that "CAN" is still a valid choice based on the column definition. To remove "CAN" from the enumeration, you have to alter the column definition; this is the equivalent of reindexing the dataframe in pandas.

Pandas DataFrame to List of Lists

It's easy to turn a list of lists into a pandas dataframe:
import pandas as pd
df = pd.DataFrame([[1,2,3],[3,4,5]])
But how do I turn df back into a list of lists?
lol = df.what_to_do_now?
print lol
# [[1,2,3],[3,4,5]]

You could access the underlying array and call its tolist method:
>>> df = pd.DataFrame([[1,2,3],[3,4,5]])
>>> lol = df.values.tolist()
>>> lol
[[1L, 2L, 3L], [3L, 4L, 5L]]

If the data has column and index labels that you want to preserve, there are a few options.
Example data:
>>> df = pd.DataFrame([[1,2,3],[3,4,5]], \
columns=('first', 'second', 'third'), \
index=('alpha', 'beta'))
>>> df
first second third
alpha 1 2 3
beta 3 4 5
The tolist() method described in other answers is useful but yields only the core data - which may not be enough, depending on your needs.
>>> df.values.tolist()
[[1, 2, 3], [3, 4, 5]]
One approach is to convert the DataFrame to json using df.to_json() and then parse it again. This is cumbersome but does have some advantages, because the to_json() method has some useful options.
>>> df.to_json()
{
"first":{"alpha":1,"beta":3},
"second":{"alpha":2,"beta":4},"third":{"alpha":3,"beta":5}
}
>>> df.to_json(orient='split')
{
"columns":["first","second","third"],
"index":["alpha","beta"],
"data":[[1,2,3],[3,4,5]]
}
Cumbersome but may be useful.
The good news is that it's pretty straightforward to build lists for the columns and rows:
>>> columns = [df.index.name] + [i for i in df.columns]
>>> rows = [[i for i in row] for row in df.itertuples()]
This yields:
>>> print(f"columns: {columns}\nrows: {rows}")
columns: [None, 'first', 'second', 'third']
rows: [['alpha', 1, 2, 3], ['beta', 3, 4, 5]]
If the None as the name of the index is bothersome, rename it:
df = df.rename_axis('stage')
Then:
>>> columns = [df.index.name] + [i for i in df.columns]
>>> print(f"columns: {columns}\nrows: {rows}")
columns: ['stage', 'first', 'second', 'third']
rows: [['alpha', 1, 2, 3], ['beta', 3, 4, 5]]

I wanted to preserve the index, so I adapted the original answer to this solution:
list_df = df.reset_index().values.tolist()
Now you can paste it somewhere else (e.g. to paste into a Stack Overflow question) and latter recreate it:
pd.Dataframe(list_df, columns=['name1', ...])
pd.set_index(['name1'], inplace=True)

I don't know if it will fit your needs, but you can also do:
>>> lol = df.values
>>> lol
array([[1, 2, 3],
[3, 4, 5]])
This is just a numpy array from the ndarray module, which lets you do all the usual numpy array things.

I had this problem: how do I get the headers of the df to be in row 0 for writing them to row 1 in the excel (using xlsxwriter)? None of the proposed solutions worked, but they pointed me in the right direction. I just needed one line more of code
# get csv data
df = pd.read_csv(filename)
# combine column headers and list of lists of values
lol = [df.columns.tolist()] + df.values.tolist()

Maybe something changed but this gave back a list of ndarrays which did what I needed.
list(df.values)

Not quite relate to the issue but another flavor with same expectation
converting data frame series into list of lists to plot the chart using create_distplot in Plotly
hist_data=[]
hist_data.append(map_data['Population'].to_numpy().tolist())

"df.values" returns a numpy array. This does not preserve the data types. An integer might be converted to a float.
df.iterrows() returns a series which also does not guarantee to preserve the data types. See: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iterrows.html
The code below converts to a list of list and preserves the data types:
rows = [list(row) for row in df.itertuples()]

If you wish to convert a Pandas DataFrame to a table (list of lists) and include the header column this should work:
import pandas as pd
def dfToTable(df:pd.DataFrame) -> list:
return [list(df.columns)] + df.values.tolist()
Usage (in REPL):
>>> df = pd.DataFrame(
[["r1c1","r1c2","r1c3"],["r2c1","r2c2","r3c3"]]
, columns=["c1", "c2", "c3"])
>>> df
c1 c2 c3
0 r1c1 r1c2 r1c3
1 r2c1 r2c2 r3c3
>>> dfToTable(df)
[['c1', 'c2', 'c3'], ['r1c1', 'r1c2', 'r1c3'], ['r2c1', 'r2c2', 'r3c3']]

The solutions presented so far suffer from a "reinventing the wheel" approach. Quoting #AMC:
If you're new to the library, consider double-checking whether the functionality you need is already offered by those Pandas objects.
If you convert a dataframe to a list of lists you will lose information - namely the index and columns names.
My solution: use to_dict()
dict_of_lists = df.to_dict(orient='split')
This will give you a dictionary with three lists: index, columns, data. If you decide you really don't need the columns and index names, you get the data with
dict_of_lists['data']

We can use the DataFrame.iterrows() function to iterate over each of the rows of the given Dataframe and construct a list out of the data of each row:
# Empty list
row_list =[]
# Iterate over each row
for index, rows in df.iterrows():
# Create list for the current row
my_list =[rows.Date, rows.Event, rows.Cost]
# append the list to the final list
row_list.append(my_list)
# Print
print(row_list)
We can successfully extract each row of the given data frame into a list

This is very simple:
import numpy as np
list_of_lists = np.array(df)

Note: I have seen many cases on Stack Overflow where converting a Pandas Series or DataFrame to a NumPy array or plain Python lists is entirely unecessary. If you're new to the library, consider double-checking whether the functionality you need is already offered by those Pandas objects.
To quote a comment by #jpp:
In practice, there's often no need to convert the NumPy array into a list of lists.
If a Pandas DataFrame/Series won't work, you can use the built-in DataFrame.to_numpy and Series.to_numpy methods.

A function I wrote that allows including the index column or the header row:
def df_to_list_of_lists(df, index=False, header=False):
rows = []
if header:
rows.append(([df.index.name] if index else []) + [e for e in df.columns])
for row in df.itertuples():
rows.append([e for e in row] if index else [e for e in row][1:])
return rows

Creating an empty Pandas DataFrame, and then filling it

I'm starting from the pandas DataFrame documentation here: Introduction to data structures
I'd like to iteratively fill the DataFrame with values in a time series kind of calculation. I'd like to initialize the DataFrame with columns A, B, and timestamp rows, all 0 or all NaN.
I'd then add initial values and go over this data calculating the new row from the row before, say row[A][t] = row[A][t-1]+1 or so.
I'm currently using the code as below, but I feel it's kind of ugly and there must be a way to do this with a DataFrame directly or just a better way in general.
Note: I'm using Python 2.7.
import datetime as dt
import pandas as pd
import scipy as s
if __name__ == '__main__':
base = dt.datetime.today().date()
dates = [ base - dt.timedelta(days=x) for x in range(0,10) ]
dates.sort()
valdict = {}
symbols = ['A','B', 'C']
for symb in symbols:
valdict[symb] = pd.Series( s.zeros( len(dates)), dates )
for thedate in dates:
if thedate > dates[0]:
for symb in valdict:
valdict[symb][thedate] = 1+valdict[symb][thedate - dt.timedelta(days=1)]
print valdict

NEVER grow a DataFrame row-wise!
TLDR; (just read the bold text)
Most answers here will tell you how to create an empty DataFrame and fill it out, but no one will tell you that it is a bad thing to do.
Here is my advice: Accumulate data in a list, not a DataFrame.
Use a list to collect your data, then initialise a DataFrame when you are ready. Either a list-of-lists or list-of-dicts format will work, pd.DataFrame accepts both.
data = []
for row in some_function_that_yields_data():
data.append(row)
df = pd.DataFrame(data)
pd.DataFrame converts the list of rows (where each row is a scalar value) into a DataFrame. If your function yields DataFrames instead, call pd.concat.
Pros of this approach:
It is always cheaper to append to a list and create a DataFrame in one go than it is to create an empty DataFrame (or one of NaNs) and append to it over and over again.
Lists also take up less memory and are a much lighter data structure to work with, append, and remove (if needed).
dtypes are automatically inferred (rather than assigning object to all of them).
A RangeIndex is automatically created for your data, instead of you having to take care to assign the correct index to the row you are appending at each iteration.
If you aren't convinced yet, this is also mentioned in the documentation:
Iteratively appending rows to a DataFrame can be more computationally
intensive than a single concatenate. A better solution is to append
those rows to a list and then concatenate the list with the original
DataFrame all at once.
*** Update for pandas >= 1.4: append is now DEPRECATED! ***
As of pandas 1.4, append has now been deprecated! Use pd.concat instead. See the release notes
These options are horrible
append or concat inside a loop
Here is the biggest mistake I've seen from beginners:
df = pd.DataFrame(columns=['A', 'B', 'C'])
for a, b, c in some_function_that_yields_data():
df = df.append({'A': i, 'B': b, 'C': c}, ignore_index=True) # yuck
# or similarly,
# df = pd.concat([df, pd.Series({'A': i, 'B': b, 'C': c})], ignore_index=True)
Memory is re-allocated for every append or concat operation you have. Couple this with a loop and you have a quadratic complexity operation.
The other mistake associated with df.append is that users tend to forget append is not an in-place function, so the result must be assigned back. You also have to worry about the dtypes:
df = pd.DataFrame(columns=['A', 'B', 'C'])
df = df.append({'A': 1, 'B': 12.3, 'C': 'xyz'}, ignore_index=True)
df.dtypes
A object # yuck!
B float64
C object
dtype: object
Dealing with object columns is never a good thing, because pandas cannot vectorize operations on those columns. You will need to do this to fix it:
df.infer_objects().dtypes
A int64
B float64
C object
dtype: object
loc inside a loop
I have also seen loc used to append to a DataFrame that was created empty:
df = pd.DataFrame(columns=['A', 'B', 'C'])
for a, b, c in some_function_that_yields_data():
df.loc[len(df)] = [a, b, c]
As before, you have not pre-allocated the amount of memory you need each time, so the memory is re-grown each time you create a new row. It's just as bad as append, and even more ugly.
Empty DataFrame of NaNs
And then, there's creating a DataFrame of NaNs, and all the caveats associated therewith.
df = pd.DataFrame(columns=['A', 'B', 'C'], index=range(5))
df
A B C
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 NaN NaN NaN
It creates a DataFrame of object columns, like the others.
df.dtypes
A object # you DON'T want this
B object
C object
dtype: object
Appending still has all the issues as the methods above.
for i, (a, b, c) in enumerate(some_function_that_yields_data()):
df.iloc[i] = [a, b, c]
The Proof is in the Pudding
Timing these methods is the fastest way to see just how much they differ in terms of their memory and utility.
Benchmarking code for reference.

Here's a couple of suggestions:
Use date_range for the index:
import datetime
import pandas as pd
import numpy as np
todays_date = datetime.datetime.now().date()
index = pd.date_range(todays_date-datetime.timedelta(10), periods=10, freq='D')
columns = ['A','B', 'C']
Note: we could create an empty DataFrame (with NaNs) simply by writing:
df_ = pd.DataFrame(index=index, columns=columns)
df_ = df_.fillna(0) # With 0s rather than NaNs
To do these type of calculations for the data, use a NumPy array:
data = np.array([np.arange(10)]*3).T
Hence we can create the DataFrame:
In [10]: df = pd.DataFrame(data, index=index, columns=columns)
In [11]: df
Out[11]:
A B C
2012-11-29 0 0 0
2012-11-30 1 1 1
2012-12-01 2 2 2
2012-12-02 3 3 3
2012-12-03 4 4 4
2012-12-04 5 5 5
2012-12-05 6 6 6
2012-12-06 7 7 7
2012-12-07 8 8 8
2012-12-08 9 9 9

If you simply want to create an empty data frame and fill it with some incoming data frames later, try this:
newDF = pd.DataFrame() #creates a new dataframe that's empty
newDF = newDF.append(oldDF, ignore_index = True) # ignoring index is optional
# try printing some data from newDF
print newDF.head() #again optional
In this example I am using this pandas doc to create a new data frame and then using append to write to the newDF with data from oldDF.
If I have to keep appending new data into this newDF from more than
one oldDFs, I just use a for loop to iterate over
pandas.DataFrame.append()
Note: append() is deprecated since version 1.4.0. Use concat()

Initialize empty frame with column names
import pandas as pd
col_names = ['A', 'B', 'C']
my_df = pd.DataFrame(columns = col_names)
my_df
Add a new record to a frame
my_df.loc[len(my_df)] = [2, 4, 5]
You also might want to pass a dictionary:
my_dic = {'A':2, 'B':4, 'C':5}
my_df.loc[len(my_df)] = my_dic
Append another frame to your existing frame
col_names = ['A', 'B', 'C']
my_df2 = pd.DataFrame(columns = col_names)
my_df = my_df.append(my_df2)
Performance considerations
If you are adding rows inside a loop consider performance issues. For around the first 1000 records "my_df.loc" performance is better, but it gradually becomes slower by increasing the number of records in the loop.
If you plan to do thins inside a big loop (say 10M‌ records or so), you are better off using a mixture of these two;
fill a dataframe with iloc until the size gets around 1000, then append it to the original dataframe, and empty the temp dataframe.
This would boost your performance by around 10 times.

Simply:
import numpy as np
import pandas as pd
df=pd.DataFrame(np.zeros([rows,columns])
Then fill it.

Assume a dataframe with 19 rows
index=range(0,19)
index
columns=['A']
test = pd.DataFrame(index=index, columns=columns)
Keeping Column A as a constant
test['A']=10
Keeping column b as a variable given by a loop
for x in range(0,19):
test.loc[[x], 'b'] = pd.Series([x], index = [x])
You can replace the first x in pd.Series([x], index = [x]) with any value

This is my way to make a dynamic dataframe from several lists with a loop
x = [1,2,3,4,5,6,7,8]
y = [22,12,34,22,65,24,12,11]
z = ['as','ss','wa', 'ss','er','fd','ga','mf']
names = ['Bob', 'Liz', 'chop']
a loop
def dataF(x,y,z,names):
res = []
for t in zip(x,y,z):
res.append(t)
return pd.DataFrame(res,columns=names)
Result
dataF(x,y,z,names)

# import pandas library
import pandas as pd
# create a dataframe
my_df = pd.DataFrame({"A": ["shirt"], "B": [1200]})
# show the dataframe
print(my_df)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python: pandas Data Frame and meaning of {Series}0 in debugger - python

Related

How do i filter row and column from a 3 row matrix in Python?

Python csv Merge several column to one cell

pandas confusion with multiindex columns loc and get_level_values [duplicate]

Pandas DataFrame to List of Lists

Creating an empty Pandas DataFrame, and then filling it

Categories

Resources