How to populate arrays with values read in from csv via pandas? - python

I have create a DataFrame using pandas by reading a csv file. What I want to do is iterate down the rows (for the values in column 1) into a certain array, and do the same for the values in column 2 for a different array. This seems like it would normally be a fairly easy thing to do, so I think I am missing something, however I can't find much online that doesn't get too complicated and doesn't seem to do what I want. Stack questions like this one appear to be asking the same thing, but the answers are long and complicated. Is there no way to do this in a few lines of code? Here is what I have set up:
import pandas as pd
#available possible players
playerNames = []
df = pd.read_csv('Fantasy Week 1.csv')
What I anticipate I should be able to do would be something like:
for row in df.columns[1]:
playerNames.append(row)
This however does not return the desired result.
Essentially, if df =
[1,2,3
4,5,6
7,8,9], I would want my array to be [1,4,7]

Do:
for row in df[df.columns[1]]:
playerNames.append(row)
Or even better:
print(df[df.columns[1]].tolist())
In this case you want the 1st column's values so do:
for row in df[df.columns[0]]:
playerNames.append(row)
Or even better:
print(df[df.columns[0]].tolist())

Related

Using Pandas in Python: Splitting one column into three with possible blanks?

Right now, I'm working with a csv file in which there is one column with a string.
This file is 'animals.csv'.
Row,Animal
1,big green cat
2,small lizard
3,gigantic blue bunny
The strings are either two or three elements long.
I'm practicing using pandas, with the expand=True option to separate the column into three. My ideal table would look like this:
Row,Size,Color,Animal
1,big,green,cat
2,small, ,lizard
3,gigantic,blue,bunny
But how can I deal with situations where one element is missing? In this example, "small lizard" has no color, but I still want to include it in the table. Here's the code I have so far.
import pandas as pd
file = 'animals.csv'
def copy_csv(file):
filereader = pd.read_csv(file)
filereader[['size', 'color', 'animal']] = filereader['Animal'].str.split(expand=True)
filereader.to_csv('sorted' + 'animals.csv')
copy_csv(file)
I end up with this error, which I know is happening because one of the strings ("small lizard" only has two elements.
ValueError: Columns must be same length as key
Any suggestions for how to solve this?
Edit: I tried the suggestion below and tried this:
new = filereader['Animal'].str.split('\s', expand=True)
And get a little closer to the goal, but not quite:
Row,Size,Color,Animal
1,big,green,cat
2,small,lizard,None
3,gigantic,blue,bunny
Looks like I need to figure out a way to say "if there are only two elements, the middle element should be None".

What is the most efficient(fastest) way to create a dataframe?

I am working on a project that reads a couple of thousand text documents, creates a dataframe from them, and then trains a model on the dataframe. The most time-consuming aspect of the code is the creation of the dataframe.
Here is how I create the dataframe:
I first create 4-5 lists, create a dictionary with 'Column-name' as the key and the previous lists as the values. Then use pd.DataFrame to give the dictionary. I have added print updates after each step and the dataframe creation step takes the most time.
Method I am using:
line_of_interest = []
line_no = []
file_name = []
for file in file_names:
with open(file) as txt:
for i, line in enumerate(txt):
if 'word of interest' in line:
line_of_interest.append(line)
line_no.append(i)
file_name.append()
rows = {'Line_no':line_no,'Line':line_of_interest,'File':file_name}
df = pd.DataFrame(data = rows)
I was wondering if there is a more efficient and less time-consuming way to create the dataframe. I tried looking for similar questions and the only thing I could find was "Most Efficient Way to Create Pandas DataFrame from Web Scraped Data".
Let me know if there is a similar question with a good answer. The only other method of creating a dataframe I know is appending row by row all the values as I discover them, and I don't know a way to check if that is quicker. Thanks!

Selecting 1.6M rows of a pandas dataframe [duplicate]

This question already has answers here:
"Large data" workflows using pandas [closed]
(16 answers)
Closed 2 years ago.
I have a csv file with ~2.3M rows. I'd like save the subset (~1.6M) of the rows which have non-nan values in two columns inside the dataframe. I'd like to keep using pandas to do this. Right now, my code looks like:
import pandas as pd
catalog = pd.read_csv('catalog.txt')
slim_list = []
for i in range(len(catalog)):
if (pd.isna(catalog['z'][i]) == False and pd.isna(catalog['B'][i]) == False):
slim_list.append(i)
which holds the rows of catalog which have non-nan values. I then make a new catalog with those rows as entries
slim_catalog = pd.DataFrame(columns = catalog.columns)
for j in range(len(slim_list)):
data = (catalog.iloc[j]).to_dict()
slim_catalog = slim_catalog.append(data, ignore_index = True)
pd.to_csv('slim_catalog.csv')
This should, in principle, work. It's sped up a little by reading each row into a dict. However, it takes far, far too long to execute for all 2.3M rows. What is a better way to solve this problem?
This is the completely wrong way of doing this in pandas.
Firstly, never iterate over some range, i.e. for i in range(len(catalog)): and then individually index into the row: catalog['z'][i], that is incredibly inefficient.
Second, do not create a pandas.DataFrame using pd.DataFrame.append in a loop, that is a linear operation, so the entire thing will be quadratic time.
But you shouldn't be looping here to begin with. All you need is something like
catalog[catalog.loc[:, ['z', 'B']].notna().all(axis=1)].to_csv('slim_catalog.csv')
Or broken up to perhaps be more readable:
not_nan_zB = catalog.loc[:, ['z', 'B']].notna().all(axis=1)
catalog[not_nan_zB].to_csv('slim_catalog.csv')

Loop for creating csv out of dataframe column index

I want to create a loop which creates multiple csvs which have the same 9 columns in the beginning but differ iteratively in the last column.
[col1,col2,col3,col4,...,col9,col[i]]
I have a dataframe with a shape of (20000,209).
What I want is that I create a loop which does not takes too much computation power and resources but creates 200 csvs which differ in the last column. All columns exist in one dataframe. The columns which should be added are in columns i =[10:-1].
I thought of something like:
for col in df.columns[10:-1]:
dfi = df[:9]
dfi.concat(df[10])
dfi.dropna()
dfi.to_csv('dfi.csv'))
Maybe it is also possible to use
dfi.to_csv('dfi.csv', sequence = [:9,i])
The i should display the number of the added column. Any idea how to make this happen easily? :)
Thanks a lot!
I'm not sure I understand fully what you want but are you saying that each csv should just have 10 columns, all should have the first 9 and then one csv for each of the remaining 200 columns?
If so I would go for something as simple as:
base_cols = list(range(9))
for i in range(9, 209):
df.iloc[:, base_cols+[i]].to_csv('csv{}.csv'.format(i))
Which should work I think.

Extract value from single row of pandas DataFrame

I have a dataset in a relational database format (linked by ID's over various .csv files).
I know that each data frame contains only one value of an ID, and I'd like to know the simplest way to extract values from that row.
What I'm doing now:
# the group has only one element
purchase_group = purchase_groups.get_group(user_id)
price = list(purchase_group['Column_name'])[0]
The third row is bothering me as it seems ugly, however I'm not sure what is the workaround. The grouping (I guess) assumes that there might be multiple values and returns a <class 'pandas.core.frame.DataFrame'> object, while I'd like just a row returned.
If you want just the value and not a df/series then call values and index the first element [0] so just:
price = purchase_group['Column_name'].values[0]
will work.
If purchase_group has single row then doing purchase_group = purchase_group.squeeze() would make it into a series so you could simply call purchase_group['Column_name'] to get your values
Late to the party here, but purchase_group['Column Name'].item() is now available and is cleaner than some other solutions
This method is intuitive; for example to get the first row (list from a list of lists) of values from the dataframe:
np.array(df)[0]

Categories

Resources