Using Pandas in Python: Splitting one column into three with possible blanks?

Using Pandas in Python: Splitting one column into three with possible blanks? - python

Right now, I'm working with a csv file in which there is one column with a string.
This file is 'animals.csv'.
Row,Animal
1,big green cat
2,small lizard
3,gigantic blue bunny
The strings are either two or three elements long.
I'm practicing using pandas, with the expand=True option to separate the column into three. My ideal table would look like this:
Row,Size,Color,Animal
1,big,green,cat
2,small, ,lizard
3,gigantic,blue,bunny
But how can I deal with situations where one element is missing? In this example, "small lizard" has no color, but I still want to include it in the table. Here's the code I have so far.
import pandas as pd
file = 'animals.csv'
def copy_csv(file):
filereader = pd.read_csv(file)
filereader[['size', 'color', 'animal']] = filereader['Animal'].str.split(expand=True)
filereader.to_csv('sorted' + 'animals.csv')
copy_csv(file)
I end up with this error, which I know is happening because one of the strings ("small lizard" only has two elements.
ValueError: Columns must be same length as key
Any suggestions for how to solve this?
Edit: I tried the suggestion below and tried this:
new = filereader['Animal'].str.split('\s', expand=True)
And get a little closer to the goal, but not quite:
Row,Size,Color,Animal
1,big,green,cat
2,small,lizard,None
3,gigantic,blue,bunny
Looks like I need to figure out a way to say "if there are only two elements, the middle element should be None".

Related

New dataframe in Pandas based on specific values(a lot of them) from existing df

Good evening! I'm using pandas on Jupyter Notebook. I have a huge dataframe representing full history of posts of 26 channels in a messenger. It has a column "dialog_id" which represents in which dialog the message was sent(so, there can be only 26 unique values in the column, but there are more then 700k rows, and the df is sorted itself by time, not id, so it is kinda chaotic). I have to split this dataframe into 2 different(one will contain full history of 13 channels, and the other will contain history for the rest 13 channels). I know ids by which I have to split, they are random as well. For example, one is -1001232032465 and the other is -1001153765346.
The question is, how do I do it most elegantly and adequate?
I know I can do it somehow with df.loc[], but I don't want to put like 13 rows of df.loc[]. I've tried to use logical operators for this, like:
df1.loc[(df["dialog_id"] == '-1001708255880') & (df["dialog_id"] == '-1001645788710' )], but it doesn't work. I suppose I'm using them wrong. I expect a solution with any method creating a new df, with the use of logical operators. In verbal expression, I think it should sound like "put the row in a new df if the dialog_id is x, or dialog_id is y, or dialog_id is z, etc". Please help me!

The easiest way seems to be just setting up a query.
df = pd.DataFrame(dict(col_id=[1,2,3,4,], other=[5,6,7,8,]))
channel_groupA = [1,2]
channel_groupB = [3,4]
df_groupA = df.query(f'col_id == {channel_groupA}')
df_groupB = df.query(f'col_id == {channel_groupB}')

Pandas series string manipulation using Python - 1st two chars flip and append to end of string

I have a column (series) of values I'm trying to move characters around and I'm going nowhere fast! I found some snippets of code to get me where I am but need a "Closer". I'm working with one column, datatype (STR). Each column strings are a series of numbers. Some are duplicated. These duplicate numbers have a (n-) in front of the number. The (n) number will change based on how many duplicate numbers strings are listed. Some may have two duplicates, some eight duplicates. Doesn't matter, order should stay the same.
I need to go down through each cell or string, pluck the (n-) from the left of string, swap the two characters around, and append it to the end of the string. No number sorting needed. The column is 4-5k lines long and will look like the example given all the way down. No other special characters or letters. Also, the duplicate rows will always be together no matter where in the column.
My problem is the code below actually works and will step through each string, evaluate it for a dash, then process the numbers in the way I need. However, I have not learned how to get the changes back into my dataframe from a python for-loop. I was really hoping that somebody had a niffy lambda fix or a pandas apply function to address the whole column at once. But I haven't found anything that I can tweak to work. I know there is a better way than slowly traversing down through a series and I would like to learn.
Two possible fixes needed:
Is there a way to have the code below replace the old df.string value with the newly created df.string value? If so, please let me know.
I've been trying to read up on df.apply using the split feature so I can address the whole column at once. I understand it's the smarter play. Is there a couple of lines code that would do what I need?
Please let me know what you think. I appreciate the help. Thank you for taking the time.
import re
import pandas as pd
from pandas import DataFrame, Series
import numpy as np
df = pd.read_excel("E:\Book2.xlsx")
df.column1=df.column1.astype(str)
for r in df['column1']: #Finds Column
if bool(re.search('-', r))!=True: #test if string has '-'
continue
else:
a = [] #string holder for '-'
b = [] #string holder for numbers
for c in r:
if c == '-': #if '-' then hold in A
a.append(c)
else:
b.append(c) #if number then hold in B
t = (''.join(b + a)) #puts '-' at the end of string
z = t[1:] + t[:1] #picks up 1st position char and moves to end of string
r = z #assigns new created string to df.column1 value
print(df)
Starting File: Ending File:
column1 column1
41887 41887
1-41845 41845-1
2-41845 41845-2
40905 40905
1-41323 41323-1
2-41323 41323-2
3-41323 41323-3
41778 41778

You can use df.str.replace():
If we recreate your example with a file containing all your values and retain column1 as the column name:
import pandas as pd
df=pd.read_csv('file.txt')
df.columns=['column1']
df['column1']=df['column1'].str.replace('(^\d)-(\d+)',r'\2-\1')
print(df)
This will give the desired output. Replace the old column with new one, and do it all in one (without loops).
#in
41887
1-41845
2-41845
40905
1-41323
2-41323
3-41323
41778
#out
column1
0 41887
1 41845-1
2 41845-2
3 40905
4 41323-1
5 41323-2
6 41323-3
7 41778

How to populate arrays with values read in from csv via pandas?

I have create a DataFrame using pandas by reading a csv file. What I want to do is iterate down the rows (for the values in column 1) into a certain array, and do the same for the values in column 2 for a different array. This seems like it would normally be a fairly easy thing to do, so I think I am missing something, however I can't find much online that doesn't get too complicated and doesn't seem to do what I want. Stack questions like this one appear to be asking the same thing, but the answers are long and complicated. Is there no way to do this in a few lines of code? Here is what I have set up:
import pandas as pd
#available possible players
playerNames = []
df = pd.read_csv('Fantasy Week 1.csv')
What I anticipate I should be able to do would be something like:
for row in df.columns[1]:
playerNames.append(row)
This however does not return the desired result.
Essentially, if df =
[1,2,3
4,5,6
7,8,9], I would want my array to be [1,4,7]

Do:
for row in df[df.columns[1]]:
playerNames.append(row)
Or even better:
print(df[df.columns[1]].tolist())
In this case you want the 1st column's values so do:
for row in df[df.columns[0]]:
playerNames.append(row)
Or even better:
print(df[df.columns[0]].tolist())

How to modify cells in column conditionally in pandas?

I have a csv dataset which for whatever reason has an extra asterisk (*) at the end of some names. I am trying to remove them, but I'm having trouble. I just want to replace the name in the case where it ends with a *, otherwise keep it as-is.
I have tried a couple variations of the following, but with little success.
import pandas as pd
people = pd.read_csv("people.csv")
people.loc[people["name"].str[-1] == "*"]] = people["name"].str[:-1]
Here I am getting the following error:
ValueError: Must have equal len keys and value when setting with an iterable
I understand why this is wrong, but I'm not sure how else to reference the values I want to change.
I could instead do something like:
starred = people.loc[people["name"].str[-1] == "*"]
starred["name"] = starred["name"].str[:-1]
I get a warning here, but this kind of works. The problem is that it only contains the previously starred people, not all of them.
I'm kind of new to this, so apologies if this is simple. I feel like it shouldn't be too hard, there should be some function to do this, but I don't know what it is.

Your syntax for pd.DataFrame.loc needs to include a column label:
df = pd.DataFrame({'name': ['John*', 'Rose', 'Summer', 'Mark*']})
df.loc[df['name'].str[-1] == '*', 'name'] = df['name'].str[:-1]
print(df)
name
0 John
1 Rose
2 Summer
3 Mark
If you only specify the first part of the indexer, you will be filtering by row label only and return a dataframe. You cannot assign a series to a dataframe.

Adding names and assigning data types to ASCII data

My professor uses IDL and sent me a file of ASCII data that I need to eventually be able to read and manipulate.
He used the following command to read the data:
readcol, 'sn-full.txt', format='A,X,X,X,X,X,F,A,F,A,X,X,X,X,X,X,X,X,X,A,X,X,X,X,A,X,X,X,X,F,X,I,X,F,F,X,X,F,X,F,F,F,F,F,F', $
sn, off1, dir1, off2, dir2, type, gal, dist, htype, d1, d2, pa, ai, b, berr, b0, k, kerr
Here's a picture of what the first two rows look like: http://i.imgur.com/hT7YIE3.png
Since I'm not going to be an astronomer, I am using Python but since I am new to it, I am having a hard time reading the data.
I know that the his code assigns the data type A (string data) to column one, skips columns two -six by using an X, and then assigns the data type F (floating point) to column seven, etc. Then sn is assigned to the first column that isn't skipped, etc.
I have been trying to replicate this by using either numpy.loadtxt("sn-full.txt") or ascii.read("sn-full.txt") but am not sure how to enter the dtype parameter. I know I could assign everything to be a certain data type, but how do I assign data types to individual columns?

Using astropy.io.ascii you should be able to read your file relatively easily:
from astropy.io import ascii
# Give names for ALL of the columns, as there is no easy way to skip columns
# for a table with no column header.
colnames = ('sn', 'gal_name1', 'gal_name2', 'year', 'month', 'day', ...)
table = ascii.read('sn_full.txt', Reader=ascii.NoHeader, names=colnames)
This gives you a table with all of the data columns. The fact that you have some columns you don't need is not a problem unless the table is mega-rows long. For the table you showed you don't need to specify the dtypes explicitly since io.ascii.read will figure them out correctly.
One slight catch here is that the table you've shown is really a fixed width table, meaning that all the columns line up vertically. Notice that the first row begins with 1998S NGC 3877. As long as every row has the same pattern with three space-delimited columns indicating the supernova name and the galaxy name as two words, then you're fine. But if any of the galaxy names are a single word then the parsing will fail. I suspect that if the IDL readcol is working then the corresponding io.ascii version should work out of the box. If not then io.ascii has a way of reading fixed width tables where you supply the column names and positions explicitly.
[EDIT]
Looks like in this case a fixed width reader is needed to inform the parser how to split the columns instead of just using space as delimiter. So basically you need to add two rows at the top of the table file, where the first one gives the column names and the second has dashes that indicate the span of each column:
a b c
---- ------------ ------
1.2 hello there 2
2.4 worlds 3
It's also possible in astropy.io.ascii to just specify by code the start and stop position of each column if you don't have the option of modifying the input data file, e.g.:
>>> ascii.read(table, Reader=ascii.FixedWidthNoHeader,
names=('Name', 'Phone', 'TCP'),
col_starts=(0, 9, 18),
col_ends=(5, 17, 28),
)

http://casa.colorado.edu/~ginsbura/pyreadcol.htm looks like it does what you want. It emulates IDL's readcol function.
Another possibility is https://pypi.python.org/pypi/fortranformat. It looks like it might be more capable and the data you're looking at is in fixed format and the format specifiers (X, A, etc.) are fortran format specifiers.

I would use Pandas for that particular purpose. The easiest way to do it is, assuming your columns are single-tab-separated:
import pandas as pd
import scipy as sp # Provides all functionality from numpy, too
mydata = pd.read_table(
'filename.dat', sep='\t', header=None,
names=['sn', 'gal_name1', 'gal_name2', 'year', 'month',...],
dtype={'sn':sp.float64, 'gal_name1':object, 'year':sp.int64, ...},)
(Strings here fall into the general 'object' datatype).
Each column now has a name and can be accessed as mydata['colname'], and this can then be sliced like regular numpy 1D arrays like e.g. mydata['colname'][20:50] etc. etc.
Pandas has built-in plotting calls to matplotlib, so you can quickly get an overview of a numerical type column by mydata['column'].plot(), or two different columns against each other as mydata.plot('col1', 'col2'). All normal plotting keywords can be passed.
If you want to plot the data in a normal matplotlib routine, you can just pass the columns to matplotlib, where they will be treated as ordinary Numpy vectors.
Each column can be accessed as an ordinary Numpy vector as mydata['colname'].values.
EDIT
If your data are not uniformly separated, numpy's genfromtxt() function is better. You can then convert it to a Pandas DataFrame by
mydf = pd.DataFrame(myarray, columns=['col1', 'col2', ...],
dtype={'col1':sp.float64, 'col2':object, ...})

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.