I am trying to split a column of data into multiple columns based on a condition ",,".
But it should also split the data when it encounters ",,,,".
Basically it should also consider ",," as ",,,,".
My code
import pandas as pd
df = pd.DataFrame()
df['data'] = data
df
df.columns = ['header']
final = df["header"].str.split(",,",n = 2, expand = True)
final
Thanks for your help !
If you just need to split a string with more than one delimiter,
you can use re.split(string=your_string, pattern=',,,,|,,')
after importing re.
If you need something specific for Pandas, I don't know that.
Related
I want to split all the columns of a dataframe with more than two whitespaces since all the columns have the same format. I know how to do the same on one or a few columns but stuck at implementing the same code on all the columns. The code below works for one column. Would really appreciate any help on this.
df2 = df1['Col 1'].str.split("\s{2,}",expand = True)
simply do this:
df = df.apply(lambda x: x.str.split("\s{2,}",expand = True))
I am using this code. but instead of new with just the required rows, I'm getting an empty .csv with just the header.
import pandas as pd
df = pd.read_csv("E:/Mac&cheese.csv")
newdf = df[df["fruit"]=="watermelon"+"*"]
newdf.to_csv("E:/Mac&cheese(2).csv",index=False)
I believe the problem is in how you select the rows containing the word "watermelon". Instead of:
newdf = df[df["fruit"]=="watermelon"+"*"]
Try:
newdf = df[df["fruit"].str.contains("watermelon")]
In your example, pandas is literally looking for cells containing the word "watermelon*".
missing the underscore in pd.read_csv on first call, also it looks like the actual location is incorrect. missing the // in the file location.
I am trying to create a dataframe where the column lengths are not equal. How can I do this?
I was trying to use groupby. But I think this will not be the right way.
import pandas as pd
data = {'filename':['file1','file1'], 'variables':['a','b']}
df = pd.DataFrame(data)
grouped = df.groupby('filename')
print(grouped.get_group('file1'))
Above is my sample code. The output of which is:
What can I do to just have one entry of 'file1' under 'filename'?
Eventually I need to write this to a csv file.
Thank you
If you only have one entry in a column the other will be NaN. So you could just filter the NaNs by doing something like df = df.at[df["filename"].notnull()]
please be gentle total Python newbie, I'm currently writing a script which is turning out to be very long and i thought there must! be a for loop method to make this easier. I'm currently going through a CSV, pulling the header titles and placing it within a str.replace code, manually.
df['Col 1'] = df['Col 1'].str.replace('text','replacement')
I figured it would start like this.. but no idea how to proceed!
Import pandas as pd
df = pd.read_csv('file.csv')
for row in df.columns:
if (df[,:] =...
Sorry I know this probably looks terrible, but this is all I could fathom with my limited knowledge!
Thanks!
jezrael comment solved it much more ellegantly.
But, in case you needed specific code for each column it would go something like this:
import pandas as pd
df = pd.read_csv('file.csv')
for column in df.columns:
df[column] = df[column].str.replace('text','replacement')
No worries! We've all been there.
Your import statement should be lowercase: import pandas as pd
In your for loop, I think there's a misunderstanding of what you'll be iterating over. The for row in df.columns will iterate over the column names, not the rows.
Is it correct to say that you'd like to convert the column names to strings?
You can do a multiple-column replacement in one shot with replace by passing in a dictionary.
Say you want to replace t1 with r1 in column a; t2 with r2 in column b, you can do
df.replace({"a":{"t1":"r1"}, "b":{"t2":"r2"}})
df = pd.read_csv('file.csv',usecols=['List of column names you want to use from your csv'],
names=['list of names of column you want your pandas df to have'])
You should read the docs and identify the fields that are important in your case.
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html
I am trying to import a file from xlsx into a Python Pandas dataframe. I would like to prevent fields/columns being interpreted as integers and thus losing leading zeros or other desired heterogenous formatting.
So for an Excel sheet with 100 columns, I would do the following using a dict comprehension with range(99).
import pandas as pd
filename = 'C:\DemoFile.xlsx'
fields = {col: str for col in range(99)}
df = pd.read_excel(filename, sheetname=0, converters=fields)
These import files do have a varying number of columns all the time, and I am looking to handle this differently than changing the range manually all the time.
Does somebody have any further suggestions or alternatives for reading Excel files into a dataframe and treating all fields as strings by default?
Many thanks!
Try this:
xl = pd.ExcelFile(r'C:\DemoFile.xlsx')
ncols = xl.book.sheet_by_index(0).ncols
df = xl.parse(0, converters={i : str for i in range(ncols)})
UPDATE:
In [261]: type(xl)
Out[261]: pandas.io.excel.ExcelFile
In [262]: type(xl.book)
Out[262]: xlrd.book.Book
Use dtype=str when calling .read_excel()
import pandas as pd
filename = 'C:\DemoFile.xlsx'
df = pd.read_excel(filename, dtype=str)
the usual solution is:
read in one row of data just to get the column names and number of columns
create the dictionary automatically where each columns has a string type
re-read the full data using the dictionary created at step 2.