This question already has answers here:
Create Pandas DataFrame from a string
(7 answers)
How to make good reproducible pandas examples
(5 answers)
Closed 3 years ago.
Often people ask questions on Stack Overflow with an output of print(dataframe). It is convenient if one has a way of quickly loading the dataframe data into a pandas.dataframe object.
What is/are the most suggestible ways of loading a dataframe from a dataframe-string (which may or may not be properly formatted)?
Example-1
If you want to load the following string as a dataframe what would you do?
# Dummy Data
s1 = """
Client NumberOfProducts ID
A 1 2
A 5 1
B 1 2
B 6 1
C 9 1
"""
Example-2
This type is more similar to what you find in csv file.
# Dummy Data
s2 = """
Client, NumberOfProducts, ID
A, 1, 2
A, 5, 1
B, 1, 2
B, 6, 1
C, 9, 1
"""
Expected Output
References
Note: The following two links do not address the specific situation presented in Example-1. The reason I think my question is not a duplicate is that I think one cannot load the string in Example-1 using any of the solutions already posted on those links (at the time of writing).
Create Pandas DataFrame from a string. Note that pd.read_csv(StringIO(s1), sep), as suggested here, doesn't really work for Example-1. You get the following output.
This question was marked as a duplicate of two Stack Overflow links. One of them is the one above, which fails in addressing the case presented in Example-1. And the second one is . Among all the answers presented there, only one looked like it might work for Example-1, but it did not work.
# could not read the clipboard and threw error
pd.read_clipboard(sep='\s\s+')
Error Thrown:
PyperclipException:
Pyperclip could not find a copy/paste mechanism for your system.
For more information, please visit https://pyperclip.readthedocs.org
I can suggest two methods to approach this problem.
Method-1
Process the string with regex and numpy to make the dataframe. What I have seen is that this works most of the time. This would for the case presented in "Example-1".
# Make Dataframe
import pandas as pd
import numpy as np
import re
# Make Dataframe
# s = s1
ncols = 3 # number_of_columns
ss = re.sub('\s+',',',s.strip())
sa = np.array(ss.split(',')).reshape(-1,ncols)
df = pd.DataFrame(dict((k,v) for k,v in zip(sa[0,:], sa[1:,].T)))
df
Method-2
Use io.StringIO to feed into pandas.read_csv(). But this would work if the separator is well defined. For instance, if your data looks similar to "Example-2". Source credit
import pandas as pd
from io import StringIO
# Make Dataframe
# s = s2
df = pd.read_csv(StringIO(s), sep=',')
Output
Related
This question already has answers here:
Split one column into multiple columns by multiple delimiters in Pandas
(2 answers)
Closed 1 year ago.
I'd like to begin processing some data for analysis but I have to separate the responses into multiple values. Currently each column contains one value that is combined with 3 responses, Agree: #score, Disagree: #score, Neither agree nor disagree. I'd like to separate the responses from the column into individual values to create an analysis for a visualization. Would I need to include regular expression to do this?
So far that code I have is just to load the data with some libraries I plan to use:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
def load_data():
# importing datasets
df=pd.read_csv('dataset.csv')
return df
load_data().head()
You need to use str.split(';') to first split the values into multiple columns. Then for each column value, split the string again using str.split(':') but take [-1] part of it.
Here's how you can do it.
import pandas as pd
df = pd.DataFrame({'username':['Dragonfly','SpeedHawk','EagleEye'],
'Question1':['Comfortable:64;Neither comfortable nor uncomfortable:36',
'Comfortable:0;Neither comfortable nor uncomfortable:100',
'Comfortable:10;Neither comfortable nor uncomfortable:90'],
'Question2':['Agree:46;Disagree:13;Neither agree nor disagree:41',
'Agree:96;Disagree:0;Neither agree nor disagree:4',
'Agree:90;Disagree:5;Neither agree nor disagree:5']})
df[['Q1_Comfortable','Q1_Neutral']] = df['Question1'].str.split(';',expand=True)
df[['Q2_Agree','Q2_Disagree','Q2_Neutral']] = df['Question2'].str.split(';',expand=True)
df.drop(columns=['Question1','Question2'],inplace=True)
for col in df.columns[1:]:
df[col] = df[col].str.split(':').str[-1]
print (df)
The output of this will be:
username Q1_Comfortable Q1_Neutral Q2_Agree Q2_Disagree Q2_Neutral
0 Dragonfly 64 36 46 13 41
1 SpeedHawk 0 100 96 0 4
2 EagleEye 10 90 90 5 5
This question already has answers here:
Problem with getting rid of specific columns [closed]
(2 answers)
Closed 3 years ago.
I have a code that slices data and then suppose to calculte different indices according to the columns.
My code worked well but today I had to slice differently the data and since then I get keyerror whenever I try to compute the indices.
unfortinatly I can't share my original data but I hope this code can help in understand what happenned here.
This is my code with some explainations:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df_plants = pd.read_csv('my_data')
#My data contains columns with numerical data and their column title is numbers
#here I have changed the numbers titles into float
float_cols = [float(i) for i in df_plants.columns.tolist()[4:] if type(i)==str]
df_plants.columns.values[4:] = float_cols
#detector edges removal
#Here my goal is to remove some of the columns that has wrong data.
#this part was added today and might be the reason for the problem
cols = df_plants.columns.tolist()
df_plants=df_plants[cols[:4] + cols[11:]].copy()
#Trying to calculte indices:
filter_plants['NDVI']=(filter_plants['801.03']-filter_plants['680.75'])/(filter_plants['801.03']+filter_plants['680.75'])
KeyError: '801.03'
In order to solve this problem I have tried to add this line again before the calculation:
float_cols = [float(i) for i in df_plants.columns.tolist()[4:] ]
df_plants.columns.values[4:] = float_cols
but I still got the keyerror.
My end goal is to be able to do calculations with my indices which I believe relate to changing in the type of the columns
Try changing the last line to:
filter_plants['NDVI']=(filter_plants[801.03]-filter_plants[680.75])/(filter_plants[801.03]+filter_plants[680.75])
This question already has answers here:
How to copy/paste DataFrame from Stack Overflow into Python
(3 answers)
Closed 4 years ago.
For pandas related question on StackOverflow , people usually provide their sample data like below:
a b c d e
0 -0.420430 -0.394562 0.760232 0.152246 -0.671229
1 0.388447 0.676054 -0.058273 -0.246588 0.811332
2 -0.498263 -0.108011 0.952489 0.504729 -0.385724
3 1.069371 0.143752 0.414916 -1.180362 -0.029045
4 -0.245684 -0.150180 0.210579 0.063154 0.261488
5 0.064939 -0.396667 0.857411 -0.460206 0.039658
What's the most efficient way to create the data in my own jupyer notebook, so I can further investigate the question?
Usually, I will copy the data about to notepad and replace the space with comma and do the following code to create the sameple data:
data = np.array([-0.420430,-0.394562,0.760232,0.152246,...]) # paste the result from notepad here
df = pd.DataFrame(data.reshape(-1,5),columns=[HEADERS_OF_DATA]) # 5 is number of columns
However, this is quite slow and inconvenient. Is there any faster way to do so?
Wonderfully, you can do this with pd.read_clipboard().
Just copy the posted DataFrame from the question, and then this line of code will parse it as a DataFrame using pd.read_table():
df = pd.read_clipboard()
I'm a python beginner that's working with a large csv file of online order data.
I'm trying to see what skus people most frequently purchase with a specific sku, we'll call it grey-shirt711.
I'm struggling to express how to say "show all orders that contain grey-shirt771 and at least one other sku". I keep merely retrieving all orders that have grey-shirt711 in it, which 90% of the time is only that sku.
Assuming I'm only dealing with these two columns ('sku' and 'orderID'), what's the simplest way I could express this statement?
Thank you!
We'd like to help but you need to be a little more specific. Can you provide an example of what you've tried. Can you show us how you're reading in the data? Like Boris suggests, you'll likely want to do this using Pandas. Here's a snippet that will filter a dataframe on a column of your choosing:
import pandas as pd
import numpy as np
d = {'col1': [1, 2], 'col2': [3, 4]} # Should be your data import line...
df = pd.DataFrame(np.random.randint(low=0, high=10, size=(100, 2)),
columns=['sku','orderID'])
#%% Alternatively, load your data using Pandas by uncommenting the lines below
# df = pd.read_excel('path_to_your_file') #If using excel
# Method 1
filter1 = 6 #replace 6 with grey-shirt771
filter2 = 3 # replace this with another sku of interest
df_items_of_interest1 = df[(df['sku'] == filter1) | (df['sku'] == filter2)]
# Method 2
filter1 = 'sku == 6'
filter2 = 'sku == 3'
df_items_of_interest2 = df.query(filter1 + '|' + filter2)
# Method 3
df_items_of_interest3 = df[df['sku'].isin([6,3])]
Refer to this SO Post and the Pandas documentation for clarity.
I hope that helps. On behalf of the Stack Overflow community, I say welcome. To maximize the value you'll get from using this site (and to help us help you) try out some of these tips
Being able to define the ranges in a manner similar to excel, i.e. 'A5:B10' is important to what I need so reading the entire sheet to a dataframe isn't very useful.
So what I need to do is read the values from multiple ranges in the Excel sheet to multiple different dataframes.
valuerange1 = ['a5:b10']
valuerange2 = ['z10:z20']
df = pd.DataFrame(values from valuerange)
df = pd.DataFrame(values from valuerange1)
or
df = pd.DataFrame(values from ['A5:B10'])
I have searched but either I have done a very poor job of searching or everyone else has gotten around this problem but I really can't.
Thanks.
Using openpyxl
Since you have indicated, that you are looking into a very user friendly way to specify the range (like the excel-syntax) and as Charlie Clark already suggested, you can use openpyxl.
The following utility function takes a workbook and a column/row range and returns a pandas DataFrame:
from openpyxl import load_workbook
from openpyxl.utils import get_column_interval
import re
def load_workbook_range(range_string, ws):
col_start, col_end = re.findall("[A-Z]+", range_string)
data_rows = []
for row in ws[range_string]:
data_rows.append([cell.value for cell in row])
return pd.DataFrame(data_rows, columns=get_column_interval(col_start, col_end))
Usage:
wb = load_workbook(filename='excel-sheet.xlsx',
read_only=True)
ws = wb.active
load_workbook_range('B1:C2', ws)
Output:
B C
0 5 6
1 8 9
Pandas only Solution
Given the following data in an excel sheet:
A B C
0 1 2 3
1 4 5 6
2 7 8 9
3 10 11 12
You can load it with the following command:
pd.read_excel('excel-sheet.xlsx')
If you were to limit the data being read, the pandas.read_excel method offers a number of options. Use the parse_cols, skiprows and skip_footer to select the specific subset that you want to load:
pd.read_excel(
'excel-sheet.xlsx', # name of excel sheet
names=['B','C'], # new column header
skiprows=range(0,1), # list of rows you want to omit at the beginning
skip_footer=1, # number of rows you want to skip at the end
parse_cols='B:C' # columns to parse (note the excel-like syntax)
)
Output:
B C
0 5 6
1 8 9
Some notes:
The API of the read_excel method is not meant to support more complex selections. In case you require a complex filter it is much easier (and cleaner) to load the whole data into a DataFrame and use the excellent slicing and indexing mechanisms provided by pandas.
The most easiest way is to use pandas for getting the range of values from excel.
import pandas as pd
#if you want to choose single range, you can use the below method
src=pd.read_excel(r'August.xlsx',usecols='A:C',sheet_name='S')
#if you have multirange, which means a dataframe with A:S and as well some other range
src=pd.read_excel(r'August.xlsx',usecols='A:C,G:I',sheet_name='S')
If you want to use particular range, for example "B3:E5", you can use the following structure.
src=pd.read_excel(r'August.xlsx',usecols='B:E',sheet_name='S',header=2)[0:2]