My Excel worksheet has some columns, one of which is Python list-like column. If I import this Excel data using pandas.read_excel, is it possible for pandas to recognize that column as list at this stage or later? I am asking because I have comma-seperated values residing in Excel and I want to use pandas' explode() after importing the Excel file.
I tried to wrap the Excel cells with [""] but the importing and exploding did not work as desired. Any guidance?
Thanks!
data = {
"Name": ["A", "B", "C","D"],
"Product Sold": [["Apple", "Banana"], ["Apple", "Pear"], ["Pear"], ["Berry"]],
"Prices": [[5,6], [5,8], [4], [3]]
}
df = pd.DataFrame(data)
df.explode(['Product Sold', 'Prices'])
You could try something like this:
import pandas as pd
data = {
"Name": "Apple,Pear",
}
df = pd.DataFrame(data,index=[1])
for c in pdf.columns:
if df[c].str.contains(','):
df[c] = df[c].apply(lambda x : str(x).split(','))
print(type(df.Name.iloc[0]))
Read in your excel file, then pass it through the for loop above and it should make lists out of comma-delimited cells.
Let me know if it helps.
Related
I got a question about storing data from .dat files in the right row of a dataframe. I go with this minimal example.
I have already a dataframe like this:
data = {'col1': [1, 2, 3, 4],'col2': ["a", "b", "c", "d"]}
df = pd.DataFrame(data, index=['row_exp1','row_exp2','row_exp3','row_exp4'])
Now I want to add a new column called col3 with numpy arrays in each single cell. Thus, I will have 4 numpy arrays, one in every cell.
I get the numpy arrays from a .dat file.
The import part is that I have to found the right row. I have 4 .dat files and every dat file matches to the row name. For example the first .dat file has got the name 230109_exp3_foo.dat. So this dat file matches to the third row of my dataframe.
Then the algorithm has to put the data from the .dat file in the right cell:
col1
col2
col3
row_exp1
1
a
row_exp2
2
b
row_exp3
3
c
[1,2,3,4,5,6]
row_exp4
4
d
The other entries should be NaN and I would fill them with the right numpy array in the next loop.
I think the difficult part is to select the right row and to math this with the file name of the .dat file.
If you're working with time series data, this isn't how you want to structure your dataframe. Read up on "tidy" data. (https://r4ds.had.co.nz/tidy-data.html)
Every column is a variable. Every row is an observation.
So let's assume you're loading your data with a function called load_data that accepts a file name:
def load_data(filename):
# load the data, fill in your own details
pass
Then you would build up your dataframe like this:
meta_data = {
'col1': [1, 2, 3, 4],
'col2': ["a", "b", "c", "d"],
}
list_of_dataframes = []
for n, fname in enumerate(filenames):
this_array = load_data(fname)
list_of_dataframes.append(
pd.DataFrame({
'row_num': list(range(len(this_array))),
'col1': meta_data['col1'][n],
'col2': meta_data['col2'][n],
'values': this_array,
})
)
df = pd.concat(list_of_dataframes, ignore_index=True)
Maybe it helps:
# Do you have the similar pattern in each .dat file name? (I assume that yes)
list_of_files = ['230109_exp3_foo.dat', '230109_exp2_foo.dat', '230109_exp1_foo.dat', '230109_exp4_foo.dat']
# for each index trying to find value after row_ in file list
files_match = df.reset_index()['index'].map(lambda x: [y for y in list_of_files if x.replace('row_', '') in y])
# if I understand correctly, you know how to read .dat file,
# so you can insert your function instead of function_for_reading_dat_file
df['col3'] = files_match.map(lambda x: function_for_reading_dat_file(x[0]) if len(x) != 0 else 'None')
I have a dictionary where I read an sql table:
df = {}
df['abc'] = pd.read_sql_table('abc', engine, schema_z2,
columns = cols)
Now, I want to filter out data such that only those rows with values "R" and "P" from the art column are kept. This is what I tried after reading this code snippet somewhere.
df_merged = df['abc'][df['abc']['art'].isin(['R','P'])]
print(df_merged)
When I hover over df_merged in Visual Studio Code, it says that it is a dataframe, which is what I wanted. However, at the end when I run my code,df_merged is empty, even though it should have rows. I could be using a wrong syntax here: df['abc'][df['abc']['art'].isin(['R','P'])]but I am unable to identify how to change it.
A similar question How to filter Pandas dataframe using 'in' and 'not in' like in SQL
does not help because I am already using isin() and I am trying to filter values from a dictionary not a df initially.
and if I just do this:
df_merged =df['abc']['art_kennz'].isin(['R','P','SP','GP'])
df_merged shows a Series[_bool] type instead of Dataframe.
Edit:
I tried this with the following test data https://docs.google.com/spreadsheets/d/1cykNjViW_DacwWZNaIHWEh3E8OqxsVon/edit?usp=sharing&ouid=115380043465372211112&rtpof=true&sd=true:
import pandas as pd
df = {}
df['abc'] = pd.read_excel('./testing.xlsx')
print(df)
df_merged = df['abc'][df['abc']['art_kennz'].isin(['S','P','SP','GP'])]
df_merged.head()
and I get an empty dataset upon printing, which shouldn't be the case
Try:
df_merged = df[df['art'].isin(['R','P'])]['abc']
I seem to be able to filter out what you want just fine:
import pandas as pd
df = {}
df["abc"] = pd.DataFrame({"art": ["A", "B", "C", "P", "R", "S"], "bert": [1, 2, 3, 4, 5, 6]})
df_merged = df["abc"][df["abc"]["art"].isin(["R", "P"])]
df_merged
#{'abc': art bert
#3 P 4
#4 R 5}
df_merged shows a Series[_bool] type instead of Dataframe.
Should be correct, since the pandas.DataFrame.isin() method returns a bool array whose rows satisfy the condition, so you can easily filter using that:
df["abc"]["art"].isin(["R", "P"])
#3 True
#4 True
#Name: art, dtype: bool
When I use the following dict:
dfs = {}
dfs["abc"] = pd.DataFrame({
"mandant" : [9,9,9,9],
"fk_eakopf_posnr" : [552025046,552025047,552035009,552035009],
"zeile" : [5,5,5,5],
"art_kennz" : ["G","G","G","S"]
})
I preform the following process:
dfs["abc"][dfs["abc"]['art_kennz'].isin(['S','P','SP','GP'])]
Then I get the following output:
mandant fk_eakopf_posnr zeile art_kennz
3 9 552035009 5 S
Is this what you want?
I am new to pandas, and I would appreciate any help. I have a pandas dataframe that comes from csv file. The data contains 2 columns : dates and cashflows. Is it possible to convert these list into list comprehension with tuples inside the list? Here how my dataset looks like:
2021/07/15 4862.306832
2021/08/15 3474.465543
2021/09/15 7121.260118
The desired output is :
[(2021/07/15, 4862.306832),
(2021/08/15, 3474.465543),
(2021/09/15, 7121.260118)]
use apply with lambda function
data = {
"date":["2021/07/15","2021/08/15","2021/09/15"],
"value":["4862.306832","3474.465543","7121.260118"]
}
df = pd.DataFrame(data)
listt = df.apply(lambda x:(x["date"],x["value"]),1).tolist()
Output:
[('2021/07/15', '4862.306832'),
('2021/08/15', '3474.465543'),
('2021/09/15', '7121.260118')]
Hi I have a very large dataset in csv file, which I read into a panda dataframe. One column has json strings that I want to extract the values into new columns. The pic below shows a few rows in my csv file.
The fourth column (data) is the one required to be extracted. The key in the first level (605,254,834,265 etc) is always changing but the number is always the same as that in the last column ('reg'). I want to extract the values of 'price', 'status' and '#result' and put them in new columns.
The code I am using is
import pandas as pd
import numpy as np
import json
from pandas import DataFrame
df = pd.read_csv('sample.csv')
df["result"]=np.nan #create empty column
df["price"]=np.nan
df["status"]=np.nan
for i in range (0,len(df['data'])):
df['result'].iloc[i]=json.loads(df['data'].iloc[i])[str(df['reg'].iloc[i])]['#result']
df['price'].iloc[i]=json.loads(df['data'].iloc[i])[str(df['reg'].iloc[i])]['price']
df['status'].iloc[i]=json.loads(df['data'].iloc[i])[str(df['reg'].iloc[i])]['status']
print(df)
So I got the dataframe with new columns (result, price and status) as below:
The code gives me the output that I want. However since i am using 'for loop' it takes very long to run for the big dataframe. I think there must be a more clever way doing this. And I know that there are different ways to do if the first level key is constant. Can anyone have a better idea to extract this type of json strings in panda frame.
Cheers!
In your example you're parsing the same JSON multiple times. It's enough to parse it only once. For example:
import pandas as pd
d1 = '{"605":{"price":"570", "address":"946", "status": "done", "#result":"good" }}'
d2 = '{"254":{"price":"670", "address":"300", "status": "done", "classification_id": "102312321", "#result":"good" }}'
df = pd.DataFrame({'num': [1771, 905],
'item': ['orange', 'mango'],
'id': [190384, 2500003],
'data':[d1, d2],
'reg': [605, 254]
})
import json
df = df.join( pd.DataFrame(list(json.loads(d).values())[0] for d in df.pop('data')) )
# drop columns we don't want
del df['address']
del df['classification_id']
print(df)
Prints:
num item id reg price status #result
0 1771 orange 190384 605 570 done good
1 905 mango 2500003 254 670 done good
It seems to me that it would be imminently useful for pandas to support the idea of projection (omitting or selecting columns) during data parsing.
Many JSON datasets I find have a ton of extraneous fields I don't need, or I need to parse a specific field in the nested structure.
What I do currently is pipe through jq to create a file that contains only the fields I need. This becomes the "cleaned" file.
I would prefer a method where I didn't have to create a new cleaned file every time I want to look at a particular facet or set of facets, but I could instead tell pandas to load the JSON path .data.interesting and only project fields: A B C.
As an example:
{
"data": {
"not interesting": ["milk", "yogurt", "dirt"],
"interesting": [{ "A": "moonlanding", "B": "1956", "C": 100000, "D": "meh" }]
}
Unfortunately, it seems like there's no easy way to do it on load, but if you're okay with doing it immediately after...
# drop by index
df.drop(df.columns[[1, 2]], axis=1, inplace=True)
# drop by name
df.drop(['B', 'C'], axis=1, inplace=True)