Sort each dataframe in a dictionary of dataframes - python

Thanks to #Woody Pride's answer here: https://stackoverflow.com/a/19791302/5608428, I've got to 95% of what I want to achieve.
Which is, by the way, create a dict of sub dataframes from a large df.
All I need to do is sort each dataframe in the dictionary. It's such a small thing but I can't find an answer on here or Google.
import pandas as pd
import numpy as np
import itertools
def points(row):
if row['Ob1'] > row['Ob2']:
val = 2
else:
val = 1
return val
#create some data with Names column
data = pd.DataFrame({'Names': ['Joe', 'John', 'Jasper', 'Jez'] *4, \
'Ob1' : np.random.rand(16), 'Ob2' : np.random.rand(16)})
#create list of unique pairs
comboNames = list(itertools.combinations(data.Names.unique(), 2))
#create a data frame dictionary to store your data frames
DataFrameDict = {elem : pd.DataFrame for elem in comboNames}
for key in DataFrameDict.keys():
DataFrameDict[key] = data[:][data.Names.isin(key)]
#Add test calculated column
for tbl in DataFrameDict:
DataFrameDict[tbl]['Test'] = DataFrameDict[tbl].apply(points, axis=1)
#############################
#Checking test and sorts
##############################
#access df's to print head
for tbl in DataFrameDict:
print(DataFrameDict[tbl].head())
print()
#access df's to print summary
for tbl in DataFrameDict:
print(str(tbl[0])+" vs "+str(tbl[1])+": "+str(DataFrameDict[tbl]['Ob2'].sum()))
print()
#trying to sort each df
for tbl in DataFrameDict:
#Doesn't work
DataFrameDict[tbl].sort_values(['Ob1'])
#mistakenly deleted other attempts (facepalm)
for tbl in DataFrameDict:
print(DataFrameDict[tbl].head())
print()
The code runs but won't sort each df no matter what I try. I can access each df no problem for printing etc but no .sort_values()
As an aside, creating the df's with tuples for names(keys) was/is kind of hacky. Is there a better way to do this?
Many thanks

Looks like you just need to assign the sorted DataFrame back into the dict:
for tbl in DataFrameDict:
DataFrameDict[tbl] = DataFrameDict[tbl].sort_values(['Ob1'])

Related

Compare entire rows for equality if some condition is satisfied

Let's say I have the following data of a match in a CSV file:
name,match1,match2,match3
Alice,2,4,3
Bob,2,3,4
Charlie,1,0,4
I'm writing a python program. Somewhere in my program I have scores collected for a match stored in a list, say x = [1,0,4]. I have found where in the data these scores exist using pandas and I can print "found" or "not found". However I want my code to print out to which name these scores correspond to. In this case the program should output "charlie" since charlie has all these values [1,0,4]. how can I do that?
I will have a large set of data so I must be able to tell which name corresponds to the numbers I pass to the program.
Yes, here's how to compare entire rows in a dataframe:
df[(df == x).all(axis=1)].index # where x is the pd.Series we're comparing to
Also, it makes life easiest if you directly set name as the index column when you read in the CSV.
import pandas as pd
from io import StringIO
df = """\
name,match1,match2,match3
Alice,2,4,3
Bob,2,3,4
Charlie,1,0,4"""
df = pd.read_csv(StringIO(df), index_col='name')
x = pd.Series({'match1':1, 'match2':0, 'match3':4})
Now you can see that doing df == x, or equivalently df.eq(x), is not quite what you want because it does element-wise compare and returns a row of True/False. So you need to aggregate those rows with .all(axis=1) which finds rows where all comparison results were True...
df.eq(x).all(axis=1)
df[ (df == x).all(axis=1) ]
# match1 match2 match3
# name
# Charlie 1 0 4
...and then finally since you only want the name of such rows:
df[ (df == x).all(axis=1) ].index
# Index(['Charlie'], dtype='object', name='name')
df[ (df == x).all(axis=1) ].index.tolist()
# ['Charlie']
which is what you wanted. (I only added the spaces inside the expression for clarity).
You need to use DataFrame.loc which would work like this:
print(df.loc[(df.match1 == 1) & (df.match2 == 0) & (df.match3 == 4), 'name'])
Maybe try something like this:
import pandas as pd
import numpy as np
# Makes sample data
match1 = np.array([2,2,1])
match2 = np.array([4,4,0])
match3 = np.array([3,3,4])
name = np.array(['Alice','Bob','Charlie'])
df = pd.DataFrame({'name': id, 'match1': match1, 'match2':match2, 'match3' :match3})
df
# example of the list you want to get the data from
x=[1,0,4]
#x=[2,4,3]
# should return the name Charlie as well as the index (based on the values in the list x)
df['name'].loc[(df['match1'] == x[0]) & (df['match2'] == x[1]) & (df['match3'] ==x[2])]
# Makes a new dataframe out of the above
mydf = pd.DataFrame(df['name'].loc[(df['match1'] == x[0]) & (df['match2'] == x[1]) & (df['match3'] ==x[2])])
# Loop that prints out the name based on the index of mydf
# Assuming there are more than one name, it will print all. if there is only one name, it will print only that)
for i in range(0,len(mydf)):
print(mydf['name'].iloc[i])
you can use this
here data is your Data frame ,you can change accordingly your data frame name,
and
considering [1,0,4] is int type
data = data[(data['match1']== 1)&(data['match2']==0)&(data['match3']== 4 ).index
print(data[0])
if data is object type then use this
data = data[(data['match1']== "1")&(data['match2']=="0")&(data['match3']== "4" ).index
print(data[0])

using pandas to find the string from a column

I am a very beginner in programming and trying to learn to code. so please bear with my bad coding. I am using pandas to find a string from a column (Combinations column in the below code ) in the data frame and print the entire row containing the string . Find the code below. Basically I need to find all the instances where the string occurs , and print the entire row .find my code below . I am not able to figure out how to find that particular instance of the column and print it .
import pandas as pd
data = pd.read_csv("signallervalues.csv",index_col=False)
data.head()
data['col1'] = data['col1'].astype(str)
data['col2'] = data['col2'].astype(str)
data['col3'] = data['col3'].astype(str)
data['col4'] = data['col4'].astype(str)
data['col5']= data['col5'].astype(str)
data.head()
combinations= data['Col1']+data['col2'] + data['col3'] + data['col4'] + data['col5']
data['combinations']= combinations
print(data.head())
list_of_combinations = data['combinations'].to_list()
print(list_of_combinations)
for i in list_of_combinations:
if data['combinations'].str.contains(i).any():
print(i+ 'data occurs in row' )
# I need to print the row containing the string here
else:
print(i +'is occuring only once')
my data frame looks like this
import pandas as pd
data=pd.DataFrame()
# recreating your data (more or less)
data['signaller']= pd.Series(['ciao', 'ciao', 'ciao'])
data['col6']= pd.Series(['-1-11-11', '11', '-1-11-11'])
list_of_combinations=['11', '-1-11-11']
data.reset_index(inplace=True)
# group by the values of column 6 and counting how many times they occur
g=data.groupby('col6')['index']
count= pd.DataFrame(g.count())
count=count.rename(columns={'index':'occurences'})
count.reset_index(inplace=True)
# create a df that keeps only the rows in the list 'list_of_combinations'
count[~count['col6'].isin(list_of_combinations)== False]
My result

Pandas Correction Previous Row

I have dataframe like this.
import pandas as pd
#create dataframe
df= pd.DataFrame({"Date":range(0,22),
"Country":["USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA",],
"Number":[0,0,0,0,0,1,1,3,5,6,4,6,7,8,7,10,25,50,75,60,45,100]
"Number is Corrected":[0,0,0,0,0,1,1,3,5,6,6,6,7,7,7,10,25,50,50,60,60,100]})
But this dataframe is have a problem. Some numbers are wrong.
Previous number always has to be smaller than next number(6,4,6,,7,8,7...50,75,60,45,100)
I don't use df.sort because it's not about sorting it's about correction.
Edit: I added corrected numbers in "number is corrected" column.
guessing from your 'Number corrected' list, you could probably use this:
import pandas as pd
#create dataframe
df= pd.DataFrame({"Date":range(0,22),
"Country":["USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA",],
"Number":[0,0,0,0,0,1,1,3,5,6,4,6,7,8,7,10,25,50,75,60,45,100]})
# "Number is Corrected":[0,0,0,0,0,1,1,3,5,6,6,6,7,7,7,10,25,50,50,60,60,100]})
def correction():
df['Number is Corrected'] = df['Number']
cache = 0
for num, content in enumerate(df['Number is Corrected'], start=0):
if(df['Number is Corrected'][num] < cache):
df['Number is Corrected'][num] = cache
else:
cache = df['Number is Corrected'][num]
print(df)
if __name__ == "__main__":
correction()
but there is some inconsistency, like your conversation with jezrael. Evtl. you'll need to update the logic of the code, if it gets clearer, what the output you wished. Good luck.

Create Forecasts Looping over SKUs and Export to CSV using Facebook Prophet

I am new to Python so please bear with me.
I am trying to convert what I think may be a nested dictionary into a csv that I can export. Below is my code:
import pandas as pd
import os
from fbprophet import Prophet
# Read in File
df1 = pd.read_csv('File_Path.csv')
#Create Loop to Forecast Multiple SKUs
def get_prediction(df):
prediction = {}
df1 = df.rename(columns={'Date': 'ds','qty_ordered': 'y', 'item_no': 'item'})
list_items = df1.item.unique()
for item in list_items:
item_df = df1.loc[df1['item'] == item]
# set the uncertainty interval to 95% (the Prophet default is 80%)
my_model = Prophet(yearly_seasonality= True, seasonality_prior_scale=1.0)
my_model.fit(item_df)
future_dates = my_model.make_future_dataframe(periods=12, freq='M')
forecast = my_model.predict(future_dates)
prediction[item] = forecast
return prediction
# Save predictions to dictionary
df2 = get_prediction(df1)
# Convert dictionary
df3 = pd.DataFrame.from_dict(df3, index='columns)
So the last part of the code is where I am struggling. I need to convert the df2 dictionary to a dataframe (df3) so I can export it to a csv. But it looks as if it is a nested dictionary? Not sure if I need to update my function or not.
This is what a snippet of the dictionary looks like
I need to export it so it will look like this
Any help would be greatly appreciated!
The following code should help flattening df2 (dictionary of dataframes if I understand correctly).
def flatten(dict_of_df):
# insert column 'item'
for key, value in dict_of_df.items():
value['item'] = key
# return vertically concatenated dataframe with all the items
return pd.concat(dict_of_df.values())

iterate over list of dicts to create different strings

I have a pandas file with 3 different columns that I turn into a dictionary with to_dict, the result is a list of dictionaries:
df = [
{'HEADER1': 'col1-row1', 'HEADER2: 'col2-row1', 'HEADER3': 'col3-row1'},
{'HEADER1': 'col1-row2', 'HEADER2: 'col2-row2', 'HEADER3': 'col3-row2'}
]
Now my problem is that I need the value of 'col2-rowX' and 'col3-rowX' to build an URL and use requests and bs4 to scrape the websties.
I need my result to be something like the following:
requests.get("'http://www.website.com/' + row1-col2 + 'another-string' + row1-col3 + 'another-string'")
And i need to do that for every dictionary in the list.
I have tried iterating over the dictionaries using for-loops.
something like:
import pandas as pd
import os
os.chdir('C://Users/myuser/Desktop')
df = pd.DataFrame.from_csv('C://Users/myuser/Downloads/export.csv')
#Remove 'Code' column
df = df.drop('Code', axis=1)
#Remove 'Code2' as index
df = df.reset_index()
#Rename columns for easier manipulation
df.columns = ['CB', 'FC', 'PO']
#Convert to dictionary for easy URL iteration and creation
df = df.to_dict('records')
for row in df:
for key in row:
print(key)
You only ever iterate twice, and short-circuit out of the nested for loop every time it is executed by having a return statement there. Looking up the necessary information from the dictionary will allow you to build up your url's. One possible example:
def get_urls(l_d):
l=[]
for d in l_d:
l.append('http://www.website.com/' + d['HEADER2'] + 'another-string' + d['HEADER3'] + 'another-string')
return l
df = [{'HEADER1': 'col1-row1', 'HEADER2': 'col2-row1', 'HEADER3': 'col3-row1'},{'HEADER1': 'col1-row2', 'HEADER2': 'col2-row2', 'HEADER3': 'col3-row2'}]
print get_urls(df)
>>> ['http://www.website.com/col2-row1another-stringcol3-row1another-string', 'http://www.website.com/col2-row2another-stringcol3-row2another-string']

Categories

Resources