Trying to solve a trading problem, but rephrasing it in a different way.
I have an array of countries as
countries = {'country_name': ['France','Germany','Italy','Japan']}
For each country, I have a CSV stored on my laptop. Each CSV has 3 columns [Date, Birth, Death].
I am making for loop on Array and reading the CSV and creating a dataframe object.
countries = {'country_name': ['France','Germany','Italy','Japan']}
countries = pd.DataFrame(countries)
for country in countries['country_name']:
country_file_name = country + '.csv'
vars()[country] = pd.read_csv(country_file_name)
## Here I want to append country to each column except index
When I do France.head()
I get the output as France
index Birth Deaths
2020-01-01 9 10
2002-01-02 5 12
...
2002-12-10 14 10
But I want the output as France
index France_Birth France_Deaths
2020-01-01 9 10
2002-01-02 5 12
....
2002-12-10 14 10
Note - I do not want to do France.columns= ['France_Birth','France_Deaths'] because it will take me days to do it for all the csv.
I am using jupyternote book here.
https://colab.research.google.com/drive/1aOg3eOhsigbewAhRwQE1QsxGDKzEPyW5?usp=sharing
Note sure there is any way to this or I have to change my approach.
This can be achieved using the rename function of pandas.Dataframe:
countries = {'country_name': ['France','Germany','Italy','Japan']}
countries = pd.DataFrame(countries)
for country in countries['country_name']:
country_file_name = country + '.csv'
vars()[country] = pd.read_csv(country_file_name).rename(columns=lambda s: country + "_" + s)
You can check the documentation here.
Related
Name
4 A-------
5 ---
6 Father Name
7 ------
8 Gender
9 Country of
10 M
11 Oman
12 Identity Number -n?
13 Date of Birth
14 ------------9
15 28.10.1995
16 ----
17 Date of Issue
18 Date of Expiry
To extract a specific column from a csv file you can simply use the iloc function from the pandas library after reading the initial csv file.
dataset = pd.read_csv("path_of_csv")
# Now once you've read the original csv file you can slice along the columns
# to get the desired column (Example: Name, 1st column)
Name = dataset.iloc[:,0]
Or if you use an older version of pandas, this just might work:
(Definitely works for pandas version 1.3.5)
dataset = pd.read_csv("path_of_csv")
Name = dataset['Name']
I have a dataframe with three columns lets say
Name Address Date
faraz xyz 2022-01-01
Abdul abc 2022-06-06
Zara qrs 2021-02-25
I want to compare each date in Date column with all the other dates in the Date column and only keep those rows which lie within 6 months of atleast one of all the dates.
for example: (2022-01-01 - 2022-06-06) = 5 months so we keep both these dates
but,
(2022-06-06 - 2021-02-25) and (2022-01-01 - 2021-02-25) exceed the 6 month limit
so we will drop that row.
Desired Output:
Name Address Date
faraz xyz 2022-01-01
Abdul abc 2022-06-06
I have tried a couple of approches such a nested loops, but I got 1 million+ entries and it takes forever to run that loop. Some of the dates repeat too. Not all are unique.
for index, row in dupes_df.iterrows():
for date in uniq_dates_list:
format_date = datetime.strptime(date,'%d/%m/%y')
if (( format_date.year - row['JournalDate'].year ) * 12 + ( format_date.month - row['JournalDate'].month ) <= 6):
print("here here")
break
else:
dupes_df.drop(index, inplace=True)
I need a much more omptimal solution for it. Studied about lamba functions, but couldn't get to the depths of it.
IIUC, this should work for you:
import pandas as pd
import itertools
from io import StringIO
data = StringIO("""Name;Address;Date
faraz;xyz;2022-01-01
Abdul;abc;2022-06-06
Zara;qrs;2021-02-25
""")
df = pd.read_csv(data, sep=';', parse_dates=['Date'])
df_date = pd.DataFrame([sorted(l, reverse=True) for l in itertools.combinations(df['Date'], 2)], columns=['Date1', 'Date2'])
df_date['diff'] = (df_date['Date1'] - df_date['Date2']).dt.days
df[df.Date.isin(df_date[df_date['diff'] <= 180].iloc[:, :-1].T[0])]
Output:
Name Address Date
0 faraz xyz 2022-01-01
1 Abdul abc 2022-06-06
First I think it's be easier if you use 'relativedelta' from 'dateutil'.
Reference: https://pynative.com/python-difference-between-two-dates-in-months/
Second, I think you need to add a column, let's call it score.
At the second loop, if delta <= 6 month :
set score = 1 and 'continue'
This way each row is compared to all rows.
Delete all rows that have score == 0.
I want to make a different dataframe for those Number(Column B) where Main Date > Reported Date (see the below image). If this condition comes true then I have to make other dataframe displaying that Number Data.
Example
:- if take Number (column B) 223311, now if any main date > Reported Date, then display all the records of that Number
Here is a simple solution with Pandas. You can separate out Dataframes very easily by column values of a particular column. From there, iterate the new Dataframe, resetting for index (if you want to keep the index, use dataframe.shape instead). I appended them to a list for convenience, which could be easily extracted into labeled dataframes, or combined. Long variable names are to help comprehension.
df = pd.read_csv('forstack.csv')
list_of_dataframes = [] #A place to store each dataframe. You could also name them as you go
checked_Numbers = [] #Simply to avoid multiple of same dataframe
for aNumber in df['Number']: #For every number in the column "Number"
if(aNumber not in checked_Numbers): #While this number has not been processed
checked_Numbers.append(aNumber) #Mark as checked
df_forThisNumber = df[df.Number == aNumber].reset_index(drop=True) #"Make a different Dataframe" Per request, with new index
for index in range(0,len(df_forThisNumber)): #Parse each element of this dataframe to see if it matches criteria
if(df_forThisNumber.at[index,'Main Date'] > df_forThisNumber.at[index,'Reported Date']):
list_of_dataframes.append(df_forThisNumber) #If it matches the criteria, append it
Outputs :
Main Date Number Reported Date Fee Amount Cost Name
0 1/1/2019 223311 1/1/2019 100 12 20 11
1 1/7/2019 223311 1/1/2019 100 12 20 11
Main Date Number Reported Date Fee Amount Cost Name
0 1/2/2019 111111 1/2/2019 100 12 20 11
1 1/6/2019 111111 1/2/2019 100 12 20 11
Main Date Number Reported Date Fee Amount Cost Name
0 1/3/2019 222222 1/3/2019 100 12 20 11
1 1/8/2019 222222 1/3/2019 100 12 20 11
I'm trying to put together a generic piece of code that would:
Take a time series for some price data and divide it into deciles, e.g. take the past 18m of gold prices and divide it into deciles [DONE, see below]
date 4. close decile
2017-01-03 1158.2 0
2017-01-04 1166.5 1
2017-01-05 1181.4 2
2017-01-06 1175.7 1
... ...
2018-04-23 1326.0 7
2018-04-24 1333.2 8
2018-04-25 1327.2 7
[374 rows x 2 columns]
Pull out the dates for a particular decile, then create a secondary datelist with an added 30 days
#So far only for a single decile at a time
firstdecile = gold.loc[gold['decile'] == 1]
datelist = list(pd.to_datetime(firstdecile.index))
datelist2 = list(pd.to_datetime(firstdecile.index) + pd.DateOffset(months=1))
Take an average of those 30-day price returns for each decile
level1 = gold.ix[datelist]
level2 = gold.ix[datelist2]
level2.index = level2.index - pd.DateOffset(months=1)
result = pd.merge(level1,level2, how='inner', left_index=True, right_index=True)
def ret(one, two):
return (two - one)/one
pricereturns = result.apply(lambda x :ret(x['4. close_x'], x['4. close_y']), axis=1)
mean = pricereturns.mean()
Return the list of all 10 averages in a single CSV file
So far I've been able to put together something functional that does steps 1-3 but only for a single decile, but I'm struggling to expand this to a looped-code for all 10 deciles at once with a clean CSV output
First append the close price at t + 1 month as a new column on the whole dataframe.
gold2_close = gold.loc[gold.index + pd.DateOffset(months=1), 'close']
gold2_close.index = gold.index
gold['close+1m'] = gold2_close
However practically relevant should be the number of trading days, i.e. you won't have prices for the weekend or holidays. So I'd suggest you shift by number of rows, not by daterange, i.e. the next 20 trading days
gold['close+20'] = gold['close'].shift(periods=-20)
Now calculate the expected return for each row
gold['ret'] = (gold['close+20'] - gold['close']) / gold['close']
You can also combine steps 1. and 2. directly so you don't need the additional column (only if you shift by number of rows, not by fixed daterange due to reindexing)
gold['ret'] = (gold['close'].shift(periods=-20) - gold['close']) / gold['close']
Since you already have your deciles, you just need to groupby the deciles and aggregate the returns with mean()
gold_grouped = gold.groupby(by="decile").mean()
Putting in some random data you get something like the dataframe below. close and ret are the averages for each decile. You can create a csv from a dataframe via pandas.DataFrame.to_csv
close ret
decile
0 1238.343597 -0.018290
1 1245.663315 0.023657
2 1254.073343 -0.025934
3 1195.941312 0.009938
4 1212.394511 0.002616
5 1245.961831 -0.047414
6 1200.676333 0.049512
7 1181.179956 0.059099
8 1214.438133 0.039242
9 1203.060985 0.029938
I have a frame moviegoers that includes zip codes but not cities.
I then redefined moviegoers to be zipcodes and changed the data type of zip codes to be a data frame instead of a series.
zipcodes = pd.read_csv('NYC1-moviegoers.csv',dtype={'zip_code': object})
I know the dataset URL I need is this: https://raw.githubusercontent.com/mafudge/datasets/master/zipcodes/free-zipcode-database-Primary.csv.
I defined a dataframe, zip_codes, to call the data from that dataset and change the dataset type from series to dataframe so its in the same format as the zipcodes dataframe.
I want to merge the dataframes so I can have the movie goer data. But, instead of zipcodes, I want to have the state abbreviation. This is where I am having issues.
The end goal is to count the number of movie goers per state. Example ideal output:
CA 116
MN 78
NY 60
TX 51
IL 50
Any ideas would be greatly appreciated.
I think need map by Series and then use value_counts for count:
print (zipcodes)
zip_code
0 85711
1 94043
2 32067
3 43537
4 15213
s = zip_codes.set_index('Zipcode')['State']
df = zipcodes['zip_code'].map(s).value_counts().rename_axis('state').reset_index(name='count')
print (df.head())
state count
0 OH 1
1 CA 1
2 FL 1
3 AZ 1
4 PA 1
Simply merge both datasets on Zipcode columns then run groupby for state counts.
# READ DATA FILES WITH RENAMING OF ZIP COLUMN IN FIRST
url = "https://raw.githubusercontent.com/mafudge/datasets/master/zipcodes/free-zipcode-database-Primary.csv"
moviegoers = pd.read_csv('NYC1-moviegoers.csv', dtype={'zip_code': object}).rename(columns={'zip_code': 'Zipcode'})
zipcodes = pd.read_csv(url, dtype={'Zipcode': object})
# MERGE ON COMMON FIELD
merged_df = pd.merge(moviegoers, zipcodes, on='Zipcode')
# AGGREGATE BY INDICATOR (STATE)
merged_df.groupby('State').size()
# ALTERNATIVE GROUP BY COUNT
merged_df.groupby('State')['Zipcode'].agg('count')