I have 2 dataframes. My main dataframe
dffinal
date id och och1 och2 och3 cch1 LCH L# 0 3/27/2020 1 -2.1 3 3 1 5 NaN NaN 1 4/9/2020 2 2.0 1 2 1 3 NaN NaN
My second dataframe
df2
date och cch och1 och2 och3 cch1 0 5/30/2012 -0.7 -0.7 3 -1 1 56 1 9/16/2013 0.9 -1.0 6 4 3 7 2 9/26/2013 2.5 5.4 2 3 2 4 3 8/26/2016 0.1 -0.7 4 3 5 10
I have this loop
for i in dffinal.index: df3=df2.copy() df3 = df3[df3['och1'] >dffinal['och1'].iloc[i]] df3 = df3[df3['och2'] >dffinal['och2'].iloc[i]] df3 = df3[df3['och3'] >dffinal['och3'].iloc[i]] df3 = df3[df3['cch1'] >dffinal['cch1'].iloc[i]] dffinal['LCH'][i] =df3["och"].mean() dffinal['L#'][i] =len(df3.index)
As it is clear from my code the values of LCH and L# are obtained from df2(df3) based on above conditions.
This code works very well, but it is very slow. I found out that i can improve efficiency with pandas vectorization. However, I could not figure out how to do it for my case.
This is my desired result
date id och och1 och2 och3 cch1 LCH L# 0 3/27/2020 1 -2.1 3 3 1 5 0.900000 1.0 1 4/9/2020 2 2.0 1 2 1 3 1.166667 3.0
I would greatly appreciate if you could help me to increase the efficiency of my code
Correct answer
I personally use the answer of @shadowtalker easy method, simply because I can undesrtand how it works.
The most efficient answer is fast but complex
score:2
Here is one way to approach your problem
def fast(A, B):
for a in A:
m = (B[:, 1:] > a[1:]).all(1)
yield B[m, 0].mean(), m.sum()
c = ['och', 'och1', 'och2', 'och3', 'cch1']
df1[['LCH', 'L#']] = list(fast(df1[c].to_numpy(), df2[c].to_numpy()))
date id och och1 och2 och3 cch1 LCH L#
0 3/27/2020 1 -2.1 3 3 1 5 0.900000 1
1 4/9/2020 2 2.0 1 2 1 3 1.166667 3
Similar question
- Pandas vectorization instead of loop for two dataframes
- Speeding up a nested for loop through two Pandas DataFrames
- Comparing two pandas dataframes for differences
- Apply a for loop to multiple DataFrames in Pandas
- Compare pandas dataframes for common rows in two dataframes
- python pandas - Editing multiple DataFrames with a for loop
- Sum columns of two pandas dataframes of different sizes only for certain rows
- Correlation matrix for two Pandas dataframes
- How to concat thousands of pandas dataframes generated by a for loop efficiently?
- Loop through two pandas dataframes
score:3
It may be very difficult to avoid iterration with the logic you have in place to select a subset of rows in df2 for a given dffinal row, but you should be able to speed up the iterative method (hopefully by a lot) using this.
(note: if you're repeatedly accessing the row of the dataframe you're iterating through, use .iterrows
so you can grab things much more simply (and quickly)
for i,row in dffinal.iterrows():
och_array = df2.loc[(df3['och1'] >row['och1']) &\
(df2['och2'] >row['och2']) &\
(df2['och3'] >row['och3']) &\
(df2['cch1'] >row['cch1']),'och'].values
dffinal.at[i,'LCH'] = och_array.mean()
dffinal.at[i,'L#'] = len(och_array)
This avoids lookups in dffinal, avoids creating a new copy of the df several times over. Can't test this without a data sample, but I think this will work.
score:3
This answer is based on https://stackoverflow.com/a/68197271/2954547, except that it uses itertuples
instead of iterrows
. itertuples
is generally safer than iterrows
, because it preserves dtypes correctly. See the "Notes" section of the DataFrame.iterrows
documentation.
It also is self-contained, in that it can be executed top-to-bottom without having to copy/paste data, etc.
Note that I iterate over df1.itertuples
and not df_final.itertuples
. Never mutate something that you are iterating over, and never iterate over something that you are mutating. Modifying a DataFrame in-place is a form of mutation.
import io
import pandas as pd
data1_txt = """
date id och och1 och2 och3 cch1 LCH L#
3/27/2020 1 -2.1 3 3 1 5 NaN NaN
4/9/2020 2 2.0 1 2 1 3 NaN NaN
"""
data2_txt = """
date och cch och1 och2 och3 cch1
5/30/2012 -0.7 -0.7 3 -1 1 56
9/16/2013 0.9 -1.0 6 4 3 7
9/26/2013 2.5 5.4 2 3 2 4
8/26/2016 0.1 -0.7 4 3 5 10
"""
df1 = pd.read_fwf(io.StringIO(data1_txt), index_col='id')
df2 = pd.read_fwf(io.StringIO(data2_txt))
df_final = df1.copy()
for row in df1.itertuples():
row_mask = (
(df2['och1'] > row.och1) &
(df2['och2'] > row.och2) &
(df2['och3'] > row.och3) &
(df2['cch1'] > row.cch1)
)
och_vals = df2.loc[row_mask, 'och']
i = row.Index
df_final.at[i, 'LCH'] = och_vals.mean()
df_final.at[i, 'L#'] = len(och_vals)
print(df_final)
The output is
date och och1 och2 och3 cch1 LCH L# LCH L#
id
1 3/27/2020 -2.1 3 3 1 5 NaN NaN 0.900000 1.0
2 4/9/2020 2.0 1 2 1 3 NaN NaN 1.166667 3.0
score:3
Only way I can think of by pandas methods without loops is a cross join after resetting the index and comparing with df.all(1)
cols = ['och1','och2','och3','cch1']
u = df2.reset_index().assign(k=1).merge(
dffinal.reset_index().assign(k=1),on='k',suffixes=('','_y'))
#for new Version of pandas there is a how='cross' included now
dffinal['NewLCH'] = (u[u[cols].gt(u[[f"{i}_y" for i in cols]].to_numpy()).all(1)]
.groupby("index_y")['och'].mean())
print(dffinal)
date id och och1 och2 och3 cch1 LCH L# NewLCH
0 3/27/2020 1 -2.1 3 3 1 5 NaN NaN 0.900000
1 4/9/2020 2 2.0 1 2 1 3 NaN NaN 1.166667
Credit To: stackoverflow.com
Related Query
- Pandas - find nearest dates between two DataFrames without loop
- Lookup in 'dictionary' with pandas dataframes instead of for loops
- Pandas - common values for a particular column in two distinct dataframes
- Using a For Loop to modify a dictionary of DataFrames in Pandas
- Split text columns into two columns in Pandas DataFrame, for different dataframes
- Merging pandas dataframes generated by for loop
- Efficient pandas operation for columnwise functions on two dataframes
- apply function to two pandas dataframes in python (scipy.stats.spearmanr for each row from two dataframes)
- Comparing two dataframes in pandas for all values greater than the other
- Appending Pandas dataframes in for loop results in ValueError
- Code / Loop optimization with pandas for creating two matrixes
- Apply function for two dataframes in pandas
- For loop to create pandas dataframes - varying dataframe names?
- Pandas SUMIF equivalent for two dataframes
- How to stack several pandas DataFrames in a for loop
- Pandas join two dataframes with different timeframes for each row
- Append two pandas dataframe with different shapes and in for loop using python or pandasql
- Setting equivalent dtypes for two Pandas dataframes
- Python: filter out same columns for two pandas dataframes
- Pandas : Merge two Dataframes on a column, but keep only data for unlike Columns
More Query from same tag
- Pandas Series of hour values to Series of dates
- How to count number of increases/decreases across a row in a dataframe
- How to convert a list of series to list?
- How can I plot a histogram in pandas using nominal values?
- Replace each value in Series with its relative ranking
- Column missing when trying to open hdf created by pandas in h5py
- What goes wrong with my Pandas Read_JSON for a simple API?
- How to compare dataframe unique values with a list?
- Sorting Pandas Categorical labels after groupby
- How to replace non-duplicated values in columns of csv files by stars("*")?
- Filter rows with only Zero values with 2 columns
- plot multiple columns in a different grid in same plot in different colors
- How to get the longest length string/integer/float from a pandas column when there are strings in the column
- Pandas: Aggregate differently based on group
- How to create a list of date string in 'yyyymmdd' format with Python Pandas?